Skip to content

Feature request: codio_ingest_api — package API ingestion for agent-quality code generation

Problem

When agents generate notebook/script code using libraries (cogpy, scipy, xarray, etc.), they lack knowledge of: - Function signatures and type contracts - Dimension/shape broadcasting behavior (e.g. "psd_multitaper vectorizes over non-time dims") - Idiomatic usage patterns (e.g. "don't loop over channels, pass the full xarray")

This causes rookie mistakes like writing explicit for-loops over channels when the function already handles multi-dimensional input natively.

Current skill prompts list function names but can't keep up with API evolution. Docstrings help but aren't accessible to agents at generation time.

Proposed solution: codio_ingest_api

A new codio ingestion mode that extracts and indexes package APIs for semantic search.

Ingestion flow

codio_ingest_api(package="cogpy") — works for any installed Python package:

  1. Walk module tree via importlib
  2. For each public function/class:
  3. Extract signature via inspect.signature
  4. Extract type hints (including generics like xr.DataArray)
  5. Parse docstring (numpy/google/sphinx style) into structured sections
  6. Static-analyze function body for:
    • xarray dim handling (.dims, .transpose(), xr.apply_ufunc)
    • Broadcasting patterns (which dims are reduced, which are preserved)
    • Common input validation patterns
  7. If @api_contract decorator present, use it as high-confidence metadata
  8. Produce structured API index: {module: {function: {signature, type_hints, dim_contract, docstring_summary, vectorization_info}}}
  9. Index into existing RAG infrastructure for semantic search

Query interface

codio_api_query(package, query) — semantic search over the API index:

codio_api_query("cogpy", "compute PSD of multichannel xarray signal")
→ cogpy.spectral.psd.psd_multitaper
    signature: (signal, fs, NW=4.0, ...)
    accepts: xr.DataArray with any dims containing "time", or 1D ndarray
    returns: (psd: DataArray[non-time-dims, freq], freqs: ndarray)
    vectorizes_over: all non-time dims — no loops needed

→ cogpy.spectral.specx.psdx
    signature: (signal, fs, ...)
    accepts: same as psd_multitaper
    returns: xr.Dataset with power and freqs coords
    note: higher-level wrapper, returns Dataset instead of tuple

Integration with skills

The /notebook skill would call codio_api_query before writing cogpy code to get correct signatures and vectorization behavior. Similarly for /notebook-promote when extracting processing logic into scripts.

Optional: @api_contract decorator for packages

For packages that want to provide high-confidence metadata:

@api_contract(
    input={"signal": "DataArray[..., time]"},
    output={"psd": "DataArray[..., freq]", "freqs": "ndarray[freq]"},
    vectorizes_over="all non-time dims",
)
def psd_multitaper(signal, fs, NW=4.0, ...):

The ingestion flow reads these if present, falls back to static analysis if not. The decorators also serve as documentation for human readers.

Why codio

codio already tracks packages (codio_list, codio_get, codio_registry). This extends it from "what packages exist and what do they do" to "what does each function accept and how does it handle multi-dimensional data." Same registry, deeper introspection.

Use cases

  1. /notebook skill — write correct vectorized code on first try
  2. /notebook-promote — understand function contracts when extracting to scripts
  3. /add-feature-cogpy (cogpy-dev) — check for existing API overlap before adding new functions
  4. Any agent writing code against an installed package

Source context: pixecog

PixEcog (pixecog): Neuropixels and ECoG dataset and analysis

Recent commits:

c309f45 Fix pipeline doc naming drift, populate registry doc_path, close 3 issues
84d605b Migrate 43 scripts from utils.smk.smk_log
5808910 [DATALAD] removed content

README:


type: readme


Quick Start for Collaborators

Follow this checklist to get started with Pixecog documentation and workflows.

🐀 Pixecog Project — Compact Overview

Core principles

  • One immutable BIDS raw dataset (raw/) as the canonical baseline
  • Each analysis pipeline ha