Skip to content

pipeio: pipelines as data

Sources & anchors

  • Stack component: projio
  • Canonical artifact: pixecog/code/pipelines/lfp_extrema/ end-to-end + pixecog/code/pipelines/manifest_assemble/
  • Workshop session: Day-3 AM session 1
  • Outline: _outline.md §B

Frame

Flow registry; BidsPaths adapter; manifest.yml as cross-flow contract; ~50 MCP tools; pipeio_target_paths resolves a (flow, group, member) tuple to a path. The pain pipeio solves is hand-constructed BIDS wildcards.

The pain

A pixecog researcher writing a fresh analysis notebook needs the LFP extrema events file for sub-03, ses-04, task-pre. They know it lives somewhere under derivatives/lfp_extrema/. They open a terminal, ls derivatives/lfp_extrema/sub-03/ses-04/, find the file, copy its absolute path, and paste it into the notebook. A week later the cohort adds an acquisition: lshank filter to the recording schema; the path now contains acq-lshank. Every notebook that hard-coded the old path breaks silently.

The same problem at agent scale is worse. An agent that resolves a BIDS path by glob-and-grep is one schema change away from picking up the wrong file or constructing a path that doesn't exist. The deterministic fix — give the agent a function that takes a tuple and returns the canonical path — is what pipeio supplies.

The flow registry

A flow in pipeio is a directory under code/pipelines/<name>/ containing a Snakefile, a config.yml, optional scripts/, optional notebooks/, and optional docs/. pipeio_registry_scan() walks code/pipelines/, identifies every flow, parses its Snakefile to inventory its mods (logical sub-pipelines composed of rules), and writes the result to .projio/pipeio/registry.yml. The registry is the source of truth for what pipelines exist and what they produce.

The pixecog cohort's registry currently lists sixteen flows — among them preprocess_ieeg, preprocess_ecephys, brainstate, lfp_extrema, spectrogram_burst, manifest_assemble — each with its mod inventory. Once a flow is registered, every agent or collaborator can answer "what pipelines does this project have?" with one tool call:

pipeio_flow_list()
# → [{"name": "lfp_extrema", "code_path": ".../code/pipelines/lfp_extrema",
#     "app_type": "snakemake", "mods": [...]}, ...]

pipeio_flow_status(flow) adds per-flow state: missing scaffold files, last run, whether outputs are stale. The registry is regenerated by pipeio_registry_scan() and validated by pipeio_registry_validate(). A flow whose Snakefile has changed since the last scan is detected automatically.

BidsPaths — the adapter

pipeio's BIDS adapter (pipeio.adapters.bids.BidsPaths) is a small Python class that wraps a flow's per-flow output schema and exposes two methods: path(group, member, wildcards) to construct one output path, and targets(group, member) to expand the cross-product across the full BIDS wildcard table. The pixecog/code/pipelines/lfp_extrema/Snakefile shows the convention:

from snakebids import generate_inputs, set_bids_spec
from pipeio.adapters.bids import BidsPaths

set_bids_spec("v0_0_0")
configfile: "config.yml"

inputs = {}
if _ecephys_pb and Path(_bids_dir_ecephys).exists():
    inputs.update(generate_inputs(_bids_dir_ecephys, _ecephys_pb))
if _ieeg_pb and Path(_bids_dir_ieeg).exists():
    inputs.update(generate_inputs(_bids_dir_ieeg, _ieeg_pb))

_registry = dict(config.get("registry") or {})
out_paths = BidsPaths(_registry, repo_abs(config["output_dir"]), inputs)

The registry block in config.yml declares the flow's output groups. Each group has a base input (which BIDS modality it derives from), an output bids.root (under which derivatives/<flow>/ the files land), a datatype, and a list of members (the individual files a single run produces — typically an events TSV, a metrics parquet, a metadata JSON). The Snakefile rule bodies use out_paths.path("detect_ripple", "events", wildcards) to construct output paths; the rule's wildcard table is the BIDS wildcard table generated by snakebids.

The point is that the output path is computed from the config, not pasted into the rule body. Change the bids.root for one group and every rule that produces members of that group emits to the new location automatically.

manifest.yml — the cross-flow contract

Every flow's output directory carries a manifest.yml that records what was produced and how. It captures the configured registry, the resolved output paths, the wildcard tables, and a checksum of the config used. Downstream flows read upstream manifest.yml files instead of re-globbing the derivatives tree.

pixecog/code/pipelines/manifest_assemble/ is the canonical cross-flow example. It is itself a snakemake flow whose only job is to gather detection-event TSVs from multiple upstream flows (lfp_extrema, spectrogram_burst, sharpwaveripple, …) and assemble them into a project-wide event table. Its Snakefile reads each upstream flow's manifest.yml, enumerates the produced events files, and registers them as inputs:

_detection_dirs = [str(repo_abs(p)) for p in config.get("detection_dirs", [])]

The detection_dirs entries in manifest_assemble/config.yml name upstream flows by directory. There is no hard-coded path to a specific subject/session/task tuple; the assemble step infers the full set from the upstream manifests. Adding a new detection method in a new upstream flow is one config edit and a re-run; the assemble flow picks the new outputs up automatically.

pipeio_target_paths — the resolution tool

The MCP tool that closes the loop is pipeio_target_paths(flow, group, member). Given a registered flow, a group name (e.g. detect_ripple), and a member name (e.g. events), it returns every concrete BIDS path that this flow would produce for the group's member, across the full wildcard table. The agent does not glob; it calls the tool, gets a list of paths, and consumes them.

The same logic powers pipeio_target_paths(flow, group, member, subject="03", session="04") for a specific tuple. The wildcards in the result come from the BIDS inputs the flow was generated against — not from a hand-maintained list.

The MCP surface (~50 tools)

The breadth of the pipeio tool set reflects the breadth of the authoring surface. The tools fall into clusters:

  • Flow lifecycle: pipeio_flow_list, pipeio_flow_status, pipeio_flow_new, pipeio_flow_audit, pipeio_flow_fork, pipeio_flow_deregister.
  • Mod and rule editing: pipeio_mod_list, pipeio_mod_create, pipeio_mod_context, pipeio_mod_audit, pipeio_rule_list, pipeio_rule_stub, pipeio_rule_insert, pipeio_rule_update.
  • Config: pipeio_config_init, pipeio_config_read, pipeio_config_patch.
  • Notebooks (kind-aware, two backends — jupytext percent-format and marimo): pipeio_nb_create, pipeio_nb_read, pipeio_nb_diff, pipeio_nb_sync, pipeio_nb_exec, pipeio_nb_watch, pipeio_nb_snapshot, pipeio_nb_extract, pipeio_nb_promote.
  • Documentation: pipeio_mod_doc_refresh, pipeio_docs_collect, pipeio_docs_nav, pipeio_dag_export, pipeio_flow_report.
  • Execution: pipeio_run, pipeio_run_status, pipeio_run_dashboard, pipeio_run_kill.
  • Contracts: pipeio_contracts_validate, pipeio_cross_flow, pipeio_target_paths, pipeio_completion.

Every tool addresses flows by name, never by path. The pipeio authoring model assumes a registered flow; an unregistered directory is invisible.

End-to-end: lfp_extrema + manifest_assemble

Putting the pieces together, the lfp_extrema flow in pixecog produces, for each (subject, session, task, acquisition, recording) tuple, a detection TSV per detection method. Its config.yml enumerates a list of detection methods under detections:, each of which the flow expands into a registry entry — meaning the config drives the registry (one of pixecog's distinctive patterns: see Config-driven pipelines). The Snakefile uses BidsPaths to resolve output paths; each run's manifest.yml records exactly which detections were produced.

manifest_assemble is the consumer. It reads each upstream manifest.yml, globs the detection TSVs, and produces a session-level event registry — the single source of truth for "what events were detected, by which method, on which subject/session". A downstream notebook (or report) calls pipeio_target_paths("manifest_assemble", "events", "all", subject="03") to get the assembled events file for sub-03, and never has to know how the upstream flows arrange their own output directories.

The chain — preprocess_ieeg → lfp_extrema → manifest_assemble → notebook — is fully traceable: each flow's manifest names its inputs; each input is itself a manifest entry; each manifest is committed under derivatives/<flow>/; and every output is reproducible from the recorded config.

What's missing

BidsPaths and the manifest.yml convention are projio's; they are not part of the official BIDS or snakebids specs. A pipeline written to the convention is portable across pipeio projects but not, today, to a non-pipeio project. The honest framing in the survey is that the convention layered on BIDS is a projio convention, not an upstream contract — see honest gaps §2. The upstream story (proposing the convention to snakebids) is on the roadmap but out of scope for the workshop.

Further reading