pipeio: pipelines as data¶
Sources & anchors
- Stack component: projio
- Canonical artifact:
pixecog/code/pipelines/lfp_extrema/end-to-end +pixecog/code/pipelines/manifest_assemble/ - Workshop session: Day-3 AM session 1
- Outline:
_outline.md§B
Frame¶
Flow registry; BidsPaths adapter; manifest.yml as cross-flow
contract; ~50 MCP tools; pipeio_target_paths resolves a (flow, group,
member) tuple to a path. The pain pipeio solves is hand-constructed
BIDS wildcards.
The pain¶
A pixecog researcher writing a fresh analysis notebook needs the LFP
extrema events file for sub-03, ses-04, task-pre. They know it lives
somewhere under derivatives/lfp_extrema/. They open a terminal,
ls derivatives/lfp_extrema/sub-03/ses-04/, find the file, copy its
absolute path, and paste it into the notebook. A week later the cohort
adds an acquisition: lshank filter to the recording schema; the path
now contains acq-lshank. Every notebook that hard-coded the old
path breaks silently.
The same problem at agent scale is worse. An agent that resolves a BIDS path by glob-and-grep is one schema change away from picking up the wrong file or constructing a path that doesn't exist. The deterministic fix — give the agent a function that takes a tuple and returns the canonical path — is what pipeio supplies.
The flow registry¶
A flow in pipeio is a directory under code/pipelines/<name>/
containing a Snakefile, a config.yml, optional scripts/,
optional notebooks/, and optional docs/. pipeio_registry_scan()
walks code/pipelines/, identifies every flow, parses its Snakefile
to inventory its mods (logical sub-pipelines composed of rules), and
writes the result to .projio/pipeio/registry.yml. The registry is
the source of truth for what pipelines exist and what they produce.
The pixecog cohort's registry currently lists sixteen flows — among
them preprocess_ieeg, preprocess_ecephys, brainstate,
lfp_extrema, spectrogram_burst, manifest_assemble — each with
its mod inventory. Once a flow is registered, every agent or
collaborator can answer "what pipelines does this project have?"
with one tool call:
pipeio_flow_list()
# → [{"name": "lfp_extrema", "code_path": ".../code/pipelines/lfp_extrema",
# "app_type": "snakemake", "mods": [...]}, ...]
pipeio_flow_status(flow) adds per-flow state: missing scaffold files,
last run, whether outputs are stale. The registry is regenerated by
pipeio_registry_scan() and validated by pipeio_registry_validate().
A flow whose Snakefile has changed since the last scan is detected
automatically.
BidsPaths — the adapter¶
pipeio's BIDS adapter (pipeio.adapters.bids.BidsPaths) is a small
Python class that wraps a flow's per-flow output schema and exposes
two methods: path(group, member, wildcards) to construct one output
path, and targets(group, member) to expand the cross-product across
the full BIDS wildcard table. The pixecog/code/pipelines/lfp_extrema/Snakefile
shows the convention:
from snakebids import generate_inputs, set_bids_spec
from pipeio.adapters.bids import BidsPaths
set_bids_spec("v0_0_0")
configfile: "config.yml"
inputs = {}
if _ecephys_pb and Path(_bids_dir_ecephys).exists():
inputs.update(generate_inputs(_bids_dir_ecephys, _ecephys_pb))
if _ieeg_pb and Path(_bids_dir_ieeg).exists():
inputs.update(generate_inputs(_bids_dir_ieeg, _ieeg_pb))
_registry = dict(config.get("registry") or {})
out_paths = BidsPaths(_registry, repo_abs(config["output_dir"]), inputs)
The registry block in config.yml declares the flow's output
groups. Each group has a base input (which BIDS modality it derives
from), an output bids.root (under which derivatives/<flow>/ the
files land), a datatype, and a list of members (the individual
files a single run produces — typically an events TSV, a metrics
parquet, a metadata JSON). The Snakefile rule bodies use
out_paths.path("detect_ripple", "events", wildcards) to construct
output paths; the rule's wildcard table is the BIDS wildcard table
generated by snakebids.
The point is that the output path is computed from the
config, not pasted into the rule body. Change the
bids.root for one group and every rule that produces members of
that group emits to the new location automatically.
manifest.yml — the cross-flow contract¶
Every flow's output directory carries a manifest.yml that records
what was produced and how. It captures the configured registry, the
resolved output paths, the wildcard tables, and a checksum of the
config used. Downstream flows read upstream manifest.yml files
instead of re-globbing the derivatives tree.
pixecog/code/pipelines/manifest_assemble/ is the canonical
cross-flow example. It is itself a snakemake flow whose only job is
to gather detection-event TSVs from multiple upstream flows
(lfp_extrema, spectrogram_burst, sharpwaveripple, …) and assemble
them into a project-wide event table. Its Snakefile reads each
upstream flow's manifest.yml, enumerates the produced events files,
and registers them as inputs:
_detection_dirs = [str(repo_abs(p)) for p in config.get("detection_dirs", [])]
The detection_dirs entries in manifest_assemble/config.yml name
upstream flows by directory. There is no hard-coded path to a
specific subject/session/task tuple; the assemble step infers the
full set from the upstream manifests. Adding a new detection method
in a new upstream flow is one config edit and a re-run; the assemble
flow picks the new outputs up automatically.
pipeio_target_paths — the resolution tool¶
The MCP tool that closes the loop is pipeio_target_paths(flow,
group, member). Given a registered flow, a group name (e.g.
detect_ripple), and a member name (e.g. events), it returns
every concrete BIDS path that this flow would produce for the group's
member, across the full wildcard table. The agent does not glob; it
calls the tool, gets a list of paths, and consumes them.
The same logic powers pipeio_target_paths(flow, group, member,
subject="03", session="04") for a specific tuple. The wildcards in
the result come from the BIDS inputs the flow was generated against —
not from a hand-maintained list.
The MCP surface (~50 tools)¶
The breadth of the pipeio tool set reflects the breadth of the authoring surface. The tools fall into clusters:
- Flow lifecycle:
pipeio_flow_list,pipeio_flow_status,pipeio_flow_new,pipeio_flow_audit,pipeio_flow_fork,pipeio_flow_deregister. - Mod and rule editing:
pipeio_mod_list,pipeio_mod_create,pipeio_mod_context,pipeio_mod_audit,pipeio_rule_list,pipeio_rule_stub,pipeio_rule_insert,pipeio_rule_update. - Config:
pipeio_config_init,pipeio_config_read,pipeio_config_patch. - Notebooks (kind-aware, two backends — jupytext percent-format
and marimo):
pipeio_nb_create,pipeio_nb_read,pipeio_nb_diff,pipeio_nb_sync,pipeio_nb_exec,pipeio_nb_watch,pipeio_nb_snapshot,pipeio_nb_extract,pipeio_nb_promote. - Documentation:
pipeio_mod_doc_refresh,pipeio_docs_collect,pipeio_docs_nav,pipeio_dag_export,pipeio_flow_report. - Execution:
pipeio_run,pipeio_run_status,pipeio_run_dashboard,pipeio_run_kill. - Contracts:
pipeio_contracts_validate,pipeio_cross_flow,pipeio_target_paths,pipeio_completion.
Every tool addresses flows by name, never by path. The pipeio authoring model assumes a registered flow; an unregistered directory is invisible.
End-to-end: lfp_extrema + manifest_assemble¶
Putting the pieces together, the lfp_extrema flow in pixecog
produces, for each (subject, session, task, acquisition, recording)
tuple, a detection TSV per detection method. Its config.yml enumerates
a list of detection methods under detections:, each of which the
flow expands into a registry entry — meaning the config drives the
registry (one of pixecog's distinctive patterns: see
Config-driven pipelines).
The Snakefile uses BidsPaths to resolve output paths; each run's
manifest.yml records exactly which detections were produced.
manifest_assemble is the consumer. It reads each upstream
manifest.yml, globs the detection TSVs, and produces a session-level
event registry — the single source of truth for "what events were
detected, by which method, on which subject/session". A downstream
notebook (or report) calls pipeio_target_paths("manifest_assemble",
"events", "all", subject="03") to get the assembled events file for
sub-03, and never has to know how the upstream flows arrange their
own output directories.
The chain — preprocess_ieeg → lfp_extrema → manifest_assemble →
notebook — is fully traceable: each flow's manifest names its inputs;
each input is itself a manifest entry; each manifest is committed
under derivatives/<flow>/; and every output is reproducible from
the recorded config.
What's missing¶
BidsPaths and the manifest.yml convention are projio's; they are
not part of the official BIDS or snakebids specs. A pipeline written
to the convention is portable across pipeio projects but not, today,
to a non-pipeio project. The honest framing in the survey is that
the convention layered on BIDS is a projio convention, not an
upstream contract — see honest gaps §2. The
upstream story (proposing the convention to snakebids) is on the
roadmap but out of scope for the workshop.
Further reading¶
- Snakemake documentation — underlying execution engine; pipeio wraps its scheduling and wildcard resolution.
- snakebids documentation — BIDS-aware input generation used inside pipeio-managed flows.