pipeio v2 roadmap: lean scope, snakebids/DataLad alignment¶

North star¶

pipeio is an agent-facing authoring + discovery layer for Snakemake/snakebids/DataLad projects. It does not compete with execution engines or provenance systems. It makes pipeline knowledge queryable and actionable for AI agents via MCP tools.

One flow = one snakebids app = one derivative directory (= one DataLad subdataset).

Principles¶

Don't reimplement what snakebids, Snakemake, or DataLad already do
Adapt their outputs into agent-usable structured data where needed
Own the registry, authoring, contracts, and documentation layers they don't provide
Align with BIDS derivatives metadata for cross-flow lineage

Source¶

See deep-research-pipeio-scope.md for full landscape analysis.

Current tool inventory and v2 fate¶

KEEP — unique agent value (no ecosystem equivalent)¶

Tool	Purpose	v2 changes
`flow_list`	List flows in registry	Treat flows as snakebids apps; include derivative dir
`flow_status`	Overview of a flow	Add snakebids app status (has run.py, .snakebids marker)
`mod_list`	List mods in a flow	Keep as-is
`mod_resolve`	Resolve modkeys to metadata	Keep as-is
`mod_context`	Bundled read: rules, scripts, doc, config	Keep as-is
`mod_create`	Scaffold mod (script + doc + I/O)	Align with snakebids `workflow/` layout
`rule_list`	Parse rules from Snakefiles	Keep — agents need structured rule data
`rule_stub`	Generate rule text from I/O specs	Keep — unique authoring tool
`rule_insert`	Insert rule into .smk file	Keep — unique authoring tool
`rule_update`	Patch existing rule	Keep — unique authoring tool
`config_read`	Parse flow config with bids signatures	Evolve to read `config/snakebids.yml`
`config_patch`	Surgical YAML edit (preserves comments/anchors)	Keep — unique; reposition for snakebids.yml
`cross_flow`	Map output→input chains across flows	Evolve: also read BIDS `dataset_description.json` `GeneratedBy`/`SourceDatasets`
`contracts_validate`	Check I/O contracts	Keep — feeds DataLad run `--input`/`--output` declarations
`registry_scan`	Discover flows from filesystem	Evolve: detect snakebids app structure
`registry_validate`	Check registry consistency	Keep
`nb_create`	Scaffold notebook with bootstrap cells	Keep
`nb_update`	Update notebook metadata	Keep
`nb_status`	Notebook sync/lifecycle status	Keep
`nb_sync`	Jupytext sync	Keep — thin wrapper over jupytext
`nb_publish`	Publish notebook to docs	Keep
`nb_analyze`	Parse notebook structure	Keep
`nb_exec`	Execute notebook (papermill)	Keep
`nb_pipeline`	Chain sync→publish→collect	Keep
`modkey_bib`	Generate modkey bibliography	Keep — unique
`docs_collect`	Collect flow docs into MkDocs	Keep
`docs_nav`	Generate nav YAML fragment	Keep
`mkdocs_nav_patch`	Patch mkdocs.yml nav	Keep

THIN OUT — replace internals with ecosystem tools¶

Tool	Current impl	v2: adapter over
`dag`	Custom Snakefile parser	`snakemake --d3dag` JSON output
`completion`	Glob filesystem vs registry schema	`snakemake --summary` lifted into contract-level status
`log_parse`	Read raw snakemake logs	Pointer to `snakemake --report` + DataLad run record
`config_init`	Scaffold flat config.yml	Scaffold snakebids app skeleton (`config/snakebids.yml` + `workflow/` + `run.py`)

STOP / REPLACE — duplicate ecosystem tools¶

Tool	Current impl	v2: replaced by
`run`	`screen -dmS snakemake` + `runs.json`	`datalad run -- python run.py ...` → return commit + run record
`run_status`	Parse screen sessions + log tail	DataLad run records + `snakemake --summary`
`run_dashboard`	Aggregate runs.json	DataLad `git log` of run records
`run_kill`	Kill screen sessions	Process management (if needed at all)

Structural changes¶

Flow directory layout: flat → snakebids app¶

Current:

code/pipelines/{pipe}/{flow}/
    Snakefile
    config.yml
    scripts/

v2 (snakebids app):

code/pipelines/{flow}/               # or code/apps/{flow}/
    run.py                            # snakebids entry point
    config/
        snakebids.yml                 # pybids_inputs, parse_args, analysis_levels
    workflow/
        Snakefile
        rules/*.smk                   # mod-organized rule files
        scripts/
    notebooks/
    docs/

Impact on pipeio: - registry_scan: detect run.py + config/snakebids.yml as snakebids app markers - config_read/config_patch: target config/snakebids.yml - rule_insert/rule_list: look in workflow/rules/ and workflow/Snakefile - mod_create: scaffold scripts into workflow/scripts/

Execution: screen → datalad run¶

Current:

pipeio_run(pipe, flow)
  → screen -dmS snakemake ...
  → writes runs.json

v2:

pipeio_run(pipe, flow, analysis_level="participant")
  → datalad run \
      --input {bids_dir} \
      --output {derivative_dir} \
      -- python run.py {bids_dir} {derivative_dir} {analysis_level}
  → returns { commit, run_record, derivative_dir }

Contracts feed the --input/--output declarations.

Cross-flow: registry → BIDS derivatives metadata¶

v2: Read/write dataset_description.json in each derivative dir:

{
  "Name": "preprocess-ecephys",
  "GeneratedBy": [{"Name": "preprocess-ecephys", "CodeURL": "..."}],
  "SourceDatasets": [{"URL": "../raw"}]
}

Standards-aligned lineage that any BIDS tool can read.

Migration phases¶

Phase 0: Research & design (current)¶

[x] Deep research on ecosystem landscape
[x] Identify keep/thin/stop categories
[ ] Design snakebids.yml schema mapping (what pipeio reads/writes)
[ ] Design datalad run integration interface
[ ] Decide on pipe/flow hierarchy: keep pipe as category or flatten?

Phase 1: Structural alignment (non-breaking, additive)¶

[ ] registry_scan learns snakebids app layout alongside current flat layout
[ ] config_read supports both config.yml and config/snakebids.yml
[ ] config_init --snakebids generates app skeleton
[ ] Add dataset_description.json to flow scaffolding

Phase 2: Execution migration¶

[ ] pipeio_run gains provenance=True → wraps with datalad run
[ ] pipeio_dag switches to snakemake --d3dag backend
[ ] pipeio_completion switches to snakemake --summary backend
[ ] Deprecate runs.json state machine

Phase 3: Full snakebids alignment¶

[ ] Default scaffolding generates snakebids app layout
[ ] Remove flat layout support (or keep as legacy)
[ ] Explore pipeio as snakebids plugin
[ ] cross_flow reads BIDS GeneratedBy/SourceDatasets

Tool count¶

Category	v1	v2
Keep (unique)	27	27
Thin (adapter)	4	4 (same API, different internals)
Stop (replace)	4	0
New	—	~2 (datalad_run wrapper, bids_metadata)
Total	35	~33

The surface barely changes — the difference is what's inside: pipeio stops reimplementing and starts composing.