Skip to content

pipeio v2 roadmap: lean scope, snakebids/DataLad alignment

North star

pipeio is an agent-facing authoring + discovery layer for Snakemake/snakebids/DataLad projects. It does not compete with execution engines or provenance systems. It makes pipeline knowledge queryable and actionable for AI agents via MCP tools.

One flow = one snakebids app = one derivative directory (= one DataLad subdataset).

Principles

  1. Don't reimplement what snakebids, Snakemake, or DataLad already do
  2. Adapt their outputs into agent-usable structured data where needed
  3. Own the registry, authoring, contracts, and documentation layers they don't provide
  4. Align with BIDS derivatives metadata for cross-flow lineage

Source

See deep-research-pipeio-scope.md for full landscape analysis.


Current tool inventory and v2 fate

KEEP — unique agent value (no ecosystem equivalent)

Tool Purpose v2 changes
flow_list List flows in registry Treat flows as snakebids apps; include derivative dir
flow_status Overview of a flow Add snakebids app status (has run.py, .snakebids marker)
mod_list List mods in a flow Keep as-is
mod_resolve Resolve modkeys to metadata Keep as-is
mod_context Bundled read: rules, scripts, doc, config Keep as-is
mod_create Scaffold mod (script + doc + I/O) Align with snakebids workflow/ layout
rule_list Parse rules from Snakefiles Keep — agents need structured rule data
rule_stub Generate rule text from I/O specs Keep — unique authoring tool
rule_insert Insert rule into .smk file Keep — unique authoring tool
rule_update Patch existing rule Keep — unique authoring tool
config_read Parse flow config with bids signatures Evolve to read config/snakebids.yml
config_patch Surgical YAML edit (preserves comments/anchors) Keep — unique; reposition for snakebids.yml
cross_flow Map output→input chains across flows Evolve: also read BIDS dataset_description.json GeneratedBy/SourceDatasets
contracts_validate Check I/O contracts Keep — feeds DataLad run --input/--output declarations
registry_scan Discover flows from filesystem Evolve: detect snakebids app structure
registry_validate Check registry consistency Keep
nb_create Scaffold notebook with bootstrap cells Keep
nb_update Update notebook metadata Keep
nb_status Notebook sync/lifecycle status Keep
nb_sync Jupytext sync Keep — thin wrapper over jupytext
nb_publish Publish notebook to docs Keep
nb_analyze Parse notebook structure Keep
nb_exec Execute notebook (papermill) Keep
nb_pipeline Chain sync→publish→collect Keep
modkey_bib Generate modkey bibliography Keep — unique
docs_collect Collect flow docs into MkDocs Keep
docs_nav Generate nav YAML fragment Keep
mkdocs_nav_patch Patch mkdocs.yml nav Keep

THIN OUT — replace internals with ecosystem tools

Tool Current impl v2: adapter over
dag Custom Snakefile parser snakemake --d3dag JSON output
completion Glob filesystem vs registry schema snakemake --summary lifted into contract-level status
log_parse Read raw snakemake logs Pointer to snakemake --report + DataLad run record
config_init Scaffold flat config.yml Scaffold snakebids app skeleton (config/snakebids.yml + workflow/ + run.py)

STOP / REPLACE — duplicate ecosystem tools

Tool Current impl v2: replaced by
run screen -dmS snakemake + runs.json datalad run -- python run.py ... → return commit + run record
run_status Parse screen sessions + log tail DataLad run records + snakemake --summary
run_dashboard Aggregate runs.json DataLad git log of run records
run_kill Kill screen sessions Process management (if needed at all)

Structural changes

Flow directory layout: flat → snakebids app

Current:

code/pipelines/{pipe}/{flow}/
    Snakefile
    config.yml
    scripts/

v2 (snakebids app):

code/pipelines/{flow}/               # or code/apps/{flow}/
    run.py                            # snakebids entry point
    config/
        snakebids.yml                 # pybids_inputs, parse_args, analysis_levels
    workflow/
        Snakefile
        rules/*.smk                   # mod-organized rule files
        scripts/
    notebooks/
    docs/

Impact on pipeio: - registry_scan: detect run.py + config/snakebids.yml as snakebids app markers - config_read/config_patch: target config/snakebids.yml - rule_insert/rule_list: look in workflow/rules/ and workflow/Snakefile - mod_create: scaffold scripts into workflow/scripts/

Execution: screen → datalad run

Current:

pipeio_run(pipe, flow)
  → screen -dmS snakemake ...
  → writes runs.json

v2:

pipeio_run(pipe, flow, analysis_level="participant")
  → datalad run \
      --input {bids_dir} \
      --output {derivative_dir} \
      -- python run.py {bids_dir} {derivative_dir} {analysis_level}
  → returns { commit, run_record, derivative_dir }

Contracts feed the --input/--output declarations.

Cross-flow: registry → BIDS derivatives metadata

v2: Read/write dataset_description.json in each derivative dir:

{
  "Name": "preprocess-ecephys",
  "GeneratedBy": [{"Name": "preprocess-ecephys", "CodeURL": "..."}],
  "SourceDatasets": [{"URL": "../raw"}]
}

Standards-aligned lineage that any BIDS tool can read.


Migration phases

Phase 0: Research & design (current)

  • [x] Deep research on ecosystem landscape
  • [x] Identify keep/thin/stop categories
  • [ ] Design snakebids.yml schema mapping (what pipeio reads/writes)
  • [ ] Design datalad run integration interface
  • [ ] Decide on pipe/flow hierarchy: keep pipe as category or flatten?

Phase 1: Structural alignment (non-breaking, additive)

  • [ ] registry_scan learns snakebids app layout alongside current flat layout
  • [ ] config_read supports both config.yml and config/snakebids.yml
  • [ ] config_init --snakebids generates app skeleton
  • [ ] Add dataset_description.json to flow scaffolding

Phase 2: Execution migration

  • [ ] pipeio_run gains provenance=True → wraps with datalad run
  • [ ] pipeio_dag switches to snakemake --d3dag backend
  • [ ] pipeio_completion switches to snakemake --summary backend
  • [ ] Deprecate runs.json state machine

Phase 3: Full snakebids alignment

  • [ ] Default scaffolding generates snakebids app layout
  • [ ] Remove flat layout support (or keep as legacy)
  • [ ] Explore pipeio as snakebids plugin
  • [ ] cross_flow reads BIDS GeneratedBy/SourceDatasets

Tool count

Category v1 v2
Keep (unique) 27 27
Thin (adapter) 4 4 (same API, different internals)
Stop (replace) 4 0
New ~2 (datalad_run wrapper, bids_metadata)
Total 35 ~33

The surface barely changes — the difference is what's inside: pipeio stops reimplementing and starts composing.