Skip to content

pipeio: Pipeline Documentation Conventions

Problem

Pipeline documentation exists at three levels of abstraction, but only one (mod-level) has defined conventions and tooling:

Level What it describes Convention? Scaffolded? Collected?
Mod One processing step: rationale, spec, known issues theory/spec/delta facets mod_create docs_collectmods/
Flow How mods chain within a workflow: ordering rationale, DAG, design decisions Ad-hoc Only stub index.md Copied as-is
Pipeline How flows compose across a project: layer architecture, data flow, status Manual Nothing Nothing

Agents and humans writing flow-level or pipeline-level docs have no template, no scaffold, and no guidance on what sections to include. The result is inconsistency: some flows have 170-line overviews with full DAG diagrams and design rationale, others have 3-line stubs.

Design

Principle: convention over generation

These docs are primarily human-authored narratives — scientific rationale, design decisions, known gaps. Pipeio should define what sections belong at each level and scaffold templates, but not attempt to auto-generate the narrative content. Auto-generation is limited to structural metadata that pipeio already knows (mod listings, manifest chains, config summaries).

Flow Overview: docs/overview.md

Each flow gets an overview.md in its docs/ directory. This is the entry point for understanding the flow as a whole — how mods compose into a processing chain and why.

Sections

Section Content Auto-populatable?
Purpose What this flow produces and why it exists as a unit No — requires scientific context
Input / Output Input sources, output derivative, manifest paths Partially — from config.yml
Mod Chain Processing order, dependencies between mods, DAG Partially — mod list from registry, DAG from snakemake
Design Decisions Why this ordering, why these flow boundaries, alternatives considered No — human narrative
Known Gaps Flow-level issues, missing mods, planned additions No — human narrative (delta-like)

Template

# {flow} — Flow Overview

## Purpose

<!-- What does this flow produce? Why is it a single flow rather than
     split into multiple? What downstream flows consume its output? -->

## Input

- Input directory: `{input_dir}`
- Input manifest: `{input_manifest}` (from flow: `{source_flow}`)
- Wildcards: {wildcards}

## Output

- Output directory: `{output_dir}`
- Output manifest: `{output_manifest}`

## Mod Chain

<!-- How do the mods compose? What is the processing order and why?
     Include an ASCII or mermaid DAG if helpful. -->

| Order | Mod | Purpose |
|-------|-----|---------|
| 1 | {mod} | {one-line description} |
| ... | ... | ... |

## Design Decisions

<!-- Key choices: why this mod ordering, why certain steps read from
     raw vs intermediate, why certain operations are combined or split. -->

## Known Gaps

<!-- Flow-level issues. Unlike mod-level delta.md, these are about
     missing mods, architectural problems, or cross-mod concerns.
     Remove entries as they are resolved. -->

Lifecycle

flow_new scaffold (stub) → agent/human fills Purpose + Mod Chain
  → mods evolve → update Mod Chain + Design Decisions
    → issues found → add Known Gaps
      → gaps resolved → remove from Known Gaps

The overview is a living document that evolves with the flow. It is not a one-time scaffold.

Relationship to mod docs

Flow overview describes how mods compose. Mod theory describes why a specific processing step works. Mod spec describes what it does technically. There should be no duplication — the overview references mods by name and defers detail to their facet docs.

flow overview.md    "We apply badlabel before interpolation because..."
  └→ badlabel/theory.md   "DBSCAN outlier detection on quantile-aggregated features..."
  └→ badlabel/spec.md     "Input: feature.zarr (nch × nwin × 5). Output: mask.npy (nch,)"

Pipeline Architecture: code/pipelines/architecture.md

One document per project describing how flows compose into a multi-stage analysis. This is the highest level of pipeline documentation.

Sections

Section Content Auto-populatable?
Architecture Diagram Mermaid/ASCII graph of flow dependencies Partially — from cross_flow manifest chains
Flow Table Status, layer/stage, description per flow Partially — flow names + status from registry
Data Flow How derivatives chain: which flow consumes which Yes — from cross_flow
Design Principles Why these flow boundaries, what defines a flow No — human narrative
References Links to architecture decision notes No — human curation

Template

# Pipeline Architecture

<!-- High-level description of the project's analysis pipeline.
     How do flows compose from raw data to final results? -->

## Architecture Diagram

```mermaid
graph TD
    %% Auto-generated scaffold from cross_flow manifest chains.
    %% Edit to add layers, groupings, and planned flows.
{mermaid_body}

Flows

Flow Stage Status Description
{flow_table}

Data Flow

Consumer Input Manifest Producer
{chain_table}

Design Principles

References

#### Location

`code/pipelines/architecture.md` is the source of truth. `docs_collect` copies it to `docs/pipelines/architecture.md` for inclusion in the site.

The choice of `code/pipelines/` (not `docs/`) follows the same principle as flow docs: **source lives next to code**, site copies are build artifacts.

#### Lifecycle
first flow registered → scaffold architecture.md with flow table + manifest chains → human adds layers, grouping, design principles → new flows added → update diagram + table (agent or human) → architecture decisions → add References
### `docs_collect` Changes

1. **Flow overview collection** — already handled: `overview.md` → `index.md` renaming exists.

2. **Pipeline architecture collection** — new: if `code/pipelines/architecture.md` exists, copy it to `docs/pipelines/architecture.md` (with source-path header). Include in nav before per-flow entries.

### Commands: Who Scaffolds What

#### Flow overview: `pipeio_flow_new` (existing, extended)

**CLI:** `pipeio flow new <flow>`
**MCP:** `pipeio_flow_new(flow)`

Currently scaffolds `docs/index.md` only. Extended to also scaffold `docs/overview.md`.

**Idempotent behavior** (unchanged): only writes files that don't exist. Running `flow_new` on an existing flow with a hand-written `overview.md` is safe — it won't overwrite.

Change to `mcp_flow_new` in `pipeio/mcp.py`:
```python
# docs/overview.md (new — flow overview template)
overview = flow_dir / "docs" / "overview.md"
if not overview.exists():
    overview.write_text(FLOW_OVERVIEW_TEMPLATE.format(
        flow=flow,
        input_dir=raw.get("input_dir", ""),
        output_dir=raw.get("output_dir", f"derivatives/{flow}"),
        input_manifest=raw.get("input_manifest", ""),
        output_manifest=raw.get("output_manifest", f"derivatives/{flow}/manifest.yml"),
    ), encoding="utf-8")
    created.append("docs/overview.md")

The existing docs/index.md stays as a lightweight landing page (title + mod listing). The overview carries the narrative.

For existing flows that lack an overview: run pipeio flow new <flow> again. Since it's idempotent, it only creates the missing overview.md — everything else is skipped.

Pipeline architecture: pipeio_architecture_init (new tool)

CLI: pipeio docs architecture-init [--force] MCP: pipeio_architecture_init(force=False)

Scaffolds code/pipelines/architecture.md from live registry + manifest chain data.

Behavior: 1. If code/pipelines/architecture.md exists and force=False → return {"status": "exists", "path": "..."}. No overwrite. 2. If missing or force=True: - Call cross_flow to get manifest chains - Read registry for flow names - Generate mermaid diagram: one node per flow, edges from input_manifestoutput_manifest chains - Generate flow table: name + code_path (status/layer/description left as placeholders for human) - Generate data flow table from chains - Write to code/pipelines/architecture.md 3. Return {"status": "created", "path": "...", "flows": N, "chains": M}

Why a separate tool, not part of flow_new: - flow_new operates on a single flow. Architecture is cross-flow. - flow_new is called frequently (every new flow). Architecture init is called once per project, or occasionally to re-scaffold after major changes. - Different --force semantics: flow_new is always-safe idempotent. Architecture re-scaffold should require explicit opt-in.

Implementation location: pipeio/docs.py (new function architecture_init), exposed via pipeio/mcp.py and pipeio/cli.py (pipeio docs architecture-init).

Collection: pipeio_docs_collect (existing, extended)

CLI: pipeio docs collect MCP: pipeio_docs_collect()

Extended to handle code/pipelines/architecture.md:

# --- 0. Collect pipeline-level architecture doc ---
arch_src = pipelines_dir / "architecture.md"
if arch_src.is_file():
    _copy_with_header(arch_src, docs_base / "architecture.md", root)
    collected.append(str(docs_base / "architecture.md"))

No change to how flow-level docs are collected — overview.mdindex.md renaming already works.

Extended to insert architecture.md before per-flow entries:

if (docs_base / "architecture.md").exists():
    flow_navs.insert(0, {"Architecture": "architecture.md"})

Result:

- Pipelines:
  - Architecture: architecture.md
  - preprocess_ieeg:
    - Overview: preprocess_ieeg/index.md
    - Modules: ...

Respect for Existing Content — Summary

Command Target file If exists If missing
pipeio_flow_new docs/overview.md Skip (idempotent) Create from template
pipeio_flow_new docs/index.md Skip (idempotent) Create stub
pipeio_architecture_init code/pipelines/architecture.md Skip (unless --force) Create from registry + cross_flow
pipeio_docs_collect docs/pipelines/architecture.md Overwrite (build artifact) Create from source
pipeio_docs_collect docs/pipelines/{flow}/index.md Overwrite if stale (build artifact) Create stub

Invariant: source files in code/pipelines/ are never overwritten without --force. Build artifacts in docs/pipelines/ are always overwritten (they're gitignored).

Implementation Plan

Step Component Change
1 ontology.md Add Flow Overview and Pipeline Architecture sections documenting the conventions
2 mcp_flow_new Generate docs/overview.md template alongside docs/index.md (idempotent)
3 docs.py + mcp.py + cli.py New architecture_init function + pipeio_architecture_init MCP tool + pipeio docs architecture-init CLI
4 docs_collect Collect code/pipelines/architecture.mddocs/pipelines/architecture.md
5 docs_nav Insert architecture.md at top of pipelines nav
6 mod_doc_refresh Optionally update the mod chain table in overview.md when mods are added/removed

Steps 1–2 are pure convention + scaffold extension. Steps 3–5 add the architecture tool and collection. Step 6 is optional convenience. The convention is valuable even without the tooling — an agent can scaffold the template manually using cross_flow and flow_status output.

Non-Goals

  • Auto-generating narrative content. The purpose, design decisions, and known gaps sections require human or agent judgment. Pipeio provides the template and structural metadata, not the scientific rationale.
  • Enforcing completeness. Empty sections are fine. The template signals what should be documented, not what must be.
  • Versioning architecture docs. The architecture doc is a living document, not a snapshot. Git history provides versioning. No need for dated copies or changelogs.
  • Replacing project-level planning docs. code/pipelines/architecture.md describes the technical data flow. Higher-level planning (milestones, timelines, priorities) stays in docs/plan/ or questio.