Migration guide: pixecog preprocess/ieeg → pipeio v2 snakebids app¶

This document audits the current preprocess/ieeg flow in pixecog and provides a step-by-step migration guide for converting it to a pipeio v2 snakebids app layout.

1. Current layout audit¶

1.1 Directory structure¶

code/pipelines/preprocess/ieeg/
├── Snakefile              # monolithic, 693 lines, ~25 rules
├── config.yml             # 332 lines: pybids_inputs, registry, params
├── Makefile               # notebook publishing workflow
├── scripts/               # 14 active scripts + 2 deprecated subdirs
│   ├── badlabel.py
│   ├── badness_video.py
│   ├── feature.py
│   ├── feature_umap.py
│   ├── filter.py
│   ├── interpolate.py
│   ├── lfp_video.py
│   ├── plot_feature_maps.py
│   ├── rowcol_noise.py
│   ├── ttl_removal.py
│   ├── linenoise/         # 7 scripts (zapline, comb profiling, downsample)
│   └── noise_tfspace/     # 3 scripts (spectrogram, summary, report)
├── notebooks/             # 21 notebooks (jupytext .py format)
│   ├── notebook.yml       # notebook registry
│   └── {name}/{name}.py
├── report/                # .rst caption files for snakemake --report
├── docs/                  # flow documentation
├── _docs/                 # legacy docs
└── .snakemake/            # snakemake cache (not tracked)

1.2 Snakefile analysis¶

Key characteristics: - Already uses generate_inputs from snakebids (line 17) — major plus - Uses set_bids_spec("v0_0_0") for BIDS path generation - Depends on sutil.repo_root.repo_abs for absolute path resolution (~30 call sites in Snakefile) - Depends on sutil.bids.paths.BidsPaths for registry-driven path resolution - Uses configfile: "config.yml" then re-reads it with safe_load (double-parsing) - All rules are inline in a single Snakefile (no .smk includes) - Uses report() wrapper for snakemake report integration on several outputs - One rule (zapline_plus) has conda: "/storage/share/python/environments/Anaconda3/envs/matlab" — hardcoded absolute path

Rules (25 total):

Rule	Type	Script	Notes
`all`	target	—	Default: expand pre_all + COMB_SUMMARY_HTML
`noisy_all`	target	—	Viz targets only
`report_noisy_all`	target	—	Existing viz files only
`noise_tfspace_all`	target	—	TF-space reports
`report`	target	—	Existing all + comb summary
`registry`	utility	inline	Writes registry YAML to derivatives
`manifest`	utility	inline	Writes manifest TSV
`test`	target	—	Single test entity
`status`	checkpoint	—	Touch file after QC outputs
`raw_zarr`	transform	inline	BIDS LFP → zarr
`ttl_removal`	transform	scripts/ttl_removal.py	TTL artifact removal
`lowpass`	transform	scripts/filter.py	Lowpass filter
`downsample`	transform	inline	Downsample zarr
`feature`	transform	scripts/feature.py	Feature extraction
`badlabel`	transform	scripts/badlabel.py	Bad channel detection
`plot_feature_maps`	viz	scripts/plot_feature_maps.py	Feature map plots
`rowcol_noise`	qc	scripts/rowcol_noise.py	Row/col noise stats
`badness_video`	viz	scripts/badness_video.py	Badness animation
`lfp_video`	viz	scripts/lfp_video.py	LFP animation
`interpolate`	transform	scripts/interpolate.py	Bad channel interpolation
`noise_tfspace_spectrogram`	transform	scripts/noise_tfspace/compute_tfspace_spectrogram.py
`noise_tfspace_summary`	transform	scripts/noise_tfspace/compute_tfspace_summary.py
`noise_tfspace_report`	viz	scripts/noise_tfspace/plot_tfspace_summary.py
`downsample_for_zapline`	transform	scripts/linenoise/downsample_lfp.py
`preprocess_json_sidecar`	utility	shell	Symlinks sidecars
`linenoise_profile_combfreqs`	transform	scripts/linenoise/measure_combfreqs.py
`comb_cross_session_summary`	viz	scripts/linenoise/comb_cross_session_summary.py
`comb_qc_plot`	viz	scripts/linenoise/comb_qc_plot.py
`zapline_plus`	transform	scripts/linenoise/clean_zapline_plus.py	Needs matlab conda env
`noisy_spectrogram`	viz	scripts/linenoise/sample_spectrogram_plot.py
`preprocess_alias`	utility	shell	Symlinks final output

1.3 Config structure¶

# Input sources
input_dir: "raw"
input_registry: "raw/registry.yml"
input_dir_brainstate: "derivatives"
input_registry_brainstate: "derivatives/brainstate/flow-brainstate_registry.yml"

# pybids_inputs (2 input types: ieeg, ecephys)
pybids_inputs:
  ieeg: { filters: ..., wildcards: [subject, session, task] }
  ecephys: { filters: ..., wildcards: [subject, session, task, acquisition, recording] }

# Member set anchors (YAML &anchors for DRY)
_member_sets: { ... }

# Output
output_dir: "derivatives/preprocess"
output_registry: "derivatives/preprocess/pipe-preprocess_flow-ieeg_registry.yml"

# Registry groups (16 groups, ~60 members)
registry: { all, raw_zarr, lowpass, downsample, feature, badlabel, noise,
            noise_tfspace, interpolate, zapline_in, linenoise, viz,
            linenoise_profile, preprocess, ttl_removal }

# Processing params
geometry: { ... }
windowing: { ... }
features: [ ... ]
noise: { ... }
badlabel: { ... }
umap: { ... }
video: { ... }
linenoise: { ... }   # 20+ params
noise_tfspace: { ... }
ttl_removal: { ... }

1.4 Dependencies and cross-flow consumers¶

Internal dependencies (sutil): - sutil.repo_root.repo_abs — used in Snakefile + 20 scripts + 15 notebooks - sutil.bids.paths.BidsPaths — used in Snakefile + 2 notebooks

Cross-flow consumers (downstream flows reading derivatives/preprocess/): - sharpwaveripple — reads derivatives/preprocess/ via ecephys registry - spectrogram/burst — reads derivatives/preprocess/ via ecephys registry - Both reference pipe-preprocess_flow-ecephys_registry.yml (the ecephys sibling, not ieeg)

Key finding: No downstream flow directly consumes the ieeg preprocess registry. The ieeg flow's outputs feed the ecephys flow (which shares derivatives/preprocess/), and downstream flows consume the ecephys registry. This means the ieeg migration has no direct cross-flow breakage risk.

1.5 Output structure¶

derivatives/preprocess/ is a DataLad subdataset (has .git/).

Contains: - Per-stage subdirs: all/, badlabel/, downsample/, feature/, interpolate/, linenoise/, linenoise_in/, linenoise_profile/, lowpass/, noise/, noise_tfspace/, raw_zarr/, transient/, validate/, viz/, viz_cache/ - Per-subject dirs: sub-01/ through sub-05/, sub-test/ - Registry files: pipe-preprocess_flow-ieeg_registry.yml, pipe-preprocess_flow-ecephys_registry.yml - No dataset_description.json — needs to be created - No run.py — needs to be created

1.6 Pipeio registry status¶

The flow is already registered in pipeio at .projio/pipeio/registry.yml under preprocess/ieeg with 22 mods and full rule mapping. This is consistent with the current flat layout.

2. Current → target mapping¶

2.1 Directory mapping¶

Current path	v2 path	Action
`Snakefile`	`workflow/Snakefile`	Move; split rules into .smk files
`config.yml`	`config/snakebids.yml`	Move; add `parse_args` + `analysis_levels`
`scripts/`	`workflow/scripts/`	Move
`scripts/linenoise/`	`workflow/scripts/linenoise/`	Move
`scripts/noise_tfspace/`	`workflow/scripts/noise_tfspace/`	Move
`notebooks/`	`notebooks/`	Keep in place
`report/`	`workflow/report/`	Move (snakemake convention)
`docs/`	`docs/`	Keep in place
`Makefile`	`Makefile`	Keep; update paths
(new)	`run.py`	Create snakebids entry point
(new)	`derivatives/preprocess/dataset_description.json`	Create BIDS metadata

2.2 Snakefile split by mod¶

The monolithic Snakefile should be split into mod-organized .smk files:

Module	Rules	Target file
common	all, test, report, status, manifest, registry, targets	`workflow/Snakefile` (keep orchestration)
raw	raw_zarr, ttl_removal	`workflow/rules/raw.smk`
signal	lowpass, downsample, feature	`workflow/rules/signal.smk`
badlabel	badlabel, plot_feature_maps, badness_video	`workflow/rules/badlabel.smk`
noise	rowcol_noise	`workflow/rules/noise.smk`
interpolate	interpolate, preprocess_json_sidecar, preprocess_alias	`workflow/rules/interpolate.smk`
linenoise	downsample_for_zapline, linenoise_profile_combfreqs, comb_cross_session_summary, comb_qc_plot, zapline_plus, noisy_spectrogram	`workflow/rules/linenoise.smk`
noise_tfspace	noise_tfspace_spectrogram, noise_tfspace_summary, noise_tfspace_report	`workflow/rules/noise_tfspace.smk`
viz	lfp_video	`workflow/rules/viz.smk`

3. Blockers and decisions¶

3.1 `sutil.repo_root.repo_abs` dependency (MAJOR)¶

Scope: 30+ call sites in Snakefile, 20 scripts, 15 notebooks.

repo_abs(rel) resolves a path relative to the repository root. In the v2 snakebids model, the Snakefile runs from workflow/ and paths should be relative to the app root or use snakebids' own path resolution.

Decision required: How to replace repo_abs: - Option A: Replace with Path(workflow.basedir).parent / rel in Snakefile context (snakemake provides workflow.basedir = directory containing Snakefile). - Option B: Replace with config["root"] where root is injected by run.py. - Option C: Keep sutil but make repo_abs resolve from config rather than git root.

Recommendation: Option B — run.py sets config["root"] to the repo root, and all repo_abs(x) calls become Path(config["root"]) / x. This is a mechanical find-replace.

3.2 `sutil.bids.paths.BidsPaths` dependency (MODERATE)¶

Used for registry-driven path construction: out_paths("group", "member") → BIDS path template.

In v2, this is replaced by pipeio.bids.BidsResolver. The API is similar but not identical. The Snakefile setup code (lines 15–26) needs rewriting.

Migration: BidsResolver is a drop-in adapter with the same (group, member) call signature. Import changes from sutil.bids.paths.BidsPaths to pipeio.bids.BidsResolver.

3.3 `configfile` double-parsing (MINOR)¶

Lines 8+12–13: configfile: "config.yml" then re-reads with safe_load. This is because repo_abs needs the config dict before snakemake's config is fully available.

v2 fix: With run.py injecting paths, the double-parse becomes unnecessary. Use snakemake's native configfile: directive only.

3.4 Hardcoded conda environment (MINOR)¶

zapline_plus rule uses conda: "/storage/share/python/environments/Anaconda3/envs/matlab".

v2 fix: Move to a workflow/envs/matlab.yml conda env spec, or use config["conda_envs"]["matlab"] for portability.

3.5 `report()` paths with rst captions (MINOR)¶

Several rules use report(path, caption="report/foo.rst"). The caption paths are relative to the rule's location. After moving rules to workflow/rules/, these need updating.

v2 fix: Move report/ to workflow/report/ and update caption paths.

Both preprocess/ieeg and preprocess/ecephys write to derivatives/preprocess/. They share the DataLad subdataset but use separate registries. This is fine for v2 — BIDS derivatives directories can contain outputs from multiple pipelines. The dataset_description.json should list both as generators.

4. Step-by-step migration guide¶

Phase 1: Pre-migration checklist¶

[ ] Verify all current outputs are committed in the derivatives/preprocess subdataset
[ ] Run snakemake -n to confirm current Snakefile parses cleanly
[ ] Back up current Snakefile: cp Snakefile Snakefile.v1.bak
[ ] Verify sutil is installed in the cogpy environment
[ ] Check that no other flow's Snakefile imports from preprocess/ieeg/ directly

Phase 2: Create v2 directory skeleton¶

cd code/pipelines/preprocess/ieeg

# Create v2 directories
mkdir -p workflow/rules
mkdir -p workflow/scripts
mkdir -p config

# Move files
mv Snakefile workflow/Snakefile
mv config.yml config/snakebids.yml
mv scripts/* workflow/scripts/
rmdir scripts
mv report workflow/report

Phase 3: Create `run.py`¶

#!/usr/bin/env python
"""Snakebids entry point for preprocess/ieeg flow."""
from pathlib import Path
from snakebids.app import SnakeBidsApp

def main():
    app = SnakeBidsApp(
        snakefile_path=Path(__file__).resolve().parent / "workflow" / "Snakefile",
        configfile_path=Path(__file__).resolve().parent / "config" / "snakebids.yml",
    )
    app.run_snakemake()

if __name__ == "__main__":
    main()

Phase 4: Update `config/snakebids.yml`¶

Add snakebids-required sections at the top:

# snakebids app metadata
app_name: preprocess-ieeg
analysis_levels: &analysis_levels
  - participant

parse_args:
  bids_dir:
    help: "Input BIDS directory"
    default: "raw"
  output_dir:
    help: "Output derivatives directory"
    default: "derivatives/preprocess"
  analysis_level:
    help: "Analysis level"
    choices: *analysis_levels
    default: "participant"

# ... rest of existing config unchanged ...

Phase 5: Update `workflow/Snakefile`¶

Key changes to the Snakefile header:

from snakemake.utils import min_version
min_version("6.0")
from snakebids import generate_inputs, bids, set_bids_spec
set_bids_spec("v0_0_0")

from pathlib import Path

# v2: config root injected by snakebids or set from workflow location
ROOT = Path(config.get("root", Path(workflow.basedir).parent.parent.parent.parent))

configfile: str(Path(workflow.basedir).parent / "config" / "snakebids.yml")

# Replace all repo_abs() calls with ROOT / "path"
# e.g.: repo_abs("code/pipelines/preprocess/ieeg") → ROOT / "code/pipelines/preprocess/ieeg"

from sutil.bids.paths import BidsPaths  # or pipeio.bids.BidsResolver when ready

inputs = generate_inputs(ROOT / config["input_dir"], config["pybids_inputs"])
# ... rest of setup with ROOT instead of repo_abs ...

# Include mod rules
include: "rules/raw.smk"
include: "rules/signal.smk"
include: "rules/badlabel.smk"
include: "rules/noise.smk"
include: "rules/interpolate.smk"
include: "rules/linenoise.smk"
include: "rules/noise_tfspace.smk"
include: "rules/viz.smk"

# Keep target rules in main Snakefile
rule all:
    input:
        inputs['ieeg'].expand(pre_all),
        COMB_SUMMARY_HTML
# ... other target rules ...

Phase 6: Split rules into `.smk` files¶

For each .smk file, extract the relevant rules from the monolithic Snakefile. The rules can reference ROOT, inputs, in_paths, out_paths, and config as globals (Snakemake includes share the namespace).

Update script: directives — paths are relative to the rule file's directory: - In workflow/Snakefile: script: "scripts/foo.py" (relative to workflow/) - In workflow/rules/raw.smk: script: "../scripts/foo.py" (up one level)

Phase 7: Update script `repo_abs` calls¶

Mechanical replacement in all scripts:

# Before:
from sutil.repo_root import repo_abs
path = repo_abs("derivatives/preprocess/...")

# After (in snakemake script context):
from pathlib import Path
root = Path(snakemake.config.get("root", "."))
path = root / "derivatives/preprocess/..."

For scripts that use repo_abs only for log file paths or notebook references, the replacement is straightforward. Each script's snakemake.config["root"] provides the repo root.

Phase 8: Create `dataset_description.json`¶

{
  "Name": "preprocess",
  "BIDSVersion": "1.9.0",
  "DatasetType": "derivative",
  "GeneratedBy": [
    {
      "Name": "preprocess-ieeg",
      "Description": "iEEG preprocessing pipeline: raw→zarr, lowpass, downsample, feature extraction, bad channel detection, interpolation, line noise removal",
      "CodeURL": "code/pipelines/preprocess/ieeg"
    },
    {
      "Name": "preprocess-ecephys",
      "Description": "Extracellular electrophysiology preprocessing pipeline",
      "CodeURL": "code/pipelines/preprocess/ecephys"
    }
  ],
  "SourceDatasets": [
    {
      "URL": "../../raw"
    }
  ]
}

Place at derivatives/preprocess/dataset_description.json.

Phase 9: Update Makefile¶

Update SNAKEMAKE invocation paths and any references to Snakefile or config.yml to point to the new locations.

Phase 10: Test plan¶

Parse test: cd code/pipelines/preprocess/ieeg && snakemake -s workflow/Snakefile --configfile config/snakebids.yml -n
Dry run: snakemake -s workflow/Snakefile --configfile config/snakebids.yml -n --forceall
Single subject test: snakemake -s workflow/Snakefile --configfile config/snakebids.yml -n --config root=$(git rev-parse --show-toplevel) -- test
Entry point test: python run.py raw derivatives/preprocess participant --dry-run
Full run on test subject: verify output matches v1 byte-for-byte for deterministic rules
Registry scan: pipeio_registry_scan() should detect the flow as app_type: snakebids

Phase 11: Rollback plan¶

cd code/pipelines/preprocess/ieeg

# Restore v1 layout
mv workflow/Snakefile ./Snakefile
mv config/snakebids.yml ./config.yml
mv workflow/scripts/* ./scripts/
mv workflow/report ./report
rm -rf workflow config run.py

All changes are local to code/pipelines/preprocess/ieeg/. The derivatives directory and DataLad subdataset are unaffected. No downstream flows need changes since they consume the ecephys registry, not the ieeg one.

5. Effort estimate by component¶

Component	Effort	Blocking?
Directory restructure	Small	No
`run.py` creation	Small	No
Config additions	Small	No
Snakefile split into .smk	Medium	No
`repo_abs` → `ROOT /` in Snakefile	Small (mechanical)	No
`repo_abs` → `config["root"]` in 20 scripts	Medium (mechanical)	No
`BidsPaths` → `BidsResolver` swap	Small (API-compatible)	Needs BidsResolver impl
`script:` path updates in .smk files	Small (mechanical)	No
`report()` caption path updates	Small	No
Notebook `repo_abs` updates	Large (15 notebooks)	Non-blocking (defer)
Hardcoded conda env	Small	No
`dataset_description.json`	Small	No
Testing	Medium	—

Total: ~1-2 focused sessions. The mechanical repo_abs replacement dominates.

6. Pixecog-specific vs reusable¶

Pixecog-specific¶

BidsPaths → BidsResolver migration (pixecog's custom path resolution)
sutil.repo_root.repo_abs elimination (pixecog utility)
Specific rule split plan (domain knowledge)
Cross-flow analysis (project-specific topology)
Hardcoded conda path

Reusable for any flow migration¶

Directory restructure template (flat → snakebids app)
run.py boilerplate
config/snakebids.yml additions (parse_args, analysis_levels)
dataset_description.json template
script: path update rules (Snakefile → rules/ relative paths)
Test plan structure
Rollback plan pattern

Candidates for pipeio automation (`pipeio_flow_migrate`)¶

Directory scaffolding — mkdir -p workflow/rules config + move files
run.py generation — template with flow name substitution
dataset_description.json generation — from registry metadata + config
script: path rewriting — parse rules, adjust relative paths after move
configfile: path update — mechanical
Dry-run validation — snakemake -n after migration to verify parse
Registry rescan — verify detection as snakebids app

A pipeio_flow_migrate(pipe, flow, dry_run=True) tool could handle items 1–6 automatically, with dry_run=True showing the plan before execution.