Skip to content

Migration guide: pixecog preprocess/ieeg → pipeio v2 snakebids app

This document audits the current preprocess/ieeg flow in pixecog and provides a step-by-step migration guide for converting it to a pipeio v2 snakebids app layout.


1. Current layout audit

1.1 Directory structure

code/pipelines/preprocess/ieeg/
├── Snakefile              # monolithic, 693 lines, ~25 rules
├── config.yml             # 332 lines: pybids_inputs, registry, params
├── Makefile               # notebook publishing workflow
├── scripts/               # 14 active scripts + 2 deprecated subdirs
│   ├── badlabel.py
│   ├── badness_video.py
│   ├── feature.py
│   ├── feature_umap.py
│   ├── filter.py
│   ├── interpolate.py
│   ├── lfp_video.py
│   ├── plot_feature_maps.py
│   ├── rowcol_noise.py
│   ├── ttl_removal.py
│   ├── linenoise/         # 7 scripts (zapline, comb profiling, downsample)
│   └── noise_tfspace/     # 3 scripts (spectrogram, summary, report)
├── notebooks/             # 21 notebooks (jupytext .py format)
│   ├── notebook.yml       # notebook registry
│   └── {name}/{name}.py
├── report/                # .rst caption files for snakemake --report
├── docs/                  # flow documentation
├── _docs/                 # legacy docs
└── .snakemake/            # snakemake cache (not tracked)

1.2 Snakefile analysis

Key characteristics: - Already uses generate_inputs from snakebids (line 17) — major plus - Uses set_bids_spec("v0_0_0") for BIDS path generation - Depends on sutil.repo_root.repo_abs for absolute path resolution (~30 call sites in Snakefile) - Depends on sutil.bids.paths.BidsPaths for registry-driven path resolution - Uses configfile: "config.yml" then re-reads it with safe_load (double-parsing) - All rules are inline in a single Snakefile (no .smk includes) - Uses report() wrapper for snakemake report integration on several outputs - One rule (zapline_plus) has conda: "/storage/share/python/environments/Anaconda3/envs/matlab" — hardcoded absolute path

Rules (25 total):

Rule Type Script Notes
all target Default: expand pre_all + COMB_SUMMARY_HTML
noisy_all target Viz targets only
report_noisy_all target Existing viz files only
noise_tfspace_all target TF-space reports
report target Existing all + comb summary
registry utility inline Writes registry YAML to derivatives
manifest utility inline Writes manifest TSV
test target Single test entity
status checkpoint Touch file after QC outputs
raw_zarr transform inline BIDS LFP → zarr
ttl_removal transform scripts/ttl_removal.py TTL artifact removal
lowpass transform scripts/filter.py Lowpass filter
downsample transform inline Downsample zarr
feature transform scripts/feature.py Feature extraction
badlabel transform scripts/badlabel.py Bad channel detection
plot_feature_maps viz scripts/plot_feature_maps.py Feature map plots
rowcol_noise qc scripts/rowcol_noise.py Row/col noise stats
badness_video viz scripts/badness_video.py Badness animation
lfp_video viz scripts/lfp_video.py LFP animation
interpolate transform scripts/interpolate.py Bad channel interpolation
noise_tfspace_spectrogram transform scripts/noise_tfspace/compute_tfspace_spectrogram.py
noise_tfspace_summary transform scripts/noise_tfspace/compute_tfspace_summary.py
noise_tfspace_report viz scripts/noise_tfspace/plot_tfspace_summary.py
downsample_for_zapline transform scripts/linenoise/downsample_lfp.py
preprocess_json_sidecar utility shell Symlinks sidecars
linenoise_profile_combfreqs transform scripts/linenoise/measure_combfreqs.py
comb_cross_session_summary viz scripts/linenoise/comb_cross_session_summary.py
comb_qc_plot viz scripts/linenoise/comb_qc_plot.py
zapline_plus transform scripts/linenoise/clean_zapline_plus.py Needs matlab conda env
noisy_spectrogram viz scripts/linenoise/sample_spectrogram_plot.py
preprocess_alias utility shell Symlinks final output

1.3 Config structure

# Input sources
input_dir: "raw"
input_registry: "raw/registry.yml"
input_dir_brainstate: "derivatives"
input_registry_brainstate: "derivatives/brainstate/flow-brainstate_registry.yml"

# pybids_inputs (2 input types: ieeg, ecephys)
pybids_inputs:
  ieeg: { filters: ..., wildcards: [subject, session, task] }
  ecephys: { filters: ..., wildcards: [subject, session, task, acquisition, recording] }

# Member set anchors (YAML &anchors for DRY)
_member_sets: { ... }

# Output
output_dir: "derivatives/preprocess"
output_registry: "derivatives/preprocess/pipe-preprocess_flow-ieeg_registry.yml"

# Registry groups (16 groups, ~60 members)
registry: { all, raw_zarr, lowpass, downsample, feature, badlabel, noise,
            noise_tfspace, interpolate, zapline_in, linenoise, viz,
            linenoise_profile, preprocess, ttl_removal }

# Processing params
geometry: { ... }
windowing: { ... }
features: [ ... ]
noise: { ... }
badlabel: { ... }
umap: { ... }
video: { ... }
linenoise: { ... }   # 20+ params
noise_tfspace: { ... }
ttl_removal: { ... }

1.4 Dependencies and cross-flow consumers

Internal dependencies (sutil): - sutil.repo_root.repo_abs — used in Snakefile + 20 scripts + 15 notebooks - sutil.bids.paths.BidsPaths — used in Snakefile + 2 notebooks

Cross-flow consumers (downstream flows reading derivatives/preprocess/): - sharpwaveripple — reads derivatives/preprocess/ via ecephys registry - spectrogram/burst — reads derivatives/preprocess/ via ecephys registry - Both reference pipe-preprocess_flow-ecephys_registry.yml (the ecephys sibling, not ieeg)

Key finding: No downstream flow directly consumes the ieeg preprocess registry. The ieeg flow's outputs feed the ecephys flow (which shares derivatives/preprocess/), and downstream flows consume the ecephys registry. This means the ieeg migration has no direct cross-flow breakage risk.

1.5 Output structure

derivatives/preprocess/ is a DataLad subdataset (has .git/).

Contains: - Per-stage subdirs: all/, badlabel/, downsample/, feature/, interpolate/, linenoise/, linenoise_in/, linenoise_profile/, lowpass/, noise/, noise_tfspace/, raw_zarr/, transient/, validate/, viz/, viz_cache/ - Per-subject dirs: sub-01/ through sub-05/, sub-test/ - Registry files: pipe-preprocess_flow-ieeg_registry.yml, pipe-preprocess_flow-ecephys_registry.yml - No dataset_description.json — needs to be created - No run.py — needs to be created

1.6 Pipeio registry status

The flow is already registered in pipeio at .projio/pipeio/registry.yml under preprocess/ieeg with 22 mods and full rule mapping. This is consistent with the current flat layout.


2. Current → target mapping

2.1 Directory mapping

Current path v2 path Action
Snakefile workflow/Snakefile Move; split rules into .smk files
config.yml config/snakebids.yml Move; add parse_args + analysis_levels
scripts/ workflow/scripts/ Move
scripts/linenoise/ workflow/scripts/linenoise/ Move
scripts/noise_tfspace/ workflow/scripts/noise_tfspace/ Move
notebooks/ notebooks/ Keep in place
report/ workflow/report/ Move (snakemake convention)
docs/ docs/ Keep in place
Makefile Makefile Keep; update paths
(new) run.py Create snakebids entry point
(new) derivatives/preprocess/dataset_description.json Create BIDS metadata

2.2 Snakefile split by mod

The monolithic Snakefile should be split into mod-organized .smk files:

Module Rules Target file
common all, test, report, status, manifest, registry, targets workflow/Snakefile (keep orchestration)
raw raw_zarr, ttl_removal workflow/rules/raw.smk
signal lowpass, downsample, feature workflow/rules/signal.smk
badlabel badlabel, plot_feature_maps, badness_video workflow/rules/badlabel.smk
noise rowcol_noise workflow/rules/noise.smk
interpolate interpolate, preprocess_json_sidecar, preprocess_alias workflow/rules/interpolate.smk
linenoise downsample_for_zapline, linenoise_profile_combfreqs, comb_cross_session_summary, comb_qc_plot, zapline_plus, noisy_spectrogram workflow/rules/linenoise.smk
noise_tfspace noise_tfspace_spectrogram, noise_tfspace_summary, noise_tfspace_report workflow/rules/noise_tfspace.smk
viz lfp_video workflow/rules/viz.smk

3. Blockers and decisions

3.1 sutil.repo_root.repo_abs dependency (MAJOR)

Scope: 30+ call sites in Snakefile, 20 scripts, 15 notebooks.

repo_abs(rel) resolves a path relative to the repository root. In the v2 snakebids model, the Snakefile runs from workflow/ and paths should be relative to the app root or use snakebids' own path resolution.

Decision required: How to replace repo_abs: - Option A: Replace with Path(workflow.basedir).parent / rel in Snakefile context (snakemake provides workflow.basedir = directory containing Snakefile). - Option B: Replace with config["root"] where root is injected by run.py. - Option C: Keep sutil but make repo_abs resolve from config rather than git root.

Recommendation: Option B — run.py sets config["root"] to the repo root, and all repo_abs(x) calls become Path(config["root"]) / x. This is a mechanical find-replace.

3.2 sutil.bids.paths.BidsPaths dependency (MODERATE)

Used for registry-driven path construction: out_paths("group", "member") → BIDS path template.

In v2, this is replaced by pipeio.bids.BidsResolver. The API is similar but not identical. The Snakefile setup code (lines 15–26) needs rewriting.

Migration: BidsResolver is a drop-in adapter with the same (group, member) call signature. Import changes from sutil.bids.paths.BidsPaths to pipeio.bids.BidsResolver.

3.3 configfile double-parsing (MINOR)

Lines 8+12–13: configfile: "config.yml" then re-reads with safe_load. This is because repo_abs needs the config dict before snakemake's config is fully available.

v2 fix: With run.py injecting paths, the double-parse becomes unnecessary. Use snakemake's native configfile: directive only.

3.4 Hardcoded conda environment (MINOR)

zapline_plus rule uses conda: "/storage/share/python/environments/Anaconda3/envs/matlab".

v2 fix: Move to a workflow/envs/matlab.yml conda env spec, or use config["conda_envs"]["matlab"] for portability.

3.5 report() paths with rst captions (MINOR)

Several rules use report(path, caption="report/foo.rst"). The caption paths are relative to the rule's location. After moving rules to workflow/rules/, these need updating.

v2 fix: Move report/ to workflow/report/ and update caption paths.

3.6 Cross-flow output directory sharing (INFO)

Both preprocess/ieeg and preprocess/ecephys write to derivatives/preprocess/. They share the DataLad subdataset but use separate registries. This is fine for v2 — BIDS derivatives directories can contain outputs from multiple pipelines. The dataset_description.json should list both as generators.


4. Step-by-step migration guide

Phase 1: Pre-migration checklist

  • [ ] Verify all current outputs are committed in the derivatives/preprocess subdataset
  • [ ] Run snakemake -n to confirm current Snakefile parses cleanly
  • [ ] Back up current Snakefile: cp Snakefile Snakefile.v1.bak
  • [ ] Verify sutil is installed in the cogpy environment
  • [ ] Check that no other flow's Snakefile imports from preprocess/ieeg/ directly

Phase 2: Create v2 directory skeleton

cd code/pipelines/preprocess/ieeg

# Create v2 directories
mkdir -p workflow/rules
mkdir -p workflow/scripts
mkdir -p config

# Move files
mv Snakefile workflow/Snakefile
mv config.yml config/snakebids.yml
mv scripts/* workflow/scripts/
rmdir scripts
mv report workflow/report

Phase 3: Create run.py

#!/usr/bin/env python
"""Snakebids entry point for preprocess/ieeg flow."""
from pathlib import Path
from snakebids.app import SnakeBidsApp

def main():
    app = SnakeBidsApp(
        snakefile_path=Path(__file__).resolve().parent / "workflow" / "Snakefile",
        configfile_path=Path(__file__).resolve().parent / "config" / "snakebids.yml",
    )
    app.run_snakemake()

if __name__ == "__main__":
    main()

Phase 4: Update config/snakebids.yml

Add snakebids-required sections at the top:

# snakebids app metadata
app_name: preprocess-ieeg
analysis_levels: &analysis_levels
  - participant

parse_args:
  bids_dir:
    help: "Input BIDS directory"
    default: "raw"
  output_dir:
    help: "Output derivatives directory"
    default: "derivatives/preprocess"
  analysis_level:
    help: "Analysis level"
    choices: *analysis_levels
    default: "participant"

# ... rest of existing config unchanged ...

Phase 5: Update workflow/Snakefile

Key changes to the Snakefile header:

from snakemake.utils import min_version
min_version("6.0")
from snakebids import generate_inputs, bids, set_bids_spec
set_bids_spec("v0_0_0")

from pathlib import Path

# v2: config root injected by snakebids or set from workflow location
ROOT = Path(config.get("root", Path(workflow.basedir).parent.parent.parent.parent))

configfile: str(Path(workflow.basedir).parent / "config" / "snakebids.yml")

# Replace all repo_abs() calls with ROOT / "path"
# e.g.: repo_abs("code/pipelines/preprocess/ieeg") → ROOT / "code/pipelines/preprocess/ieeg"

from sutil.bids.paths import BidsPaths  # or pipeio.bids.BidsResolver when ready

inputs = generate_inputs(ROOT / config["input_dir"], config["pybids_inputs"])
# ... rest of setup with ROOT instead of repo_abs ...

# Include mod rules
include: "rules/raw.smk"
include: "rules/signal.smk"
include: "rules/badlabel.smk"
include: "rules/noise.smk"
include: "rules/interpolate.smk"
include: "rules/linenoise.smk"
include: "rules/noise_tfspace.smk"
include: "rules/viz.smk"

# Keep target rules in main Snakefile
rule all:
    input:
        inputs['ieeg'].expand(pre_all),
        COMB_SUMMARY_HTML
# ... other target rules ...

Phase 6: Split rules into .smk files

For each .smk file, extract the relevant rules from the monolithic Snakefile. The rules can reference ROOT, inputs, in_paths, out_paths, and config as globals (Snakemake includes share the namespace).

Update script: directives — paths are relative to the rule file's directory: - In workflow/Snakefile: script: "scripts/foo.py" (relative to workflow/) - In workflow/rules/raw.smk: script: "../scripts/foo.py" (up one level)

Phase 7: Update script repo_abs calls

Mechanical replacement in all scripts:

# Before:
from sutil.repo_root import repo_abs
path = repo_abs("derivatives/preprocess/...")

# After (in snakemake script context):
from pathlib import Path
root = Path(snakemake.config.get("root", "."))
path = root / "derivatives/preprocess/..."

For scripts that use repo_abs only for log file paths or notebook references, the replacement is straightforward. Each script's snakemake.config["root"] provides the repo root.

Phase 8: Create dataset_description.json

{
  "Name": "preprocess",
  "BIDSVersion": "1.9.0",
  "DatasetType": "derivative",
  "GeneratedBy": [
    {
      "Name": "preprocess-ieeg",
      "Description": "iEEG preprocessing pipeline: raw→zarr, lowpass, downsample, feature extraction, bad channel detection, interpolation, line noise removal",
      "CodeURL": "code/pipelines/preprocess/ieeg"
    },
    {
      "Name": "preprocess-ecephys",
      "Description": "Extracellular electrophysiology preprocessing pipeline",
      "CodeURL": "code/pipelines/preprocess/ecephys"
    }
  ],
  "SourceDatasets": [
    {
      "URL": "../../raw"
    }
  ]
}

Place at derivatives/preprocess/dataset_description.json.

Phase 9: Update Makefile

Update SNAKEMAKE invocation paths and any references to Snakefile or config.yml to point to the new locations.

Phase 10: Test plan

  1. Parse test: cd code/pipelines/preprocess/ieeg && snakemake -s workflow/Snakefile --configfile config/snakebids.yml -n
  2. Dry run: snakemake -s workflow/Snakefile --configfile config/snakebids.yml -n --forceall
  3. Single subject test: snakemake -s workflow/Snakefile --configfile config/snakebids.yml -n --config root=$(git rev-parse --show-toplevel) -- test
  4. Entry point test: python run.py raw derivatives/preprocess participant --dry-run
  5. Full run on test subject: verify output matches v1 byte-for-byte for deterministic rules
  6. Registry scan: pipeio_registry_scan() should detect the flow as app_type: snakebids

Phase 11: Rollback plan

cd code/pipelines/preprocess/ieeg

# Restore v1 layout
mv workflow/Snakefile ./Snakefile
mv config/snakebids.yml ./config.yml
mv workflow/scripts/* ./scripts/
mv workflow/report ./report
rm -rf workflow config run.py

All changes are local to code/pipelines/preprocess/ieeg/. The derivatives directory and DataLad subdataset are unaffected. No downstream flows need changes since they consume the ecephys registry, not the ieeg one.


5. Effort estimate by component

Component Effort Blocking?
Directory restructure Small No
run.py creation Small No
Config additions Small No
Snakefile split into .smk Medium No
repo_absROOT / in Snakefile Small (mechanical) No
repo_absconfig["root"] in 20 scripts Medium (mechanical) No
BidsPathsBidsResolver swap Small (API-compatible) Needs BidsResolver impl
script: path updates in .smk files Small (mechanical) No
report() caption path updates Small No
Notebook repo_abs updates Large (15 notebooks) Non-blocking (defer)
Hardcoded conda env Small No
dataset_description.json Small No
Testing Medium

Total: ~1-2 focused sessions. The mechanical repo_abs replacement dominates.


6. Pixecog-specific vs reusable

Pixecog-specific

  • BidsPaths → BidsResolver migration (pixecog's custom path resolution)
  • sutil.repo_root.repo_abs elimination (pixecog utility)
  • Specific rule split plan (domain knowledge)
  • Cross-flow analysis (project-specific topology)
  • Hardcoded conda path

Reusable for any flow migration

  • Directory restructure template (flat → snakebids app)
  • run.py boilerplate
  • config/snakebids.yml additions (parse_args, analysis_levels)
  • dataset_description.json template
  • script: path update rules (Snakefile → rules/ relative paths)
  • Test plan structure
  • Rollback plan pattern

Candidates for pipeio automation (pipeio_flow_migrate)

  1. Directory scaffoldingmkdir -p workflow/rules config + move files
  2. run.py generation — template with flow name substitution
  3. dataset_description.json generation — from registry metadata + config
  4. script: path rewriting — parse rules, adjust relative paths after move
  5. configfile: path update — mechanical
  6. Dry-run validationsnakemake -n after migration to verify parse
  7. Registry rescan — verify detection as snakebids app

A pipeio_flow_migrate(pipe, flow, dry_run=True) tool could handle items 1–6 automatically, with dry_run=True showing the plan before execution.