Preprocessing Pipeline
This page explains the design of cogpy’s preprocessing pipeline: how bad channels are identified, why spatial normalization matters, and how the Snakemake workflow ties it all together.
Pipeline overview
The preprocessing pipeline converts raw ECoG recordings into clean, analysis-ready signals. It runs as a Snakemake DAG with seven stages:
raw → lowpass → downsample → feature → badlabel → plot → interpolate
Stage |
Purpose |
Module |
|---|---|---|
|
Convert binary LFP + XML metadata to Zarr |
|
|
Anti-alias filter before downsampling |
|
|
Decimate to target sampling rate |
|
|
Extract channel quality features |
|
|
Label bad channels via DBSCAN |
|
|
QC visualizations |
Matplotlib |
|
Spatially interpolate bad channels |
|
Each stage reads from Zarr and writes to Zarr, making the pipeline restartable and inspectable at any point.
Filtering
cogpy provides xarray-aware filters in cogpy.preprocess.filtering:
Function |
Type |
Use case |
|---|---|---|
|
Butterworth IIR |
General frequency selection |
|
Butterworth IIR |
Anti-aliasing before decimation |
|
Butterworth IIR |
DC removal |
|
Notch IIR |
Line noise removal (50/60 Hz) |
|
Polyphase |
Downsampling with anti-alias |
|
Spatial |
Common-mode rejection (subtract channel median) |
|
Spatial |
Gaussian smoothing across grid |
|
Spatial |
Median filtering across grid |
All filters use xr.apply_ufunc internally, preserving dimensions,
coordinates, and attributes. They are also dask-compatible for lazy
evaluation on large recordings.
Why Butterworth IIR? For online-style causal filtering with filtfilt
(zero-phase), Butterworth provides flat passband response with predictable
roll-off. The pipeline uses order-4 filters by default.
Bad-channel detection
The bad-channel pipeline is the most complex preprocessing component. It uses a four-stage approach designed to be robust to the spatial structure of electrode grids.
Stage 1: Feature extraction
compute_features_sliding() extracts seven channel-quality features in
sliding time windows:
Feature |
What it measures |
|---|---|
|
Spatial correlation with grid neighbors |
|
Variance relative to neighbors |
|
Amplitude deviation from neighbors |
|
Peak signal amplitude |
|
Rate of signal change |
|
Long-range temporal correlation |
|
Distribution tail heaviness |
Output shape: (n_features, AP, ML, n_windows).
Stage 2: Spatial normalization
Raw features are not directly comparable across the grid — a channel near the edge naturally has different variance than one in the center. Spatial normalization corrects for this by comparing each channel to its grid neighbors.
Four normalization modes:
Mode |
Formula |
Used by |
|---|---|---|
|
Use raw value |
|
|
|
|
|
|
|
|
|
|
Neighbor relationships are derived from the 2D grid layout using binary dilation footprints (default: 2-iteration, connectivity-1).
Why spatial normalization? Without it, edge channels and channels near sulci would be systematically flagged as bad. The normalization isolates locally anomalous channels — those that differ from their immediate spatial context.
Stage 3: DBSCAN outlier labeling
After normalization, features are aggregated across time windows using
quantiles (75th–95th percentile, 5 levels), then stacked into a
(n_channels, n_feature_quantile_combos) matrix.
DBSCAN identifies outliers:
StandardScaler normalizes the feature matrix
k-distance curve estimates optimal
epsvia knee detection (KneeLocatoron k=10 nearest-neighbor distances)DBSCAN clusters channels; noise points (
label=-1) are bad
Why DBSCAN over threshold-based methods? DBSCAN detects outliers in the joint feature space without assuming any single feature is sufficient. A channel might have acceptable variance but anomalous kurtosis — DBSCAN catches multivariate outliers that per-feature thresholds miss.
Why automatic eps? Manual eps tuning is fragile across recordings with different noise floors. The k-distance knee provides a data-driven estimate that adapts to each recording’s feature distribution.
Stage 4: Interpolation
Bad channels are replaced by spatial interpolation from their grid neighbors. This preserves the spatial sampling for downstream analyses (CSD computation, spatial measures) that require a complete grid.
Snakemake orchestration
The pipeline is packaged as a Snakemake workflow in
cogpy.workflows.preprocess. Key design choices:
Rules are thin orchestrators. Each rule loads data via
cogpy.io, callscogpycompute functions, and saves viacogpy.io. No compute logic lives in the Snakefile.Zarr as interchange format. Every stage reads and writes Zarr, providing chunked storage, metadata preservation, and restartability.
Config-driven parameters. All hyperparameters (filter cutoffs, window sizes, DBSCAN settings) live in YAML config files, not in code.
Dask chunking. Scripts chunk along the time axis for memory-efficient processing of long recordings:
sigx.chunk({'time': 16*4096, 'AP': -1, 'ML': -1})
CLI entry point
The cogpy-preproc command wraps Snakemake with sensible defaults:
cogpy-preproc all data/sub-01/rec1.lfp -c 8
cogpy-preproc feature data/sub-01/rec1.lfp --configfile custom.yml
It loads the packaged default config, merges any user overrides, resolves the packaged Snakefile path, and spawns a Snakemake subprocess targeting the requested rule.
Legacy modules
The canonical bad-channel pipeline lives in cogpy.preprocess.badchannel.
Three older modules are retained for backward compatibility but are deprecated:
Legacy module |
Replacement |
|---|---|
|
|
|
|
|
|
New code should always use the badchannel subpackage.