Preprocessing Pipeline

This page explains the design of cogpy’s preprocessing pipeline: how bad channels are identified, why spatial normalization matters, and how the Snakemake workflow ties it all together.

Pipeline overview

The preprocessing pipeline converts raw ECoG recordings into clean, analysis-ready signals. It runs as a Snakemake DAG with seven stages:

raw → lowpass → downsample → feature → badlabel → plot → interpolate

Stage

Purpose

Module

raw_zarr

Convert binary LFP + XML metadata to Zarr

cogpy.io.ecog_io

lowpass

Anti-alias filter before downsampling

cogpy.preprocess.filtering

downsample

Decimate to target sampling rate

cogpy.preprocess.filtering

feature

Extract channel quality features

cogpy.preprocess.badchannel

badlabel

Label bad channels via DBSCAN

cogpy.preprocess.badchannel

plot_feature_maps

QC visualizations

Matplotlib

interpolate

Spatially interpolate bad channels

cogpy.preprocess.interpolate

Each stage reads from Zarr and writes to Zarr, making the pipeline restartable and inspectable at any point.

Filtering

cogpy provides xarray-aware filters in cogpy.preprocess.filtering:

Function

Type

Use case

bandpassx()

Butterworth IIR

General frequency selection

lowpassx()

Butterworth IIR

Anti-aliasing before decimation

highpassx()

Butterworth IIR

DC removal

notchx() / notchesx()

Notch IIR

Line noise removal (50/60 Hz)

decimatex()

Polyphase

Downsampling with anti-alias

cmrx()

Spatial

Common-mode rejection (subtract channel median)

gaussian_spatialx()

Spatial

Gaussian smoothing across grid

median_spatialx()

Spatial

Median filtering across grid

All filters use xr.apply_ufunc internally, preserving dimensions, coordinates, and attributes. They are also dask-compatible for lazy evaluation on large recordings.

Why Butterworth IIR? For online-style causal filtering with filtfilt (zero-phase), Butterworth provides flat passband response with predictable roll-off. The pipeline uses order-4 filters by default.

Bad-channel detection

The bad-channel pipeline is the most complex preprocessing component. It uses a four-stage approach designed to be robust to the spatial structure of electrode grids.

Stage 1: Feature extraction

compute_features_sliding() extracts seven channel-quality features in sliding time windows:

Feature

What it measures

anticorrelation

Spatial correlation with grid neighbors

relative_variance

Variance relative to neighbors

deviation

Amplitude deviation from neighbors

amplitude

Peak signal amplitude

time_derivative

Rate of signal change

hurst_exponent

Long-range temporal correlation

kurtosis

Distribution tail heaviness

Output shape: (n_features, AP, ML, n_windows).

Stage 2: Spatial normalization

Raw features are not directly comparable across the grid — a channel near the edge naturally has different variance than one in the center. Spatial normalization corrects for this by comparing each channel to its grid neighbors.

Four normalization modes:

Mode

Formula

Used by

identity

Use raw value

anticorrelation

ratio

x / (median_neighbor + ε)

relative_variance, amplitude, time_derivative

difference

x - median_neighbor

deviation

robust_z

(x - median) / (MAD × 1.4826 + ε)

kurtosis

Neighbor relationships are derived from the 2D grid layout using binary dilation footprints (default: 2-iteration, connectivity-1).

Why spatial normalization? Without it, edge channels and channels near sulci would be systematically flagged as bad. The normalization isolates locally anomalous channels — those that differ from their immediate spatial context.

Stage 3: DBSCAN outlier labeling

After normalization, features are aggregated across time windows using quantiles (75th–95th percentile, 5 levels), then stacked into a (n_channels, n_feature_quantile_combos) matrix.

DBSCAN identifies outliers:

  1. StandardScaler normalizes the feature matrix

  2. k-distance curve estimates optimal eps via knee detection (KneeLocator on k=10 nearest-neighbor distances)

  3. DBSCAN clusters channels; noise points (label=-1) are bad

Why DBSCAN over threshold-based methods? DBSCAN detects outliers in the joint feature space without assuming any single feature is sufficient. A channel might have acceptable variance but anomalous kurtosis — DBSCAN catches multivariate outliers that per-feature thresholds miss.

Why automatic eps? Manual eps tuning is fragile across recordings with different noise floors. The k-distance knee provides a data-driven estimate that adapts to each recording’s feature distribution.

Stage 4: Interpolation

Bad channels are replaced by spatial interpolation from their grid neighbors. This preserves the spatial sampling for downstream analyses (CSD computation, spatial measures) that require a complete grid.

Snakemake orchestration

The pipeline is packaged as a Snakemake workflow in cogpy.workflows.preprocess. Key design choices:

  • Rules are thin orchestrators. Each rule loads data via cogpy.io, calls cogpy compute functions, and saves via cogpy.io. No compute logic lives in the Snakefile.

  • Zarr as interchange format. Every stage reads and writes Zarr, providing chunked storage, metadata preservation, and restartability.

  • Config-driven parameters. All hyperparameters (filter cutoffs, window sizes, DBSCAN settings) live in YAML config files, not in code.

  • Dask chunking. Scripts chunk along the time axis for memory-efficient processing of long recordings: sigx.chunk({'time': 16*4096, 'AP': -1, 'ML': -1})

CLI entry point

The cogpy-preproc command wraps Snakemake with sensible defaults:

cogpy-preproc all data/sub-01/rec1.lfp -c 8
cogpy-preproc feature data/sub-01/rec1.lfp --configfile custom.yml

It loads the packaged default config, merges any user overrides, resolves the packaged Snakefile path, and spawns a Snakemake subprocess targeting the requested rule.

Legacy modules

The canonical bad-channel pipeline lives in cogpy.preprocess.badchannel. Three older modules are retained for backward compatibility but are deprecated:

Legacy module

Replacement

channel_feature.py

badchannel.channel_features

channel_feature_functions.py

badchannel.channel_features

detect_bads.py

badchannel.badlabel

New code should always use the badchannel subpackage.