Preprocessing Pipeline

This page explains the design of cogpy’s preprocessing pipeline: how bad channels are identified, why spatial normalization matters, and how the Snakemake workflow ties it all together.

Pipeline overview

The preprocessing pipeline converts raw ECoG recordings into clean, analysis-ready signals. It runs as a Snakemake DAG with seven stages:

raw → lowpass → downsample → feature → badlabel → plot → interpolate

Stage	Purpose	Module
`raw_zarr`	Convert binary LFP + XML metadata to Zarr	`cogpy.io.ecog_io`
`lowpass`	Anti-alias filter before downsampling	`cogpy.preprocess.filtering`
`downsample`	Decimate to target sampling rate	`cogpy.preprocess.filtering`
`feature`	Extract channel quality features	`cogpy.preprocess.badchannel`
`badlabel`	Label bad channels via DBSCAN	`cogpy.preprocess.badchannel`
`plot_feature_maps`	QC visualizations	Matplotlib
`interpolate`	Spatially interpolate bad channels	`cogpy.preprocess.interpolate`

Each stage reads from Zarr and writes to Zarr, making the pipeline restartable and inspectable at any point.

Filtering

cogpy provides xarray-aware filters in cogpy.preprocess.filtering:

Function	Type	Use case
`bandpassx()`	Butterworth IIR	General frequency selection
`lowpassx()`	Butterworth IIR	Anti-aliasing before decimation
`highpassx()`	Butterworth IIR	DC removal
`notchx()` / `notchesx()`	Notch IIR	Line noise removal (50/60 Hz)
`decimatex()`	Polyphase	Downsampling with anti-alias
`cmrx()`	Spatial	Common-mode rejection (subtract channel median)
`gaussian_spatialx()`	Spatial	Gaussian smoothing across grid
`median_spatialx()`	Spatial	Median filtering across grid

All filters use xr.apply_ufunc internally, preserving dimensions, coordinates, and attributes. They are also dask-compatible for lazy evaluation on large recordings.

Why Butterworth IIR? For online-style causal filtering with filtfilt (zero-phase), Butterworth provides flat passband response with predictable roll-off. The pipeline uses order-4 filters by default.

Bad-channel detection

The bad-channel pipeline is the most complex preprocessing component. It uses a four-stage approach designed to be robust to the spatial structure of electrode grids.

Stage 1: Feature extraction

compute_features_sliding() extracts seven channel-quality features in sliding time windows:

Feature	What it measures
`anticorrelation`	Spatial correlation with grid neighbors
`relative_variance`	Variance relative to neighbors
`deviation`	Amplitude deviation from neighbors
`amplitude`	Peak signal amplitude
`time_derivative`	Rate of signal change
`hurst_exponent`	Long-range temporal correlation
`kurtosis`	Distribution tail heaviness

Output shape: (n_features, AP, ML, n_windows).

Stage 2: Spatial normalization

Raw features are not directly comparable across the grid — a channel near the edge naturally has different variance than one in the center. Spatial normalization corrects for this by comparing each channel to its grid neighbors.

Four normalization modes:

Mode	Formula	Used by
`identity`	Use raw value	`anticorrelation`
`ratio`	`x / (median_neighbor + ε)`	`relative_variance`, `amplitude`, `time_derivative`
`difference`	`x - median_neighbor`	`deviation`
`robust_z`	`(x - median) / (MAD × 1.4826 + ε)`	`kurtosis`

Neighbor relationships are derived from the 2D grid layout using binary dilation footprints (default: 2-iteration, connectivity-1).

Why spatial normalization? Without it, edge channels and channels near sulci would be systematically flagged as bad. The normalization isolates locally anomalous channels — those that differ from their immediate spatial context.

Stage 3: DBSCAN outlier labeling

After normalization, features are aggregated across time windows using quantiles (75th–95th percentile, 5 levels), then stacked into a (n_channels, n_feature_quantile_combos) matrix.

DBSCAN identifies outliers:

StandardScaler normalizes the feature matrix
k-distance curve estimates optimal eps via knee detection (KneeLocator on k=10 nearest-neighbor distances)
DBSCAN clusters channels; noise points (label=-1) are bad

Why DBSCAN over threshold-based methods? DBSCAN detects outliers in the joint feature space without assuming any single feature is sufficient. A channel might have acceptable variance but anomalous kurtosis — DBSCAN catches multivariate outliers that per-feature thresholds miss.

Why automatic eps? Manual eps tuning is fragile across recordings with different noise floors. The k-distance knee provides a data-driven estimate that adapts to each recording’s feature distribution.

Stage 4: Interpolation

Bad channels are replaced by spatial interpolation from their grid neighbors. This preserves the spatial sampling for downstream analyses (CSD computation, spatial measures) that require a complete grid.

Snakemake orchestration

The pipeline is packaged as a Snakemake workflow in cogpy.workflows.preprocess. Key design choices:

Rules are thin orchestrators. Each rule loads data via cogpy.io, calls cogpy compute functions, and saves via cogpy.io. No compute logic lives in the Snakefile.
Zarr as interchange format. Every stage reads and writes Zarr, providing chunked storage, metadata preservation, and restartability.
Config-driven parameters. All hyperparameters (filter cutoffs, window sizes, DBSCAN settings) live in YAML config files, not in code.
Dask chunking. Scripts chunk along the time axis for memory-efficient processing of long recordings: sigx.chunk({'time': 16*4096, 'AP': -1, 'ML': -1})

CLI entry point

The cogpy-preproc command wraps Snakemake with sensible defaults:

cogpy-preproc all data/sub-01/rec1.lfp -c 8
cogpy-preproc feature data/sub-01/rec1.lfp --configfile custom.yml

It loads the packaged default config, merges any user overrides, resolves the packaged Snakefile path, and spawns a Snakemake subprocess targeting the requested rule.

Legacy modules

The canonical bad-channel pipeline lives in cogpy.preprocess.badchannel. Three older modules are retained for backward compatibility but are deprecated:

Legacy module	Replacement
`channel_feature.py`	`badchannel.channel_features`
`channel_feature_functions.py`	`badchannel.channel_features`
`detect_bads.py`	`badchannel.badlabel`

New code should always use the badchannel subpackage.