--- title: Factor Analysis of Spectrograms with erpPCA file_format: mystnb kernelspec: name: cogpy display_name: cogpy language: python mystnb: execution_mode: "auto" --- # Factor Analysis of Spectrograms with erpPCA This tutorial demonstrates how to decompose a grid ECoG spectrogram into spatio-spectral factors using **varimax-rotated PCA** (erpPCA). :::{admonition} Setup :class: dropdown **Python 3.10+** required. Install cogpy with the extras this notebook needs (interactive plots + spectral backend + Jupyter kernel): ```bash pip install "ecogpy[viz,signal,notebook]" ``` Or install everything at once: `pip install "ecogpy[all]"`. For a development install see {doc}`install`. ::: :::{note} This is a **method demo** using a short bundled sample recording. The sample data is not chosen to highlight any particular oscillatory phenomenon. For real analyses (e.g. spindle detection in NREM sleep), use a domain-appropriate recording. ::: The pipeline is: 1. Load a grid ECoG signal 2. Compute a multitaper spectrogram 3. Build a design matrix (log-power, mean-centred) 4. Fit erpPCA → extract loadings and scores 5. Inspect loadings (which spatial pattern at which frequency) 6. Process factor scores (smooth, threshold, detect events) Each factor captures a recurring spatio-spectral pattern. :::{tip} **Why varimax?** Rohe & Zeng (2023) show that varimax-rotated PCA is, under mild conditions, equivalent to ICA: when the latent factors are non-Gaussian (leptokurtic, κ > 3), varimax recovers statistically identifiable independent components — the same goal ICA pursues, but via the classical eigendecomposition + rotation route rather than higher-order tensor methods. This makes erpPCA both computationally simple and theoretically grounded. *K. Rohe and M. Zeng, "Vintage factor analysis with Varimax performs statistical inference," J. R. Stat. Soc. B, vol. 85, no. 4, pp. 1037–1060, 2023.* [DOI: 10.1093/jrsssb/qkad029](https://doi.org/10.1093/jrsssb/qkad029) ::: ```{code-cell} python import numpy as np import xarray as xr import matplotlib.pyplot as plt import holoviews as hv hv.extension("bokeh") ``` ## 1. Load sample data `load_sample()` returns a preprocessed grid ECoG signal with dims `(AP, ML, time)` at 125 Hz. ```{code-cell} python from cogpy.datasets import load_sample sig = load_sample() print(sig) ``` Quick look — single-channel time trace and a spatial snapshot: ```{code-cell} python fig, axes = plt.subplots(1, 2, figsize=(12, 3)) # Time trace at center electrode sig.sel(AP=8, ML=8).plot(ax=axes[0]) axes[0].set(title="Time trace (AP=8, ML=8)", xlabel="Time (s)", ylabel="Amplitude") # Spatial snapshot at one time point sig.isel(time=sig.sizes["time"] // 2).plot(ax=axes[1], cmap="RdBu_r") axes[1].set(title="Spatial snapshot (mid-recording)", aspect="equal") plt.tight_layout() ``` ## 2. Compute multitaper spectrogram We use `spectrogramx` (ghostipy backend) to compute a time-frequency representation for every electrode in the grid. Three practical considerations for the PCA step: - **Frequency range**: clip to frequencies of interest (1–50 Hz) to avoid inflating the variable count. - **Frequency coarsening**: average adjacent frequency bins to reduce dimensionality while preserving spectral structure. This is the single most effective knob for controlling runtime. - **Grid subsampling**: for large grids, subsample spatially to keep the covariance matrix tractable. The PCA variable count is `n_AP × n_ML × n_freq`. The covariance matrix is `(n_vars × n_vars)`, so keeping this under ~3000 is recommended for interactive use. :::{note} This tutorial uses aggressive subsampling (8×8 grid, 12 freq bins → 768 variables) to keep the docs build fast (~45 s). For real analyses you can use the full 16×16 grid with 10–25 frequency bins. See [Computational considerations](#computational-considerations) at the end. ::: ```{code-cell} python from cogpy.spectral.specx import spectrogramx spec = spectrogramx(sig, bandwidth=4.0, nperseg=64, noverlap=56) spec = spec.compute() if hasattr(spec.data, "compute") else spec # Clip frequency range and subsample grid spec = spec.sel(freq=slice(1, 50)) spec = spec.isel(AP=slice(None, None, 2), ML=slice(None, None, 2)) # 8×8 # Coarsen frequency: average pairs of adjacent bins → ~12 bins # This is the key step for keeping the eigendecomposition fast. spec = spec.coarsen(freq=2, boundary="trim").mean() n_vars = spec.sizes["AP"] * spec.sizes["ML"] * spec.sizes["freq"] print(f"Spectrogram: {spec.shape}") print(f"PCA variables: {spec.sizes['AP']}×{spec.sizes['ML']}×{spec.sizes['freq']} = {n_vars}") ``` Spectrogram at one electrode and mean spectrum across the grid: ```{code-cell} python mid_ap = spec.AP.values[len(spec.AP) // 2] mid_ml = spec.ML.values[len(spec.ML) // 2] spec_img = hv.Image( spec.sel(AP=mid_ap, ML=mid_ml), kdims=["time", "freq"], label="Spectrogram", ).opts(cmap="viridis", colorbar=True, width=500, height=250, title=f"Spectrogram at AP={mid_ap}, ML={mid_ml}") mean_spec = spec.mean(dim=("AP", "ML", "time")) mean_curve = hv.Curve( (mean_spec.freq.values, mean_spec.values), kdims=["Frequency (Hz)"], vdims=["Power"], label="Mean spectrum", ).opts(width=350, height=250, title="Mean power spectrum", logy=True) spec_img + mean_curve ``` ## 3. Build the design matrix erpPCA operates on a 2D matrix `(time, variables)`. We: 1. **Rename** dims to `SpatSpecDecomposition` convention (`AP→h`, `ML→w`) 2. **Flatten** spatial + frequency dims into one axis 3. **Log-transform** power (stabilises variance) 4. **Mean-centre** each variable across time We deliberately avoid z-scoring (dividing by std) here: that would equalise the variance of every channel×frequency variable, giving noise-dominated variables the same weight as signal-rich ones in the covariance matrix. ```{code-cell} python from cogpy.decomposition.spatspec import SpatSpecDecomposition mtx = spec.rename({"AP": "h", "ML": "w"}) ss = SpatSpecDecomposition(mtx) X = ss.designmat(mtx, log=True) Xc = X - X.mean("time") Xc.data = np.nan_to_num(Xc.data) print(f"Design matrix: {Xc.shape[0]} time points × {Xc.shape[1]} variables") ``` Visualise the design matrix as a heatmap — each row is a time window, each column a (channel, frequency) variable: ```{code-cell} python hv.Image( Xc.data, kdims=["Variable", "Time"], bounds=(0, 0, Xc.shape[1], Xc.shape[0]), ).opts( cmap="RdBu_r", colorbar=True, width=600, height=250, title="Design matrix (mean-centred log-power)", xlabel="Variable index (h × w × freq)", ylabel="Time window", invert_yaxis=True, ) ``` ## 4. Fit erpPCA The `erpPCA` estimator follows the scikit-learn API: 1. Covariance matrix of the design matrix 2. Eigendecomposition 3. **Varimax rotation** to maximise simple structure 4. Sort factors by explained variance ```{code-cell} python from cogpy.decomposition.pca import erpPCA nfac = 8 erp = erpPCA(nfac=nfac, verbose=False) erp.fit(Xc.data) print(f"Loadings: {erp.LR.shape} (vars × factors)") print(f"Scores: {erp.FSr.shape} (time × factors)") print(f"Var explained (rotated): {erp.VT[3][:nfac].sum():.1f}%") ``` :::{tip} **Short recordings and rank-deficient covariance.** When the number of time windows is smaller than the number of variables (as in this demo), the empirical covariance matrix is rank-deficient and the top eigenvalues are over-estimated. A **shrinkage covariance estimator** (e.g. Ledoit-Wolf) regularises the estimate by pulling eigenvalues toward their grand mean: ```python from sklearn.covariance import LedoitWolf erp = erpPCA(nfac=nfac, cov_estimator=LedoitWolf(), verbose=False) erp.fit(Xc.data) ``` This is recommended whenever `n_time_windows < n_variables`. ::: Variance explained per factor: ```{code-cell} python var_pct = erp.VT[3][:nfac] hv.Bars( [(f"F{i}", v) for i, v in enumerate(var_pct)], kdims=["Factor"], vdims=["Variance (%)"], ).opts( width=400, height=250, color="steelblue", title="Variance explained per factor (rotated)", ) ``` ## 5. Reshape into loadings and scores Reshape the flat loading matrix back into `(factor, h, w, freq)` and wrap scores as `(time, factor)` DataArrays. ```{code-cell} python ss.ldx_set(erp.LR) ss.ldx_process() scx = ss.scx_from_FSr(erp.FSr, Xc.time.values) print(f"Loadings: {ss.ldx.shape}") print(f"Scores: {scx.shape}") print(f"\nFactor summary:") print(ss.ldx_df[["AP", "ML", "freqmax", "norm"]]) ``` ## 6. Visualise loadings ### Spatial maps (HoloViews) Each factor's spatial map at its peak frequency, laid out in a grid. The `×` marks the peak electrode. These are static `HoloMap`-compatible objects that render on Sphinx sites. ```{code-cell} python from cogpy.plot.hv.decomposition import loading_spatial_layout loading_spatial_layout(ss.ldx_slc_maxfreq, ss.ldx_df) ``` ### Browse factors with HoloMap A `HoloMap` keyed by factor — use the slider or dropdown to browse: ```{code-cell} python from cogpy.plot.hv.decomposition import factor_holomap factor_holomap(ss.ldx, ss.ldx_df) ``` ### Spectral profiles Frequency loading at the peak electrode for each factor — the red dashed line marks the peak frequency: ```{code-cell} python from cogpy.plot.hv.decomposition import loading_spectral_profiles loading_spectral_profiles(ss.ldx_slc_maxch, ss.ldx_df) ``` ### Interactive explorer (notebook only) The following `DynamicMap` provides a live slider to browse the full 4D loading tensor across factors and frequency bins. It will **not render on a static site** — run this cell in a Jupyter notebook. ```{code-cell} python :tags: [skip-execution] # NOTE: DynamicMap — requires a live kernel (won't render on static docs site) def _loading_explorer(factor, freq_idx): freq_val = float(ss.ldx.freq.values[freq_idx]) slc = ss.ldx.sel(factor=factor, freq=freq_val, method="nearest") return hv.Image(slc, kdims=["w", "h"]).opts( cmap="RdBu_r", colorbar=True, width=350, height=300, symmetric=True, invert_yaxis=True, title=f"Factor {factor} @ {freq_val:.1f} Hz", ) hv.DynamicMap( _loading_explorer, kdims=[ hv.Dimension("factor", values=list(range(nfac))), hv.Dimension("freq_idx", range=(0, ss.ldx.sizes["freq"] - 1), step=1), ], ) ``` ## 7. Factor scores over time Factor scores show how strongly each spatio-spectral pattern is expressed at each time point. The traces are stacked vertically (each normalised to unit std) and a **gain** slider lets you amplify small variations without changing the spacing between traces. ```{code-cell} python from cogpy.plot.hv.decomposition import score_traces_holomap score_traces_holomap(scx, ss.ldx_df, gains=(0.3, 0.5, 1.0, 2.0, 4.0)) ``` ## 8. Score processing Raw scores are noisy. The `scores` module provides a pipeline: 1. **Gaussian smoothing** — temporal low-pass 2. **Lower envelope removal** — removes slow baseline shifts 3. **Quantile thresholding** — keeps only prominent activations ```{code-cell} python from cogpy.decomposition.scores import scx_process scx_dict = scx_process(scx, sigma=0.25, quantile=0.25, return_all=True) ``` Compare processing stages for one factor: ```{code-cell} python ifac = 0 peak_freq = ss.ldx_df.loc[ifac, "freqmax"] stages = [ ("Raw", scx_dict["scx"], "steelblue"), ("Smoothed (no envelope)", scx_dict["scx_noenv"], "darkorange"), ("Thresholded", scx_dict["scx_thresh"], "green"), ] panels = [] for label, arr, color in stages: trace = arr.sel(factor=ifac) panels.append( hv.Curve( (trace.time.values, trace.values), kdims=["Time (s)"], vdims=["Score"], ).opts( width=700, height=150, color=color, title=f"F{ifac} ({peak_freq:.0f} Hz) — {label}", ) ) hv.Layout(panels).cols(1) ``` ## 9. Reconstruct and compare Reconstruct the spectrogram from loadings × scores and compare to the original at one electrode. ```{code-cell} python mtx_hat = ss.reconstruct(scx) mtx_z = ss.mtx_from_designmat(Xc, mtx) h_sel = ss.ldx.h.values[len(ss.ldx.h) // 2] w_sel = ss.ldx.w.values[len(ss.ldx.w) // 2] orig_slc = mtx_z.sel(h=h_sel, w=w_sel) recon_slc = mtx_hat.sel(h=h_sel, w=w_sel) resid_slc = orig_slc - recon_slc img_orig = hv.Image(orig_slc, kdims=["time", "freq"]).opts( cmap="viridis", colorbar=True, width=320, height=220, title="Original (z-scored)") img_recon = hv.Image(recon_slc, kdims=["time", "freq"]).opts( cmap="viridis", colorbar=True, width=320, height=220, title=f"Reconstructed ({nfac} factors)") img_resid = hv.Image(resid_slc, kdims=["time", "freq"]).opts( cmap="RdBu_r", colorbar=True, width=320, height=220, title="Residual", symmetric=True) img_orig + img_recon + img_resid ``` ## 10. Frequency band labelling Tag factors by their peak frequency band using the generic `mark_freq_band` method. ```{code-cell} python ss.mark_freq_band("delta", 0.5, 4) ss.mark_freq_band("theta", 4, 8) ss.mark_freq_band("alpha", 8, 13) ss.mark_freq_band("spindle", 10, 16) ss.mark_freq_band("beta", 13, 30) ss.mark_freq_band("low_gamma", 30, 80) print(ss.ldx_df[["freqmax", "is_delta", "is_theta", "is_alpha", "is_spindle", "is_beta", "is_low_gamma"]]) ``` ## Summary | Step | Function / Class | Module | |------|-----------------|--------| | Load data | `load_sample()` | `cogpy.datasets` | | Spectrogram | `spectrogramx()` | `cogpy.spectral.specx` | | Design matrix | `SpatSpecDecomposition.designmat()` | `cogpy.decomposition.spatspec` | | Fit PCA | `erpPCA.fit()` | `cogpy.decomposition.pca` | | Reshape loadings | `SpatSpecDecomposition.ldx_set()` | `cogpy.decomposition.spatspec` | | Score processing | `scx_process()` | `cogpy.decomposition.scores` | | Factor matching | `match_factors()` | `cogpy.decomposition.match` | | HV spatial layout | `loading_spatial_layout()` | `cogpy.plot.hv.decomposition` | | HV factor browser | `factor_holomap()` | `cogpy.plot.hv.decomposition` | For cross-recording factor matching (e.g. aligning factors across sessions or subjects), see {doc}`factor-matching`. (computational-considerations)= ## Computational considerations ### What this tutorial subsampled This tutorial uses a short sample recording (7.5 s). To keep the eigendecomposition tractable for interactive use, we reduced the variable count: | Parameter | Full data | This tutorial | Effect | |-----------|-----------|---------------|--------| | Grid | 16×16 = 256 ch | 8×8 = 64 ch | 2× spatial subsample | | Freq range | 0–62 Hz (33 bins) | 1–50 Hz (25 bins) | Drop DC + above-interest | | Freq coarsen | 25 bins | 12 bins (pairs averaged) | 2× freq reduction | | Overlap | 48/64 → 55 time wins | 56/64 → 109 time wins | More samples for PCA | | **PCA variables** | **8,448** | **768** | 11× reduction | The computational bottleneck is **`np.linalg.eigh`** on the `(n_vars × n_vars)` covariance matrix — an O(n³) operation. The covariance computation itself (`X.T @ X`) is fast and scales linearly with recording length. ### Scaling to real sessions (2 hours, 16×16 grid) For a 2-hour recording at 125 Hz with `nperseg=256, noverlap=224`: | | Tutorial | Real session | |---|----------|-------------| | Duration | 7.5 s | 7,200 s | | Time windows | 109 | ~28,000 | | Grid | 8×8 | 16×16 | | Freq bins (1–50 Hz) | 25 | ~100 | | **Variables** | **1,600** | **25,600** | | Cov matrix entries | 2.5 M | 655 M | With 28,000 time samples >> 25,600 variables the covariance is well-conditioned. The cost is dominated by the eigendecomposition. ### Strategies for large-scale analysis - **Frequency binning**: average power into ~10 canonical bands (delta, theta, alpha, ...) instead of keeping all freq bins. 16×16×10 = 2,560 vars — fast and interpretable. - **Truncated eigendecomposition**: replace `np.linalg.eigh` with `scipy.sparse.linalg.eigsh(k=nfac)` to compute only the top-k eigenvalues. Reduces O(n³) to O(n²·k). - **Two-stage decomposition**: first reduce frequency dimensionality per channel (standard PCA), then run spatial varimax on the reduced representation. - **Spatial subsampling**: for very large grids, subsample every 2nd electrode (as done here) with minimal information loss for low-spatial-frequency patterns. ## Next steps - {doc}`factor-matching` — match factors across sessions and animals (direct follow-up) - {doc}`spectral-analysis` — spectral analysis stack (PSD, spectrograms, features) - {doc}`spatial-measures` — spatial grid measures (Moran's I, gradient anisotropy)