---
title: Cross-Recording Factor Matching
file_format: mystnb
kernelspec:
  name: cogpy
  display_name: cogpy
  language: python
mystnb:
  execution_mode: "auto"
---

# Cross-Recording Factor Matching

This tutorial demonstrates how to **match spatio-spectral factors across
multiple recordings** (sessions or animals) and compute a consensus
(centroid) decomposition.  It is a direct follow-up to
{doc}`factor-analysis`, which covers single-recording erpPCA.

:::{admonition} Setup
:class: dropdown

**Python 3.10+** required.  Install cogpy with the extras this notebook
needs (interactive plots + spectral backend + Jupyter kernel):

```bash
pip install "ecogpy[viz,signal,notebook]"
```

Or install everything at once: `pip install "ecogpy[all]"`.
For a development install see {doc}`install`.
:::

:::{admonition} Prerequisites
:class: important

Read {doc}`factor-analysis` first.  This tutorial assumes familiarity
with `SpatSpecDecomposition`, `erpPCA`, loadings, and scores.
:::

## The problem

Varimax-rotated PCA decomposes each recording's spectrogram into
spatio-spectral factors independently.  The factors are meaningful
within a single recording, but across recordings:

- **Factor ordering differs** — variance ranking changes with noise.
- **Factor polarity and shape vary** — slightly different spatial
  patterns emerge from different data segments.

To compare factors across recordings (e.g. "is this spindle pattern
consistent across sessions?"), we need to **match** corresponding
factors and quantify their similarity.

## The approach

The matching pipeline (Garcia-Cortadella et al. 2024,
{func}`cogpy.decomposition.match.match_factors`) works in five steps:

:::{admonition} Algorithm overview
:class: tip

1. **Pairwise similarity** — for every pair of recordings, compute a
   `(nfac x nfac)` similarity matrix using spatial correlation of
   loading maps at peak frequency, gated by a frequency threshold.
2. **Hungarian assignment** — find the optimal one-to-one factor
   mapping for each pair that maximises total similarity
   (`scipy.optimize.linear_sum_assignment`).
3. **Reference selection** — pick the recording whose transitive
   matches are best overall (the "medoid"), or use a fixed reference.
4. **Transitive alignment** — re-index all recordings' factors to
   match the reference's ordering.
5. **Centroid** — average the aligned, normalised loadings to get a
   consensus decomposition.
:::

```{code-cell} python
import warnings; warnings.filterwarnings("ignore")  # suppress xarray FutureWarnings
import numpy as np
import xarray as xr
import pandas as pd
import holoviews as hv

hv.extension("bokeh")
```

## 1. Generate synthetic multi-session data

We create **4 synthetic recordings** that share the same underlying
spatio-spectral factors but with different noise realisations and
factor orderings — mimicking what you'd get from independent PCA
fits on different sessions of the same animal.

:::{admonition} Why synthetic data?
:class: note

The bundled sample recording is a single 7.5 s segment. To
demonstrate matching we need multiple recordings with shared
structure.  We extract "ground truth" loadings from the real data,
then generate synthetic spectrograms by mixing these loadings with
random scores + noise.  Each independent PCA fit recovers the factors
in a **different order** — exactly the scenario matching solves.
:::

```{code-cell} python
from cogpy.datasets import load_sample
from cogpy.spectral.specx import spectrogramx
from cogpy.decomposition.spatspec import SpatSpecDecomposition
from cogpy.decomposition.pca import erpPCA

# Fit erpPCA on the real sample to get "ground truth" loadings
sig = load_sample()
spec = spectrogramx(sig, bandwidth=4.0, nperseg=64, noverlap=56)
spec = spec.compute() if hasattr(spec.data, "compute") else spec
spec = spec.sel(freq=slice(1, 50)).isel(
    AP=slice(None, None, 2), ML=slice(None, None, 2)
)
spec = spec.coarsen(freq=2, boundary="trim").mean()

mtx = spec.rename({"AP": "h", "ML": "w"})
ss_real = SpatSpecDecomposition(mtx)
X_real = ss_real.designmat(mtx, log=True)
Xc_real = X_real - X_real.mean("time")
Xc_real.data = np.nan_to_num(Xc_real.data)

nfac = 6
erp = erpPCA(nfac=nfac, verbose=False)
erp.fit(Xc_real.data)

# Ground truth loadings: (nfac, n_vars)
L_true = erp.LR.T  # (nfac, hwf)
print(f"Ground truth: {nfac} factors, {L_true.shape[1]} variables")
```

Now generate 4 synthetic spectrograms by mixing these loadings with
random scores, adding noise, and fitting erpPCA independently on
each:

```{code-cell} python
rng = np.random.default_rng(42)
nrec = 4
n_time = 200  # time windows per synthetic recording

synthetic_series = []

for irec in range(nrec):
    # Random scores (each factor has different variance)
    variances = rng.exponential(1.0, size=nfac)
    scores = rng.normal(size=(n_time, nfac)) * np.sqrt(variances)

    # Reconstruct spectrogram + add noise
    X_synth = scores @ L_true
    noise_level = 0.3 * np.std(X_synth)
    X_synth += rng.normal(scale=noise_level, size=X_synth.shape)

    # Fit erpPCA independently — factor order will differ
    erp_synth = erpPCA(nfac=nfac, verbose=False)
    erp_synth.fit(X_synth)

    # Build SpatSpecDecomposition
    ss = SpatSpecDecomposition(ss_real.spatspec_coords)
    ss.ldx_set(erp_synth.LR)
    ss.ldx_process()
    synthetic_series.append(ss)

ss_series = pd.Series(synthetic_series)
print(f"Created {nrec} synthetic recordings, each with {nfac} factors")
```

## 2. Inspect unmatched factors

Before matching, factors from different recordings are in arbitrary
order.  Compare the spatial maps at peak frequency for the first two
recordings — notice that **similar patterns appear at different factor
indices**:

```{code-cell} python
from cogpy.plot.hv.decomposition import loading_spatial_layout

layout_0 = loading_spatial_layout(
    ss_series.iloc[0].ldx_slc_maxfreq,
    ss_series.iloc[0].ldx_df,
).opts(title="Recording 0 (unmatched)")

layout_1 = loading_spatial_layout(
    ss_series.iloc[1].ldx_slc_maxfreq,
    ss_series.iloc[1].ldx_df,
).opts(title="Recording 1 (unmatched)")

layout_0 + layout_1
```

:::{admonition} What to look for
:class: hint

You should see similar spatial patterns (blobs in similar grid
locations) but at different factor indices across the two recordings.
For example, Recording 0's Factor 0 might look like Recording 1's
Factor 3.  This is the ordering ambiguity that matching resolves.
:::

## 3. Pairwise similarity matrix

The first step in matching is computing the similarity between every
factor pair across two recordings.  The metric is the **Pearson
correlation between spatial loading maps at peak frequency**.  Factor
pairs whose peak frequencies differ by more than `freq_threshold` Hz
are assigned zero similarity:

```{code-cell} python
from cogpy.decomposition.pca import compute_similarity_matrix

simil_01 = compute_similarity_matrix(
    ss_series.iloc[0], ss_series.iloc[1], freq_threshold=3
)

hv.HeatMap(
    [(f"Rec0 F{i}", f"Rec1 F{j}", simil_01[i, j])
     for i in range(nfac) for j in range(nfac)],
    kdims=["Recording 0", "Recording 1"],
    vdims=["Similarity"],
).opts(
    cmap="RdBu_r", colorbar=True, width=400, height=350,
    title="Pairwise factor similarity (Rec 0 vs Rec 1)",
    xrotation=45, symmetric=True,
)
```

:::{admonition} Reading the heatmap
:class: hint

Each cell `(i, j)` shows the spatial correlation between Factor `i`
of Recording 0 and Factor `j` of Recording 1.  Values close to 1
indicate a strong match.  Zero values mean the peak frequencies
differed by more than `freq_threshold` — those pairs are excluded
from consideration.
:::

## 4. Hungarian assignment (bipartite matching)

The **Hungarian algorithm** (`scipy.optimize.linear_sum_assignment`)
finds the one-to-one mapping that maximises total similarity.  It
solves the assignment problem optimally in O(n^3):

```{code-cell} python
from scipy.optimize import linear_sum_assignment

row_ind, col_ind = linear_sum_assignment(-simil_01)  # negate for maximisation

print("Optimal factor mapping (Rec 0 -> Rec 1):")
for r, c in zip(row_ind, col_ind):
    print(f"  Factor {r} -> Factor {c}  (similarity: {simil_01[r, c]:.3f})")
print(f"\nMean optimal similarity: {simil_01[row_ind, col_ind].mean():.3f}")
```

:::{admonition} Why Hungarian and not greedy?
:class: note

A greedy approach (pick the best pair, remove it, repeat) can get
stuck in suboptimal solutions.  The Hungarian algorithm guarantees
the **globally optimal** one-to-one assignment — the permutation that
maximises the sum of matched similarities.
:::

## 5. Multi-recording matching with `match_factors`

For more than 2 recordings, pairwise Hungarian can't be applied
directly (it's a bipartite algorithm — it only handles two sets).
`match_factors` uses a **hub-and-spoke** strategy:

```{code-cell} python
from cogpy.decomposition.match import match_factors

results = match_factors(
    ss_series,
    nrec=nrec,
    nfac=nfac,
    freq_threshold=3,
    simil_score_threshold=0.5,
    optimal_ref=True,
)

print(f"Optimal reference recording: {int(results['metadata_csv']['opt_ref'])}")
print(f"\nMatch table (factor index per recording):")
print(results["match_df_csv"])
```

:::{admonition} Understanding the match table
:class: hint

Each **row** is a matched factor group.  The numbered columns
(0, 1, 2, 3) show which factor index in each recording corresponds
to that matched factor.  The `simil` column is the **minimum**
pairwise similarity across all recording pairs for that group — a
conservative quality measure.

Factors with similarity below `simil_score_threshold` are dropped.
:::

:::{admonition} Optimal reference selection
:class: note

When `optimal_ref=True`, the algorithm tries **every recording** as
the reference, runs the full matching, and picks the one whose
transitive matches are best overall.  This is like finding the
"medoid" of the group — the most typical recording.  Set
`optimal_ref=False` to always use recording 0 as the reference
(faster, deterministic, but may give worse matches if recording 0
is atypical).
:::

## 6. Visualise matched loadings

After matching, factor indices are aligned across recordings.
Each column below shows the **same matched factor** across all 4
recordings — they should look spatially similar:

```{code-cell} python
ss_all = results["ldx_all_pkl"]
n_matched = min(4, len(results["match_df_csv"]))

panels = []
for ifac in range(n_matched):
    for irec in range(nrec):
        slc = ss_all.ldx_slc_maxfreq.sel(rec=irec, factor=ifac)
        panels.append(
            hv.Image(
                slc, kdims=["w", "h"],
            ).opts(
                cmap="RdBu_r", colorbar=False, width=160, height=140,
                symmetric=True, invert_yaxis=True,
                title=f"Rec {irec}, F{ifac}",
            )
        )

hv.Layout(panels).cols(nrec).opts(
    title="Matched factors: each column = same factor across recordings"
)
```

## 7. Centroid (consensus) loadings

The centroid is the mean of normalised, aligned loadings — a
**consensus factor** that captures the shared spatial-spectral
pattern across all recordings:

```{code-cell} python
ss_cent = results["ldx_pkl"]

loading_spatial_layout(ss_cent.ldx_slc_maxfreq, ss_cent.ldx_df)
```

### Centroid spectral profiles

The frequency profile at the peak electrode for each consensus
factor — the red dashed line marks the peak frequency:

```{code-cell} python
from cogpy.plot.hv.decomposition import loading_spectral_profiles

loading_spectral_profiles(ss_cent.ldx_slc_maxch, ss_cent.ldx_df)
```

## 8. Quality assessment

Not all factors match well across recordings.  The `simil` column
quantifies the worst-case pairwise similarity for each matched
factor.  Low scores indicate patterns that are session-specific or
noise-driven:

```{code-cell} python
match_df = results["match_df_csv"]
hv.Bars(
    [(f"F{i}", s) for i, s in zip(match_df.index, match_df["simil"])],
    kdims=["Matched factor"], vdims=["Min. similarity"],
).opts(
    width=400, height=250, color="steelblue",
    title="Match quality per factor (min pairwise similarity)",
    ylim=(0, 1),
)
```

:::{admonition} Interpreting match quality
:class: tip

- **> 0.9**: strong match — the factor is highly reproducible.
- **0.7 -- 0.9**: moderate — likely a real pattern with some
  session-to-session variability.
- **< 0.7**: weak — consider whether this factor is meaningful
  or noise-driven.  Increase `simil_score_threshold` to filter
  these out automatically.
:::

## Summary

| Step | Function | Module |
|------|----------|--------|
| Fit per-recording | `erpPCA.fit()` | `cogpy.decomposition.pca` |
| Pairwise similarity | `compute_similarity_matrix()` | `cogpy.decomposition.pca` |
| Full matching pipeline | `match_factors()` | `cogpy.decomposition.match` |
| HV spatial layout | `loading_spatial_layout()` | `cogpy.plot.hv.decomposition` |
| HV spectral profiles | `loading_spectral_profiles()` | `cogpy.plot.hv.decomposition` |

### Key parameters

| Parameter | Default | Purpose |
|-----------|---------|---------|
| `freq_threshold` | 3 Hz | Max peak-frequency difference to consider a match |
| `simil_score_threshold` | 0.7 | Min similarity to retain a matched factor |
| `optimal_ref` | `True` | Auto-select best reference recording |

(cross-animal-matching)=
## Cross-animal matching

When comparing factors **across animals**, the matching can be done
in two stages:

1. **Within-animal**: match factors across sessions of the same
   animal using `match_factors` to produce a per-animal centroid.
2. **Across-animal**: collect the per-animal centroids into a new
   Series and run `match_factors` again to produce a grand centroid.

```python
# Stage 1: within-animal centroids
animal_centroids = {}
for animal_id, session_series in sessions_by_animal.items():
    result = match_factors(session_series, nrec=len(session_series),
                           nfac=nfac, freq_threshold=3,
                           simil_score_threshold=0.7)
    animal_centroids[animal_id] = result["ldx_pkl"]

# Stage 2: across-animal matching
cross_animal_series = pd.Series(animal_centroids)
grand_result = match_factors(cross_animal_series,
                              nrec=len(cross_animal_series),
                              nfac=nfac, freq_threshold=5,
                              simil_score_threshold=0.5)
grand_centroid = grand_result["ldx_pkl"]
```

:::{admonition} Spatial correspondence across animals
:class: warning

The current similarity metric (spatial correlation) assumes
**pixel-to-pixel correspondence** — i.e. the same grid size and
placement across animals.  When grid placements differ, spatial
correlation may underestimate true similarity.

Garcia-Cortadella et al. (2024) use an alternative **feature-space
Euclidean distance** for cross-animal matching: a 7-dimensional
summary (AP/ML sign, grid position, peak frequency, spatial extent)
that does not require aligned grids.  A `method="feature_space"`
option for `match_factors` is planned for a future release.
:::

## Next steps

- {doc}`factor-analysis` — single-recording factor analysis
  (prerequisite)
- {doc}`spectral-analysis` — spectral analysis stack (PSD,
  spectrograms)
- `cogpy.decomposition.match` — API reference for matching functions

## References

Garcia-Cortadella, R. et al. (2024). DC-coupled, 2.5D
electrophysiological imaging of large-scale cortical dynamics.
*bioRxiv*. [DOI: 10.1101/2024.12.20.629545](https://doi.org/10.1101/2024.12.20.629545)