# Data Model

cogpy uses `xarray.DataArray` as its primary data structure. This page
explains the schema conventions and why they were chosen.

## Why xarray?

ECoG data has labeled dimensions (time, space, frequency) that are
meaningful to the analysis. Raw numpy arrays lose this context — you have
to track which axis is which separately. xarray solves this by attaching
named dimensions and coordinates to arrays.

cogpy chose xarray over custom classes because:
- It integrates with the scientific Python ecosystem (pandas, dask, zarr)
- It provides serialization for free (netCDF, Zarr)
- Dimension-aware operations (`sel`, `isel`, `groupby`) reduce indexing bugs
- It avoids the maintenance burden of a custom array class

## Signal schemas

All schemas are defined in `cogpy.base.ECoGSchema`:

### Grid ECoG: `(time, AP, ML)`

The primary schema for 2D electrode grids. Used by spatial measures,
CSD computation, and grid-aware preprocessing.

```
sig.dims  →  ("time", "AP", "ML")
sig.fs    →  1000.0  (Hz, scalar coordinate)
sig.time  →  [0.000, 0.001, 0.002, ...]  (seconds)
sig.AP    →  [0, 1, 2, ..., 15]  (grid row indices)
sig.ML    →  [0, 1, 2, ..., 15]  (grid column indices)
```

**AP** = anterior-posterior (row 0 = most posterior). **ML** = medial-lateral
(column 0 = most medial). This convention matches the physical electrode
layout.

### Flat ECoG: `(time, ch)`

For channel-indexed data without grid semantics (e.g., strip electrodes,
or after flattening a grid).

### Multichannel: `(channel, time)`

Generic multichannel format. Note the transposed axis order compared to
flat ECoG — this matches common neuroscience conventions (channels as rows).

## Sampling rate

`fs` can be stored in two ways:
- **Scalar coordinate:** `sig.coords["fs"]` — preferred
- **Attribute:** `sig.attrs["fs"]` — fallback

Use `cogpy.base.get_fs(sig)` to retrieve it regardless of storage
method, and `ensure_fs(sig, fs=1000.0)` to guarantee it is set.

## Spectrogram schemas

Two spectrogram schemas serve different roles:

### `GridWindowedSpectrum` — compute pipeline form

```
spec.dims  →  ("time_win", "AP", "ML", "freq")
```

Uppercase spatial dims match the `(..., AP, ML)` batch convention expected by
spatial measures. Use `coerce_grid_windowed_spectrum(da)` to convert
`spectrogramx()` output into this form (handles renames and transposes).

### `GridSpectrogram4D` — orthoslicer/GUI form

```
spec.dims  →  ("ml", "ap", "time", "freq")
```

Lowercase spatial dims, optimized for slice-based visualization.

### Flat form

```
spec.dims  →  ("ch", "time_win", "freq")
```

### Normalization

`normalize_spectrogram(spec, method=...)` produces a whitened or
dB-transformed spectrogram with the same dims:

- `"robust_zscore"` — `(x - median) / MAD` along `freq`
- `"db"` — `10 * log10(x)`

## Batch dimension convention

Compute functions in cogpy follow numpy's broadcasting convention with
**typed trailing axes**:

| Domain | Convention | Example |
|--------|-----------|---------|
| Temporal measures | `(..., time)` | `kurtosis(arr)` reduces last axis |
| Spectral features | `(..., freq)` | `band_power(psd, freqs, band)` |
| Spatial measures | `(..., AP, ML)` | `moran_i(grid)` reduces last two axes |

Leading `...` dimensions are batch dimensions (time windows, frequency bins,
channels, etc.). Functions broadcast over them automatically.

This design means you can apply a spatial measure to a full 4D spectrogram
`(time, freq, AP, ML)` in one call — no loops required.

## Event representations

Events are stored in `EventCatalog`, a thin pandas DataFrame wrapper with
standardized columns:

| Column | Required | Description |
|--------|----------|-------------|
| `event_id` | yes | Unique integer ID |
| `t` | yes | Event time (seconds) |
| `t0`, `t1` | no | Interval start/end |
| `duration` | no | `t1 - t0` |
| `freq` | no | Peak frequency (Hz) |
| `AP`, `ML` | no | Grid position |
| `channel` | no | Channel index |
| `label` | no | Event type label |
| `score` | no | Detection confidence |
| `detector` | no | Detector name |
| `pipeline` | no | Pipeline provenance |

## Validation boundaries

Core compute functions assume valid input — they do not coerce dimensions
or check schemas. Validation happens at **system boundaries**:

- `cogpy.io` — constructs valid DataArrays from raw files
- `cogpy.datasets.schemas` — `validate_*()` and `coerce_*()` functions
- `cogpy.cli` — argument parsing and input validation
- Frontend entry points — before passing data to the backend

This keeps core functions fast and simple.