Systematic literature processing — biblio × questio bridge¶

Problem¶

biblio manages papers (ingest, docling, grobid, enrich, RAG). questio manages hypotheses (questions.yml, milestones, evidence from result notes). But there's no structured layer connecting them — nothing that says "paper X found Y about hypothesis H1 with parameter Z, and our dataset can reproduce/extend/contradict it."

The gap was exposed when processing Swanson et al. 2025 for pixecog — manually extracting hypothesis-relevant findings, methods parameters, and dataset opportunities took significant effort and produced output with no canonical home in the biblio derivatives tree.

Suggested tools¶

1. biblio_extract(citekey, force=False) - Reads docling markdown + grobid refs for a citekey - Reads plan/questions.yml for hypothesis definitions - Writes structured YAML to bib/derivatives/claude/{citekey}/extract.yml - Schema per paper: - relevance: per-hypothesis mapping (strength: direct/supporting/tangential/none, finding, parameters, implication for our dataset) - methods: detection methods, thresholds, bands, software used - dataset_opportunities: reproduce / extend / contradict lists - species, regions, recording, n_subjects, sleep_stages - This is the LLM extraction step — uses docling full text as input, questions.yml as the evaluation frame - Could be run in batch (biblio_extract_batch) for all papers with docling output

2. questio_prior_art(question_id?) - Reads all bib/derivatives/claude/*/extract.yml files - Assembles per-hypothesis literature evidence tables: - What each study found (finding, lag, effect size, species, n) - Consensus parameters (range across studies) - What our dataset uniquely adds - Outputs markdown to docs/plan/prior_art/ (or integrates into questio_docs_collect) - This is the synthesis layer — no LLM needed, just YAML aggregation + markdown generation

3. Extend questio_gap(question_id) response - Currently shows: milestones, result-note evidence, blockers - Could additionally show: "literature expectations" pulled from extract.yml files - e.g., "4 studies report D-U→SWR lag of 100-150 ms; our milestone delta-ripple-coupling should reproduce this"

Why this matters¶

Makes literature review a first-class pipeline output, not ad hoc notes
Connects papers to hypotheses bidirectionally (which papers matter for H1? which hypotheses does paper X inform?)
Captures specific parameters for pipeline validation targets (e.g., "expect 30 ms SWR→DOWN lag per Swanson 2025")
The reproduce/extend/contradict frame drives manuscript framing directly
Enables batch processing: ingest 20 papers → extract all → synthesize per hypothesis automatically

Design considerations¶

bib/derivatives/claude/ follows the existing derivatives pattern (docling/, grobid/, openalex/)
extract.yml is YAML not markdown — composable for synthesis
The extraction prompt needs the project's questions.yml as context — this is project-specific, not generic biblio
Could live in biblio (since it's per-paper) or questio (since it's hypothesis-driven) — probably biblio since the output is per-citekey under bib/derivatives/

Source context: pixecog¶

PixEcog (pixecog): Neuropixels and ECoG dataset and analysis

Recent commits:

8dc0d9d Pipeline docs: gitignore docs/pipelines/, relocate hand-authored files
96cd1ec Refactor sharpwaveripple/contracts: extract generic helpers to utils/io, remove pipelines __init__.py
36f9326 Add result note directory and sample note

README:

type: readme

Quick Start for Collaborators¶

Follow this checklist to get started with Pixecog documentation and workflows.

🐀 Pixecog Project — Compact Overview¶

Core principles

One immutable BIDS raw dataset (raw/) as the canonical baseline
Each analysis pipeline ha

idea-arash-20260407-225436-752515.md — studyio proposes hypothesis-aware research orchestration — directly overlaps with the biblio×questio bridge; both aim to connect literature evidence to hypothesis tracking
idea-arash-20260408-035007-479946.md — questio_gap→pipeio_run involves questio evidence gaps; questio_prior_art proposed here is the complementary tool that fills those gaps from the literature side
idea-arash-20260408-035035-245990.md — auto-QC structured result notes share the same schema design problem: where canonical structured evidence lives in the derivatives tree
idea-arash-20260326-155537-817950.md — biblio_docling async mode is a prerequisite for biblio_extract_batch — both operate on the same docling markdown output
idea-arash-20260403-172004-817050.md — skill candidates for projio ecosystem — biblio_extract and questio_prior_art are natural additions to that candidate list