Spec: GROBID citation context — beyond simple citation networks¶
Goal¶
(promoted from note)
Context¶
(see source note)
Prompt¶
Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-021609-691872.md). Understand the problem, then implement the proposed fix.
Spec: GROBID citation context — beyond simple citation networks¶
biblio currently uses GROBID for header extraction and reference parsing, producing a flat list of references per paper. But GROBID can also extract citation contexts — the sentences where a reference is cited. This enables "paper X cites paper Y in context C" relationships, which are far richer than simple citation edges.
Research questions¶
- What does GROBID provide for citation context?
- TEI XML
<ref>elements havetargetattributes linking to bibliography entries - These refs are embedded in the full-text paragraphs — the surrounding text IS the citation context
- How does grobid-client-python expose this?
-
What does the TEI structure look like for inline citations?
-
How could biblio use citation contexts?
- Enrich the citation graph: instead of just "A cites B", store "A cites B saying '...sharp-wave ripples were shown to...'"
- RAG queries could return citation contexts as evidence
- Manuscript writing: auto-generate citation sentences based on how others cited the same paper
-
Literature review: cluster papers by how they cite a common reference
-
What's the data model?
- Where to store citation contexts?
bib/derivatives/grobid/{citekey}/contexts.json? - Schema:
{citing_citekey, cited_citekey, context_text, section, position} -
How to extract from existing TEI XML that biblio already generates
-
What does biblio-glutton add?
- biblio-glutton does high-performance bibliographic matching
- Could replace or augment biblio's CrossRef-based
resolve_doi_by_title - Matching unresolved GROBID references to DOIs
Output¶
Write spec to docs/specs/biblio/citation-context.md covering:
- GROBID TEI XML structure for inline citations (with examples from the indexed repo)
- Proposed data model for citation contexts in biblio
- Integration with existing graph.py and reference resolution
- MCP tools to query citation contexts
- Priority assessment: must-have vs nice-to-have
Key references (indexed in RAG)¶
.projio/codio/mirrors/grobidorg--grobid/— GROBID source, TEI output format.projio/codio/mirrors/grobidorg--grobid-client-python/— Python client API.projio/codio/mirrors/kermitt2--biblio-glutton/— bibliographic matchingpackages/biblio/src/biblio/grobid.py— current GROBID integrationpackages/biblio/src/biblio/graph.py— current citation graphpackages/biblio/src/biblio/ref_md.py— reference-markdown standardization
Related Notes¶
- issue-arash-20260403-193112-105596.md — Directly related: citation context extraction is a new enrichment type that fits the biblio enrichment pipeline redesign
- issue-arash-20260402-015659-415628.md — Both concern batch GROBID/docling processing of papers — citation context extraction would run alongside existing TEI XML generation
- issue-arash-20260404-014857-481322.md — Parallel enrichment concern: topics per citekey and citation contexts per citekey would share the same derivatives storage pattern
- issue-arash-20260403-193037-589959.md — Both are biblio enrichment audit/spec notes exploring what structured data can be extracted and stored per paper
- issue-arash-20260403-193002-484673.md — biblio-glutton is mentioned in both — the OpenAlex API audit and this spec both consider biblio-glutton as a bibliographic matching layer
Acceptance Criteria¶
- [ ]
Result¶
Completed in 132.4s (1 steps, model: opus)
Session: 02b6d370-6293-469e-84ce-09c3c53bc945
Result note: /storage2/arash/worklog/workflow/captures/20260405-031424-addc20/note.md
Output¶
Spec written to `docs/specs/biblio/citation-context.md`. It covers:
- **TEI XML structure** — how GROBID encodes inline `<ref type="bibr" target="#bN">` elements within body paragraphs, with examples
- **Data model** — per-citekey `contexts.json` schema (bib_id, cited_citekey, section, sentence, callout_text, position) and corpus-level inverted index
- **Implementation plan** — new `extract_citation_contexts()` parser in `grobid.py`, integration with `run_grobid_for_key`, corpus aggregation function. Reuses existing `parse_tei_biblstructs` and `match_biblstructs_to_corpus` from `ref_md.py`
- **MCP tools** — `biblio_citation_contexts(citekey)` and `biblio_cited_by_contexts(citekey)`
- **RAG integration** — citation context sentences as a new `grobid-context` source type
- **Priority assessment** — P1: core parser + per-citekey storage + MCP tool; P2: aggregation + cited-by tool + RAG; P3: intent classification, biblio-glutton, section density stats
- **biblio-glutton** — assessment and recommendation to defer to P3
## Next Steps
1. Implement `extract_citation_contexts()` in `packages/biblio/src/biblio/grobid.py` and extend `GrobidOutputs` / `run_grobid_for_key` to write `contexts.json`.
2. Add the `biblio_citation_contexts` and `biblio_cited_by_contexts` MCP tools in `src/projio/mcp/biblio.py`.
3. Run `biblio_grobid` on a test paper and verify that `contexts.json` is produced with correct sentence extraction and citekey resolution.