Issue arash 20260404 021609 691872
title: "## Spec: GROBID citation context — beyond simple citation networks status: done created: 2026-04-04 updated: 2026-04-04 timestamp: 20260404-021609-691872 tags: [issue] source: agent-observation project_primary: projio capture_id: 20260404-021607-6de75f confidence: 1.0 transcript_file: /storage2/arash/worklog/workflow/captures/20260404-021607-6de75f/transcript.txt
Spec: GROBID citation context — beyond simple citation networks¶
biblio currently uses GROBID for header extraction and reference parsing, producing a flat list of references per paper. But GROBID can also extract citation contexts — the sentences where a reference is cited. This enables "paper X cites paper Y in context C" relationships, which are far richer than simple citation edges.
Research questions¶
- What does GROBID provide for citation context?
- TEI XML
<ref>elements havetargetattributes linking to bibliography entries - These refs are embedded in the full-text paragraphs — the surrounding text IS the citation context
- How does grobid-client-python expose this?
-
What does the TEI structure look like for inline citations?
-
How could biblio use citation contexts?
- Enrich the citation graph: instead of just "A cites B", store "A cites B saying '...sharp-wave ripples were shown to...'"
- RAG queries could return citation contexts as evidence
- Manuscript writing: auto-generate citation sentences based on how others cited the same paper
-
Literature review: cluster papers by how they cite a common reference
-
What's the data model?
- Where to store citation contexts?
bib/derivatives/grobid/{citekey}/contexts.json? - Schema:
{citing_citekey, cited_citekey, context_text, section, position} -
How to extract from existing TEI XML that biblio already generates
-
What does biblio-glutton add?
- biblio-glutton does high-performance bibliographic matching
- Could replace or augment biblio's CrossRef-based
resolve_doi_by_title - Matching unresolved GROBID references to DOIs
Output¶
Write spec to docs/specs/biblio/citation-context.md covering:
- GROBID TEI XML structure for inline citations (with examples from the indexed repo)
- Proposed data model for citation contexts in biblio
- Integration with existing graph.py and reference resolution
- MCP tools to query citation contexts
- Priority assessment: must-have vs nice-to-have
Key references (indexed in RAG)¶
.projio/codio/mirrors/grobidorg--grobid/— GROBID source, TEI output format.projio/codio/mirrors/grobidorg--grobid-client-python/— Python client API.projio/codio/mirrors/kermitt2--biblio-glutton/— bibliographic matchingpackages/biblio/src/biblio/grobid.py— current GROBID integrationpackages/biblio/src/biblio/graph.py— current citation graphpackages/biblio/src/biblio/ref_md.py— reference-markdown standardization
Related Notes¶
- issue-arash-20260403-193112-105596.md — Directly related: citation context extraction is a new enrichment type that fits the biblio enrichment pipeline redesign
- issue-arash-20260402-015659-415628.md — Both concern batch GROBID/docling processing of papers — citation context extraction would run alongside existing TEI XML generation
- issue-arash-20260404-014857-481322.md — Parallel enrichment concern: topics per citekey and citation contexts per citekey would share the same derivatives storage pattern
- issue-arash-20260403-193037-589959.md — Both are biblio enrichment audit/spec notes exploring what structured data can be extracted and stored per paper
- issue-arash-20260403-193002-484673.md — biblio-glutton is mentioned in both — the OpenAlex API audit and this spec both consider biblio-glutton as a bibliographic matching layer