Spec: GROBID citation context — beyond simple citation networks¶

Goal¶

(promoted from note)

Context¶

(see source note)

Prompt¶

Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-021609-691872.md). Understand the problem, then implement the proposed fix.

Spec: GROBID citation context — beyond simple citation networks¶

biblio currently uses GROBID for header extraction and reference parsing, producing a flat list of references per paper. But GROBID can also extract citation contexts — the sentences where a reference is cited. This enables "paper X cites paper Y in context C" relationships, which are far richer than simple citation edges.

Research questions¶

What does GROBID provide for citation context?
TEI XML <ref> elements have target attributes linking to bibliography entries
These refs are embedded in the full-text paragraphs — the surrounding text IS the citation context
How does grobid-client-python expose this?
What does the TEI structure look like for inline citations?
How could biblio use citation contexts?
Enrich the citation graph: instead of just "A cites B", store "A cites B saying '...sharp-wave ripples were shown to...'"
RAG queries could return citation contexts as evidence
Manuscript writing: auto-generate citation sentences based on how others cited the same paper
Literature review: cluster papers by how they cite a common reference
What's the data model?
Where to store citation contexts? bib/derivatives/grobid/{citekey}/contexts.json?
Schema: {citing_citekey, cited_citekey, context_text, section, position}
How to extract from existing TEI XML that biblio already generates
What does biblio-glutton add?
biblio-glutton does high-performance bibliographic matching
Could replace or augment biblio's CrossRef-based resolve_doi_by_title
Matching unresolved GROBID references to DOIs

Output¶

Write spec to docs/specs/biblio/citation-context.md covering: - GROBID TEI XML structure for inline citations (with examples from the indexed repo) - Proposed data model for citation contexts in biblio - Integration with existing graph.py and reference resolution - MCP tools to query citation contexts - Priority assessment: must-have vs nice-to-have

Key references (indexed in RAG)¶

.projio/codio/mirrors/grobidorg--grobid/ — GROBID source, TEI output format
.projio/codio/mirrors/grobidorg--grobid-client-python/ — Python client API
.projio/codio/mirrors/kermitt2--biblio-glutton/ — bibliographic matching
packages/biblio/src/biblio/grobid.py — current GROBID integration
packages/biblio/src/biblio/graph.py — current citation graph
packages/biblio/src/biblio/ref_md.py — reference-markdown standardization

issue-arash-20260403-193112-105596.md — Directly related: citation context extraction is a new enrichment type that fits the biblio enrichment pipeline redesign
issue-arash-20260402-015659-415628.md — Both concern batch GROBID/docling processing of papers — citation context extraction would run alongside existing TEI XML generation
issue-arash-20260404-014857-481322.md — Parallel enrichment concern: topics per citekey and citation contexts per citekey would share the same derivatives storage pattern
issue-arash-20260403-193037-589959.md — Both are biblio enrichment audit/spec notes exploring what structured data can be extracted and stored per paper
issue-arash-20260403-193002-484673.md — biblio-glutton is mentioned in both — the OpenAlex API audit and this spec both consider biblio-glutton as a bibliographic matching layer

Acceptance Criteria¶

[ ]

Result¶

Completed in 132.4s (1 steps, model: opus)

Session: 02b6d370-6293-469e-84ce-09c3c53bc945 Result note: /storage2/arash/worklog/workflow/captures/20260405-031424-addc20/note.md

Output¶

Spec written to `docs/specs/biblio/citation-context.md`. It covers:

- **TEI XML structure** — how GROBID encodes inline `<ref type="bibr" target="#bN">` elements within body paragraphs, with examples
- **Data model** — per-citekey `contexts.json` schema (bib_id, cited_citekey, section, sentence, callout_text, position) and corpus-level inverted index
- **Implementation plan** — new `extract_citation_contexts()` parser in `grobid.py`, integration with `run_grobid_for_key`, corpus aggregation function. Reuses existing `parse_tei_biblstructs` and `match_biblstructs_to_corpus` from `ref_md.py`
- **MCP tools** — `biblio_citation_contexts(citekey)` and `biblio_cited_by_contexts(citekey)`
- **RAG integration** — citation context sentences as a new `grobid-context` source type
- **Priority assessment** — P1: core parser + per-citekey storage + MCP tool; P2: aggregation + cited-by tool + RAG; P3: intent classification, biblio-glutton, section density stats
- **biblio-glutton** — assessment and recommendation to defer to P3

## Next Steps

1. Implement `extract_citation_contexts()` in `packages/biblio/src/biblio/grobid.py` and extend `GrobidOutputs` / `run_grobid_for_key` to write `contexts.json`.
2. Add the `biblio_citation_contexts` and `biblio_cited_by_contexts` MCP tools in `src/projio/mcp/biblio.py`.
3. Run `biblio_grobid` on a test paper and verify that `contexts.json` is produced with correct sentence extraction and citekey resolution.

Spec: GROBID citation context — beyond simple citation networks¶

Goal¶

Context¶

Prompt¶

Spec: GROBID citation context — beyond simple citation networks¶

Research questions¶

Output¶

Key references (indexed in RAG)¶

Related Notes¶

Acceptance Criteria¶

Result¶

Output¶