Skip to content

Spec: GROBID citation context — beyond simple citation networks

Goal

(promoted from note)

Context

(see source note)

Prompt

Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-021609-691872.md). Understand the problem, then implement the proposed fix.


Spec: GROBID citation context — beyond simple citation networks

biblio currently uses GROBID for header extraction and reference parsing, producing a flat list of references per paper. But GROBID can also extract citation contexts — the sentences where a reference is cited. This enables "paper X cites paper Y in context C" relationships, which are far richer than simple citation edges.

Research questions

  1. What does GROBID provide for citation context?
  2. TEI XML <ref> elements have target attributes linking to bibliography entries
  3. These refs are embedded in the full-text paragraphs — the surrounding text IS the citation context
  4. How does grobid-client-python expose this?
  5. What does the TEI structure look like for inline citations?

  6. How could biblio use citation contexts?

  7. Enrich the citation graph: instead of just "A cites B", store "A cites B saying '...sharp-wave ripples were shown to...'"
  8. RAG queries could return citation contexts as evidence
  9. Manuscript writing: auto-generate citation sentences based on how others cited the same paper
  10. Literature review: cluster papers by how they cite a common reference

  11. What's the data model?

  12. Where to store citation contexts? bib/derivatives/grobid/{citekey}/contexts.json?
  13. Schema: {citing_citekey, cited_citekey, context_text, section, position}
  14. How to extract from existing TEI XML that biblio already generates

  15. What does biblio-glutton add?

  16. biblio-glutton does high-performance bibliographic matching
  17. Could replace or augment biblio's CrossRef-based resolve_doi_by_title
  18. Matching unresolved GROBID references to DOIs

Output

Write spec to docs/specs/biblio/citation-context.md covering: - GROBID TEI XML structure for inline citations (with examples from the indexed repo) - Proposed data model for citation contexts in biblio - Integration with existing graph.py and reference resolution - MCP tools to query citation contexts - Priority assessment: must-have vs nice-to-have

Key references (indexed in RAG)

  • .projio/codio/mirrors/grobidorg--grobid/ — GROBID source, TEI output format
  • .projio/codio/mirrors/grobidorg--grobid-client-python/ — Python client API
  • .projio/codio/mirrors/kermitt2--biblio-glutton/ — bibliographic matching
  • packages/biblio/src/biblio/grobid.py — current GROBID integration
  • packages/biblio/src/biblio/graph.py — current citation graph
  • packages/biblio/src/biblio/ref_md.py — reference-markdown standardization

Acceptance Criteria

  • [ ]

Result

Completed in 132.4s (1 steps, model: opus)

Session: 02b6d370-6293-469e-84ce-09c3c53bc945 Result note: /storage2/arash/worklog/workflow/captures/20260405-031424-addc20/note.md

Output

Spec written to `docs/specs/biblio/citation-context.md`. It covers:

- **TEI XML structure** — how GROBID encodes inline `<ref type="bibr" target="#bN">` elements within body paragraphs, with examples
- **Data model** — per-citekey `contexts.json` schema (bib_id, cited_citekey, section, sentence, callout_text, position) and corpus-level inverted index
- **Implementation plan** — new `extract_citation_contexts()` parser in `grobid.py`, integration with `run_grobid_for_key`, corpus aggregation function. Reuses existing `parse_tei_biblstructs` and `match_biblstructs_to_corpus` from `ref_md.py`
- **MCP tools** — `biblio_citation_contexts(citekey)` and `biblio_cited_by_contexts(citekey)`
- **RAG integration** — citation context sentences as a new `grobid-context` source type
- **Priority assessment** — P1: core parser + per-citekey storage + MCP tool; P2: aggregation + cited-by tool + RAG; P3: intent classification, biblio-glutton, section density stats
- **biblio-glutton** — assessment and recommendation to defer to P3

## Next Steps

1. Implement `extract_citation_contexts()` in `packages/biblio/src/biblio/grobid.py` and extend `GrobidOutputs` / `run_grobid_for_key` to write `contexts.json`.
2. Add the `biblio_citation_contexts` and `biblio_cited_by_contexts` MCP tools in `src/projio/mcp/biblio.py`.
3. Run `biblio_grobid` on a test paper and verify that `contexts.json` is produced with correct sentence extraction and citekey resolution.