Skip to content

Study: biblio-glutton — reference resolution improvement

Goal

(promoted from note)

Context

(see source note)

Prompt

Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-021642-474901.md). Understand the problem, then implement the proposed fix.


Study: biblio-glutton — high-performance bibliographic matching for reference resolution

biblio currently resolves unmatched GROBID references to DOIs using CrossRef title search (crossref.py:resolve_doi_by_title). biblio-glutton is a purpose-built high-performance matching service that could replace or augment this.

Research questions

  1. What does biblio-glutton do?
  2. How does it match references? (title + author + date fuzzy matching?)
  3. What data sources does it use? (CrossRef dump, OpenAlex, DOI metadata)
  4. Performance: how fast is it vs CrossRef API?
  5. Can it run as a local service?

  6. How does it compare to biblio's current approach?

  7. biblio uses CrossRef API with SequenceMatcher similarity scoring (≥0.70 threshold)
  8. biblio-glutton may have better recall/precision
  9. API rate limits: CrossRef has strict limits, biblio-glutton is self-hosted

  10. Integration options

  11. Option A: Replace CrossRef calls with biblio-glutton API
  12. Option B: Use biblio-glutton as fallback when CrossRef misses
  13. Option C: Use biblio-glutton for bulk resolution (all GROBID refs at once)

  14. Is it worth it?

  15. How many references does biblio currently fail to resolve via CrossRef?
  16. Would biblio-glutton meaningfully improve the hit rate?
  17. Deployment complexity (Java service + Elasticsearch backend)

Output

Write findings to docs/specs/biblio/glutton-study.md

Key references (indexed in RAG)

  • .projio/codio/mirrors/kermitt2--biblio-glutton/ — biblio-glutton source
  • packages/biblio/src/biblio/crossref.py — current CrossRef matching
  • packages/biblio/src/biblio/grobid.py — GROBID reference extraction
  • packages/biblio/src/biblio/graph.py — where resolved references feed into
  • issue-arash-20260404-021609-691872.md — Both deal with GROBID reference processing — biblio-glutton would replace/augment the CrossRef fallback that fires after GROBID extracts references
  • issue-arash-20260404-021628-584751.md — Parallel 'study external service' notes — both evaluate replacing or augmenting biblio's current HTTP-based resolution with a specialized service
  • issue-arash-20260403-193112-105596.md — Biblio enrichment pipeline redesign — biblio-glutton integration (Option A/B/C) would fit as a stage in the redesigned resolution pipeline
  • issue-arash-20260403-193002-484673.md — Audits biblio's OpenAlex/CrossRef API usage; findings directly inform whether biblio-glutton's self-hosted approach is worth the deployment complexity
  • issue-arash-20260402-015659-415628.md — Batch GROBID/docling processing — Option C (bulk resolution) in the biblio-glutton study maps to the same batch-processing need

Acceptance Criteria

  • [ ]

Result

Completed in 234.1s (1 steps, model: opus)

Session: 5bd08ddb-002f-4a97-a294-b8162a2d4866 Result note: /storage2/arash/worklog/workflow/captures/20260405-032332-ac52de/note.md

Output

Study complete. Written to `docs/specs/biblio/glutton-study.md`.

**Key findings:**

- biblio-glutton offers +1.3% recall and +0.14% precision over CrossRef API, with multi-field matching (title + author + journal + volume + page + year) vs biblio's current title-only approach.
- Throughput is ~6.5–12.6 req/s locally (unlimited) vs CrossRef's rate-limited ~50 req/s.
- Deployment cost is significant: Java + Elasticsearch + ~300 GB storage.

**Recommendation:** Start with **Option D** (improve current CrossRef matching by sending more metadata fields + adding instrumentation), then optionally add glutton as a fallback (Option B) once we have data showing CrossRef misses are a real problem.

## Next Steps

1. Implement Option D: update `crossref.py:resolve_doi_by_title` to send title+author+year to CrossRef's `query.bibliographic` and add resolution instrumentation/logging.
2. Add a `glutton_url` config field to `biblio.yml` and implement Option B as an optional fallback resolver when glutton is available.
3. Build instrumentation to measure current CrossRef resolution success rates on your corpus (count absent refs before/after resolution, log similarity distributions).