Skip to content

## Study: biblio-glutton — high-performance bibliographic matching for reference

Study: biblio-glutton — high-performance bibliographic matching for reference resolution

biblio currently resolves unmatched GROBID references to DOIs using CrossRef title search (crossref.py:resolve_doi_by_title). biblio-glutton is a purpose-built high-performance matching service that could replace or augment this.

Research questions

  1. What does biblio-glutton do?
  2. How does it match references? (title + author + date fuzzy matching?)
  3. What data sources does it use? (CrossRef dump, OpenAlex, DOI metadata)
  4. Performance: how fast is it vs CrossRef API?
  5. Can it run as a local service?

  6. How does it compare to biblio's current approach?

  7. biblio uses CrossRef API with SequenceMatcher similarity scoring (≥0.70 threshold)
  8. biblio-glutton may have better recall/precision
  9. API rate limits: CrossRef has strict limits, biblio-glutton is self-hosted

  10. Integration options

  11. Option A: Replace CrossRef calls with biblio-glutton API
  12. Option B: Use biblio-glutton as fallback when CrossRef misses
  13. Option C: Use biblio-glutton for bulk resolution (all GROBID refs at once)

  14. Is it worth it?

  15. How many references does biblio currently fail to resolve via CrossRef?
  16. Would biblio-glutton meaningfully improve the hit rate?
  17. Deployment complexity (Java service + Elasticsearch backend)

Output

Write findings to docs/specs/biblio/glutton-study.md

Key references (indexed in RAG)

  • .projio/codio/mirrors/kermitt2--biblio-glutton/ — biblio-glutton source
  • packages/biblio/src/biblio/crossref.py — current CrossRef matching
  • packages/biblio/src/biblio/grobid.py — GROBID reference extraction
  • packages/biblio/src/biblio/graph.py — where resolved references feed into
  • issue-arash-20260404-021609-691872.md — Both deal with GROBID reference processing — biblio-glutton would replace/augment the CrossRef fallback that fires after GROBID extracts references
  • issue-arash-20260404-021628-584751.md — Parallel 'study external service' notes — both evaluate replacing or augmenting biblio's current HTTP-based resolution with a specialized service
  • issue-arash-20260403-193112-105596.md — Biblio enrichment pipeline redesign — biblio-glutton integration (Option A/B/C) would fit as a stage in the redesigned resolution pipeline
  • issue-arash-20260403-193002-484673.md — Audits biblio's OpenAlex/CrossRef API usage; findings directly inform whether biblio-glutton's self-hosted approach is worth the deployment complexity
  • issue-arash-20260402-015659-415628.md — Batch GROBID/docling processing — Option C (bulk resolution) in the biblio-glutton study maps to the same batch-processing need