## Study: biblio-glutton — high-performance bibliographic matching for reference
Study: biblio-glutton — high-performance bibliographic matching for reference resolution¶
biblio currently resolves unmatched GROBID references to DOIs using CrossRef title search (crossref.py:resolve_doi_by_title). biblio-glutton is a purpose-built high-performance matching service that could replace or augment this.
Research questions¶
- What does biblio-glutton do?
- How does it match references? (title + author + date fuzzy matching?)
- What data sources does it use? (CrossRef dump, OpenAlex, DOI metadata)
- Performance: how fast is it vs CrossRef API?
-
Can it run as a local service?
-
How does it compare to biblio's current approach?
- biblio uses CrossRef API with SequenceMatcher similarity scoring (≥0.70 threshold)
- biblio-glutton may have better recall/precision
-
API rate limits: CrossRef has strict limits, biblio-glutton is self-hosted
-
Integration options
- Option A: Replace CrossRef calls with biblio-glutton API
- Option B: Use biblio-glutton as fallback when CrossRef misses
-
Option C: Use biblio-glutton for bulk resolution (all GROBID refs at once)
-
Is it worth it?
- How many references does biblio currently fail to resolve via CrossRef?
- Would biblio-glutton meaningfully improve the hit rate?
- Deployment complexity (Java service + Elasticsearch backend)
Output¶
Write findings to docs/specs/biblio/glutton-study.md
Key references (indexed in RAG)¶
.projio/codio/mirrors/kermitt2--biblio-glutton/— biblio-glutton sourcepackages/biblio/src/biblio/crossref.py— current CrossRef matchingpackages/biblio/src/biblio/grobid.py— GROBID reference extractionpackages/biblio/src/biblio/graph.py— where resolved references feed into
Related Notes¶
- issue-arash-20260404-021609-691872.md — Both deal with GROBID reference processing — biblio-glutton would replace/augment the CrossRef fallback that fires after GROBID extracts references
- issue-arash-20260404-021628-584751.md — Parallel 'study external service' notes — both evaluate replacing or augmenting biblio's current HTTP-based resolution with a specialized service
- issue-arash-20260403-193112-105596.md — Biblio enrichment pipeline redesign — biblio-glutton integration (Option A/B/C) would fit as a stage in the redesigned resolution pipeline
- issue-arash-20260403-193002-484673.md — Audits biblio's OpenAlex/CrossRef API usage; findings directly inform whether biblio-glutton's self-hosted approach is worth the deployment complexity
- issue-arash-20260402-015659-415628.md — Batch GROBID/docling processing — Option C (bulk resolution) in the biblio-glutton study maps to the same batch-processing need