Study: biblio-glutton — reference resolution improvement¶

Goal¶

(promoted from note)

Context¶

(see source note)

Prompt¶

Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-021642-474901.md). Understand the problem, then implement the proposed fix.

Study: biblio-glutton — high-performance bibliographic matching for reference resolution¶

biblio currently resolves unmatched GROBID references to DOIs using CrossRef title search (crossref.py:resolve_doi_by_title). biblio-glutton is a purpose-built high-performance matching service that could replace or augment this.

Research questions¶

What does biblio-glutton do?
How does it match references? (title + author + date fuzzy matching?)
What data sources does it use? (CrossRef dump, OpenAlex, DOI metadata)
Performance: how fast is it vs CrossRef API?
Can it run as a local service?
How does it compare to biblio's current approach?
biblio uses CrossRef API with SequenceMatcher similarity scoring (≥0.70 threshold)
biblio-glutton may have better recall/precision
API rate limits: CrossRef has strict limits, biblio-glutton is self-hosted
Integration options
Option A: Replace CrossRef calls with biblio-glutton API
Option B: Use biblio-glutton as fallback when CrossRef misses
Option C: Use biblio-glutton for bulk resolution (all GROBID refs at once)
Is it worth it?
How many references does biblio currently fail to resolve via CrossRef?
Would biblio-glutton meaningfully improve the hit rate?
Deployment complexity (Java service + Elasticsearch backend)

Output¶

Write findings to docs/specs/biblio/glutton-study.md

Key references (indexed in RAG)¶

.projio/codio/mirrors/kermitt2--biblio-glutton/ — biblio-glutton source
packages/biblio/src/biblio/crossref.py — current CrossRef matching
packages/biblio/src/biblio/grobid.py — GROBID reference extraction
packages/biblio/src/biblio/graph.py — where resolved references feed into

issue-arash-20260404-021609-691872.md — Both deal with GROBID reference processing — biblio-glutton would replace/augment the CrossRef fallback that fires after GROBID extracts references
issue-arash-20260404-021628-584751.md — Parallel 'study external service' notes — both evaluate replacing or augmenting biblio's current HTTP-based resolution with a specialized service
issue-arash-20260403-193112-105596.md — Biblio enrichment pipeline redesign — biblio-glutton integration (Option A/B/C) would fit as a stage in the redesigned resolution pipeline
issue-arash-20260403-193002-484673.md — Audits biblio's OpenAlex/CrossRef API usage; findings directly inform whether biblio-glutton's self-hosted approach is worth the deployment complexity
issue-arash-20260402-015659-415628.md — Batch GROBID/docling processing — Option C (bulk resolution) in the biblio-glutton study maps to the same batch-processing need

Acceptance Criteria¶

[ ]

Result¶

Completed in 234.1s (1 steps, model: opus)

Session: 5bd08ddb-002f-4a97-a294-b8162a2d4866 Result note: /storage2/arash/worklog/workflow/captures/20260405-032332-ac52de/note.md

Output¶

Study complete. Written to `docs/specs/biblio/glutton-study.md`.

**Key findings:**

- biblio-glutton offers +1.3% recall and +0.14% precision over CrossRef API, with multi-field matching (title + author + journal + volume + page + year) vs biblio's current title-only approach.
- Throughput is ~6.5–12.6 req/s locally (unlimited) vs CrossRef's rate-limited ~50 req/s.
- Deployment cost is significant: Java + Elasticsearch + ~300 GB storage.

**Recommendation:** Start with **Option D** (improve current CrossRef matching by sending more metadata fields + adding instrumentation), then optionally add glutton as a fallback (Option B) once we have data showing CrossRef misses are a real problem.

## Next Steps

1. Implement Option D: update `crossref.py:resolve_doi_by_title` to send title+author+year to CrossRef's `query.bibliographic` and add resolution instrumentation/logging.
2. Add a `glutton_url` config field to `biblio.yml` and implement Option B as an optional fallback resolver when glutton is available.
3. Build instrumentation to measure current CrossRef resolution success rates on your corpus (count absent refs before/after resolution, log similarity distributions).

Study: biblio-glutton — reference resolution improvement¶

Goal¶

Context¶

Prompt¶

Study: biblio-glutton — high-performance bibliographic matching for reference resolution¶

Research questions¶

Output¶

Key references (indexed in RAG)¶

Related Notes¶

Acceptance Criteria¶

Result¶

Output¶