Study: Unpaywall/oadoi internals — improve biblio OA cascade¶
Goal¶
(promoted from note)
Context¶
(see source note)
Prompt¶
Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-021628-584751.md). Understand the problem, then implement the proposed fix.
Study: Unpaywall/oadoi internals — improve biblio's OA PDF cascade¶
biblio's pdf_fetch_oa.py implements an OA cascade: pool → OpenAlex → Unpaywall → EZProxy. Study the Unpaywall backend (oadoi) to understand how OA resolution actually works and identify improvements.
Research questions¶
- How does Unpaywall find OA copies?
- What sources does it check? (repositories, publisher sites, preprint servers)
- How does it rank OA locations? (gold, green, bronze, hybrid)
-
What's the
best_oa_locationselection logic? -
How does OpenAlex's OA data relate to Unpaywall?
- OpenAlex incorporates Unpaywall data — is biblio's separate Unpaywall call redundant?
-
When does OpenAlex's
best_oa_locationdiffer from direct Unpaywall? -
Can biblio's cascade be smarter?
- Should we skip Unpaywall if OpenAlex already has the OA location?
- Are there OA sources biblio misses? (CORE, BASE, PubMed Central direct)
-
Should the cascade order be configurable per-paper-type?
-
PDF validation
- oadoi has logic for detecting paywall pages served as PDFs
- Compare with biblio's
pdf_validate— can we improve detection?
Output¶
Write findings to docs/specs/biblio/oa-cascade-study.md with:
- How Unpaywall actually works (from oadoi source)
- Comparison: biblio's cascade vs Unpaywall's approach
- Recommended improvements with priority
- Whether the Unpaywall API call is still needed given OpenAlex integration
Key references (indexed in RAG)¶
.projio/codio/mirrors/ourresearch--oadoi/— Unpaywall backend sourcepackages/biblio/src/biblio/pdf_fetch_oa.py— current OA cascadepackages/biblio/src/biblio/unpaywall.py— current Unpaywall client.projio/codio/mirrors/ourresearch--openalex-elastic-api/— OpenAlex OA fields
Related Notes¶
- issue-arash-20260402-233201-350554.md — Direct predecessor: the paywall-page-saved-as-PDF bug that motivates the OA cascade study
- issue-arash-20260403-193112-105596.md — Biblio enrichment pipeline redesign overlaps with improving the OA cascade and PDF fetch logic
- issue-arash-20260403-193002-484673.md — OpenAlex API audit is directly relevant to the question of whether Unpaywall calls are redundant given OpenAlex OA data
- issue-arash-20260402-015659-415628.md — Batch docling command is downstream of successful PDF fetching — shares the PDF pipeline context
- issue-arash-20260404-021609-691872.md — GROBID citation context study is a parallel deep-dive into biblio internals, same research methodology
Acceptance Criteria¶
- [ ]
Result¶
Completed in 244.3s (1 steps, model: opus)
Session: f3546bf9-7310-4e96-a38d-a7a8e08440b3
Result note: /storage2/arash/worklog/workflow/captures/20260405-031904-2ea734/note.md
Output¶
Done. Here's what was delivered:
**Spec document:** `docs/specs/biblio/oa-cascade-study.md` — comprehensive analysis of how Unpaywall/oadoi works internally vs biblio's cascade, with prioritized recommendations.
**Code changes (3 files):**
1. **`packages/biblio/src/biblio/pdf_fetch_oa.py`** — Three improvements:
- New `_oa_pdf_url_candidates()` returns all OA URLs from an OpenAlex record (best_oa_location, oa_url, primary_location, all oa_locations), deduplicated and priority-ordered. The cascade now tries each candidate before moving to Unpaywall.
- Enhanced `_download()` validation: rejects files < 1KB (error page stubs), reads 4KB header for encrypted PDF detection (`/Encrypt` pattern from oadoi), existing HTML and magic-byte checks preserved.
2. **`packages/biblio/src/biblio/unpaywall.py`** — Fixed `best_pdf_url()` fallback order: now tries all `url_for_pdf` entries across all locations before falling back to `best_oa_location.url` (which is often a landing page, not a PDF).
**Key findings from the study:**
- Unpaywall API calls are largely redundant with OpenAlex (which incorporates Unpaywall data), but worth keeping as a fallback
- biblio was only trying the single best OA URL — now tries all candidates
- PDF validation was too simple compared to oadoi's approach — now catches encrypted PDFs and tiny error pages
## Next Steps
1. Run `make test` in the biblio package to verify the changes don't break existing tests.
2. Implement the Priority 3 recommendation: add `Content-Disposition` header checking to `_download()` for additional PDF confirmation.
3. Add unit tests for `_oa_pdf_url_candidates()` with multi-location OpenAlex records and for the new encrypted-PDF / minimum-size validation in `_download()`.