Issue arash 20260404 021628 584751
title: "## Study: Unpaywall/oadoi internals — improve biblio's OA PDF cascade status: done created: 2026-04-04 updated: 2026-04-04 timestamp: 20260404-021628-584751 tags: [issue] source: agent-observation project_primary: projio capture_id: 20260404-021627-ccb7fb confidence: 1.0 transcript_file: /storage2/arash/worklog/workflow/captures/20260404-021627-ccb7fb/transcript.txt
Study: Unpaywall/oadoi internals — improve biblio's OA PDF cascade¶
biblio's pdf_fetch_oa.py implements an OA cascade: pool → OpenAlex → Unpaywall → EZProxy. Study the Unpaywall backend (oadoi) to understand how OA resolution actually works and identify improvements.
Research questions¶
- How does Unpaywall find OA copies?
- What sources does it check? (repositories, publisher sites, preprint servers)
- How does it rank OA locations? (gold, green, bronze, hybrid)
-
What's the
best_oa_locationselection logic? -
How does OpenAlex's OA data relate to Unpaywall?
- OpenAlex incorporates Unpaywall data — is biblio's separate Unpaywall call redundant?
-
When does OpenAlex's
best_oa_locationdiffer from direct Unpaywall? -
Can biblio's cascade be smarter?
- Should we skip Unpaywall if OpenAlex already has the OA location?
- Are there OA sources biblio misses? (CORE, BASE, PubMed Central direct)
-
Should the cascade order be configurable per-paper-type?
-
PDF validation
- oadoi has logic for detecting paywall pages served as PDFs
- Compare with biblio's
pdf_validate— can we improve detection?
Output¶
Write findings to docs/specs/biblio/oa-cascade-study.md with:
- How Unpaywall actually works (from oadoi source)
- Comparison: biblio's cascade vs Unpaywall's approach
- Recommended improvements with priority
- Whether the Unpaywall API call is still needed given OpenAlex integration
Key references (indexed in RAG)¶
.projio/codio/mirrors/ourresearch--oadoi/— Unpaywall backend sourcepackages/biblio/src/biblio/pdf_fetch_oa.py— current OA cascadepackages/biblio/src/biblio/unpaywall.py— current Unpaywall client.projio/codio/mirrors/ourresearch--openalex-elastic-api/— OpenAlex OA fields
Related Notes¶
- issue-arash-20260402-233201-350554.md — Direct predecessor: the paywall-page-saved-as-PDF bug that motivates the OA cascade study
- issue-arash-20260403-193112-105596.md — Biblio enrichment pipeline redesign overlaps with improving the OA cascade and PDF fetch logic
- issue-arash-20260403-193002-484673.md — OpenAlex API audit is directly relevant to the question of whether Unpaywall calls are redundant given OpenAlex OA data
- issue-arash-20260402-015659-415628.md — Batch docling command is downstream of successful PDF fetching — shares the PDF pipeline context
- issue-arash-20260404-021609-691872.md — GROBID citation context study is a parallel deep-dive into biblio internals, same research methodology