Study: Unpaywall/oadoi internals — improve biblio OA cascade¶

Goal¶

(promoted from note)

Context¶

(see source note)

Prompt¶

Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-021628-584751.md). Understand the problem, then implement the proposed fix.

Study: Unpaywall/oadoi internals — improve biblio's OA PDF cascade¶

biblio's pdf_fetch_oa.py implements an OA cascade: pool → OpenAlex → Unpaywall → EZProxy. Study the Unpaywall backend (oadoi) to understand how OA resolution actually works and identify improvements.

Research questions¶

How does Unpaywall find OA copies?
What sources does it check? (repositories, publisher sites, preprint servers)
How does it rank OA locations? (gold, green, bronze, hybrid)
What's the best_oa_location selection logic?
How does OpenAlex's OA data relate to Unpaywall?
OpenAlex incorporates Unpaywall data — is biblio's separate Unpaywall call redundant?
When does OpenAlex's best_oa_location differ from direct Unpaywall?
Can biblio's cascade be smarter?
Should we skip Unpaywall if OpenAlex already has the OA location?
Are there OA sources biblio misses? (CORE, BASE, PubMed Central direct)
Should the cascade order be configurable per-paper-type?
PDF validation
oadoi has logic for detecting paywall pages served as PDFs
Compare with biblio's pdf_validate — can we improve detection?

Output¶

Write findings to docs/specs/biblio/oa-cascade-study.md with: - How Unpaywall actually works (from oadoi source) - Comparison: biblio's cascade vs Unpaywall's approach - Recommended improvements with priority - Whether the Unpaywall API call is still needed given OpenAlex integration

Key references (indexed in RAG)¶

.projio/codio/mirrors/ourresearch--oadoi/ — Unpaywall backend source
packages/biblio/src/biblio/pdf_fetch_oa.py — current OA cascade
packages/biblio/src/biblio/unpaywall.py — current Unpaywall client
.projio/codio/mirrors/ourresearch--openalex-elastic-api/ — OpenAlex OA fields

issue-arash-20260402-233201-350554.md — Direct predecessor: the paywall-page-saved-as-PDF bug that motivates the OA cascade study
issue-arash-20260403-193112-105596.md — Biblio enrichment pipeline redesign overlaps with improving the OA cascade and PDF fetch logic
issue-arash-20260403-193002-484673.md — OpenAlex API audit is directly relevant to the question of whether Unpaywall calls are redundant given OpenAlex OA data
issue-arash-20260402-015659-415628.md — Batch docling command is downstream of successful PDF fetching — shares the PDF pipeline context
issue-arash-20260404-021609-691872.md — GROBID citation context study is a parallel deep-dive into biblio internals, same research methodology

Acceptance Criteria¶

[ ]

Result¶

Completed in 244.3s (1 steps, model: opus)

Session: f3546bf9-7310-4e96-a38d-a7a8e08440b3 Result note: /storage2/arash/worklog/workflow/captures/20260405-031904-2ea734/note.md

Output¶

Done. Here's what was delivered:

**Spec document:** `docs/specs/biblio/oa-cascade-study.md` — comprehensive analysis of how Unpaywall/oadoi works internally vs biblio's cascade, with prioritized recommendations.

**Code changes (3 files):**

1. **`packages/biblio/src/biblio/pdf_fetch_oa.py`** — Three improvements:
   - New `_oa_pdf_url_candidates()` returns all OA URLs from an OpenAlex record (best_oa_location, oa_url, primary_location, all oa_locations), deduplicated and priority-ordered. The cascade now tries each candidate before moving to Unpaywall.
   - Enhanced `_download()` validation: rejects files < 1KB (error page stubs), reads 4KB header for encrypted PDF detection (`/Encrypt` pattern from oadoi), existing HTML and magic-byte checks preserved.

2. **`packages/biblio/src/biblio/unpaywall.py`** — Fixed `best_pdf_url()` fallback order: now tries all `url_for_pdf` entries across all locations before falling back to `best_oa_location.url` (which is often a landing page, not a PDF).

**Key findings from the study:**
- Unpaywall API calls are largely redundant with OpenAlex (which incorporates Unpaywall data), but worth keeping as a fallback
- biblio was only trying the single best OA URL — now tries all candidates
- PDF validation was too simple compared to oadoi's approach — now catches encrypted PDFs and tiny error pages

## Next Steps
1. Run `make test` in the biblio package to verify the changes don't break existing tests.
2. Implement the Priority 3 recommendation: add `Content-Disposition` header checking to `_download()` for additional PDF confirmation.
3. Add unit tests for `_oa_pdf_url_candidates()` with multi-location OpenAlex records and for the new encrypted-PDF / minimum-size validation in `_download()`.

Study: Unpaywall/oadoi internals — improve biblio OA cascade¶

Goal¶

Context¶

Prompt¶

Study: Unpaywall/oadoi internals — improve biblio's OA PDF cascade¶

Research questions¶

Output¶

Key references (indexed in RAG)¶

Related Notes¶

Acceptance Criteria¶

Result¶

Output¶