Skip to content

Study: Unpaywall/oadoi internals — improve biblio OA cascade

Goal

(promoted from note)

Context

(see source note)

Prompt

Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-021628-584751.md). Understand the problem, then implement the proposed fix.


Study: Unpaywall/oadoi internals — improve biblio's OA PDF cascade

biblio's pdf_fetch_oa.py implements an OA cascade: pool → OpenAlex → Unpaywall → EZProxy. Study the Unpaywall backend (oadoi) to understand how OA resolution actually works and identify improvements.

Research questions

  1. How does Unpaywall find OA copies?
  2. What sources does it check? (repositories, publisher sites, preprint servers)
  3. How does it rank OA locations? (gold, green, bronze, hybrid)
  4. What's the best_oa_location selection logic?

  5. How does OpenAlex's OA data relate to Unpaywall?

  6. OpenAlex incorporates Unpaywall data — is biblio's separate Unpaywall call redundant?
  7. When does OpenAlex's best_oa_location differ from direct Unpaywall?

  8. Can biblio's cascade be smarter?

  9. Should we skip Unpaywall if OpenAlex already has the OA location?
  10. Are there OA sources biblio misses? (CORE, BASE, PubMed Central direct)
  11. Should the cascade order be configurable per-paper-type?

  12. PDF validation

  13. oadoi has logic for detecting paywall pages served as PDFs
  14. Compare with biblio's pdf_validate — can we improve detection?

Output

Write findings to docs/specs/biblio/oa-cascade-study.md with: - How Unpaywall actually works (from oadoi source) - Comparison: biblio's cascade vs Unpaywall's approach - Recommended improvements with priority - Whether the Unpaywall API call is still needed given OpenAlex integration

Key references (indexed in RAG)

  • .projio/codio/mirrors/ourresearch--oadoi/ — Unpaywall backend source
  • packages/biblio/src/biblio/pdf_fetch_oa.py — current OA cascade
  • packages/biblio/src/biblio/unpaywall.py — current Unpaywall client
  • .projio/codio/mirrors/ourresearch--openalex-elastic-api/ — OpenAlex OA fields

Acceptance Criteria

  • [ ]

Result

Completed in 244.3s (1 steps, model: opus)

Session: f3546bf9-7310-4e96-a38d-a7a8e08440b3 Result note: /storage2/arash/worklog/workflow/captures/20260405-031904-2ea734/note.md

Output

Done. Here's what was delivered:

**Spec document:** `docs/specs/biblio/oa-cascade-study.md` — comprehensive analysis of how Unpaywall/oadoi works internally vs biblio's cascade, with prioritized recommendations.

**Code changes (3 files):**

1. **`packages/biblio/src/biblio/pdf_fetch_oa.py`** — Three improvements:
   - New `_oa_pdf_url_candidates()` returns all OA URLs from an OpenAlex record (best_oa_location, oa_url, primary_location, all oa_locations), deduplicated and priority-ordered. The cascade now tries each candidate before moving to Unpaywall.
   - Enhanced `_download()` validation: rejects files < 1KB (error page stubs), reads 4KB header for encrypted PDF detection (`/Encrypt` pattern from oadoi), existing HTML and magic-byte checks preserved.

2. **`packages/biblio/src/biblio/unpaywall.py`** — Fixed `best_pdf_url()` fallback order: now tries all `url_for_pdf` entries across all locations before falling back to `best_oa_location.url` (which is often a landing page, not a PDF).

**Key findings from the study:**
- Unpaywall API calls are largely redundant with OpenAlex (which incorporates Unpaywall data), but worth keeping as a fallback
- biblio was only trying the single best OA URL — now tries all candidates
- PDF validation was too simple compared to oadoi's approach — now catches encrypted PDFs and tiny error pages

## Next Steps
1. Run `make test` in the biblio package to verify the changes don't break existing tests.
2. Implement the Priority 3 recommendation: add `Content-Disposition` header checking to `_download()` for additional PDF confirmation.
3. Add unit tests for `_oa_pdf_url_candidates()` with multi-location OpenAlex records and for the new encrypted-PDF / minimum-size validation in `_download()`.