Skip to content

Issue arash 20260404 021628 584751


title: "## Study: Unpaywall/oadoi internals — improve biblio's OA PDF cascade status: done created: 2026-04-04 updated: 2026-04-04 timestamp: 20260404-021628-584751 tags: [issue] source: agent-observation project_primary: projio capture_id: 20260404-021627-ccb7fb confidence: 1.0 transcript_file: /storage2/arash/worklog/workflow/captures/20260404-021627-ccb7fb/transcript.txt


Study: Unpaywall/oadoi internals — improve biblio's OA PDF cascade

biblio's pdf_fetch_oa.py implements an OA cascade: pool → OpenAlex → Unpaywall → EZProxy. Study the Unpaywall backend (oadoi) to understand how OA resolution actually works and identify improvements.

Research questions

  1. How does Unpaywall find OA copies?
  2. What sources does it check? (repositories, publisher sites, preprint servers)
  3. How does it rank OA locations? (gold, green, bronze, hybrid)
  4. What's the best_oa_location selection logic?

  5. How does OpenAlex's OA data relate to Unpaywall?

  6. OpenAlex incorporates Unpaywall data — is biblio's separate Unpaywall call redundant?
  7. When does OpenAlex's best_oa_location differ from direct Unpaywall?

  8. Can biblio's cascade be smarter?

  9. Should we skip Unpaywall if OpenAlex already has the OA location?
  10. Are there OA sources biblio misses? (CORE, BASE, PubMed Central direct)
  11. Should the cascade order be configurable per-paper-type?

  12. PDF validation

  13. oadoi has logic for detecting paywall pages served as PDFs
  14. Compare with biblio's pdf_validate — can we improve detection?

Output

Write findings to docs/specs/biblio/oa-cascade-study.md with: - How Unpaywall actually works (from oadoi source) - Comparison: biblio's cascade vs Unpaywall's approach - Recommended improvements with priority - Whether the Unpaywall API call is still needed given OpenAlex integration

Key references (indexed in RAG)

  • .projio/codio/mirrors/ourresearch--oadoi/ — Unpaywall backend source
  • packages/biblio/src/biblio/pdf_fetch_oa.py — current OA cascade
  • packages/biblio/src/biblio/unpaywall.py — current Unpaywall client
  • .projio/codio/mirrors/ourresearch--openalex-elastic-api/ — OpenAlex OA fields