PDF Discovery Candidates¶
Date: 2026-03-09
Context¶
biblio currently supports:
- PDF fetch from BibTeX
filefields - direct PDF ingestion via
biblio ingest pdfs
It does not yet have a true PDF discovery layer for finding accessible full text from metadata such as DOI, OpenAlex IDs, or related identifiers.
Strong Candidates¶
Priority order:
- DOI landing page resolution
- Unpaywall-style open-access lookup
- OpenAlex open-access and metadata signals
- arXiv resolution
- PubMed Central / Europe PMC resolution
- publisher page parsing as a fallback
Rationale¶
- DOI resolution is the most natural first step because
biblioalready supports DOI ingestion. - Unpaywall-style OA lookup is a strong legal and reliable source for accessible PDFs.
- OpenAlex is already a core metadata and graph backend for
biblio, so it is a natural companion signal source. - arXiv and PMC are high-value special cases with relatively deterministic full-text locations.
- generic publisher scraping should remain a fallback rather than the primary architecture.
Recommended Product Shape¶
Possible future command:
biblio pdf discover
Suggested behavior:
- discover candidate PDF URLs from DOI and metadata
- record provenance and confidence
- optionally download only when explicitly requested
- keep early versions review-oriented rather than fully automatic
Non-Goals¶
Avoid:
- brittle broad web scraping as the main strategy
- illegal or questionable download sources
- hiding provenance of how a PDF candidate was found
Suggested Architecture¶
- metadata and identifier normalization
- DOI -> OpenAlex enrichment
- DOI -> OA/full-text candidate lookup
- domain-specific resolvers like arXiv and PMC
- manual review or explicit download step