biblio pdf_fetch_oa saves HTML paywall pages as .pdf files without content-type validation.
16 of 272 articles in sirocampus biblio have HTML files masquerading as PDFs. Docling fails on these because there's no actual PDF to parse.
Fix: After downloading, check that the file starts with %PDF- magic bytes. If not, delete the file and report as failed fetch. Optionally retry with alternative sources (Unpaywall, Semantic Scholar, etc.).
Failed citekeys: barachantMulticlassBraincomputerInterface2012, coonOscillatoryPhaseModulates2016, dfosterFreelymovingMonkeyTreadmill2014, harveyTopographicRepresentationNumerosity2013, kellisDecodingSpokenWords2010, kitamuraEngramsCircuitsCrucial2017, pesaranInvestigatingLargescaleBrain2018, samsonovichPathIntegrationCognitive1997, seichepineDielectrophoresisAssistedIntegration10242017, taubeAbsencePresence3D2020, taubeAbsencePresence3D2020a, tchoeHumanBrainMapping2022, toddSystematicExplorationUnsupervised2017, vorheesMorrisWaterMaze2006b, yooFunctionalDoubleDissociation2017, zubairHeadMovementWalking2016
Source context: sirocampus¶
SiroCampus (sirocampus): Sirota Lab shared repository
Recent commits:
d9383f9 codio: normalize catalog paths (python→codelib, matlab→toolboxes), add motivebatch; share refactor task notes
70ec57f remove code/lib
38ed6a4 [DATALAD] removed content
README:
sirocampus¶
Clone the Repository¶
Server:
datalad install -s "ria+file:///storage/share/git/ria-store#~sirocampus" sirocampus
or a reckless clone (to save space):
datalad install -s "ria+file:///storage/share/git/ria-store#~sirocampus" --reckless ephemeral sirocampus
Related Notes¶
- issue-arash-20260402-015659-415628.md — Batch docling processing fails on HTML-masquerading-as-PDF files; fixing pdf_fetch_oa validation would unblock batch docling runs
- issue-arash-20260402-220152-539138.md — biblio_compile merges intermediates including PDF outputs; corrupted HTML-as-PDF files in the pool would propagate into compiled artifacts
- issue-arash-20260402-220025-468258.md — Sources vs artifacts separation spec is directly relevant — validated PDFs vs raw fetched files should be distinguished in the architecture
- issue-arash-20260402-220130-159401.md — biblio merge/output reorganization; invalid fetched files affect what ends up in merged outputs