biblio pdf_fetch_oa saves HTML paywall pages as .pdf files without content-type validation.

16 of 272 articles in sirocampus biblio have HTML files masquerading as PDFs. Docling fails on these because there's no actual PDF to parse.

Fix: After downloading, check that the file starts with %PDF- magic bytes. If not, delete the file and report as failed fetch. Optionally retry with alternative sources (Unpaywall, Semantic Scholar, etc.).

Failed citekeys: barachantMulticlassBraincomputerInterface2012, coonOscillatoryPhaseModulates2016, dfosterFreelymovingMonkeyTreadmill2014, harveyTopographicRepresentationNumerosity2013, kellisDecodingSpokenWords2010, kitamuraEngramsCircuitsCrucial2017, pesaranInvestigatingLargescaleBrain2018, samsonovichPathIntegrationCognitive1997, seichepineDielectrophoresisAssistedIntegration10242017, taubeAbsencePresence3D2020, taubeAbsencePresence3D2020a, tchoeHumanBrainMapping2022, toddSystematicExplorationUnsupervised2017, vorheesMorrisWaterMaze2006b, yooFunctionalDoubleDissociation2017, zubairHeadMovementWalking2016

Source context: sirocampus¶

SiroCampus (sirocampus): Sirota Lab shared repository

Recent commits:

d9383f9 codio: normalize catalog paths (python→codelib, matlab→toolboxes), add motivebatch; share refactor task notes
70ec57f remove code/lib
38ed6a4 [DATALAD] removed content

README:

sirocampus¶

Clone the Repository¶

Server:

datalad install -s "ria+file:///storage/share/git/ria-store#~sirocampus" sirocampus

or a reckless clone (to save space):

datalad install -s "ria+file:///storage/share/git/ria-store#~sirocampus" --reckless ephemeral sirocampus

issue-arash-20260402-015659-415628.md — Batch docling processing fails on HTML-masquerading-as-PDF files; fixing pdf_fetch_oa validation would unblock batch docling runs
issue-arash-20260402-220152-539138.md — biblio_compile merges intermediates including PDF outputs; corrupted HTML-as-PDF files in the pool would propagate into compiled artifacts
issue-arash-20260402-220025-468258.md — Sources vs artifacts separation spec is directly relevant — validated PDFs vs raw fetched files should be distinguished in the architecture
issue-arash-20260402-220130-159401.md — biblio merge/output reorganization; invalid fetched files affect what ends up in merged outputs

Source context: sirocampus¶

sirocampus¶

Clone the Repository¶

Related Notes¶

`sirocampus`¶