Skip to content

biblio pdf_fetch_oa saves HTML paywall pages as .pdf files without content-type validation.

16 of 272 articles in sirocampus biblio have HTML files masquerading as PDFs. Docling fails on these because there's no actual PDF to parse.

Fix: After downloading, check that the file starts with %PDF- magic bytes. If not, delete the file and report as failed fetch. Optionally retry with alternative sources (Unpaywall, Semantic Scholar, etc.).

Failed citekeys: barachantMulticlassBraincomputerInterface2012, coonOscillatoryPhaseModulates2016, dfosterFreelymovingMonkeyTreadmill2014, harveyTopographicRepresentationNumerosity2013, kellisDecodingSpokenWords2010, kitamuraEngramsCircuitsCrucial2017, pesaranInvestigatingLargescaleBrain2018, samsonovichPathIntegrationCognitive1997, seichepineDielectrophoresisAssistedIntegration10242017, taubeAbsencePresence3D2020, taubeAbsencePresence3D2020a, tchoeHumanBrainMapping2022, toddSystematicExplorationUnsupervised2017, vorheesMorrisWaterMaze2006b, yooFunctionalDoubleDissociation2017, zubairHeadMovementWalking2016


Source context: sirocampus

SiroCampus (sirocampus): Sirota Lab shared repository

Recent commits:

d9383f9 codio: normalize catalog paths (python→codelib, matlab→toolboxes), add motivebatch; share refactor task notes
70ec57f remove code/lib
38ed6a4 [DATALAD] removed content

README:

sirocampus

Clone the Repository

Server:

datalad install -s "ria+file:///storage/share/git/ria-store#~sirocampus" sirocampus

or a reckless clone (to save space):

datalad install -s "ria+file:///storage/share/git/ria-store#~sirocampus" --reckless ephemeral sirocampus