biblio needs a batch docling command that processes all (or filtered) citekeys with PDF articles.
Requirements:
- Iterate over all citekeys that have PDFs but no docling output yet
- Pace jobs to avoid CPU/memory saturation (docling uses ~2.7GB RAM and 300%+ CPU per paper)
- Configurable concurrency (default: 1, maybe 2 on beefy machines)
- Progress reporting: current citekey, N/total, elapsed, ETA
- Graceful resume: skip already-processed citekeys (check derivatives/docling/{citekey}/ exists)
- Optional filter: by tag, collection, or citekey glob
- Should work both as CLI (biblio docling batch) and MCP tool (biblio_docling_batch)
Current state: biblio_docling processes one citekey at a time with background mode. Running 207 papers requires manually calling it 207 times or writing a wrapper script.
Example CLI:
biblio docling batch --root /storage/share/sirocampus --concurrency 1 --progress
Example MCP:
biblio_docling_batch(concurrency=1, filter_collection="share-papers", background=True)
Source context: sirocampus¶
SiroCampus (sirocampus): Sirota Lab shared repository
Recent commits:
d9383f9 codio: normalize catalog paths (python→codelib, matlab→toolboxes), add motivebatch; share refactor task notes
70ec57f remove code/lib
38ed6a4 [DATALAD] removed content
README:
sirocampus¶
Clone the Repository¶
Server:
datalad install -s "ria+file:///storage/share/git/ria-store#~sirocampus" sirocampus
or a reckless clone (to save space):
datalad install -s "ria+file:///storage/share/git/ria-store#~sirocampus" --reckless ephemeral sirocampus
Related Notes¶
- [[issue-arash-20260328-174906-119594.md]] — Both are biblio MCP tool feature requests expanding the biblio command surface
- [[issue-arash-20260328-145700-791144.md]] — Likely related biblio graph migration work touching the same biblio infrastructure