Skip to content

biblio needs a batch docling command that processes all (or filtered) citekeys with PDF articles.

Requirements: - Iterate over all citekeys that have PDFs but no docling output yet - Pace jobs to avoid CPU/memory saturation (docling uses ~2.7GB RAM and 300%+ CPU per paper) - Configurable concurrency (default: 1, maybe 2 on beefy machines) - Progress reporting: current citekey, N/total, elapsed, ETA - Graceful resume: skip already-processed citekeys (check derivatives/docling/{citekey}/ exists) - Optional filter: by tag, collection, or citekey glob - Should work both as CLI (biblio docling batch) and MCP tool (biblio_docling_batch)

Current state: biblio_docling processes one citekey at a time with background mode. Running 207 papers requires manually calling it 207 times or writing a wrapper script.

Example CLI:

biblio docling batch --root /storage/share/sirocampus --concurrency 1 --progress

Example MCP:

biblio_docling_batch(concurrency=1, filter_collection="share-papers", background=True)


Source context: sirocampus

SiroCampus (sirocampus): Sirota Lab shared repository

Recent commits:

d9383f9 codio: normalize catalog paths (python→codelib, matlab→toolboxes), add motivebatch; share refactor task notes
70ec57f remove code/lib
38ed6a4 [DATALAD] removed content

README:

sirocampus

Clone the Repository

Server:

datalad install -s "ria+file:///storage/share/git/ria-store#~sirocampus" sirocampus

or a reckless clone (to save space):

datalad install -s "ria+file:///storage/share/git/ria-store#~sirocampus" --reckless ephemeral sirocampus

  • [[issue-arash-20260328-174906-119594.md]] — Both are biblio MCP tool feature requests expanding the biblio command surface
  • [[issue-arash-20260328-145700-791144.md]] — Likely related biblio graph migration work touching the same biblio infrastructure