Skip to content

biblio: pool-aware derivative resolution

Goal

Stop regenerating docling/grobid/openalex outputs that already exist in the shared pool. This saves compute time, API calls, and storage — the shared pool at /storage/share/sirocampus/bib/ already has derivatives for most papers.

Context

  • Pool PDF search works (pool.search in config, used by fetch_pdfs_oa)
  • Derivatives are NOT pool-aware — every project regenerates them independently
  • SiroCampus pool has docling outputs for 250+ papers
  • Running docling batch on a new project re-processes all of them (~45s each = 3+ hours wasted)

Two-tier model

Tier Examples Resolution
Shared PDFs, docling, grobid, openalex Check pool first, symlink if found
Project-local summaries, reviews, reading lists, RAG Always generate per-project

Acceptance Criteria

  • [ ] resolve_derivative(cfg, citekey, "docling") checks pool paths
  • [ ] run_docling_for_key skips when pool output exists (unless force=True)
  • [ ] find_pending_docling excludes keys with pool derivatives
  • [ ] paper_context reports derivative_source: "pool" | "local" | "missing"
  • [ ] Batch operations show pool skip counts
  • [ ] Config: pool.derivatives: true (default enabled)

Result

(Filled in after execution)