biblio + indexio: literature and retrieval¶
Sources & anchors
- Stack component: projio
- Canonical artifact: projio's own corpus (1.3k docs + 75k codelib chunks)
- Workshop session: Day-3 AM session 2
- Outline:
_outline.md§B
Frame¶
Citekey resolution; docling/grobid for PDF extraction; corpus indexing, chunking, embedding; RAG. The pain biblio + indexio solve is claims drift from data.
The pain¶
By the time a manuscript draft is being assembled, the author has read fifty papers, cited twenty, and paraphrased the rest from memory. Two weeks later a co-author asks "where did you get that number for the ripple frequency?" The author opens Zotero, searches for "ripple frequency", scrolls through hits, opens three PDFs, ctrl-F-s through each, and tries to reconstruct the citation. By the third paragraph of the manuscript, the claim has drifted: the paraphrase no longer matches the source figure caption, the page reference is wrong, or the cited paper turns out to have made the opposite claim with different terminology.
The fix is mechanical, not editorial: keep a queryable corpus of the project's literature — full text, sectioned, embedded — and resolve every citekey to extracted source text at write time. biblio and indexio together supply that corpus.
biblio — bibliography management¶
biblio is the literature subsystem. It manages the project's
bibliography as a set of human-edited source .bib files in
bib/srcbib/ plus a set of generated artifacts in .projio/biblio/
and .projio/render/.
The compilation pipeline is:
bib/srcbib/*.bib (Zotero export, human-managed)
→ biblio_merge
→ .projio/biblio/merged.bib (deduplicated, single file)
→ biblio_compile
→ .projio/render/compiled.bib (final bib used by pandoc and mkdocs-bibtex)
The MCP surface separates roles. Ingestion: biblio_ingest(dois=[...])
takes a list of DOIs, queries Crossref / OpenAlex, and creates new
.bib entries with normalized citekeys. PDF acquisition:
biblio_pdf_fetch_oa(force, citekeys) runs the open-access cascade
(Unpaywall → OpenAlex OA URL → direct publisher when available) and
drops PDFs under bib/articles/. Full-text extraction:
biblio_docling(citekey) runs docling on a PDF and produces a
structured .json and .md representation under bib/derivatives/docling/.
Reference graph: biblio_grobid(citekey) extracts the paper's own
reference list, and biblio_graph_expand(citekeys) walks the graph
to find related work that should also be in the library.
Citekey resolution¶
Every cited paper has a canonical citekey (e.g. buzsaki2024oscillations).
citekey_resolve(citekeys) returns the full bibliographic record for
each. paper_context(citekey) returns the full docling extraction —
sectioned body text, figure captions, table contents — as structured
data the agent or human can quote from directly. paper_absent_refs(citekey)
lists the citekeys this paper references that are not yet in the
library, so an automated curation pass can pull them in.
The Zotero integration closes the loop with the lab's existing
workflow. biblio_zotero_pull ingests the latest export from the
researcher's personal Zotero collection; biblio_zotero_push reflects
biblio-side enrichments (added DOIs, normalized author names) back to
Zotero. The pattern from the biblio identity memo
is that biblio is project-centric and MCP-first, complementing Zotero
(person-centric, GUI-first) and OpenAlex (corpus-centric).
indexio — corpus indexing and RAG¶
indexio is the retrieval subsystem. It takes a set of sources —
glob patterns, single files, or paths — and produces an embedded
Chroma index suitable for semantic search. The interface is two MCP
tools: indexio_sources_list() and indexio_build() to manage the
index, and rag_query(query, corpus=...) and
rag_query_multi(queries=[...], corpus=...) to query it.
A source in indexio is identified by id, has a glob or path, lands
in a named corpus, and carries optional metadata. The default
corpus for a project is docs; biblio outputs land in a biblio
corpus; codio mirrors land in a codelib corpus. Multiple corpora
keep retrieval focused: a query about RIPPLE detection should not get
chunks of the snakemake source code.
biblio_rag_sync() is the bridge: it walks bib/derivatives/docling/,
registers each extracted paper as a source under the biblio
corpus, and triggers an incremental index rebuild. codio_rag_sync()
does the same for code mirrors. The two syncs together populate the
retrieval surface; rag_query("how does snakebids handle missing
sessions?", corpus="codelib") searches code, rag_query("typical
ripple frequency band", corpus="biblio") searches papers.
projio's own corpus as the canonical example¶
projio (the tool repo, dogfooding its own ecosystem) maintains two
corpora indexed under .projio/indexio/:
- a docs corpus indexing this handbook, the workshop syllabus, the survey, the specs, and the agent-activity log — currently ~1.3k chunks across the markdown tree.
- a codelib corpus indexing the ~14 external code mirrors under
.projio/codio/mirrors/*/(snakemake, openalex-elastic-api, openalex-docs, openalex-api-tutorials, …) at ~75k chunks total.
The .projio/indexio/config.yaml declares each source explicitly:
sources:
- id: docs
corpus: docs
glob: docs/**/*.md
- id: codio-notes
corpus: codelib
glob: docs/reference/codelib/libraries/**/*.md
- id: codio-src-snakemake
corpus: codelib
glob: .projio/codio/mirrors/snakemake--snakemake/**/*.{py,ipynb,md}
metadata:
library: snakemake
kind: external_mirror
rag_query("how does snakemake report job IDs?",
corpus="codelib") searches the entire mirrored snakemake source and
returns the relevant code chunks plus surrounding context. The same
query against the docs corpus would search the handbook and the
specs. Splitting corpora at index-build time keeps queries cheap and
focused.
Putting them together — the manuscript story¶
The closing pattern for this chapter: a manuscript author drafting a
methods section calls paper_context("buzsaki2024oscillations") to
get the extracted source text, quotes the relevant passage directly,
inserts the [@buzsaki2024oscillations] citekey, and runs
manuscript_cite_check on the assembled draft. The cite check
verifies that every cited key resolves in compiled.bib, every key
in the bib is referenced (or marked as background reading), and every
quotation is traceable to the extracted text. The author is still
the one writing the prose, but the evidence is now machine-resolvable
all the way back to the source PDF. (Note: manuscript_cite_suggest
and the broader manuscript subsystem are uneven across projects — see
figio + manuscript for the honest
framing.)
The pain of "claims drift from data" does not disappear, but the mechanical part of the fix — find the source, paraphrase faithfully, attach the citekey — is now a tool call rather than a manual hunt.
What is honest about this layer¶
The cohort use of biblio and indexio is uneven. Three of the four
study projects have populated bib/ directories and run docling
extraction periodically; one (msol) has biblio.enabled: false while
still carrying a populated bib/ — the enabled-flag drift
the handbook documents. indexio is enabled across all five projects
but only projio itself currently maintains the two-corpus pattern at
scale; the study projects index docs and rely on the ecosystem
codelib corpus for code retrieval.
Further reading¶
- Docling — PDF text extraction; table, figure, and structured reference extraction that
biblio_doclingwraps. - GROBID — ML tool for structured reference and header extraction from PDFs; powers
biblio_grobid. - OpenAlex API — open scholarly metadata API; powers DOI resolution and citation-graph expansion in biblio.