Agent-Driven Literature & Code Ingestion¶
This tutorial shows how to use Claude Code (or any MCP-capable agent) to ingest papers and code libraries into a projio workspace — turning a curated reading list into queryable, structured project knowledge.
Prerequisites¶
- A projio workspace (
projio init .) - biblio and codio components activated (
projio add biblio && projio add codio) - The MCP server configured (
.mcp.jsonin place — see Configure the MCP Server) - Agent permissions configured (
projio add claude— see Agent Safety & Permissions)
The scenario¶
You have a research topic — say, travelling wave detection methods — and a curated list of DOIs and GitHub repositories. You want to:
- Ingest the papers into your bibliography
- Register the code libraries in your code intelligence registry
- Tag and organize everything for later discovery
With projio's MCP tools, the agent handles this in a single conversation.
Step 1: Ingest papers by DOI¶
Give Claude Code a list of DOIs and ask it to ingest them:
You: Ingest these papers into biblio with tag "travelling_waves" and
add them to a collection called "phase-methods":
10.1038/nn.4046
10.1038/nn.4494
10.1016/j.neuron.2013.08.006
10.1152/jn.00369.2007
10.1016/j.jneumeth.2011.10.005
The agent calls biblio_ingest:
{
"dois": [
"10.1038/nn.4046",
"10.1038/nn.4494",
"10.1016/j.neuron.2013.08.006",
"10.1152/jn.00369.2007",
"10.1016/j.jneumeth.2011.10.005"
],
"tags": ["travelling_waves"],
"status": "unread",
"collection": "phase-methods"
}
The tool returns:
{
"citekeys": [
"muller_2018_CorticalTravelling",
"davis_2020_SpontaneousWaves",
"rubino_2006_PropagatingWaves",
"rubino_2007_PhaseGradient",
"townsend_2011_PhaseGradient"
],
"count": 5,
"output_bib": "/path/to/project/bib/srcbib/imported.bib",
"collection": "phase-methods"
}
Under the hood: biblio_ingest
The tool executes a multi-step pipeline:
- DOI parsing — Each DOI is normalized (strips
https://doi.org/prefixes) - OpenAlex enrichment — Queries the OpenAlex API for each DOI to resolve title, authors, year, journal, and abstract
- Citekey generation — Assigns
{author}_{year}_{TitleWords}citekeys with automatic deduplication (appends2,3, etc. on collision) - BibTeX writing — Appends
@article{...}entries tobib/srcbib/imported.bib - Library ledger — Sets
status: unreadandtags: [travelling_waves]inbib/config/library.yml - Collection — Creates the "phase-methods" collection in
bib/config/collections.jsonand adds the citekeys
After ingestion, run biblio merge to fold the imported entries into bib/main.bib for downstream use by docling and GROBID.
Step 2: Register code libraries¶
Now give the agent the GitHub URLs:
You: Add these libraries to the codio registry:
https://github.com/mne-tools/mne-python
https://github.com/neurodsp-tools/neurodsp
https://github.com/NeuralEnsemble/elephant
https://github.com/kemerelab/ghostipy
https://github.com/preraulab/multitaper_toolbox
https://github.com/mathLab/PyDMD
The agent calls codio_add_urls:
{
"urls": [
"https://github.com/mne-tools/mne-python",
"https://github.com/neurodsp-tools/neurodsp",
"https://github.com/NeuralEnsemble/elephant",
"https://github.com/kemerelab/ghostipy",
"https://github.com/preraulab/multitaper_toolbox",
"https://github.com/mathLab/PyDMD"
]
}
The tool returns:
{
"results": [
{"url": "https://github.com/mne-tools/mne-python", "name": "mne_python", "status": "added", "repo_id": "mne-tools--mne-python"},
{"url": "https://github.com/neurodsp-tools/neurodsp", "name": "neurodsp", "status": "added", "repo_id": "neurodsp-tools--neurodsp"},
{"url": "https://github.com/NeuralEnsemble/elephant", "name": "elephant", "status": "added", "repo_id": "neuralensemble--elephant"},
{"url": "https://github.com/kemerelab/ghostipy", "name": "ghostipy", "status": "added", "repo_id": "kemerelab--ghostipy"},
{"url": "https://github.com/preraulab/multitaper_toolbox", "name": "multitaper_toolbox", "status": "added", "repo_id": "preraulab--multitaper_toolbox"},
{"url": "https://github.com/mathLab/PyDMD", "name": "pydmd", "status": "added", "repo_id": "mathlab--pydmd"}
]
}
Under the hood: codio_add_urls
For each URL, the tool:
- Parses the URL — Extracts
owner/repoto derive arepo_idand libraryname(lowercased, hyphens to underscores) - Fetches GitHub metadata — Calls the GitHub API to get: language, license (SPDX), description, and topic tags
- Creates catalog entry — Writes to
.projio/codio/catalog.ymlwith kindexternal_mirror, the detected language, license, and summary - Creates profile entry — Writes to
.projio/codio/profiles.ymlwithpriority: tier2,status: candidate, and GitHub topics as capability tags - Sets runtime policy — Python repos get
runtime_import: pip_onlyanddecision_default: wrap; non-Python repos getreference_onlyandnew - Registers repository — Writes to
.projio/codio/repos.ymlwith the URL, hosting provider, and storage type
Existing libraries are skipped (idempotent). If a URL can't be parsed, it's reported as an error without blocking other URLs.
Step 3: Verify and explore¶
Now you can query what was ingested using the read tools.
Check a paper¶
You: What do we have on Muller 2018?
The agent calls citekey_resolve(["muller_2018_CorticalTravelling"]) and gets back the full metadata: title, authors, year, DOI, tags, and library status.
For deeper context (docling excerpt, GROBID references), the agent can call paper_context("muller_2018_CorticalTravelling").
Check a library¶
You: What's in the registry for mne_python?
The agent calls codio_get("mne_python") and gets the full merged record: language, license, repo URL, capabilities, priority, runtime policy, and any curated notes.
Discover by capability¶
You: Which libraries support phase analysis?
The agent calls codio_discover("phase analysis") to search across capability tags and descriptions.
Combining read and write tools
The real power is in composition. After ingesting, the agent can:
- Call
codio_discover("multitaper spectral estimation")to find relevant libraries - Call
paper_contexton related papers to understand the algorithms - Call
note_search("wave detection methods")to check prior design decisions - Then write a new note with
note_createsummarizing the analysis
This is the search-before-creation workflow — the agent builds on existing project knowledge rather than starting from scratch.
Step 4: Bulk-update library metadata¶
After reviewing the ingested papers, update their status:
You: Mark the Muller and Davis papers as "reading" with high priority.
The agent calls biblio_library_set:
{
"citekeys": ["muller_2018_CorticalTravelling", "davis_2020_SpontaneousWaves"],
"status": "reading",
"priority": "high"
}
Under the hood: biblio_library_set
Updates the bib/config/library.yml ledger file. Each citekey's entry is updated with the specified fields. Fields not provided are left unchanged.
Valid statuses: unread, reading, processed, archived.
Valid priorities: low, normal, high.
Putting it all together¶
A single conversation with Claude Code can transform a curated list of DOIs and URLs into a fully structured, queryable knowledge layer:
graph LR
A[DOI list] -->|biblio_ingest| B[BibTeX entries]
B --> C[Library ledger]
B --> D[Collections]
E[GitHub URLs] -->|codio_add_urls| F[Catalog entries]
F --> G[Profiles]
F --> H[Repo registry]
C -->|citekey_resolve| I[Agent queries]
G -->|codio_discover| I
What the agent sees
From the agent's perspective, the projio MCP tools are just another set of callable functions — like file read/write or web search. The agent doesn't need to know about YAML files, BibTeX syntax, or OpenAlex APIs. It calls biblio_ingest with DOIs and gets back citekeys. It calls codio_add_urls with URLs and gets back library names.
The structured output means the agent can chain tools naturally: ingest papers, then resolve their citekeys, then search for related notes, then create a summary — all in one conversation.
Step 5: Process and index¶
After ingestion, the agent can run the full pipeline without leaving the conversation:
Merge imported entries:
The agent calls biblio_merge() to fold bib/srcbib/*.bib into bib/main.bib:
{"n_sources": 2, "n_entries": 5, "out_bib": "/path/to/project/bib/main.bib"}
Extract full text with Docling:
For each paper that has a PDF, the agent calls biblio_docling(citekey):
{"citekey": "muller_2018_CorticalTravelling", "md_path": "...", "json_path": "..."}
Extract references with GROBID:
The agent can first check the server with biblio_grobid_check(), then call biblio_grobid(citekey) for each paper:
{"citekey": "muller_2018_CorticalTravelling", "header_path": "...", "references_path": "..."}
Rebuild the search index:
Finally, the agent calls indexio_build() to re-index everything for semantic search:
{"store": "default", "persist_directory": "...", "source_stats": {...}}
CLI equivalents
These MCP tools correspond to the CLI commands biblio merge, biblio docling, biblio grobid, and indexio build. The MCP tools let the agent run the full pipeline autonomously in a single conversation.
Next steps¶
- Write curated library notes in
docs/reference/codelib/libraries/for the codio entries