Semantic Search Pipeline¶
This tutorial shows how to build a searchable corpus from your project's documents, then query it through MCP tools — giving agents grounded access to your project knowledge.
The pipeline¶
graph LR
A[Documents] -->|register sources| B[Source config]
B -->|indexio build| C[Chunked + embedded corpus]
C -->|rag_query| D[Agent retrieval]
C -->|rag_query_multi| E[Multi-facet search]
Prerequisites¶
- A projio workspace (
projio init .) - indexio installed (
pip install "projio[indexio]") - Documents to index (markdown, PDFs, BibTeX, code)
- MCP server configured (
projio mcp-config -C . --yes) - Agent permissions configured (
projio add claude— see Agent Safety & Permissions)
Step 1: Initialize indexio¶
indexio init-config
This creates infra/indexio/config.yml with default settings for chunking and embedding.
Step 2: Register sources¶
Tell indexio what to index. Sources are directories or file patterns:
indexio add-source docs/ # markdown documentation
indexio add-source bib/main.bib # bibliography entries
indexio add-source src/ # source code
Each source gets chunking parameters appropriate to its content type. The config file (infra/indexio/config.yml) stores the registered sources.
What to index
Index the materials your agent needs for context:
- docs/ — project documentation, how-to guides, design decisions
- docs/log/ — notio notes (ideas, tasks, meeting notes)
- bib/ — bibliography entries and extracted paper text
- src/ — source code (useful for code search alongside codio)
- docs/reference/codelib/ — codio curated library notes
Step 3: Build the corpus¶
indexio build
This runs the full pipeline:
- Scan registered sources for files
- Chunk documents into passages (respecting markdown headers, code blocks, etc.)
- Embed chunks using the configured embedding model
- Store the vector index for retrieval
Build time depends on corpus size. A typical research project (100 docs, 50 papers) takes a few minutes.
Step 4: Query via MCP¶
Single query¶
Ask the agent a question and it retrieves relevant passages:
You: What methods exist for detecting travelling waves in neural data?
The agent calls rag_query(query="methods for detecting travelling waves in neural data", k=8):
{
"corpus": "default",
"results": [
{
"chunk_id": "bib/docling/muller_2018.md:chunk_3",
"score": 0.91,
"text": "Phase gradient methods compute the spatial derivative of instantaneous phase...",
"source": "bib/docling/muller_2018.md"
},
{
"chunk_id": "docs/log/idea/idea-arash-20260310.md:chunk_1",
"score": 0.84,
"text": "Compare optical flow approaches vs phase gradient for wave detection...",
"source": "docs/log/idea/idea-arash-20260310.md"
}
]
}
The results include text excerpts with source attribution, so the agent can cite where information came from.
Multi-query search¶
For complex questions that span multiple facets:
You: I need to understand both the mathematical foundations and the Python
implementations for multitaper spectral analysis.
The agent calls rag_query_multi:
{
"queries": [
"mathematical foundations of multitaper spectral analysis",
"Python implementations of multitaper spectral estimation"
],
"k": 5
}
Results are deduplicated across queries — a passage matching both facets appears once with the higher score.
List corpora¶
You: What corpora are indexed?
The agent calls corpus_list():
{
"corpora": [
{"name": "default", "chunks": 1247, "sources": 5, "last_built": "2026-03-18T14:30:00"}
]
}
Step 5: Keep the corpus current¶
After adding new documents, papers, or notes:
indexio build # rebuilds incrementally
For biblio integration, register codio and biblio sources:
codio rag sync # register codio sources in indexio
biblio rag sync # register biblio sources in indexio
indexio build # rebuild with new sources
Agent patterns¶
Grounded answers¶
The agent uses RAG to ground responses in your project's actual content rather than general knowledge:
You: Based on our project notes and papers, what's the recommended
approach for phase estimation?
The agent calls rag_query, reads the top results, and synthesizes an answer citing specific documents.
Cross-domain search¶
Combine RAG with other tools for comprehensive research:
You: Find everything we have about multitaper methods — papers,
notes, and code libraries.
The agent:
- Calls
rag_query("multitaper methods")— finds papers and notes - Calls
codio_discover("multitaper spectral estimation")— finds libraries - Calls
citekey_resolveon any cited papers — gets full metadata - Synthesizes a comprehensive summary
Worklog integration¶
In the worklog pipeline, semantic search supports:
- Note triage — find related notes before creating duplicates
- Task context — retrieve project knowledge relevant to a task before agent execution
- Run reports — ground agent summaries in actual project content
Next steps¶
- Agent Orchestration — combine search with all ecosystem tools in a single session
- Agent-Driven Ingestion — ingest new papers and libraries to expand the corpus