Topic enrichment pipeline¶
Goal¶
(promoted from note)
Context¶
(see source note)
Prompt¶
Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-014857-481322.md). Understand the problem, then implement the proposed fix.
Implement: Topic enrichment pipeline — persist OpenAlex topics per citekey¶
From docs/specs/biblio/enrichment-pipeline.md and docs/specs/biblio/concept-topic-overlap.md.
What¶
After OpenAlex resolution, persist the topic hierarchy and keywords for each work. Use as a free baseline for tagging before (optionally) running LLM concept extraction.
Tasks¶
- Enrichment storage —
packages/biblio/src/biblio/openalex/openalex_enrich.py(new): enrich_resolved(root)— readresolved.jsonl, extract topics/keywords/type/retracted per citekey-
Write per-citekey YAML to
bib/derivatives/openalex/{citekey}.yml:citekey: smith2024 type: article is_retracted: false primary_topic: name: "Sharp-Wave Ripples" subfield: "Behavioral Neuroscience" field: "Psychology" domain: "Social Sciences" score: 0.92 topics: [...] keywords: [...] counts_by_year: {2024: 5, 2023: 12, ...} -
Topic → tag mapping —
packages/biblio/src/biblio/openalex/topic_tags.py(new): - Map OpenAlex topic hierarchy to biblio tag vocabulary
- Generate tags like
domain:neuroscience,field:behavioral-neuroscience,topic:sharp-wave-ripples -
Auto-populate library.yml tags from topics (opt-in, configurable)
-
Integration with autotag — modify
packages/biblio/src/biblio/autotag.py: - If OpenAlex topics exist for a citekey, use them as context/seed in the LLM prompt
-
This makes LLM tagging more accurate and avoids redundant classification
-
MCP tools:
biblio_enrich(citekeys)— run enrichment for specific citekeysbiblio_enrich_all()— run for all resolved papers-
Update
paper_context()to include topic data -
Pipeline integration — update the ingest pipeline documentation:
- After
openalex_resolve→biblio_enrich→ then graph_expand, docling, etc.
Key files¶
packages/biblio/src/biblio/openalex/openalex_resolve.py— reads resolved.jsonlpackages/biblio/src/biblio/autotag.py— LLM tagging to augment with topicspackages/biblio/src/biblio/concepts.py— LLM concepts to compare/complementpackages/biblio/src/biblio/library.py— where tags are storedpackages/biblio/src/biblio/mcp.py— MCP wrappersdocs/specs/biblio/enrichment-pipeline.mddocs/specs/biblio/concept-topic-overlap.md
Related Notes¶
- issue-arash-20260403-193112-105596.md — Direct spec for the enrichment pipeline being implemented here
- issue-arash-20260403-193037-589959.md — Audit of concept tagging vs OpenAlex topic classification — the overlap analysis this pipeline addresses
- issue-arash-20260404-014840-332631.md — Companion OpenAlex API P1 fixes needed for the resolution step this pipeline depends on
- issue-arash-20260403-210020-146442.md — Zotero push-tags writes enrichments back — downstream consumer of the topics this pipeline produces
- issue-arash-20260403-193002-484673.md — OpenAlex API audit that informs which topic fields are reliably available for enrichment
Acceptance Criteria¶
- [ ]
Result¶
Completed in 385.7s (1 steps, model: opus)
Session: 302dbc10-6a25-4961-84a6-ac1b5f6dfa82
Result note: /storage2/arash/worklog/workflow/captures/20260404-022411-fe8238/note.md
Output¶
All implementations are working. Here's a summary of all changes:
**New files:**
- `packages/biblio/src/biblio/openalex/openalex_enrich.py` — Core enrichment module: reads `resolved.jsonl`, extracts topics/keywords/type/retraction/citation-counts, writes per-citekey YAML to `bib/derivatives/openalex/{citekey}.yml`, builds `_topic_index.yml`
- `packages/biblio/src/biblio/openalex/topic_tags.py` — Maps OpenAlex topic hierarchy to `oa:`-prefixed biblio tags (`oa:domain:*`, `oa:field:*`, `oa:subfield:*`, `oa:topic:*`, `oa:keyword:*`), with library.yml population (union merge)
**Modified files:**
- `packages/biblio/src/biblio/openalex/openalex_client.py` — Added `keywords` and `counts_by_year` to `DEFAULT_SELECT`
- `packages/biblio/src/biblio/openalex/openalex_resolve.py` — Extended `_work_to_minimal` with `primary_topic`, `keywords`, `type`, `is_retracted`, `counts_by_year`
- `packages/biblio/src/biblio/autotag.py` — LLM tier now loads OpenAlex enrichment and injects it as context in the classification prompt
- `packages/biblio/src/biblio/mcp.py` — Added `enrich()` and `enrich_topic_tags()` wrappers; `paper_context()` now includes `openalex` field with enrichment data
- `src/projio/mcp/biblio.py` — Added `biblio_enrich()` and `biblio_enrich_topic_tags()` tool functions
- `src/projio/mcp/server.py` — Registered `biblio_enrich` and `biblio_enrich_topic_tags` tools
## Next Steps
1. Run `biblio_enrich()` on the current project to populate per-citekey YAML files from existing resolved.jsonl, then `biblio_enrich_topic_tags(dry_run=True)` to preview the library tag additions.
2. Add CLI subcommands (`biblio enrich run` and `biblio enrich tags`) in `packages/biblio/src/biblio/cli.py` for command-line access to the enrichment pipeline.
3. Update `CLAUDE.md` and the projio agent routing table to document the new `biblio_enrich` → `biblio_enrich_topic_tags` pipeline step after `openalex_resolve`.