Topic enrichment pipeline¶

Goal¶

(promoted from note)

Context¶

(see source note)

Prompt¶

Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-014857-481322.md). Understand the problem, then implement the proposed fix.

Implement: Topic enrichment pipeline — persist OpenAlex topics per citekey¶

From docs/specs/biblio/enrichment-pipeline.md and docs/specs/biblio/concept-topic-overlap.md.

What¶

After OpenAlex resolution, persist the topic hierarchy and keywords for each work. Use as a free baseline for tagging before (optionally) running LLM concept extraction.

Tasks¶

Enrichment storage — packages/biblio/src/biblio/openalex/openalex_enrich.py (new):
enrich_resolved(root) — read resolved.jsonl, extract topics/keywords/type/retracted per citekey

Write per-citekey YAML to bib/derivatives/openalex/{citekey}.yml:

citekey: smith2024
type: article
is_retracted: false
primary_topic:
  name: "Sharp-Wave Ripples"
  subfield: "Behavioral Neuroscience"
  field: "Psychology"
  domain: "Social Sciences"
  score: 0.92
topics: [...]
keywords: [...]
counts_by_year: {2024: 5, 2023: 12, ...}

Topic → tag mapping — packages/biblio/src/biblio/openalex/topic_tags.py (new):
Map OpenAlex topic hierarchy to biblio tag vocabulary
Generate tags like domain:neuroscience, field:behavioral-neuroscience, topic:sharp-wave-ripples
Auto-populate library.yml tags from topics (opt-in, configurable)
Integration with autotag — modify packages/biblio/src/biblio/autotag.py:
If OpenAlex topics exist for a citekey, use them as context/seed in the LLM prompt
This makes LLM tagging more accurate and avoids redundant classification
MCP tools:
biblio_enrich(citekeys) — run enrichment for specific citekeys
biblio_enrich_all() — run for all resolved papers
Update paper_context() to include topic data
Pipeline integration — update the ingest pipeline documentation:
After openalex_resolve → biblio_enrich → then graph_expand, docling, etc.

Key files¶

packages/biblio/src/biblio/openalex/openalex_resolve.py — reads resolved.jsonl
packages/biblio/src/biblio/autotag.py — LLM tagging to augment with topics
packages/biblio/src/biblio/concepts.py — LLM concepts to compare/complement
packages/biblio/src/biblio/library.py — where tags are stored
packages/biblio/src/biblio/mcp.py — MCP wrappers
docs/specs/biblio/enrichment-pipeline.md
docs/specs/biblio/concept-topic-overlap.md

issue-arash-20260403-193112-105596.md — Direct spec for the enrichment pipeline being implemented here
issue-arash-20260403-193037-589959.md — Audit of concept tagging vs OpenAlex topic classification — the overlap analysis this pipeline addresses
issue-arash-20260404-014840-332631.md — Companion OpenAlex API P1 fixes needed for the resolution step this pipeline depends on
issue-arash-20260403-210020-146442.md — Zotero push-tags writes enrichments back — downstream consumer of the topics this pipeline produces
issue-arash-20260403-193002-484673.md — OpenAlex API audit that informs which topic fields are reliably available for enrichment

Acceptance Criteria¶

[ ]

Result¶

Completed in 385.7s (1 steps, model: opus)

Session: 302dbc10-6a25-4961-84a6-ac1b5f6dfa82 Result note: /storage2/arash/worklog/workflow/captures/20260404-022411-fe8238/note.md

Output¶

All implementations are working. Here's a summary of all changes:

**New files:**
- `packages/biblio/src/biblio/openalex/openalex_enrich.py` — Core enrichment module: reads `resolved.jsonl`, extracts topics/keywords/type/retraction/citation-counts, writes per-citekey YAML to `bib/derivatives/openalex/{citekey}.yml`, builds `_topic_index.yml`
- `packages/biblio/src/biblio/openalex/topic_tags.py` — Maps OpenAlex topic hierarchy to `oa:`-prefixed biblio tags (`oa:domain:*`, `oa:field:*`, `oa:subfield:*`, `oa:topic:*`, `oa:keyword:*`), with library.yml population (union merge)

**Modified files:**
- `packages/biblio/src/biblio/openalex/openalex_client.py` — Added `keywords` and `counts_by_year` to `DEFAULT_SELECT`
- `packages/biblio/src/biblio/openalex/openalex_resolve.py` — Extended `_work_to_minimal` with `primary_topic`, `keywords`, `type`, `is_retracted`, `counts_by_year`
- `packages/biblio/src/biblio/autotag.py` — LLM tier now loads OpenAlex enrichment and injects it as context in the classification prompt
- `packages/biblio/src/biblio/mcp.py` — Added `enrich()` and `enrich_topic_tags()` wrappers; `paper_context()` now includes `openalex` field with enrichment data
- `src/projio/mcp/biblio.py` — Added `biblio_enrich()` and `biblio_enrich_topic_tags()` tool functions
- `src/projio/mcp/server.py` — Registered `biblio_enrich` and `biblio_enrich_topic_tags` tools

## Next Steps

1. Run `biblio_enrich()` on the current project to populate per-citekey YAML files from existing resolved.jsonl, then `biblio_enrich_topic_tags(dry_run=True)` to preview the library tag additions.
2. Add CLI subcommands (`biblio enrich run` and `biblio enrich tags`) in `packages/biblio/src/biblio/cli.py` for command-line access to the enrichment pipeline.
3. Update `CLAUDE.md` and the projio agent routing table to document the new `biblio_enrich` → `biblio_enrich_topic_tags` pipeline step after `openalex_resolve`.