Skip to content

Topic enrichment pipeline

Goal

(promoted from note)

Context

(see source note)

Prompt

Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-014857-481322.md). Understand the problem, then implement the proposed fix.


Implement: Topic enrichment pipeline — persist OpenAlex topics per citekey

From docs/specs/biblio/enrichment-pipeline.md and docs/specs/biblio/concept-topic-overlap.md.

What

After OpenAlex resolution, persist the topic hierarchy and keywords for each work. Use as a free baseline for tagging before (optionally) running LLM concept extraction.

Tasks

  1. Enrichment storagepackages/biblio/src/biblio/openalex/openalex_enrich.py (new):
  2. enrich_resolved(root) — read resolved.jsonl, extract topics/keywords/type/retracted per citekey
  3. Write per-citekey YAML to bib/derivatives/openalex/{citekey}.yml:

    citekey: smith2024
    type: article
    is_retracted: false
    primary_topic:
      name: "Sharp-Wave Ripples"
      subfield: "Behavioral Neuroscience"
      field: "Psychology"
      domain: "Social Sciences"
      score: 0.92
    topics: [...]
    keywords: [...]
    counts_by_year: {2024: 5, 2023: 12, ...}
    

  4. Topic → tag mappingpackages/biblio/src/biblio/openalex/topic_tags.py (new):

  5. Map OpenAlex topic hierarchy to biblio tag vocabulary
  6. Generate tags like domain:neuroscience, field:behavioral-neuroscience, topic:sharp-wave-ripples
  7. Auto-populate library.yml tags from topics (opt-in, configurable)

  8. Integration with autotag — modify packages/biblio/src/biblio/autotag.py:

  9. If OpenAlex topics exist for a citekey, use them as context/seed in the LLM prompt
  10. This makes LLM tagging more accurate and avoids redundant classification

  11. MCP tools:

  12. biblio_enrich(citekeys) — run enrichment for specific citekeys
  13. biblio_enrich_all() — run for all resolved papers
  14. Update paper_context() to include topic data

  15. Pipeline integration — update the ingest pipeline documentation:

  16. After openalex_resolvebiblio_enrich → then graph_expand, docling, etc.

Key files

  • packages/biblio/src/biblio/openalex/openalex_resolve.py — reads resolved.jsonl
  • packages/biblio/src/biblio/autotag.py — LLM tagging to augment with topics
  • packages/biblio/src/biblio/concepts.py — LLM concepts to compare/complement
  • packages/biblio/src/biblio/library.py — where tags are stored
  • packages/biblio/src/biblio/mcp.py — MCP wrappers
  • docs/specs/biblio/enrichment-pipeline.md
  • docs/specs/biblio/concept-topic-overlap.md

Acceptance Criteria

  • [ ]

Result

Completed in 385.7s (1 steps, model: opus)

Session: 302dbc10-6a25-4961-84a6-ac1b5f6dfa82 Result note: /storage2/arash/worklog/workflow/captures/20260404-022411-fe8238/note.md

Output

All implementations are working. Here's a summary of all changes:

**New files:**
- `packages/biblio/src/biblio/openalex/openalex_enrich.py` — Core enrichment module: reads `resolved.jsonl`, extracts topics/keywords/type/retraction/citation-counts, writes per-citekey YAML to `bib/derivatives/openalex/{citekey}.yml`, builds `_topic_index.yml`
- `packages/biblio/src/biblio/openalex/topic_tags.py` — Maps OpenAlex topic hierarchy to `oa:`-prefixed biblio tags (`oa:domain:*`, `oa:field:*`, `oa:subfield:*`, `oa:topic:*`, `oa:keyword:*`), with library.yml population (union merge)

**Modified files:**
- `packages/biblio/src/biblio/openalex/openalex_client.py` — Added `keywords` and `counts_by_year` to `DEFAULT_SELECT`
- `packages/biblio/src/biblio/openalex/openalex_resolve.py` — Extended `_work_to_minimal` with `primary_topic`, `keywords`, `type`, `is_retracted`, `counts_by_year`
- `packages/biblio/src/biblio/autotag.py` — LLM tier now loads OpenAlex enrichment and injects it as context in the classification prompt
- `packages/biblio/src/biblio/mcp.py` — Added `enrich()` and `enrich_topic_tags()` wrappers; `paper_context()` now includes `openalex` field with enrichment data
- `src/projio/mcp/biblio.py` — Added `biblio_enrich()` and `biblio_enrich_topic_tags()` tool functions
- `src/projio/mcp/server.py` — Registered `biblio_enrich` and `biblio_enrich_topic_tags` tools

## Next Steps

1. Run `biblio_enrich()` on the current project to populate per-citekey YAML files from existing resolved.jsonl, then `biblio_enrich_topic_tags(dry_run=True)` to preview the library tag additions.
2. Add CLI subcommands (`biblio enrich run` and `biblio enrich tags`) in `packages/biblio/src/biblio/cli.py` for command-line access to the enrichment pipeline.
3. Update `CLAUDE.md` and the projio agent routing table to document the new `biblio_enrich` → `biblio_enrich_topic_tags` pipeline step after `openalex_resolve`.