Skip to content

## Implement: Topic enrichment pipeline — persist OpenAlex topics per citekey F

Implement: Topic enrichment pipeline — persist OpenAlex topics per citekey

From docs/specs/biblio/enrichment-pipeline.md and docs/specs/biblio/concept-topic-overlap.md.

What

After OpenAlex resolution, persist the topic hierarchy and keywords for each work. Use as a free baseline for tagging before (optionally) running LLM concept extraction.

Tasks

  1. Enrichment storagepackages/biblio/src/biblio/openalex/openalex_enrich.py (new):
  2. enrich_resolved(root) — read resolved.jsonl, extract topics/keywords/type/retracted per citekey
  3. Write per-citekey YAML to bib/derivatives/openalex/{citekey}.yml:

    citekey: smith2024
    type: article
    is_retracted: false
    primary_topic:
      name: "Sharp-Wave Ripples"
      subfield: "Behavioral Neuroscience"
      field: "Psychology"
      domain: "Social Sciences"
      score: 0.92
    topics: [...]
    keywords: [...]
    counts_by_year: {2024: 5, 2023: 12, ...}
    

  4. Topic → tag mappingpackages/biblio/src/biblio/openalex/topic_tags.py (new):

  5. Map OpenAlex topic hierarchy to biblio tag vocabulary
  6. Generate tags like domain:neuroscience, field:behavioral-neuroscience, topic:sharp-wave-ripples
  7. Auto-populate library.yml tags from topics (opt-in, configurable)

  8. Integration with autotag — modify packages/biblio/src/biblio/autotag.py:

  9. If OpenAlex topics exist for a citekey, use them as context/seed in the LLM prompt
  10. This makes LLM tagging more accurate and avoids redundant classification

  11. MCP tools:

  12. biblio_enrich(citekeys) — run enrichment for specific citekeys
  13. biblio_enrich_all() — run for all resolved papers
  14. Update paper_context() to include topic data

  15. Pipeline integration — update the ingest pipeline documentation:

  16. After openalex_resolvebiblio_enrich → then graph_expand, docling, etc.

Key files

  • packages/biblio/src/biblio/openalex/openalex_resolve.py — reads resolved.jsonl
  • packages/biblio/src/biblio/autotag.py — LLM tagging to augment with topics
  • packages/biblio/src/biblio/concepts.py — LLM concepts to compare/complement
  • packages/biblio/src/biblio/library.py — where tags are stored
  • packages/biblio/src/biblio/mcp.py — MCP wrappers
  • docs/specs/biblio/enrichment-pipeline.md
  • docs/specs/biblio/concept-topic-overlap.md