Skip to content

## Audit: biblio concept tagging vs OpenAlex topic classification — overlap anal

Goal

(promoted from note)

Context

(see source note)

Prompt

Fix the issue described in /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260403-193037-589959.md. Read it first, then implement the proposed fix.

Acceptance Criteria

  • [ ]

Result

Completed in 151.8s (1 steps, model: opus)

Session: 8d0b7222-f231-4935-aa1c-e04819a56b0a Result note: /storage2/arash/worklog/workflow/captures/20260403-193715-8bcc14/note.md

Output

Done. The analysis is at `docs/specs/biblio/concept-topic-overlap.md`.

**Key finding:** The systems are **complementary, not redundant**. OpenAlex provides free hierarchical domain classification (~4,500 topics in a 4-level tree), while biblio's LLM extraction provides fine-grained details OpenAlex can't — named datasets, specific metrics, techniques, and project-specific vocabulary. The only real overlap is at the broad domain level.

**Recommendation:** Layer them — OpenAlex topics as free baseline (layer 0), autotag propagation (layer 1), autotag LLM (layer 2), concept extraction (layer 3). Biblio already fetches `topics` from OpenAlex in the resolver, so Phase 1 requires no new API calls.

## Next Steps

1. Implement Phase 1: extract and persist OpenAlex topics from the existing resolver data into `bib/derivatives/openalex/{citekey}.yml` so they're available for downstream tools.
2. Add `openalex_mappings` to `tag_vocab.yml` to map OpenAlex subfield IDs to the autotag controlled vocabulary, enabling zero-cost topic-to-tag conversion.
3. Add OpenAlex topic data to the concept index and RAG corpus so `rag_query` and `search_concepts` can surface OpenAlex-derived classifications alongside LLM-extracted concepts.