## Audit: biblio concept tagging vs OpenAlex topic classification — overlap anal
Audit: biblio concept tagging vs OpenAlex topic classification — overlap analysis¶
Biblio has LLM-based concept extraction (packages/biblio/src/biblio/concepts.py) that uses Claude to extract methods, datasets, metrics, domains, techniques from papers. OpenAlex has its own concept/topic classification system. Determine whether these are redundant and whether biblio should use OpenAlex topics instead of (or alongside) the LLM approach.
Scope¶
- OpenAlex topic system — Use RAG to query the indexed
openalex-concept-taggingrepo andopenalex-docsto understand: - What is OpenAlex's topic/concept hierarchy? (levels, granularity)
- How are topics assigned to works? (ML model? rule-based?)
- What fields are returned per work? (
topics,concepts,keywords) -
Coverage: do all works have topics?
-
Biblio concept system — Read
packages/biblio/src/biblio/concepts.pyto understand: - What categories does biblio extract? (methods, datasets, metrics, domains, techniques)
- How are they extracted? (LLM prompt → structured output)
- Where are they stored? (
bib/derivatives/concepts/) -
How are they used? (concept index, concept search)
-
Overlap analysis — Compare:
- Do OpenAlex topics cover the same ground as biblio concepts?
- Are biblio's LLM-extracted concepts more specific/useful for research workflows?
- Cost: LLM calls per paper vs free metadata from OpenAlex
- Could biblio use OpenAlex topics as a baseline and LLM concepts as enrichment?
Output¶
Write findings to docs/specs/biblio/concept-topic-overlap.md with:
- Side-by-side comparison table
- Recommendation: replace, complement, or keep separate
- If complement: how to integrate OpenAlex topics into biblio's data model
Key files¶
packages/biblio/src/biblio/concepts.pypackages/biblio/src/biblio/autotag.py(also uses LLM for tagging).projio/codio/mirrors/ourresearch--openalex-concept-tagging/(indexed in RAG).projio/codio/mirrors/ourresearch--openalex-docs/(indexed in RAG)
Related Notes¶
- issue-arash-20260403-193002-484673.md — Companion audit: both examine biblio vs OpenAlex capabilities, likely created in the same investigation session
- issue-arash-20260402-220025-468258.md — Specifies the new bib architecture — context for where concept/topic outputs fit in the biblio storage layout
- issue-arash-20260402-220152-539138.md — biblio_compile tool design — relevant if OpenAlex topics are to be incorporated into the compilation pipeline
- issue-arash-20260402-015659-415628.md — Batch docling processing — docling extracts paper content that feeds biblio concept extraction, directly upstream of the overlap analysis