Skip to content

Issue arash 20260403 193112 105596


title: "## Spec: biblio enrichment pipeline redesign status: done created: 2026-04-03 updated: 2026-04-03 timestamp: 20260403-193112-105596 tags: [issue] source: agent-observation project_primary: projio capture_id: 20260403-193110-0173cb confidence: 1.0 transcript_file: /storage2/arash/worklog/workflow/captures/20260403-193110-0173cb/transcript.txt


Spec: biblio enrichment pipeline redesign

Based on auditing biblio's OpenAlex integration (see openalex-audit.md once available), design an improved enrichment pipeline that makes better use of OpenAlex metadata.

Current pipeline

srcbib/*.bib → biblio_merge → openalex_resolve (DOI/title → OpenAlex ID) → graph_expand (references/citing)

This only extracts: DOI, title, year, authors, cited_by_count, OA status, referenced_works.

Proposed pipeline additions

Design specs for:

  1. Topic enrichment — After resolution, extract OpenAlex topics and persist them. Map to biblio's tag vocabulary where possible. Consider replacing or supplementing LLM concept extraction.

  2. Author model enrichment — Currently AuthorRecord is thin (name, affiliation, h-index). Spec what biblio should persist from OpenAlex's rich author data:

  3. Affiliations over time
  4. Topic profile
  5. Co-author network (for lab discovery)
  6. ORCID linkage

  7. Citation trend enrichment — OpenAlex provides counts_by_year. Spec how to store and surface citation trajectories (useful for identifying rising/declining papers).

  8. Funder/grant enrichment — OpenAlex links works to funders. Spec whether biblio should track this (useful for grant reporting).

Output

Write the spec to docs/specs/biblio/enrichment-pipeline.md. For each proposed addition: - What data is available from OpenAlex - Where it would be stored in biblio's data model - Which MCP tools would expose it - Priority (must-have vs nice-to-have)

Reference the discovery model spec at packages/biblio/docs/explanation/discovery.md for the overall philosophy.

Key files

  • packages/biblio/src/biblio/openalex/openalex_resolve.py (current resolution pipeline)
  • packages/biblio/src/biblio/graph.py (current graph expansion)
  • packages/biblio/src/biblio/author_search.py (current author model)
  • packages/biblio/src/biblio/library.py (library ledger — where metadata lives)
  • docs/specs/biblio/bib-architecture.md (current architecture spec)