OpenAlex API P1 fixes from audit¶

Goal¶

(promoted from note)

Context¶

(see source note)

Prompt¶

Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-014840-332631.md). Understand the problem, then implement the proposed fix.

Implement: OpenAlex API P1 fixes from audit¶

Apply all P1 findings from docs/specs/biblio/openalex-audit.md.

Tasks¶

Batch DOI lookups via OR filter in openalex_client.py:
Add get_works_by_dois(dois: list[str]) that uses filter=doi:doi1|doi2|...|doi50
Batch into groups of 50 (OpenAlex limit)
Use in openalex_resolve.py to replace one-at-a-time resolution
Default per_page=200 for bulk operations:
In openalex_client.py OpenAlexClientConfig, change default from 25 to 200
The API max is 200, and biblio's bulk paths (graph expand, author works, institution works) all paginate anyway
Extract type field in _extract_work (author_search.py):
Add type: str | None to WorkRecord dataclass
Extract from data.get("type") — values like "article", "book", "dataset", "preprint"
Extract is_retracted field in _extract_work:
Add is_retracted: bool to WorkRecord
Extract from data.get("is_retracted", False)
Extract topics in _extract_work:
Add topics: list[dict] to WorkRecord (or a TopicRecord dataclass)
Extract primary_topic + topics[] with hierarchy (domain/field/subfield/topic) and scores
This data is already fetched via DEFAULT_SELECT but discarded
Full affiliations history for _extract_author:
Add affiliations: list[dict] to AuthorRecord
Extract from data.get("affiliations") — list of {institution, years}
Keep affiliation (singular) as the last_known for backwards compat
Cache author/institution lookups in openalex_cache.py:
Add author/{hash}.json and institution/{hash}.json cache paths
Wire into get_author() and get_institution() in client

Key files¶

packages/biblio/src/biblio/openalex/openalex_client.py
packages/biblio/src/biblio/openalex/openalex_cache.py
packages/biblio/src/biblio/openalex/openalex_resolve.py
packages/biblio/src/biblio/author_search.py
packages/biblio/src/biblio/discovery.py
packages/biblio/src/biblio/mcp.py — update return dicts to include new fields
docs/specs/biblio/openalex-audit.md — source of requirements

issue-arash-20260403-193002-484673.md — This is the OpenAlex API audit that generated the P1 findings being implemented here
issue-arash-20260403-193112-105596.md — Enrichment pipeline redesign is directly affected by the WorkRecord/AuthorRecord field additions and batch DOI lookup changes
issue-arash-20260403-193037-589959.md — Topics extraction task (item 5) overlaps with the concept tagging vs OpenAlex topic classification audit
issue-arash-20260403-210033-704498.md — GUI outbound links depend on type, is_retracted, and topics fields being extracted from OpenAlex
issue-arash-20260403-205942-590539.md — Pool promote likely uses author/institution lookups that would benefit from the caching fix in item 7

Acceptance Criteria¶

[ ]

Result¶

Completed in 321.3s (1 steps, model: opus)

Session: ed6dcb96-5d7a-4e5e-ae13-1cbd379ada05 Result note: /storage2/arash/worklog/workflow/captures/20260404-015603-3118f6/note.md

Output¶

Good — `discovery.py` is a new file (untracked), so it wouldn't show in diff. All changes look correct.

Here's a summary of all 7 P1 fixes implemented:

1. **Batch DOI lookups** — Added `get_works_by_dois()` to `OpenAlexClient` using `filter=doi:d1|d2|...|d50` with batches of 50. Integrated into `resolve_srcbib_to_openalex` as a pre-resolution phase that populates the cache before per-entry processing.

2. **Default `per_page=200`** — Changed default in `openalex_config_from_mapping` from 25 to 200 (OpenAlex API max).

3. **Extract `type` field** — Added `type: str | None` to `WorkRecord`, extracted from `data.get("type")` in `_extract_work`. Added `type` to `DEFAULT_SELECT`.

4. **Extract `is_retracted` field** — Added `is_retracted: bool` to `WorkRecord`, extracted from `data.get("is_retracted", False)`. Added `is_retracted` to `DEFAULT_SELECT`.

5. **Extract `topics`** — Added `topics: list[dict] | None` to `WorkRecord`, extracting full hierarchy (domain/field/subfield/topic with scores) from both `topics` and `primary_topic`. Added `primary_topic` to `DEFAULT_SELECT`.

6. **Full affiliations history** — Added `affiliations: list[dict] | None` to `AuthorRecord`, extracting `{institution_id, institution_name, years}` from `data.get("affiliations")`. Kept `affiliation` (singular) for backwards compat.

7. **Cache author/institution lookups** — Added `path_for_author()` and `path_for_institution()` to `OpenAlexCache`. Wired caching into `get_author_by_id()` and `get_institution_by_id()` in `discovery.py`. Updated MCP callers to pass cache.

All new fields are propagated through MCP return dicts. The `get_author_works_by_position` select string was also updated to include the new fields.

## Next Steps
1. Run `cd packages/biblio && make test` to verify all existing tests pass with the new dataclass fields.
2. Update the `docs/specs/biblio/openalex-audit.md` to mark P1 items as resolved.
3. Add unit tests for `get_works_by_dois` batch logic and the new `WorkRecord`/`