OpenAlex API P1 fixes from audit¶
Goal¶
(promoted from note)
Context¶
(see source note)
Prompt¶
Fix the issue described below (source: /storage2/arash/projects/projio/docs/log/issue/issue-arash-20260404-014840-332631.md). Understand the problem, then implement the proposed fix.
Implement: OpenAlex API P1 fixes from audit¶
Apply all P1 findings from docs/specs/biblio/openalex-audit.md.
Tasks¶
- Batch DOI lookups via OR filter in
openalex_client.py: - Add
get_works_by_dois(dois: list[str])that usesfilter=doi:doi1|doi2|...|doi50 - Batch into groups of 50 (OpenAlex limit)
-
Use in
openalex_resolve.pyto replace one-at-a-time resolution -
Default
per_page=200for bulk operations: - In
openalex_client.pyOpenAlexClientConfig, change default from 25 to 200 -
The API max is 200, and biblio's bulk paths (graph expand, author works, institution works) all paginate anyway
-
Extract
typefield in_extract_work(author_search.py): - Add
type: str | NonetoWorkRecorddataclass -
Extract from
data.get("type")— values like "article", "book", "dataset", "preprint" -
Extract
is_retractedfield in_extract_work: - Add
is_retracted: booltoWorkRecord -
Extract from
data.get("is_retracted", False) -
Extract
topicsin_extract_work: - Add
topics: list[dict]toWorkRecord(or aTopicRecorddataclass) - Extract
primary_topic+topics[]with hierarchy (domain/field/subfield/topic) and scores -
This data is already fetched via
DEFAULT_SELECTbut discarded -
Full affiliations history for
_extract_author: - Add
affiliations: list[dict]toAuthorRecord - Extract from
data.get("affiliations")— list of {institution, years} -
Keep
affiliation(singular) as the last_known for backwards compat -
Cache author/institution lookups in
openalex_cache.py: - Add
author/{hash}.jsonandinstitution/{hash}.jsoncache paths - Wire into
get_author()andget_institution()in client
Key files¶
packages/biblio/src/biblio/openalex/openalex_client.pypackages/biblio/src/biblio/openalex/openalex_cache.pypackages/biblio/src/biblio/openalex/openalex_resolve.pypackages/biblio/src/biblio/author_search.pypackages/biblio/src/biblio/discovery.pypackages/biblio/src/biblio/mcp.py— update return dicts to include new fieldsdocs/specs/biblio/openalex-audit.md— source of requirements
Related Notes¶
- issue-arash-20260403-193002-484673.md — This is the OpenAlex API audit that generated the P1 findings being implemented here
- issue-arash-20260403-193112-105596.md — Enrichment pipeline redesign is directly affected by the WorkRecord/AuthorRecord field additions and batch DOI lookup changes
- issue-arash-20260403-193037-589959.md — Topics extraction task (item 5) overlaps with the concept tagging vs OpenAlex topic classification audit
- issue-arash-20260403-210033-704498.md — GUI outbound links depend on type, is_retracted, and topics fields being extracted from OpenAlex
- issue-arash-20260403-205942-590539.md — Pool promote likely uses author/institution lookups that would benefit from the caching fix in item 7
Acceptance Criteria¶
- [ ]
Result¶
Completed in 321.3s (1 steps, model: opus)
Session: ed6dcb96-5d7a-4e5e-ae13-1cbd379ada05
Result note: /storage2/arash/worklog/workflow/captures/20260404-015603-3118f6/note.md
Output¶
Good — `discovery.py` is a new file (untracked), so it wouldn't show in diff. All changes look correct.
Here's a summary of all 7 P1 fixes implemented:
1. **Batch DOI lookups** — Added `get_works_by_dois()` to `OpenAlexClient` using `filter=doi:d1|d2|...|d50` with batches of 50. Integrated into `resolve_srcbib_to_openalex` as a pre-resolution phase that populates the cache before per-entry processing.
2. **Default `per_page=200`** — Changed default in `openalex_config_from_mapping` from 25 to 200 (OpenAlex API max).
3. **Extract `type` field** — Added `type: str | None` to `WorkRecord`, extracted from `data.get("type")` in `_extract_work`. Added `type` to `DEFAULT_SELECT`.
4. **Extract `is_retracted` field** — Added `is_retracted: bool` to `WorkRecord`, extracted from `data.get("is_retracted", False)`. Added `is_retracted` to `DEFAULT_SELECT`.
5. **Extract `topics`** — Added `topics: list[dict] | None` to `WorkRecord`, extracting full hierarchy (domain/field/subfield/topic with scores) from both `topics` and `primary_topic`. Added `primary_topic` to `DEFAULT_SELECT`.
6. **Full affiliations history** — Added `affiliations: list[dict] | None` to `AuthorRecord`, extracting `{institution_id, institution_name, years}` from `data.get("affiliations")`. Kept `affiliation` (singular) for backwards compat.
7. **Cache author/institution lookups** — Added `path_for_author()` and `path_for_institution()` to `OpenAlexCache`. Wired caching into `get_author_by_id()` and `get_institution_by_id()` in `discovery.py`. Updated MCP callers to pass cache.
All new fields are propagated through MCP return dicts. The `get_author_works_by_position` select string was also updated to include the new fields.
## Next Steps
1. Run `cd packages/biblio && make test` to verify all existing tests pass with the new dataclass fields.
2. Update the `docs/specs/biblio/openalex-audit.md` to mark P1 items as resolved.
3. Add unit tests for `get_works_by_dois` batch logic and the new `WorkRecord`/`