idea¶

ChatGPT deep research¶

Agentic & Programmatic Scientific Figure Systems: Toward Figures as Compilable Programs Reframing academic figures as programmable build artifacts A modern academic “figure” is rarely a single plot. It is typically a compiled artifact produced by coordinating heterogeneous steps: data shaping, statistical analysis, per-panel rendering, multi-panel layout, global labeling, and export into manuscript-ready formats (PDF/SVG/PNG). The recurring pain point is that only part of this pipeline is treated as code; the “last mile” of composition and annotation is often handled as an opaque, manual editing session. This is one reason recent agentic figure papers repeatedly describe figure-making as a time-consuming, largely manual bottleneck and target “publication-ready” outputs directly from paper text.

Treating figures as programs that compile into visual artifacts changes what “good tools” look like. Instead of optimizing only for expressiveness in plotting, the goal becomes a full-stack system with:

A specification layer (what the figure is, structurally and semantically). A deterministic build (how to produce it idempotently from inputs). Inspectable intermediate representations (so failures are debuggable and an agent can operate safely). Stable exports (so the same spec yields reliable PDF/HTML/PNG artifacts across platforms and time). This framing is already implicit in mature plotting stacks: for example, Matplotlib explicitly models figures as a hierarchy of drawable objects (“Artists”), rendered through a backend canvas/renderer model, and can export to both raster and vector outputs. The gap is not “can we draw?” but “can we compose and maintain publication figures as rebuildable systems?”

Landscape of figure systems through the lens of composition and reproducibility The landscape becomes clearer when tools are grouped by where they sit in the compilation chain and whether they expose structure that can be programmatically transformed.

Code-first visualization systems: strong panel programs, weak figure programs Code-first libraries excel at producing panels reproducibly, but struggle to represent the whole figure (multi-panel composition + figure-wide semantics) as a first-class program.

Matplotlib / seaborn / Plotly (Python). Matplotlib’s internal contract is explicit object structure: almost everything visible is an “Artist,” typically owned by an Axes; many Artists are not designed to be shared across Axes. It provides layout helpers (GridSpec, tight_layout, constrained_layout, SubFigure, subplot_mosaic) aimed at arranging Axes. But the documentation is candid about the limits: tight_layout may not converge and can vary across calls; manual GridSpec adjustments often require iterative tweaking to prevent overlaps; and some manual adjustments are not compatible with automated layout modes. These are not mere usability issues—they reveal that layout is not encoded as a stable, constraint-satisfiable figure specification, but as a post-hoc, renderer-dependent negotiation.

Seaborn sits on top of Matplotlib and inherits this architecture: it simplifies statistical plotting but does not introduce a new figure-composition IR; it remains primarily a panel generator in a Matplotlib figure context.

Plotly is closer to a declarative figure representation: figures are trees of attributes serialized to JSON, with a published JSON schema and tooling to write figures as JSON. However, Plotly’s own docs note a key reproducibility caveat: JSON produced by one Plotly.py version may not be compatible with another. That statement is important: even a declarative spec does not guarantee reproducibility unless the compiler/runtime (and its schema) is versioned as part of the build.

ggplot2 (R) and the “composition add-ons.” ggplot2 operationalizes a layered grammar-of-graphics approach (widely documented in the ggplot2 literature and references), which makes panels composable as layered specifications rather than imperative drawing commands. Yet multi-panel composition is explicitly outside ggplot2’s core concern: patchwork’s own package documentation states that ggplot2 “does not concern itself with composition of multiple plots,” motivating patchwork as an expansion layer for assembling complex compositions. Cowplot similarly focuses on alignment concerns (and highlights that it can align plots independently of arrangement).

This repeated pattern—core plotting system + external composition layer—signals the architectural fault line: panels have rich semantics, while figure-level layout is bolted on as a separate domain with weaker formal guarantees.

LaTeX/TikZ/PGFPlots ecosystems. TikZ/PGF are explicitly positioned as a “TeX-approach to typesetting” for graphics rather than an interactive drawing program, emphasizing precise positioning and macro-driven reuse. PGFPlots adds a plotting DSL on top of TikZ, and provides structured multi-plot grouping (e.g., the groupplots library for matrix-like arrangements). This stack demonstrates that fully code-defined vector figures are possible, but it also illustrates a tradeoff: the compiler is TeX itself, the programming model is macro-oriented, and integrating heterogeneous outputs (external plots, icons, raster images) often reintroduces ad-hoc glue at the figure level.

Vector/GUI composition systems: perfect control, absent provenance GUI vector editors (e.g., Inkscape/Illustrator-class workflows) are optimized for manual control of the final artifact: alignment nudging, typographic tweaking, and bespoke annotation. Their core weakness is that the “figure program” is implicit in human actions rather than a stable, replayable spec.

SVG is an instructive middle ground. The SVG 2 specification describes SVG as an XML-based language for 2D vector and mixed vector/raster graphics, and defines an explicit document structure with structural elements like

Inkscape makes the tension explicit: it uses SVG as its native file format and extends it with application-specific elements/attributes in its own namespace. That is valuable for editing, but it also means a figure’s final state can depend on editor-specific metadata and invisible conventions—hard for reproducible builds and risky for automated transformations.

Hybrid workflows (Matplotlib → SVG → vector editor) look like an escape hatch but expose another systemic mismatch: generated SVG can become “pathological” for interactive editing (e.g., huge scatterplots leading to unusable editing performance), and interoperability issues often surface at precisely the point where one wants to refine a figure.

Emerging agentic/AI systems: toward structured programs, but not consistently Recent work such as SciFig and AutoFigure explicitly targets publication-ready scientific figures from paper text using multi-stage agentic pipelines, iterative refinement loops, and evaluation rubrics. A key differentiator from generic text-to-image is that these systems emphasize editability and structure (modules, components, connections) rather than only pixel realism.

However, a central question remains (and will matter for “agent-operable systems”): do these tools emit structured figure programs (hierarchical specs with stable identifiers, constraints, and semantics), or do they mainly emit images with some post-hoc vector overlays? AutoFigure, for example, explicitly separates a symbolic layout stage (SVG/HTML-like blueprint) from a later aesthetic rendering stage. That separation is promising—but it also highlights that “program” and “final look” may be compiled by different engines with different guarantees.

Complex and semantic figure composition (subplot_mosaic) — Matplotlib 3.10.8 documentation View Objects Within A Layer - InkscapeForum.com How to create old-fashion scientific diagrams using TikZ? - TeX - LaTeX Stack Exchange Block Diagrams using Tikz - TeX - LaTeX Stack Exchange

The missing layer between plotting and publication figures Across paradigms, the recurring gap is the figure-composition layer: the part that turns multiple panels into a coherent, journal-ready figure with consistent alignment and semantics.

Evidence of the gap appears directly in the tool ecosystem:

ggplot2’s ecosystem requires dedicated add-ons because composition is not a concern of the base plotting API; patchwork’s documentation states this explicitly. Matplotlib provides multiple layout mechanisms but notes fundamental limitations (non-convergent layout, incompatibilities between auto layout and manual GridSpec adjustments, iterative tweaking to avoid overlaps). Hybrid “export-to-SVG then edit” workflows reveal a representational mismatch between data-driven plotting and shape-level editing, sometimes severe enough to make the vector artifact hard to edit. Conceptually, this missing layer consists of at least four entangled concerns:

Multi-panel composition as constraints, not coordinates. Publication figures require alignment (panel edges, tick-label baselines), shared legends, consistent whitespace, and stable label placement across sizes. In many plotting stacks, these are achieved through heuristics (tight_layout, constrained_layout) that attempt to avoid overlaps, but do not expose a declarative constraint set an agent can reason about.

Cross-panel semantics. The meaning of a composite figure is often in relationships: “A and B share the same x-scale,” “Panel C is an ablation of A,” “Annotations refer to the same entity across panels.” Plotting libraries encode per-panel marks, and SVG encodes shapes, but neither natively captures these higher-level semantic invariants.

Figure-wide consistency as a first-class object. Consistent typography, symbol reuse, arrow styles, and color tokens are part of a “figure system,” not a one-off figure. SciFig’s rubric dimensions explicitly list design consistency and technical implementation quality (including vector-quality considerations) as evaluation axes, implying that these are systematic properties rather than incidental aesthetics.

Editability that preserves structure. GUI editing is flexible but often loses provenance; image generation is fast but historically produces pixel-level outputs that are “not directly interactable.” Modern agentic systems aim to bridge this, but even their own papers sometimes include hints that a fully automatic draft still benefits from manual polish in some contexts.

Why this layer remains under-formalized in 2026 is not purely accidental. It is the intersection point between (a) typography/layout engines, (b) plotting grammars, (c) vector graphics standards, and (d) human communication goals (clarity, interpretability). SciFig explicitly argues that previous systems optimize for visual balance but fail on logical clarity and scientific accuracy, and introduces hierarchical modeling plus iterative feedback to address that mismatch. This is an implicit admission: “layout” for scientific figures is not just geometry—it is geometry under semantic constraints.

A canonical figure pipeline model and where today’s tools fit A useful abstraction is to make the figure build explicit as a staged compiler:

data → analysis → panel generation → composition → annotation → export

Below, the point is not to map tools as a directory, but to identify representations and failure modes at each stage.

Data. Inputs range from raw files (CSV/Parquet/HDF5) to database queries. The critical system feature is dependency identity: hashes, timestamps, and schema expectations. Workflow systems like Snakemake formalize “inputs → outputs” rules and automatically build a directed acyclic graph (DAG) of jobs based on filenames and rules.

What’s missing for figures: a standardized way to declare that “Figure 3 depends on dataset X and on statistical model configuration Y” in the same build graph as the manuscript, rather than as informal knowledge.

Analysis. This is code execution (Python/R/Julia), ideally deterministic under pinned environments. Quarto, for instance, emphasizes that executable Python code blocks embedded in markdown are re-run when a document is rendered, and exposes caching controls via rendering options.

What’s missing for figures: analysis outputs frequently are not versioned as stable intermediates; figure regeneration may require re-running expensive steps without explicit caching boundaries tied to figure specs.

Panel generation. Representations here are often rich but panel-scoped:

Matplotlib: an Artist tree plus coordinate transform framework that maps data coordinates → axes → figure → display coordinates. ggplot2: a layered grammar spec compiled into grid graphics objects (conceptually), with strong support for composable layers. Vega/Vega-Lite: JSON specifications compiled and rendered by a runtime; Vega specifically constructs a scenegraph (tree of visual mark items) as part of its dataflow.

What’s missing: stable identifiers for semantic elements (e.g., “the regression line in panel B”) that survive recompilation so an agent can target edits without brittle diffs. Composition. This is the missing layer. Existing tools treat it as:

Heuristic fitting of decorations to avoid overlap (Matplotlib layouts). External composition packages (patchwork/cowplot) because core plotting doesn’t model composition. DSL diagram tools like Graphviz (DOT language to diagrams, including SVG/PDF outputs) and Mermaid (text definitions rendered to diagrams).

What’s missing: a figure-level constraint system that merges panel outputs, supports alignment/spacing invariants, and remains stable under edits. Annotation. Annotation spans from axis labels to callouts and arrows. Agentic figure systems now treat annotation and structure as first-class: SciFig formalizes output in terms of (i) a layout specifying position/size/styling, (ii) a connection set for arrows, and (iii) visual elements (images/text), then composes them via a composition function.

What’s missing: semantics for annotations (what they refer to, why they exist) in a machine-checkable form, enabling tasks like “ensure citation/data consistency” or “update label when upstream variable name changes.”

Export. Two families matter:

Vector-first: SVG/PDF for print fidelity and post-editability. Matplotlib’s savefig explicitly supports saving to image or vector formats depending on backend. SVG as an XML standard is structurally editable, but editor-specific extensions can complicate portability. Raster-first: PNG for web and some submission systems; easier determinism but loses structure. In practice, cross-target consistency is its own problem: even tools designed to target multiple outputs can show target-specific figure layout issues (e.g., multi-panel rendering behaving differently across HTML and PDF).

What’s missing: a back-end independent layout contract (or validation step) ensuring figure invariants hold across targets.

Representations and DSLs for figures: code, declarative specs, and intermediate graphs A programmable/agentic figure system lives or dies by its representation—what an agent can read, reason about, and modify safely.

Comparison of representation families Representation family Typical example form Expressivity Editability under automation Agent suitability Imperative code Matplotlib’s stateful/OO calls; Artists rendered via backends High for custom drawing; easy to call arbitrary computation Refactors can be brittle; intent is implicit in procedures Moderate: requires program analysis; hard to localize edits without breaking emergent layout Declarative visualization specs Vega-Lite JSON compiled to Vega ; Plotly figures as JSON trees under a schema High for chart families; constrained for bespoke visuals unless extended Stronger: diffs on structured fields High: schema-guided edits, validation, and tool-agnostic storage (but version drift remains) Diagram DSLs DOT grammar for Graphviz ; Mermaid text definitions High for graphs/flows; weaker for data plots Strong; text-based specs High for pipeline diagrams; semantics align naturally with graph structure Vector IR (document object model) SVG DOM with structural elements like and Very high geometric control Moderate: editable but semantics are mostly geometry; tool-specific extensions exist Medium: good for geometry transforms; weak for “meaning-based” edits unless enriched Scene graphs / internal render trees Vega maintains a scenegraph of visual elements ; Matplotlib Artists hierarchy High but tied to runtime Usually not persisted stably Medium: good for introspection/testing, but poor as an interchange format unless serialized

The emerging consensus (visible across modern declarative stacks and agentic figure papers) is that an intermediate structured representation is the missing hinge. Vega’s “scenegraph” illustrates a pragmatic separation: users author a spec; the runtime compiles it into a structured visual tree used for rendering and updates. SciFig similarly decomposes output into a layout + connection set + visual elements and iterates via rendered feedback. AutoFigure decomposes into a symbolic blueprint (SVG/HTML-like) plus a rendering stage.

What a figure-specific IR must add beyond SVG/scenegraphs To be “agent-operable,” a figure IR must carry information that existing intermediates usually lack:

Stable IDs for semantic objects across builds (panels, axes, legends, marks, callouts, arrows). Constraints (alignment, spacing, baseline rules) as declarative objects, not emergent from heuristics. The fact that tight_layout can be non-convergent and manual adjustments often require iteration is a symptom of missing explicit constraints. Semantic links: annotation anchors should point to semantic targets (“this label refers to variable X”), not only coordinates. Provenance hooks: references to upstream data and analysis artifacts so the figure participates in a reproducible DAG, like Snakemake’s rule graph. Agentic figure systems: when AI generates figure programs versus images Agentic figure systems can now be evaluated by a sharper criterion than “does it look right?”: does the system produce or maintain a structured figure program that supports safe edits and rebuilds?

SciFig: hierarchical layout + iterative feedback + rubricized evaluation SciFig explicitly models scientific pipeline figure generation as a multi-stage task with quality constraints, and introduces a multi-agent architecture: a description agent extracts hierarchical structure, a layout agent performs spatial/hierarchical reasoning, and a component agent renders elements.

Two contributions matter for “figure-as-program” thinking:

Hierarchy as structure. SciFig defines a two-level hierarchy (modules and components) and emphasizes generating inter-module connections rather than arbitrary arrows, which is effectively a move from pixel space to a structured graph + layout representation. Evaluation as a first-class system layer. SciFig derives six rubrics (technical accuracy, visual clarity, structural coherence, design consistency, interpretability, implementation quality) by analyzing 2,219 real figures, and uses an evaluation agent that generates rubric-specific checks. This is important because it functions as a type system for figure quality—exactly what agentic editing needs to avoid regressions. SciFig’s paper also includes quantitative evidence about why figures remain a bottleneck: in a participant study, reported figure creation times range from hours to over a month, with a median around one week and mean around 9.5 days. Regardless of whether one accepts the precise estimate, the study supports the broader claim that figure composition is still a major time sink.

On the “structured program vs image” axis, SciFig claims it generates fully editable, vector-based figures and frames output in terms of layout + connections + visual elements. That suggests it is closer to generating a figure program than typical text-to-image models, even though the published artifact is a rendered vector figure rather than an exposed, standardized DSL.

AutoFigure: decoupled symbolic blueprint + rendering, with critique-and-refine AutoFigure (ICLR 2026) frames manual scientific illustration creation as a recognized bottleneck and introduces a benchmark (FigureBench) of 3,300 text–figure pairs for long-form text-to-illustration tasks.

Its architecture is explicitly compiler-like:

Stage I produces a machine-readable symbolic layout described as SVG/HTML geometry/topology plus a style descriptor, and even rasterizes this layout into a reference image for conditioning later stages. Stage II performs critique-and-refine: a self-refinement loop between a “designer” and “critic” optimizing alignment, overlap avoidance, and balance, i.e., layout quality as an explicit objective. Stage III executes a rendering strategy and includes an “erase-and-correct” procedure for textual accuracy (OCR + verification + vector text overlays). This system is best understood as generating a structured blueprint and then compiling it into a polished illustration with a learned renderer. The blueprint is a figure program of sorts, but the final look depends on a generative model conditioned on that blueprint—so the last stage may be less deterministic than a pure vector compiler. AutoFigure’s own ablation discussion distinguishes performance “without rendering” vs “with rendering,” highlighting that the rendering stage changes quality characteristics and that intermediate sketch formats matter.

From an “agent-operable” standpoint, the most promising part is the explicit symbolic layout (SVG/HTML-like), because it provides a structured edit surface. AutoFigure’s open-source implementation also advertises multiple output formats, including SVG and an mxGraph XML variant compatible with draw.io, which is another structured graph representation rather than pixels.

DeTikZify: generating semantics-preserving TikZ programs DeTikZify is a direct answer to “structured program vs image”: it synthesizes figures as semantics-preserving TikZ graphics programs (not just rendered images), trained on large TikZ corpora and refined via an MCTS-based inference mechanism. This is significant because TikZ is itself a programmable graphics language, meaning edits can be applied at the program level, and recompilation yields consistent vector output under a stable TeX toolchain.

Synthesis: what agentic systems have and what they still lack Across these systems, the frontier is shifting from “generate an image” to “generate a representation that supports iteration.” SciFig emphasizes editability and hierarchical structure; AutoFigure explicitly outputs a symbolic blueprint; DeTikZify outputs a graphics program.

The remaining gaps, relative to fully programmable figure systems, are:

Standardized IR and APIs. Each system invents its own symbolic layer (SVG/HTML-like, layout+connections, TikZ). There is no widely adopted interchange format that combines constraints + semantics + provenance. Idempotent rebuild guarantees. Agentic loops and learned rendering stages introduce nondeterminism unless randomness, model versions, and prompts are treated as pinned build inputs (analogous to Plotly schema drift concerns). Local edit operations with invariants. Systems are beginning to support “editability” in a human sense, but robust agent editing requires typed operations (“change label text,” “swap dataset,” “reflow layout”) that preserve rubric constraints like clarity and consistency. Integration with manuscript build systems and a target architecture A practical programmable figure system must integrate with manuscripts as a dependency rather than an attachment. The manuscript is itself a compiled artifact (PDF/HTML) produced by systems like Pandoc (multi-format converter capable of PDF production) and Quarto (code-executing markdown with caching controls and incremental rendering). LaTeX workflows further rely on labeling and cross-references for figures and require multiple compilation passes for references to resolve.

System taxonomy by architectural role The most stable taxonomy is not by brand/tool name, but by role in the figure compiler stack:

Workflow/DAG orchestration: Snakemake-like systems that infer dependencies and build a DAG of jobs. Panel compilers: Matplotlib (Artists + transforms), ggplot2 (layered grammar), Vega/Vega-Lite (JSON spec → compiled runtime), Plotly (schema-driven JSON trees). Figure composition layers: patchwork/cowplot for R compositions; Matplotlib layout engines and mosaic/subfigures for Python; PGFPlots groupplots for TeX; diagram DSL engines (Graphviz/Mermaid) for pipeline figures. Vector IR and editors: SVG as IR; editors like Inkscape that preserve SVG but may extend it. Agentic generators/evaluators: SciFig (hierarchy + rubric evaluation), AutoFigure (symbolic blueprint + critique + rendering), TikZ program synthesis (DeTikZify). Manuscript compilers: Pandoc/LaTeX/Quarto as the final build stage consuming figure artifacts. Agentic design requirements for figure systems An “agent-compatible” figure system is not primarily about natural-language prompting; it is about safe operations over explicit structure. The minimal requirements follow directly from the failure modes documented above:

Explicit structure, not hidden GUI state. SVG provides a structural DOM, and declarative specs like Vega-Lite and Plotly expose trees under schemas; these are inherently more inspectable than purely manual editing histories. Idempotent builds. DAG-based workflow systems formalize “inputs → outputs,” enabling incremental rebuilds and caching; without this, upstream data changes cannot reliably trigger figure regeneration. Composable transformations. Layout engines that rely on heuristics and iterative tweaking (non-convergent tight_layout, manual iteration for GridSpec) are hard for agents; a constraint-based, versioned layout spec is more transformable. Inspectable intermediate outputs. Vega’s scenegraph concept shows how a runtime-maintained visual tree supports updates and debugging; agentic figure systems similarly rely on intermediate layouts plus feedback loops. Versioned compilers. Plotly’s warning about JSON compatibility across versions highlights that the “compiler” must be pinned; the same holds for model versions in agentic pipelines. Proposed minimal viable architecture for programmable figures A fully programmable, agent-driven figure system can be specified as four interacting layers. The goal is not to replace plotting libraries, but to add the missing composition/semantics layer that turns panel code into a figure program.

Figure specification layer (declarative). A single “FigureSpec” should define:

Panels (each with a referenced panel compiler: Matplotlib script, ggplot recipe, Vega-Lite JSON, Graphviz/Mermaid DSL). A layout graph (grid + constraints + grouping semantics). Annotation objects (typed: caption, label, callout, arrow), anchored to semantic IDs rather than raw coordinates. Style tokens (fonts, line weights, color roles) applied consistently across panels—explicitly addressing “design consistency” as a constraint rather than an aesthetic afterthought. Build pipeline (reproducible execution). A build engine should compile FigureSpec into artifacts under a DAG:

Each panel is a node producing a panel IR (e.g., a “panel scenegraph” or normalized SVG fragment). Composition is a node that solves constraints and emits a composed figure IR (structured scene/layout graph). Export nodes produce SVG/PDF/PNG variants. The pipeline should integrate with a DAG orchestrator (Snakemake-like) so figures are rebuildable dependencies of the manuscript. Output system (vector-first with validated fallbacks). Vector-first output is fundamental for publication and editability, but should be emitted from the composed IR rather than as ad-hoc exports stitched in GUI tools. SVG provides a structurally editable target, and Matplotlib/Graphviz can already emit SVG/PDF; the architecture should treat these as compilation targets with validation checks (e.g., bounding boxes, font embedding, label overlap tests).

Agent interface (queryable structure + safe edits). The agent should not “paint pixels.” It should operate on FigureSpec / figure IR via safe primitives:

EditText(node_id, new_text) (with automatic reflow). ReplacePanel(panel_id, new_panel_spec) (keeping constraints and style tokens stable). Relayout(group_id, constraint_changes) (re-solve constraints, validate rubrics). RefreshFromUpstream() (trigger DAG rebuild and compare invariants). SciFig’s rubric-based evaluation suggests how this can be made robust: the system can attach automatic checks (clarity, coherence, consistency) as postconditions for agent edits. AutoFigure’s critique-and-refine loop similarly shows how iterative evaluation feedback can be systematized, but the agent interface should target the symbolic blueprint rather than an image.

Why figure workflows are still “broken” in 2026, and the path forward The persistent breakage is not a lack of plotting capability; it is the absence of a widely adopted figure composition IR that unifies constraints, semantics, and provenance across panel compilers and manuscript targets. The ecosystem evidence is consistent: composition is externalized (patchwork), heuristic and sometimes unstable (Matplotlib layout), or accomplished by manual GUI state that is difficult to reproduce (SVG editors with custom namespaces).

The path forward is a compiler architecture where:

Panel systems remain specialized compilers. A figure IR becomes the missing “linker/optimizer” that composes panels under constraints and semantics. Agents become safe transformation engines over this IR, with rubric-like postconditions and DAG-integrated rebuilds—turning “figure editing” into “program transformation + recompilation,” not “manual nudging.”