Evaluating Visual-Logical Consistency
in Academic Illustration Generation
1Tongyi Lab, Alibaba Group 2SJTU 3FDU 4USTC 5SEU 6HKU
* Equal contribution, Random ordered † Corresponding author
Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored. Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.
Fine-grained multi-level QA assessment across 4 hierarchical levels, from component existence to global semantics.
Explicitly separates objective logic evaluation (VQA) from subjective aesthetic assessment (UniPercept), eliminating metric ambiguity.
Automated text-to-logic directed graph pipeline with level-specific QA generators, powered by Gemini 3 Flash.
300 papers from CVPR, ICCV, ICLR, NeurIPS 2025 with 5,704 human-verified QA pairs and ~100 hours of expert annotation.
Evaluation results on AIBench across five dimensions. Click column headers to sort. Scores are averaged across 300 papers.
| # | Model | Size | Component ▲ | Topology ▲ | Phase ▲ | Semantics ▲ | Aesthetics ▲ | Score ▼ |
|---|
AIBench provides the first fine-grained, VQA-based evaluation for academic illustration generation with multi-level QA hierarchy.
| Benchmark | Data Construction | # Papers | # Avg. Eval Unit | Eval Method | Granularity |
|---|---|---|---|---|---|
| PaperBanana | Auto | 292 | 1 | VLM-as-Judge | Coarse |
| AutoFigure | Auto | 3,300 | 1 | VLM-as-Judge | Coarse |
| AIBench (Ours) | Auto + Human | 300 | 19.01 | QA-Based | Fine-grained |
Our QA pairs span from low-level component verification to high-level semantic alignment.
Verifies the presence and completeness of key nodes by checking whether core components and labels appear correctly.
Examines local connectivity and data flows between adjacent nodes, testing correct routing of upstream to downstream modules.
Targets macro-architectural organization including parallel branches, feature aggregation, and global feedback loops.
Evaluates the system's end-to-end design intent and task paradigm, requiring integration of evidence across the entire diagram.
Extensive evaluation of state-of-the-art closed-source and open-source models reveals significant insights for academic illustration generation.
The performance saturation on general generative benchmarks masks a profound capability gap between open- and closed-source models when confronting the high-density, complex reasoning required for academic illustration generation.
Models face a conflict mirroring human design challenges: logical completeness and visual aesthetics are often inherently exclusive and hard to balance simultaneously.
Navigating the dual challenges of long, complex text reasoning and high-density rendering is crucial. By independently applying test-time scaling to both stages, we can effectively shatter the capability ceilings of current models.
Exploration of different strategies to push the boundaries of academic illustration generation.
| Methods | Component | Topology | Phase | Semantics | Aesthetics | Overall |
|---|---|---|---|---|---|---|
| Rewriting | ||||||
| Qwen-Image-2512 | 32.27 | 29.11 | 39.95 | 56.39 | 56.45 | 42.83 |
| Rewritten Qwen-Image-2512 | 56.97 | 45.93 | 57.71 | 71.97 | 59.35 | 58.39 |
| AutoFigure Pipeline | ||||||
| SVG Code (Gemini-2.5-Flash) | 87.51 | 79.66 | 81.90 | 91.98 | 43.12 | 76.83 |
| Nano Banana Pro | 87.80 | 74.81 | 82.67 | 88.54 | 55.04 | 77.77 |
| SVG Prompted Nano Banana Pro | 87.57 | 76.50 | 78.91 | 92.14 | 55.05 | 78.03 |
| Post Enhancement | ||||||
| Wan 2.6 | 68.60 | 56.11 | 72.56 | 80.43 | 51.50 | 65.84 |
| Wan 2.6 w/ BoN | 69.90 | 58.50 | 74.24 | 80.53 | 52.97 | 67.23 |
| Wan 2.6 w/ Edit | 75.70 | 61.80 | 72.81 | 82.02 | 54.21 | 69.31 |
If you find AIBench useful in your research, please consider citing our paper.
@article{liao2025aibench,
title={AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation},
author={Liao, Zhaohe and Jiang, Kaixun and Liu, Zhihang and Wei, Yujie and Yu, Junqiu and Li, Quanhao and Yu, Hongtao and Li, Pandeng and Wang, Yuzheng and Xing, Zhen and Zhang, Shiwei and Xie, Chen-Wei and Zheng, Yun and Liu, Xihui},
journal={arXiv preprint arXiv:2603.28068},
year={2026}
}