AIBench

Evaluating Visual-Logical Consistency
in Academic Illustration Generation

Zhaohe Liao^1,2*, Kaixun Jiang^3*, Zhihang Liu^1,4*, Yujie Wei^3*, Junqiu Yu^1,3*, Quanhao Li^1,3*, Hongtao Yu^1,5*, Pandeng Li^1†, Yuzheng Wang¹, Zhen Xing¹, Shiwei Zhang¹, Chen-Wei Xie¹, Yun Zheng¹, Xihui Liu^6†

¹Tongyi Lab, Alibaba Group ²SJTU ³FDU ⁴USTC ⁵SEU ⁶HKU

* Equal contribution, Random ordered † Corresponding author

arXiv Paper GitHub 🤗 Dataset 🏆 Leaderboard

Abstract

Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored. Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.

300

Papers

5,704

QA Pairs

QA Levels

Models Evaluated

🔍

VQA-Based Logic Evaluation

Fine-grained multi-level QA assessment across 4 hierarchical levels, from component existence to global semantics.

⚖️

Decoupled Metrics

Explicitly separates objective logic evaluation (VQA) from subjective aesthetic assessment (UniPercept), eliminating metric ambiguity.

📊

Scalable QA Framework

Automated text-to-logic directed graph pipeline with level-specific QA generators, powered by Gemini 3 Flash.

✨

High-Quality Data

300 papers from CVPR, ICCV, ICLR, NeurIPS 2025 with 5,704 human-verified QA pairs and ~100 hours of expert annotation.

Overview of AIBench. We introduce a comprehensive benchmark to evaluate academic illustration generation through two primary dimensions: question-answering-based logical evaluation and model-based aesthetic assessment.

🏆 Leaderboard

Evaluation results on AIBench across five dimensions. Click column headers to sort. Scores are averaged across 300 papers.

Model Performance

#	Model	Size	Component ▲	Topology ▲	Phase ▲	Semantics ▲	Aesthetics ▲	Score ▼

Benchmark

AIBench provides the first fine-grained, VQA-based evaluation for academic illustration generation with multi-level QA hierarchy.

Comparison with Related Benchmarks

Benchmark	Data Construction	# Papers	# Avg. Eval Unit	Eval Method	Granularity
PaperBanana	Auto	292	1	VLM-as-Judge	Coarse
AutoFigure	Auto	3,300	1	VLM-as-Judge	Coarse
AIBench (Ours)	Auto + Human	300	19.01	QA-Based	Fine-grained

The Evaluation Pipeline of AIBench. Models generate academic illustrations from paper method descriptions, then are evaluated on logical accuracy via hierarchical QA pairs (L1–L4) and visual appeal via a model-based aesthetic score.

QA Data Construction Pipeline. We construct multi-level QA pairs via a text-to-logic directed graph, followed by level-specific QA generation and careful human annotation.

Four Hierarchical QA Levels

Our QA pairs span from low-level component verification to high-level semantic alignment.

Level 1

Component Existence

Verifies the presence and completeness of key nodes by checking whether core components and labels appear correctly.

32%

1,816 QA pairs

Level 2

Local Topology

Examines local connectivity and data flows between adjacent nodes, testing correct routing of upstream to downstream modules.

30%

1,711 QA pairs

Level 3

Phase Architecture

Targets macro-architectural organization including parallel branches, feature aggregation, and global feedback loops.

21%

1,214 QA pairs

Level 4

Global Semantics

Evaluates the system's end-to-end design intent and task paradigm, requiring integration of evidence across the entire diagram.

17%

963 QA pairs

Statistics of AIBench. (a) Papers curated from four representative 2025 conferences. (b) QA level distribution across four hierarchical levels. (c) Word cloud showing lexical diversity. (d) Top research topics in 2025 research papers.

Experiment Results

Extensive evaluation of state-of-the-art closed-source and open-source models reveals significant insights for academic illustration generation.

Finding 1

The performance saturation on general generative benchmarks masks a profound capability gap between open- and closed-source models when confronting the high-density, complex reasoning required for academic illustration generation.

Finding 2

Models face a conflict mirroring human design challenges: logical completeness and visual aesthetics are often inherently exclusive and hard to balance simultaneously.

Finding 3

Navigating the dual challenges of long, complex text reasoning and high-density rendering is crucial. By independently applying test-time scaling to both stages, we can effectively shatter the capability ceilings of current models.

Typical generation failure modes leading to incorrect answers: (a) missing components, (b) layout errors, (c) hallucinated reasoning/incorrect logic, and (d) unclear text rendering.

Test-Time Scaling Results

Exploration of different strategies to push the boundaries of academic illustration generation.

Methods	Component	Topology	Phase	Semantics	Aesthetics	Overall
Rewriting
Qwen-Image-2512	32.27	29.11	39.95	56.39	56.45	42.83
Rewritten Qwen-Image-2512	56.97	45.93	57.71	71.97	59.35	58.39
AutoFigure Pipeline
SVG Code (Gemini-2.5-Flash)	87.51	79.66	81.90	91.98	43.12	76.83
Nano Banana Pro	87.80	74.81	82.67	88.54	55.04	77.77
SVG Prompted Nano Banana Pro	87.57	76.50	78.91	92.14	55.05	78.03
Post Enhancement
Wan 2.6	68.60	56.11	72.56	80.43	51.50	65.84
Wan 2.6 w/ BoN	69.90	58.50	74.24	80.53	52.97	67.23
Wan 2.6 w/ Edit	75.70	61.80	72.81	82.02	54.21	69.31

Citation

If you find AIBench useful in your research, please consider citing our paper.

@article{liao2025aibench,
  title={AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation},
  author={Liao, Zhaohe and Jiang, Kaixun and Liu, Zhihang and Wei, Yujie and Yu, Junqiu and Li, Quanhao and Yu, Hongtao and Li, Pandeng and Wang, Yuzheng and Xing, Zhen and Zhang, Shiwei and Xie, Chen-Wei and Zheng, Yun and Liu, Xihui},
  journal={arXiv preprint arXiv:2603.28068},
  year={2026}
}