AIBench

Evaluating Visual-Logical Consistency
in Academic Illustration Generation

Zhaohe Liao1,2*, Kaixun Jiang3*, Zhihang Liu1,4*, Yujie Wei3*, Junqiu Yu1,3*, Quanhao Li1,3*, Hongtao Yu1,5*, Pandeng Li1†, Yuzheng Wang1, Zhen Xing1, Shiwei Zhang1, Chen-Wei Xie1, Yun Zheng1, Xihui Liu6†

1Tongyi Lab, Alibaba Group    2SJTU    3FDU    4USTC    5SEU    6HKU

* Equal contribution, Random ordered    † Corresponding author

Abstract

Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored. Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.

300
Papers
5,704
QA Pairs
4
QA Levels
14
Models Evaluated
🔍

VQA-Based Logic Evaluation

Fine-grained multi-level QA assessment across 4 hierarchical levels, from component existence to global semantics.

⚖️

Decoupled Metrics

Explicitly separates objective logic evaluation (VQA) from subjective aesthetic assessment (UniPercept), eliminating metric ambiguity.

📊

Scalable QA Framework

Automated text-to-logic directed graph pipeline with level-specific QA generators, powered by Gemini 3 Flash.

High-Quality Data

300 papers from CVPR, ICCV, ICLR, NeurIPS 2025 with 5,704 human-verified QA pairs and ~100 hours of expert annotation.

AIBench Overview
Overview of AIBench. We introduce a comprehensive benchmark to evaluate academic illustration generation through two primary dimensions: question-answering-based logical evaluation and model-based aesthetic assessment.

🏆 Leaderboard

Evaluation results on AIBench across five dimensions. Click column headers to sort. Scores are averaged across 300 papers.

Model Performance

# Model Size Component Topology Phase Semantics Aesthetics Score

Benchmark

AIBench provides the first fine-grained, VQA-based evaluation for academic illustration generation with multi-level QA hierarchy.

Comparison with Related Benchmarks

Benchmark Data Construction # Papers # Avg. Eval Unit Eval Method Granularity
PaperBanana Auto 292 1 VLM-as-Judge Coarse
AutoFigure Auto 3,300 1 VLM-as-Judge Coarse
AIBench (Ours) Auto + Human 300 19.01 QA-Based Fine-grained
Evaluation Pipeline
The Evaluation Pipeline of AIBench. Models generate academic illustrations from paper method descriptions, then are evaluated on logical accuracy via hierarchical QA pairs (L1–L4) and visual appeal via a model-based aesthetic score.
QA Data Construction Pipeline
QA Data Construction Pipeline. We construct multi-level QA pairs via a text-to-logic directed graph, followed by level-specific QA generation and careful human annotation.

Four Hierarchical QA Levels

Our QA pairs span from low-level component verification to high-level semantic alignment.

Level 1

Component Existence

Verifies the presence and completeness of key nodes by checking whether core components and labels appear correctly.

32%
1,816 QA pairs
Level 2

Local Topology

Examines local connectivity and data flows between adjacent nodes, testing correct routing of upstream to downstream modules.

30%
1,711 QA pairs
Level 3

Phase Architecture

Targets macro-architectural organization including parallel branches, feature aggregation, and global feedback loops.

21%
1,214 QA pairs
Level 4

Global Semantics

Evaluates the system's end-to-end design intent and task paradigm, requiring integration of evidence across the entire diagram.

17%
963 QA pairs
Dataset Statistics
Statistics of AIBench. (a) Papers curated from four representative 2025 conferences. (b) QA level distribution across four hierarchical levels. (c) Word cloud showing lexical diversity. (d) Top research topics in 2025 research papers.

Experiment Results

Extensive evaluation of state-of-the-art closed-source and open-source models reveals significant insights for academic illustration generation.

Finding 1

The performance saturation on general generative benchmarks masks a profound capability gap between open- and closed-source models when confronting the high-density, complex reasoning required for academic illustration generation.

Finding 2

Models face a conflict mirroring human design challenges: logical completeness and visual aesthetics are often inherently exclusive and hard to balance simultaneously.

Finding 3

Navigating the dual challenges of long, complex text reasoning and high-density rendering is crucial. By independently applying test-time scaling to both stages, we can effectively shatter the capability ceilings of current models.

Failure Cases
Typical generation failure modes leading to incorrect answers: (a) missing components, (b) layout errors, (c) hallucinated reasoning/incorrect logic, and (d) unclear text rendering.

Test-Time Scaling Results

Exploration of different strategies to push the boundaries of academic illustration generation.

Methods Component Topology Phase Semantics Aesthetics Overall
Rewriting
Qwen-Image-2512 32.2729.1139.9556.3956.4542.83
Rewritten Qwen-Image-2512 56.9745.9357.7171.9759.3558.39
AutoFigure Pipeline
SVG Code (Gemini-2.5-Flash) 87.5179.6681.9091.9843.1276.83
Nano Banana Pro 87.8074.8182.6788.5455.0477.77
SVG Prompted Nano Banana Pro 87.5776.5078.9192.1455.0578.03
Post Enhancement
Wan 2.6 68.6056.1172.5680.4351.5065.84
Wan 2.6 w/ BoN 69.9058.5074.2480.5352.9767.23
Wan 2.6 w/ Edit 75.7061.8072.8182.0254.2169.31

Citation

If you find AIBench useful in your research, please consider citing our paper.

@article{liao2025aibench,
  title={AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation},
  author={Liao, Zhaohe and Jiang, Kaixun and Liu, Zhihang and Wei, Yujie and Yu, Junqiu and Li, Quanhao and Yu, Hongtao and Li, Pandeng and Wang, Yuzheng and Xing, Zhen and Zhang, Shiwei and Xie, Chen-Wei and Zheng, Yun and Liu, Xihui},
  journal={arXiv preprint arXiv:2603.28068},
  year={2026}
}