CreBench
Human-Aligned Creativity Evaluation from Idea to Process to Product
Accepted at AAAI 2026
Authors: Kaiwen Xue¹*, Li Chenglong¹*, Zhonghong Ou²†, Guoxin Zhang¹, Kaoyan Lu³, Shuai Lyu¹, Yifan Zhu¹, Zong Ping¹, Junpeng Ding¹, Xinyu Liu⁴, Qunlin Chen⁴, Weiwei Qin¹, Yiran Shen¹, Jiayi Cen⁵ (*Co-first authors, †Corresponding author)
🌟 Overview
CreBench is a creativity evaluation benchmark designed for open‑ended drawing and visual design tasks. It systematically measures model performance across three interconnected dimensions: Creative Ideas, Creative Processes, and Creative Products. CreBench focuses on whether a model’s generative behavior aligns with human‑like creative thinking throughout the full workflow of ideation, exploration, refinement, and visual execution.
🧠 Definition of Creativity
In this benchmark, creativity is defined as:
The process of generating, evaluating, and refining problem‑solving ideas, and manipulating visual or structural elements to produce novel and appropriate solutions.
This includes divergent exploration, conceptual combination, refinement of a core idea, and elaboration of visual details. A drawing is considered creative when it is both an effective solution and a meaningful creative expression.
📐 Evaluation Dimensions
CreBench decomposes creativity into three major dimensions, each with further sub‑criteria:
1. Creative Ideas
- Originality: Whether the proposed idea is novel, surprising, or unconventional.
- Appropriateness: Whether the idea meaningfully addresses the task requirements and constraints.
2. Creative Processes
- Immersion / Preparation: The degree of thoughtful pause, understanding, and exploration before drawing.
- Divergence: Exploration of multiple alternatives instead of committing to a single path.
- Structuring: Organization of elements into a coherent, functional system or mechanism.
- Evaluation: Evidence of iterative improvement—deleting, replacing, reorganizing elements.
- Elaboration: Attention to proportions, alignment, hierarchy, and visual detail refinement.
3. Creative Products
- Effectiveness: Whether the final drawing clearly communicates how the solution works.
- Aesthetic Quality: Balance, rhythm, proportion, and overall expressive visual quality.
- Novelty: Whether the final product introduces non‑standard forms or symbolic expressions.
- Manufacturability: Structural and physical feasibility—whether the solution could be built.
- Systemic Complexity: Presence of multiple interacting functional components forming a coherent system.
📊 Rubric Design
CreBench provides a 1–5 level scoring rubric for each criterion, ranging from Absent to Highly Developed. The rubric captures behaviors such as:
- No divergence → rich multi‑path exploration
- No structural reasoning → well‑organized functional systems
- Minimal detail → extensive refinement and elaboration
The rubric supports both human annotation and LLM‑as‑a‑judge evaluations with structured scoring instructions.
🚀 Key Contributions
- A full‑pipeline creativity definition spanning Ideas → Processes → Products.
- Behavior‑level analysis using drawing logs (pauses, trials, undo, restructuring, etc.).
- Multi‑dimensional, interpretable, and operational scoring standards.
- A scalable benchmark for studying alignment between human creative thinking and model behavior.
📁 Resources
- Full scoring rubric with annotated examples (coming soon).
- Drawing logs and process data (transcripts, log files, data streams).
- Evaluation scripts and baseline model results.