MagicMirror: A Large-Scale Dataset and Benchmark for Fine-Grained Artifacts Assessment in Text-to-Image Generation

Jia Wang1,2    Jie Hu2    Xiaoqi Ma2    Hanghang Ma2    Yanbing Zeng2    Xiaoming Wei2

1University of Chinese Academy of Sciences    2Meituan   

Corresponding author

Abstract

Text-to-image (T2I) generation has achieved remarkable progress in instruction following and aesthetics. However, a persistent challenge is the prevalence of physical artifacts, such as anatomical and structural flaws, which severely degrade perceptual quality and limit application. Given the diversity and complexity of these artifacts, a systematic and fine-grained evaluation framework is required, which is lacking in current benchmarks. To fill this gap, we introduce MagicMirror, a comprehensive framework for artifacts assessment. We first establish a detailed taxonomy of generated image artifacts. Guided by this taxonomy, we manually annotate MagicData340K, the first human-annotated large-scale dataset of 340K generated images with fine-grained artifact labels. Building on this dataset, we train MagicAssessor, a Vision-Language Model (VLM) that provides detailed assessments and corresponding labels. To overcome challenges like class imbalance and reward hacking, we design a novel data sampling strategy and a multi-level reward system for Group Relative Policy Optimization (GRPO). Finally, we leverage MagicAssessor to construct MagicBench, an automated benchmark for evaluating the image artifacts of current T2I models. Our evaluation with MagicBench reveals that despite their widespread adoption, even top-tier models like GPT-image-1 are consistently plagued by significant artifacts, highlighting artifact reduction as a critical frontier for future T2I development.

Task Intro

Assessment Example
Figure 1: An output example of MagicAssessor-7B.

Given an image generated by an AIGC model and its corresponding prompt, our model can detect artifacts that appear in the image. In this example, an artifact in the car's rearview mirror is accurately identified.

Classification of Artifacts

Artifacts Classification Mindmap
Figure 2: The hierarchical taxonomy of image artifacts.

We define Normal/Artifact as Level 1 (L1). At the highest level, we distinguish between artifacts concerning the subject itself and those involving interactions between subjects. Subject-level issues are further divided into Anatomy (including human, animal, and object structure) and Attributes (including color and proportion). These main categories constitute our Level 2 (L2) labels. For critical areas, we define more specific Level 3 (L3) labels, such as Hand Structure Deformity.

L2 Label Examples
Figure 3: Visual examples of artifacts corresponding to our taxonomy.

Data Curation

Data Curation
Figure 4: Data curation for MagicData340K.

The process begins with curating prompts from diverse sources and generating corresponding images with a suite of T2I models. A comprehensive annotation taxonomy for artifacts is then developed through iterative tests and applied to the resulting text-image pairs. Finally, a representative subset is selected for fine-grained annotation, where detailed descriptions for each label are written by humans and used to synthesize Chain-of-Thought (CoT) rationales with GPT-4o.

Training Strategy

Training Strategy
Figure 5: Data sampling strategy and reward system in model training.

To enhance its performance, we employ GRPO, which we uniquely adapt to our task through two key innovations: a targeted data sampling strategy and a multi-level reward system. First, to address the issue of data imbalance, our data sampling strategy oversamples challenging positive cases, such as anatomically correct hands. Then, we propose a multi-level reward system, which guides the model from coarse to fine-grained detection and introduce a novel consistency reward to align the model's reasoning with its final output to prevent reward hacking.

Overall Performance

Overall Performance
Table 1: Overall performance comparison with other models.

Our model demonstrates strong performance in both detailed artifact detection and overall quality assessment. Compared to existing VLM-based assessors, MagicAssessor achieves the best overall performance, especially excelling in detailed evaluation. It also shows a high correlation with human preferences, outperforming both traditional metrics like CLIP Score and other VLM-based models.

Benchmark Results

Benchmark Results
Table 2: Evaluation score (↑) on different labels of different models in MagicBench.

We evaluate various state-of-the-art T2I models using MagicBench. The results show that while some models excel in specific areas, MagicAssessor provides a more balanced and comprehensive assessment. Our benchmark reveals critical insights into the current state of T2I models and highlights the importance of fine-grained artifact evaluation.

Citation

@article{wang2025magicmirror,
    title   = {MagicMirror: A Large-Scale Dataset and Benchmark for Fine-Grained Artifacts Assessment in Text-to-Image Generation},
    author  = {Wang, Jia and Hu, Jie and Ma, Xiaoqi and Ma, Hanghang and Zeng, Yanbing and Wei, Xiaoming},
    journal = {arXiv preprint arXiv:2509.10260},
    year    = {2025}
    }