FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

About FineGRAIN

FineGRAIN is a comprehensive benchmark for evaluating text-to-image (T2I) models across 27 specific failure modes. While T2I models can generate visually impressive images, they often struggle with precise prompt adherence—missing colors, object counts, spatial relationships, and other critical details.

Our benchmark provides a structured evaluation framework that tests both T2I model capabilities and Vision Language Model (VLM) performance as judges. We evaluate how well VLMs can identify specific failure modes in images generated by leading T2I models including Flux, Stable Diffusion variants, and others across challenging prompts designed to elicit common failure patterns.

The benchmark reveals systematic weaknesses in current models and provides actionable insights for researchers and practitioners. Use FineGRAIN to compare model performance, identify areas for improvement, and track progress in text-to-image generation quality.

For questions, feedback, or suggestions about the benchmark, please contact us at contact@finegrainbench.ai.

T2I Model Performance Comparison (Version 1.1)

Model	Company ▼	Average ▼	Cause-effect Relations ▼	Action and Motion ▼	Anatomical Accuracy ▼	BG-FG Mismatch ▼	Blending Styles ▼	Color Binding ▼	Counts or Multiple Objects ▼	Abstract Concepts ▼	Emotional Conveyance ▼	FG-BG Relations ▼	Human Action ▼	Human Anatomy Moving ▼	Long Text Specific ▼	Negation ▼	Opposite Relation ▼	Perspective ▼	Physics ▼	Scaling ▼	Shape Binding ▼	Short Text Specific ▼	Social Relations ▼	Spatial Relations ▼	Surreal ▼	Tense and Aspect ▼	Text Rendering Style ▼	Text-Based ▼	Texture Binding ▼

Note: * Images resized before evaluation

VLM Judge Performance Comparison

Overall VLM Accuracy

66.1%

Molmo (Best Overall)

65.1%

InternVL3

63.8%

Pixtral

Search Failure Modes

Sort by Performance

Failure Mode	Molmo	InternVL3	Pixtral	Best Performance

FineGRAIN: 27 Failure Modes

Search Failure Modes

Filter by Difficulty

Failure Mode	Failure Rate	Description	Sample Prompt

How to read this table:

Failure Rate: Percentage of images that contain the failure mode (higher = more challenging)
Sample Prompts: Real prompts used in our evaluation to elicit specific failure modes
Color coding: Red (>70%) = Very challenging, Yellow (30-70%) = Moderate, Green (<30%) = Less challenging

BibTeX

@misc{hayes2025finegrainevaluatingfailuremodes,
      title={FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges},
      author={Kevin David Hayes and Micah Goldblum and Vikash Sehwag and Gowthami Somepalli and Ashwinee Panda and Tom Goldstein},
      year={2025},
      eprint={2512.02161},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.02161},
}