FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

FineGRAIN Bench will appear as a Spotlight Paper at NeurIPS 2025

1University of Maryland, 2Columbia University, 3Sony AI

About FineGRAIN

FineGRAIN is a comprehensive benchmark for evaluating text-to-image (T2I) models across 27 specific failure modes. While T2I models can generate visually impressive images, they often struggle with precise prompt adherence—missing colors, object counts, spatial relationships, and other critical details.

Our benchmark provides a structured evaluation framework that tests both T2I model capabilities and Vision Language Model (VLM) performance as judges. We evaluate how well VLMs can identify specific failure modes in images generated by leading T2I models including Flux, Stable Diffusion variants, and others across challenging prompts designed to elicit common failure patterns.

The benchmark reveals systematic weaknesses in current models and provides actionable insights for researchers and practitioners. Use FineGRAIN to compare model performance, identify areas for improvement, and track progress in text-to-image generation quality.

For questions, feedback, or suggestions about the benchmark, please contact us at contact@finegrainbench.ai.

T2I Model Performance Comparison (Version 1.1)

Model Company Average Cause-effect Relations Action and Motion Anatomical Accuracy BG-FG Mismatch Blending Styles Color Binding Counts or Multiple Objects Abstract Concepts Emotional Conveyance FG-BG Relations Human Action Human Anatomy Moving Long Text Specific Negation Opposite Relation Perspective Physics Scaling Shape Binding Short Text Specific Social Relations Spatial Relations Surreal Tense and Aspect Text Rendering Style Text-Based Texture Binding

Note: * Images resized before evaluation

VLM Judge Performance Comparison

Overall VLM Accuracy

66.1%
Molmo (Best Overall)
65.1%
InternVL3
63.8%
Pixtral
Failure Mode Molmo InternVL3 Pixtral Best Performance

FineGRAIN: 27 Failure Modes

Failure Mode Failure Rate Description Sample Prompt

How to read this table:

  • Failure Rate: Percentage of images that contain the failure mode (higher = more challenging)
  • Sample Prompts: Real prompts used in our evaluation to elicit specific failure modes
  • Color coding: Red (>70%) = Very challenging, Yellow (30-70%) = Moderate, Green (<30%) = Less challenging

BibTeX

@misc{hayes2025finegrainevaluatingfailuremodes,
      title={FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges},
      author={Kevin David Hayes and Micah Goldblum and Vikash Sehwag and Gowthami Somepalli and Ashwinee Panda and Tom Goldstein},
      year={2025},
      eprint={2512.02161},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.02161},
}