ConceptMix

ConceptMix

A Compositional Benchmark for Evaluating Text-to-Image Models

Princeton University
ConceptMix Figure

ConceptMix provides a flexible and scalable evaluation benchmark for T2I models.

Abstract

Compositionality is a critical capability in Text-to-Image (T2I) models, as it reflects their ability to understand and combine multiple concepts from text descriptions. Existing evaluations of compositional capability rely heavily on human-designed text prompts or fixed templates, limiting their diversity and complexity, and so the evaluations have low discriminative power. We propose ConceptMix, a scalable, controllable, and customizable benchmark consisting of two stages: (a) With categories of visual concepts (e.g., objects, colors, shapes, spatial relationships), it randomly samples an object and k-tuples of visual concepts to generate text prompts with GPT-4o for image generation. (b) To automatically evaluate generation quality, ConceptMix uses an LLM to generate one question per visual concept, allowing automatic grading of whether each specified concept...

Figure: Many open-source models surpass proprietary model performance on existing benchmarks yet fail consistently in reasoning questions from ConceptMix.


Leaderboard

We provide a Leaderboard table to showcase the performance of different T2I models evaluated with the ConceptMix benchmark.

Rank Model k=1 k=2 k=3 k=4 k=5 k=6 k=7
1 DALL-E 0.83 0.61 0.50 0.27 0.17 0.11 0.08
2 Playground v2.5 0.70 0.46 0.22 0.10 0.07 0.02 0.00
3 DeepFloyd IF XL v1 0.68 0.38 0.21 0.09 0.05 0.02 0.01
4 SDXL Base 0.69 0.43 0.18 0.09 0.05 0.01 0.00
5 PixArt alpha 0.66 0.37 0.17 0.09 0.05 0.01 0.01
6 SDXL Turbo 0.64 0.35 0.18 0.09 0.03 0.02 0.01
7 SD v2.1 0.52 0.29 0.14 0.06 0.03 0.01 0.00
8 SD v1.4 0.52 0.23 0.08 0.03 0.01 0.00 0.00

If you have a model you'd like to see on the leaderboard, please submit it here.

LAION-5B Concept Diversity

The LAION-5B dataset is analyzed for concept diversity, the heatmap below showcases the frequency of these visual concepts in sampled captions.

Concept Diversity in LAION-5B Dataset

Concept Diversity in LAION-5B Dataset. Left: Heatmap of sampled captions shows colors and styles are most frequent; shapes and spatial relationships are least. Right: Most examples include 2-3 concepts.

Performance Across Concept Categories

We evaluate the performance of T2I models across different concept categories. Color and style are easier, with all models achieving high scores. Performance is lower for generating specific numbers of objects and spatial relationships, with varying results for texture and size. Overall, DALLĀ·E 3 outperforms others in all categories.

Performance Across Concept Categories

Performance Across Concept Categories. The bar graph shows how different models perform across various visual concept categories.

BibTeX

@article{wu2024conceptmix,
      title={ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty},
      author={Wu, Xindi and Yu, Dingli and Huang, Yangsibo and Russakovsky, Olga and Arora, Sanjeev},
      journal={arXiv preprint arXiv:2408.14339},
      year={2024}
    }