ConceptMix: A Compositional Benchmark for Evaluating Text-to-Image Models

Abstract

Compositionality is a critical capability in Text-to-Image (T2I) models, as it reflects their ability to understand and combine multiple concepts from text descriptions. Existing evaluations of compositional capability rely heavily on human-designed text prompts or fixed templates, limiting their diversity and complexity, and so the evaluations have low discriminative power. We propose ConceptMix, a scalable, controllable, and customizable benchmark consisting of two stages: (a) With categories of visual concepts (e.g., objects, colors, shapes, spatial relationships), it randomly samples an object and k-tuples of visual concepts to generate text prompts with GPT-4o for image generation. (b) To automatically evaluate generation quality, ConceptMix uses an LLM to generate one question per visual concept, allowing automatic grading of whether each specified concept.

Overview of CONCEPTMIX benchmark for T2I models.

Leaderboard

We provide a Leaderboard table to showcase the performance of different T2I models evaluated with the ConceptMix benchmark.

Rank	Model	k=1	k=2	k=3	k=4	k=5	k=6	k=7
1	DALL-E	0.83	0.61	0.50	0.27	0.17	0.11	0.08
2	Playground v2.5	0.70	0.46	0.22	0.10	0.07	0.02	0.00
3	DeepFloyd IF XL v1	0.68	0.38	0.21	0.09	0.05	0.02	0.01
4	SDXL Base	0.69	0.43	0.18	0.09	0.05	0.01	0.00
5	PixArt alpha	0.66	0.37	0.17	0.09	0.05	0.01	0.01
6	SDXL Turbo	0.64	0.35	0.18	0.09	0.03	0.02	0.01
7	SD v2.1	0.52	0.29	0.14	0.06	0.03	0.01	0.00
8	SD v1.4	0.52	0.23	0.08	0.03	0.01	0.00	0.00

Performance on Individual Concept Categories (k=1)

We evaluate the performance of T2I models across different concept categories. Color and style are easier, with all models achieving high scores. Performance is lower for generating specific numbers of objects and spatial relationships, with varying results for texture and size. Overall, DALL·E 3 outperforms others in all categories.

Performance Across Concept Categories. We evaluate T2I models across concept categories, finding high scores for color and style but lower performance for object counts and spatial relationships. DALL·E 3 outperforms others across all categories.

Performance of Compositional Generation (k > 1)

ConceptMix Shows Stronger Discriminative Power: We compare five models using 3-in-1 and GPT4v scores (global prompt-level) from T2I-CompBench, and ConceptMix with varying difficulty levels (k). ConceptMix, with varying difficulty levels (k), clearly distinguishes model performance, with gaps widening as k increases.

Qualitative Performance of Different T2I Models

We compare the qualitative performance of different T2I models (SD v1.4, SD v2.1, PixArt alpha, Playground v2.5, DALL·E 3) across varying levels of compositional complexity (k = 1...7). As prompts become more complex, image quality degrades. DALL·E 3 performs best, while SD v1.4 performs worst.

Qualitative Comparison: Visual comparison of generated images across different models and complexity levels (k), showing degrading performance with increasing prompt complexity.

LAION-5B Concept Diversity

The LAION-5B dataset is analyzed for concept diversity, the heatmap below showcases the frequency of these visual concepts in sampled captions.

Concept Diversity in LAION-5B Dataset. Left: Heatmap of sampled captions shows colors and styles are most frequent; shapes and spatial relationships are least. Right: Most examples include 2-3 concepts.

BibTeX

@article{wu2024conceptmix,
      title={ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty},
      author={Wu, Xindi and Yu, Dingli and Huang, Yangsibo and Russakovsky, Olga and Arora, Sanjeev},
      journal={arXiv preprint arXiv:2408.14339},
      year={2024}
    }

ConceptMix

A Compositional Image Generation Benchmark with Controllable Difficulty

ConceptMix provides a flexible and scalable evaluation benchmark for T2I models.