Visual instruction tuning (VIT) datasets have grown rapidly in scale, yet the informativeness of individual training samples has largely been overlooked. We explore the impact of sample complexity on data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Compositional Tuning), a visual compositional tuning data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example.
When applied to the LLaVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Training on COMPACT data outperforms training on the full-scale VIT data on complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%), offering a scalable and efficient synthetic data generation recipe for vision-language tasks.
Complexity k. Increasing the complexity of LLaVA samples improves downstream performance. We define k as the number of atomic visual capabilities required to answer a question. Existing VIT datasets are dominated by simple (k ≤ 2) questions; augmenting samples with one additional capability (LLaVAk+1) shifts the distribution rightward and boosts accuracy.
COMPACT is a four-step compositional data recipe that scales the complexity of training samples by combining atomic visual capabilities. Instead of chasing quantity, COMPACT asks a more information-dense question for each image, lifting the average k-value of the training data.
COMPACT data generation pipeline. (Left) We uniformly sample kgen ∈ {1, 2, 3} atomic capabilities for each image. (Center) We prompt Gemini-2.0-Flash to generate questions that naturally integrate all sampled capabilities and verify their quality. (Right) We combine the synthetic compositional tuning data with a small 5% VIT subset for response formatting.
For a task T, the number of atomic visual capabilities {c1, …, ck} required to solve it defines its compositional complexity k. A higher k implies a more information-dense question that forces the model to actually use the visual content.
We identify 10 vision-centric atomic capabilities grouped into three categories: Attribution, Recognition, and Relation. These are the building blocks COMPACT combines to synthesize complex training questions.
| Group | Capability | Definition | Example Question |
|---|---|---|---|
| Attribution | Color | Identifying or comparing colors of objects. | "What color is the car?" |
| Shape | Recognizing and describing shapes of objects. | "What shape is the dining table?" | |
| Recognition | Object Recognition | Identifying and naming objects present in the image. | "What object is on the table?" |
| Action Recognition | Identifying what action is being performed. | "What is the person doing?" | |
| Text Recognition | Reading and interpreting text visible in the image. | "What word is written on the sign?" | |
| Spatial Recognition | Understanding the overall scene layout. | "How is the furniture arranged in the room?" | |
| Counting | Determining the number of instances. | "How many people are in the room?" | |
| Relation | Spatial Relationship | Identifying how objects are positioned relative to each other. | "What is next to the red car?" |
| Object Interaction | Analyzing how multiple objects interact. | "How is the woman interacting with the laptop?" | |
| Scene Understanding | Identifying the type of environment/setting. | "Where is this scene taking place?" |
With 32K compositional tuning samples + 5% LLaVA-665K VIT (≈65K total — 10% of the full VIT budget), COMPACT matches full-scale VIT performance (100.2% relative score) and outperforms strong data-selection baselines including EL2N, D2-Pruning, and ICONS on 8 multimodal benchmarks.
| Method | #Data | InfoVQA | SeedB2+ | MME | TextVQA | MM-Vet | CV-Bench | MMStar | LLaVA-W | Rel. (%) |
|---|---|---|---|---|---|---|---|---|---|---|
| LLaVA-665K (full) | 665K | 20.80 | 41.72 | 1478.48 | 46.99 | 29.22 | 60.92 | 35.11 | 68.50 | 100.0 |
| Random | 65K | 20.05 | 41.85 | 1327.70 | 42.88 | 30.46 | 54.71 | 34.13 | 64.30 | 95.4 |
| EL2N | 65K | 20.52 | 42.95 | 1350.10 | 42.41 | 33.53 | 50.92 | 33.82 | 62.30 | 97.1 |
| D2-Pruning | 65K | 20.90 | 43.70 | 1362.30 | 41.82 | 31.61 | 48.49 | 36.63 | 61.80 | 97.1 |
| ICONS | 65K | 21.00 | 42.03 | 1402.75 | 43.12 | 31.23 | 55.96 | 35.96 | 61.80 | 97.5 |
| COMPACT (ours) | 65K | 23.68 | 43.13 | 1379.94 | 44.37 | 31.74 | 55.28 | 36.13 | 64.50 | 100.2 |
Main comparison. COMPACT (65K) outperforms all 65K data-selection baselines and matches the full 665K VIT baseline. Bold indicates best, underline second best.
Performance across compositional tuning data scales. Fixing the VIT subset at 5% of LLaVA-665K and scaling COMPACT's compositional tuning data from 2K to 32K. COMPACT (solid) consistently outperforms VIT-only baselines (dashed) with far less data. On spatially complex tasks like SeedBench2Plus, COMPACT's 2K model rivals VIT's 32K model.
We analyze 5,400 LLaVA-665K questions and 7,200 COMPACT questions with Gemini-2.0-Flash. LLaVA is dominated by easy samples (77% have k ≤ 2); COMPACT redistributes mass toward higher complexity.
| Statistic | LLaVA-665K | COMPACT |
|---|---|---|
| Mean k-value | 1.95 | 2.89 |
| Mode k-value | 2 | 3 |
| Samples with k ≤ 2 | 77% | 35% |
| Zero-capability (k = 0) samples | 1.1% | 0% |
~1.1% of LLaVA-665K questions require no visual capabilities at all — they could be answered without looking at the image:
Capability correlation. Spatial capabilities (scene understanding, spatial recognition, spatial relationship) are locally correlated; object recognition co-occurs with most other capabilities. This is why sampling kgen acts as a lower bound on the final k-value.
We dissect COMPACT along four axes: (A) complexity distribution, (B) atomic-capability coverage, (C) instruction-tuning ratio, and (D) the kgen range used during generation.
When we force COMPACT's k-distribution to match LLaVA-665K (unbalanced), relative performance drops from 98.8% → 97.6%, recovering only half of COMPACT's gain over random. At least half of COMPACT's gain comes from higher-complexity samples.
Leave-one-out: removing any atomic capability hurts performance. Scene understanding (−5.2) and spatial relationship (−4.9) drive the largest gains; none are redundant.
Instruction-following is orthogonal to visual reasoning — performance stabilizes around a 5% VIT mix with diminishing returns beyond.
Effect of kgen range. Training on kgen ∈ {1,2,3} outperforms kgen = 3 alone. Complex samples are most useful in the presence of simpler ones — a range from simple to complex is optimal.
@article{wu2025compact,
title = {Compact: Compositional atomic-to-complex visual capability tuning},
author = {Wu, Xindi and Hwang, Hee Seung and Kirichenko, Polina and Tureci, Esin and Russakovsky, Olga},
journal = {arXiv preprint arXiv:2504.21850},
year = {2025}
}
This material is based upon work supported by the National Science Foundation under Grants 2107048 and 2112562, and Solidigm AI SW. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or Solidigm Research. All experiments, data collection, and processing activities were conducted at Princeton University. Meta was involved solely in an advisory role; no experiments, data collection, or processing activities were conducted on Meta infrastructure. We thank Allison Chen, Sanghyuk Chun, and Jihoon Chung for helpful discussions and feedback.