Visual Compositional Tuning (COMPACT)

100.2%

of full LLaVA-665K
with only 10% of the data

+8.6%

MM-Vet
over full 665K VIT

+13.8%

InfoVQA
over full 665K VIT

+33.5%

on k=4 complex
test questions

Abstract

Visual instruction tuning (VIT) datasets have grown rapidly in scale, yet the informativeness of individual training samples has largely been overlooked. We explore the impact of sample complexity on data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Compositional Tuning), a visual compositional tuning data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example.

When applied to the LLaVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Training on COMPACT data outperforms training on the full-scale VIT data on complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%), offering a scalable and efficient synthetic data generation recipe for vision-language tasks.

Teaser: effect of k-value on downstream performance

Complexity k. Increasing the complexity of LLaVA samples improves downstream performance. We define k as the number of atomic visual capabilities required to answer a question. Existing VIT datasets are dominated by simple (k ≤ 2) questions; augmenting samples with one additional capability (LLaVA_k+1) shifts the distribution rightward and boosts accuracy.

Method

COMPACT is a four-step compositional data recipe that scales the complexity of training samples by combining atomic visual capabilities. Instead of chasing quantity, COMPACT asks a more information-dense question for each image, lifting the average k-value of the training data.

COMPACT data generation pipeline. (Left) We uniformly sample k_gen ∈ {1, 2, 3} atomic capabilities for each image. (Center) We prompt Gemini-2.0-Flash to generate questions that naturally integrate all sampled capabilities and verify their quality. (Right) We combine the synthetic compositional tuning data with a small 5% VIT subset for response formatting.

Compositional complexity k

For a task T, the number of atomic visual capabilities {c₁, …, c_k} required to solve it defines its compositional complexity k. A higher k implies a more information-dense question that forces the model to actually use the visual content.

k = 1 (simple)

"What color is the car?"

color

k = 2

"What color is the car on the left?"

color + spatial relationship

k = 3 (COMPACT)

"What color is the object on the left side of the car?"

object recognition + spatial relationship + color

10 Atomic Visual Capabilities

We identify 10 vision-centric atomic capabilities grouped into three categories: Attribution, Recognition, and Relation. These are the building blocks COMPACT combines to synthesize complex training questions.

Group	Capability	Definition	Example Question
Attribution	Color	Identifying or comparing colors of objects.	"What color is the car?"
Attribution	Shape	Recognizing and describing shapes of objects.	"What shape is the dining table?"
Recognition	Object Recognition	Identifying and naming objects present in the image.	"What object is on the table?"
	Action Recognition	Identifying what action is being performed.	"What is the person doing?"
	Text Recognition	Reading and interpreting text visible in the image.	"What word is written on the sign?"
	Spatial Recognition	Understanding the overall scene layout.	"How is the furniture arranged in the room?"
	Counting	Determining the number of instances.	"How many people are in the room?"
Relation	Spatial Relationship	Identifying how objects are positioned relative to each other.	"What is next to the red car?"
	Object Interaction	Analyzing how multiple objects interact.	"How is the woman interacting with the laptop?"
	Scene Understanding	Identifying the type of environment/setting.	"Where is this scene taking place?"

Main Results

With 32K compositional tuning samples + 5% LLaVA-665K VIT (≈65K total — 10% of the full VIT budget), COMPACT matches full-scale VIT performance (100.2% relative score) and outperforms strong data-selection baselines including EL2N, D2-Pruning, and ICONS on 8 multimodal benchmarks.

Method	#Data	InfoVQA	SeedB2+	MME	TextVQA	MM-Vet	CV-Bench	MMStar	LLaVA-W	Rel. (%)
LLaVA-665K (full)	665K	20.80	41.72	1478.48	46.99	29.22	60.92	35.11	68.50	100.0
Random	65K	20.05	41.85	1327.70	42.88	30.46	54.71	34.13	64.30	95.4
EL2N	65K	20.52	42.95	1350.10	42.41	33.53	50.92	33.82	62.30	97.1
D2-Pruning	65K	20.90	43.70	1362.30	41.82	31.61	48.49	36.63	61.80	97.1
ICONS	65K	21.00	42.03	1402.75	43.12	31.23	55.96	35.96	61.80	97.5
COMPACT (ours)	65K	23.68	43.13	1379.94	44.37	31.74	55.28	36.13	64.50	100.2

Main comparison. COMPACT (65K) outperforms all 65K data-selection baselines and matches the full 665K VIT baseline. Bold indicates best, underline second best.

Performance across compositional tuning data scales. Fixing the VIT subset at 5% of LLaVA-665K and scaling COMPACT's compositional tuning data from 2K to 32K. COMPACT (solid) consistently outperforms VIT-only baselines (dashed) with far less data. On spatially complex tasks like SeedBench2Plus, COMPACT's 2K model rivals VIT's 32K model.

Analysis

COMPACT shifts the complexity distribution

We analyze 5,400 LLaVA-665K questions and 7,200 COMPACT questions with Gemini-2.0-Flash. LLaVA is dominated by easy samples (77% have k ≤ 2); COMPACT redistributes mass toward higher complexity.

Statistic	LLaVA-665K	COMPACT
Mean k-value	1.95	2.89
Mode k-value	2	3
Samples with k ≤ 2	77%	35%
Zero-capability (k = 0) samples	1.1%	0%

Zero-capability samples in LLaVA-665K

~1.1% of LLaVA-665K questions require no visual capabilities at all — they could be answered without looking at the image:

"Should I move to London?"

"Can you explain Map Reduce to me?"

"How to do coding"

"I'm looking to create a podcast, can you help me?"

Capability correlation

Capability correlation. Spatial capabilities (scene understanding, spatial recognition, spatial relationship) are locally correlated; object recognition co-occurs with most other capabilities. This is why sampling k_gen acts as a lower bound on the final k-value.

Ablations

We dissect COMPACT along four axes: (A) complexity distribution, (B) atomic-capability coverage, (C) instruction-tuning ratio, and (D) the k_gen range used during generation.

A. Balanced complexity is what matters

When we force COMPACT's k-distribution to match LLaVA-665K (unbalanced), relative performance drops from 98.8% → 97.6%, recovering only half of COMPACT's gain over random. At least half of COMPACT's gain comes from higher-complexity samples.

B. All 10 capabilities contribute

Leave-one-out: removing any atomic capability hurts performance. Scene understanding (−5.2) and spatial relationship (−4.9) drive the largest gains; none are redundant.

C. Only ~5% VIT is enough

Instruction-following is orthogonal to visual reasoning — performance stabilizes around a 5% VIT mix with diminishing returns beyond.

D. Wide k-value range is optimal

Effect of k_gen range. Training on k_gen ∈ {1,2,3} outperforms k_gen = 3 alone. Complex samples are most useful in the presence of simpler ones — a range from simple to complex is optimal.

Takeaways

Visual compositional tuning is a scalable, data-efficient pathway toward multimodal models that can solve multi-capability tasks.

Sample complexity, not volume, drives data-efficient VIT — 10% of the data matches full LLaVA-665K.

High-k training transfers: +33.5% on k=4 test questions. Simple and complex samples together are optimal.

Broad generalization, not benchmark overfitting: Science & Tech +9.9%, Instance Reasoning +8.1% on MMStar.

BibTeX

@article{wu2025compact,
  title   = {Compact: Compositional atomic-to-complex visual capability tuning},
  author  = {Wu, Xindi and Hwang, Hee Seung and Kirichenko, Polina and Tureci, Esin and Russakovsky, Olga},
  journal = {arXiv preprint arXiv:2504.21850},
  year    = {2025}
}

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grants 2107048 and 2112562, and Solidigm AI SW. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or Solidigm Research. All experiments, data collection, and processing activities were conducted at Princeton University. Meta was involved solely in an advisory role; no experiments, data collection, or processing activities were conducted on Meta infrastructure. We thank Allison Chen, Sanghyuk Chun, and Jihoon Chung for helpful discussions and feedback.