We propose COMPACT (COMPositional Atomic-to-complex Visual Capability Tuning), a data recipe that explicitly controls for the compositional complexity of the training examples. COMPACT data allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across complex multi-capability benchmarks, COMPACT outperforms the LLaVA-665K VIT while using less than 10% of its data budget.
Training on COMPACT data, whose samples require up to three atomic capabilities, generalizes well to tasks that have even higher capability requirements. For example, COMPACT achieves a substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.
 
          Current Visual Instruction Tuning Data vs. Our Compositional Tuning Data: The VIT data is dominated by simple queries (k = 1), while our COMPACT data is balanced across compositional complexity levels (k = 1, 2, 3).
COMPACT scales capabilities of MLLMs from atomic (k = 1) to composite (k > 1) complexity levels. Our approach generates multi-capability training data by prompting vision-language models to create questions that integrate k = (1, 2, 3) atomic visual capabilities.
 
          COMPACT Data Generation Pipeline: (Left): We sample atomic capabilities (k = 1) such as color, object recognition, and spatial relationship. (Center): We generate questions (k = 1, 2, 3) that integrate all the sampled capabilities. (Right): We verify the quality of generated conversations and combine them with instruction tuning data to maintain instruction following capability.
Atomic capabilities are foundational skills that can be combined to solve complex tasks. For example, a model needs to acquire object recognition, color attribution, and spatial relationship understanding capabilities to identify how objects of different colors are spatially oriented. We define the number of atomic capabilities required to solve a task as its compositional complexity k.
 
        We identify 10 atomic capabilities and categorize them into three groups:
| Group | Capability | Definition | Example Question | 
|---|---|---|---|
| Attribution | Color | Identifying or comparing colors of objects in the image | What color is the car? | 
| Shape | Recognizing and describing the shapes of objects in the image | What shape is the dining table? | |
| Recognition | Object Recognition | Identifying and naming objects present in the image | What object is on the table? | 
| Action Recognition | Identifying what action is being performed | What is the person doing in this image? | |
| Text Recognition | Reading and interpreting text visible in the image | What word is written on the sign? | |
| Spatial Recognition | Understanding the overall spatial layout and arrangement of the entire scene | How is the furniture arranged in this room? | |
| Counting | Determining the number of instances of something in the image | How many people are in the room? | |
| Relation | Spatial Relationship | Identifying how specific objects are positioned relative to each other | What is next to the red car? | 
| Object Interaction | Analyzing how multiple objects interact with each other | How is the woman interacting with the laptop? | |
| Scene Understanding | Identifying the type of environment/setting | Where is this scene taking place? | 
With 32K samples of our compositional tuning data and 5% of the LLaVA-665K VIT data (only 10% of the size of the full VIT dataset), COMPACT matches the performance of full-scale VIT (100.18% relative score) and demonstrates exceptional generalization to complex tasks.
Baseline Comparisons: COMPACT (65K) outperforms the random subset of the VIT data (65K), gradient-based approach selected subset of the VIT data (65K), and even the full VIT data (665K) on diverse multimodal benchmarks. The best and second best results for each benchmark are shown in bold and underlined, respectively. COMPACT integrates atomic capabilities into tasks of higher compositional complexity, enabling models to generalize and handle complex tasks without explicit decomposition.
| Recipe | #Data | InfoVQA | SeedBench2Plus | MME | TextVQA | MM-Vet | CV-Bench | MMStar | LLaVA-W | Rel. (%) | 
|---|---|---|---|---|---|---|---|---|---|---|
| Original | 665K | 20.80 | 41.72 | 1478.48 | 46.99 | 29.22 | 60.92 | 35.11 | 68.50 | 100.00 | 
| Random | 65K | 20.05 | 41.85 | 1327.70 | 42.88 | 30.46 | 54.71 | 34.13 | 64.30 | 95.38 | 
| ICONS | 65K | 21.0 | 42.03 | 1402.75 | 43.12 | 31.23 | 55.96 | 35.96 | 61.8 | 97.47 | 
| COMPACT (ours) | 65K | 23.68 | 43.13 | 1379.94 | 44.37 | 31.74 | 55.28 | 36.13 | 64.50 | 100.18 | 
Compositional Generalization to Higher Complexities: We compare the performance of COMPACT (65K) and LLaVA-665K VIT (665K) at each compositional complexity (k) level. COMPACT exceeds the LLaVA-665K baseline at k = (3,4,5) tasks while using significantly less training data.
 
        Performance Across Compositional Tuning Data Scales: We fix the VIT subset (5% of LLaVA-665K) and scale the compositional tuning data in COMPACT from 2K to 32K. For comparison, we remove the compositional tuning data and add more VIT data (2K-32K) instead to prepare VIT only baselines with equal data budgets. With much fewer data, COMPACT (solid lines) consistently outperforms the VIT only baselines (dashed lines). The performance gap is pronounced for complex reasoning benchmarks such as MM-Vet and MMStar, where the 8K COMPACT model often exceeds the VIT only baseline at 32K. This demonstrates the data efficiency of COMPACT, requiring substantially less data than LLaVA-665K VIT to achieve comparable or better results.
 
        We conduct a series of ablation studies to investigate key design considerations in COMPACT.
A. Compositional Complexity Distribution: In order to show that the performance improvement of COMPACT mainly comes from the balanced distribution of compositional complexity in the compositional tuning data, we analyze the impact on the performance when its compositional complexity is unbalanced. We generate a 16K-sample compositional tuning data whose distribution of k resembles LLaVA-665K VIT data. The performance of unbalanced COMPACT (96.28%) is close to random baseline. However, the balanced COMPACT (original) performance jumps to 98.83%, suggesting that most of the performance gain in COMPACT comes from the fair represenation of higher k samples in the dataset.
| Recipe | #Data | InfoVQA | SeedBench2Plus | MME | TextVQA | MMVet | CV-Bench | MMStar | LLaVA-W | Rel. (%) | 
|---|---|---|---|---|---|---|---|---|---|---|
| LLaVA-665K VIT | 665K | 20.80 | 41.72 | 1478.48 | 46.99 | 29.22 | 60.92 | 35.11 | 68.50 | 100.00 | 
| Random | 49K | 20.33 | 42.38 | 1290.45 | 42.22 | 30.18 | 54.75 | 34.30 | 70.50 | 96.28 | 
| Unbalanced COMPACT | 49K | 22.28 | 41.17 | 1339.24 | 43.08 | 29.22 | 55.84 | 34.80 | 64.50 | 96.62 | 
| Original COMPACT | 49K | 22.68 | 42.82 | 1362.68 | 43.73 | 30.78 | 54.69 | 35.59 | 66.60 | 98.83 | 
B. Atomic Capability Coverage: To validate our choice of atomic capabilities and understand their relative importance, we conduct a leave-one-out analysis by systematically excluding questions that require a specific capability while keeping the total number of training examples fixed. The figure shows that each capability contributes meaningfully to overall performance without being redundant.
 
          C. Instruction Tuning Ratio: We vary the amount of instruction tuning data sampled from LLaVA-665K VIT to understand the impact of the mixing ratio on model performance. As we scale the VIT subset from 0% (pure compositional tuning) to 7% of LLaVA-665K VIT, we observe an upward trend with diminishing returns. These results suggest that instruction following capability is potentailly orthogonal to the capabilities of the base model and the atomic visual capabilities, and can be acquired with minimal instruction tuning data.
 
          D. Compositional Complexity Range: To isolate the effect of the range of compositional complexities while controlling for data quality, we generate three sets of 16K-sample compositional tuning data, each it k=1, k=1,2 and k=1,2,3. For fair comparison, we maintain consistent sample counts and use identical set of images in all three settings. The model trained on k=1,2,3 outperforms other two settings. Although the model trained on k=1 data can solve tasks with lower compositional complexity, it does not generalize to higher k tasks.
 
          We perform a quantitative analysis comparing COMPACT with LLaVA-665K VIT. COMPACT's compositional tuning data has a more balanced distribution of capabilities compared to LLaVA-665K VIT. In additional, the bar plot below shows the distribution of compositional complexity in a random subset of LLaVA-665K VIT data. Unlike COMPACT, where the compositional complexity is designed to be balanced, the compositional complexity in LLaVA-665K VIT data shows a complexity cliff, characterized by a lack of higher k samples.
 
             
          This material is based upon work supported by the National Science Foundation under Grant 2107048 and 2112562. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. All experiments, data collection, and processing activities were conducted at Princeton University. Meta was involved solely in an advisory role and no experiments, data collection or processing activities were conducted on Meta infrastructure. We thank Allison Chen for helpful discussions and feedback.