Visual Instruction Tuning typically requires a large amount of vision-language training data. This data often contains redundant information that increases computational costs without proportional performance gains.
The key element of our approach is cross-task influence consensus, which uses majority voting across task-specific influence matrices to identify samples that are consistently valuable across multiple tasks, allowing us to effectively prioritize data that optimizes for overall performance.
Experiments show that models trained on our selected data (20% of LLAVA-665K) achieve 98.6% of the relative performance obtained using the full dataset, and exceeds 102% of the full dataset performance at 60% selection.
We release this subset, LLAVA-ICONS-133K, a compact yet highly informative subset of LLAVA-665K visual instruction tuning data, preserving high impact training data for efficient vision-language model development.
Method | VQAv2 | GQA | VizWiz | SQA-I | TextVQA | POPE | MME | MMBench (en) |
MMBench (cn) |
LLaVA-W Bench |
Rel. (%) |
---|---|---|---|---|---|---|---|---|---|---|---|
Full | 79.1 | 63.0 | 47.8 | 68.4 | 58.2 | 86.4 | 1476.9 | 66.1 | 58.9 | 67.9 | 100 |
Random | 75.7 | 58.9 | 44.3 | 68.5 | 55.3 | 84.7 | 1483.0 | 62.2 | 54.8 | 65.0 | 95.8 |
CLIP-Score | 73.4 | 51.4 | 43.0 | 65.0 | 54.7 | 85.3 | 1331.6 | 55.2 | 52.0 | 66.2 | 91.2 |
EL2N | 76.2 | 58.7 | 43.7 | 65.5 | 53.0 | 84.3 | 1439.5 | 53.2 | 47.4 | 64.9 | 92.0 |
Perplexity | 75.8 | 57.0 | 47.8 | 65.1 | 52.8 | 82.6 | 1341.4 | 52.0 | 45.8 | 68.3 | 91.6 |
SemDeDup | 74.2 | 54.5 | 46.9 | 65.8 | 55.5 | 84.7 | 1376.9 | 52.2 | 48.5 | 70.0 | 92.6 |
D2-Pruning | 73.0 | 58.4 | 41.9 | 69.3 | 51.8 | 85.7 | 1391.2 | 65.7 | 57.6 | 63.9 | 94.8 |
Self-Sup | 74.9 | 59.5 | 46.0 | 67.8 | 49.3 | 83.5 | 1335.9 | 61.4 | 53.8 | 63.3 | 93.4 |
Self-Filter | 73.7 | 58.3 | 53.2 | 61.4 | 52.9 | 83.8 | 1306.2 | 48.8 | 45.3 | 64.9 | 90.9 |
COINCIDE | 76.5 | 59.8 | 46.8 | 69.2 | 55.6 | 86.1 | 1495.6 | 63.1 | 54.5 | 67.3 | 97.4 |
ICONS (ours) | 76.3 | 60.7 | 50.1 | 70.8 | 55.6 | 87.5 | 1485.7 | 63.1 | 55.8 | 66.1 | 98.6 |
Consensus across task-specific influence patterns identifies a compact, high-performing universal training set:
Performance across different data selection ratios:
Pairwise overlap analysis reveals:
Data overlap between specialist and generalist selections
Progressive performance across selection ratios
Pairwise overlap heatmap across tasks
AI2D | ChartQA | DocVQA | InfoVQA | MMVet | NaturalBench | RealWorldQA | CMMMU | Rel. (%) | |
---|---|---|---|---|---|---|---|---|---|
Full | 55.4 | 17.5 | 28.9 | 26.5 | 31.1 | 12.4 | 52.4 | 22.1 | 100.0 |
Random | 50.2 | 15.1 | 25.2 | 24.3 | 27.6 | 11.1 | 49.8 | 21.9 | 91.6 |
LLAVA-ICONS-133K | 53.9 | 17.1 | 27.9 | 27.5 | 29.7 | 12.8 | 55.0 | 25.2 | 98.7 |
Per-task Rel. (%) | 97.3 | 97.7 | 96.5 | 103.8 | 95.5 | 103.2 | 104.4 | 114.0 | - |
LLaVA-1.5-7B
remains effective for LLaVA-1.5-13B
.VQAv2 | GQA | Vizwiz | SQA-I | TextVQA | POPE | MME | MMBench | MMBench(cn) | LLAVA-W | Rel. (%) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Full | 80.0 | 63.3 | 58.9 | 71.2 | 60.2 | 86.7 | 1541.7 | 68.5 | 61.5 | 69.5 | 100.0 |
Random | 77.3 | 60.7 | 57.6 | 69.1 | 56.8 | 82.9 | 1517.2 | 63.2 | 56.3 | 67.5 | 95.7 |
7B-selected | 78.8 | 60.4 | 57.4 | 70.4 | 58.3 | 84.3 | 1527.5 | 64.9 | 59.7 | 68.2 | 97.3 |
13B-selected | 78.9 | 61.2 | 57.5 | 71.3 | 58.4 | 85.9 | 1535.2 | 66.1 | 59.8 | 68.8 | 98.1 |
To understand how different aggregation approaches affect the performance, we compare our approach (Vote) with four aggregation methods: (1) Mean, (2) Max, (3) Rank, (4) Norm.
Task | Mean | Max | Rank | Norm | Vote (ours) |
---|---|---|---|---|---|
VQAv2 | 75.7 | 75.2 | 74.9 | 75.1 | 76.3 |
GQA | 59.6 | 59.8 | 58.6 | 60.1 | 60.7 |
VizWiz | 47.9 | 48.1 | 40.5 | 46.4 | 50.1 |
SQA-I | 65.5 | 66.2 | 69.8 | 69.8 | 70.8 |
TextVQA | 55.5 | 55.5 | 55.2 | 54.5 | 55.6 |
POPE | 86.0 | 85.5 | 85.6 | 85.6 | 87.5 |
MME | 1422.1 | 1470.7 | 1490.0 | 1482.6 | 1485.7 |
MMBench (en) |
59.0 | 58.3 | 59.0 | 58.9 | 63.1 |
MMBench (cn) |
51.0 | 51.8 | 50.8 | 52.5 | 55.8 |
LLaVA-W Bench |
66.2 | 66.2 | 66.4 | 66.3 | 66.1 |
Rel.(%) | 96.4 | 96.1 | 95.9 | 96.8 | 98.6 |
The Vote method outperforms others with a 98.6% relative performance. While Rank excels in MME, it generally underperforms. Mean and Max have a 99.9% overlap, indicating similar selections.
To validate our choice of a consensus-aware approach versus a direct-merge approach (directly computing influence using a combined validation set), we compare both approaches.
Task | Full | Direct merge | Consensus aware (ours) | Delta (%) |
---|---|---|---|---|
VQAv2 | 79.1 | 76.1 | 76.3 | 0.26 |
GQA | 63.0 | 59.4 | 60.7 | 2.19 |
VizWiz | 47.8 | 46.1 | 50.1 | 8.67 |
SQA-I | 68.4 | 68.7 | 70.8 | 3.06 |
TextVQA | 58.2 | 54.1 | 55.6 | 2.77 |
POPE | 86.4 | 85.1 | 87.5 | 2.82 |
MME | 1476.9 | 1419.2 | 1485.7 | 4.69 |
MMBench (en) |
66.1 | 61.9 | 63.1 | 1.94 |
MMBench (cn) |
58.9 | 50.3 | 55.8 | 10.94 |
LLaVA-W Bench |
67.9 | 65.2 | 66.1 | 1.38 |
Rel. (%) | 100 | 94.7 | 98.6 | - |
Our consensus-aware approach excels in complex tasks like MME by addressing task-specific challenges and validation set sizes. It is scalable, requiring only task-specific scores and voting updates.
@article{wu2024icons,
title={ICONS: Influence Consensus for Vision-Language Data Selection},
author={Wu, Xindi and Xia, Mengzhou and Shao, Rulin and Deng, Zhiwei and
Koh, Pang Wei and Russakovsky, Olga},
journal={arXiv preprint arXiv:2501.00654},
year={2024}
}
This material is based upon work supported by the National Science Foundation under Grant No. 2107048 and No.2112562. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
This work is also supported by the Singapore National Research Foundation and the National AI Group in the Singapore Ministry of Digital Development and Information under the AI Visiting Professorship Programme (award number AIVP-2024-001).
We thank many people for their helpful discussion and feedback, listed in alphabetical order by last name: Allison Chen, Hamish Ivison, Carlos E. Jimenez, Polina Kirichenko, Jaewoo Lee, Tiffany Ling, Zhiqiu Lin, Ethan Tseng, Shengbang Tong, Justin Wang, Zirui Wang.