ICONS: Influence Consensus for Vision-Language Data Selection

1. Princeton University | 2. University of Washington | 3. Google DeepMind | 4. Allen Institute for AI

Abstract

Visual Instruction Tuning typically requires a large amount of vision-language training data. This data often contains redundant information that increases computational costs without proportional performance gains.

Approach

The key element of our approach is cross-task influence consensus, which uses majority voting across task-specific influence matrices to identify samples that are consistently valuable across multiple tasks, allowing us to effectively prioritize data that optimizes for overall performance.

Experiments

Experiments show that models trained on our selected data (20% of LLAVA-665K) achieve 98.6% of the relative performance obtained using the full dataset, and exceeds 102% of the full dataset performance at 60% selection.

Artifact

We release this subset, LLAVA-ICONS-133K, a compact yet highly informative subset of LLAVA-665K visual instruction tuning data, preserving high impact training data for efficient vision-language model development.

Method

Stage 1: Specialist

This process is repeated for each target task:
  • Step 1: Warmup training on a small subset of data.
  • Step 2: Gradient computation for both training and target task validation data.
  • Step 3: Influence matrix computation to generate per-task influence scores.

Stage 2: Generalist

  • Step 1: Sets task-specific thresholds at the 80th percentile of influence scores.
  • Step 2: Allocates votes to samples that exceed the threshold for each task.
  • Step 3: Selects the top 20% most influential samples based on total votes across tasks.
ICONS Pipeline

Overview of ICONS: Our two-stage approach combines specialist task-specific influence analysis with generalist consensus-based selection.

Experiments

Key Performance Highlights

  • 98.6% relative performance using only 20% of training data
  • 102.1% relative performance at 60% data selection
  • Effective across 10+ vision-language benchmarks
  • Strong generalization to unseen tasks and architectures

Evaluation Setup

  • Dataset: LLaVA-665K for visual instruction tuning
  • Base Model: LLaVA-v1.5-7b-lora
  • Benchmarks: VQAv2, GQA, VizWiz, SQA-I, TextVQA, POPE, MME, MMBench (en/cn), LLaVA-Bench
  • Baselines: Random, CLIP-Score, EL2N, Perplexity, SemDeDup, D2-Pruning, Self-Sup, Self-Filter, COINCIDE

Baseline Comparisons

Method VQAv2 GQA VizWiz SQA-I TextVQA POPE MME MMBench
(en)
MMBench
(cn)
LLaVA-W
Bench
Rel. (%)
Full 79.1 63.0 47.8 68.4 58.2 86.4 1476.9 66.1 58.9 67.9 100
Random 75.7 58.9 44.3 68.5 55.3 84.7 1483.0 62.2 54.8 65.0 95.8
CLIP-Score 73.4 51.4 43.0 65.0 54.7 85.3 1331.6 55.2 52.0 66.2 91.2
EL2N 76.2 58.7 43.7 65.5 53.0 84.3 1439.5 53.2 47.4 64.9 92.0
Perplexity 75.8 57.0 47.8 65.1 52.8 82.6 1341.4 52.0 45.8 68.3 91.6
SemDeDup 74.2 54.5 46.9 65.8 55.5 84.7 1376.9 52.2 48.5 70.0 92.6
D2-Pruning 73.0 58.4 41.9 69.3 51.8 85.7 1391.2 65.7 57.6 63.9 94.8
Self-Sup 74.9 59.5 46.0 67.8 49.3 83.5 1335.9 61.4 53.8 63.3 93.4
Self-Filter 73.7 58.3 53.2 61.4 52.9 83.8 1306.2 48.8 45.3 64.9 90.9
COINCIDE 76.5 59.8 46.8 69.2 55.6 86.1 1495.6 63.1 54.5 67.3 97.4
ICONS (ours) 76.3 60.7 50.1 70.8 55.6 87.5 1485.7 63.1 55.8 66.1 98.6

Key Findings

From Specialist to Generalist

Consensus across task-specific influence patterns identifies a compact, high-performing universal training set:

  • Only 1.33% average performance drop vs. specialist baselines
  • Some tasks improve under generalist selection (SQA-I: +1.43%, POPE: +1.04%)
  • Data overlap varies by task complexity: from 3.27% (VQAv2) to 24.21% (LLAVA-W Bench)

Selection Ratio Impact

Performance across different data selection ratios:

  • Stronger performance compared with baselines in low-selection regime (5-20%)
  • Reaches 102.1% relative performance at 60% selection

Cross-Task Influence Patterns

Pairwise overlap analysis reveals:

  • High overlap in related tasks:
    • MMBench (en-cn): 67.4%
    • POPE-GQA: 60.2%
    • VQAv2-VizWiz: 49.0%
  • Low overlap between dissimilar tasks (e.g., MMBench-GQA: 3.3%)
  • Findings support using influence consensus for effective multi-task selection
Data overlap between specialist and generalist selections

Data overlap between specialist and generalist selections

Progressive line plot showing selection ratio impact

Progressive performance across selection ratios

Benchmark performance heatmap

Pairwise overlap heatmap across tasks

Transferability

Unseen-task Generalization

  • Evaluates transferability across unseen benchmarks.
  • LLAVA-ICONS-133K achieves 95.5-113.9% relative performance. Outperforms random selection on all benchmarks.
  AI2D ChartQA DocVQA InfoVQA MMVet NaturalBench RealWorldQA CMMMU Rel. (%)
Full 55.4 17.5 28.9 26.5 31.1 12.4 52.4 22.1 100.0
Random 50.2 15.1 25.2 24.3 27.6 11.1 49.8 21.9 91.6
LLAVA-ICONS-133K 53.9 17.1 27.9 27.5 29.7 12.8 55.0 25.2 98.7
Per-task Rel. (%) 97.3 97.7 96.5 103.8 95.5 103.2 104.4 114.0 -

Cross-Architecture-Scale Generalization

  • Tests transferability across different model scales.
  • Subset selected with LLaVA-1.5-7B remains effective for LLaVA-1.5-13B.
  • Achieves 98.1% relative performance, indicating universal value.
  VQAv2 GQA Vizwiz SQA-I TextVQA POPE MME MMBench MMBench(cn) LLAVA-W Rel. (%)
Full 80.0 63.3 58.9 71.2 60.2 86.7 1541.7 68.5 61.5 69.5 100.0
Random 77.3 60.7 57.6 69.1 56.8 82.9 1517.2 63.2 56.3 67.5 95.7
7B-selected 78.8 60.4 57.4 70.4 58.3 84.3 1527.5 64.9 59.7 68.2 97.3
13B-selected 78.9 61.2 57.5 71.3 58.4 85.9 1535.2 66.1 59.8 68.8 98.1

Ablation Studies

Different Aggregation Approaches

To understand how different aggregation approaches affect the performance, we compare our approach (Vote) with four aggregation methods: (1) Mean, (2) Max, (3) Rank, (4) Norm.

Task Mean Max Rank Norm Vote (ours)
VQAv2 75.7 75.2 74.9 75.1 76.3
GQA 59.6 59.8 58.6 60.1 60.7
VizWiz 47.9 48.1 40.5 46.4 50.1
SQA-I 65.5 66.2 69.8 69.8 70.8
TextVQA 55.5 55.5 55.2 54.5 55.6
POPE 86.0 85.5 85.6 85.6 87.5
MME 1422.1 1470.7 1490.0 1482.6 1485.7
MMBench
(en)
59.0 58.3 59.0 58.9 63.1
MMBench
(cn)
51.0 51.8 50.8 52.5 55.8
LLaVA-W
Bench
66.2 66.2 66.4 66.3 66.1
Rel.(%) 96.4 96.1 95.9 96.8 98.6

The Vote method outperforms others with a 98.6% relative performance. While Rank excels in MME, it generally underperforms. Mean and Max have a 99.9% overlap, indicating similar selections.

Consensus-aware Selection vs. Direct-merge Selection

To validate our choice of a consensus-aware approach versus a direct-merge approach (directly computing influence using a combined validation set), we compare both approaches.

Task Full Direct merge Consensus aware (ours) Delta (%)
VQAv2 79.1 76.1 76.3 0.26
GQA 63.0 59.4 60.7 2.19
VizWiz 47.8 46.1 50.1 8.67
SQA-I 68.4 68.7 70.8 3.06
TextVQA 58.2 54.1 55.6 2.77
POPE 86.4 85.1 87.5 2.82
MME 1476.9 1419.2 1485.7 4.69
MMBench
(en)
66.1 61.9 63.1 1.94
MMBench
(cn)
58.9 50.3 55.8 10.94
LLaVA-W
Bench
67.9 65.2 66.1 1.38
Rel. (%) 100 94.7 98.6 -

Our consensus-aware approach excels in complex tasks like MME by addressing task-specific challenges and validation set sizes. It is scalable, requiring only task-specific scores and voting updates.

Citation

@article{wu2024icons,
    title={ICONS: Influence Consensus for Vision-Language Data Selection},
    author={Wu, Xindi and Xia, Mengzhou and Shao, Rulin and Deng, Zhiwei and 
            Koh, Pang Wei and Russakovsky, Olga},
    journal={arXiv preprint arXiv:2501.00654},
    year={2024}
}

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 2107048 and No.2112562. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

This work is also supported by the Singapore National Research Foundation and the National AI Group in the Singapore Ministry of Digital Development and Information under the AI Visiting Professorship Programme (award number AIVP-2024-001).

We thank many people for their helpful discussion and feedback, listed in alphabetical order by last name: Allison Chen, Hamish Ivison, Carlos E. Jimenez, Polina Kirichenko, Jaewoo Lee, Tiffany Ling, Zhiqiu Lin, Ethan Tseng, Shengbang Tong, Justin Wang, Zirui Wang.