ICONS: Influence Consensus for Vision-Language Data Selection

Abstract

Visual Instruction Tuning typically requires a large amount of vision-language training data. This data often contains redundant information that increases computational costs without proportional performance gains.

Approach

The key element of our approach is cross-task influence consensus, which uses majority voting across task-specific influence matrices to identify samples that are consistently valuable across multiple tasks, allowing us to effectively prioritize data that optimizes for overall performance.

Experiments

Experiments show that models trained on our selected data (20% of LLAVA-665K) achieve 98.6% of the relative performance obtained using the full dataset, and exceeds 102% of the full dataset performance at 60% selection.

Artifact

We release this subset, LLAVA-ICONS-133K, a compact yet highly informative subset of LLAVA-665K visual instruction tuning data, preserving high impact training data for efficient vision-language model development.

Method

Stage 1: Specialist

This process is repeated for each target task:

Step 1: Warmup training on a small subset of data.
Step 2: Gradient computation for both training and target task validation data.
Step 3: Influence matrix computation to generate per-task influence scores.

Stage 2: Generalist

Step 1: Sets task-specific thresholds at the 80th percentile of influence scores.
Step 2: Allocates votes to samples that exceed the threshold for each task.
Step 3: Selects the top 20% most influential samples based on total votes across tasks.

Overview of ICONS: Our two-stage approach combines specialist task-specific influence analysis with generalist consensus-based selection.

Experiments

Key Performance Highlights

98.6% relative performance using only 20% of training data
102.1% relative performance at 60% data selection
Effective across 10+ vision-language benchmarks
Strong generalization to unseen tasks and architectures

Evaluation Setup

Dataset: LLaVA-665K for visual instruction tuning
Base Model: LLaVA-v1.5-7b-lora
Benchmarks: VQAv2, GQA, VizWiz, SQA-I, TextVQA, POPE, MME, MMBench (en/cn), LLaVA-Bench
Baselines: Random, CLIP-Score, EL2N, Perplexity, SemDeDup, D2-Pruning, Self-Sup, Self-Filter, COINCIDE

Baseline Comparisons

Method	VQAv2	GQA	VizWiz	SQA-I	TextVQA	POPE	MME	MMBench (en)	MMBench (cn)	LLaVA-W Bench	Rel. (%)
Full	79.1	63.0	47.8	68.4	58.2	86.4	1476.9	66.1	58.9	67.9	100
Random	75.7	58.9	44.3	68.5	55.3	84.7	1483.0	62.2	54.8	65.0	95.8
CLIP-Score	73.4	51.4	43.0	65.0	54.7	85.3	1331.6	55.2	52.0	66.2	91.2
EL2N	76.2	58.7	43.7	65.5	53.0	84.3	1439.5	53.2	47.4	64.9	92.0
Perplexity	75.8	57.0	47.8	65.1	52.8	82.6	1341.4	52.0	45.8	68.3	91.6
SemDeDup	74.2	54.5	46.9	65.8	55.5	84.7	1376.9	52.2	48.5	70.0	92.6
D2-Pruning	73.0	58.4	41.9	69.3	51.8	85.7	1391.2	65.7	57.6	63.9	94.8
Self-Sup	74.9	59.5	46.0	67.8	49.3	83.5	1335.9	61.4	53.8	63.3	93.4
Self-Filter	73.7	58.3	53.2	61.4	52.9	83.8	1306.2	48.8	45.3	64.9	90.9
COINCIDE	76.5	59.8	46.8	69.2	55.6	86.1	1495.6	63.1	54.5	67.3	97.4
ICONS (ours)	76.3	60.7	50.1	70.8	55.6	87.5	1485.7	63.1	55.8	66.1	98.6

Key Findings

From Specialist to Generalist

Consensus across task-specific influence patterns identifies a compact, high-performing universal training set:

Only 1.33% average performance drop vs. specialist baselines
Some tasks improve under generalist selection (SQA-I: +1.43%, POPE: +1.04%)
Data overlap varies by task complexity: from 3.27% (VQAv2) to 24.21% (LLAVA-W Bench)

Selection Ratio Impact

Performance across different data selection ratios:

Stronger performance compared with baselines in low-selection regime (5-20%)
Reaches 102.1% relative performance at 60% selection

Cross-Task Influence Patterns

Pairwise overlap analysis reveals:

High overlap in related tasks:
- MMBench (en-cn): 67.4%
- POPE-GQA: 60.2%
- VQAv2-VizWiz: 49.0%
Low overlap between dissimilar tasks (e.g., MMBench-GQA: 3.3%)
Findings support using influence consensus for effective multi-task selection

Data overlap between specialist and generalist selections

Progressive line plot showing selection ratio impact

Progressive performance across selection ratios

Pairwise overlap heatmap across tasks

Transferability

Unseen-task Generalization

Evaluates transferability across unseen benchmarks.
LLAVA-ICONS-133K achieves 95.5-113.9% relative performance. Outperforms random selection on all benchmarks.

	AI2D	ChartQA	DocVQA	InfoVQA	MMVet	NaturalBench	RealWorldQA	CMMMU	Rel. (%)
Full	55.4	17.5	28.9	26.5	31.1	12.4	52.4	22.1	100.0
Random	50.2	15.1	25.2	24.3	27.6	11.1	49.8	21.9	91.6
LLAVA-ICONS-133K	53.9	17.1	27.9	27.5	29.7	12.8	55.0	25.2	98.7
Per-task Rel. (%)	97.3	97.7	96.5	103.8	95.5	103.2	104.4	114.0	-

Cross-Architecture-Scale Generalization

Tests transferability across different model scales.
Subset selected with LLaVA-1.5-7B remains effective for LLaVA-1.5-13B.
Achieves 98.1% relative performance, indicating universal value.

	VQAv2	GQA	Vizwiz	SQA-I	TextVQA	POPE	MME	MMBench	MMBench(cn)	LLAVA-W	Rel. (%)
Full	80.0	63.3	58.9	71.2	60.2	86.7	1541.7	68.5	61.5	69.5	100.0
Random	77.3	60.7	57.6	69.1	56.8	82.9	1517.2	63.2	56.3	67.5	95.7
7B-selected	78.8	60.4	57.4	70.4	58.3	84.3	1527.5	64.9	59.7	68.2	97.3
13B-selected	78.9	61.2	57.5	71.3	58.4	85.9	1535.2	66.1	59.8	68.8	98.1

Ablation Studies

Different Aggregation Approaches

To understand how different aggregation approaches affect the performance, we compare our approach (Vote) with four aggregation methods: (1) Mean, (2) Max, (3) Rank, (4) Norm.

Task	Mean	Max	Rank	Norm	Vote (ours)
VQAv2	75.7	75.2	74.9	75.1	76.3
GQA	59.6	59.8	58.6	60.1	60.7
VizWiz	47.9	48.1	40.5	46.4	50.1
SQA-I	65.5	66.2	69.8	69.8	70.8
TextVQA	55.5	55.5	55.2	54.5	55.6
POPE	86.0	85.5	85.6	85.6	87.5
MME	1422.1	1470.7	1490.0	1482.6	1485.7
MMBench (en)	59.0	58.3	59.0	58.9	63.1
MMBench (cn)	51.0	51.8	50.8	52.5	55.8
LLaVA-W Bench	66.2	66.2	66.4	66.3	66.1
Rel.(%)	96.4	96.1	95.9	96.8	98.6

The Vote method outperforms others with a 98.6% relative performance. While Rank excels in MME, it generally underperforms. Mean and Max have a 99.9% overlap, indicating similar selections.

Consensus-aware Selection vs. Direct-merge Selection

To validate our choice of a consensus-aware approach versus a direct-merge approach (directly computing influence using a combined validation set), we compare both approaches.

Task	Full	Direct merge	Consensus aware (ours)	Delta (%)
VQAv2	79.1	76.1	76.3	0.26
GQA	63.0	59.4	60.7	2.19
VizWiz	47.8	46.1	50.1	8.67
SQA-I	68.4	68.7	70.8	3.06
TextVQA	58.2	54.1	55.6	2.77
POPE	86.4	85.1	87.5	2.82
MME	1476.9	1419.2	1485.7	4.69
MMBench (en)	66.1	61.9	63.1	1.94
MMBench (cn)	58.9	50.3	55.8	10.94
LLaVA-W Bench	67.9	65.2	66.1	1.38
Rel. (%)	100	94.7	98.6	-

Our consensus-aware approach excels in complex tasks like MME by addressing task-specific challenges and validation set sizes. It is scalable, requiring only task-specific scores and voting updates.

Citation

@article{wu2024icons,
    title={ICONS: Influence Consensus for Vision-Language Data Selection},
    author={Wu, Xindi and Xia, Mengzhou and Shao, Rulin and Deng, Zhiwei and 
            Koh, Pang Wei and Russakovsky, Olga},
    journal={arXiv preprint arXiv:2501.00654},
    year={2024}
}

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 2107048 and No.2112562. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

This work is also supported by the Singapore National Research Foundation and the National AI Group in the Singapore Ministry of Digital Development and Information under the AI Visiting Professorship Programme (award number AIVP-2024-001).

We thank many people for their helpful discussion and feedback, listed in alphabetical order by last name: Allison Chen, Hamish Ivison, Carlos E. Jimenez, Polina Kirichenko, Jaewoo Lee, Tiffany Ling, Zhiqiu Lin, Ethan Tseng, Shengbang Tong, Justin Wang, Zirui Wang.