HIVE: Evaluating the Human Interpretability of Visual Explanations

Sunnie S. Y. Kim Nicole Meister Vikram V. Ramaswamy Ruth Fong Olga Russakovsky
Princeton University, Princeton NJ 08544, USA
{sunniesuhyoung, nmeister, vr23, ruthfong, olgarus}@princeton.edu

(Top left) Heatmap explanations by GradCAM and BagNet highlight decision-relevant image regions. (Bottom left) Prototype-based explanations by ProtoPNet and ProtoTree match image regions to prototypical parts learned during training. This schematic is much simpler than actual explanations. (Right) Actual ProtoPNet explanation example from the original paper. While existing evaluation methods typically apply to only one explanation form, HIVE evaluates and compares diverse interpretability methods.

Paper

Supplement

Extended Abstract

Poster

2min Talk

4min Talk

8min Talk

Code

News

01/2023: "Help Me Help the AI" has been conditionally accepted to CHI 2023!

10/2022: Check out our follow-up paper "Help Me Help the AI": Understanding How Explainability Can Support Human-AI Interaction. We interviewed 20 end-users of the Merlin bird identification app and studied (1) what explainability needs they have, (2) how they intend to use explanations of the AI system's predictions, and (3) how they perceive existing XAI approaches (heatmap, example, concept, prototype).

07/2022: HIVE has been accepted to ECCV 2022. The camera-ready version is available here!

06/2022: We presented HIVE at the CVPR 2022 Explainable AI for Computer Vision Workshop (spotlight talk & poster) and the CVPR 2022 Women in Computer Vision Workshop (poster).

05/2022: We presented HIVE at the CHI 2022 Human-Centered Explainable AI Workshop (poster).

Abstract

As AI technology is increasingly applied to high-impact, high-risk domains, there have been a number of new methods aimed at making AI models more human interpretable. Despite the recent growth of interpretability work, there is a lack of systematic evaluation of proposed techniques. In this work, we introduce HIVE (Human Interpretability of Visual Explanations), a novel human evaluation framework that assesses the utility of explanations to human users in AI-assisted decision making scenarios, and enables falsifiable hypothesis testing, cross-method comparison, and human-centered evaluation of visual interpretability methods. To the best of our knowledge, this is the first work of its kind. Using HIVE, we conduct IRB-approved human studies with nearly 1000 participants and evaluate four methods that represent the diversity of computer vision interpretability works: GradCAM, BagNet, ProtoPNet, and ProtoTree. Our results suggest that explanations engender human trust, even for incorrect predictions, yet are not distinct enough for users to distinguish between correct and incorrect predictions. We open-source HIVE to enable future studies and encourage more human-centered approaches to interpretability research.

Citation


    @inproceedings{Kim2022HIVE,
      author = {Sunnie S. Y. Kim and Nicole Meister and Vikram V. Ramaswamy and Ruth Fong and Olga Russakovsky},
      title = {{HIVE}: Evaluating the Human Interpretability of Visual Explanations},
      booktitle = {European Conference on Computer Vision (ECCV)},
      year = {2022}
    }

HIVE (Human Interpretability of Visual Explanations)

We propose HIVE, a novel human evaluation framework for visual interpretability methods. Through careful design, HIVE allows for falsifiable hypothesis testing regarding the utility of explanations for identifying model errors, cross-method comparison between different interpretability techniques, and human-centered evaluation for understanding the practical effectiveness of interpretability.

In particular, we focus on AI-assisted decision making scenarios where humans use an AI (image classification) model and an interpretability method to make decisions about whether the model prediction is correct or more generally about whether to use the model. We evaluate how useful a given interpretability method is in these scenarios through the following tasks.

First, we evaluate interpretability methods on a simple agreement task, where we present users with a single model prediction-explanation pair for a given image and ask how confident they are in the prediction. This task simulates a common decision making setting and is close to existing evaluation schemes that consider a model’s top-1 prediction and an explanation for it. See above for the ProtoPNet evaluation UI.

However, it has been previously observed that users tend to believe in model predictions when given explanations for them. Hence, we evaluate methods on a distinction task to mitigate the effect of such confirmation bias in interpretability evaluation. Here we simultaneously show four prediction-explanation pairs and ask users to identify the correct prediction based on the provided explanations. This task measures how well explanations can help users distinguish between correct and incorrect predictions. See above for the GradCAM evaluation UI.

In summary, HIVE consists of the following steps: We first introduce the study and the interpretability method to be evaluated. Next, we show a preview of the evaluation task and provide example explanations for one correct and one incorrect model prediction to give participants appropriate references. Afterwards, participants complete the evaluation task. Throughout the study, we also ask subjective evaluation and user preference questions to make the most out of the human studies. Our study design was approved by our university’s Institutional Review Board (IRB).

Key Findings

1. When provided explanations, participants tend to believe that the model predictions are correct, revealing an issue of confirmation bias.

2. When given multiple model predictions and explanations, participants struggle to distinguish between correct and incorrect predictions based on the explanations. This result suggests that interpretability methods need to be improved to be reliably useful for AI-assisted decision making.

3. There exists a gap between the similarity judgments of humans and prototype-based models which can hurt the quality of their interpretability.

4. Participants prefer to use a model with explanations over a baseline model without explanations. To switch their preference, they require the baseline model to have +6.2% to +10.9% higher accuracy.

Please see the full paper for details.

Related Work

Below are some papers related to our work. We discuss them in more detail in the related work section of our paper.

[1] Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra. IJCV 2019.

[2] Approximating CNNs with Bag-of-local-Features Models Works Surprisingly Well on ImageNet. Wieland Brendel, Matthias Bethge. ICLR 2019.

[3] This Looks Like That: Deep Learning for Interpretable Image Recognition. Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, Cynthia Rudin. NeurIPS 2019.

[4] This Looks Like That... Does it? Shortcomings of Latent Space Prototype Interpretability in Deep Networks. Adrian Hoffmann, Claudio Fanconi, Rahul Rade, Jonas Kohler. ICML 2021 Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI.

[5] Neural Prototype Trees for Interpretable Fine-grained Image Recognition. Meike Nauta, Ron van Bree, Christin Seifert. CVPR 2021.

[6] Deep Residual Learning for Image Recognition. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. CVPR 2016.

Acknowledgements

This material is based upon work partially supported by the National Science Foundation (NSF) under Grant No. 1763642. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF. We also acknowledge support from the Princeton SEAS Howard B. Wentz, Jr. Junior Faculty Award (OR), Princeton SEAS Project X Fund (RF, OR), Open Philanthropy (RF, OR), and Princeton SEAS and ECE Senior Thesis Funding (NM). We thank the authors of [1, 2, 3, 4, 5] for open-sourcing their code and the authors of [2, 4, 5, 6] for sharing their trained models. We also thank the AMT workers who participated in our studies, anonymous reviewers who provided thoughtful feedback, and Princeton Visual AI Lab members (especially Dora Zhao, Kaiyu Yang, and Angelina Wang) who tested our user interface and provided helpful suggestions.

Contact

Sunnie S. Y. Kim (sunniesuhyoung@princeton.edu)