Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Table of Contents
Title: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Authors: Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, Devi Parikh
Published: Dec 2 2016
Link: https://arxiv.org/abs/1612.00837
Summary (Generated by Microsoft Copilot):
Introduction:
- The paper addresses the issue of language bias in Visual Question Answering (VQA) and aims to elevate the role of image understanding.
Challenges:
- Existing VQA models often exploit language priors, leading to inflated performance without true visual understanding.
Methods:
- The authors create a balanced VQA dataset by collecting complementary images, ensuring each question has two images with different answers.
Novelties:
- Introduction of a balanced dataset that reduces language biases and a novel interpretable model providing counter-example based explanations.
Results:
- State-of-the-art VQA models perform significantly worse on the balanced dataset, confirming reliance on language priors.
Performances:
- Models trained on the balanced dataset show improved performance, indicating a need for larger, more balanced datasets.
Limitations:
- The dataset is not perfectly balanced, and some questions may not have suitable complementary images.
Discussion:
- The balanced dataset and counter-example explanations can help build trust in VQA models and push the field towards better visual understanding.