Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Table of Contents

Title: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Authors: Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, Devi Parikh

Published: Dec 2 2016

Summary (Generated by Microsoft Copilot):

Introduction:

The paper addresses the issue of language bias in Visual Question Answering (VQA) and aims to elevate the role of image understanding.

Challenges:

Existing VQA models often exploit language priors, leading to inflated performance without true visual understanding.

Methods:

The authors create a balanced VQA dataset by collecting complementary images, ensuring each question has two images with different answers.

Novelties:

Introduction of a balanced dataset that reduces language biases and a novel interpretable model providing counter-example based explanations.

Results:

State-of-the-art VQA models perform significantly worse on the balanced dataset, confirming reliance on language priors.

Performances:

Models trained on the balanced dataset show improved performance, indicating a need for larger, more balanced datasets.

Limitations:

The dataset is not perfectly balanced, and some questions may not have suitable complementary images.

Discussion:

The balanced dataset and counter-example explanations can help build trust in VQA models and push the field towards better visual understanding.