Table of Contents

Title: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Authors: Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, Devi Parikh

Published: Dec 2 2016

Link: https://arxiv.org/abs/1612.00837

Summary (Generated by Microsoft Copilot):

Introduction:

  • The paper addresses the issue of language bias in Visual Question Answering (VQA) and aims to elevate the role of image understanding.

Challenges:

  • Existing VQA models often exploit language priors, leading to inflated performance without true visual understanding.

Methods:

  • The authors create a balanced VQA dataset by collecting complementary images, ensuring each question has two images with different answers.

Novelties:

  • Introduction of a balanced dataset that reduces language biases and a novel interpretable model providing counter-example based explanations.

Results:

  • State-of-the-art VQA models perform significantly worse on the balanced dataset, confirming reliance on language priors.

Performances:

  • Models trained on the balanced dataset show improved performance, indicating a need for larger, more balanced datasets.

Limitations:

  • The dataset is not perfectly balanced, and some questions may not have suitable complementary images.

Discussion:

  • The balanced dataset and counter-example explanations can help build trust in VQA models and push the field towards better visual understanding.