VQA: Visual Question Answering

Table of Contents

Title: VQA: Visual Question Answering

Authors: Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh

Published: May 3 2015

Summary (Generated by Microsoft Copilot):

Introduction:

The paper introduces the task of Visual Question Answering (VQA), which involves answering natural language questions about images.

Challenges:

VQA requires detailed image understanding and complex reasoning beyond simple image captioning.

Methods:

The authors provide a large dataset with ∼0.25M images, ∼0.76M questions, and ∼10M answers.
They compare various baselines and methods for VQA.

Novelties:

The task involves open-ended, free-form questions and answers, increasing the diversity of knowledge and reasoning needed.

Results:

The dataset includes 204,721 images from MS COCO and 50,000 abstract scenes.
Three questions were collected for each image or scene, answered by ten subjects.

Performances:

The paper discusses human performance and automatic evaluation metrics for VQA.

Limitations:

Some questions can be answered using commonsense knowledge without the image.

Discussion:

VQA is seen as a step towards solving AI-complete problems, pushing the boundaries of computer vision and natural language processing.