VQA: Visual Question Answering
Table of Contents
Title: VQA: Visual Question Answering
Authors: Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh
Published: May 3 2015
Link: https://arxiv.org/abs/1505.00468
Summary (Generated by Microsoft Copilot):
Introduction:
- The paper introduces the task of Visual Question Answering (VQA), which involves answering natural language questions about images.
Challenges:
- VQA requires detailed image understanding and complex reasoning beyond simple image captioning.
Methods:
- The authors provide a large dataset with ∼0.25M images, ∼0.76M questions, and ∼10M answers.
- They compare various baselines and methods for VQA.
Novelties:
- The task involves open-ended, free-form questions and answers, increasing the diversity of knowledge and reasoning needed.
Results:
- The dataset includes 204,721 images from MS COCO and 50,000 abstract scenes.
- Three questions were collected for each image or scene, answered by ten subjects.
Performances:
- The paper discusses human performance and automatic evaluation metrics for VQA.
Limitations:
- Some questions can be answered using commonsense knowledge without the image.
Discussion:
- VQA is seen as a step towards solving AI-complete problems, pushing the boundaries of computer vision and natural language processing.