Table of Contents

Title: VQA: Visual Question Answering

Authors: Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh

Published: May 3 2015

Link: https://arxiv.org/abs/1505.00468

Summary (Generated by Microsoft Copilot):

Introduction:

  • The paper introduces the task of Visual Question Answering (VQA), which involves answering natural language questions about images.

Challenges:

  • VQA requires detailed image understanding and complex reasoning beyond simple image captioning.

Methods:

  • The authors provide a large dataset with ∼0.25M images, ∼0.76M questions, and ∼10M answers.
  • They compare various baselines and methods for VQA.

Novelties:

  • The task involves open-ended, free-form questions and answers, increasing the diversity of knowledge and reasoning needed.

Results:

  • The dataset includes 204,721 images from MS COCO and 50,000 abstract scenes.
  • Three questions were collected for each image or scene, answered by ten subjects.

Performances:

  • The paper discusses human performance and automatic evaluation metrics for VQA.

Limitations:

  • Some questions can be answered using commonsense knowledge without the image.

Discussion:

  • VQA is seen as a step towards solving AI-complete problems, pushing the boundaries of computer vision and natural language processing.