Table of Contents

Title: Yin and Yang: Balancing and Answering Binary Visual Questions

Authors: Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, Devi Parikh

Published: Nov 16 2015

Link: https://arxiv.org/abs/1511.05099

Summary (Generated by Microsoft Copilot):

Introduction:

  • The paper addresses binary Visual Question Answering (VQA) on abstract scenes, focusing on visual verification of concepts inquired in questions.

Challenges:

  • Language priors can lead to superficial performance without true visual understanding.
  • Dataset biases can hinder progress in multi-modal AI.

Methods:

  • Convert questions into tuples summarizing visual concepts.
  • Use abstract scenes to balance the dataset, ensuring equal “yes” and “no” answers for each question.

Novelties:

  • Balanced dataset creation with complementary scenes.
  • Tuple extraction for concise visual concept representation.

Results:

  • Language-only models perform poorly on the balanced dataset.
  • Proposed approach matches state-of-the-art performance on unbalanced datasets and outperforms on balanced datasets.

Performances:

  • Significant improvement in visual reasoning and understanding.
  • Better performance by attending to relevant image regions.

Limitations:

  • Some scenes cannot be modified due to limited clipart library.
  • Handling negative questions remains challenging.

Discussion:

  • Balancing datasets can improve visual understanding.
  • Future work should focus on detailed visual semantics and real images.