Table of Contents

Title: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Authors: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

Published: Oct 22 2020

Link: https://arxiv.org/abs/2010.11929

What I Learned:

  • Transformer architectures lack some of the inductive bias that CNNs have.
  • Large transformer-based models often have two stage strategy: 1) pre-training, 2) fine-tuning.
    • BERT: A denoising self-supervised pre-training task.
    • GPT: Language-modeling as a pre-training task
  • Differences of the inductive bias
    • CNNs: The local infomation, 2D neighborhood structure infomation and translation equivariance are learned into each layer throughout the entire model.
    • ViT:
      • MLP Layers: Locality and translation equivariant
      • Self-attention layers: Global
      • Creating patches and fine-tuning for the position embeddings: 2D neighborhood structure

Summary (Generated by Microsoft Copilot):

Introduction:

  • The paper explores the application of Transformers to image recognition, proposing the Vision Transformer (ViT), which processes images as sequences of patches.

Challenges:

  • Transformers lack the inductive biases inherent to CNNs, such as translation equivariance and locality, making them less effective with smaller datasets.

Methods:

  • Images are split into patches, which are linearly embedded and fed into a Transformer. The model is pre-trained on large datasets and fine-tuned on smaller benchmarks.

Novelties:

  • The approach eliminates the need for CNNs, using a pure Transformer architecture for image classification.

Results:

  • ViT achieves excellent results on benchmarks like ImageNet and CIFAR-100 when pre-trained on large datasets.

Performances:

  • ViT outperforms state-of-the-art CNNs with fewer computational resources when pre-trained on large datasets.

Limitations:

  • ViT underperforms on smaller datasets due to the lack of inductive biases.

Discussion:

  • The paper suggests that large-scale pre-training can compensate for the lack of inductive biases in Transformers, making them competitive with CNNs. Future work includes applying ViT to other vision tasks and exploring self-supervised pre-training methods.