An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Table of Contents

Title: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Authors: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

Published: Oct 22 2020

Link: https://arxiv.org/abs/2010.11929

What I Learned:

Transformer architectures lack some of the inductive bias that CNNs have.
Large transformer-based models often have two stage strategy: 1) pre-training, 2) fine-tuning.
- BERT: A denoising self-supervised pre-training task.
- GPT: Language-modeling as a pre-training task
Differences of the inductive bias
- CNNs: The local infomation, 2D neighborhood structure infomation and translation equivariance are learned into each layer throughout the entire model.
- ViT:
  - MLP Layers: Locality and translation equivariant
  - Self-attention layers: Global
  - Creating patches and fine-tuning for the position embeddings: 2D neighborhood structure

Summary (Generated by Microsoft Copilot):

Introduction:

The paper explores the application of Transformers to image recognition, proposing the Vision Transformer (ViT), which processes images as sequences of patches.

Challenges:

Transformers lack the inductive biases inherent to CNNs, such as translation equivariance and locality, making them less effective with smaller datasets.

Methods:

Images are split into patches, which are linearly embedded and fed into a Transformer. The model is pre-trained on large datasets and fine-tuned on smaller benchmarks.

Novelties:

The approach eliminates the need for CNNs, using a pure Transformer architecture for image classification.

Results:

ViT achieves excellent results on benchmarks like ImageNet and CIFAR-100 when pre-trained on large datasets.

Performances:

ViT outperforms state-of-the-art CNNs with fewer computational resources when pre-trained on large datasets.

Limitations:

ViT underperforms on smaller datasets due to the lack of inductive biases.

Discussion:

The paper suggests that large-scale pre-training can compensate for the lack of inductive biases in Transformers, making them competitive with CNNs. Future work includes applying ViT to other vision tasks and exploring self-supervised pre-training methods.