An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Table of Contents
Title: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Authors: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
Published: Oct 22 2020
Link: https://arxiv.org/abs/2010.11929
What I Learned:
- Transformer architectures lack some of the inductive bias that CNNs have.
- Large transformer-based models often have two stage strategy: 1) pre-training, 2) fine-tuning.
- BERT: A denoising self-supervised pre-training task.
- GPT: Language-modeling as a pre-training task
- Differences of the inductive bias
- CNNs: The local infomation, 2D neighborhood structure infomation and translation equivariance are learned into each layer throughout the entire model.
- ViT:
- MLP Layers: Locality and translation equivariant
- Self-attention layers: Global
- Creating patches and fine-tuning for the position embeddings: 2D neighborhood structure
Summary (Generated by Microsoft Copilot):
Introduction:
- The paper explores the application of Transformers to image recognition, proposing the Vision Transformer (ViT), which processes images as sequences of patches.
Challenges:
- Transformers lack the inductive biases inherent to CNNs, such as translation equivariance and locality, making them less effective with smaller datasets.
Methods:
- Images are split into patches, which are linearly embedded and fed into a Transformer. The model is pre-trained on large datasets and fine-tuned on smaller benchmarks.
Novelties:
- The approach eliminates the need for CNNs, using a pure Transformer architecture for image classification.
Results:
- ViT achieves excellent results on benchmarks like ImageNet and CIFAR-100 when pre-trained on large datasets.
Performances:
- ViT outperforms state-of-the-art CNNs with fewer computational resources when pre-trained on large datasets.
Limitations:
- ViT underperforms on smaller datasets due to the lack of inductive biases.
Discussion:
- The paper suggests that large-scale pre-training can compensate for the lack of inductive biases in Transformers, making them competitive with CNNs. Future work includes applying ViT to other vision tasks and exploring self-supervised pre-training methods.