A2MIM: Architecture-Agnostic Masked Image Modeling, ICML2023
Table of Contents
Overview
Paper: Li et al., Architecture-Agnostic Masked Image Modeling - From ViT back to CNN (icml2023 open access or arxiv).
(Figures and tables in this post are from the original paper)
Novelties of the Paper
- They proposed a new approach called “Architecture-Agnostic Masked Image Modeling” (A2MIM) to enhance benefits of middle-order interactions between patches.
- They utilized the mean RGB value of masked patch instead of the learnable mask token in existing MIM frameworks.
- Fourier domain loss, inspired by Focal Frequency loss, was introduced to A2MIM to treat middle-order interaction.
- A2MIM can be applied to improve CNNs and Transformers.
Performance Evaluation Methods
- They evaluated the abilities of A2MIM with three perspectives: fine-tuning for classification, transfer learning for object detection and segmentation.
- They showed A2MIM can improve representation performances of pre-trained network.
Discussions
- There were less benefits of CNNs than transformers to apply A2MIM. Authors guessed middle-order intersections learning was limited by inductive bias of CNNs.
- Transformers showed more effectiveness on longer pre-training with A2MIM.
What I learned
- ViT and CNN have low-pass and high-pass filtering characteristics, respectively, and have specific frequency bands, making it difficult to model middle-order interactions well.