A2MIM: Architecture-Agnostic Masked Image Modeling, ICML2023 | tsuji.tech

Table of Contents

Overview

Paper: Li et al., Architecture-Agnostic Masked Image Modeling - From ViT back to CNN (icml2023 open access or arxiv).

(Figures and tables in this post are from the original paper)

Novelties of the Paper

They proposed a new approach called “Architecture-Agnostic Masked Image Modeling” (A2MIM) to enhance benefits of middle-order interactions between patches.
They utilized the mean RGB value of masked patch instead of the learnable mask token in existing MIM frameworks.
Fourier domain loss, inspired by Focal Frequency loss, was introduced to A2MIM to treat middle-order interaction.
A2MIM can be applied to improve CNNs and Transformers.

Performance Evaluation Methods

They evaluated the abilities of A2MIM with three perspectives: fine-tuning for classification, transfer learning for object detection and segmentation.
They showed A2MIM can improve representation performances of pre-trained network.

Discussions

There were less benefits of CNNs than transformers to apply A2MIM. Authors guessed middle-order intersections learning was limited by inductive bias of CNNs.
Transformers showed more effectiveness on longer pre-training with A2MIM.

What I learned

ViT and CNN have low-pass and high-pass filtering characteristics, respectively, and have specific frequency bands, making it difficult to model middle-order interactions well.