Understanding CL and MIM in Vision Transformers: A Comparative Analysis

Share

Recently, the paper "What Do Self-Supervised Vision Transformers Learn?" caught my attention, and I decided to write this short post for those new to Vision Transformers (ViTs).

In the realm of ViTs, there are two fundamental self-supervised learning techniques: Contrastive Learning (CL) and Masked Image Modeling (MIM).

CL is a widely-used self-supervised learning method that works by pulling the embeddings (representations in a high-dimensional space) of various transformations (variations, such as a rotated version) of the same image closer together, and pushing those from images of different classes apart. MIM, on the other hand, has more recently risen to prominence, particularly with ViTs. MIM functions by masking random patches in the input image and subsequently reconstructing the missing pixels, as shown in the image below.

The paper "What Do Self-Supervised Vision Transformers Learn?" https://arxiv.org/abs/2305.00729 by Park et al. studies CL and MIM trained transformers in detail and finds:

1. As expected, CL primarily captures global patterns, whereas MIM does not. CL is also more shape-oriented while MIM is more texture-oriented.

2. CL plays a significant role in the later layers of the ViTs architecture, training self-attentions to capture longer-range global patterns, such as the shape of an object. However, it also leads to reduced diversity of representations, thereby worsening scalability and dense prediction performance.

3. MIM utilizes high-frequency signals of the representations and mainly focuses on the early layers of the ViTs.

The paper argues that CL and MIM can complement each other and that even the simplest harmonization can help leverage the advantages of both methods.