Idea
- BERT stands for *Bidirectional Encoder representation from Image Transformers
- BEIT stands for Bidirectional Encoder representation from Image Transformers
In computer vision, Vision Transformer (ViT) requires more training data than CNNs. However, high quality labeled data is rare and hard to acquire, so we need to develop self-supervised models
BEIT and MAE are designed to self-supervised learning by predicting the masked information. The major difference between these two models is that BEIT predicts the token while MAE predicts the original pixel
Method
Before pre-training
Train an image tokenizer via autoencoding-style reconstruction, save the learned image-to-token vocabulary
During pre-training
- Randomly mask some proportion of image patches, and replace them with special mask embedding
[M]
- Feed the patches (including the masked embeddings) into a backbone vision Transformer
- The aim is to predict the visual tokens of the original image that the decoder could use to reconstruct the image (though the decoder is not used in this pre-training period)
Fine-tune on downstream tasks
such as classification, semantic segmentation
Contribution
BEIT has demonstrated for the first time that generative pre-training can achieve better fine-tuning results than contrastive learning, excelling in image classification and semantic segmentation tasks. More importantly, by eliminating the reliance on supervised pre-training, BEIT can efficiently utilize unlabeled images to scale Vision Transformers to large model sizes. It is believed that the "generative self-supervised renaissance" initiated by BEIT in the visual domain will accelerate the field towards "the BERT moment of CV"