Idea

Develop an asymmetric encoder-decoder model, where

  • A large random subset of image patches is masked out, and the visible patches are taken as the input of the encoder
  • Masked tokens are introduces after the encoder, and the full set of encoded patches and mask tokens is processed by a small decoder that aims to reconstruct the original image in pixels.
  • The two steps above is during the pre-training process. After per-training, the decoder is discarded and the encoder is applied to uncorrupted images for recognition tasks

Questions

  • Why mask a large ratio of image patches (such as 75%):
    • It optimizes accuracy since the task is becoming challenging
    • Reduce memory and time costs during pre-training