Idea

  • Apply transformer (the same architecture as in NLP) to vision tasks, which paves the road for multi-modal models
  • Completely abandon CNNs, which may need more data to achieve the same performance as convolutional neural networks since the transformer has no inductive bias, which refers to the inherent assumptions or preferences that guide the model's learning process, such as locality and translation equivariance presented in CNNs

Model

  • Slice the original image into pixel patches
  • Convert these patches to embeddings (each patch has a vector embedding, similar as word2vec) through a linear layer
  • Add positional embeddings
  • Append a special token to the front of the embedding sequence. This can be a [CLS] token (see Steps), whose corresponding output vector representation in the latent space after the final layer can hold the information of the whole sequence (image)