An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Idea
Apply transformer (the same architecture as in NLP) to vision tasks, which paves the road for multi-modal models
Completely abandon CNNs, which may need more data to achieve the same performance as convolutional neural networks since the transformer has no inductive bias, which refers to the inherent assumptions or preferences that guide the model's learning process, such as locality and translation equivariance presented in CNNs
Model
Slice the original image into 16×16 pixel patches
Convert these patches to embeddings (each patch has a vector embedding, similar as word2vec) through a linear layer
Add positional embeddings
Append a special token to the front of the embedding sequence. This can be a [CLS] token (see Steps), whose corresponding output vector representation in the latent space after the final layer can hold the information of the whole sequence (image)