Lin's Notes Garden

❯

❯

❯

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Idea

Apply transformer (the same architecture as in NLP) to vision tasks, which paves the road for multi-modal models
Completely abandon CNNs, which may need more data to achieve the same performance as convolutional neural networks since the transformer has no inductive bias, which refers to the inherent assumptions or preferences that guide the model's learning process, such as locality and translation equivariance presented in CNNs

Model

Slice the original image into $16 \times 16$ pixel patches
Convert these patches to embeddings (each patch has a vector embedding, similar as word2vec) through a linear layer
Add positional embeddings
Append a special token to the front of the embedding sequence. This can be a [CLS] token (see Steps), whose corresponding output vector representation in the latent space after the final layer can hold the information of the whole sequence (image)

Graph View

Idea
Model

Backlinks

Object Detection
Semantic Segmentation
BEiT: BERT Pre-Training of Image Transformers
A ConvNet for the 2020s
Vision Transformers for Dense Prediction
Per-Pixel Classification is Not All You Need for Semantic Segmentation
SAM 2: Segment Anything in Images and Videos
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Segment Anything
Segmenter: Transformer for Semantic Segmentation

Created by Diex Lin with Quartz v4.5.0 © 2025

GitHub