Idea

Apply Vision Transformer (ViT) for semantic segmentation, unlike DPT - Vision Transformers for Dense Prediction, this paper also uses mask Transformer as the decoder.

The class masks make the decoder focuses on specific foreground areas, and could help in extracting more relevant and localized features for segmentation

Architecture

Encoder

Adopt flattening and linear projection as the embedding method. then added with position embedding. Just like the original vision transformer paper

Decoder

The sequence of path encodings $z_{L} \in R^{N \times D}$ is decoded to a segmentation map $s \in R^{H \times W \times K}$ where $K$ is the number of classes.

The decoder learns to map patch-level encodings coming from the encoder to patch-level class scores. Next these patch-level class scores are upsampled by bilinear interpolation to pixel-level scores.

Linear

A point-wise linear layer is applied to the patch encodings $z_{L} \in R^{N \times D}$ to produce patch-level class logits $z_{lin} \in R^{N \times K}$ . The sequence is then reshaped into a 2D feature map $s_{lin} \in R^{H \times W \times K}$ . A Softmax is then applied on the class dimension to obtain the final segmentation map

Mask Transformer

Introduce a set of $K$ learnable class embeddings $cls = [cls_{1}, \dots, cls_{K}] \in R^{K \times D}$ . Each class embedding is initialized randomly and assigned to a single semantic class.

The class embedding $cls$ are processed jointly with patch encoding $z_{L}$ by the decoder.

Mask transformer generates $K$ masks by computing the scalar product between L2-normalized patch embeddings $z_{M}^{'} \in R^{N \times D}$ and class embeddings $c \in R^{K \times D}$ output by the decoder

Masks (z_{M}^{'}, c) = z_{M}^{'} c^{T}

These these mask sequences are reshaped into a 2D mask to form $s_{mask} \in R^{H / P \times W / P \times K}$ and bilinearly upsampled to the original image size to obtain a feature map $s \in R^{H \times W \times K}$ .

Lin's Notes Garden

Explorer

Segmenter: Transformer for Semantic Segmentation

Idea

Architecture

Encoder

Decoder

Linear

Mask Transformer

Graph View

Table of Contents

Backlinks