Idea

Apply masked attention in the decoder
Use multi-scale high-resolution features

Architecture

Difference from MaskFormer:

Masked Attention

Constrain cross-attention within predicted mask regions, other than the whole image

From ( $l$ is the layer index, and $X_{l}$ is the query features at $l$ -th layer)

X_{l} = softmax (Q_{l} K_{l}^{T}) V_{l} + X_{l - 1}, X_{l} \in R^{N \times C}

X_{l} = softmax (M_{l - 1} + Q_{l} K_{l}^{T}) V_{l} + X_{l - 1}, X_{l} \in R^{N \times C}

where

M_{l - 1} (x, y) = {0, - \infty if M_{l - 1} (x, y) = 1 otherwise

here $M_{l - 1} \in {0, 1}^{N \times H_{l} W_{l}}$ is the binarized output (with threshold at $0.5$ ) of the resized mask prediction of the previous $(l - 1)$ -th Transformer decoder layer

High-resolution features

Instead of always using the high-resolution feature map, utilize a feature pyramid which consists of both low- and high-resolution features and feed one resolution of the multi-scale feature to one Transformer decoder layer at a time.

Specifically, use the feature map produced by the pixel decoder with resolution $\frac{1}{32}$ , $\frac{1}{16}$ and $\frac{1}{8}$ of the original image.

Lin's Notes Garden

Explorer

Masked-attention Mask Transformer for Universal Image Segmentation

Idea

Architecture

Masked Attention

High-resolution features

Graph View

Table of Contents