Idea
Based on MaskFormer,
- Apply masked attention in the decoder
- Use multi-scale high-resolution features
Architecture
Difference from MaskFormer:
Masked Attention
Constrain cross-attention within predicted mask regions, other than the whole image
From ( is the layer index, and is the query features at -th layer)
to
where
here is the binarized output (with threshold at ) of the resized mask prediction of the previous -th Transformer decoder layer
High-resolution features
Instead of always using the high-resolution feature map, utilize a feature pyramid which consists of both low- and high-resolution features and feed one resolution of the multi-scale feature to one Transformer decoder layer at a time.
Specifically, use the feature map produced by the pixel decoder with resolution , and of the original image.