Idea

Instead for pre-pixel classification, we could first predict a set of binary masks and then assign a single class to each mask.

Each prediction is supervised with a per-pixel binary mask loss and a classification loss

Architecture

Using a backbone to extract images features $F$
Upsample the features to obtain per-pixel embeddings $E_{pixel}$
A transformer decoder attends to image features and produces $N$ per-segment embeddings $Q$ , then embeddings then independently generate
- $N$ class predictions (shape $N \times (K + 1)$ ), where $\emptyset$ is introduced to represent "no-object" class
- $N$ corresponding mask embeddings $E_{mask}$
Predict $N$ possibly overlapping binary mask predictions via a dot product between pixel embeddings $E_{pixel}$ and mask embeddings $E_{mask}$ followed by a sigmoid activation.
Finally we can get the prediction by combining $N$ binary masks with their class predictions using a simple matrix multiplication.