Idea

Instead for pre-pixel classification, we could first predict a set of binary masks and then assign a single class to each mask.

Each prediction is supervised with a per-pixel binary mask loss and a classification loss

Architecture

  1. Using a backbone to extract images features
  2. Upsample the features to obtain per-pixel embeddings
  3. A transformer decoder attends to image features and produces per-segment embeddings , then embeddings then independently generate
    • class predictions (shape ), where is introduced to represent "no-object" class
    • corresponding mask embeddings
  4. Predict possibly overlapping binary mask predictions via a dot product between pixel embeddings and mask embeddings followed by a sigmoid activation.
  5. Finally we can get the prediction by combining binary masks with their class predictions using a simple matrix multiplication.