Idea

Propose a universal model for semantic, instance and panoptic segmentation (See differences), which just requires once-and-for-all training.

  • Introduce a task token to condition the model on the task at hand
  • Use a query-text contrastive loss

Architecture

Multi-Scale Feature Modeling

Extract multi-scale features for an input image using a backbone, followed by a pixel decoder (translating the high-level features into pixel-level predictions)

Unified Task-Conditional Query Formulation

Task Input Token

Uniformly sample {task} from {panopitc, instance, semantic} to form the task input token

At the same time, uniformly sample the corresponding ground truth to generate text queries

Text Query ()

As shown below, first iterate over the set of masks to create a list of text () with a template a photo with a {CLS}. Then pad with a/an {task} entries to obtain a padded list () of constant length , with padded entries representation no-object masks. To obtain the text queries , we first tokenize the text entries and pass the tokenized representation through a text-encoder, which is a -layer transformer. The encoded text embeddings represent the number of binary masks and their corresponding classes in the input image. We further concatenate a set of learnable text context embeddings to the encoded text embeddings to obtain the final text quires  The goal of is to provide a unified textual context that captures the relevant information necessary for various tasks, such as image segmentation.

Object Query ()

First initialize the object queries () as times repetitions of the task-token . Then update with guidance from the flattened -scale features inside a -layer transformer. The updated from the transformer (rich with image-contextual information) is concatenated with to obtain a task-conditioned representation of queries,

Query-Task Contrastive Loss

Considering that we have a batch of object-text query pairs , where and are the corresponding object and text queried, respectively, of the -th pair, we measure the similarity between the queries by calculating a dot product.

Here is a learnable temperature parameter to scale the contrastive logits.