Idea

Propose a universal model for semantic, instance and panoptic segmentation (See differences), which just requires once-and-for-all training.

Introduce a task token to condition the model on the task at hand
Use a query-text contrastive loss

Architecture

Multi-Scale Feature Modeling

Extract multi-scale features for an input image using a backbone, followed by a pixel decoder (translating the high-level features into pixel-level predictions)

Unified Task-Conditional Query Formulation

Task Input Token

Uniformly sample {task} from {panopitc, instance, semantic} to form the task input token

At the same time, uniformly sample the corresponding ground truth to generate text queries $Q_{text}$

Text Query ( $Q_{text}$ )

As shown below, first iterate over the set of masks to create a list of text ( $T_{text}$ ) with a template a photo with a {CLS}. Then pad $T_{text}$ with a/an {task} entries to obtain a padded list ( $T_{pad}$ ) of constant length $N_{text}$ , with padded entries representation no-object masks. To obtain the text queries $Q_{text}$ , we first tokenize the text entries $T_{pad}$ and pass the tokenized representation through a text-encoder, which is a $6$ -layer transformer. The encoded $N_{text}$ text embeddings represent the number of binary masks and their corresponding classes in the input image. We further concatenate a set of $N_{ctx}$ learnable text context embeddings $Q_{ctx}$ to the encoded text embeddings to obtain the final $N$ text quires $Q_{text}$ The goal of $Q_{ctx}$ is to provide a unified textual context that captures the relevant information necessary for various tasks, such as image segmentation.

Object Query ( $Q$ )

First initialize the object queries ( $Q^{'}$ ) as $N - 1$ times repetitions of the task-token $Q_{task}$ . Then update $Q^{'}$ with guidance from the flattened $\frac{1}{4}$ -scale features inside a $2$ -layer transformer. The updated $Q^{'}$ from the transformer (rich with image-contextual information) is concatenated with $Q_{task}$ to obtain a task-conditioned representation of $N$ queries, $Q$

Query-Task Contrastive Loss

Considering that we have a batch of $B$ object-text query pairs ${(q_{i}^{obj}, q_{i}^{txt})}_{i = 1}^{B}$ , where $q_{i}^{obj}$ and $q_{i}^{txt}$ are the corresponding object and text queried, respectively, of the $i$ -th pair, we measure the similarity between the queries by calculating a dot product.

L_{Q \to Q_{text}} L_{Q_{text} \to Q} L_{Q \leftrightarrow Q_{text}} = - \frac{1}{B} i = 1 \sum B lo g \frac{exp ( q _{i}^{obj} ⊙ q _{i}^{txt} / τ )}{\sum _{j = 1} exp ( q _{i}^{obj} ⊙ q _{j}^{txt} / τ )} = - \frac{1}{B} i = 1 \sum B lo g \frac{exp ( q _{i}^{txt} ⊙ q _{i}^{obj} / τ )}{\sum _{j = 1} exp ( q _{i}^{txt} ⊙ q _{j}^{obj} / τ )} = L_{Q \to Q_{text}} + L_{Q_{t e x t} \to Q}

Here $τ$ is a learnable temperature parameter to scale the contrastive logits.

Lin's Notes Garden

Explorer

OneFormer: One Transformer to Rule Universal Image Segmentation

Idea

Architecture

Multi-Scale Feature Modeling

Unified Task-Conditional Query Formulation

Task Input Token

Text Query ( $Q_{text}$ )

Object Query ( $Q$ )

Query-Task Contrastive Loss

Graph View

Table of Contents

Lin's Notes Garden

Explorer

OneFormer: One Transformer to Rule Universal Image Segmentation

Idea

Architecture

Multi-Scale Feature Modeling

Unified Task-Conditional Query Formulation

Task Input Token

Text Query (Qtext​)

Object Query (Q)

Query-Task Contrastive Loss

Graph View

Table of Contents

Text Query ( $Q_{text}$ )

Object Query ( $Q$ )