Motivation

The previous object detection models are designed in an indirect way to tackle the problem, by defining surrogate regression and classification problems on a large set of proposals, anchor or windows centers. This includes many hand-designed procedures such as anchor generation and Non-maximum Suppression
To simplify these pipelines, the author proposes a end-to-end approach to bypass the surrogate tasks.

General Idea

View the object detection as a direct set prediction problem, that is, directly predicts the final set of boxes without redundant bounding boxes

Set Prediction Loss

Bipartite matching

Let the ground truth set of objects be denoted by $y$ , and $\overset{y}{^} = {\overset{y}{^}_{i}}_{i = 1}^{N}$ for the set of $N$ predictions. Assuming $N$ is larger than the number of objects in the image, we consider $y$ also as a set of size $N$ padded with $\emptyset$ (no object). To find a bipartite matching (see in bipartite graphs) between these two sets we search for a permutation of $N$ elements $σ \in S_{n}$ with the lowest cost:

\overset{σ}{^} = σ \in S_{n} ar g min i \sum N L_{match} (y_{i}, \overset{y}{^}_{σ (i)})

where $L_{match} (\cdot, \cdot)$ is a pair-wise matching cost

Each element $i$ of the ground truth set can be seen as a $y_{i} = (c_{i}, b_{i})$ where $c_{i}$ is the target class label and $b_{i} \in [0, 1]^{4}$ is a vector that defines ground truth box center coordinates and its height and width relative to the image size. If $\overset{p}{^}_{σ (i)} (c_{i})$ is the predicted probability of class $c_{i}$ of the prediction with index $σ (i)$ , we define

L_{match} (y_{i}, \overset{y}{^}_{σ (i)}) = - 1_{{c_{i} \neq = \emptyset}} \overset{p}{^}_{σ (i)} (c_{i}) + 1_{{c_{i} \neq = \emptyset}} L_{box} (b_{i}, \hat{b}_{σ (i)})

where

L_{box} (b_{i}, \hat{b}_{σ (i)}) = λ_{iou} L_{iou} (b_{i}, \hat{b}_{σ (i)}) + λ_{L1} ∥ b_{i} - \hat{b}_{σ (i)} ∥_{1}

and $L_{iou}$ the IoU loss

Hungarian Loss

In the previous section we find the permutation $\overset{σ}{^}$ that best match the set of prediction and the set of ground truth, and then we need to compute the loss function:

L_{Hungarian} (y, \overset{y}{^}) = i \sum N L_{match} (y_{i}, \overset{y}{^}_{\overset{σ}{^} (i)})

Method

Backbone

Starting from the initial image $x_{img} \in R^{3 \times H_{0} \times W_{0}}$ , a conventional CNN backbone generates a lower-resolution activation map $f \in R^{C \times H \times W}$ . The typical values are $C = 2048, H = \frac{H _{0}}{32}, W = \frac{W _{0}}{32}$

Transformer Encoder

a $1 \times 1$ convolution reduces the channel dimension of the high-label activation map $f$ from $C$ to a smaller dimension $d$ , creating a new feature map $z_{0} \in R^{d \times H \times W}$
Collapse the spatial dimensions of $z_{0}$ to $d \times H W$ and add positional encodings, and then fed into the transformer

Transformer Decoder

The input of the decoder is a sequence of learnable positional encodings called object queries. This can be seen as something like learned anchor boxes, which give the model some prior knowledge about how to bound the boxes etc. but without human design

Prediction Feed-forward networks

Takes in the output of the transformer decoder and obtain the $(\overset{c}{^}_{i}, \hat{b}_{i})$ mentioned above

Lin's Notes Garden

Explorer

End-to-End Object Detection with Transformers