Idea

Use a single neural network to predict bounding boxes and class probabilities directly from full images in one evaluation.

Method

Divide the input image into an $S \times S$ grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object
Each grid cell predicts $B$ bounding boxes and confidence scores for those boxes. These confidence score reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.
Each gird cell also predicts $C$ conditional class probabilities. These probabilities are conditioned on the grid cell containing an object. The model only predict one set of class probabilities per gird cell, regardless of the number of boxes $B$
Therefore, the predictions are encoded as an $S \times S \times (5 B + C)$ tensor
After that, we have each grids having scores and box positions for every class. Then we could obtain the boxes for each class by combing all the gird box predictions together whose class scores for this certain class if relatively high.

Architecture

Apply $20$ convolutional layers pre-trained on ImageNet

Loss Function

L = λ_{coord} i = 0 \sum S^{2} j = 0 \sum B 1_{ij}^{obj} [(x_{i} - \overset{x}{^}_{i})^{2} + (y_{i} - \overset{y}{^}_{i})^{2}] + λ_{coord} i = 0 \sum S^{2} j = 0 \sum B 1_{ij}^{obj} [(w_{i} - \overset{w}{^}_{i})^{2} + (h_{i} - \hat{h}_{i})^{2}] + i = 0 \sum S^{2} j = 0 \sum B 1_{ij}^{obj} (C_{i} - \hat{C}_{i})^{2} + λ_{noobj} i = 0 \sum S^{2} j = 0 \sum B 1_{ij}^{noobj} (C_{i} - \hat{C}_{i})^{2} + i = 0 \sum S^{2} 1_{i}^{obj} (p_{i} (c) - \overset{p}{^}_{i} (c))^{2}

where the $x, y$ coordinates represent the center of the box relative the bounds of the grid cell. and the width $w$ and height $h$ are predicted relative to the whole image

Lin's Notes Garden

Explorer

You Only Look Once: Unified, Real-Time Object Detection

Idea

Method

Architecture

Loss Function

Graph View

Table of Contents

Backlinks