Motivation
An improvement of YOLO v1, and generated a bigger object detection dataset.
Better (Accuracy Improvement)
- Apply Batch Normalization
- Use high resolution classifier backbone, like SSD
- Use anchor boxes instead of directly predicting the exact coordinates, like Faster R-CNN, SSD
- Other than hand-pick the dimensions (heights and weights) of the anchor boxes, the new architecture first collect the ground truth dimensions in the training data and then run -means to find out the best dimensions
- Predict the positions of box center using relative coordinates (so that fall in ) instead of exact positions
- Get the features of multi-scale objects via a pass-through layer to concatenate the low-level features with high level features (while Faster R-CNN and SSD run their proposal networks on feature maps with different sizes)
Faster
- Use customized backbone network Darknet-19
Stronger (More Categories Supported)
Since object detection datasets are way more limited than classification datasets, the author proposes a mechanism for jointly training on classification and detection data.
When the network sees an image labelled for detection it can back-propagate based on the full YOLOv2 loss function. When it sees a classification image we only back-propagate loss from the classification-specific parts of the architecture.