Single-shot Models

YOLO vs SSD

Feature	YOLO	SSD
Speed	Up to 155 FPS	22-46 FPS
Accuracy	Generally lower, struggles with small objects	Higher, better with varying object sizes
Architecture	Fully connected layers, grid-based predictions	Convolutional layers, anchor boxes
Best Use Case	Real-time applications	Applications needing accuracy and speed balance

Development of YOLO

YOLO v2 (YOLO 9000) adds Batch Normalization, and starts to predict the anchor boxes via $k$ -means clusters; Huge dataset with $9000$ object categories
YOLO v3 - An Incremental Improvement makes detections at three different scales in the network, using a new backbone network called Darknet-53 with 53 convolutional layers

Two-stage Models

Traditional R-CNNs

R-CNN first use selective search to find possible bounding boxes and converts detection task into classification task
SPP apply spatial pooling to enable the classification CNNs to accept multiple-size inputs
Fast R-CNN boost the classification speed by first convert the whole image into a feature map and then apply selective search directly on these features
Faster R-CNN further abandons the selective search to use a RPN to find possible bounding boxes, and propose the anchor boxes method
Mask R-CNN add a small FCN head to Faster R-CNN to output a segmentation prediction, originally for instance segmentation, but also improves the accuracy for detection task

Novel CNN-based Architectures

R-FCN divides the class feature into $3 \times 3$ parts and evaluate the scores for the certain part of the certain class on different positions of the whole image. Thus this eliminates the need to pass each RoI into a detector network.
FPN proposes an additional top-down path to further enhance the semantic features.

Tackle the imbalance of foreground objects and background objects

OHEM proposes a read-only copy of the RoI network to first do a loss prediction on all RoIs and then select the regions with least accuracy to the RoI network that enables back-propagation.
RetinaNet invents Focal Loss to down-weight the loss assigned to well-classified examples

Transformer-based Architectures

DETR considers the detection task as a set prediction problem, invents an end-to-end model based on ViT that eliminates the need for Non-maximum Suppression and human designed anchor boxes

Lin's Notes Garden

Explorer

Object Detection

Single-shot Models

YOLO vs SSD

Development of YOLO

Two-stage Models

Traditional R-CNNs

Novel CNN-based Architectures

Tackle the imbalance of foreground objects and background objects

Transformer-based Architectures

Graph View

Table of Contents

Backlinks