Single-shot Models

YOLO vs SSD

FeatureYOLOSSD
SpeedUp to 155 FPS22-46 FPS
AccuracyGenerally lower, struggles with small objectsHigher, better with varying object sizes
ArchitectureFully connected layers, grid-based predictionsConvolutional layers, anchor boxes
Best Use CaseReal-time applicationsApplications needing accuracy and speed balance

Development of YOLO

Two-stage Models

Traditional R-CNNs

  • R-CNN first use selective search to find possible bounding boxes and converts detection task into classification task
  • SPP apply spatial pooling to enable the classification CNNs to accept multiple-size inputs
  • Fast R-CNN boost the classification speed by first convert the whole image into a feature map and then apply selective search directly on these features
  • Faster R-CNN further abandons the selective search to use a RPN to find possible bounding boxes, and propose the anchor boxes method
  • Mask R-CNN add a small FCN head to Faster R-CNN to output a segmentation prediction, originally for instance segmentation, but also improves the accuracy for detection task

Novel CNN-based Architectures

  • R-FCN divides the class feature into parts and evaluate the scores for the certain part of the certain class on different positions of the whole image. Thus this eliminates the need to pass each RoI into a detector network.
  • FPN proposes an additional top-down path to further enhance the semantic features.

Tackle the imbalance of foreground objects and background objects

  • OHEM proposes a read-only copy of the RoI network to first do a loss prediction on all RoIs and then select the regions with least accuracy to the RoI network that enables back-propagation.
  • RetinaNet invents Focal Loss to down-weight the loss assigned to well-classified examples

Transformer-based Architectures

DETR considers the detection task as a set prediction problem, invents an end-to-end model based on ViT that eliminates the need for Non-maximum Suppression and human designed anchor boxes