Motivation

  • Selective search is slow in R-CNN and Fast R-CNN
  • Use a single CNN to generate region proposals

Architecture

Region Proposal Network (RPN)

To generate region proposals, we slide a small network over the convolution feature map output by the last shared convolution layer. This small network takes as input an spatial windows of the input convolutional feature map. Each sliding windows is mapped to a lower-dimensional feature.

This feature is fed into two sibling fully-connected layers

  • box-regression layer (reg)
  • box-classification layer (cls) Note that the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations.

At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as . Then,

  • the reg layer has outputs encoding the coordinates of boxes
  • the cls layer outputs scores that estimate probability of object or not object for each proposal

Anchor Boxes

The proposals are parameterized relative to reference boxes, which we call anchors. The network predicted the offsets that indicate how much the anchor box needs to be modified to better fit the actual object in the image.