Motivation
- Selective search is slow in R-CNN and Fast R-CNN
- Use a single CNN to generate region proposals
Architecture
Region Proposal Network (RPN)
To generate region proposals, we slide a small network over the convolution feature map output by the last shared convolution layer. This small network takes as input an spatial windows of the input convolutional feature map. Each sliding windows is mapped to a lower-dimensional feature.
This feature is fed into two sibling fully-connected layers
- box-regression layer (
reg
) - box-classification layer (
cls
) Note that the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations.
At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as . Then,
- the
reg
layer has outputs encoding the coordinates of boxes - the
cls
layer outputs scores that estimate probability of object or not object for each proposal
Anchor Boxes
The proposals are parameterized relative to reference boxes, which we call anchors. The network predicted the offsets that indicate how much the anchor box needs to be modified to better fit the actual object in the image.