Motivation

  • R-CNN is slow because it performs a CNN forward pass for each object proposal, without sharing computation
  • The selective search process is also time-consuming (The paper does not optimize this)

Method

  1. Input the image and use selective search to generate object proposals ()
  2. Pass the whole input image into a CNN to extract feature maps
  3. Apply RoI Pooling to extract feature map for each object proposal (Key Optimization: here the input of pooling is the projection of the CNN feature map output on these object proposals, therefore we save the CNN forward computing time)
  4. Compute the classification loss and box regression loss

RoI Pooling

Just like SPP, which extracts a fixed-size feature representation from a variable-sized RoI in an input feature map. It allows the use of fully connected layers after the convolutional layers, even though the RoIs can have different sizes.