Motivation

  • Extend Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition
  • The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner.

Architecture

Procedure

Mask R-CNN adopts the same two-stage procedure as Faster R-CNN, with an identical first stage (which is RPN). In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI via a small FCN

Formally, during training, we define a multi-task loss on each sampled RoI as , where and are identical as Faster R-CNN. The mask branch has a -dimensional output for each RoI, which encodes binary masks of resolution , one for each of the classes. To this we apply a per-pixel sigmoid, and define as the average binary cross-entropy loss. For an RoI associated with ground-truth class , is only defined on the -th mask.