Idea

The limitation of tradition residual network and CNNs : stride convolution operations keep downsampling the original image and losing spatial information so that it is hard to do precise segmentation.

DeepLab v1 comes up with the dilated convolution method to tackle this problem. Though successfully maintain the image size by setting stride from to and applying different dilating rates, the model costs a large amount of memory.

RefineNet proposes a new method to enjoy both the memory an computational benefits while still able to produce effective high-resolution segmentation prediction via multi-path refinement

Architecture

Multi-path Refinement

Each RefineNet block takes in the corresponding downsampling result and the output of the low spatial size RefineNet block.

Residual Convolution Unit

The simplified version of the original convolution unit in the original ResNet, where the batch normalization is removed.

Muti-resolution Fusion

This block fuses all path inputs into a high-resolution feature map. The steps are:

  • Use a convolution layer to convert the inputs into the same feature dimension (the smallest one)
  • Upsample all the feature maps to the largest resolution of the inputs
  • Fuse all the feature maps by summation

Chained Residual Pooling

The pooling aims to capture background context from a large image region. It is able to efficiently pool features with multiple window sizes and fuse them together using learnable weights.