Idea

The core innovation of PSPNet is its Pyramid Pooling Module. This module aggregates contextual information from different regions of the image by partitioning the feature map into several segments and applying pooling operations at multiple scales.

Method

The pyramid pooling module fuses features under four different pyramid scales. The coarsest level highlighted in red is global pooling to generate a single bin output. The following pyramid level separates the feature map into different sub-regions and forms pooled representation for different locations.

The output of different levels in the pyramid pooling module contains the feature map with varied sizes. To maintain the weight of global feature, we use convolution layer after each pyramid level to reduce the dimension of context representation to of the original one if the level size of pyramid is . Then we directly upsample the low-dimension feature maps to get the same size feature as the original feature map via bilinear interpolation. Finally, different levels of features are concatenated as the final pyramid pooling global feature.