Idea

Improve the semantic segmentation results by effectively utilizing global context, which captures the semantic context of scenes and selectively enhances class-dependent feature maps.

Context Encoding Module

Encoding Layer

Input

Considering the input feature map with the shape of $C \times H \times W$ as a set of $C$ -dimensional input features $X = {x_{1}, \dots, x_{N}}$ , where $N = H \times W$

Codebook

The Encoding Layer learns a codebook $D$ containing $K$ codewords (or visual centers)

D = {d_{1}, d_{2}, \dots, d_{K}}

Each $d_{k}$ represents a distinct visual center in the feature space that captures a specific semantic meaning or category in the input data

Smoothing Factor

Alongside the codebook, the layer also learns a set of smoothing factors

S = {s_{1}, s_{2}, \dots, s_{K}}

Each $s_{k}$ corresponds to a smoothing factor for the visual center $d_{k}$ . These factors are used to control the influence of each visual center on the feature representation

Output

The layer outputs the residual encoder by aggregating the residuals with soft-assignment weights $e_{k} = \sum_{i = 1}^{N} e_{ik}$ , where

e_{ik} = \frac{exp ( - s _{k} ∥ r _{ik} ∥ ^{2} )}{\sum _{j = 1}^{K} exp ( - s _{j} ∥ r _{ij} ∥ ^{2} )} r_{ik}

and the residuals are given by $r_{ik} = x_{i} - d_{k}$ . The final output is $e = \sum_{k = 1}^{K} ϕ (e_{k})$ , where $ϕ$ denotes Batch Normalization with ReLU activation. ( $e$ is a $C$ -dimensional vector)

Feature Map Attention

Use a fully connected layer to predict feature map scaling factors $γ = δ (W e)$ , where $W$ denotes the layer weights and $δ$ is the sigmoid function. Then the module output is given by $Y = X \otimes γ$ .

Lin's Notes Garden

Explorer

Context Encoding for Semantic Segmentation