Idea
- For each pixel, aggregating the pixel with its region information (the category this pixel belongs to predicted by the backbone network, regularized by the ground-truth segmentation with a auxiliary loss during training) to obtain Object Region Representations
- Compute the relation between each pixel and each object region, and augment the representation of each pixel with the object-contextual representation which is a weighted aggregation of all the object region representations.
Segmentation Transformer
The pipeline above can be rephrased into the following transformer encoder-decoder architecture