Idea

Another application of Vision Transformer (ViT) in semantic segmentation, with

Hierarchical Transformer Encoder to produce multiscale features. Mix-FFN proves $3 \times 3$ convolution is enough to provide positional information so that the positional encoding can be avoided.
Lightweight MLP Decoder combines both local and global attention mechanisms, allowing the model to generate powerful representations without the complexity typically associated with traditional decoders

Architecture

Overlap Patch Merging

Allows for the unification of features from overlapping patches, which enhances the model's ability to preserve local details. Compared to non-overlap patch merging (e.g. in Swin Transformer), this approach improves local continuity and supports flexible input sizes.

Lin's Notes Garden

Explorer

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Idea

Architecture

Overlap Patch Merging

Graph View

Table of Contents

Backlinks