[ML] SegNet Paper One-page Summary

1. Abstraction

Image segmentation is a quite classic and important field of computer vision. Because there are so many applications using it like the self-driving car, cctv, and the other robotics area. So, I want to review SegNet that is one of the famous, and much-quoted papers about it. And I also pensonally like this paper. Because it gives a clear message and with a pretty simple structure, outperforms other models at that time.

2. Structure

The network has an encoder and decoder structure. Thereby I will introduce model structure like that way.

a) Encoder

Aspose Words ad893166-b0e8-4c5c-9b6b-9c63bbf2b49c 001

The model has a simple encoding network. It performs convolution and max-pooling in an encoding network. As seen in the above figure, the layer consists of 13 of the 3x3 convolutional layer that is motivated by the VGG16’s one without a fully connected layer.

With this deep convolutional layer, the encoding network can utilize extracted features with more non-linearity. And it performs 2x2 max-pooling reducing features dimension in half. Especially it stores max-pooling indices during max-pooling. The indices are used in the decoding process.

b) Decoder

Aspose Words ad893166-b0e8-4c5c-9b6b-9c63bbf2b49c 002

The decoder network consists of convolutional and upsampling layers like encoder structure and lastly have softmax classifier for pixel-wise segmentation. Interestingly, the upsampling is performed by stored max-pooling indices in the encoding network. I will deal more about it in next part.

3. Upsampling Using Max-Pooling Indices

Why using max-pooling indices in the upsampling layer? What is the benefit of it? It is one of the main contributions of this paper. The author says there is a threefold advantage with max-pooling indices.

Firstly, it improves boundary delineation. The outline details are discarded in the max-pooling layer in the encoder. But If storing max-pooling indices are used in the upsampling process, spatial information can be used during the upsampling process in the decoder.

Secondly, it reduces the number of parameters enabling end-to-end training. If we compare the left side(SegNet) and the right sied(FCN) in the above figure, we can know that the upsampling in FCN is the whole layer data used and computationally expansive than SegNet’s one.

Finally, this form of upsampling can be incorporated into any encoder-decoder architecture. the encoder-decoder structure is still used widely in many fields. If we can utilize store indices of it, the upsampling process could be lighter and outperform without losing information.

4. Result

Aspose Words ad893166-b0e8-4c5c-9b6b-9c63bbf2b49c 003Aspose Words ad893166-b0e8-4c5c-9b6b-9c63bbf2b49c 004

As seen above figure and table, qualitatively and quantifiably SegNet shows quite reasonable performance even it has a simple structure. As seen in the right table quantifiable SegNet outperforms other models like DeepLab and so on except mIOU.

Aspose Words ad893166-b0e8-4c5c-9b6b-9c63bbf2b49c 005

Due to the encoder-decoder structure, SegNet has shower forward and backward time that FCN. and DeepLab. But as mentioned in part 3, Storing max-pooling index contribute to lower Model size and GPU memory than FCN and DeconvNet.

All credit by Vijay Badrinarayanan et al., “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation”., PAMI 2017

Tags:

Categories:

Updated:

Leave a comment