1. Introduction

Skip-SegFormer Eficient Semantic Segmentation for urban driving

Andrea Lombardi

0 1

Emanuel Di Nardo

emanuel.dinardo@uniparthenope.it 0 1

Angelo Ciaramella

angelo.ciaramella@uniparthenope.it 0 1

Autonomous driving, Image Segmentation, Computer Vision, Transformer

0 University of Naples Parthenope , Italy 1 Workshop Proce dings

Environmental perception is a crucial aspect within the field of autonomous urban driving that provides information about the environment, identifying clear driving areas and possible surrounding obstacles. Semantic segmentation is a widely used perception method for self-driving cars. The predicted image pixels can be used to bias the vehicle's behaviour and avoid collisions. In this work a Semantic Segmentation model based on an architecture called SegFormer is proposed, made more eficient by using what our Skip-Decoder module. The model is fine-tuned on urban driving datasets and produces accurate segmentation masks in a short time, making the architecture perfectly adaptable to an autonomous driving car system.

1. Introduction

The autonomous driving cars need to be equipped with the necessary perception to understand the nearby situation so that they can safely integrate into our existing roads and have enough information about the environment, clear driving areas and possible surrounding obstacles. One of the many sensor involved in autonomous driving is usually a camera, which allows the system to process the rich visual signal information using, for example, semantic segmentation, that allows the system to recognize possible obstacles and avoid collisions. This work proposes an accurate real-time semantic segmentation model for self-driving cars based on SegFormer [ 1 ], a Transformer-based architecture with a lightweight decoder and an eficient multi-head attention. This method guarantees a fast inference time and good performances. The model is fine-tuned and tested using urban driving datasets such as Cityscapes [2] or ApolloScape [3], showing the capability of SegFormer to be easily adaptable to downstream tasks.

The proposed architecture is a variation of the original SegFormer implementation, which uses a so called Skip-Decoder which simulates the U-Net “expanding path” and expands the hidden states using diferent local size information at each iteration.

By using the term “autonomous driving” in this paper, nEvelop-O Ital-IA 2023: 3rd National Conference on Artificial Intelligence, organized by CINI, May 29–31, 2023, Pisa, Italy ∗Corresponding author. as T2T ViT [6], ViT ADP [7] introduce tailored changes to ViT to further improve image classification performance.

Other recent works like Swin Transformer [8] and CvT [9] enhance the local continuity of features in the image removing fixed size position embedding to improve the performance of Transformers in dense prediction tasks.

For semantic segmentation in particular, SETR [10] provides an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. A relevant work, using transformer architecture, has been proposed by SegFormer [ 1 ], a powerful segmentation framework made by two main ultra-eficient modules.

We will further elaborate more about SegFormer in the , exploring the modules and their approaches. However, some aspects for semantic segmentation such as computational eficiency has not been thoroughly studied in the literature and, in fact, these Transformer-based methods have very low eficiency and, thus, dificult to deploy

autonomous-driving-levels.html veloped from a urban driving environment. Recently, low-resolution fine features; and a lightweight All-MLP in real-time applications. Crucial applications such as road scene understanding in autonomous vehicles need much more segmentation accuracy without afecting the eficiency.

3. Datasets

Thanks to the recent works and the state-of-the-art methods in semantic segmentation, a variety of datasets such as ADE20k [11], COCO-Stuf [ 12] and PASCAL VOC [13] have been proposed, but none of them is precisely designificant research eforts have gone into new vision technologies for understanding complex trafic scene and driving scenario. In this paper, we use two challenging datasets: • Cityscapes [2]: is a benchmark suite with a corresponding dataset specifically tailored for autonomous driving in an urban environment and involving a much wider range of highly complex inner-city street scenes that were recorded in 50 diferent cities with diferent sizes, geographic position and diferent time of the year. The base dataset consists of 5000 fine pixel-level annotations layered polygons and realized in-house to guarantee highest quality levels. • ApolloScape [3]: ApolloScape is a dataset used to prove the learning strength of the model, and was choosen for its stronger challenging environments. For instance, high contrast regions due to sun light and large area of shadows from the overpass. The specifications of ApolloScape for the semantic scene parsing are the following: 143906 video frames and their corresponding pixel-level semantic labelling. The number of given samples is large and, furthermore, the dataset is more complex due to the variety of the environments and image features.

3.1. Metrics

In this work, the Mean IoU has been used, a particular version of the famous Jaccard index calculated by taking the IoU of each class and averaging them, giving in output a single value. The Jaccard index measures the similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. Another important metric referencing to the inference time is FLOPs (Floating Point Operations), that represents the total number of floating point operations required for a single forward pass. The higher the FLOPs, the slower the model and hence low throughput. We will use this metric to measure the eficiency of the proposed model. This section introduces the basic model used and the proposed version of the architecture.

4.1. Segformer in detail

SegFormer [ 1 ] is an eficient, robust and powerful segmentation framework without hand-crafted and computationally demanding modules. The architecture consists of two main modules: a hierarchical Transformer encoder to generate high-resolution coarse features and decoder to fuse these multi-level features to produce the ifnal semantic segmentation mask. Image patches of size 4 x 4 (smaller patches favors the dense prediction task) are used as input to the hierarchical Transformer encoder to obtain multi-level features at 1/4, 1/8, 1/16, 1/32 of the original image resolution, that are then passed to the All-MLP decoder to predict the segmentation mask at x 4 1 th of the image, the amount of pixels to estimate with

5. Experiments

In this section the tests made on the proposed methodologies are compared and discussed to figure out the limitations of the propounded methods and eventually motivate how any modification in the architecture in

4.2. SegFormer with Skip Connections 5.1. Expertimental Settings

The encoder is pretrained on Imagenet-1k [14] dataset and the decoder is randomly initialized. During training, data augmentation was applied on Cityscapes dataset The All-MLP decoder is what makes the SegFormer archi- through random cropping to 1024 x 1024 and random tecture so fast and lightweight. The original up-sampling stage is basically performed with a bilinear interpolation horizontal flipping. The ApolloScape dataset, instaed, was first resized to 2048 x 1024 and then passed through which makes a large number of the features to be esti- the same data augmentation as the Cityscapes dataset. mated using just a portion of the x x features. This

AdamW optimizer [15] has been used with an initial work proposes a variant of the previously mentioned de- learning rate value of 0.0006, combined with a StepLR inspiration from the work proposed by U-Net [4] authors, the full image size using a bilinear interpolation. 4 4 4 4 coder, in which the up-sampling stage uses more features to build the feature map with size x

x that is eventually given in input to the MLP classifier layer. Taking this decoder uses some sort of skip-connections. Considering the decoder as an expansive path, each decoder step doubles the feature map size and concatenates two encoder hidden state outputs at each iteration. As shown in Figure 1, there are few simple steps in this decoder.

First, the i-th hidden state is fused with the previous one (assuming that the first has already been up-sampled) using a 1x1 convolution that keeps the same size of the features. Then, this

, is up-sampled to match the size of the hidden state of the next iteration. The upsampling is performed as a bilinear interpolation. This loop is performed 4 times, because in our experiments the encoder is made by 4 transformer encoder blocks. Finally, another MLP classifies these fused encoder hidden states to produce the segmentation map with size x 4 4 of pixels. With the hidden states with size 1 th, 1 th and 16 ficiency trade-of. schedule with a factor of 0.5 and a patience of 5 epochs.

During the training, test and evaluation, before any measurement, the output of the model has been restored to

5.2. Results

The proposed SegFormer model and its variant are tested on both datasets presented in section 3, and then compared to the state-of-the-art architectures. A discussion about limits, performances and model features is provided in this section.

Method

FCN - MobileNet v2 DeeplabV3 - MobileNet v2 EncNet - ResNet101 FCN - ResNet101 DeeplabV3 - ResNet101 MiT-B0 MiT-B1 MiT-B0 - SkipDecoder (Ours) MiT-B1 - SkipDecoder (Ours) FLOPs

mIoU 317.1 556.2 1748 are also provided for each model to show the performance/ef

5.3. Results on ApolloScape

5.2.1. Results on Cityscapes Despite ApolloScape is a very demanding and challenging dataset, the results are partially comparable to those As it can be seen from Table 2, the most important classes of Cityscapes. This dataset has multiple classes that refer are well detected and segmented. Summarily, the model to particular elements of the environment that cloud be detects all the flat surfaces and the constructions, respec- considered less relevant to the autonomous driving scene tively with a 96.93 and 87.58 IoU. On the flip side, there understanding, such as road piles, dustbins, tunnels and are also diferent irrelevant misclassified classes, strictly bridges. Despite the good mIoU, several relevant classes related to occlusion problems, and it is a common issue are not well segmented by the MiT-B1 model, such as that diferent systems have. Having a look at the image person, wall and motorcycle. The model is not capable in Figure 2: it is evident that the model has problems to generalize on a very challenging dataset, and these in distinguishing the man’s leg from the bicycle, due to results shows us the handicaps of very lightweight archithe uncommon position taken by the rider, who is usu- tectures in real-life applications. Anyway, it is possible ally seen as a pedestrian in an upright position. In fact, to appreciate the MiT-B1 architecture performance comthis problem also leads to a medium/low accuracy on pared to the MiT-B0 architecture that was not able to the object category classes. Without these less relevant reach the same level of accuracy despite having a very classes, the mean IoU would be about 87.3%, which is large number of samples to be trained on. This could be largely comparable with the state-of-the-art methods and caused by the small size of the hidden layers which is outperforms the methods listed in Table 1. not enough to guarantee a good generalization capabil

The same situation was replicated with the proposed ity but, on the other hand, it is also the reason why the SegFormer variant which uses the decoder described in model has a low inference time. More parameters are subsection 4.2. As expected, the model is faster despite needed to truly appreciate the model’s performance on having the exact same number of layers and the same such demanding data. performance. On the other hand, this version brings up On ApolloScape, the proposed Skip-decoder Segthe same issues: the up-scaling technique used in the Former performed as good as on Cityscapes without any Skip decoder does not resolve the low IoU encountered meaningful accuracy improvement. The skip-decoder over the object classes. The MiT-B0 model, being a lighter also shows how the proposed architecture is capable of version of the MiT-B1 model, shows a lower mIoU, but keeping the same results regardless of the features of the brings with it all the advantages and disadvantages. In dataset which, in ApolloScape case, are very challenging this case the model is very fast and portable, and the and tough. Given the results, the model shows a good accuracy makes the model suitable for a driver assistance robustness even on such demanding datasets.

Method

ResNet-38 ERFNet-IntRA-KD MiT-B0 MiT-B1 MiT-B0 - SkipDecoder (Ours) MiT-B1 - SkipDecoder (Ours) FLOPs 175.4

125.5 247.9 107.3 230.7

6. Inference Time

In this work, the inference time application is a crucial aspect, because the real-time requirement of the model should be enough to make the driving automation system able to take the correct decision, e.g. warn the driver to turn in a particular direction to avoid an obstacle that the driver was not able to see in time. The GPU used for inference time tests is a NVIDIA Tesla V100 with 32GB GDDR6, and the tests are made for two diferent image sizes (2048x2048, 3480x3480).

Model type MiT-B0 MiT-B0 + SkipDecoder (Ours) MiT-B1 MiT-B1 + SkipDecoder (Ours) inference time (ms) 2048 41ms 40ms 44ms 41ms

The FLOPs diference between the model with and without the skip decoder is noticeable, but the inference time diference shown in Table 5 is just slightly lower. Hence, even if the number of floating point operations is lower, the diference is not very significant. Now, it is analyzed the inference time presented above by considering a hypothetical situation for a car going 60km/h (16m/s) in a urban environment. On 2048x2048 images the inference time is better and the model is able to perform at about 25 frames per second and it is potentially capable to give a remarkable support to almost all driving automation levels. At 60km/h, the system is able to give a result every 0.6 meters (every 40ms). And at 100km/h, working at 40ms, the model gives a result every single meter, which is acceptable considering a highly automated driving system that modifies the behavior of the car (such as speed, trajectory etc.) without the human control that would also include human reaction times in the calculation.

7. Conclusions

The work presented herein is a study of real-time semantic segmentation for autonomous driving purposes and new techniques that may or not may be useful for new methodologies. The Skip-decoder SegFormer has been advanced, an eficient model composed by a transformer encoder that manages to be used of dificult segmentation tasks, and a lightweight all-MLP decoder which, in the proposed version, uses a faster and eficient up-sampling technique inspired by U-Net. Finally, this has allowed to ifnd an efective method both from the point of view of metrics and from that of computational resources, balancing these two aspects into a light model powerful enough to be ran in real-time without compromising too much with performance. In addition, such solution can be compared with a variety of techniques in terms of speed and accuracy trade-of, being capable of increase its performance just by modifying some architectures parameters at the expense of eficiency.

Acknowledgments References

This work was completed in part with resources provided 3480 by the University of Naples “Parthenope”, Department of 69ms Science and Technologies, Research Computing Facilities 68ms (https://rcf.uniparthenope.it) 72ms 69ms [2] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En- [13] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. zweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, Winn, A. Zisserman, The pascal visual object The cityscapes dataset for semantic urban scene classes (voc) challenge., Int. J. Comput. Vis. 88 (2010) understanding, in: Proceedings of the IEEE Confer- 303–338. URL: http://dblp.uni-trier.de/db/journals/ ence on Computer Vision and Pattern Recognition ijcv/ijcv88.html#EveringhamGWWZ10. (CVPR), 2016. [14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei[3] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, Fei, Imagenet: A large-scale hierarchical image P. Wang, Y. Lin, R. Yang, The apolloscape database, in: 2009 IEEE Conference on Computer dataset for autonomous driving., in: CVPR Vision and Pattern Recognition, 2009, pp. 248–255. Workshops, IEEE Computer Society, 2018, pp. doi:1 0 . 1 1 0 9 / C V P R . 2 0 0 9 . 5 2 0 6 8 4 8 . 954–960. URL: http://dblp.uni-trier.de/db/conf/ [15] I. Loshchilov, F. Hutter, Fixing weight decay cvpr/cvprw2018.html#HuangCGCZWLY18. regularization in adam., CoRR abs/1711.05101 [4] O. Ronneberger, P. Fischer, T. Brox, U-net: Convo- (2017). URL: http://dblp.uni-trier.de/db/journals/ lutional networks for biomedical image segmenta- corr/corr1711.html#abs-1711-05101. tion, 2015. URL: https://arxiv.org/abs/1505.04597.

doi:1 0 . 4 8 5 5 0 / A R X I V . 1 5 0 5 . 0 4 5 9 7 . [5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, ICLR (2021). [6] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. H. Tay,

J. Feng, S. Yan, Tokens-to-token vit: Training vision transformers from scratch on imagenet, CoRR abs/2101.11986 (2021). URL: https://arxiv.org/abs/ 2101.11986. a r X i v : 2 1 0 1 . 1 1 9 8 6 . [7] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai,

Y. Qiao, Vision transformer adapter for dense predictions, 2022. URL: https://arxiv.org/abs/2205.

08534. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 2 0 5 . 0 8 5 3 4 . [8] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang,

S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, CoRR abs/2103.14030 (2021). URL: https://arxiv.org/abs/ 2103.14030. a r X i v : 2 1 0 3 . 1 4 0 3 0 . [9] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai,

L. Yuan, L. Zhang, Cvt: Introducing convolutions to vision transformers, CoRR abs/2103.15808 (2021). URL: https://arxiv.org/abs/2103.15808.

a r X i v : 2 1 0 3 . 1 5 8 0 8 . [10] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang,

Y. Fu, J. Feng, T. Xiang, P. H. Torr, L. Zhang, Rethinking semantic segmentation from a sequenceto-sequence perspective with transformers, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6877–6886.

doi:1 0 . 1 1 0 9 / C V P R 4 6 4 3 7 . 2 0 2 1 . 0 0 6 8 1 . [11] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, A. Torralba, Semantic understanding of scenes through the ade20k dataset, International

Journal of Computer Vision 127 (2019) 302–321. [12] H. Caesar, J. R. R. Uijlings, V. Ferrari, Cocostuf: Thing and stuf classes in context, CoRR abs/1612.03716 (2016). URL: http://arxiv.org/abs/ 1612.03716. a r X i v : 1 6 1 2 . 0 3 7 1 6 .

[1]

Xie ,

Wang ,

Yu ,

Anandkumar ,

J. M.

Alvarez ,

Luo , Segformer: Simple and eficient design for semantic segmentation with transformers , CoRR abs/2105 .15203 ( 2021 ). URL: https://arxiv.org/ abs/2105.15203. a r X i v : 2 1 0 5 . 1 5 2 0 3 .