<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Skip-SegFormer Eficient Semantic Segmentation for urban driving</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Lombardi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuel Di Nardo</string-name>
          <email>emanuel.dinardo@uniparthenope.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelo Ciaramella</string-name>
          <email>angelo.ciaramella@uniparthenope.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Autonomous driving, Image Segmentation, Computer Vision, Transformer</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Naples Parthenope</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Environmental perception is a crucial aspect within the field of autonomous urban driving that provides information about the environment, identifying clear driving areas and possible surrounding obstacles. Semantic segmentation is a widely used perception method for self-driving cars. The predicted image pixels can be used to bias the vehicle's behaviour and avoid collisions. In this work a Semantic Segmentation model based on an architecture called SegFormer is proposed, made more eficient by using what our Skip-Decoder module. The model is fine-tuned on urban driving datasets and produces accurate segmentation masks in a short time, making the architecture perfectly adaptable to an autonomous driving car system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The autonomous driving cars need to be equipped with
the necessary perception to understand the nearby
situation so that they can safely integrate into our
existing roads and have enough information about
the environment, clear driving areas and possible
surrounding obstacles. One of the many sensor involved
in autonomous driving is usually a camera, which allows
the system to process the rich visual signal information
using, for example, semantic segmentation, that allows
the system to recognize possible obstacles and avoid
collisions. This work proposes an accurate real-time
semantic segmentation model for self-driving cars based
on SegFormer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a Transformer-based architecture
with a lightweight decoder and an eficient multi-head
attention. This method guarantees a fast inference time
and good performances. The model is fine-tuned and
tested using urban driving datasets such as Cityscapes
[2] or ApolloScape [3], showing the capability of
SegFormer to be easily adaptable to downstream tasks.
      </p>
      <p>The proposed architecture is a variation of the original
SegFormer implementation, which uses a so called
Skip-Decoder which simulates the U-Net “expanding
path” and expands the hidden states using diferent local
size information at each iteration.</p>
      <p>By using the term “autonomous driving” in this paper,
nEvelop-O
Ital-IA 2023: 3rd National Conference on Artificial Intelligence,
organized by CINI, May 29–31, 2023, Pisa, Italy
∗Corresponding author.
as T2T ViT [6], ViT ADP [7] introduce tailored changes to
ViT to further improve image classification performance.</p>
      <p>Other recent works like Swin Transformer [8] and CvT
[9] enhance the local continuity of features in the image
removing fixed size position embedding to improve the
performance of Transformers in dense prediction tasks.</p>
      <p>
        For semantic segmentation in particular, SETR [10]
provides an alternative perspective by treating semantic
segmentation as a sequence-to-sequence prediction task. A
relevant work, using transformer architecture, has been
proposed by SegFormer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a powerful segmentation
framework made by two main ultra-eficient modules.
      </p>
      <p>We will further elaborate more about SegFormer in the ,
exploring the modules and their approaches. However,
some aspects for semantic segmentation such as
computational eficiency has not been thoroughly studied in the
literature and, in fact, these Transformer-based methods
have very low eficiency and, thus, dificult to deploy</p>
      <p>autonomous-driving-levels.html
veloped from a urban driving environment. Recently, low-resolution fine features; and a lightweight All-MLP
in real-time applications. Crucial applications such as
road scene understanding in autonomous vehicles need
much more segmentation accuracy without afecting the
eficiency.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Datasets</title>
      <p>Thanks to the recent works and the state-of-the-art
methods in semantic segmentation, a variety of datasets such
as ADE20k [11], COCO-Stuf [ 12] and PASCAL VOC [13]
have been proposed, but none of them is precisely
designificant research eforts have gone into new vision
technologies for understanding complex trafic scene and
driving scenario. In this paper, we use two challenging
datasets:
• Cityscapes [2]: is a benchmark suite with a
corresponding dataset specifically tailored for
autonomous driving in an urban environment and
involving a much wider range of highly complex
inner-city street scenes that were recorded in 50
diferent cities with diferent sizes, geographic
position and diferent time of the year. The base
dataset consists of 5000 fine pixel-level
annotations layered polygons and realized in-house to
guarantee highest quality levels.
• ApolloScape [3]: ApolloScape is a dataset used
to prove the learning strength of the model, and
was choosen for its stronger challenging
environments. For instance, high contrast regions due to
sun light and large area of shadows from the
overpass. The specifications of ApolloScape for the
semantic scene parsing are the following: 143906
video frames and their corresponding pixel-level
semantic labelling. The number of given
samples is large and, furthermore, the dataset is more
complex due to the variety of the environments
and image features.</p>
      <sec id="sec-2-1">
        <title>3.1. Metrics</title>
        <p>In this work, the Mean IoU has been used, a particular
version of the famous Jaccard index calculated by taking
the IoU of each class and averaging them, giving in output
a single value. The Jaccard index measures the similarity
between finite sample sets, and is defined as the size of
the intersection divided by the size of the union of the
sample sets. Another important metric referencing to the
inference time is FLOPs (Floating Point Operations), that
represents the total number of floating point operations
required for a single forward pass. The higher the FLOPs,
the slower the model and hence low throughput. We will
use this metric to measure the eficiency of the proposed
model.
This section introduces the basic model used and the
proposed version of the architecture.</p>
      </sec>
      <sec id="sec-2-2">
        <title>4.1. Segformer in detail</title>
        <p>
          SegFormer [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is an eficient, robust and powerful
segmentation framework without hand-crafted and
computationally demanding modules. The architecture
consists of two main modules: a hierarchical Transformer
encoder to generate high-resolution coarse features and
decoder to fuse these multi-level features to produce the
ifnal semantic segmentation mask. Image patches of size
4 x 4 (smaller patches favors the dense prediction task)
are used as input to the hierarchical Transformer encoder
to obtain multi-level features at 1/4, 1/8, 1/16, 1/32 of the
original image resolution, that are then passed to the
All-MLP decoder to predict the segmentation mask at 
x 
4
1 th of the image, the amount of pixels to estimate with
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Experiments</title>
      <p>In this section the tests made on the proposed
methodologies are compared and discussed to figure out the
limitations of the propounded methods and eventually
motivate how any modification in the architecture
in</p>
      <sec id="sec-3-1">
        <title>4.2. SegFormer with Skip Connections</title>
      </sec>
      <sec id="sec-3-2">
        <title>5.1. Expertimental Settings</title>
        <p>The encoder is pretrained on Imagenet-1k [14] dataset
and the decoder is randomly initialized. During training,
data augmentation was applied on Cityscapes dataset
The All-MLP decoder is what makes the SegFormer archi- through random cropping to 1024 x 1024 and random
tecture so fast and lightweight. The original up-sampling
stage is basically performed with a bilinear interpolation
horizontal flipping. The ApolloScape dataset, instaed,
was first resized to 2048 x 1024 and then passed through
which makes a large number of the features to be esti- the same data augmentation as the Cityscapes dataset.
mated using just a portion of the  x
x  features. This</p>
        <p>AdamW optimizer [15] has been used with an initial
work proposes a variant of the previously mentioned de- learning rate value of 0.0006, combined with a StepLR
inspiration from the work proposed by U-Net [4] authors, the full image size using a bilinear interpolation.
4
4

4

4
coder, in which the up-sampling stage uses more features
to build the feature map with size  x</p>
        <p>x  that is
eventually given in input to the MLP classifier layer. Taking
this decoder uses some sort of skip-connections.
Considering the decoder as an expansive path, each decoder step
doubles the feature map size and concatenates two
encoder hidden state outputs at each iteration. As shown
in Figure 1, there are few simple steps in this decoder.</p>
        <p>First, the i-th hidden state is fused with the previous one
(assuming that the first has already been up-sampled)
using a 1x1 convolution that keeps the same size of the
features. Then, this</p>
        <p>, is up-sampled to match
the size of the hidden state of the next iteration. The
upsampling is performed as a bilinear interpolation. This
loop is performed 4 times, because in our experiments
the encoder is made by 4 transformer encoder blocks.
Finally, another MLP classifies these fused encoder hidden
states to produce the segmentation map with size  x
4

4
of pixels. With the hidden states with size 1 th, 1 th and
16
ficiency trade-of.
schedule with a factor of 0.5 and a patience of 5 epochs.</p>
        <p>During the training, test and evaluation, before any
measurement, the output of the model has been restored to</p>
      </sec>
      <sec id="sec-3-3">
        <title>5.2. Results</title>
        <p>The proposed SegFormer model and its variant are tested
on both datasets presented in section 3, and then
compared to the state-of-the-art architectures. A discussion
about limits, performances and model features is
provided in this section.</p>
        <sec id="sec-3-3-1">
          <title>Method</title>
          <p>FCN - MobileNet v2
DeeplabV3 - MobileNet v2
EncNet - ResNet101
FCN - ResNet101
DeeplabV3 - ResNet101
MiT-B0
MiT-B1
MiT-B0 - SkipDecoder (Ours)
MiT-B1 - SkipDecoder (Ours)
FLOPs</p>
          <p>mIoU
317.1
556.2
1748
are also provided for each model to show the
performance/ef</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>5.3. Results on ApolloScape</title>
        <p>5.2.1. Results on Cityscapes Despite ApolloScape is a very demanding and
challenging dataset, the results are partially comparable to those
As it can be seen from Table 2, the most important classes of Cityscapes. This dataset has multiple classes that refer
are well detected and segmented. Summarily, the model to particular elements of the environment that cloud be
detects all the flat surfaces and the constructions, respec- considered less relevant to the autonomous driving scene
tively with a 96.93 and 87.58 IoU. On the flip side, there understanding, such as road piles, dustbins, tunnels and
are also diferent irrelevant misclassified classes, strictly bridges. Despite the good mIoU, several relevant classes
related to occlusion problems, and it is a common issue are not well segmented by the MiT-B1 model, such as
that diferent systems have. Having a look at the image person, wall and motorcycle. The model is not capable
 in Figure 2: it is evident that the model has problems to generalize on a very challenging dataset, and these
in distinguishing the man’s leg from the bicycle, due to results shows us the handicaps of very lightweight
archithe uncommon position taken by the rider, who is usu- tectures in real-life applications. Anyway, it is possible
ally seen as a pedestrian in an upright position. In fact, to appreciate the MiT-B1 architecture performance
comthis problem also leads to a medium/low accuracy on pared to the MiT-B0 architecture that was not able to
the object category classes. Without these less relevant reach the same level of accuracy despite having a very
classes, the mean IoU would be about 87.3%, which is large number of samples to be trained on. This could be
largely comparable with the state-of-the-art methods and caused by the small size of the hidden layers which is
outperforms the methods listed in Table 1. not enough to guarantee a good generalization
capabil</p>
        <p>The same situation was replicated with the proposed ity but, on the other hand, it is also the reason why the
SegFormer variant which uses the decoder described in model has a low inference time. More parameters are
subsection 4.2. As expected, the model is faster despite needed to truly appreciate the model’s performance on
having the exact same number of layers and the same such demanding data.
performance. On the other hand, this version brings up On ApolloScape, the proposed Skip-decoder
Segthe same issues: the up-scaling technique used in the Former performed as good as on Cityscapes without any
Skip decoder does not resolve the low IoU encountered meaningful accuracy improvement. The skip-decoder
over the object classes. The MiT-B0 model, being a lighter also shows how the proposed architecture is capable of
version of the MiT-B1 model, shows a lower mIoU, but keeping the same results regardless of the features of the
brings with it all the advantages and disadvantages. In dataset which, in ApolloScape case, are very challenging
this case the model is very fast and portable, and the and tough. Given the results, the model shows a good
accuracy makes the model suitable for a driver assistance robustness even on such demanding datasets.</p>
        <sec id="sec-3-4-1">
          <title>Method</title>
          <p>ResNet-38
ERFNet-IntRA-KD
MiT-B0
MiT-B1
MiT-B0 - SkipDecoder (Ours)
MiT-B1 - SkipDecoder (Ours)
FLOPs
175.4</p>
          <p>125.5
247.9
107.3
230.7</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Inference Time</title>
      <p>In this work, the inference time application is a crucial
aspect, because the real-time requirement of the model
should be enough to make the driving automation system
able to take the correct decision, e.g. warn the driver to
turn in a particular direction to avoid an obstacle that
the driver was not able to see in time. The GPU used for
inference time tests is a NVIDIA Tesla V100 with 32GB
GDDR6, and the tests are made for two diferent image
sizes (2048x2048, 3480x3480).</p>
      <p>Model type
MiT-B0
MiT-B0 + SkipDecoder (Ours)
MiT-B1
MiT-B1 + SkipDecoder (Ours)
inference time (ms)
2048
41ms
40ms
44ms
41ms</p>
      <p>The FLOPs diference between the model with and
without the skip decoder is noticeable, but the inference
time diference shown in Table 5 is just slightly lower.
Hence, even if the number of floating point operations
is lower, the diference is not very significant. Now, it
is analyzed the inference time presented above by
considering a hypothetical situation for a car going 60km/h
(16m/s) in a urban environment. On 2048x2048 images
the inference time is better and the model is able to
perform at about 25 frames per second and it is potentially
capable to give a remarkable support to almost all
driving automation levels. At 60km/h, the system is able
to give a result every 0.6 meters (every 40ms). And at
100km/h, working at 40ms, the model gives a result every
single meter, which is acceptable considering a highly
automated driving system that modifies the behavior of
the car (such as speed, trajectory etc.) without the human
control that would also include human reaction times in
the calculation.</p>
    </sec>
    <sec id="sec-5">
      <title>7. Conclusions</title>
      <p>The work presented herein is a study of real-time
semantic segmentation for autonomous driving purposes and
new techniques that may or not may be useful for new
methodologies. The Skip-decoder SegFormer has been
advanced, an eficient model composed by a transformer
encoder that manages to be used of dificult segmentation
tasks, and a lightweight all-MLP decoder which, in the
proposed version, uses a faster and eficient up-sampling
technique inspired by U-Net. Finally, this has allowed to
ifnd an efective method both from the point of view of
metrics and from that of computational resources,
balancing these two aspects into a light model powerful
enough to be ran in real-time without compromising too
much with performance. In addition, such solution can
be compared with a variety of techniques in terms of
speed and accuracy trade-of, being capable of increase
its performance just by modifying some architectures
parameters at the expense of eficiency.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments References</title>
      <p>This work was completed in part with resources provided
3480 by the University of Naples “Parthenope”, Department of
69ms Science and Technologies, Research Computing Facilities
68ms (https://rcf.uniparthenope.it)
72ms
69ms
[2] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En- [13] M. Everingham, L. V. Gool, C. K. I. Williams, J. M.
zweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, Winn, A. Zisserman, The pascal visual object
The cityscapes dataset for semantic urban scene classes (voc) challenge., Int. J. Comput. Vis. 88 (2010)
understanding, in: Proceedings of the IEEE Confer- 303–338. URL: http://dblp.uni-trier.de/db/journals/
ence on Computer Vision and Pattern Recognition ijcv/ijcv88.html#EveringhamGWWZ10.
(CVPR), 2016. [14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L.
Fei[3] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, Fei, Imagenet: A large-scale hierarchical image
P. Wang, Y. Lin, R. Yang, The apolloscape database, in: 2009 IEEE Conference on Computer
dataset for autonomous driving., in: CVPR Vision and Pattern Recognition, 2009, pp. 248–255.
Workshops, IEEE Computer Society, 2018, pp. doi:1 0 . 1 1 0 9 / C V P R . 2 0 0 9 . 5 2 0 6 8 4 8 .
954–960. URL: http://dblp.uni-trier.de/db/conf/ [15] I. Loshchilov, F. Hutter, Fixing weight decay
cvpr/cvprw2018.html#HuangCGCZWLY18. regularization in adam., CoRR abs/1711.05101
[4] O. Ronneberger, P. Fischer, T. Brox, U-net: Convo- (2017). URL: http://dblp.uni-trier.de/db/journals/
lutional networks for biomedical image segmenta- corr/corr1711.html#abs-1711-05101.
tion, 2015. URL: https://arxiv.org/abs/1505.04597.</p>
      <p>doi:1 0 . 4 8 5 5 0 / A R X I V . 1 5 0 5 . 0 4 5 9 7 .
[5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D.
Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit,
N. Houlsby, An image is worth 16x16 words:
Transformers for image recognition at scale, ICLR (2021).
[6] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. H. Tay,</p>
      <p>J. Feng, S. Yan, Tokens-to-token vit: Training
vision transformers from scratch on imagenet, CoRR
abs/2101.11986 (2021). URL: https://arxiv.org/abs/
2101.11986. a r X i v : 2 1 0 1 . 1 1 9 8 6 .
[7] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai,</p>
      <p>Y. Qiao, Vision transformer adapter for dense
predictions, 2022. URL: https://arxiv.org/abs/2205.</p>
      <p>08534. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 2 0 5 . 0 8 5 3 4 .
[8] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang,</p>
      <p>S. Lin, B. Guo, Swin transformer: Hierarchical
vision transformer using shifted windows, CoRR
abs/2103.14030 (2021). URL: https://arxiv.org/abs/
2103.14030. a r X i v : 2 1 0 3 . 1 4 0 3 0 .
[9] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai,</p>
      <p>L. Yuan, L. Zhang, Cvt: Introducing convolutions
to vision transformers, CoRR abs/2103.15808
(2021). URL: https://arxiv.org/abs/2103.15808.</p>
      <p>a r X i v : 2 1 0 3 . 1 5 8 0 8 .
[10] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang,</p>
      <p>Y. Fu, J. Feng, T. Xiang, P. H. Torr, L. Zhang,
Rethinking semantic segmentation from a
sequenceto-sequence perspective with transformers, in: 2021
IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2021, pp. 6877–6886.</p>
      <p>doi:1 0 . 1 1 0 9 / C V P R 4 6 4 3 7 . 2 0 2 1 . 0 0 6 8 1 .
[11] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A.
Barriuso, A. Torralba, Semantic understanding of
scenes through the ade20k dataset, International</p>
      <p>Journal of Computer Vision 127 (2019) 302–321.
[12] H. Caesar, J. R. R. Uijlings, V. Ferrari,
Cocostuf: Thing and stuf classes in context, CoRR
abs/1612.03716 (2016). URL: http://arxiv.org/abs/
1612.03716. a r X i v : 1 6 1 2 . 0 3 7 1 6 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anandkumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Alvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Segformer: Simple and eficient design for semantic segmentation with transformers</article-title>
          ,
          <source>CoRR abs/2105</source>
          .15203 (
          <year>2021</year>
          ). URL: https://arxiv.org/ abs/2105.15203.
          <article-title>a r X i v : 2 1 0 5 . 1 5 2 0 3</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>