<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Early fusion of Dense Optical Flow with Image for Semantic Segmentation in Autonomous Driving</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Prashanth Viswanath</string-name>
          <email>prashanth.viswanath@valeo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ganesh Sistu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mihai Ilie</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Senthil Yogamani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonathan Horgan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Valeo Vision Systems</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Precise understanding of the scene around the car is of utmost importance to achieve autonomous driving. Convolutional Neural Networks (CNNs) have been widely used for road scene understanding in the last few years with great success. However, most of these networks have a complex architecture which needs a complex system to be deployed in the car. Typical systems today take the input from cameras placed around the car and the CNNs process them to provide the understanding of the environment. Various hardware manufacturers today are including hardware accelerators in their System on Chips (SoCs) for certain computer vision tasks such as Optical Flow (OF), Stereo Vision (SV) which can achieve good accuracy and fast runtime. If these accelerators can be used in tandem with the CNN to enhance the accuracy of perception, then it is hugely bene cial. In this paper, we explore the possibility of using the Dense Optical Flow output from the hardware accelerator as input along with the image for CNNs to be able to perceive the scene better and faster. We show that by fusion of optical ow and image, mean Intersection over Union (IoU) of segmentation improves by over 1% and accuracy of major classes such as road, person, rider, motorcycle and bicycle improves by 2%, 1%, 5%, 7% and 11% respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>Convolutional Neural Networks (CNN) Dense Optical Flow (DOF) Stereo Vision (SV) Computer Vision Autonomous Driving System on Chip (SoC)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Object detection and localization around the ego vehicle is of great importance
for driver assistance systems and autonomous driving systems. The current trend
is to use convolutional neural networks (CNNs) for the scene perception task and
provide the locations of various objects around the ego vehicle. CNNs are used
for providing semantic information [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], object detection information [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], scene 3D reconstruction [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] and object motion information [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Various
sensory inputs like camera [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], lidar [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] and radar [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] have been used by CNNs
to perceive the environment. Despite these e orts, the accurate delineation of
object boundaries remain a challenge.
      </p>
      <p>
        Most state of the art CNNs assume very high compute and often cannot be
deployed in small systems that are present in the cars. There are various
restrictions on systems that can be deployed in the car: thermal footprint, memory
footprint, placement of the system which impacts how the sensors are connected
etc, all of which have a direct impact on the cost these systems. In order to
meet the thermal and memory bandwidth constraints, many hardware
manufacturers are providing accelerators or xed processing engines for CNNs, dense
optical ow (DOF) and stereo vision (SV) in their Sytem on Chip (SoC) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The typical compute supported by these accelerators are between 1 - 4
Tera-Operations per Second (TOPS) within a power budget of 5W. Given the
limited compute available on the SoC, it is critical to have an optimized CNN
for the perception task and obtain the best performance. Since DOF and SV
engines can be run in tandem on the respective accelerators, it would be very
bene cial if CNN can take advantage of the motion and depth cues to improve
the perception accuracy. Also, this helps to optimize the network to be smaller
and meet the accuracy and run time requirements.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] shows that optical ow is very useful in detecting moving objects like
vehicles and pedestrians. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] show that motion boundaries improve semantic
segmentation. However, the DOF output undergoes a lot of processing before
it is fused with image input. In this paper, we propose to leverage the motion
cues by using the DOF outputs from the accelerators with minimal
preprocessing before combining it with the image as an input to the CNN, in order to
have an optimal and real-time implementation on SoCs that can be deployed
in the car. In order to simulate the DOF outputs from hardware accelerators,
we use the Opencv Farneback [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] function. The Opencv Farneback function
gives a good representation of the DOF algorithm present in the SoCs as most
hardware companies benchmark their algorithm against it and generally perform
better. We consider di erent formats of optical ow data such as magnitude only,
magnitude and direction, color wheel format etc. concatenated with the RGB
channels of the image as input to the CNN and analyze its performance for
semantic segmentation task.
      </p>
      <p>The rest of the paper is organized as follows: Section 2 provides information
on the related work. Section 3 details the proposed method for incorporating
optical ow input in segmentation task. Section 4 shows the experimental results
and discussions. Finally, section 5 provides concluding remarks.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Semantic Segmentation: [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] were the rst to propose an end-to-end CNN for
semantic segmentation. They modi ed the last layers of the CNN, thus producing
fully convolutional neural network (FCN). Due to the large receptive elds of
FCNs, the localization of object boundaries is insu ciently precise. In order to
overcome this, many solutions were proposed such as applying a fully connected
conditional random elds (CRFs) to the output of the CNN [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or introducing
global energy model along with boundary cues [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These post processing steps
require additional parameter tuning and compute time. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed an
encoderdecoder based architecture which requires one fourth of memory usage and about
half the inference time compared to FCNs, making it an ideal architecture for
e cient segmentation. Figure 1 shows the encoder-decoder type architecture for
semantic segmentation. The encoder extracts features from the image which is
then decoded to produce the semantic segmentation output. ImageNet [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
pretrained networks such as VGG16 [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], Resnet [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] are typically used as encoder.
In early architectures [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], decoder was a mirror image of encoder and had
the same complexity. Newer architectures use a relatively smaller decoder. There
can also be additional connections from encoder to decoder. For example, Segnet
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] passes max-pooling indices and U-Net [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] passes intermediate feature maps
to decoder as well.
      </p>
      <p>
        Motion Estimation: Optical ow is an important step in deriving motion
boundaries. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] use motion boundaries along with images to improve semantic
segmentation. The motion boundaries are computed based on a learning based
prediction proposed in [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. This post processing of optical ow to obtain motion
boundaries involves additional computation and memory usage, unlikely to be
available on SoCs that are deployed in the car. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] consider two stream approach
where they have separate encoders to extract features from the image channels
and the DOF channels and concatenate these features. This results in duplicating
the encoder network which hugely impact the size and run-time of the network.
Also, the optical ow input is obtained from Flownet [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] type CNN which
outputs color wheel representation of the optical ow, which requires additional
processing to generate it. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] derived motion boundaries from the gradient of
optical ow computed by traditional computer vision approach and concentrated
on motion of only single object in the scene, which is typically not the case in
an autonomous driving system. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] also show that motion boundaries can be
leveraged to improve semantic segmentation. However, the motion boundaries
are used as additional modality in a late fusion post processing step, which
increases the computation and complexity of the system, similar to [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]
and [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] use KITTI [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and CamVid [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] dataset respectively, which have very
few images with segmentation annotations. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] uses a total of 1950 frames from
KITTI raw dataset [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] uses only 367 images for training and 233 images
for testing from the CamVid [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] dataset.
(a) Sample image of normalized (b) Sample image of angle of ow
magnitude of ow vectors. vectors computed.
(c) Sample image of color-wheel (d) Sample image of scaled
magnirepresentation of ow vectors. tude representation of ow vectors.
(e) Sample image of ow vector in (f) Sample image of ow vector in
direction x cast into 8-bit. direction y cast into 8-bit.
      </p>
      <p>Fig. 3: Di erent formats of DOF inputs considered.</p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Method</title>
      <p>In this section, details of our approach are provided. Figure 2 shows the block
diagram explaining the pipeline of our approach. The DOF output is obtained
using the current and previous frame. The DOF output is then concatenated
with the current frame as additional channels before they are input to the CNN.
There are multiple methods of representing the DOF data as discussed in
Section 3.1. The most important aspect of concatenating optical ow with image is
the normalization of the optical ow data such that the value of the ow vectors
are in the same range as that of image pixels. The most e ective representation
which provides optimal run-time and improved segmentation performance is
determined by various experiments as discussed in Section 4. We propose a method
of scaling the ow vectors by a xed constant in order to reduce the amount of
additional processing requirements and still improve semantic segmentation
performance.
3.1</p>
      <sec id="sec-3-1">
        <title>Dense Optical Flow data</title>
        <p>
          The DOF data is computed using the Opencv Farneback DOF algorithm [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
The default settings are considered to generate the ow output. The Farneback
algorithm outputs 32-bit oating point ow vectors in x and y direction. From
this, di erent formats of DOF inputs for CNN were computed which are as
follows:
{ Normalized magnitude: Magnitude is computed from the dx,dy ow vectors
and normalized in the range 0-255 8-bit unsigned integer format to be in the
same range as image channel input as shown in Figure 3a.
{ Angle: Angle of direction is computed from dx,dy ow vectors and
represented in degrees in range 0-180 8-bit unsigned integer format as shown in
Figure 3b.
{ Color wheel format: The ow vectors are represented in the color wheel
format similar to Middlebury dataset [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] where the color represents the
direction of the ow and intensity of color represents the magnitude of the ow
as shown in Figure 3c.
{ dx, dy: The ow vectors in each direction dx, dy typecast to 8-bit unsigned
integer format as shown in Figure 3e and Figure 3f.
{ Scaled magnitude: Magnitude is computed from the dx,dy ow vectors and
scaled by a xed number (255) uniformly as shown in Figure 3d. This is done
in order to simplify the preprocessing step.
        </p>
        <p>All the above formats of DOF output were considered for fusion with
image channels to evaluate the performance of semantic segmentation, which are
discussed in Section 4.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Segmentation</title>
        <p>
          An encoder-decoder type architecture similar to MultiNet [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] is used for the
segmentation task. The encoder is Resnet10 [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] architecture and the decoder is
a cut down version of FCN8 [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] architecture, with only three upsample layers
similar to the MultiNet [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] architecture. The encoder and decoder is combined
similar to the MultiNet architecture, where the intermediate layers from the
encoder are connected to the decoder using skip connections. The network
architecture is as shown in Figure 4. Di erent inputs for the motion stream along
with image are considered with the same network architecture. Pixel-wise cross
entropy loss is used for the network. In order to compensate for the low
representation of certain classes, other loss functions such as median frequency based
weighted cross entropy loss function [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and alpha focal loss [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] were also tried.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Results</title>
      <p>In this section, the experimental setup and the results of various experiments
are detailed.
4.1</p>
      <sec id="sec-4-1">
        <title>Dataset</title>
        <p>
          The proposed framework is trained and tested on the challenging Cityscapes
dataset [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Although there exists other motion segmentation datasets such as
[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], they are either synthetic [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], relatively small [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]
or has limited camera motion [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] unlike what is present in autonomous driving
scenes. The Cityscapes dataset [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] provides 5000 images with ne pixel-wise
annotations, along with the sequence of images which can be used to compute
DOF data. Out of 5000 images, 2975 images are used for training and 500 images
are used for evaluation. The results presented by the various experiments are
based on the evaluation set. For computing DOF, only two frames (current
and previous frames) were considered in order to mimic the actual hardware
accelerator setup.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Experimental Setup</title>
        <p>
          For all the experiments, the network architecture is kept same as shown in Figure
4. A baseline with the network con guration using image only input is obtained
rst. Adam optimizer is used with a learning rate of 5e-5. No decay of learning
rate is used during the training and L2 regularization is used while training.
The network is trained for a maximum of 30 epochs with early stopping based
on validation loss with a patience of 5 enabled.The encoder is initialized with
the Resnet pretrained weights on Imagenet [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and the transposed convolution
layers of the decoder are initialized to bilinear upsampling, while training the
network with image only input. For the network structure with additional
channel input from optical ow data, the pretrained weights from the network trained
using image only input, are used. The image resolution is 1024x512.
        </p>
        <p>The evaluation metrics used in the segmentation are global average accuracy,
precision, recall, F1-score and mean intersection over union (IoU). The individual
class accuracies are also evaluated based on the confusion matrix results.
(c) Result with image and color (d) Result with image and norm
format ow as input. magnitude and angle ow as input.
(e) Result with image and dx,dy (f) Result with image and
ow as input. scaled magnitude as input.</p>
        <p>xed
(h) Result with image and norm
(g) Result with image only as input magnitude as input using weighted
using weighted Cross Entropy loss. Cross Entropy loss.
(i) Ground truth result of segmen- (j) Ground truth result of
segmentation overlay on the image. tation of the image.
(a) Result with image only as in- (b) Result with image and norm
put. magnitude as input.
(c) Result with image and color (d) Result with image and norm
format ow as input. magnitude and angle ow as input.
(e) Result with image and dx,dy (f) Result with image and
ow as input. scaled magnitude as input.
xed
(g) Sample result with image only (h) Sample result with image and
as input using weighted Cross En- norm magnitude as input using
tropy loss. weighted Cross Entropy loss.</p>
        <p>Union (mean IoU). A closer look at other metrics shows that using the
normalized magnitude as shown in Figure 3a provides the best precision, with signi cant
increase in class accuracies for road, rider, car, motorcylcle and bicyle, but
decreasing class accuracies for persons and sidewalk. The ow vectors for person
and sidewalk is very less and hence when normalized, it is close to zero. The
mean IoU is less compared to the state of the art. This is due to two factors:
{ The small size of the network used to obtain real-time performance
{ The under represented classes such as pole, wall, fence, truck, bus, train,
motorcycle, rider, tra c sign and tra c light classes of the Cityscapes dataset
The per-class IoU improves signi cantly for all moving objects such as persons,
riders, cars and bicycle as shown in Table 2. Computing the normalized
magnitude involves signi cant amount of preprocessing. First, the distribution of the
ow in an image has to be computed and then remapped to 0-255 range by
multiplying each ow with a di erent scaling factor. In order to reduce the amount
of preprocessing, a simple xed scaling of magnitude was implemented where the
magnitude of each ow vector was multiplied by 255 which is as shown in Figure
3d and the results are as shown in row 6 of Table 1. As it can be seen, the overall
metrics are improved further. The accuracy for person is also improved due to
the scaling, as compared to the normalized magnitude approach. The proposed
scaling approach scales any ow vector greater than 0 to 255, thereby removing
the importance of ow vectors for objects that are moving faster, essentially
converting it into a binary image. The scaling factor can be adjusted to maintain
the importance of fast moving objects and can even be a learned parameter.
Figure 5 and Figure 6 shows sample results of segmentation considering various
formats of input to the network. One interesting observation from the results is
the improvement in accuracy in the segmentation of road class, which is counter
intuitive. This is because the optical ow is inaccurate on the road surface and
hence typically made invalid or void for those regions, thus helping the CNN to
classify the road class better.</p>
        <p>
          Experiments with di erent loss functions such as weighted cross entropy and
alpha focal loss were tried in order to improve the segmentation of classes such as
rider, motorcycle and bicycle which are under represented. Row 7 and 8 of Table
1 shows the results of the network trained with image and image + normalized
magnitude of optical ow input respectively, using weighted cross entropy loss.
Median frequency [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] of classes were used to weight the loss function accordingly.
It clearly shows that the under represented classes such as rider, motorcycle and
bicycle hugely improve. However, the other classes such as road, pedestrian and
car which have good representation in the dataset su er as they are weighted less.
Along the same lines, alpha focal loss [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] was also tried, but no improvement
was observed.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we explored combining dense optical ow data in various formats
along with image to improve semantic segmentation. We have shown that by
combining normalized magnitude of optical ow with image, the accuracy for
segmenting moving objects and road improves a lot. We also present a simpler
method to scale magnitude of optical ow and combining it with image, thereby
reducing the amount of preprocessing needed and still improve the
segmentation results. Furthermore, we can deduce the scaling parameter by a learning
approach. DOF and CNN accelerators are present in several SoCs and hence we
have provided analysis on how best to utilize them in the SoC.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Movidius myraid x vpu</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <article-title>Product speci cations of the r-car v3h</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. S32v234:
          <article-title>64-bit multi-core a53 processor for vision and adas applications</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Badrinarayanan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kendall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cipolla</surname>
          </string-name>
          , R.:
          <article-title>Segnet: A deep convolutional encoder-decoder architecture for image segmentation</article-title>
          .
          <source>arXiv preprint arXiv:1511.00561</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scharstein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Black</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szeliski</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A database and evaluation methodology for optical ow</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>92</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>31</fpage>
          (Mar
          <year>2011</year>
          ). https://doi.org/10.1007/s11263-010- 0390-2, https://doi.org/10.1007/s11263-010-0390-2
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bertasius</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torresani</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Semantic segmentation with boundary neural elds</article-title>
          .
          <source>CoRR abs/1511</source>
          .02674 (
          <year>2015</year>
          ), http://arxiv.org/abs/1511.02674
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Brostow</surname>
            ,
            <given-names>G.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fauqueur</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cipolla</surname>
          </string-name>
          , R.:
          <article-title>Semantic object classes in video: A high-de nition ground truth database</article-title>
          .
          <source>Pattern Recogn. Lett</source>
          .
          <volume>30</volume>
          (
          <issue>2</issue>
          ),
          <volume>88</volume>
          {97 (Jan
          <year>2009</year>
          ). https://doi.org/10.1016/j.patrec.
          <year>2008</year>
          .
          <volume>04</volume>
          .005, http://dx.doi.org/10.1016/j.patrec.
          <year>2008</year>
          .
          <volume>04</volume>
          .005
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feris</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasconcelos</surname>
          </string-name>
          , N.:
          <article-title>A uni ed multi-scale deep convolutional neural network for fast object detection</article-title>
          .
          <source>In: European Conference on Computer Vision</source>
          . pp.
          <volume>354</volume>
          {
          <fpage>370</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Capobianco</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Facheris</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuccoli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marinai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Vehicle classi cation based on convolutional networks applied to fm-cw radar signals</article-title>
          .
          <source>arXiv preprint arXiv:1710.05718v3</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papandreou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kokkinos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuille</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          :
          <article-title>Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs</article-title>
          .
          <source>arXiv preprint arXiv:1606.00915</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Cordts</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Omran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rehfeld</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Enzweiler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benenson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franke</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schiele</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The cityscapes dataset for semantic urban scene understanding</article-title>
          .
          <source>In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <string-name>
            <surname>ImageNet: A LargeScale Hierarchical Image</surname>
          </string-name>
          <article-title>Database</article-title>
          .
          <source>In: CVPR09</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Farneback, G.:
          <article-title>Two-frame motion estimation based on polynomial expansion</article-title>
          . In: Bigun,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Gustavsson</surname>
          </string-name>
          , T. (eds.)
          <article-title>Image Analysis</article-title>
          . pp.
          <volume>363</volume>
          {
          <fpage>370</fpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dosovitskiy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilg</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Hausser,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Hazirbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Golkov</surname>
          </string-name>
          , V., van der Smagt,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Cremers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Brox</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Flownet: Learning optical ow with convolutional networks</article-title>
          .
          <source>CoRR abs/1504</source>
          .06852 (
          <year>2015</year>
          ), http://arxiv.org/abs/1504.06852
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Geiger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stiller</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urtasun</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Vision meets robotics: The kitti dataset</article-title>
          .
          <source>International Journal of Robotics Research (IJRR)</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>CoRR abs/1512</source>
          .03385 (
          <year>2015</year>
          ), http://arxiv.org/abs/1512.03385
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Y.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oramas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuytelaars</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gool</surname>
            ,
            <given-names>L.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leuven</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.D.V.K</surname>
          </string-name>
          .U.:
          <article-title>Do motion boundaries improve semantic segmentation ? (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Focal loss for dense object detection</article-title>
          .
          <source>CoRR abs/1708</source>
          .
          <year>02002</year>
          (
          <year>2017</year>
          ), http://arxiv.org/abs/1708.02002
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anguelov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erhan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>C.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berg</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          : Ssd:
          <article-title>Single shot multibox detector</article-title>
          .
          <source>In: European conference on computer vision</source>
          . pp.
          <volume>21</volume>
          {
          <fpage>37</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shelhamer</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Fully convolutional networks for semantic segmentation</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>3431</volume>
          {
          <issue>3440</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Mayer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilg</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Hausser,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Cremers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Brox</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>A large dataset to train convolutional networks for disparity, optical ow, and scene ow estimation</article-title>
          .
          <source>CoRR abs/1512</source>
          .02134 (
          <year>2015</year>
          ), http://arxiv.org/abs/1512.02134
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Ochs</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brox</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Segmentation of moving objects by long term video analysis</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>36</volume>
          (
          <issue>6</issue>
          ),
          <volume>1187</volume>
          { 1200 (Jun
          <year>2014</year>
          ), http://lmb.informatik.unifreiburg.de/Publications/2014/OB14b, preprint
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Papazoglou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrari</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Fast object segmentation in unconstrained video</article-title>
          .
          <source>In: Proceedings of the 2013 IEEE International Conference on Computer Vision</source>
          . pp.
          <volume>1777</volume>
          {
          <fpage>1784</fpage>
          . ICCV '13, IEEE Computer Society, Washington, DC, USA (
          <year>2013</year>
          ). https://doi.org/10.1109/ICCV.
          <year>2013</year>
          .
          <volume>223</volume>
          , http://dx.doi.org/10.1109/ICCV.
          <year>2013</year>
          .223
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Paszke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaurasia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Culurciello</surname>
          </string-name>
          , E.:
          <article-title>Enet: A deep neural network architecture for real-time semantic segmentation</article-title>
          .
          <source>arXiv preprint arXiv:1606.02147</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Perazzi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pont-Tuset</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McWilliams</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Gool</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gross</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SorkineHornung</surname>
          </string-name>
          , A.:
          <article-title>A benchmark dataset and evaluation methodology for video object segmentation</article-title>
          .
          <source>In: Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Ronneberger</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brox</surname>
          </string-name>
          , T.:
          <article-title>U-net: Convolutional networks for biomedical image segmentation</article-title>
          .
          <source>CoRR abs/1505</source>
          .04597 (
          <year>2015</year>
          ), http://arxiv.org/abs/1505.04597
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Siam</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mahgoub</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zahran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yogamani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jagersand</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Modnet:
          <article-title>Moving object detection network with motion and appearance for autonomous driving</article-title>
          .
          <source>arXiv preprint arXiv:1709.04821v2</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>CoRR abs/1409</source>
          .1556 (
          <year>2014</year>
          ), http://arxiv.org/abs/1409.1556
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Teichmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weber</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Zollner,
          <string-name>
            <given-names>J.M.</given-names>
            ,
            <surname>Cipolla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Urtasun</surname>
          </string-name>
          , R.: Multinet:
          <article-title>Realtime joint semantic reasoning for autonomous driving</article-title>
          .
          <source>CoRR abs/1612</source>
          .07695 (
          <year>2016</year>
          ), http://arxiv.org/abs/1612.07695
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Usenko</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Engel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stuckler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cremers</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Reconstructing street-scenes in real-time from a driving car</article-title>
          .
          <source>In: International Conference on 3D Vision</source>
          . IEEE (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Weinzaepfel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Revaud</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harchaoui</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Learning to detect motion boundaries</article-title>
          .
          <source>2015 IEEE Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (CVPR) pp.
          <volume>2578</volume>
          {
          <issue>2586</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Zelener</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Cnn-based object segmentation in urban lidar with missing points</article-title>
          .
          <source>In: Fourth International Conference on 3D Vision</source>
          . IEEE (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>