<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Autonomous Driving in Simulation using Domain-Independent Perception</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Audun Wigum Arbo?</string-name>
          <email>audun.wa@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Even Dalen</string-name>
          <email>even.dalen@live.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Lindseth</string-name>
          <email>frankl@ntnu.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Norwegian University of Science and Technology</institution>
          ,
          <addr-line>Trondheim</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>ions generalize better than a model trained directly on RGB images in simulation, even when the perception model is trained on realworld data. We also show that the perception model trained on several tasks using multi-task learning, leads to better-performing driving policies than learning only semantic segmentation.</p>
      </abstract>
      <kwd-group>
        <kwd>End-to-end Autonomous Driving</kwd>
        <kwd>AV Domain Transfer</kwd>
        <kwd>Multi-task Learning</kwd>
        <kwd>Conditional Imitation Learning</kwd>
        <kwd>Semantic Seg- mentation</kwd>
        <kwd>Depth Estimation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Autonomous vehicles have been a popular research domain for many years, and
there has recently been large investments from both technology and car
companies to be the rst to solve the problem. The most prominent approach in
recent years has been the modular approach, where the driving is divided into
several sub-tasks such as perception, localization, and planning. The modular
approach often results in a very complex solution, where each module has to be
ne tuned individually. The scalability of this approach can therefore become an
issue when expanding the approach to more complex situations.</p>
      <p>Another rising approach is the end-to-end approach, where the entire driving
policy is generated within a single system. The system takes sensor-input and
converts it directly to driving commands, similar to how humans drive vehicles.
End-to-end systems for autonomous vehicles require large amounts of data, and
the ability to train on many di erent scenarios. Therefore, simulated
environments have been explored for training in di erent scenarios and creating large
datasets. These environments, however, di er signi cantly from the real world,
and the learned driving policy does not transfer adequately between
environments.</p>
      <p>
        This paper attempts to improve the ability for driving policies to be
transferred between domains by abstracting away both the perception task, and the
raw throttle and brake control of the vehicle, focusing mainly on the perception
task. The Mapillary Vistas dataset [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] is used for learning perception in a
realworld driving environment, and the autonomous vehicle simulator CARLA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is
used to learn both driving and perception. The ultimate goal of this paper is to
reduce the amount of real-world data required to train an autonomous vehicle,
by utilizing simulated environments for training.
      </p>
      <p>The paper is structured as follows: Section 2 investigates related work, while
additionally providing a brief history of the eld itself. Section 3 presents our
method, including the data, neural network architectures, and evaluation
metrics. Section 4 describes our experiments, their results, and discussion related to
these. Section 5 discusses the overall implications of the experimental results, and
compare our results with conclusions from related work. Section 6 draws a nal
conclusion of the work conducted, and addresses the paper's merits, weaknesses,
and potential future work.
2</p>
      <p>
        Related Work
[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] arrange autonomous vehicle control algorithms into two categories:
modular approaches and end-to-end approaches. Modular approaches divides the
responsibility of driving into several sub-tasks, such as perception, localization,
planning, and control. Conversely, end-to-end approaches can be de ned as a
function f (x) = a where x is any input needed to make decisions | typically
sensor data and environmental information | and a are the output controls that
are sent to the vehicle's actuators.
      </p>
      <p>
        The end-to-end approach was rst demonstrated in the ALVINN project,
described by [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. ALVINN was able to follow simple tracks, but had no means
to handle more complex environments. Since then, large advancements have
been done within neural networks, resulting in new research within end-to-end
vehicle control. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] approaches the problem using modern techniques, and
showcase a driving policy capable of driving on both highways and residential roads;
in varied weather conditions. More recent approaches [
        <xref ref-type="bibr" rid="ref14 ref15 ref20 ref26 ref6">6,20,15,14,26</xref>
        ] are based
on Conditional Imitation Learning (CIL), introduced by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in 2017, where the
driving policy is given instructions | high-level commands (HLCs) | on which
actions to take (e.g. turn left in next intersection). [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] shows that an architecture
can be re-used for both simulated and physical environments, but they make no
attempt to use the same model weights across the two domains. Codevilla et.
al outputs a steering angle, and either throttle or brake, which are sent to the
vehicle's control systems. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] outputs the target speed of the vehicle, leaving
the raw throttle and brake adjustments to a lower-level system. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] proposes
to abstract the commands even further; into several waypoints in space. Their
model outputs two waypoints, 5 and 20 meters away from the vehicle, which a
PID controller uses to control the vehicle's steering and velocity.
Transfer from simulation to real world. A lot of studies have been done
on transferring learning from simulation to the real world. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] used images from
the driving game Grand Theft Auto, to train their object detection model, and
achieved state of the art performance on the KITTI [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Cityscapes [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
datasets. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] successfully used simulation to train a model for robotic grasping of
physical unseen objects. Among the techniques used was applying randomization
in the form of random textures, lighting, and camera position, to enable their
model to generalize from the simulated source domain to their physical target
domain.
      </p>
      <p>
        Transferring driving policies between domains also require an abstraction
of the perception data. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] uses a perception model to generate segmentation
maps which are forwarded to the driving model, in order to generate similar
perception environments for both simulation and real-world. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] combines
groundtruth segmentation and depth data from CARLA to increase driving
performance. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] uses an encoder-decoder network with three decoder-heads |
segmentation, depth estimation and original RGB reproduction | to maximize the
model's scene understanding. Hawke et al. also removes the decoding-process
when training their driving policy, making their driving policy model take only
the compressed encoding of scene understanding as input. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] nds that the
performance of such multi-task prediction models depend highly on the relative
weighting between each task's loss. Tuning these weights manually is an error
prone and time-consuming process, and they therefore suggest a solution for
tuning weights based on the homoscedastic uncertainty of each task. They show that
the multi-task approach outperformes separate models trained individually. The
uncertainty based weighing was later used by Hawke et al. and produced good
results for generating optimal encoding of a driving scene. Depth images has also
been proven as a useful approach in other simulation-to-real world knowledge
transfers, such as robotic grasping [
        <xref ref-type="bibr" rid="ref12 ref25">25,12</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data and Methods</title>
      <p>Our approach consists of two separately trained models: a perception model
and a driving model. The reason for this separation is to decouple the task of
scene understanding from the task of driving. This opens up the possibility of
improving the tasks independently, and we can train the models separately and
with di erent datasets. An important goal was then to make the output of the
perception model domain-independent, which in turn makes the driving policy
model domain-independent. Domain independence is in this context de ned as
the ability to generalize between multiple domains (e.g. simulated and real), as
well as completely unseen domains.</p>
      <p>The perception model is trained on datasets containing RGB images paired
with semantic segmentation and depth information. These datasets can contain
images not directly related to driving, as they are used to train a general scene
understanding.</p>
      <p>The driving model is trained on datasets recorded from an expert driver. The
dataset contains RGB images and driving data such as steering angle, current
speed and target speed. The datasets for both models can either be collected
from the real world, generated from the CARLA simulator, or a combination of
both real-world and simulated data.
3.1</p>
      <sec id="sec-2-1">
        <title>Perception Model</title>
        <p>
          The perception model takes raw RGB images as input, and tries to predict one
or more outputs related to scene understanding; always semantic segmentation,
and in some experiments an additional depth map. The model has an
encoderdecoder structure, compressing the input into a layer with few neurons (encoder)
before expanding towards one or more prediction outputs (decoders). To train
the model, data from driving situations in di erent environments and
geographical areas are used. Some experiment also use data generated from CARLA as a
means to improve the model's performance in simulated environments.
Data. The Mapillary Vistas dataset [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] (henceforth Mapillary) was used for
RGB and ground-truth semantic segmentation data. The dataset consists of
25 000 high-resolution images from di erent driving situations, with a large
variety of weather and geographical locations. To simplify the environment for
the perception network, the number of classes for segmentation was reduced from
the original 66 object classes, to ve classes: unlabeled, road, lane markings,
humans and vehicles. To train the model's depth decoder, ground truth depth
maps were generated using the Monodepth2 network from [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], as Mapillary
lacks this information. Figure 1 shows a sample from this dataset.
        </p>
        <p>
          In addition to using real-world perception data, we generated synthesized
data in CARLA. A Python script spawns a large variety of vehicles and
pedestrians, and captures RGB, semantic segmentation, and depth data from the
vehicles as they navigate the simulated world. The eld of view (FOV) and
camera yaw angle were randomly distributed to generalize between di erent camera
setups. The simulated weather was additionally changed periodically; varying
cloudiness, amount of rain, time of day, and other modi able weather
parameters in CARLA. The nal size of this synthesized dataset is about 20 000 images.
Architecture. Several encoders and decoders were explored when deciding the
model's architecture. Encoders tested were: MobileNet [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], ResNet-50 [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and
a vanilla CNN, while decoders tested were: U-Net[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] and SegNet[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. To
generate a network that could predict both depth and segmentation estimations, we
modi ed the existing MobileNet-U-Net architecture to include a second U-Net
decoder. The decoder was modi ed to predict only one value per pixel, use the
sigmoid activation function, and train with a regression loss function for depth
estimation, adapted from [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Figure 2 illustrates the new MobileNet-U-Net with
two decoders.
        </p>
        <p>Evaluation and Metrics. The segmentation prediction was evaluated using
IntgetrissectthieongroovuenrdUtnriuotnh(sIoegUm),ecnatalctuiolanteadndwipthistthheefoplrleodwiicntgedeqsuegamtioennt:aggttti\[oppn,. wMheearne
IoU was used as the main indicator for performance, calculated by taking the
mean of the class-wise IoU. Frequency weighted IoU was also calculated,
measured as the mean IoU weighted by the number of pixels for each class.</p>
        <p>
          The accuracy within threshold, as described in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], was chosen as the metric
for depth estimation. Given the predicted depth value dp and the ground truth
depth value dgt, the accuracy within threshold th is de ned as max( ddgpt ; ddgpt ) =
&gt; th. Each pixel gets labeled as true or false based on whether the pixel is
within the speci ed threshold or not. The accuracy of an image is then calculated
by taking the average of all the pixels in the image. th is a threshold that we
varied between the values 1:25, 1:252, and 1:253, as in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Driving Model</title>
        <p>The driving model runs raw RGB images through the perception model, and
uses its output segmentation and depth predictions as input. These images are
coupled with driving data recorded from an expert driver. The driving model
processes these inputs through its own layers, before outputting a steering angle
and target speed.</p>
        <p>Data. The driving data was generated in CARLA version 0.9.9. This was done
by making an autopilot control a car in various environments, and recording
video from three forward-facing cameras, its steering angle, speed, target speed,
and HLC (left, right, straight, or follow lane). The autopilot has access to the full
state of the world, which includes a HD map, its own location and velocity, and
the states of other vehicles and pedestrians. It uses this information to generate
waypoints, which are nally fed into a PID-based controller to apply throttle,
brake, and steering angle. The collected training data was unevenly distributed
in regards to HLCs and steering angles, and we therefore down-sampled
overrepresented values for an improved data distribution.</p>
        <p>
          Various datasets were gathered for training the driving policy, all of which
were collected in Town01. These have di erent amount of complexity; steering
noise magnitude the autopilot has to account for, di erent weather conditions
and di erent light conditions. 30 641 samples were collected in total, where the
weather varied according to CARLA's 15 default weather presets. The training
data was e ectively multiplied by three, as we made two copies of each data
point, where we used the recorded image from each side camera instead of the
main camera. To adjust for a slightly modi ed camera perspective, we added
an o set of 0.05 and -0.05 in steering angle respectively for the left and right
camera variants. This technique was rst introduced by [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], and has later proved
successfully in other papers [
          <xref ref-type="bibr" rid="ref14 ref6">6,14</xref>
          ].
Architecture. The input of the driving model is a concatenation of the output
from the perception model and an information vector containing the HLC
(onehot encoded), current speed, current speed limit, and the upcoming tra c light's
state. The driving model is trained on simulation data with all the layers of
the perception model frozen (that is, non-trainable) to preserve generalizability.
Figure 3 shows an overview of the model.
        </p>
        <p>Traffic light state
Speed limit
Current speed</p>
        <p>HLC
+
+
+</p>
        <p>Angle
Speed</p>
        <p>
          The segmentation and depth output of the perception model are
concatenated channel-wise, and resemble a RGBD (RGB + depth) image. This
representation is then run through 5 convolutional blocks, each consisting of zero
padding of 1, 2D-convolution with kernel 3, batch normalization, ReLu
activation, and nally max-pooling with pool size 2. The lter sizes are 64, 128, 256,
256, 256, respectively. The current HLC, whether the tra c light was red or not,
speed, and speed limit are concatenated with feature vectors generated from the
perception data. The last layers are a combination of fully-connected layers,
where we concatenate the HLC vector at each step, similar to [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. The rst
output of the model is the steering prediction; one neuron outputting the optimal
steering (between 0 and 1, 0 being max leftward, 1 being max rightward), later
mapped to CARLA's [
          <xref ref-type="bibr" rid="ref1">-1, 1</xref>
          ] range. The second output is the optimal vehicle
speed, outputted as a percentage of 100 km/h (between 0 and 1).
Evaluation and Metrics. The main metric used for measuring driving model
performance was Mean Completion Rate (MCR) during real-time evaluation.
This is calculated by dividing the completed distance dc by the total route
distance dt of each run-through of a route, averaged over all run-throughs R:
        </p>
        <p>r R ddct . Tra c violations were not included as metrics, as the scope of this
paP</p>
        <p>jRj
per is mainly within completing routes without major incidents, and the models
were therefore not trained to avoid such violations. The model's validation loss
was also used as a rough metric for performance. By empirical observations we
only picked models with validation loss / 0:03 for further evaluation. The
validation loss metric was used as an initial performance estimation because the
MCR evaluation was considerably more time consuming.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and results</title>
      <p>There are two main experiments conducted in this paper. The rst experiment
and its sub-experiments focuses on generating the best perception model to
be used when training the driving network. Model architecture, dataset
variants, augmentation, and multi-task learning are parameters experimented with
to increase performance. The second experiment is conducted in CARLA. This
experiment assess the driving policy performance given the di erent models
derived in the rst set of experiments. The generalizability of each model is tested
using di erent unseen environments. Each perception model is then compared
to a baseline model trained only on the CARLA dataset using Mean Completion
Rate as the metric.
4.1</p>
      <sec id="sec-3-1">
        <title>Experiment 1: Perception Model</title>
        <p>The perception experiments use semantic segmentation and occasionally depth
estimation to generalize the driving environment when training and testing the
driving models. All of the perception experiments use the same dataset for
evaluation, and the results can therefore be compared across experiments. The
CARLA data generated and used for evaluation consists of 4 400 images with
corresponding ground truth segmentation and depth maps from Town 3-4. The
Mapillary evaluation dataset is a set of 2 000 images from the original Mapillary
test set.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Experiment 1-1: Encoder-decoder models. This experiment attempts to</title>
        <p>
          nd the best encoder and decoder to use for the perception network. All the
encoders tested were picked because they have previously shown good results in
other papers, and were implemented in a common library by [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>The three encoders showed increased performance as the complexity and size
of the encoder increased. The Vanilla CNN encoder performed worst with the
lowest Mean IoU score, however, it was also the fastest model during training
and testing. MobileNet gave better results while keeping a lot of the speed
advantage from the Vanilla CNN network. MobileNet also showed very good results,
with MobileNet-U-Net displaying the best overall performance when
combining scores. ResNet50 performed good as expected with a higher Mean IoU than
MobileNet-U-Net, however the di erence from MobileNet-U-Net was less than</p>
        <sec id="sec-3-2-1">
          <title>Model Mean IoU</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>VanillaCNN-SegNet 0.324</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>VanillaCNN-U-Net 0.351</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>MobileNet-SegNet 0.368</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>MobileNet-U-Net 0.403</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>ResNet50-SegNet 0.405</title>
          <p>ResNet50-U-Net 0.383</p>
          <p>Weighted IoU
0.712
0.705
0.775
0.774
0.767
0.733
expected. MobileNet was used for futher experiments as it was signi cantly faster
than ResNet50.</p>
          <p>
            Experiment 1-2: Training data. To improve the model further some CARLA
data was introduced to the Mapillary dataset. Augmentation was also introduced
for further improvements and better generalization. The Mapillary+CARLA
dataset consisted of 20 000 datapoints from the Mapillary dataset and 3 250
samples from Town01 and Town02 in CARLA. The dataset with only
augmented CARLA data (CARLA+Aug) used a di erent dataset of 15 000 samples
from Town 1-4, and 4 000 samples from Town 5 as validation. The results were
evaluated on Town 3-4 as Town 1-2 was used when training Mapillary+CARLA.
The augmentation included consists of among others gaussian noise, translation,
rotation, hue and saturation augmentations, and was adapted from [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ].
          </p>
          <p>Training dataset</p>
        </sec>
        <sec id="sec-3-2-7">
          <title>Mapillary</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>Mapillary+Aug</title>
        </sec>
        <sec id="sec-3-2-9">
          <title>Mapillary+CARLA</title>
        </sec>
        <sec id="sec-3-2-10">
          <title>Mapillary+CARLA+Aug 0.478</title>
        </sec>
        <sec id="sec-3-2-11">
          <title>CARLA+Aug 0.436 0.469</title>
          <p>The dataset experiment shows that including CARLA data as a
component when training the perception models increases the total performance. As
the model's goal is to make good predictions in both real and simulated
environments, combining data from both seems to be a reasonable approach.
CARLA+Aug achieves great results when evaluating on CARLA data, however
the performance decreased drastically when predicting in real-world
environments. Models trained on real-world data tends to generalize better to unseen
simulated environments than the other way around. Incorporating some CARLA
data into the real-world data in addition to augmenting the images yields the
best results overall.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Experiment 1-3: Multi-task perception. Inspired by [15] we introduced a</title>
        <p>depth estimation decoder to the MobileNet-U-Net model. The model was trained
with ground truth data generated by the Monodepth2 network using images from
the Mapillary dataset, while depth maps for the CARLA data was included in
the generated CARLA dataset.</p>
        <sec id="sec-3-3-1">
          <title>Mapillary</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>CARLA</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>Mapillary+CARLA 0.520 (+0.05) 0.854</title>
          <p>
            Including a depth estimation decoder increases the segmentation performance
for each model. The mean increase in IoU on the CARLA test set is 8%, which
conforms with the results reported by [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ], who reported a 4.17% increase in
performance when training semantic segmentation with depth estimation. An
increase in overall scene understanding can also be expected as depth is
introduced to the model, however this has to be veri ed as part of the overall driving
policy experiments.
4.2
          </p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Experiment 2: Driving Model</title>
        <p>
          This experiment aims to assess the overall performance of the two-part
(perception and driving policy) architecture. We run real-time evaluations on variants
of our proposed architecture, including a baseline network where the complete
network is trained at once. The evaluation is conducted with a custom
scenario runner for CARLA, originally introduced by [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], and extended for our
experiments. The real-time nature of this experiment makes it di erent from the
previous experiments: The models' steering and speed outputs a ect the camera
input in subsequent simulation steps, and each prediction is therefore dependent
on the ones before.
        </p>
        <p>The scenario runner. The scenario runner makes each model drive through
a prede ned set of routes, each of which is de ned by a set of waypoints. The
model navigates each route using HLCs provided automatically when passing
each waypoint. Each attempt at a route ends either when the vehicle completes
the route, or when the vehicle enters any of the following erroneous states: stuck
on an obstacle, leaving its correct lane and not returning within ve seconds, or
ignoring a HLC. The models are then compared on their mean route completion
rate.</p>
        <p>Environments and Routes. The models were tested in two environments,
Town02 and Town07. Town02 is similar, but not identical to the one in the
driving policy's training data, which is Town01. Town07 is quite di erent, and
is rural with narrow roads (some without any centre marking), elds, and barns.
There are three routes in each environment, which the models will try to
complete in six di erent weather conditions. Three of the weather conditions have
already been observed in the training data, while the three remaining are
unknown to the policy. The training data only contain samples from day-time
weathers, but two of the unknown weathers are at midnight.</p>
        <p>Results. Table 4 summarizes the driving performance of the di erent models.
The model trained only on driving data and without a frozen perception model,
RGB, was the best-performing model on Town02, but it struggles with Town07.</p>
        <p>The model names starting with SD indicates that they use the
segmentation and depth perception model (henceforth SD). SD-CARLA, which uses
SD trained only on perception data from CARLA, outperforms all other models
when ranked by Mean Completion Rate (MCR) over both towns. To demonstrate
its performance, we made a video (https://youtu.be/HL5LStDe7wY) showing
some of its good performing moments. SD-Mapillary uses SD as well, but only
had perception training data from Mapillary. While not performing as good as
SD-CARLA, it still has impressive results. Its perception model has not seen any
CARLA data, but is still able to predict segmentation and depth good enough
for the driving model to beat even the RGB model. SD-Combined used
perception data from both Mapillary and CARLA, and performs a little bit worse than
SD-Mapillary.</p>
        <p>The model names starting with S indicates that they use the
segmentationonly perception model (henceforth S). S-CARLA is the S-counterpart of
SDCARLA, and it performs very well in Town02. In Town07 however, it struggles</p>
        <p>Seen weather
Model Clear (D) Rain (D) Wet (S)
RGB 100.00 % 28.72 % 36.05 %
SD-CARLA 100.00 % 43.30 % 55.36 %
S-CARLA 93.83 % 7.23 % 43.51 %
SD-Mapillary 88.51 % 42.16 % 67.22 %
SD-Combined 93.53 % 9.92 % 47.42 %
S-Combined 90.71 % 44.02 % 23.12 %
S-Mapillary 72.12 % 43.30 % 40.85 %</p>
        <p>Unseen weather
with night-time weather. S-Mapillary is the S-counterpart of SD-Mapillary, and it
has the lowest MCR in both towns. In any run with Fog (S), it fails almost
immediately. S-Combined uses combined perception data, the same as SD-Combined.
It is performing a bit better than S-Mapillary in Town02, and is the fourth best
in Town07.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussions</title>
      <p>
        Results in comparison to related work
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] uses ground-truth semantic segmentation data generated from CARLA, not
predicted as we do, and combine segmentation with both ground-truth depth
maps and depth estimated by a separate network. Their results aligns with
our results; using semantic segmentation data beats just using raw images, and
combining both segmentation and depth performs the best. With a combination
of ground truth segmentation and estimated depth, their policy is still able to
beat the raw image-based policy. Our models estimate both segmentation and
depth, and is still able to perform good in comparison to our baseline
RGBmodel.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] use predicted binary segmentation (road/not road) as driving input,
and our work extends this with predicted depth, giving additional performance
bene ts. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] achieved higher completion rates even with tra c, but focused
more on the impact of larger datasets and encoding temporal information in the
model, while this paper focused mainly on generalizability.
      </p>
      <p>
        The driving model by [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] did not include the perception model's decoding
layers in its architecture, which seems to be an overall more e cient approach.
Because the U-Net architecture used in our paper had connections between each
encoder-decoder layer, information could have been lost by not including the
decoding layers. In future work, a model without connections between the
encoderdecoder layers could be explored to take advantage of [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]'s approach.
5.2
      </p>
      <sec id="sec-4-1">
        <title>Driving models</title>
        <p>We nd that models with a learned understanding of both the semantics and/or
geometry of the scene are able to navigate never-before-seen environments and
weather. Our real-time experiment shows that these driving models often
perform better than learning from raw image inputs directly, with models utilizing
both semantics and geometry performing best overall.</p>
        <p>
          Variance. It is important to note that we observed a high variance when
training and evaluating our models. Two models trained from the exact same setup
could perform signi cantly di erent, despite having the exact same training data.
We suspect that this is the same variance problem as [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] experienced. The
variance was handled by training and testing the models several times to make sure
the results were representative. Still, conclusions based on the results in
Experiment 2 must be drawn carefully. A more robust approach could be to train
multiple models with the same parameters, and averaging their results.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Splitting end-to-end models for autonomous vehicles into separate models for
perception and driving policy is shown to give good results in simulated
environments. Perception models trained from public datasets such as Mapillary
Vistas can be used to reduce the amount of driving data needed when training
an end-to-end driving policy network. This approach opens up for training the
driving policy in a simulated environment, while still getting good performance
in real-world environments.</p>
      <p>Future work should explore how these results transfers into the real world.
Evaluating the performance of a model trained solely in simulation directly in a
real-world environment will be an important next step as a means of testing the
validity of these results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alhashim</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wonka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>High quality monocular depth estimation via transfer learning</article-title>
          .
          <source>arXiv:1812.11941 [cs] (Mar</source>
          <year>2019</year>
          ), arXiv:
          <year>1812</year>
          .11941
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Badrinarayanan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kendall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cipolla</surname>
          </string-name>
          , R.:
          <article-title>Segnet: A deep convolutional encoder-decoder architecture for image segmentation</article-title>
          .
          <source>arXiv:1511.00561 [cs] (Oct</source>
          <year>2016</year>
          ), arXiv:
          <fpage>1511</fpage>
          .
          <fpage>00561</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bojarski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Del Testa</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dworakowski</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Firner</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flepp</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jackel</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monfort</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muller</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            ,
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Zieba</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>End to End Learning for Self-Driving Cars (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bousmalis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Irpan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wohlhart</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelcey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalakrishnan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Downs</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ibarz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pastor</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konolige</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , et al.:
          <article-title>Using simulation and domain adaptation to improve e ciency of deep robotic grasping</article-title>
          .
          <source>In: 2018 IEEE International Conference on Robotics and Automation (ICRA)</source>
          . p.
          <volume>4243</volume>
          {
          <issue>4250</issue>
          (May
          <year>2018</year>
          ). https://doi.org/10.1109/ICRA.
          <year>2018</year>
          .8460875
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xian</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Monocular depth estimation with augmented ordinal depth relationships</article-title>
          . arXiv:
          <year>1806</year>
          .00585 [cs] (
          <year>Jul 2019</year>
          ), arXiv:
          <year>1806</year>
          .00585
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Codevilla</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , Muller,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Koltun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>End-to-end Driving via Conditional Imitation Learning</article-title>
          .
          <source>arXiv:1710.02410 [cs] (Oct</source>
          <year>2017</year>
          ), arXiv:
          <fpage>1710</fpage>
          .
          <fpage>02410</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Codevilla</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santana</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaidon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Exploring the Limitations of Behavior Cloning for Autonomous Driving</article-title>
          . arXiv:
          <year>1904</year>
          .08980 [cs] (
          <year>Apr 2019</year>
          ), arXiv:
          <year>1904</year>
          .08980
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Cordts</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Omran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rehfeld</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Enzweiler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benenson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franke</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schiele</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The cityscapes dataset for semantic urban scene understanding</article-title>
          .
          <source>arXiv:1604.01685 [cs] (Apr</source>
          <year>2016</year>
          ), arXiv:
          <fpage>1604</fpage>
          .
          <fpage>01685</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Dosovitskiy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ros</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Codevilla</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koltun</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>CARLA: An open urban driving simulator</article-title>
          .
          <source>In: Proceedings of the 1st Annual Conference on Robot Learning</source>
          . pp.
          <volume>1</volume>
          {
          <issue>16</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Geiger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stiller</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urtasun</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Vision meets robotics: The kitti dataset</article-title>
          .
          <source>International Journal of Robotics Research (IJRR)</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Godard</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mac Aodha</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Firman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brostow</surname>
          </string-name>
          , G.:
          <article-title>Digging into self-supervised monocular depth estimation</article-title>
          . arXiv:
          <year>1806</year>
          .01260 [cs, stat] (
          <year>Aug 2019</year>
          ), arXiv:
          <year>1806</year>
          .01260
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Gualtieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pas</surname>
          </string-name>
          , A.t.,
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Platt</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>High precision grasp pose detection in dense clutter</article-title>
          .
          <source>arXiv:1603.01564 [cs] (Jun</source>
          <year>2017</year>
          ), arXiv:
          <fpage>1603</fpage>
          .
          <fpage>01564</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Image segmentation keras : Implementation of segnet, fcn, unet, pspnet and other models in keras</article-title>
          . (
          <year>2020</year>
          ), https://github.com/divamgupta/ image-segmentation-keras
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Haavaldsen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aasboe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lindseth</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          : Autonomous Vehicle Control:
          <article-title>End-toend Learning in Simulated Urban Environments</article-title>
          . arXiv:
          <year>1905</year>
          .06712 [cs] (May
          <year>2019</year>
          ), arXiv:
          <year>1905</year>
          .06712
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Hawke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurau</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reda</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mazur</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Micklethwaite</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Gri ths, N.,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kendall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Urban Driving with Conditional Imitation Learning</article-title>
          . arXiv:
          <year>1912</year>
          .00177 [cs] (
          <year>Dec 2019</year>
          ), arXiv:
          <year>1912</year>
          .00177
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>arXiv:1512.03385 [cs] (Dec</source>
          <year>2015</year>
          ), arXiv:
          <fpage>1512</fpage>
          .
          <fpage>03385</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>A.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalenichenko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weyand</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andreetto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adam</surname>
          </string-name>
          , H.:
          <article-title>Mobilenets: E cient convolutional neural networks for mobile vision applications</article-title>
          .
          <source>arXiv:1704.04861 [cs] (Apr</source>
          <year>2017</year>
          ), arXiv:
          <fpage>1704</fpage>
          .
          <fpage>04861</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Johnson-Roberson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehta</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sridhar</surname>
            ,
            <given-names>S.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosaen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasudevan</surname>
          </string-name>
          , R.:
          <article-title>Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks</article-title>
          ? arXiv:
          <fpage>1610</fpage>
          .
          <year>01983</year>
          [cs] (
          <year>Feb 2017</year>
          ), arXiv:
          <fpage>1610</fpage>
          .01983
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Kendall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gal</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cipolla</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics</article-title>
          .
          <source>arXiv:1705.07115 [cs] (Apr</source>
          <year>2018</year>
          ), arXiv:
          <fpage>1705</fpage>
          .
          <fpage>07115</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. Muller,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Ghanem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Koltun</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          :
          <article-title>Driving Policy Transfer via Modularity and Abstraction</article-title>
          . arXiv:
          <year>1804</year>
          .09364 [cs] (
          <year>Dec 2018</year>
          ), arXiv:
          <year>1804</year>
          .09364
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Neuhold</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ollmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rota Bulo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontschieder</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>The mapillary vistas dataset for semantic understanding of street scenes</article-title>
          .
          <source>In: International Conference on Computer Vision</source>
          (ICCV) (
          <year>2017</year>
          ), https://www.mapillary.com/dataset/vistas
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Pomerleau</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          :
          <source>Advances in Neural Information Processing Systems</source>
          <volume>1</volume>
          , p.
          <volume>305</volume>
          {
          <fpage>313</fpage>
          . Morgan Kaufmann Publishers Inc. (
          <year>1989</year>
          ), http://dl.acm.org/ citation.cfm?id=
          <volume>89851</volume>
          .
          <fpage>89891</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Ronneberger</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brox</surname>
          </string-name>
          , T.:
          <article-title>U-net: Convolutional networks for biomedical image segmentation</article-title>
          .
          <source>arXiv:1505</source>
          .04597 [cs] (May
          <year>2015</year>
          ), arXiv:
          <fpage>1505</fpage>
          .
          <fpage>04597</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Standley</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zamir</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guibas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savarese</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Which tasks should be learned together in multi-task learning</article-title>
          ? arXiv:
          <year>1905</year>
          .07553 [cs] (May
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Viereck</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pas</surname>
          </string-name>
          , A.t.,
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Platt</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Learning a visuomotor controller for real world robotic grasping using simulated depth images</article-title>
          .
          <source>arXiv:1706.04652 [cs] (Nov</source>
          <year>2017</year>
          ), arXiv:
          <fpage>1706</fpage>
          .
          <fpage>04652</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Codevilla</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurram</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urfalioglu</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Multimodal</surname>
          </string-name>
          Endto-End Autonomous Driving. arXiv:
          <year>1906</year>
          .03199 [cs] (
          <year>Jun 2019</year>
          ), arXiv:
          <year>1906</year>
          .03199
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Yurtsever</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lambert</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carballo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>A Survey of Autonomous Driving: Common Practices and Emerging Technologies</article-title>
          . arXiv:
          <year>1906</year>
          .05113 [cs, eess] (
          <year>Jun 2019</year>
          ), arXiv:
          <year>1906</year>
          .05113
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>