<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Transfer Learning versus Multi-agent Learning regarding Distributed Decision-Making in Highway Traffic</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mark Schutera</string-name>
          <email>mark.schutera@kit.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Niklas Goby</string-name>
          <email>niklas.goby@is.uni-freiburg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dirk Neumann</string-name>
          <email>dirk.neumann@is.uni-freiburg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Reischl</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chair for Information Systems Research, University of Freiburg</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IT Innovation Chapter Data Science, ZF Friedrichshafen AG</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute for Automation and Applied Informatics, Karlsruhe Institute of Technology</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Research and Development, ZF Friedrichshafen AG</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Transportation and traffic are currently undergoing a rapid increase in terms of both scale and complexity. At the same time, an increasing share of traffic participants are being transformed into agents driven or supported by artificial intelligence resulting in mixed-intelligence traffic. This work explores the implications of distributed decision-making in mixed-intelligence traffic. The investigations are carried out on the basis of an online-simulated highway scenario, namely the MIT DeepTraffic simulation. In the first step traffic agents are trained by means of a deep reinforcement learning approach, being deployed inside an elitist evolutionary algorithm for hyperparameter search. The resulting architectures and training parameters are then utilized in order to either train a single autonomous traffic agent and transfer the learned weights onto a multi-agent scenario or else to conduct multi-agent learning directly. Both learning strategies are evaluated on different ratios of mixed-intelligence traffic. The strategies are assessed according to the average speed of all agents driven by artificial intelligence. Traffic patterns that provoke a reduction in traffic flow are analyzed with respect to the different strategies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The level of automation in traffic and transportation is
increasing rapidly, especially in the context of highway
scenarios, where complexity is reduced in comparison to urban
street scenarios. Traffic congestion is annoying, stressful, and
time-consuming. Progress in the area of autonomous driving
thus offers the opportunity to: Improve this condition,
enhance traffic flow, and yield corresponding benefits such as
reduced energy consumption [Winner et al., 2015]. At the
same time, autonomous systems are distributed by a number
of different manufacturers and suppliers. This leads to the
challenge of the interaction between different autonomous
systems and human-operated vehicles. Therefore, it seems
to be within the realms of possibility that increased
automation in traffic may compromise the average flow of mixed
intelligence traffic. As highway traffic can be described in
terms of a multi-agent system with independent agents
cooperating and competing to achieve an objective the key to
high-performance highway traffic flow might lie within
multiagent learning and thus within the understanding and
exploration of distributed decision-making and its strategies.
Transfer learning is used with increasing frequency within
deep learning and might prove able to adapt artificial neural
networks to bordering tasks [Prodanova et al., 2018]. Within
the automotive industry, the pros and cons of each such
strategy are still subject to ongoing discussions. This work
contributes to this discussion by investigating the performance of
transfer learning, as opposed to multi-agent learning,
regarding distributed decision-making in highway traffic. For the
experiments, agents are trained with different learning
strategies and deploy them to the DeepTraffic micro-traffic
simulation, which was introduced along with the MIT 6.S094:
Deep Learning for Self-Driving Cars course [Fridman et al.,
2018]. The aim of this study is to examine the impact on
mixed intelligence traffic in the form it’s expected to take with
the adoption of Level 5 autonomous driving. To this end the
subsequent steps are taken:</p>
      <p>Traffic agents are trained within a micro-traffic
simulation, through deep reinforcement learning.</p>
      <p>An evolutionary algorithm is designed to embed the
traffic agents’ learning procedure.</p>
      <p>A single traffic agent’s model is applied to multiple
agents (transfer learning strategy).</p>
      <p>Multiple traffic agents are jointly trained (multi-agent
learning strategy).</p>
      <p>The two learning strategies are evaluated by means of
speed and traffic flow patterns.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Micro-Traffic Simulation Environment</title>
      <p>In the DeepTraffic1 challenge, the task is to train a car agent
with the goal of achieving the highest average speed over a
period of time. In order to succeed, the agent has to choose
the optimal action at at each step in time given the state
st. Possible actions are: accelerate, decelerate, goLeft (lane
1https://selfdrivingcars.mit.edu/
deeptraffic/
change to the left), goRight (lane change to the right) or
noAction (remain in the current lane at the same speed). The
agent’s observed state xt at time step t is defined as the
number of grid cells surrounding the agent. The size of the slice
is adjustable via three different parameters: lanesSide,
representing the width of the slice; patchesAhead, denoting the
length of the slice in the forward direction; and
patchesBehind, representing the length of the slice in the backward
direction. Depending on the parameter temporal window,
w, the state st can be transformed into a sequence st =
xt w; at w; xt w+1; at w+1; : : : ; xt 1; at 1; xt; at. If w =
0, then st = xt. Cell values denote the maximum speed the
agent can achieve when it is inside the cell. The maximum
speed in an empty cell is set to 80 mph. A cell occupied by a
car maintains the speed of the car.</p>
      <p>There are a total of 20 cars inside the environment, for
which the intelligent control of up to 11 cars is allowed. The
remaining cars choose their actions randomly. Central to the
intelligent control is a neural network. It receives the
observed state st as input and returns an action at, therefore
functioning as the agent’s behavior. The implemented
algorithm is a JavaScript implementation of the famous DQN
[Mnih et al., 2013] algorithm. Please refer to section 3.1 for
more information.</p>
      <p>The environment allows for the adjustment of a whole set
of hyperparameters in order to push the agents’ performance.
Table 1 lists the most important hyperparameters, which have
proven to have a significant influence on the agents’
performance [Fridman et al., 2018]. These hyperparameters, as
well as the network architecture itself, can be directly
adjusted within the browser. To automate the configuration,
training, and validation process for the experiments a
Pythonbased helper robot using the Selenium2 package was
implemented.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Training Advanced Traffic Agents</title>
      <sec id="sec-3-1">
        <title>Deep Reinforcement Learning and the Deep</title>
      </sec>
      <sec id="sec-3-2">
        <title>Q-Network (DQN)</title>
        <p>Deep reinforcement learning (DRL) is the combination of
two general-purpose frameworks: reinforcement learning
(RL) for decision-making, and deep learning (DL) for
representation learning [Silver, 2016].</p>
        <p>In the RL framework, an agent’s task is to learn actions
within an initially unknown environment. The learning
follows a trial-and-error strategy based on rewards or
punishments. The agent’s goal is to select actions that maximize the
cumulative future reward over a period of time. In the DL
framework, an algorithm learns a representation from raw
input that is required to achieve a given objective. The
combined DRL approach enables agents to engage in more
human like learning whereby they construct and acquire their
knowledge directly from raw inputs, such as vision,
without any hand-engineered features or domain heuristics. This
new generation of algorithms has recently achieved human
like results in mastering complex tasks with a very large state
2http://selenium-python.readthedocs.io/
Algorithm 1 Deep Q-learning with Experience Replay
Initialize replay memory D to capacity N
Initialize action-value function Q with random weights
for episode = 1; M do</p>
        <p>Observe initial state s1
for t = 1; T do</p>
        <p>With probability select a random action at
otherwise select at = maxa Q (st; a; )
Execute action at
Observe reward rt and state st+1
Store experience (st; at; rt; st+1) in D</p>
        <p>Sample random minibatch of
(sj ; aj ; rj ; sj+1) from D
8 for terminal sj+1 :
&gt;&lt; rj
Set yj = for non-terminal sj+1 :
&gt;: rj + maxa0 Q(sj+1; a0; )
transitions
Train the Q network using (yj
Q(sj ; aj ; ))2 as
loss</p>
        <p>end for
end for
space and with no prior knowledge [Mnih et al., 2013; 2015;
Silver et al., 2017].</p>
        <p>The simulation environment, per default, implements a
DQN algorithm introduced in [Mnih et al., 2013; 2015] for
training the advanced traffic agents. As a variant of the
popular Q-learning [Watkins and Dayan, 1992] algorithm, DQN
uses a neural network to approximate the optimal state-action
value function (i.e. Q-function). To make this work, DQN
utilizes four core concepts: experience replay [Lin, 1993],
a fixed target network, reward clipping, and frame skipping
[Mnih et al., 2015].</p>
        <p>The resulting approximate state-action value function
Q(s; a; i) is parametrized through i, in which i are the
parameters (i.e weights) of the Q-network at iteration i [Mnih
et al., 2015]. To train the Q-network at iteration i, one has to
minimize the following loss function:
h
Li( i) = E(s;a;r;s0) U(D) (yi
Q(s; a; i))2i ;
(1)
in which (s; a; r; s0) U (D) represents samples of
experiences, drawn uniformly at random from the
experience replay memory D (experience replay), yi = r +
maxa0 Q(s0; a0; i ) is the target for iteration i, is the
discount factor determining the agent’s horizon, i are the
parameters of the Q-network at iteration i and i are the
network parameters used to compute the target at iteration
i, which updates every C steps and hold fix otherwise (fixed
target network) [Mnih et al., 2015]. Algorithm 1 outlines the
full pseudo-code algorithm.
3.2</p>
      </sec>
      <sec id="sec-3-3">
        <title>Extended Hyperparameter Search</title>
        <p>Within deep reinforcement learning there arises the need for
a structured approach to determine suitable hyperparameter
configurations . This is important for both the neural
network’s architecture and the training process. The
following approach fulfills this requirement over multiple search
iterations. The micro-traffic simulation has already been
used to conduct a large-scale, crowd-sourced
hyperparameter search [Fridman et al., 2018]. In a first step, the proposals
drawn from this hyperparameter search are utilized in order
to define the intervals of the hyperparameters (see Tab. 1).
Building on the hyperparameter bounds, a 15-fold random
search is performed, as proposed by [Bergstra and Bengio,
2012; Goodfellow et al., 2016].</p>
        <p>Subsequently, the five best performing networks, which are
generated by the random search were utilized to initialize an
elitist evolutionary algorithm. The hyperparameter search for
artificial neural networks is inhibited by the comparatively
long training time for each hyperparameter configuration.
Therefore, an elitist fast converging evolutionary algorithm
was deployed to automate the process further. The whole
hyperparameter search process reduces the effects of bad agent
configuration, rendering the effects of transfer learning and
multi-agent approaches more visible and reproducible. In the
future, we would also like to exploit the hyperparameter
tuning capabilities of evolutionary algorithms to create highly
optimized agents [Salimans et al., 2017; Such et al., 2017;
Conti et al., 2017].
Throughout the transfer learning strategy, a first core neural
network ANNcore is trained. The network is trained with
a single agent deployed within the micro-traffic simulation.
The training is iterated while training and evaluating different
hyperparameter configurations (for hyperparameter
configuration see Tab. 1). Subsequently, the learned model is
repurposed for a multi-agent system. The decision-making process
is distributed over independent, multiple agents. The transfer
learning approach presented here, is based on parameter
sharing among multiple agents while the agents maintain their
ability to carry out self-determined actions. To that end, the
previously learned weights of ANNcore are transfered onto a
second, third, and so on agent Ai, as described by [Olivas et
al., 2009] and displayed in Fig. 1.
Within the multi-agent learning strategy, the agents are
trained simultaneously without being aware of each other.
More precisely, they have to interact with each other
without the possibility to communicate among themselves. This
makes joint planning impossible. The resulting network
ANNshared is trained with the joint objective of achieving
the average speed for all agents, but as in the transfer learning
scenario, actions are taken individually and in a greedy way.
The neural network’s ANNshared parameters are distributed
and shared across all agents Ai (see Fig. 2). In contrast to the
transfer learning approach, the multi-agent strategy enables
the agents to learn to interact directly with other agents in
order to increase the reward [Tuyls and Weiss, 2012].
cp comprises a boolean B stating whether the car is blocked
to the front and/or sides (1) or whether one of the lanes – left
lane, front lane, or right lane – is passable (0) due to the safety
regulations within the safety catchment area. Furthermore,
the feature vector annotation takes into account the speed S at
which the agent drove into the congestion as well as the loss
in speed or deceleration D the agent’s vehicle experiences
within half a second in simulated time after encountering the
congestion. The feature C reflects whether the agent was
compromised by another intelligent agent and thus assesses
the amount of cooperation during evaluation. The number of
congestion throughout the evaluation runs is taken into
account as ncp.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>The first experiments focus on a hyperparameter search as
described in section 3.2. The hyperparameter configuration
for the elitist evolutionary algorithm are as follows: A small
population size, = 5, and a directed population
initialization by means of a random search keeping the best parent
during transition into the next generation. The crossover rate
is set to pcross = 0:3 and the mutation rate to pmut = 0:1.
This approach significantly favors exploitation over the
exploration of the hyperparameter space. Hence, the approach
converges in short time while exhibiting the disadvantage of
reduced exploration of the hyperparameter space.</p>
      <p>In order to compare the transfer learning strategy to the
multi-agent strategy (see Fig. 5), the neural network
architecture and training parameters (see Tab. 1) discovered by the
hyperparameter configuration search are further utilized. Each
strategy is applied to different numbers of trainable agents,
ranging from 1 up to 11 agents. Each arrangement is
evaluated 5-fold to meet expected deviations due to differing
evaluation data. However, this is found to pose only a minor issue
as the minimal and maximal validation performance for each
arrangement spans less than 0:5 mph in all arrangements.
5.1</p>
      <sec id="sec-4-1">
        <title>Results</title>
        <p>In the quest to find a high-performance hyperparameter
configuration, the 15-fold random search makes a start by
evaluating in the micro-traffic simulation configurations reaching a
maximum average speed of 64:13 mph (see search iteration 0
in Fig. 4). The average speed is used as an indicator for
traffic flow. The best five configurations are selected to initialize
the evolutionary algorithm which leaves the configurations
with the maximum at 64:13 mph, the minimum at 60:27 mph,
and the mean at 62:55 mph. The evolutionary algorithm is
deployed over six generations (see search iterations 2-7 in
Fig. 4). As discussed in Section 3.2, the evolutionary
algorithm is elitist with a focus on exploitation and enabling an
extended exploration. However, the influence of exploration
is observed in search iteration 3, while the stronger
exploitation is evident in search iteration 4, where the range of values
is again decreased. After completion of the evolutionary
algorithm, the configurations have a maximum of 67:93 mph,
a minimum of 62:63 mph and a mean of 65:66 mph. This
shows an increase of 3:8 mph without any user interaction
apart from choosing educated upper and lower bounds for the
hyperparameter search space.</p>
        <p>Both strategies experience a drop in performance when
applied to multi-agent scenarios (see Fig. 5). As for the initial
addition of supplementary agents the performance downturn
is likely due to the fact that the network architecture and
training hyperparameters have been optimized according to a
desirable single agent performance which then faces a different
scenario during the reconditioned evaluation.
Notwithstanding an overall increase in performance, associated with an
increase in the number of agents can be recognized. The slopes
of the regression curves are: 0:342 mph per agent added for
the transfer learning strategy and 0:379 mph per agent added
for the multi-agent training strategy (compare with Fig. 5).
The multi-agent strategy, having the edge over the transfer
learning strategy, is able to profit from training in a
multiagent scenario. By contrast, agents in the transfer learning
strategy never had the opportunity to learn how to react to
and interact with other trained agents.</p>
        <p>Further insight is gained by analyzing the traffic
congestion feature vectors (see Section 4.3 and Fig. 6). What strikes
one the most is the counterintuitive finding that the number of
congestions (gray area) increases with the amount of trained
agents deployed in the micro-traffic simulation and as the
average evaluation speed increases. Simultaneously, the
number of congestions in which the car is held in full enclosure
(dark gray line) remains constant, fluctuating around 50
incidences.</p>
        <p>This is only seemingly contradictory, as the largest part of
the increased number of congestion incidences may be
attributed to low decelerations 1-11 (see Fig. 7). In terms of
traffic flow, this means that the trained agents are able to
anticipate and withdraw from potentially congestive positions
in advance or else dissolve a formation conducive to
congestion. Thus, the trained agents are able to accelerate again
shortly after driving into an area of congestion which leads to
better performance.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>The influence of transfer learning and multi-agent learning
in the presence of multiple trainable agents, has been
investigated with respect to distributed decision-making in order
to increase simulated highway traffic flow. Both strategies
were implemented and evaluated in the micro-traffic
simulation environment. Since the micro-traffic simulation only
allows for multi-agent learning the newly conducted
strategycomparison and deployment of the transfer learning strategy
as well as the evaluation tooling allows for extended testing
and evaluation.</p>
      <p>It was demonstrated, that transfer learning strategies are
applicable within the utilized micro-traffic simulation. A
beneficial effect of such strategies correlating with the amount
of trainable agents deployed in mixed-intelligence traffic has
been shown. It was found that the transfer learning
strategy and the multi-agent strategy were reaching approximately
the same level of performance, while also displaying similar
characteristics. Concentrating on traffic patterns, it became
evident that the number of congestions an agent experiences
not necessarily contingent on the average speed. More
important are the magnitude of deceleration required of the agent
and the time needed to withdraw from a congested situation.
The micro-traffic scenario is a vast simplification of real
traffic. Our findings suggest that multi-agent learning has an edge
with respect to performance in scenarios with more
intelligent agents involved. This leads to the assumption that with
a growing number of intelligent agents taking to the roads,
multi-agent learning strategies will be inevitable.</p>
      <p>Further comparisons between the investigated multi-agent
strategies might reveal explicit distinctions. To this end,
investigating ratios with a higher share of trainable agents is
advisable. Moreover, the multi-agent strategies should
benefit from a network architecture and training design that is
tailored with respect to the multi-agent scenario (as opposed to
the single-agent scenario). Increasing the amount of training
iterations and deepening the hyperparameter search is
recommended.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We thank Dr. Jochen Abhau and Dr. Stefan Elser from
Research and Development, as well as the whole Data Science
Team at ZF Friedrichshafen AG, for supporting this research.
Thank you for all the assistance and comments that greatly
improved this work. We would also like to express our
gratitude to Prof. Dr. Ralf Mikut from the Institute for Automation
and Applied Informatics, Karlsruhe Institute of Technology,
who provided insight and expertise that greatly enhanced this
and other research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Bergstra and Bengio</source>
          , 2012]
          <string-name>
            <given-names>James</given-names>
            <surname>Bergstra</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Random search for hyper-parameter optimization</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>13</volume>
          (Feb):
          <fpage>281</fpage>
          -
          <lpage>305</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Conti et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Edoardo</given-names>
            <surname>Conti</surname>
          </string-name>
          , Vashisht Madhavan, Felipe Petroski Such, Joel Lehman,
          <string-name>
            <given-names>Kenneth O.</given-names>
            <surname>Stanley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Clune</surname>
          </string-name>
          .
          <article-title>Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents</article-title>
          .
          <source>CoRR, abs/1712.06560</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Fridman et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Lex</given-names>
            <surname>Fridman</surname>
          </string-name>
          , Benedikt Jenik, and
          <string-name>
            <given-names>Jack</given-names>
            <surname>Terwilliger</surname>
          </string-name>
          . Deeptraffic:
          <article-title>Driving fast through dense traffic with deep reinforcement learning</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1801</year>
          .02805,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Goodfellow et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Ian</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          , Yoshua Bengio, and
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Courville</surname>
          </string-name>
          .
          <article-title>Deep Learning Book</article-title>
          . MIT Press,
          <year>2016</year>
          . http://www.deeplearningbook.org.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Lin</source>
          ,
          <year>1993</year>
          ]
          <string-name>
            <surname>Long-Ji Lin</surname>
          </string-name>
          .
          <article-title>Reinforcement learning for robots using neural networks</article-title>
          .
          <source>Technical report</source>
          , Carnegie-Mellon Univ Pittsburgh PA School of Computer Science,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Mnih et al.,
          <year>2013</year>
          ]
          <string-name>
            <given-names>Volodymyr</given-names>
            <surname>Mnih</surname>
          </string-name>
          , Koray Kavukcuoglu, David Silver,
          <string-name>
            <given-names>Alex</given-names>
            <surname>Graves</surname>
          </string-name>
          , Ioannis Antonoglou, Daan Wierstra, and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          .
          <article-title>Playing atari with deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1312.5602</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Mnih et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Volodymyr</given-names>
            <surname>Mnih</surname>
          </string-name>
          , Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
          <article-title>Alex Graves, Martin Riedmiller, Andreas</article-title>
          K Fidjeland,
          <string-name>
            <surname>Georg Ostrovski</surname>
          </string-name>
          , et al.
          <article-title>Human-level control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          ,
          <volume>518</volume>
          (
          <issue>7540</issue>
          ):
          <fpage>529</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Olivas et al.,
          <year>2009</year>
          ]
          <string-name>
            <given-names>Emilio</given-names>
            <surname>Soria</surname>
          </string-name>
          <string-name>
            <surname>Olivas</surname>
          </string-name>
          , Jose David Martin Guerrero, Marcelino Martinez Sober, Jose Rafael Magdalena Benedito, and Antonio Jose Serrano Lopez.
          <source>Handbook Of Research On Machine Learning Applications and Trends: Algorithms, Methods and Techniques - 2 Volumes. Information Science Reference - Imprint of: IGI Publishing</source>
          , Hershey, PA,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Prodanova et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>N.</given-names>
            <surname>Prodanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stegmaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Allgeier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Stachs</surname>
          </string-name>
          , B. Ko¨hler, R. Mikut,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Bartschat</surname>
          </string-name>
          .
          <article-title>Transfer learning with human corneal tissues: An analysis of optimal cut-off layer</article-title>
          .
          <source>MIDL Amsterdam</source>
          ,
          <year>2018</year>
          . Submitted paper, online available.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Salimans et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Tim</given-names>
            <surname>Salimans</surname>
          </string-name>
          , Jonathan Ho, Xi Chen, and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <article-title>Evolution strategies as a scalable alternative to reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1703.03864</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Silver et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>David</given-names>
            <surname>Silver</surname>
          </string-name>
          , Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran,
          <string-name>
            <given-names>Thore</given-names>
            <surname>Graepel</surname>
          </string-name>
          , et al.
          <article-title>Mastering chess and shogi by self-play with a general reinforcement learning algorithm</article-title>
          .
          <source>arXiv preprint arXiv:1712</source>
          .
          <year>01815</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Silver</source>
          , 2016]
          <string-name>
            <given-names>David</given-names>
            <surname>Silver</surname>
          </string-name>
          .
          <article-title>ICML 2016 Tutorial: Deep Reinforcement Learning</article-title>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Such et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Felipe</given-names>
            <surname>Petroski</surname>
          </string-name>
          <string-name>
            <surname>Such</surname>
          </string-name>
          , Vashisht Madhavan, Edoardo Conti, Joel Lehman,
          <string-name>
            <given-names>Kenneth O.</given-names>
            <surname>Stanley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Clune</surname>
          </string-name>
          .
          <article-title>Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning</article-title>
          .
          <source>CoRR, abs/1712.06567</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>[Tuyls and Weiss</source>
          , 2012]
          <string-name>
            <given-names>Karl</given-names>
            <surname>Tuyls</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Weiss</surname>
          </string-name>
          .
          <article-title>Multiagent learning: Basics, challenges, and prospects</article-title>
          .
          <source>Association for the Advancement of Artificial Intelligence</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Watkins and Dayan</source>
          , 1992]
          <article-title>Christopher JCH Watkins</article-title>
          and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Dayan</surname>
          </string-name>
          .
          <article-title>Q-learning</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          -4):
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Winner et al.,
          <year>2015</year>
          ] Hermann Winner, Felix Lotz, Stephan Hakuli, and
          <string-name>
            <given-names>Christina</given-names>
            <surname>Singer</surname>
          </string-name>
          .
          <source>Handbuch Fahrerassistenzsysteme - Grundlagen, Komponenten und Systeme fu¨r aktive Sicherheit und Komfort</source>
          . Springer Vieweg,
          <volume>3</volume>
          <fpage>edition</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>