<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Artificial Neural Networks Applied to Taxi Destination Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexandre de Brébisson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Étienne Simon</string-name>
          <email>esimon@esimon.eu</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alex Auvolat</string-name>
          <email>alex.auvolat@ens.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pascal Vincent</string-name>
          <email>vincentp@iro.umontreal.ca</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yoshua Bengio</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>MILA lab, University of Montréal</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe our first-place solution to the ECML/PKDD discovery challenge on taxi destination prediction. The task consisted in predicting the destination of a taxi based on the beginning of its trajectory, represented as a variable-length sequence of GPS points, and diverse associated meta-information, such as the departure time, the driver id and client information. Contrary to most published competitor approaches, we used an almost fully automated approach based on neural networks and we ranked first out of 381 teams. The architectures we tried use multi-layer perceptrons, bidirectional recurrent neural networks and models inspired from recently introduced memory networks. Our approach could easily be adapted to other applications in which the goal is to predict a fixed-length output from a variable-length sequence.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Random order, this does not reflect the weights of contributions.</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>The taxi destination prediction challenge was organized by the 2015
ECML/PKDD conference5 and proposed as a Kaggle competition6. It consisted
in predicting the destinations (latitude and longitude) of taxi trips based on
initial partial trajectories (which we call prefixes ) and some meta-information
associated to each ride. Such prediction models could help to dispatch taxis more
efficiently.</p>
      <p>The dataset is composed of all the complete trajectories of 442 taxis running
in the city of Porto (Portugal) for a complete year (from 2013-07-01 to
2014-0630). The training dataset contains 1.7 million datapoints, each one representing
a complete taxi ride and being composed of the following attributes7:
– the complete taxi ride: a sequence of GPS positions (latitude and longitude)
measured every 15 seconds. The last position represents the destination and
different trajectories have different GPS sequence lengths.
– metadata associated to the taxi ride:
if the client called the taxi by phone, then we have a client ID. If the
client called the taxi at a taxi stand, then we have a taxi stand ID.
Otherwise we have no client identification,
the taxi ID,
the time of the beginning of the ride (unix timestamp).</p>
      <p>In the competition setup, the testing dataset is composed of 320 partial
trajectories, which were created from five snapshots taken at different timestamps.
This testing dataset is actually divided in two subsets of equal size: the public
and private test sets. The public set was used through the competition to
compare models while the private set was only used at the end of the competition
for the final leaderboard.</p>
      <p>Our approach uses very little hand-engineering compared to those published
by other competitors. It is almost fully automated and based on artificial neural
networks. Section 2 introduces our winning model, which is based on a variant
of a multi-layer perceptron (MLP) architecture. Section 3 describes more
sophisticated alternative architecture that we also tried. Although they did not
perform as well as our simpler winning model for this particular task, we
believe that they can provide further insight on how to apply neural networks to
similar tasks. Section 4 and Section 5 compares and analyses our various models
quantitatively and qualitatively on both the competition testing set and a bigger
custom testing set.
5 http://www.geolink.pt/ecmlpkdd2015-challenge/
6 https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i
7 The exact list of attributes for each trajectory can be found here: https://www.
kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i/data</p>
    </sec>
    <sec id="sec-3">
      <title>The Winning Approach</title>
      <sec id="sec-3-1">
        <title>Data Distribution</title>
        <p>Our task is to predict the destination of a taxi given a prefix of its trajectory. As
the dataset is composed of full trajectories, we have to generate trajectory
prefixes by cutting the trajectories in the right way. The provided training dataset
is composed of more than 1.7 million complete trajectories, which gives 83 480
696 possible prefixes. The distribution of the training prefixes should be as close
as possible as that of the provided testing dataset on which we were eventually
evaluated. This test set was selected by taking five snapshots of the taxi
network activity at various dates and times. This means that the probability that
a trajectory appears in the test set is proportional to its length and that, for
each entire testing trajectory, all its possible prefixes had an equal probability
of being selected in the test set. Therefore, generating a training set with all
the possible prefixes of all the complete trajectories of the original training set
provides us with a training set which has the same distribution over prefixes
(and whole trajectories) as the test set.
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>MLP Architecture</title>
        <p>
          A Multi-Layer Perceptron (MLP) is a neural net in which each neuron of a
given layer is connected to all the neurons of the next layer, without any cycle.
It takes as input fixed-size vectors and processes them through one or several
hidden layers that compute higher level representations of the input. Finally
the output layer returns the prediction for the corresponding inputs. In our
case, the input layer receives a representation of the taxi’s prefix with associated
metadata and the output layer predicts the destination of the taxi (latitude and
longitude). We used standard hidden layers consisting of a matrix multiplication
followed by a bias and a nonlinearity. The nonlinearity we chose to use is the
Rectifier Linear Unit (ReLU) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which simply computes max(0; x). Compared
to traditional sigmoid-shaped activation functions, the ReLU limits the gradient
vanishing problem as its derivative is always one when x is positive. For our
winning approach, we used a single hidden layer of 500 ReLU neurons.
2.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Input Layer</title>
        <p>One of the first problems we encountered was that the trajectory prefixes are
varying-length sequences of GPS points, which is incompatible with the fixed-size
input of the MLP. To circumvent (temporarily, see Section 3.1) this limitation,
we chose to consider only the first k points and last k points of the trajectory
prefix, which gives us a total of 2k points, or 4k numerical values for our input
vector. For the winning model we took k = 5. These GPS points are standardized
(zero-mean, unit-variance). In the case where the trajectory prefix contains less
than 2k points, there may be overlap between the beginning k and the end k
points. In the case where the trajectory prefix contains less than k points, then
we pad the input vector by repeating either the first or the last point.</p>
        <p>
          To deal with the discrete meta-data, consisting of client ID, taxi ID, date
and time information, we learn embeddings jointly with the model for each of
these information. This is inspired by neural language modeling approaches [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],
in which each word is mapped to a vector space of fixed size (the vector is called
the embedding of the word). The table of embeddings for the words is included
in the model parameters that we learn and behaves as a regular parameter
matrix: the embeddings are first randomly initialized and are then modified by
the training algorithm like the other parameters. In our case, instead of words,
we have metadata values. More precisely, we have one embedding table for each
metadata with one row for each possible value of the metadata. For the date
and time, we decided to create higher-level variables that better describe human
activity: quarters of hour, day of the week, week of the year (one embedding table
is learnt for each of them). These embeddings are then simply concatenated to
the 4k GPS positions to form the input vector of the MLP. The complete list of
embeddings used in the winning model is given in Table 1.
As the destination we aim to predict is composed of two scalar values (latitude
and longitude), it is natural to have two output neurons. However, we found that
it was difficult to train such a simple model because it does not take into account
any prior information on the distribution of the data. To tackle this issue, we
integrate prior knowledge of the destinations directly in the architecture of our
model: instead of predicting directly the destination position, we use a predefined
set (ci)1 i C of a few thousand destination cluster centers and a hidden layer
that associates a scalar value (pi)i (similar to a probability) to each of these
clusters. As the network must output a single destination position, for our output
prediction y^, we compute a weighted average of the predefined destination cluster
centers:
y^ =
        </p>
        <p>C
X pici:
i=1
Note that this operation is equivalent to a simple linear output layer whose
weight matrix would be initialized as our cluster centers and kept fixed during
training. The hidden values (pi)i must sum to one so that y^ corresponds to a
centroid calculation and thus we compute them using a softmax layer:
where (ej )j are the activations of the previous layer.</p>
        <p>The clusters (ci)i were calculated with a mean-shift clustering algorithm on
the destinations of all the training trajectories, returning a set of C = 3392
clusters. Our final MLP architecture is represented in Figure 1.</p>
        <p>pi =</p>
        <p>exp(ei)
PC
j=1 exp(ej )</p>
        <p>;
destination prediction</p>
        <p>y^
centroid</p>
        <p>(pi)i
softmax</p>
        <p>(ei)i
hidden layer
(ci)1 i C
clusters
The evaluation cost of the competition is the mean Haversine distance, which is
defined as follows ( x is the longitude of point x, x is its latitude, and R is the
radius of the Earth):
dhaversine(x; y) = 2R arctan
s</p>
        <p>a(x; y) !
a(x; y) 1
;
where a(x; y) is defined as:
a(x; y) = sin2
y</p>
        <p>x
+ cos( x) cos( y) sin2
Our models did not learn very well when trained directly on the Haversine
distance function and thus, we used the simpler equirectangular distance instead,
which is a very good approximation at the scale of the city of Porto:
s
dequirectangular(x; y) = R
( y
x) cos
y</p>
        <p>x
2
2
+ ( y
x)2:</p>
        <p>We used stochastic gradient descent (SGD) with momentum to minimise the
mean equirectangular distance between our predictions and the actual
destination points. We set a fixed learning rate of 0.01, a momentum of 0.9 and a batch
size of 200.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Alternative Approaches</title>
      <p>The models that we are going to present in this section did not perform as well
for our specific destination task on the competition test set but we believe that
they can provide interesting insights for other problems involving fixed-length
outputs and variable-length inputs.
3.1</p>
      <p>Recurrent Neural Networks
hidden states</p>
      <p>...
h
h
h</p>
      <p>h</p>
      <p>
        As stated previously, a MLP is constrained by its fixed-length input, which
prevents us from fully exploiting the entire trajectory prefix. Therefore we
naturally considered recurrent neural net (RNN) architectures, which can read all
the GPS points one by one, updating a fixed-length internal state with the same
transition matrix at each time step. The last internal state of the RNN is
expected to summarize the prefix with relevant features for the specific task. Such
recurrent architectures are difficult to train due in particular to the problem of
vanishing and exploding gradients [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This problem is partially solved with long
short-term memory (LSTM) units [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which are crucial components in many
state of the art architectures for tasks including handwriting recognition [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ],
speech recognition [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], image captioning [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or machine translation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>We implemented and trained a LSTM RNN that reads the trajectory one
GPS point at a time from the beginning to the end of each input prefix. This
architecture is represented in Figure 2. Furthermore, in order to help the network
to better identify short-term dependencies (such as the velocity of the taxi as
the difference between two successive data points), we also considered a variant
in which the input of the RNN is not anymore a single GPS point but a window
of 5 successive GPS points of the prefix. The window shifts along the prefix by
one point at each RNN time step.
3.2</p>
      <sec id="sec-4-1">
        <title>Bidirectional Recurrent Neural Networks</title>
        <p>
          We noticed that the most relevant parts of the prefix are its beginning and its end
and we therefore tried a bidirectional RNN [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] (BRNN) to focus on these two
particular parts. Our previously described RNN reads the prefix forwards from
the first to the last known point, which leads to a final internal state containing
more information about the last points (the information about the first points
is more easily forgotten). In the BRNN architecture, one RNN reads the prefix
forwards while a second RNN reads the prefix backwards. The two final internal
states of the two RNNs are then concatenated and fed to a standard MLP that
will predict the destination in the same way as in our previous models. The
concatenation of these two final states is likely to capture more information
about the beginning and the end of the prefix. Figure 3 represents the BRNN
component of our architecture.
        </p>
        <p>h
h
h
h
h
h
h
h
Fig. 3: Bidirectional RNN architecture.
3.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Memory Networks</title>
        <p>
          Memory networks [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] have been recently introduced as an architecture that
can exploit an external database by retrieving and storing relevant information
for each prediction. We have implemented a slightly related architecture, which
is represented in Figure 4 and which we will now describe. For each prefix to
predict, we extract m entire trajectories (which we call candidates ) from the
training dataset. We then use two neural network encoders to respectively
encode the prefix and the candidates (same encoder for all the candidates). An
encoder is the same as one of our previous architectures (either feedforward or
recurrent), except that we stop at the hidden layer instead of predicting an
output. This results into m + 1 fixed-length representations in the same vector space
so that they can be easily compared. Then we compute similarities by taking the
dot products of the prefix representation with all the candidate representations.
Finally we normalize these m similarity values with a softmax and use the
resulting probabilities to weigh the destinations of the corresponding candidates.
In other words, the final destination prediction of the prefix is the centroid of
the candidate destinations weighted by the softmax probabilities. This is similar
to the way we combine clusters in our previously described architectures.
destination prediction
y^
(pi)i
centroid
softmax
        </p>
        <p>(ei)i
dot products</p>
        <p>r
encoder 1
trajectory prefix
(ci)i
(ri)i</p>
        <p>candidate destinations
eennccooddeerr2
candidates</p>
        <p>As the trajectory database is very large, such an architecture is quite
challenging to implement efficiently. Therefore, for each prefix (more precisely for
each batch of prefixes), we naively select m = 10000 random candidates. We
believe that more sophisticated retrieving functions could significantly improve the
results, but we did not have time to implement them. In particular, one could use
a pre-defined (hand-engineered) similarity measure to retrieve the most similar
candidates to the particular prefix.</p>
        <p>The two encoders that map prefixes and candidates into the same
representation space can either be feedforward or recurrent (bidirectional). As RNNs
are more expensive to train (both in terms of computation time and RAM
consumption), we had to limit ourselves to a MLP with one single hidden layer of
500 ReLUs for the encoder. We trained the architecture with a batch size of
5000 examples, and for each batch we randomly pick 10000 candidates from the
training set.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <sec id="sec-5-1">
        <title>Custom Validation Set</title>
        <p>As the competition testing dataset is particularly small, we can not reliably
compare models on it. Therefore, for the purpose of this paper, we will compare our
models on two bigger datasets: a validation dataset composed of 19427
trajectories and a testing dataset composed of 19770 trajectories. We obtained these
new testing and validation sets by extracting (and removing) random portions of
the original training set. The validation dataset is used to early-stop our training
algorithms for each model based on the best validation score, while the testing
dataset is used to compare our different trained models.
8 our winning submission on Kaggle scored 2.03 but the model had not been trained
until convergence
9 average over 381 teams, the submissions with worse scores than the public benchmark
(in which the center of Porto is always predicted) have been discarded
4.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Results</title>
        <p>Our winning ECML/PKDD challenge model is the MLP model that uses clusters
of the destinations. However it is not our best model on our custom test, which
is the BRNN with window. As our custom test set is much larger, the scores it
gives us are significantly more confident than on the competition test set and we
can therefore assert that our overall best model is the BRNN with window.</p>
        <p>
          The results also prove that embeddings and clusters significantly improve
our models. The importance of embeddings can also be confirmed by visualizing
them. Figure 5 shows 2D t-SNE [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] projections for two of these embeddings
and clear patterns can be observed, proving that quarters of hour and weeks of
the year are important features for the prediction.
        </p>
        <p>The reported score of the memory network is lower than the others but this
might be due to the fact that it was not trained until convergence (we stopped
after one week on a high end GPU).</p>
        <p>The scores on our custom test set are higher than the scores on the public
and private test set used for the competition. This suggests that the competition
testing set is composed of rides that took place at very specific dates and times
with very particular trajectory distributions. The gap between the public and
private test sets is probably due to the fact that their size is particularly small.
In contrast, our validation and test sets are big enough to obtain more significant
statistics.</p>
        <p>All the models we have explored are very computationally intensive and we
thus had to train them on GPUs to avoid weeks of training. Our competition
winning model is the least intensive and can be trained in half a day on GPU.
On the other hand, our recurrent and memory networks are much slower and we
believe that we could reach even better scores by training them longer.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We introduced an almost fully-automated neural network approach to predict
the destination of a taxi based on the beginning of its trajectory and
associated metadata. Our best model uses a recurrent bidirectional neural network to
encode the prefix, several embeddings to encode the metadata and destination
clusters to generate the output.</p>
      <p>One potential limitation of our clustering-based output layer is that the final
prediction can only fall in the convex hull of the clusters. A potential solution
would be to learn the clusters as parameters of the network and initialize them
either randomly or from the mean-shift clusters.</p>
      <p>Concerning the memory network, one could consider more sophisticated ways
to extract candidates, such as using an hand-engineered similarity measure or
even the similarity measure learnt by the memory network. In this latter case,
the learnt similarity should be used to extract only a proportion of the
candidates in order let a chance to candidates with poor similarities to be selected.
Furthermore, instead of using the dot product to compare prefix and candidate
representations, more complex functions could be used (such as the
concatenation of the representations followed by non-linear layers).</p>
      <sec id="sec-6-1">
        <title>Acknowledgments</title>
        <p>
          The authors would like to thank the developers of Theano [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ], Blocks and
Fuel [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] for developing such powerful tools. We acknowledge the support of the
following organizations for research funding and computing support: Samsung,
NSERC, Calcul Quebec, Compute Canada, the Canada Research Chairs and
CIFAR.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Glorot</surname>
          </string-name>
          , Antoine Bordes, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Deep sparse rectifier neural networks</article-title>
          .
          <source>In International Conference on Artificial Intelligence and Statistics</source>
          , pages
          <fpage>315</fpage>
          -
          <lpage>323</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Réjean Ducharme, Pascal Vincent, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Janvin</surname>
          </string-name>
          .
          <article-title>A neural probabilistic language model</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          :
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Patrice Simard, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Frasconi</surname>
          </string-name>
          .
          <article-title>Learning long-term dependencies with gradient descent is difficult</article-title>
          .
          <source>Neural Networks, IEEE Transactions on, 5</source>
          (
          <issue>2</issue>
          ):
          <fpage>157</fpage>
          -
          <lpage>166</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Alex</given-names>
            <surname>Graves</surname>
          </string-name>
          , Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>A novel connectionist system for unconstrained handwriting recognition</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          , IEEE Transactions on,
          <volume>31</volume>
          (
          <issue>5</issue>
          ):
          <fpage>855</fpage>
          -
          <lpage>868</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Doetsch</surname>
          </string-name>
          , Michal Kozielski, and Hermann Ney.
          <article-title>Fast and robust training of recurrent neural networks for offline handwriting recognition</article-title>
          .
          <source>In Frontiers in Handwriting Recognition (ICFHR)</source>
          ,
          <year>2014</year>
          14th International Conference on, pages
          <fpage>279</fpage>
          -
          <lpage>284</lpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Alan</given-names>
            <surname>Graves</surname>
          </string-name>
          , Abdel-rahman
          <string-name>
            <surname>Mohamed</surname>
            , and
            <given-names>Geoffrey</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Speech recognition with deep recurrent neural networks</article-title>
          .
          <source>In Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2013</year>
          IEEE International Conference on, pages
          <fpage>6645</fpage>
          -
          <lpage>6649</lpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Haşim</given-names>
            <surname>Sak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Senior</surname>
          </string-name>
          , and
          <article-title>Françoise Beaufays. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition</article-title>
          .
          <source>arXiv preprint arXiv:1402.1128</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Kelvin</given-names>
            <surname>Xu</surname>
          </string-name>
          , Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Show, attend and tell: Neural image caption generation with visual attention</article-title>
          .
          <source>arXiv preprint arXiv:1502.03044</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Dzmitry</surname>
            <given-names>Bahdanau</given-names>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <surname>Yoshua Bengio.</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>arXiv preprint arXiv:1409.0473</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Alex</surname>
            <given-names>Graves</given-names>
          </string-name>
          , Santiago Fernández, and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>Bidirectional lstm networks for improved phoneme classification and recognition</article-title>
          .
          <source>In Artificial Neural Networks: Formal Models and Their Applications-ICANN</source>
          <year>2005</year>
          , pages
          <fpage>799</fpage>
          -
          <lpage>804</lpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Jason</surname>
            <given-names>Weston</given-names>
          </string-name>
          , Sumit Chopra, and
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Bordes</surname>
          </string-name>
          .
          <article-title>Memory networks</article-title>
          .
          <source>arXiv preprint arXiv:1410.3916</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Laurens</surname>
            <given-names>Van der Maaten and Geoffrey</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Visualizing data using t-sne</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>9</volume>
          (
          <fpage>2579</fpage>
          -2605):
          <fpage>85</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Frédéric</surname>
            <given-names>Bastien</given-names>
          </string-name>
          , Pascal Lamblin, Razvan Pascanu, James Bergstra,
          <string-name>
            <given-names>Ian J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          , Arnaud Bergeron, Nicolas Bouchard, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Theano: new features and speed improvements</article-title>
          .
          <source>Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>James</surname>
            <given-names>Bergstra</given-names>
          </string-name>
          , Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Theano: a CPU and GPU math expression compiler</article-title>
          .
          <source>In Proceedings of the Python for Scientific Computing Conference (SciPy)</source>
          ,
          <year>June 2010</year>
          . Oral Presentation.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. B. van
          <string-name>
            <surname>Merriënboer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Dumoulin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Serdyuk</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Warde-Farley</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chorowski</surname>
            , and
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Blocks and Fuel: Frameworks for deep learning</article-title>
          .
          <source>ArXiv</source>
          e-prints,
          <year>June 2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>