<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tree-Structured Composition in Neural Networks without Tree-Structured Architectures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Samuel R. Bowman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher D. Manning</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher Potts</string-name>
          <email>cgpottsg@stanford.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Stanford University Stanford</institution>
          ,
          <addr-line>CA 94305-2150</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Tree-structured neural networks encode a particular tree geometry for a sentence in the network design. However, these models have at best only slightly outperformed simpler sequence-based models. We hypothesize that neural sequence models like LSTMs are in fact able to discover and implicitly use recursive compositional structure, at least for tasks with clear cues to that structure in the data. We demonstrate this possibility using an artificial data task for which recursive compositional structure is crucial, and find an LSTM-based sequence model can indeed learn to exploit the underlying tree structure. However, its performance consistently lags behind that of tree models, even on large training sets, suggesting that tree-structured models are more effective at exploiting recursive structure.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Neural networks that encode sentences as real-valued vectors have been successfully used in a wide
array of NLP tasks, including translation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], parsing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and sentiment analysis [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These models
are generally either sequence models based on recurrent neural networks, which build
representations incrementally from left to right [
        <xref ref-type="bibr" rid="ref1 ref4">4, 1</xref>
        ], or tree-structured models based on recursive neural
networks, which build representations incrementally according to the hierarchical structure of
linguistic phrases [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>
        While both model classes perform well on many tasks, and both are under active development, tree
models are often presented as the more principled choice, since they align with standard
linguistic assumptions about constituent structure and the compositional derivation of complex meanings.
Nevertheless, tree models have not shown the kinds of dramatic performance improvements over
sequence models that their billing would lead one to expect: head-to-head comparisons with sequence
models show either modest improvements [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or none at all [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        We propose a possible explanation for these results: standard sequence models can learn to exploit
recursive syntactic structure in generating representations of sentence meaning, thereby learning to
use the structure that tree models are explicitly designed around. This requires that sequence models
be able to identify syntactic structure in natural language. We believe this is plausible on the basis
of other recent research [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. In this paper, we evaluate whether LSTM sequence models are able
to use such structure to guide interpretation, focusing on cases where syntactic structure is clearly
indicated in the data.
      </p>
      <p>
        We compare standard tree and sequence models on their handling of recursive structure by training
the models on sentences whose length and recursion depth are limited, and then testing them on
longer and more complex sentences, such that only models that exploit the recursive structure will
be able to generalize in a way that yields correct interpretations for these test sentences. Our methods
extend those of our earlier work in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which introduces an experiment and corresponding artificial
dataset to test this ability in two tree models. We adapt that experiment to sequence models by
not p3
      </p>
      <p>p3
(not p2) and p6
p4 or (not ((p1 or p6) or p4))
p3
p3 or p2
not (p6 or (p5 or p3))
not ((((not p6) or (not p4)) and (not p5)) and (p6 and p6))
decorating the statements with an explicit bracketing, and we use this design to compare an LSTM
sequence model with three tree models, with a focus on what data each model needs in order to
generalize well.</p>
      <p>
        As in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we find that standard tree neural networks are able to make the necessary generalizations,
with their performance decaying gradually as the structures in the test set grow in size. We
additionally find that extending the training set to include larger structures mitigates this decay. Then
considering sequence models, we find that a single-layer LSTM is also able to generalize to unseen
large structures, but that it does this only when trained on a larger and more complex training set
than is needed by the tree models to reach the same generalization performance.
Our results engage with those of [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], who find that sequence models can learn to recognize
syntactic structure in natural language, at least when trained on explicitly syntactic tasks. The
simplest model presented in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] uses an LSTM sequence model to encode each sentence as a vector,
and then generates a linearized parse (a sequence of brackets and constituent labels) with high
accuracy using only the information present in the vector. This shows that the LSTM is able to identify
the correct syntactic structures and also hints that it is able to develop a generalizable method for
encoding these structures in vectors. However, the massive size of the dataset needed to train that
model, 250M tokens, leaves open the possibility that it primarily learns to generate only tree
structures that it has already seen, representing them as simple hashes—which would not capture unseen
tree structures—rather than as structured objects. Our experiments, though, show that LSTMs can
learn to understand tree structures when given enough data, suggesting that there is no fundamental
obstacle to learning this kind of structured representation. We also find, though, that sequence
models lag behind tree models across the board, even on training corpora that are quite large relative to
the complexity of the underlying grammar, suggesting that tree models can play a valuable role in
tasks that require recursive interpretation.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Recursive structure in artificial data</title>
      <p>
        Reasoning about entailment The data that we use define a version of the recognizing textual
entailment task, in which the goal is to determine what kind of logical consequence relation holds
between two sentences, drawing on a small fixed vocabulary of relations such as entailment,
contradiction, and synonymy. This task is well suited to evaluating neural network models for sentence
interpretation: models must develop comprehensive representations of the meanings of each
sentence to do well at the task, but the data do not force these representations to take a specific form,
allowing the model to learn whatever kind of representations it can use most effectively.
The data we use are labeled with the seven mutually exclusive logical relations of [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which
distinguish entailment in two directions (@, A), equivalence ( ), exhaustive and non-exhaustive
contradiction (^, j), and two types of semantic independence (#, `).
      </p>
      <p>
        The artificial language The language described in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (x4) is designed to highlight the use of
recursive structure with minimal additional complexity. Its vocabulary consists only of six unanalyzed
word types (p1; p2; p3; p4; p5; p6), and, or, and not. Sentences of the language can be
straightforwardly interpreted as statements of propositional logic (where the six unanalyzed words types are
variable names), and labeled sentence pairs can be interpreted as theorems of that logic. Some
example pairs are provided in Table 1.
      </p>
      <p>Crucially, the language is defined such that any sentence can be embedded under negation or
conjunction to create a new sentence, allowing for arbitrary-depth recursion, and such that the scope of
negation and conjunction are determined only by bracketing with parentheses (rather than bare word
order). The compositional structure of each sentence can thus be an arbitrary tree, and interpreting
a sentence correctly requires using that structure.</p>
      <p>The data come with parentheses representing a complete binary bracketing. Our models use this
information in two ways. For the tree models, the parentheses are not word tokens, but rather are
used in the expected way to build the tree. For the sequence model, the parentheses are word tokens
with associated learned embeddings. This approach provides the models with equivalent data, so
their ability to handle unseen structures can be reasonably compared.</p>
      <p>The data Our sentence pairs are divided into thirteen bins according to the number of logical
connectives (and, or, not) in the longer of the two sentences in each pair. We test each model on
each bin separately (58k total examples, using an 80/20% train/test split) in order to evaluate how
each model’s performance depends on the complexity of the sentences. In three experiments, we
train our models on the training portions of bins 0–3 (62k examples), 0–4 (90k), and 0–6 (160k), and
test on every bin but the trivial bin 0. Capping the size of the training sentences allows us to evaluate
how the models interpret the sentences: if a model’s performance falls off abruptly above the cutoff,
it is reasonable to conclude that it relies heavily on specific sentence structures and cannot generalize
to new structures. If a model’s performance decays gradually1 with no such abrupt change, then it
must have learned a more generally valid interpretation function for the language which respects its
recursive structure.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Testing sentence models on entailment</title>
      <p>
        We use the architecture depicted in Figure 1a, which builds on the one used in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The model
architecture uses two copies of a single sentence model (a tree or sequence model) to encode the
premise and hypothesis (left and right side) expressions, and then uses those encodings as the
features for a multilayer classifier which predicts one of the seven relations. Since the encodings are
computed separately, the sentence models must encode complete representations of the meanings of
the two sentences for the downstream model to be able to succeed.
      </p>
      <p>
        Classifier The classifier component of the model consists of a combining layer which takes the two
sentence representations as inputs, followed by two neural network layers, then a softmax classifier.
For the combining layer, we use a neural tensor network (NTN, [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]) layer, which sums the output
of a plain recursive/recurrent neural network layer with a vector computed using two multiplications
with a learned (full rank) third-order tensor parameter:
(1)
(2)
~yNN = tanh(M
~yNTN = ~yNN + tanh(~x(l)T T[1:::n]~x(r))
~x(l)
~x(r)
      </p>
      <p>
        + ~b )
Our model is largely identical to the model from [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], but adds the two additional tanh NN layers,
which we found help performance across the board, and also uses the NTN combination layer when
evaluating all four models, rather than just the TreeRNTN model, so as to ensure that the sentence
models are compared in as similar a setting as possible.
      </p>
      <p>
        We only study models that encode entire sentences in fixed length vectors, and we set aside models
with attention [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], a technique which gives the downstream model (here, the classifier) the potential
to access each input token individually through a soft content addressing system. While attention
simplifies the problem of learning complex correspondences between input and output, there is
no apparent reason to believe that it should improve or harm a model’s ability to track structural
information like a given token’s position in a tree. As such, we expect our results to reflect the same
basic behaviors that would be seen in attention-based models.
      </p>
      <p>1Since sentences are fixed-dimensional vectors of fixed-precision floating point numbers, all models will
make errors on sentences above some length, and L2 regularization (which helps overall performance)
exacerbates this by discouraging the model from using the kind of numerically precise, nonlinearity-saturating
functions that generalize best.
7-way softmax classifier
100d tanh NN layer
100d tanh NN layer
100d tanh NTN layer
50d premise
sentence model
with premise input
50d hypothesis
sentence model
with hypothesis input
(a) The general architecture shared across models.</p>
      <p>...
a or b</p>
      <p>
        or
a
or b
b
a
a
a or
or
a or b
b
...
(b) The architecture for the tree-structured
sentence models. Terminal nodes are learned
embeddings and nonterminal nodes are NN, NTN, or
TreeLSTM layers.
(c) The architecture for the sequence sentence
model. Nodes in the lower row are learned
embeddings and nodes in the upper row are LSTM
layers.
Sentence models The sentence encoding component of the model transforms the (learned)
embeddings of the input words for each sentence into a single vector representing that sentence. We
experiment with tree-structured models (Figure 1b) with TreeRNN (eqn. 1), TreeRNTN (eqn. 2),
and TreeLSTM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] activation functions. In addition, we use a sequence model (Figure 1c) with an
LSTM activation function [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] implemented as in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In experiments with a simpler non-LSTM
RNN sequence model, the model tended to badly underfit the training data, and those results are not
included here.
      </p>
      <p>
        Training We randomly initialize all embeddings and layer parameters, and train them using
minibatch stochastic gradient descent with AdaDelta [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] learning rates. Our objective is the standard
negative log likelihood classification objective with L2 regularization (tuned on a separate train/test
split). All models were trained for 100 epochs, after which all had largely converged without
significantly declining from their peak performances.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results and discussion</title>
      <p>The results are shown in Figure 2. The tree models fit the training data well, reaching 98.9, 98.8,
and 98.4% overall accuracy respectively in the 6 setting, with the LSTM underfitting slightly at
94.8%. In that setting, all models generalized well to structures of familiar length, with the tree
models all surpassing 97% on examples in bin 4, and the LSTM reaching 94.8%. On the longer
test sentences, the tree models decay smoothly in performance across the board, while the LSTM
decays more quickly and more abruptly, with a striking difference in the 4 setting, where LSTM
performance falls 10% from bin 4 to bin 5, compared to 4.4% for the next worse model. However,
the LSTM improves considerably with more ample training data in the 6 condition, showing only
a 3% drop and generalization results better than the best model’s in the 3 setting.</p>
      <p>
        All four models robustly beat the simple baselines reported in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]: the most frequent class occurs
just over 50% of the time and a neural bag of words model does reasonably on the shortest examples
but falls below 60% by bin 4.
50d5L0SdTLMSTM
50d5T0rdeeTRreNeNRNN
50d5T0rdeeTRreNeTRNNTN
50d5T0rdeeTLrSeeTLMSTM
50d5L0SdTLMSTM
0.02982 0.02955 0.02F8i3g8ure02.0:829T2est accuracy on thre10e0%experiments with increasingly rich training sets. The horizontal
STM50d(LSTM axis10o0%n each graph divides the test set expression pairs into bins by50dthTreeeRnNuNmber of logical operators
1 0.97436 in 90%
re95e50RdN(TNre0e.9R9N2T45N20d(Tre5e0RdN(TTNreeLS5T0Md5(0Tdre(LeSLTSTMtMh59e00d%(mLSToMre complex of the two expre5s0sdiToreneRsNNin the pair. The dotted5l0idnTreeeRsNhToNws the size of the largest
0848119968646889 00000000........99999999789976540000000656745750.......6689554.9604315091826609000087576365093352320....66247603997.4938949455161267345332224117 000000000000..........9999999986..578920375696841779862460152543886553612123 000000......999999000689430e82876600...x965586335..75068a1131833962952mccrcyauA55142678542218400000p%%%%%0000000le0.......19999887.s91887747690444292i14283691n826563523th4e 5tra6in7in8g s9e1t0in111llisssrccoayabn9870e100002acccrayuA%%%%56784c00000%%%%%h 1ex5552000pdddeTTL3rrSreeTeeiMLR4mSNTTe5MNn6t. 7 8 9 10 11 12 55550000dddd TTTLrrrSeeeTeee55MLRR00ddSNNTTLTNrMSNeTeMLSTM
9285 0.91008..432491470575 0.875 0.87809 0.70833 Size of longer expression rccauA6500%% Size of longer expression
38
      </p>
      <p>0.08292
STM50d(LSTM
1 0.97436
1 0.98485
68 0.98026
81 0.9618
53 0.91942
33 0.87463
33 0.84295
59 0.77912
09 0.70833
We find that all four models are able to effectively exploit a recursively defined language to interpret
sentences with complex unseen structures. We find that tree models’ biases allow them to do this
with greater efficiency, outperforming sequence-based models substantially in every experiment.</p>
      <p>However, our sequence model is nonetheless able to generalize smoothly from seen sentence
structures to unseen ones, showing that its lack of explicit recursive structure does not prevent it from
recognizing recursive structure in our artificial language.</p>
      <p>We interpret these results as evidence that both tree and sequence architectures can play valuable
roles in the construction of sentence models over data with recursive syntactic structure. Tree
architectures provide an explicit bias that makes it possible to efficiently learn to compositional
interpretation, which is difficult for sequence models. Sequence models, on the other hand, lack this
bias, but have other advantages. Since they use a consistent graph structure across examples, it is
easy to accelerate minibatch training in ways that yield substantially faster training times than are
possible with tree models, especially with GPUs. In addition, when sequence models integrate each
word into a partial sentence representation, they have access to the entire sentence representation up
to that point, which may provide valuable cues for the resolution of lexical ambiguity, which is not
present in our artificial language, but is a serious concern in natural language text.</p>
      <p>Finally, we suggest that, because of the well-supported linguistic claim that the kind of recursive
structure that we study here is key to the understanding of real natural languages, there is likely to
be value in developing sequence models that can more efficiently exploit this structure without fully
sacrificing the flexibility that makes them succeed.</p>
      <p>Acknowledgments
We gratefully acknowledge a Google Faculty Research Award, a gift from Bloomberg L.P., and
support from the Defense Advanced Research Projects Agency (DARPA) Deep Exploration and
Filtering of Text (DEFT) Program under Air Force Research Laboratory (AFRL) contract no.
FA875013-2-0040, the National Science Foundation under grant no. IIS 1159679, and the Department of
the Navy, Office of Naval Research, under grant no. N00014-13-1-0287. Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the authors and do not
necessarily reflect the views of Google, Bloomberg L.P., DARPA, AFRL, NSF, ONR, or the US
government.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , Oriol Vinyals, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In Proc. NIPS</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Chris</given-names>
            <surname>Dyer</surname>
          </string-name>
          , Miguel Ballesteros, Wang Ling, Austin Matthews, and
          <string-name>
            <surname>Noah</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Transitionbased dependency parsing with stack long short-term memory</article-title>
          .
          <source>In Proc. ACL</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Kai</given-names>
            <surname>Sheng</surname>
          </string-name>
          <string-name>
            <surname>Tai</surname>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Improved semantic representations from tree-structured long short-term memory networks</article-title>
          .
          <source>In Proc. ACL</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Elman</surname>
          </string-name>
          .
          <article-title>Finding structure in time</article-title>
          .
          <source>Cognitive science</source>
          ,
          <volume>14</volume>
          (
          <issue>2</issue>
          ),
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Goller</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Kuchler</surname>
          </string-name>
          .
          <article-title>Learning task-dependent distributed representations by backpropagation through structure</article-title>
          .
          <source>In Proc. IEEE International Conference on Neural Networks</source>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Richard</given-names>
            <surname>Socher</surname>
          </string-name>
          , Jeffrey Pennington,
          <string-name>
            <surname>Eric H. Huang</surname>
            ,
            <given-names>Andrew Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>Christopher D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Semi-supervised recursive autoencoders for predicting sentiment distributions</article-title>
          .
          <source>In Proc. EMNLP</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Minh-Thang Luong</surname>
            <given-names>Li</given-names>
          </string-name>
          , Jiwei,
          <string-name>
            <given-names>Dan</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Eudard</given-names>
            <surname>Hovy</surname>
          </string-name>
          .
          <article-title>When are tree structures necessary for deep learning of representations?</article-title>
          <source>Proc. EMNLP</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Grammar as a foreign language</article-title>
          .
          <source>In Proc. NIPS</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Andrej</given-names>
            <surname>Karpathy</surname>
          </string-name>
          , Justin Johnson, and
          <string-name>
            <surname>Fei-Fei Li</surname>
          </string-name>
          .
          <article-title>Visualizing and understanding recurrent networks</article-title>
          .
          <source>arXiv:1506</source>
          .
          <year>02078</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Samuel</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Bowman</surname>
            , Christopher Potts, and
            <given-names>Christopher D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Recursive neural networks can learn logical semantics</article-title>
          .
          <source>In Proc. of the 3rd Workshop on Continuous Vector Space Models and their Compositionality</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Bill</surname>
            <given-names>MacCartney</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>An extended model of natural logic</article-title>
          .
          <source>In Proc. of the Eighth International Conference on Computational Semantics</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Danqi</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Richard Socher,
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Andrew Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <article-title>Learning new facts from knowledge bases with neural tensor networks and semantic word vectors</article-title>
          .
          <source>In Proc. ICLR</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Dzmitry</surname>
            <given-names>Bahdanau</given-names>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <surname>Yoshua Bengio.</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>In Proc. ICLR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <article-title>Ju¨rgen Schmidhuber. Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Wojciech</surname>
            <given-names>Zaremba</given-names>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          .
          <article-title>Recurrent neural network regularization</article-title>
          .
          <source>In Proc. ICLR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Matthew</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Zeiler</surname>
          </string-name>
          .
          <article-title>ADADELTA: an adaptive learning rate method</article-title>
          .
          <source>arXiv:1212.5701</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>