<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tree LSTMs for Learning Sentence Representations</article-title>
      </title-group>
      <abstract>
        <p>English. In this work we obtain sentence embeddings with a recursive model using dependency graphs as network structure, trained with dictionary definitions. We compare the performance of our recursive Tree-LSTMs against other deep learning models: a recurrent version which considers a sequential connection between sentence elements, and a bag of words model which does not consider word ordering at all. We compare the approaches in an unsupervised similarity task in which general purpose embeddings should help to distinguish related content.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. In questo lavoro produciamo
sentence embedding con un modello
ricorsivo, utilizzando alberi di dipendenze
come struttura di rete, addestrandoli su
definizioni di dizionario. Confrontiamo
le prestazioni dei nostri alberi-LSTM
ricorsivi con altri modelli di apprendimento
profondo: una rete ricorrente che
considera una connessione sequenziale tra le
parole della frase, e un modello
bag-ofwords, che non ne considera l’ordine. La
valutazione dei modelli viene effettutata su
un task di similarit non supervisionata, in
cui embedding di uso generale aiutano a
distinguere i contenuti correlati.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>
        Word embeddings have succeeded in obtaining
word semantics and projecting this information
in a vector space.
        <xref ref-type="bibr" rid="ref15">(Mikolov et al., 2013)</xref>
        proposed two methodologies for learning semantic
abstractions of words from large volumes of
unlabelled data, Skipgram and CBOW, comprised
in the word2vec framework. Another approach
is GloVe
        <xref ref-type="bibr" rid="ref17">(Pennington et al., 2014)</xref>
        , which learns
from statistical co-occurrences of words. The
two conceptually similar algorithms employ a
sliding window of words, the context, with the
intuition that words appearing frequently together
are semantically related and thus should be
represented closer in Rn. The resulting vectors have
shown strong correlation with human annotations
in word-analogy tests (Griffiths et al., 2007).
      </p>
      <p>
        Despite the success of word embeddings in
capturing semantic information, they cannot obtain on
its own the composition of longer constructions,
which is essential for natural language
understanding. Thus, several methods using deep neural
networks combine word vectors for obtaining
sentence representations with linear mappings
        <xref ref-type="bibr" rid="ref2">(Baroni and Zamparelli, 2010)</xref>
        and deep neural
networks, which make use of multiple network
layers to obtain higher levels of abstraction
        <xref ref-type="bibr" rid="ref20">(Socher
et al., 2012)</xref>
        . One of the first approaches of
obtaining generic embeddings was Paragraph2Vec
        <xref ref-type="bibr" rid="ref12 ref16 ref17">(Le and Mikolov, 2014)</xref>
        . Paragraph2Vec can learn
unsupervised sentence representations, analogous
to word2vec models for word representation, by
adding an extra node, indicating the document
contribution, to the model.
      </p>
      <p>Attending to the way the nodes of the network
link with each other, two approaches are frequent
in NLP: recurrent neural networks and recursive
neural networks (RNN) 1. Recurrent models
consider sequential links among words, while
recursive models use graph-like structures for
organizing the network operations. They process
neighbouring words by parsing the tree order
(dependency or syntactic graphs), and compute node
representations for each parent recursively from the
previous step until they reach the root of the tree,
which gives the final sentence abstraction.</p>
      <p>
        In this work, we train a variant of Tree-LSTM
models for learning concept abstractions with
dic1We use the same classification as in
        <xref ref-type="bibr" rid="ref13">(Li et al., 2015)</xref>
        .
tionary descriptions as an input. To the best of our
knowledge, this is the first attempt to embed
dictionaries using such approach. Our model takes
complex graph-like structures (e.g. syntactic or
dependency graphs) as opposed to the most
common approaches which employ recurrent models
or unordered distributions of words as
representation of the sentences. We use an unsupervised
similarity benchmark with the intuition that
better sentence embeddings will produce more
coincidences with human annotations (comparably to
the word analogy task in word embeddings).
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>The following recurrent models are capable of
obtaining general purpose embeddings of sentences:
Skip-thought Vectors, and DictRep.</p>
      <p>
        Skip-thought Vectors
        <xref ref-type="bibr" rid="ref10 ref24">(Kiros et al., 2015)</xref>
        learns
general semantic sentence abstractions with
unsupervised training. This concept is similar to the
learning of word embeddings with the skipgram
model
        <xref ref-type="bibr" rid="ref15">(Mikolov et al., 2013)</xref>
        . Skip-thoughts tries
to code a sentence in such a way that it maximises
the probability of recovering the preceding and
following sentence in a document.
      </p>
      <p>
        DictRep
        <xref ref-type="bibr" rid="ref5">(Hill et al., 2015)</xref>
        trains RNN networks
and BoW models mapping definitions and words
with different error functions (cosine similarity
and ranking loss). Whilst the RNN models take
into account the word orderings, the BoW
models are just a weighted combination of the input
embeddings. The simplest BoW approach offered
competitive results against its RNN counterparts,
beating them in most tests
        <xref ref-type="bibr" rid="ref6">(Hill et al., 2016)</xref>
        .
      </p>
      <p>
        Recurrent models have achieved good
performance results in different tasks such as polarity
detection (e.g. bidirectional LSTMs in
        <xref ref-type="bibr" rid="ref22">(Tai et al.,
2015)</xref>
        ), machine translation
        <xref ref-type="bibr" rid="ref4">(Cho et al., 2014)</xref>
        or
sentence similarity detection (e.g. Skip-thoughts),
just to name a few.
      </p>
      <p>
        Despite being less explored for building
general purpose sentence embeddings, in several
classification tasks, tree-structured RNNs represent
the current state of the art. In their seminal
paper,
        <xref ref-type="bibr" rid="ref21">(Socher et al., 2013)</xref>
        captured complex
interactions among words with tensor operations and
graph-like links among network nodes. Recursive
Neural Tensor Networks (RNTN) networks have
been used to solve a simplified version of a QA
system in
        <xref ref-type="bibr" rid="ref7">(Iyyer et al., 2014)</xref>
        .
      </p>
      <p>
        In
        <xref ref-type="bibr" rid="ref3">(Bowman, 2013)</xref>
        , the authors built a natural
language inference system using RNTN in a
simplified scenario with basic sentence constructions.
Although the results show that the system is able
to learn inference relationships in most cases, it is
unclear if this model could be generalised for more
complex sentences. RNTNs were subsequently
improved by
        <xref ref-type="bibr" rid="ref22">(Tai et al., 2015)</xref>
        , using LSTMs in
the network nodes instead of tensors. With
treestructures the network can capture language
constructions which greatly affect the polarity of
sentences (e.g. negation, polarity reversal, etc.).
      </p>
      <p>
        A more complete benchmark was conducted by
        <xref ref-type="bibr" rid="ref13">(Li et al., 2015)</xref>
        . There, sequential and
recursive RNNs were tested in different tasks:
sentiment analysis, question-answer matching,
discourse parsing and semantic relation extraction.
Recursive models excelled in tasks with enough
available supervised data, when nodes different
from the root are labelled, or when semantic
relationships must be extracted from distant words
in a sentence.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Approach</title>
      <p>
        Learning models that build a dictionary of
embeddings have solid advantages over other supervised
approaches, since they take advantage of large
volumes of data that are already available online. The
training data of the system are pairs of
definition/target word which can be built with
dictionaries or encyclopedia descriptions (e.g. picking the
first sentences of a description as training data).
We follow previous work of
        <xref ref-type="bibr" rid="ref5">(Hill et al., 2015)</xref>
        that
employed dictionaries with sequential connections
but using tree structures instead.
      </p>
      <p>
        We used the Tree-LSTM as the starting point
to build our system. The input to the system
are the words conforming a definition together
with the structure of the graph with the
syntactic/dependency relationships, and the word closer
to this definition, i.e. the target. Typically the
LSTM nodes are intended for strictly sequential
information propagation. Our variant is based in
the previous work of
        <xref ref-type="bibr" rid="ref22">(Tai et al., 2015)</xref>
        .
      </p>
      <p>The main differences with the original LSTM
node are the presence of two forget gates instead
of one and the operation over two previous nodes
of the system which modify node states and
inhibitor gates. Hence, sub-indexes 1 and 2 are
reserved for left and right child nodes of the graph,
respectively. In this LSTM node there are no
peephole connections between memory states and the
inhibitor gates.</p>
      <p>The state value in the root node is fed to the last
layer of the system. Then, a non-linear
transformation is applied to obtain the sentence
embedding. In the basic configuration of the model, the
error is measured by calculating the cosine
similarity between target and predicted embeddings.
The target is the embedding of the word result
of the definition. Pre-trained word embeddings
or random initialised embeddings might be
employed. In the second case, the error is also
propagated to the leaf nodes of the graph and thus the
word embeddings are updated during training. We
did not initialise randomly embeddings because
this has consistently produced poorer results in
comparison with the same model using pre-trained
word embeddings.</p>
      <p>
        In the network configurations of the tree-LSTM
models, we added an extra backward link between
the root node and the leaves reversing the uplink
path (as hinted in
        <xref ref-type="bibr" rid="ref16 ref19">(Socher et al., 2011; Paulus et
al., 2014)</xref>
        ). In these settings, the error to minimise
is a combination of the target word similarity and
the leaves word similarity modulated by a
smoothing parameter.
      </p>
      <p>
        We implemented our model with Theano
        <xref ref-type="bibr" rid="ref23">(Theano Development Team, 2016)</xref>
        and trained
it with minibatch (30) and Adam
        <xref ref-type="bibr" rid="ref12 ref16 ref17">(Kingma and
Ba, 2014)</xref>
        as optimisation algorithm (with
parameters 1 = 0:9, 2 = 0:999 and learning rate
l = 0:002). This configuration has achieved state
of the art performance in other NLP tasks
        <xref ref-type="bibr" rid="ref11">(Kumar
et al., 2015)</xref>
        .
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>We compared DictRep (BoW and RNN) and our
Tree-LSTM variant in a benchmark of
unsupervised text similarity tasks and a supervised task
(sentiment polarity). These tasks greatly benefit
from a good representation of sentences and it
requires a lot of human effort to build a dataset.</p>
      <p>
        DictRep models were trained using available
data and online code. For a fair comparison, all
models employed the pre-trained word
embeddings and training data provided by
        <xref ref-type="bibr" rid="ref5">(Hill et al.,
2015)</xref>
        and cosine similarity as error metric. The
configuration setting was similar for all the
models.
      </p>
      <p>Our model employs two connection
configurations: The Tree-LSTM with transformed
dependency graphs and the sequential mapping of
connections, which is conceptually similar to the
DictRep-RNN model.</p>
      <p>
        For SkipThoughts we used the code available
online (ski, ) and the pre-trained model with a
sentence representation of 4800 dimensions.
Additionally, we trained a compressed model with
sentence and word representation dimensions of 1200
and 320 respectively in about three weeks. Like in
the available model, the 80 million registers of the
BookCorpus dataset
        <xref ref-type="bibr" rid="ref10 ref24">(Zhu et al., 2015)</xref>
        were used
during the training process.
      </p>
      <p>
        The objective of the semantic similarity task
benchmark is to measure the similarity between a
pair of sentences. SemEval STS 2014
        <xref ref-type="bibr" rid="ref1">(Agirre et
al., 2014)</xref>
        and SICK
        <xref ref-type="bibr" rid="ref14">(Marelli et al., 2014)</xref>
        datasets
were used for benchmarks. In both datasets, each
example was gold-standard ranked between 0
(totally unrelated sentences) and 5 (completely
similar). Furthermore, SICK dataset considers three
different types of semantic relatedness (Neutral,
Entailment and Contradiction). We tested the
models against the three relations to check if
recursive and recurrent models exhibited different
behaviour.
      </p>
      <p>
        This is the same dataset used in previous work
        <xref ref-type="bibr" rid="ref6">(Hill et al., 2016)</xref>
        but excluding the WordNet set,
since it was used as part of the training.
      </p>
      <p>For the sentiment polarity, we used as
training/validation data the Sentiment Penn Treebank
dataset 2. In this dataset, each sentence node is
labelled with a 5-tag intensity tag from 0, the most
negative, to 4. Sentences are already binarised
in the same format of our TreeDict approach so
that no preprocessing is needed in this task for
TreeModels. We used for training and test the
labels at the root node which is the the overall
sentence polarity. For completeness, we repeat the
analysis for a 3-label annotations over the same
dataset. We used the same SVM classifier for all
the models and we trained it with the sentence
vectors as input.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Results and conclusion</title>
      <p>The DictRep BoW model was undeniably better
than the recurrent and recursive models
achieving the best position in all cases (Table 1). The
TreeDict-Dep model ranked second 3.</p>
      <p>2http://nlp.stanford.edu/sentiment/treebank.html
3The character “-” indicates that some vectors for a
sentence could not be obtained (e.g. due to a malformed
dependency graph)</p>
      <p>All models capture the correlations with human
annotations better in neutral contexts. If there are
contradictions and entailment relationships, the
agreement with human annotations is less evident.
Nevertheless, this behaviour is expected and also
desirable, as this is an unsupervised benchmark
and the system has no way of learning a similar
but conflicting relationship without external help.</p>
      <p>It is clear that BoW models offered the best
performance in all the datasets. The Tree-LSTM
model, which is consistently better than the
sequential models, ranked second. Table 2 shows
the correlation among models over the SICK
similarity dataset. All the models experience strong
cross-correlations between them but the
TreeLSTM with dependency parsing showed the
closest correlation with the BoW and recurrent
models.</p>
      <p>The Table 3 shows the performance of the
models in the supervised polarity tasks. BoW
and SkipThoughts models experience similar
outcomes for the 5 and 3 label task. Models trained
with dictionary definitions (DictRep and
TreeDict) lag behind those models. However, all the
networks using dependency structures have
consistently beaten its sequential counterparts. This
is a strong indicative of the benefits of using this
more complex network structure. The difference
between the different network configurations of
the same model are less pronounced that in the
similarity tasks but in our tests, the models that
used the extra link backwards achieved small gains
(at least in the 3-label task).</p>
      <p>
        In previous work,
        <xref ref-type="bibr" rid="ref6">(Hill et al., 2016)</xref>
        compared
other models in this same similarity benchmark
achieving comparable results. Not only
DictRepBoW models outperformed the DictRep-RNNs
but also the Skip-thought model, which considers
the order of the words in a sentence, was beaten by
FastSent, its counterpart that employs BoW
representation of a sentence.
      </p>
      <p>The effect of word orderings is not clear. BoW
models are far from being ideal as they cannot
obtain which parts are negated or the dependencies
among the different elements of the sentence (e.g.
the black dog chases the white cat and the black
cat chases the white dog cannot be differentiated
by only using BoW models).</p>
      <p>
        It is important to mention that the similarity
was tested only at the root node when using
TreeLSTM. Notwithstanding, recursive models allow
to use more elaborated strategies, taking
advantage of the dependencies used to build the
relationships of the nodes in the deep network. These
strategies could combine similarities at different
levels of the sentence to obtain a more
approximate value of similarity (e.g. using a pooling
matrix with all the nodes of the parse tree
        <xref ref-type="bibr" rid="ref19">(Socher et
al., 2011)</xref>
        ).
      </p>
      <p>The errors during training time in held-out
data were 0:57 for BoW models versus the 0:51
achieved by recurrent and recursive models.
Nevertheless, better dictionary embeddings do not
seem to directly translate into better performance
at inferring general purpose sentence embeddings
in the benchmarks. Results in the test also show
that we need better mechanisms to infer sentence
level representations.</p>
      <p>Model
D.BoW
D.RNN
T.Seq
T.Dep
employed. Recursive models are more expensive
computationally and have a more complex
implementation but they exhibit better performance in
longer sentences. However, with current learning
techniques recurrent and recursive models cannot
offer better results than simpler models such as
BoW representations of sentences in unsupervised
similarity benchmarks. The results of these
findings shall be confirmed in the future in more
complex scenarios, such as large scale QA.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been funded by the Spanish
Ministerio de Economa y Competitividad through the
project INRISCO (TEC2014-54335-C4-4-R).
.40
.32
.31
.32
.32
.35
.35
.35
.40
.38</p>
      <p>In this paper we introduced the use of recursive
models for the generation of general purpose
embeddings once they are trained by embedding
dictionary definitions. We compare recurrent and
recursive models in the embedding dictionary task
and we test the validity of these embeddings for
their use as general purpose codification of
sentences with both similarity.</p>
      <p>Results demonstrate slight advantages of the
Tree recursive variant over recurrent models that
learn from dictionaries, which are more frequently
decoder for Statistical Machine Translation. arXiv
preprint arXiv:1406.1078.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Eneko</given-names>
            <surname>Agirre</surname>
          </string-name>
          , Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and
          <string-name>
            <given-names>Janyce</given-names>
            <surname>Wiebe</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Semeval-2014 task 10: Multilingual Semantic Textual Similarity</article-title>
          .
          <source>In Proceedings of the 8th international workshop on semantic evaluation (SemEval</source>
          <year>2014</year>
          ), pages
          <fpage>81</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Baroni</surname>
          </string-name>
          and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Zamparelli</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space</article-title>
          .
          <source>In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>1183</fpage>
          -
          <lpage>1193</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Samuel R Bowman</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Can recursive neural tensor networks learn logical reasoning</article-title>
          ? arXiv:
          <fpage>1312</fpage>
          .
          <fpage>6192</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart Van Merrie¨nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning Phrase Representations using RNN EncoderThomas L Griffiths</article-title>
          ,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Steyvers</surname>
          </string-name>
          , and Joshua B Tenenbaum.
          <year>2007</year>
          .
          <article-title>Topics in Semantic Representation</article-title>
          .
          <source>Psychological review</source>
          ,
          <volume>114</volume>
          (
          <issue>2</issue>
          ):
          <fpage>211</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Felix</given-names>
            <surname>Hill</surname>
          </string-name>
          , Kyunghyun Cho, Anna Korhonen, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Learning to Understand Phrases by Embedding the Dictionary</article-title>
          .
          <article-title>Transactions of the Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Felix</given-names>
            <surname>Hill</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <given-names>Anna</given-names>
            <surname>Korhonen</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Learning Distributed Representations of Sentences from Unlabelled Data</article-title>
          . arXiv:
          <volume>1602</volume>
          .
          <fpage>03483</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Mohit</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jordan L Boyd-Graber</surname>
          </string-name>
          , Leonardo Max Batista Claudino, Richard Socher, and Hal Daume´ III.
          <year>2014</year>
          .
          <article-title>A Neural Network for Factoid Question Answering over Paragraphs</article-title>
          .
          <source>In EMNLP</source>
          , pages
          <fpage>633</fpage>
          -
          <lpage>644</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Diederik</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Ba</surname>
          </string-name>
          .
          <source>A Method for Stochastic arXiv:1412</source>
          .
          <fpage>6980</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>2014. Adam: Optimization.</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Ryan</given-names>
            <surname>Kiros</surname>
          </string-name>
          , Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and
          <string-name>
            <given-names>Sanja</given-names>
            <surname>Fidler</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Skip-Thought Vectors</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3294</fpage>
          -
          <lpage>3302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Ankit</given-names>
            <surname>Kumar</surname>
          </string-name>
          , Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Ondruska</surname>
          </string-name>
          , Ishaan Gulrajani, and Richard Socher.
          <year>2015</year>
          .
          <article-title>Ask Me Anything: Dynamic Memory Networks for Natural Language Processing</article-title>
          .
          <source>arXiv preprint arXiv:1506</source>
          .
          <fpage>07285</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Quoc</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            and
            <given-names>Tomas</given-names>
          </string-name>
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Distributed Representations of Sentences and Documents</article-title>
          . In ICML, volume
          <volume>14</volume>
          , pages
          <fpage>1188</fpage>
          -
          <lpage>1196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Jiwei</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Minh-Thang</surname>
            <given-names>Luong</given-names>
          </string-name>
          , Dan Jurafsky, and
          <string-name>
            <given-names>Eudard</given-names>
            <surname>Hovy</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>When Are Tree Structures Necessary for Deep Learning of Representations</article-title>
          ? arXiv:
          <fpage>1503</fpage>
          .
          <fpage>00185</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Marelli</surname>
          </string-name>
          , Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Zamparelli</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A SICK cure for the evaluation of compositional distributional semantic models</article-title>
          .
          <source>In LREC</source>
          , pages
          <fpage>216</fpage>
          -
          <lpage>223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>arXiv preprint arXiv:1301</source>
          .
          <fpage>3781</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Romain</surname>
            <given-names>Paulus</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Global belief recursive neural networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>2888</fpage>
          -
          <lpage>2896</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global Vectors for Word Representation</article-title>
          .
          <source>In EMNLP</source>
          , volume
          <volume>14</volume>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>Sent2Vec encoder and training code from the paper “Skip-Thought Vectors”</article-title>
          . https://github. com/ryankiros/skip-thoughts.
          <source>Accessed: 2017-07-07.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Richard</given-names>
            <surname>Socher</surname>
          </string-name>
          , Eric H Huang, Jeffrey Pennin,
          <string-name>
            <surname>Christopher D Manning</surname>
          </string-name>
          , and Andrew Y Ng.
          <year>2011</year>
          .
          <article-title>Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>801</fpage>
          -
          <lpage>809</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Richard</given-names>
            <surname>Socher</surname>
          </string-name>
          , Brody Huval,
          <string-name>
            <surname>Christopher D Manning</surname>
          </string-name>
          , and Andrew Y Ng.
          <year>2012</year>
          .
          <article-title>Semantic Compositionality through Recursive Matrix-vector Spaces</article-title>
          .
          <source>In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning</source>
          , pages
          <fpage>1201</fpage>
          -
          <lpage>1211</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Richard</given-names>
            <surname>Socher</surname>
          </string-name>
          , Alex Perelygin, Jean Y Wu, Jason Chuang,
          <string-name>
            <surname>Christopher D Manning</surname>
            , Andrew Y Ng, and
            <given-names>Christopher</given-names>
          </string-name>
          <string-name>
            <surname>Potts</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank</article-title>
          .
          <source>In Proceedings of the conference on empirical methods in natural language processing (EMNLP)</source>
          , volume
          <volume>1631</volume>
          , page 1642. Citeseer.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Kai</given-names>
            <surname>Sheng</surname>
          </string-name>
          <string-name>
            <surname>Tai</surname>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Improved Semantic Representations from Tree-structured Long Short-term Memory Networks</article-title>
          . ACL.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Theano</given-names>
            <surname>Development Team</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Theano: A Python framework for fast computation of mathematical expressions</article-title>
          . arXiv e-prints,
          <source>abs/1605</source>
          .02688, May.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Yukun</given-names>
            <surname>Zhu</surname>
          </string-name>
          , Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and
          <string-name>
            <given-names>Sanja</given-names>
            <surname>Fidler</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books</article-title>
          . In arXiv preprint arXiv:
          <volume>1506</volume>
          .
          <fpage>06724</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>