=Paper=
{{Paper
|id=Vol-2006/paper005
|storemode=property
|title=Tree LSTMs for Learning Sentence Representations
|pdfUrl=https://ceur-ws.org/Vol-2006/paper005.pdf
|volume=Vol-2006
|authors=Héctor Cerezo Costas,Manuela I. Martín-Vicente,Francisco J. Gonzalez-Castano
|dblpUrl=https://dblp.org/rec/conf/clic-it/Cerezo-CostasMG17
}}
==Tree LSTMs for Learning Sentence Representations==
<pdf width="1500px">https://ceur-ws.org/Vol-2006/paper005.pdf</pdf>
<pre>
                Tree LSTMs for Learning Sentence Representations
    Héctor Cerezo-Costas     Manuela Martı́n-Vicente     F.J. González-Castaño
      AlantTic, Gradiant               Gradiant          Dept. Enxeñarı́a Telemática
 Universidade de Vigo, Spain Edificio CITEXVI, local 14  E.E. de Telecomunicación
  Edificio CITEXVI, local 14 Vigo, Pontevedra 36310, SPA Universidade de Vigo, SPA
 Vigo, Pontevedra 36310, SPA mmartin@gradiant.org javier@det.uvigo.es
 hcerezo@gradiant.org


                    Abstract                       is GloVe (Pennington et al., 2014), which learns
                                                   from statistical co-occurrences of words. The
    English. In this work we obtain sentence       two conceptually similar algorithms employ a slid-
    embeddings with a recursive model using        ing window of words, the context, with the in-
    dependency graphs as network structure,        tuition that words appearing frequently together
    trained with dictionary definitions. We        are semantically related and thus should be rep-
    compare the performance of our recursive       resented closer in Rn . The resulting vectors have
    Tree-LSTMs against other deep learning         shown strong correlation with human annotations
    models: a recurrent version which consid-      in word-analogy tests (Griffiths et al., 2007).
    ers a sequential connection between sen-          Despite the success of word embeddings in cap-
    tence elements, and a bag of words model       turing semantic information, they cannot obtain on
    which does not consider word ordering at       its own the composition of longer constructions,
    all. We compare the approaches in an un-       which is essential for natural language understand-
    supervised similarity task in which general    ing. Thus, several methods using deep neural net-
    purpose embeddings should help to distin-      works combine word vectors for obtaining sen-
    guish related content.                         tence representations with linear mappings (Ba-
    Italiano. In questo lavoro produciamo          roni and Zamparelli, 2010) and deep neural net-
    sentence embedding con un modello ri-          works, which make use of multiple network lay-
    corsivo, utilizzando alberi di dipendenze      ers to obtain higher levels of abstraction (Socher
    come struttura di rete, addestrandoli su       et al., 2012). One of the first approaches of ob-
    definizioni di dizionario. Confrontiamo        taining generic embeddings was Paragraph2Vec
    le prestazioni dei nostri alberi-LSTM ri-      (Le and Mikolov, 2014). Paragraph2Vec can learn
    corsivi con altri modelli di apprendimento     unsupervised sentence representations, analogous
    profondo: una rete ricorrente che con-         to word2vec models for word representation, by
    sidera una connessione sequenziale tra le      adding an extra node, indicating the document
    parole della frase, e un modello bag-of-       contribution, to the model.
    words, che non ne considera l’ordine. La          Attending to the way the nodes of the network
    valutazione dei modelli viene effettutata su   link with each other, two approaches are frequent
    un task di similarit non supervisionata, in    in NLP: recurrent neural networks and recursive
    cui embedding di uso generale aiutano a        neural networks (RNN) 1 . Recurrent models con-
    distinguere i contenuti correlati.             sider sequential links among words, while recur-
                                                   sive models use graph-like structures for organiz-
                                                   ing the network operations. They process neigh-
1   Introduction                                   bouring words by parsing the tree order (depen-
                                                   dency or syntactic graphs), and compute node rep-
Word embeddings have succeeded in obtaining
                                                   resentations for each parent recursively from the
word semantics and projecting this information
                                                   previous step until they reach the root of the tree,
in a vector space. (Mikolov et al., 2013) pro-
                                                   which gives the final sentence abstraction.
posed two methodologies for learning semantic
                                                      In this work, we train a variant of Tree-LSTM
abstractions of words from large volumes of un-
                                                   models for learning concept abstractions with dic-
labelled data, Skipgram and CBOW, comprised
                                                      1
in the word2vec framework. Another approach               We use the same classification as in (Li et al., 2015).
tionary descriptions as an input. To the best of our   language inference system using RNTN in a sim-
knowledge, this is the first attempt to embed dic-     plified scenario with basic sentence constructions.
tionaries using such approach. Our model takes         Although the results show that the system is able
complex graph-like structures (e.g. syntactic or       to learn inference relationships in most cases, it is
dependency graphs) as opposed to the most com-         unclear if this model could be generalised for more
mon approaches which employ recurrent models           complex sentences. RNTNs were subsequently
or unordered distributions of words as represen-       improved by (Tai et al., 2015), using LSTMs in
tation of the sentences. We use an unsupervised        the network nodes instead of tensors. With tree-
similarity benchmark with the intuition that bet-      structures the network can capture language con-
ter sentence embeddings will produce more coin-        structions which greatly affect the polarity of sen-
cidences with human annotations (comparably to         tences (e.g. negation, polarity reversal, etc.).
the word analogy task in word embeddings).                A more complete benchmark was conducted by
                                                       (Li et al., 2015). There, sequential and recur-
2   Related Work                                       sive RNNs were tested in different tasks: sen-
                                                       timent analysis, question-answer matching, dis-
The following recurrent models are capable of ob-
                                                       course parsing and semantic relation extraction.
taining general purpose embeddings of sentences:
                                                       Recursive models excelled in tasks with enough
Skip-thought Vectors, and DictRep.
                                                       available supervised data, when nodes different
   Skip-thought Vectors (Kiros et al., 2015) learns
                                                       from the root are labelled, or when semantic re-
general semantic sentence abstractions with unsu-
                                                       lationships must be extracted from distant words
pervised training. This concept is similar to the
                                                       in a sentence.
learning of word embeddings with the skipgram
model (Mikolov et al., 2013). Skip-thoughts tries      3   Approach
to code a sentence in such a way that it maximises
the probability of recovering the preceding and        Learning models that build a dictionary of embed-
following sentence in a document.                      dings have solid advantages over other supervised
   DictRep (Hill et al., 2015) trains RNN networks     approaches, since they take advantage of large vol-
and BoW models mapping definitions and words           umes of data that are already available online. The
with different error functions (cosine similarity      training data of the system are pairs of defini-
and ranking loss). Whilst the RNN models take          tion/target word which can be built with dictionar-
into account the word orderings, the BoW mod-          ies or encyclopedia descriptions (e.g. picking the
els are just a weighted combination of the input       first sentences of a description as training data).
embeddings. The simplest BoW approach offered          We follow previous work of (Hill et al., 2015) that
competitive results against its RNN counterparts,      employed dictionaries with sequential connections
beating them in most tests (Hill et al., 2016).        but using tree structures instead.
   Recurrent models have achieved good perfor-            We used the Tree-LSTM as the starting point
mance results in different tasks such as polarity      to build our system. The input to the system
detection (e.g. bidirectional LSTMs in (Tai et al.,    are the words conforming a definition together
2015)), machine translation (Cho et al., 2014) or      with the structure of the graph with the syntac-
sentence similarity detection (e.g. Skip-thoughts),    tic/dependency relationships, and the word closer
just to name a few.                                    to this definition, i.e. the target. Typically the
   Despite being less explored for building gen-       LSTM nodes are intended for strictly sequential
eral purpose sentence embeddings, in several clas-     information propagation. Our variant is based in
sification tasks, tree-structured RNNs represent       the previous work of (Tai et al., 2015).
the current state of the art. In their seminal pa-        The main differences with the original LSTM
per, (Socher et al., 2013) captured complex inter-     node are the presence of two forget gates instead
actions among words with tensor operations and         of one and the operation over two previous nodes
graph-like links among network nodes. Recursive        of the system which modify node states and in-
Neural Tensor Networks (RNTN) networks have            hibitor gates. Hence, sub-indexes 1 and 2 are re-
been used to solve a simplified version of a QA        served for left and right child nodes of the graph,
system in (Iyyer et al., 2014).                        respectively. In this LSTM node there are no peep-
   In (Bowman, 2013), the authors built a natural      hole connections between memory states and the
inhibitor gates.                                         connections, which is conceptually similar to the
   The state value in the root node is fed to the last   DictRep-RNN model.
layer of the system. Then, a non-linear transfor-           For SkipThoughts we used the code available
mation is applied to obtain the sentence embed-          online (ski, ) and the pre-trained model with a sen-
ding. In the basic configuration of the model, the       tence representation of 4800 dimensions. Addi-
error is measured by calculating the cosine sim-         tionally, we trained a compressed model with sen-
ilarity between target and predicted embeddings.         tence and word representation dimensions of 1200
The target is the embedding of the word result           and 320 respectively in about three weeks. Like in
of the definition. Pre-trained word embeddings           the available model, the 80 million registers of the
or random initialised embeddings might be em-            BookCorpus dataset (Zhu et al., 2015) were used
ployed. In the second case, the error is also prop-      during the training process.
agated to the leaf nodes of the graph and thus the          The objective of the semantic similarity task
word embeddings are updated during training. We          benchmark is to measure the similarity between a
did not initialise randomly embeddings because           pair of sentences. SemEval STS 2014 (Agirre et
this has consistently produced poorer results in         al., 2014) and SICK (Marelli et al., 2014) datasets
comparison with the same model using pre-trained         were used for benchmarks. In both datasets, each
word embeddings.                                         example was gold-standard ranked between 0 (to-
   In the network configurations of the tree-LSTM        tally unrelated sentences) and 5 (completely sim-
models, we added an extra backward link between          ilar). Furthermore, SICK dataset considers three
the root node and the leaves reversing the uplink        different types of semantic relatedness (Neutral,
path (as hinted in (Socher et al., 2011; Paulus et       Entailment and Contradiction). We tested the
al., 2014)). In these settings, the error to minimise    models against the three relations to check if re-
is a combination of the target word similarity and       cursive and recurrent models exhibited different
the leaves word similarity modulated by a smooth-        behaviour.
ing parameter.                                              This is the same dataset used in previous work
   We implemented our model with Theano                  (Hill et al., 2016) but excluding the WordNet set,
(Theano Development Team, 2016) and trained              since it was used as part of the training.
it with minibatch (30) and Adam (Kingma and                 For the sentiment polarity, we used as train-
Ba, 2014) as optimisation algorithm (with param-         ing/validation data the Sentiment Penn Treebank
eters β1 = 0.9, β2 = 0.999 and learning rate             dataset 2 . In this dataset, each sentence node is la-
l = 0.002). This configuration has achieved state        belled with a 5-tag intensity tag from 0, the most
of the art performance in other NLP tasks (Kumar         negative, to 4. Sentences are already binarised
et al., 2015).                                           in the same format of our TreeDict approach so
                                                         that no preprocessing is needed in this task for
4   Experiments                                          TreeModels. We used for training and test the la-
                                                         bels at the root node which is the the overall sen-
We compared DictRep (BoW and RNN) and our
                                                         tence polarity. For completeness, we repeat the
Tree-LSTM variant in a benchmark of unsuper-
                                                         analysis for a 3-label annotations over the same
vised text similarity tasks and a supervised task
                                                         dataset. We used the same SVM classifier for all
(sentiment polarity). These tasks greatly benefit
                                                         the models and we trained it with the sentence vec-
from a good representation of sentences and it re-
                                                         tors as input.
quires a lot of human effort to build a dataset.
   DictRep models were trained using available
                                                         5       Results and conclusion
data and online code. For a fair comparison, all
models employed the pre-trained word embed-              The DictRep BoW model was undeniably better
dings and training data provided by (Hill et al.,        than the recurrent and recursive models achiev-
2015) and cosine similarity as error metric. The         ing the best position in all cases (Table 1). The
configuration setting was similar for all the mod-       TreeDict-Dep model ranked second 3 .
els.
                                                             2
   Our model employs two connection configu-                 http://nlp.stanford.edu/sentiment/treebank.html
                                                             3
                                                             The character “-” indicates that some vectors for a sen-
rations: The Tree-LSTM with transformed de-              tence could not be obtained (e.g. due to a malformed depen-
pendency graphs and the sequential mapping of            dency graph)
 Figure 1: Tree-LSTM schema employed. Dotted blocks and lines depict the optional reverse channel.


   All models capture the correlations with human      other models in this same similarity benchmark
annotations better in neutral contexts. If there are   achieving comparable results. Not only DictRep-
contradictions and entailment relationships, the       BoW models outperformed the DictRep-RNNs
agreement with human annotations is less evident.      but also the Skip-thought model, which considers
Nevertheless, this behaviour is expected and also      the order of the words in a sentence, was beaten by
desirable, as this is an unsupervised benchmark        FastSent, its counterpart that employs BoW repre-
and the system has no way of learning a similar        sentation of a sentence.
but conflicting relationship without external help.       The effect of word orderings is not clear. BoW
   It is clear that BoW models offered the best        models are far from being ideal as they cannot ob-
performance in all the datasets. The Tree-LSTM         tain which parts are negated or the dependencies
model, which is consistently better than the se-       among the different elements of the sentence (e.g.
quential models, ranked second. Table 2 shows          the black dog chases the white cat and the black
the correlation among models over the SICK sim-        cat chases the white dog cannot be differentiated
ilarity dataset. All the models experience strong      by only using BoW models).
cross-correlations between them but the Tree-             It is important to mention that the similarity
LSTM with dependency parsing showed the clos-          was tested only at the root node when using Tree-
est correlation with the BoW and recurrent mod-        LSTM. Notwithstanding, recursive models allow
els.                                                   to use more elaborated strategies, taking advan-
   The Table 3 shows the performance of the            tage of the dependencies used to build the rela-
models in the supervised polarity tasks. BoW           tionships of the nodes in the deep network. These
and SkipThoughts models experience similar out-        strategies could combine similarities at different
comes for the 5 and 3 label task. Models trained       levels of the sentence to obtain a more approxi-
with dictionary definitions (DictRep and TreeD-        mate value of similarity (e.g. using a pooling ma-
ict) lag behind those models. However, all the         trix with all the nodes of the parse tree (Socher et
networks using dependency structures have con-         al., 2011)).
sistently beaten its sequential counterparts. This        The errors during training time in held-out
is a strong indicative of the benefits of using this   data were 0.57 for BoW models versus the 0.51
more complex network structure. The difference         achieved by recurrent and recursive models. Nev-
between the different network configurations of        ertheless, better dictionary embeddings do not
the same model are less pronounced that in the         seem to directly translate into better performance
similarity tasks but in our tests, the models that     at inferring general purpose sentence embeddings
used the extra link backwards achieved small gains     in the benchmarks. Results in the test also show
(at least in the 3-label task).                        that we need better mechanisms to infer sentence
   In previous work, (Hill et al., 2016) compared      level representations.
                                             STS 2014                                          Sick
   Model                  News     Forum     Twitter Images       Headlines    Neu       Ent          Con    All
   DictRep-BoW           .67/.74   .42/.39   .60/.65    .71/.74    .58/.62    .60/.70   .58/.56   .12/.18   .62/.72
   DictRep-RNN           .45/.52   .06/.04   .30/.32    .57/.57    .39/.42    .52/.59   .22/.23   .09/.10   .48/.56
   TreeDict-Seq          .48/.54   .24/.23   .40/.45    .60/.64    .46/.51    .51/.59   .24/.27   .07/.10   .51/.59
   TreeDict-Seq 250      .50/.58   .20/.21   .44/.47    .61/.66    .46/.49    .56/.62   .27/.30   .08/.11   .54/.64
   TreeDict-Seq 250BL    .47/.47   .23/.21   .52/.59    .51/.51    .43/.45    .48/.52   .29/.33   .10/.14   .51/.56
   TreeDict-Dep          .48/.55   .29/.28      -       .61/.67       -       .56/.64   .35/.39   .08/.13   .55/.65
   TreeDict-Dep 250      .50/.56   .31/.30      -       .56/.63       -       .55/.61   .36/.41   .09/.12   .56/.63
   TreeDict-Dep 250BL    .43/.45   .30/.28      -       .56/.58       -       .52/.56   .34/.38   .09/.11   .55/.60
   SkipThoughts-4800     .43/.23   .13/.13   .42/.40    .48/.51    .36/.37    .49/.49   .19/.25   .10/.15   .48/.50
   SkipThoughts-1200     .55/.54   .22/.23      -       .55/.61    .39/.41    .56/.56   .21/.24   .09/.15   .53/.56


Table 1: Performance of the models measured with Spearman/Pearson correlations against golden stan-
dard annotations in the similarity benchmarks.


 Model      D.BoW       D.RNN       T.Seq     T.Penn         employed. Recursive models are more expensive
 D.BoW      1.0/1.0     .70/.71    .74/75     .80/.82        computationally and have a more complex imple-
 D.RNN      .70/.71     1.0/1.0    .77/.75    .73/.72        mentation but they exhibit better performance in
 T.Seq      .74/.75     .77/.75    1.0/1.0    .79/.78        longer sentences. However, with current learning
 T.Dep      .80/.82     .73/.72    .78/.78    1.0/1.0        techniques recurrent and recursive models cannot
                                                             offer better results than simpler models such as
Table 2: Spearman/Pearson correlations among                 BoW representations of sentences in unsupervised
the different models in the SICK dataset.                    similarity benchmarks. The results of these find-
                                                             ings shall be confirmed in the future in more com-
  Model                           F1 -score                  plex scenarios, such as large scale QA.
                            (5-label) (3-label)
                                                             Acknowledgments
  DictRep-BoW                    .40         .56
  DictRep-RNN                    .32         .49             This work has been funded by the Spanish Min-
  TreeDict-Seq                   .31         .49             isterio de Economa y Competitividad through the
  TreeDict-Seq 250               .32         .48             project INRISCO (TEC2014-54335-C4-4-R).
  TreeDict-Seq 250BL             .32         .49
  TreeDict-Dep                   .35         .53
  TreeDict-Dep 250               .35         .51
                                                             References
  TreeDict-Dep 250BL             .35         .53             Eneko Agirre, Carmen Banea, Claire Cardie, Daniel
  SkipThoughts-4800              .40         .56               Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei
                                                               Guo, Rada Mihalcea, German Rigau, and Janyce
  SkipThoughts-1200              .38         .55
                                                               Wiebe. 2014. Semeval-2014 task 10: Multilingual
                                                               Semantic Textual Similarity. In Proceedings of the
Table 3: Performance of the models in the polarity             8th international workshop on semantic evaluation
                                                               (SemEval 2014), pages 81–91.
detection task
                                                             Marco Baroni and Roberto Zamparelli. 2010. Nouns
                                                              are vectors, adjectives are matrices: Representing
   In this paper we introduced the use of recursive           adjective-noun constructions in semantic space. In
models for the generation of general purpose em-              Proceedings of the 2010 Conference on Empirical
beddings once they are trained by embedding dic-              Methods in Natural Language Processing, pages
tionary definitions. We compare recurrent and re-             1183–1193. Association for Computational Linguis-
                                                              tics.
cursive models in the embedding dictionary task
and we test the validity of these embeddings for             Samuel R Bowman. 2013. Can recursive neural tensor
their use as general purpose codification of sen-              networks learn logical reasoning? arXiv:1312.6192.
tences with both similarity.
                                                             Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
   Results demonstrate slight advantages of the                cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
Tree recursive variant over recurrent models that              ger Schwenk, and Yoshua Bengio. 2014. Learn-
learn from dictionaries, which are more frequently             ing Phrase Representations using RNN Encoder-
  decoder for Statistical Machine Translation. arXiv    Jeffrey Pennington, Richard Socher, and Christopher D
  preprint arXiv:1406.1078.                                Manning. 2014. Glove: Global Vectors for Word
                                                           Representation. In EMNLP, volume 14, pages
Thomas L Griffiths, Mark Steyvers, and Joshua B            1532–43.
  Tenenbaum. 2007. Topics in Semantic Represen-
  tation. Psychological review, 114(2):211.             Sent2Vec encoder and training code from the paper
                                                          “Skip-Thought Vectors”.    https://github.
Felix Hill, Kyunghyun Cho, Anna Korhonen, and             com/ryankiros/skip-thoughts.                Ac-
  Yoshua Bengio. 2015. Learning to Understand             cessed: 2017-07-07.
  Phrases by Embedding the Dictionary. Transactions
                                                        Richard Socher, Eric H Huang, Jeffrey Pennin, Christo-
  of the Association for Computational Linguistics.
                                                          pher D Manning, and Andrew Y Ng. 2011. Dy-
                                                          namic Pooling and Unfolding Recursive Autoen-
Felix Hill, Kyunghyun Cho, and Anna Korhonen.             coders for Paraphrase Detection. In Advances in
  2016. Learning Distributed Representations of Sen-      Neural Information Processing Systems, pages 801–
  tences from Unlabelled Data. arXiv:1602.03483.          809.
Mohit Iyyer, Jordan L Boyd-Graber, Leonardo             Richard Socher, Brody Huval, Christopher D Manning,
 Max Batista Claudino, Richard Socher, and Hal            and Andrew Y Ng. 2012. Semantic Compositional-
 Daumé III. 2014. A Neural Network for Factoid           ity through Recursive Matrix-vector Spaces. In Pro-
 Question Answering over Paragraphs. In EMNLP,            ceedings of the 2012 Joint Conference on Empiri-
 pages 633–644.                                           cal Methods in Natural Language Processing and
                                                          Computational Natural Language Learning, pages
Diederik Kingma and Jimmy Ba. 2014. Adam:                 1201–1211. Association for Computational Linguis-
  A     Method   for  Stochastic Optimization.            tics.
  arXiv:1412.6980.
                                                        Richard Socher, Alex Perelygin, Jean Y Wu, Jason
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,            Chuang, Christopher D Manning, Andrew Y Ng,
  Richard Zemel, Raquel Urtasun, Antonio Torralba,        and Christopher Potts. 2013. Recursive Deep Mod-
  and Sanja Fidler. 2015. Skip-Thought Vectors. In        els for Semantic Compositionality over a Sentiment
  Advances in neural information processing systems,      Treebank. In Proceedings of the conference on
  pages 3294–3302.                                        empirical methods in natural language processing
                                                          (EMNLP), volume 1631, page 1642. Citeseer.
Ankit Kumar, Ozan Irsoy, Jonathan Su, James Brad-
  bury, Robert English, Brian Pierce, Peter Ondruska,   Kai Sheng Tai, Richard Socher, and Christopher D
  Ishaan Gulrajani, and Richard Socher.        2015.      Manning. 2015. Improved Semantic Representa-
  Ask Me Anything: Dynamic Memory Networks                tions from Tree-structured Long Short-term Mem-
  for Natural Language Processing. arXiv preprint         ory Networks. ACL.
  arXiv:1506.07285.                                     Theano Development Team. 2016. Theano: A Python
                                                          framework for fast computation of mathematical ex-
Quoc V Le and Tomas Mikolov. 2014. Distributed            pressions. arXiv e-prints, abs/1605.02688, May.
  Representations of Sentences and Documents. In
  ICML, volume 14, pages 1188–1196.                     Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan
                                                          Salakhutdinov, Raquel Urtasun, Antonio Torralba,
Jiwei Li, Minh-Thang Luong, Dan Jurafsky, and Eu-         and Sanja Fidler. 2015. Aligning Books and
   dard Hovy. 2015. When Are Tree Structures              Movies: Towards Story-like Visual Explanations by
   Necessary for Deep Learning of Representations?        Watching Movies and Reading Books. In arXiv
   arXiv:1503.00185.                                      preprint arXiv:1506.06724.

Marco Marelli, Stefano Menini, Marco Baroni, Luisa
 Bentivogli, Raffaella Bernardi, and Roberto Zam-
 parelli. 2014. A SICK cure for the evaluation of
 compositional distributional semantic models. In
 LREC, pages 216–223.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
  frey Dean. 2013. Efficient Estimation of Word
  Representations in Vector Space. arXiv preprint
  arXiv:1301.3781.

Romain Paulus, Richard Socher, and Christopher D
  Manning. 2014. Global belief recursive neural net-
  works. In Advances in Neural Information Process-
  ing Systems, pages 2888–2896.

</pre>