=Paper=
{{Paper
|id=Vol-2006/paper005
|storemode=property
|title=Tree LSTMs for Learning Sentence Representations
|pdfUrl=https://ceur-ws.org/Vol-2006/paper005.pdf
|volume=Vol-2006
|authors=Héctor Cerezo Costas,Manuela I. Martín-Vicente,Francisco J. Gonzalez-Castano
|dblpUrl=https://dblp.org/rec/conf/clic-it/Cerezo-CostasMG17
}}
==Tree LSTMs for Learning Sentence Representations==
Tree LSTMs for Learning Sentence Representations
Héctor Cerezo-Costas Manuela Martı́n-Vicente F.J. González-Castaño
AlantTic, Gradiant Gradiant Dept. Enxeñarı́a Telemática
Universidade de Vigo, Spain Edificio CITEXVI, local 14 E.E. de Telecomunicación
Edificio CITEXVI, local 14 Vigo, Pontevedra 36310, SPA Universidade de Vigo, SPA
Vigo, Pontevedra 36310, SPA mmartin@gradiant.org javier@det.uvigo.es
hcerezo@gradiant.org
Abstract is GloVe (Pennington et al., 2014), which learns
from statistical co-occurrences of words. The
English. In this work we obtain sentence two conceptually similar algorithms employ a slid-
embeddings with a recursive model using ing window of words, the context, with the in-
dependency graphs as network structure, tuition that words appearing frequently together
trained with dictionary definitions. We are semantically related and thus should be rep-
compare the performance of our recursive resented closer in Rn . The resulting vectors have
Tree-LSTMs against other deep learning shown strong correlation with human annotations
models: a recurrent version which consid- in word-analogy tests (Griffiths et al., 2007).
ers a sequential connection between sen- Despite the success of word embeddings in cap-
tence elements, and a bag of words model turing semantic information, they cannot obtain on
which does not consider word ordering at its own the composition of longer constructions,
all. We compare the approaches in an un- which is essential for natural language understand-
supervised similarity task in which general ing. Thus, several methods using deep neural net-
purpose embeddings should help to distin- works combine word vectors for obtaining sen-
guish related content. tence representations with linear mappings (Ba-
Italiano. In questo lavoro produciamo roni and Zamparelli, 2010) and deep neural net-
sentence embedding con un modello ri- works, which make use of multiple network lay-
corsivo, utilizzando alberi di dipendenze ers to obtain higher levels of abstraction (Socher
come struttura di rete, addestrandoli su et al., 2012). One of the first approaches of ob-
definizioni di dizionario. Confrontiamo taining generic embeddings was Paragraph2Vec
le prestazioni dei nostri alberi-LSTM ri- (Le and Mikolov, 2014). Paragraph2Vec can learn
corsivi con altri modelli di apprendimento unsupervised sentence representations, analogous
profondo: una rete ricorrente che con- to word2vec models for word representation, by
sidera una connessione sequenziale tra le adding an extra node, indicating the document
parole della frase, e un modello bag-of- contribution, to the model.
words, che non ne considera l’ordine. La Attending to the way the nodes of the network
valutazione dei modelli viene effettutata su link with each other, two approaches are frequent
un task di similarit non supervisionata, in in NLP: recurrent neural networks and recursive
cui embedding di uso generale aiutano a neural networks (RNN) 1 . Recurrent models con-
distinguere i contenuti correlati. sider sequential links among words, while recur-
sive models use graph-like structures for organiz-
ing the network operations. They process neigh-
1 Introduction bouring words by parsing the tree order (depen-
dency or syntactic graphs), and compute node rep-
Word embeddings have succeeded in obtaining
resentations for each parent recursively from the
word semantics and projecting this information
previous step until they reach the root of the tree,
in a vector space. (Mikolov et al., 2013) pro-
which gives the final sentence abstraction.
posed two methodologies for learning semantic
In this work, we train a variant of Tree-LSTM
abstractions of words from large volumes of un-
models for learning concept abstractions with dic-
labelled data, Skipgram and CBOW, comprised
1
in the word2vec framework. Another approach We use the same classification as in (Li et al., 2015).
tionary descriptions as an input. To the best of our language inference system using RNTN in a sim-
knowledge, this is the first attempt to embed dic- plified scenario with basic sentence constructions.
tionaries using such approach. Our model takes Although the results show that the system is able
complex graph-like structures (e.g. syntactic or to learn inference relationships in most cases, it is
dependency graphs) as opposed to the most com- unclear if this model could be generalised for more
mon approaches which employ recurrent models complex sentences. RNTNs were subsequently
or unordered distributions of words as represen- improved by (Tai et al., 2015), using LSTMs in
tation of the sentences. We use an unsupervised the network nodes instead of tensors. With tree-
similarity benchmark with the intuition that bet- structures the network can capture language con-
ter sentence embeddings will produce more coin- structions which greatly affect the polarity of sen-
cidences with human annotations (comparably to tences (e.g. negation, polarity reversal, etc.).
the word analogy task in word embeddings). A more complete benchmark was conducted by
(Li et al., 2015). There, sequential and recur-
2 Related Work sive RNNs were tested in different tasks: sen-
timent analysis, question-answer matching, dis-
The following recurrent models are capable of ob-
course parsing and semantic relation extraction.
taining general purpose embeddings of sentences:
Recursive models excelled in tasks with enough
Skip-thought Vectors, and DictRep.
available supervised data, when nodes different
Skip-thought Vectors (Kiros et al., 2015) learns
from the root are labelled, or when semantic re-
general semantic sentence abstractions with unsu-
lationships must be extracted from distant words
pervised training. This concept is similar to the
in a sentence.
learning of word embeddings with the skipgram
model (Mikolov et al., 2013). Skip-thoughts tries 3 Approach
to code a sentence in such a way that it maximises
the probability of recovering the preceding and Learning models that build a dictionary of embed-
following sentence in a document. dings have solid advantages over other supervised
DictRep (Hill et al., 2015) trains RNN networks approaches, since they take advantage of large vol-
and BoW models mapping definitions and words umes of data that are already available online. The
with different error functions (cosine similarity training data of the system are pairs of defini-
and ranking loss). Whilst the RNN models take tion/target word which can be built with dictionar-
into account the word orderings, the BoW mod- ies or encyclopedia descriptions (e.g. picking the
els are just a weighted combination of the input first sentences of a description as training data).
embeddings. The simplest BoW approach offered We follow previous work of (Hill et al., 2015) that
competitive results against its RNN counterparts, employed dictionaries with sequential connections
beating them in most tests (Hill et al., 2016). but using tree structures instead.
Recurrent models have achieved good perfor- We used the Tree-LSTM as the starting point
mance results in different tasks such as polarity to build our system. The input to the system
detection (e.g. bidirectional LSTMs in (Tai et al., are the words conforming a definition together
2015)), machine translation (Cho et al., 2014) or with the structure of the graph with the syntac-
sentence similarity detection (e.g. Skip-thoughts), tic/dependency relationships, and the word closer
just to name a few. to this definition, i.e. the target. Typically the
Despite being less explored for building gen- LSTM nodes are intended for strictly sequential
eral purpose sentence embeddings, in several clas- information propagation. Our variant is based in
sification tasks, tree-structured RNNs represent the previous work of (Tai et al., 2015).
the current state of the art. In their seminal pa- The main differences with the original LSTM
per, (Socher et al., 2013) captured complex inter- node are the presence of two forget gates instead
actions among words with tensor operations and of one and the operation over two previous nodes
graph-like links among network nodes. Recursive of the system which modify node states and in-
Neural Tensor Networks (RNTN) networks have hibitor gates. Hence, sub-indexes 1 and 2 are re-
been used to solve a simplified version of a QA served for left and right child nodes of the graph,
system in (Iyyer et al., 2014). respectively. In this LSTM node there are no peep-
In (Bowman, 2013), the authors built a natural hole connections between memory states and the
inhibitor gates. connections, which is conceptually similar to the
The state value in the root node is fed to the last DictRep-RNN model.
layer of the system. Then, a non-linear transfor- For SkipThoughts we used the code available
mation is applied to obtain the sentence embed- online (ski, ) and the pre-trained model with a sen-
ding. In the basic configuration of the model, the tence representation of 4800 dimensions. Addi-
error is measured by calculating the cosine sim- tionally, we trained a compressed model with sen-
ilarity between target and predicted embeddings. tence and word representation dimensions of 1200
The target is the embedding of the word result and 320 respectively in about three weeks. Like in
of the definition. Pre-trained word embeddings the available model, the 80 million registers of the
or random initialised embeddings might be em- BookCorpus dataset (Zhu et al., 2015) were used
ployed. In the second case, the error is also prop- during the training process.
agated to the leaf nodes of the graph and thus the The objective of the semantic similarity task
word embeddings are updated during training. We benchmark is to measure the similarity between a
did not initialise randomly embeddings because pair of sentences. SemEval STS 2014 (Agirre et
this has consistently produced poorer results in al., 2014) and SICK (Marelli et al., 2014) datasets
comparison with the same model using pre-trained were used for benchmarks. In both datasets, each
word embeddings. example was gold-standard ranked between 0 (to-
In the network configurations of the tree-LSTM tally unrelated sentences) and 5 (completely sim-
models, we added an extra backward link between ilar). Furthermore, SICK dataset considers three
the root node and the leaves reversing the uplink different types of semantic relatedness (Neutral,
path (as hinted in (Socher et al., 2011; Paulus et Entailment and Contradiction). We tested the
al., 2014)). In these settings, the error to minimise models against the three relations to check if re-
is a combination of the target word similarity and cursive and recurrent models exhibited different
the leaves word similarity modulated by a smooth- behaviour.
ing parameter. This is the same dataset used in previous work
We implemented our model with Theano (Hill et al., 2016) but excluding the WordNet set,
(Theano Development Team, 2016) and trained since it was used as part of the training.
it with minibatch (30) and Adam (Kingma and For the sentiment polarity, we used as train-
Ba, 2014) as optimisation algorithm (with param- ing/validation data the Sentiment Penn Treebank
eters β1 = 0.9, β2 = 0.999 and learning rate dataset 2 . In this dataset, each sentence node is la-
l = 0.002). This configuration has achieved state belled with a 5-tag intensity tag from 0, the most
of the art performance in other NLP tasks (Kumar negative, to 4. Sentences are already binarised
et al., 2015). in the same format of our TreeDict approach so
that no preprocessing is needed in this task for
4 Experiments TreeModels. We used for training and test the la-
bels at the root node which is the the overall sen-
We compared DictRep (BoW and RNN) and our
tence polarity. For completeness, we repeat the
Tree-LSTM variant in a benchmark of unsuper-
analysis for a 3-label annotations over the same
vised text similarity tasks and a supervised task
dataset. We used the same SVM classifier for all
(sentiment polarity). These tasks greatly benefit
the models and we trained it with the sentence vec-
from a good representation of sentences and it re-
tors as input.
quires a lot of human effort to build a dataset.
DictRep models were trained using available
5 Results and conclusion
data and online code. For a fair comparison, all
models employed the pre-trained word embed- The DictRep BoW model was undeniably better
dings and training data provided by (Hill et al., than the recurrent and recursive models achiev-
2015) and cosine similarity as error metric. The ing the best position in all cases (Table 1). The
configuration setting was similar for all the mod- TreeDict-Dep model ranked second 3 .
els.
2
Our model employs two connection configu- http://nlp.stanford.edu/sentiment/treebank.html
3
The character “-” indicates that some vectors for a sen-
rations: The Tree-LSTM with transformed de- tence could not be obtained (e.g. due to a malformed depen-
pendency graphs and the sequential mapping of dency graph)
Figure 1: Tree-LSTM schema employed. Dotted blocks and lines depict the optional reverse channel.
All models capture the correlations with human other models in this same similarity benchmark
annotations better in neutral contexts. If there are achieving comparable results. Not only DictRep-
contradictions and entailment relationships, the BoW models outperformed the DictRep-RNNs
agreement with human annotations is less evident. but also the Skip-thought model, which considers
Nevertheless, this behaviour is expected and also the order of the words in a sentence, was beaten by
desirable, as this is an unsupervised benchmark FastSent, its counterpart that employs BoW repre-
and the system has no way of learning a similar sentation of a sentence.
but conflicting relationship without external help. The effect of word orderings is not clear. BoW
It is clear that BoW models offered the best models are far from being ideal as they cannot ob-
performance in all the datasets. The Tree-LSTM tain which parts are negated or the dependencies
model, which is consistently better than the se- among the different elements of the sentence (e.g.
quential models, ranked second. Table 2 shows the black dog chases the white cat and the black
the correlation among models over the SICK sim- cat chases the white dog cannot be differentiated
ilarity dataset. All the models experience strong by only using BoW models).
cross-correlations between them but the Tree- It is important to mention that the similarity
LSTM with dependency parsing showed the clos- was tested only at the root node when using Tree-
est correlation with the BoW and recurrent mod- LSTM. Notwithstanding, recursive models allow
els. to use more elaborated strategies, taking advan-
The Table 3 shows the performance of the tage of the dependencies used to build the rela-
models in the supervised polarity tasks. BoW tionships of the nodes in the deep network. These
and SkipThoughts models experience similar out- strategies could combine similarities at different
comes for the 5 and 3 label task. Models trained levels of the sentence to obtain a more approxi-
with dictionary definitions (DictRep and TreeD- mate value of similarity (e.g. using a pooling ma-
ict) lag behind those models. However, all the trix with all the nodes of the parse tree (Socher et
networks using dependency structures have con- al., 2011)).
sistently beaten its sequential counterparts. This The errors during training time in held-out
is a strong indicative of the benefits of using this data were 0.57 for BoW models versus the 0.51
more complex network structure. The difference achieved by recurrent and recursive models. Nev-
between the different network configurations of ertheless, better dictionary embeddings do not
the same model are less pronounced that in the seem to directly translate into better performance
similarity tasks but in our tests, the models that at inferring general purpose sentence embeddings
used the extra link backwards achieved small gains in the benchmarks. Results in the test also show
(at least in the 3-label task). that we need better mechanisms to infer sentence
In previous work, (Hill et al., 2016) compared level representations.
STS 2014 Sick
Model News Forum Twitter Images Headlines Neu Ent Con All
DictRep-BoW .67/.74 .42/.39 .60/.65 .71/.74 .58/.62 .60/.70 .58/.56 .12/.18 .62/.72
DictRep-RNN .45/.52 .06/.04 .30/.32 .57/.57 .39/.42 .52/.59 .22/.23 .09/.10 .48/.56
TreeDict-Seq .48/.54 .24/.23 .40/.45 .60/.64 .46/.51 .51/.59 .24/.27 .07/.10 .51/.59
TreeDict-Seq 250 .50/.58 .20/.21 .44/.47 .61/.66 .46/.49 .56/.62 .27/.30 .08/.11 .54/.64
TreeDict-Seq 250BL .47/.47 .23/.21 .52/.59 .51/.51 .43/.45 .48/.52 .29/.33 .10/.14 .51/.56
TreeDict-Dep .48/.55 .29/.28 - .61/.67 - .56/.64 .35/.39 .08/.13 .55/.65
TreeDict-Dep 250 .50/.56 .31/.30 - .56/.63 - .55/.61 .36/.41 .09/.12 .56/.63
TreeDict-Dep 250BL .43/.45 .30/.28 - .56/.58 - .52/.56 .34/.38 .09/.11 .55/.60
SkipThoughts-4800 .43/.23 .13/.13 .42/.40 .48/.51 .36/.37 .49/.49 .19/.25 .10/.15 .48/.50
SkipThoughts-1200 .55/.54 .22/.23 - .55/.61 .39/.41 .56/.56 .21/.24 .09/.15 .53/.56
Table 1: Performance of the models measured with Spearman/Pearson correlations against golden stan-
dard annotations in the similarity benchmarks.
Model D.BoW D.RNN T.Seq T.Penn employed. Recursive models are more expensive
D.BoW 1.0/1.0 .70/.71 .74/75 .80/.82 computationally and have a more complex imple-
D.RNN .70/.71 1.0/1.0 .77/.75 .73/.72 mentation but they exhibit better performance in
T.Seq .74/.75 .77/.75 1.0/1.0 .79/.78 longer sentences. However, with current learning
T.Dep .80/.82 .73/.72 .78/.78 1.0/1.0 techniques recurrent and recursive models cannot
offer better results than simpler models such as
Table 2: Spearman/Pearson correlations among BoW representations of sentences in unsupervised
the different models in the SICK dataset. similarity benchmarks. The results of these find-
ings shall be confirmed in the future in more com-
Model F1 -score plex scenarios, such as large scale QA.
(5-label) (3-label)
Acknowledgments
DictRep-BoW .40 .56
DictRep-RNN .32 .49 This work has been funded by the Spanish Min-
TreeDict-Seq .31 .49 isterio de Economa y Competitividad through the
TreeDict-Seq 250 .32 .48 project INRISCO (TEC2014-54335-C4-4-R).
TreeDict-Seq 250BL .32 .49
TreeDict-Dep .35 .53
TreeDict-Dep 250 .35 .51
References
TreeDict-Dep 250BL .35 .53 Eneko Agirre, Carmen Banea, Claire Cardie, Daniel
SkipThoughts-4800 .40 .56 Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei
Guo, Rada Mihalcea, German Rigau, and Janyce
SkipThoughts-1200 .38 .55
Wiebe. 2014. Semeval-2014 task 10: Multilingual
Semantic Textual Similarity. In Proceedings of the
Table 3: Performance of the models in the polarity 8th international workshop on semantic evaluation
(SemEval 2014), pages 81–91.
detection task
Marco Baroni and Roberto Zamparelli. 2010. Nouns
are vectors, adjectives are matrices: Representing
In this paper we introduced the use of recursive adjective-noun constructions in semantic space. In
models for the generation of general purpose em- Proceedings of the 2010 Conference on Empirical
beddings once they are trained by embedding dic- Methods in Natural Language Processing, pages
tionary definitions. We compare recurrent and re- 1183–1193. Association for Computational Linguis-
tics.
cursive models in the embedding dictionary task
and we test the validity of these embeddings for Samuel R Bowman. 2013. Can recursive neural tensor
their use as general purpose codification of sen- networks learn logical reasoning? arXiv:1312.6192.
tences with both similarity.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
Results demonstrate slight advantages of the cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
Tree recursive variant over recurrent models that ger Schwenk, and Yoshua Bengio. 2014. Learn-
learn from dictionaries, which are more frequently ing Phrase Representations using RNN Encoder-
decoder for Statistical Machine Translation. arXiv Jeffrey Pennington, Richard Socher, and Christopher D
preprint arXiv:1406.1078. Manning. 2014. Glove: Global Vectors for Word
Representation. In EMNLP, volume 14, pages
Thomas L Griffiths, Mark Steyvers, and Joshua B 1532–43.
Tenenbaum. 2007. Topics in Semantic Represen-
tation. Psychological review, 114(2):211. Sent2Vec encoder and training code from the paper
“Skip-Thought Vectors”. https://github.
Felix Hill, Kyunghyun Cho, Anna Korhonen, and com/ryankiros/skip-thoughts. Ac-
Yoshua Bengio. 2015. Learning to Understand cessed: 2017-07-07.
Phrases by Embedding the Dictionary. Transactions
Richard Socher, Eric H Huang, Jeffrey Pennin, Christo-
of the Association for Computational Linguistics.
pher D Manning, and Andrew Y Ng. 2011. Dy-
namic Pooling and Unfolding Recursive Autoen-
Felix Hill, Kyunghyun Cho, and Anna Korhonen. coders for Paraphrase Detection. In Advances in
2016. Learning Distributed Representations of Sen- Neural Information Processing Systems, pages 801–
tences from Unlabelled Data. arXiv:1602.03483. 809.
Mohit Iyyer, Jordan L Boyd-Graber, Leonardo Richard Socher, Brody Huval, Christopher D Manning,
Max Batista Claudino, Richard Socher, and Hal and Andrew Y Ng. 2012. Semantic Compositional-
Daumé III. 2014. A Neural Network for Factoid ity through Recursive Matrix-vector Spaces. In Pro-
Question Answering over Paragraphs. In EMNLP, ceedings of the 2012 Joint Conference on Empiri-
pages 633–644. cal Methods in Natural Language Processing and
Computational Natural Language Learning, pages
Diederik Kingma and Jimmy Ba. 2014. Adam: 1201–1211. Association for Computational Linguis-
A Method for Stochastic Optimization. tics.
arXiv:1412.6980.
Richard Socher, Alex Perelygin, Jean Y Wu, Jason
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Chuang, Christopher D Manning, Andrew Y Ng,
Richard Zemel, Raquel Urtasun, Antonio Torralba, and Christopher Potts. 2013. Recursive Deep Mod-
and Sanja Fidler. 2015. Skip-Thought Vectors. In els for Semantic Compositionality over a Sentiment
Advances in neural information processing systems, Treebank. In Proceedings of the conference on
pages 3294–3302. empirical methods in natural language processing
(EMNLP), volume 1631, page 1642. Citeseer.
Ankit Kumar, Ozan Irsoy, Jonathan Su, James Brad-
bury, Robert English, Brian Pierce, Peter Ondruska, Kai Sheng Tai, Richard Socher, and Christopher D
Ishaan Gulrajani, and Richard Socher. 2015. Manning. 2015. Improved Semantic Representa-
Ask Me Anything: Dynamic Memory Networks tions from Tree-structured Long Short-term Mem-
for Natural Language Processing. arXiv preprint ory Networks. ACL.
arXiv:1506.07285. Theano Development Team. 2016. Theano: A Python
framework for fast computation of mathematical ex-
Quoc V Le and Tomas Mikolov. 2014. Distributed pressions. arXiv e-prints, abs/1605.02688, May.
Representations of Sentences and Documents. In
ICML, volume 14, pages 1188–1196. Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan
Salakhutdinov, Raquel Urtasun, Antonio Torralba,
Jiwei Li, Minh-Thang Luong, Dan Jurafsky, and Eu- and Sanja Fidler. 2015. Aligning Books and
dard Hovy. 2015. When Are Tree Structures Movies: Towards Story-like Visual Explanations by
Necessary for Deep Learning of Representations? Watching Movies and Reading Books. In arXiv
arXiv:1503.00185. preprint arXiv:1506.06724.
Marco Marelli, Stefano Menini, Marco Baroni, Luisa
Bentivogli, Raffaella Bernardi, and Roberto Zam-
parelli. 2014. A SICK cure for the evaluation of
compositional distributional semantic models. In
LREC, pages 216–223.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
frey Dean. 2013. Efficient Estimation of Word
Representations in Vector Space. arXiv preprint
arXiv:1301.3781.
Romain Paulus, Richard Socher, and Christopher D
Manning. 2014. Global belief recursive neural net-
works. In Advances in Neural Information Process-
ing Systems, pages 2888–2896.