=Paper=
{{Paper
|id=Vol-1749/paper_030
|storemode=property
|title=Tandem LSTM–SVM Approach for Sentiment Analysis
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_030.pdf
|volume=Vol-1749
|authors=Andrea Cimino,Felice Dell'Orletta
|dblpUrl=https://dblp.org/rec/conf/clic-it/CiminoD16a
}}
==Tandem LSTM–SVM Approach for Sentiment Analysis==
Tandem LSTM-SVM Approach for Sentiment Analysis
Andrea Cimino and Felice Dell’Orletta
Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
ItaliaNLP Lab - www.italianlp.it
{andrea.cimino, felice.dellorletta}@ilc.cnr.it
Abstract questo sistema abbiamo ottenuto la sec-
onda posizione nella classificazione della
English. In this paper we describe our Soggettività, la terza posizione nella clas-
approach to EVALITA 2016 SENTIPOLC sificazione della Polarità e la sesta nella
task. We participated in all the sub- identificazione dell’Ironia.
tasks with constrained setting: Subjectiv-
ity Classification, Polarity Classification
and Irony Detection. We developed a tan-
dem architecture where Long Short Term
Memory recurrent neural network is used 1 Description of the system
to learn the feature space and to capture
temporal dependencies, while the Support We addressed the EVALITA 2016 SENTIPOLC
Vector Machines is used for classification. task (Barbieri et al., 2016) as a three-classification
SVMs combine the document embedding problem: two binary classification tasks (Subjec-
produced by the LSTM with a wide set tivity Classification and Irony Detection) and a
of general–purpose features qualifying the four-class classification task (Polarity Classifica-
lexical and grammatical structure of the tion).
text. We achieved the second best ac- We implemented a tandem LSTM-SVM clas-
curacy in Subjectivity Classification, the sifier operating on morpho-syntactically tagged
third position in Polarity Classification, texts. We used this architecture since similar sys-
the sixth position in Irony Detection. tems were successfully employed to tackle differ-
ent classification problems such keyword spotting
Italiano. In questo articolo descrivi-
(Wöllmer et al., 2009) or the automatic estimation
amo il sistema che abbiamo utilizzato
of human affect from speech signal (Wöllmer et
per affrontare i diversi compiti del task
al., 2010), showing that tandem architectures out-
SENTIPOLC della conferenza EVALITA
perform the performances of the single classifiers.
2016. In questa edizione abbiamo parte-
cipato a tutti i sotto compiti nella config- In this work we used Keras (Chollet, 2016) deep
urazione vincolata, cioè senza utilizzare learning framework and LIBSVM (Chang et al.,
risorse annotate a mano diverse rispetto a 2001) to generate respectively the LSTM and the
quelle distribuite dagli organizzatori. Per SVMs statistical models.
questa partecipazione abbiamo sviluppato Since our approach relies on morpho-
un metodo che combina una rete neurale syntactically tagged texts, both training
ricorrente di tipo Long Short Term Mem- and test data were automatically morpho-
ory, utilizzate per apprendere lo spazio syntactically tagged by the POS tagger described
delle feature e per catturare dipendenze in (Dell’Orletta, 2009). In addition, in order to
temporali, e Support Vector Machine per improve the overall accuracy of our system (de-
la classificazione. Le SVM combinano la scribed in 1.2), we developed sentiment polarity
rappresentazione del documento prodotta and word embedding lexicons1 described below.
da LSTM con un ampio insieme di fea-
tures che descrivono la struttura lessi- 1
All the created lexicons are made freely available at the
cale e grammaticale del testo. Attraverso following website: http://www.italianlp.it/.
1.1 Lexical resources fascinating), and 200 negative word seeds (e.g.,
1.1.1 Sentiment Polarity Lexicons “tradire” betray, “morire” die). These terms were
chosen from the OPENER lexicon. The result-
Sentiment polarity lexicons provide mappings be-
ing corpus is made up of 683,811 tweets extracted
tween a word and its sentiment polarity (positive,
with positive seeds and 1,079,070 tweets extracted
negative, neutral). For our experiments, we used a
with negative seeds.
publicly available lexicons for Italian and two En-
The main purpose of this procedure was to as-
glish lexicons that we automatically translated. In
sign a polarity score to each n-gram occurring
addition, we adopted an unsupervised method to
in the corpus. For each n-gram (we considered
automatically create a lexicon specific for the Ital-
up to five n-grams) we calculated the correspond-
ian twitter language.
ing sentiment polarity score with the following
Existing Sentiment Polarity Lexicons scoring function: score(ng) = P M I(ng, pos) −
We used the Italian sentiment polarity lexicon P M I(ng, neg), where PMI stands for pointwise
(hereafter referred to as OP EN ER) (Maks et mutual information.
al., 2013) developed within the OpeNER Euro- 1.1.2 Word Embedding Lexicons
pean project2 . This is a freely available lexicon
Since the lexical information in tweets can be very
for the Italian language3 and includes 24,000 Ital-
sparse, to overcame this problem we built two
ian word entries. It was automatically created us-
word embedding lexicons.
ing a propagation algorithm and the most frequent
For this purpose, we trained two predict mod-
words were manually reviewed.
els using the word2vec5 toolkit (Mikolov et al.,
Automatically translated Sentiment Polarity 2013). As recommended in (Mikolov et al., 2013),
Lexicons we used the CBOW model that learns to pre-
• The Multi–Perspective Question Answering dict the word in the middle of a symmetric win-
(hereafter referred to as M P QA) Subjectiv- dow based on the sum of the vector representa-
ity Lexicon (Wilson et al., 2005). This lexi- tions of the words in the window. For our ex-
con consists of approximately 8,200 English periments, we considered a context window of
words with their associated polarity. In order 5 words. These models learn lower-dimensional
to use this resource for the Italian language, word embeddings. Embeddings are represented by
we translated all the entries through the Yan- a set of latent (hidden) variables, and each word is
dex translation service4 . a multidimensional vector that represent a specific
instantiation of these variables. We built two Word
• The Bing Liu Lexicon (hereafter referred to Embedding Lexicons starting from the following
as BL) (Hu et al., 2004). This lexicon in- corpora:
cludes approximately 6,000 English words
with their associated polarity. This resource • The first lexicon was built using a tokenized
was automatically translated by the Yandex version of the itWaC corpus6 . The itWaC cor-
translation service. pus is a 2 billion word corpus constructed
from the Web limiting the crawl to the .it
Automatically created Sentiment Polarity domain and using medium-frequency words
Lexicons from the Repubblica corpus and basic Italian
We built a corpus of positive and negative tweets vocabulary lists as seeds.
following the Mohammad et al. (2013) approach
adopted in the Semeval 2013 sentiment polarity • The second lexicon was built from a tok-
detection task. For this purpose we queried the enized corpus of tweets. This corpus was col-
Twitter API with a set of hashtag seeds that in- lected using the Twitter APIs and is made up
dicate positive and negative sentiment polarity. of 10,700,781 italian tweets.
We selected 200 positive word seeds (e.g. “vin- 1.2 The LSTM-SVM tandem system
cere” to win, “splendido” splendid, “affascinante”
SVM is an extremely efficient learning algorithm
2
http://www.opener-project.eu/ and hardly to outperform, unfortunately these type
3
https://github.com/opener-project/public-sentiment-
5
lexicons http://code.google.com/p/word2vec/
4 6
http://api.yandex.com/translate/ http://wacky.sslmit.unibo.it/doku.php?id=corpora
of algorithms capture “sparse” and “discrete” fea- Unlabeled Tweets
and itWaC Corpus Labeled Tweets
tures in document classification tasks, making re-
ally hard the detection of relations in sentences, Twitter/itWaC
Word Embeddings LSTM
which is often the key factor in detecting the over-
all sentiment polarity in documents (Tang et al.,
Sentence embed-
2015). On the contrary, Long Short Term Mem- dings Extraction
ory (LSTM) networks are a specialization of Re-
current Neural Networks (RNN) which are able SVM Model Document fea-
Generation ture extraction
to capture long-term dependencies in a sentence.
This type of neural network was recently tested
Final statis-
on Sentiment Analysis tasks (Tang et al., 2015), tical model
(Xu et al., 2016) where it has been proven to
outperform classification performance in several Figure 1: The LSTM-SVM architecture
sentiment analysis task (Nakov et al., 2016) with
respect to commonly used learning algorithms,
showing a 3-4 points of improvements. For this able to propagate an important feature that came
work, we implemented a tandem LSTM-SVM to early in the input sequence over a long distance,
take advantage from the two classification strate- thus capturing potential long-distance dependen-
gies. cies.
Figure 1 shows a graphical representation of LSTM is a state-of-the-art learning algorithm
the proposed tandem architecture. This architec- for semantic composition and allows to compute
ture is composed of 2 sequential machine learning representation of a document from the representa-
steps both involved in training and classification tion of its words with multiple abstraction levels.
phases. In the training phase, the LSTM network Each word is represented by a low dimensional,
is trained considering the training documents and continuous and real-valued vector, also known
the corresponding gold labels. Once the statisti- as word embedding and all the word vectors are
cal model of the LSTM neural network is com- stacked in a word embedding matrix.
puted, for each document of the training set a doc- We employed a bidirectional LSTM architec-
ument vector (document embedding) is computed ture since these kind of architecture allows to cap-
exploiting the weights that can be obtained from ture long-range dependencies from both directions
the penultimate network layer (the layer before the of a document by constructing bidirectional links
SoftMax classifier) by giving in input the consid- in the network (Schuster et al., 1997). In addition,
ered document to the LSTM network. The docu- we applied a dropout factor to both input gates
ment embeddings are used as features during the and to the recurrent connections in order to pre-
training phase of the SVM classifier in conjunc- vent overfitting which is a typical issue in neu-
tion with a set of widely used document classi- ral networks (Galp and Ghahramani , 2015). As
fication features. Once the training phase of the suggested in (Galp and Ghahramani , 2015) we
SVM classifier is completed the tandem architec- have chosen a dropout factor value in the opti-
ture is considered trained. The same stages are mum range [0.3, 0.5], more specifically 0.45 for
involved in the classification phase: for each doc- this work. For what concerns the optimization
ument that must be classified, an embedding vec- process, categorical cross-entropy is used as a loss
tor is obtained exploiting the previously trained function and optimization is performed by the rm-
LSTM network. Finally the embedding is used sprop optimizer (Tieleman and Hinton, 2012).
jointly with other document classification features Each input word to the LSTM architecture is
by the SVM classifier which outputs the predicted represented by a 262-dimensional vector which is
class. composed by:
Word embeddings: the concatenation of the two
1.2.1 The LSTM network word embeddings extracted by the two available
In this part, we describe the LSTM model em- Word Embedding Lexicons (128 dimensions for
ployed in the tandem architecture. The LSTM unit each word embedding, a total of 256 dimensions),
was initially proposed by Hochreiter and Schmid- and for each word embedding an extra component
huber (Hochreiter et al., 1997). LSTM units are was added in order to handle the ”unknown word”
(2 dimensions). grained PoS, which represent subdivisions of the
Word polarity: the corresponding word polarity coarse-grained tags (e.g. the class of nouns is sub-
obtained by exploiting the Sentiment Polarity Lex- divided into proper vs common nouns, verbs into
icons. This results in 3 components, one for each main verbs, gerund forms, past particles).
possible lexicon outcome (negative, neutral, posi- Coarse grained Part-Of-Speech distribution:
tive) (3 dimensions). We assumed that a word not the distribution of nouns, adjectives, adverbs,
found in the lexicons has a neutral polarity. numbers in the tweet.
End of Sentence: a component (1 dimension) in-
dicating whether or not the sentence was totally Lexicon features
read. Emoticons: presence or absence of positive or
negative emoticons in the analyzed tweet. The
1.2.2 The SVM classifier lexicon of emoticons was extracted from the site
The SVM classifier exploits a wide set of fea- http://it.wikipedia.org/wiki/Emoticon and manu-
tures ranging across different levels of linguis- ally classified.
tic description. With the exception of the word Lemma sentiment polarity n-grams: for each
embedding combination, these features were al- n-gram of lemmas extracted from the analyzed
ready tested in our previous participation at the tweet, the feature checks the polarity of each com-
EVALITA 2014 SENTIPOLC edition (Cimino et ponent lemma in the existing sentiment polarity
al., 2014). The features are organised into three lexicons. Lemma that are not present are marked
main categories: raw and lexical text features, with the ABSENT tag. This is for example the
morpho-syntactic features and lexicon features. case of the trigram “tutto molto bello” (all very
nice) that is marked as “ABSENT-POS-POS” be-
Raw and Lexical Text Features
cause molto and bello are marked as positive in
Topic: the manually annotated class of topic pro-
the considered polarity lexicon and tutto is absent.
vided by the task organizers for each tweet.
The feature is computed for each existing senti-
Number of tokens: number of tokens occurring
ment polarity lexicons.
in the analyzed tweet.
Polarity modifier: for each lemma in the tweet
Character n-grams: presence or absence of con-
occurring in the existing sentiment polarity lexi-
tiguous sequences of characters in the analyzed
cons, the feature checks the presence of adjectives
tweet.
or adverbs in a left context window of size 2. If
Word n-grams: presence or absence of contigu-
this is the case, the polarity of the lemma is as-
ous sequences of tokens in the analyzed tweet.
signed to the modifier. This is for example the case
Lemma n-grams: presence or absence of con-
of the bigram “non interessante” (not interesting),
tiguous sequences of lemma occurring in the an-
where “interessante” is a positive word, and “non”
alyzed tweet.
is an adverb. Accordingly, the feature “non POS”
Repetition of n-grams chars: presence or ab-
is created. The feature is computed 3 times, check-
sence of contiguous repetition of characters in the
ing all the existing sentiment polarity lexicons.
analyzed tweet.
PMI score: for each set of unigrams, bigrams,
Number of mentions: number of mentions (@)
trigrams, four-grams and five-grams that occur in
occurring in the analyzed tweet.
the analyzed tweet, the feature computes the score
Number of hashtags: number of hashtags occur- P
given by i–gram∈tweet score(i–gram) and re-
ring in the analyzed tweet.
turns the minimum and the maximum values of the
Punctuation: checks whether the analyzed tweet
five values (approximated to the nearest integer).
finishes with one of the following punctuation
Distribution of sentiment polarity: this feature
characters: “?”, “!”.
computes the percentage of positive, negative and
Morpho-syntactic Features neutral lemmas that occur in the tweet. To over-
Coarse grained Part-Of-Speech n-grams: pres- come the sparsity problem, the percentages are
ence or absence of contiguous sequences of rounded to the nearest multiple of 5. The feature
coarse–grained PoS, corresponding to the main is computed for each existing lexicon.
grammatical categories (noun, verb, adjective). Most frequent sentiment polarity: the feature re-
Fine grained Part-Of-Speech n-grams: pres- turns the most frequent sentiment polarity of the
ence or absence of contiguous sequences of fine- lemmas in the analyzed tweet. The feature is com-
puted for each existing lexicon. Configuration Subject. Polarity Irony
Sentiment polarity in tweet sections: the feature
best official Runs 0.718 0.664 0.548
first splits the tweet in three equal sections. For
each section the most frequent polarity is com- quadratic SVM 0.704 0.646 0.477
puted using the available sentiment polarity lexi- linear SVM 0.661 0.631 0.495
cons. The purpose of this feature is aimed at iden- LSTM 0.716 0.674 0.468
tifying change of polarity within the same tweet. linear Tandem∗ 0.676 0.650 0.499
Word embeddings combination: the feature re- quadratic Tandem∗ 0.713 0.643 0.472
turns the vectors obtained by computing sepa-
rately the average of the word embeddings of the Table 2: Classification results of the different
nouns, adjectives and verbs of the tweet. It com- learning models on the official test set.
puted once for each word embedding lexicon, ob-
taining a total of 6 vectors for each tweet. outperform the SVM and LSTM ones. In addition,
the quadratic models perform better than the linear
2 Results and Discussion
ones. These results lead us to choose the linear and
We tested five different learning configurations of quadratic tandem models as the final systems to be
our system: linear and quadratic support vector used on the official test set.
machines (linear SVM, quadratic SVM) using the Table 2 reports the overall accuracies achieved
features described in section 1.2.2, with the ex- by all our classifier configurations on the official
ception of the document embeddings generated by test set, the official submitted runs are starred in
the LSTM; LSTM using the word embeddings de- the table. The best official Runs row reports, for
scribed in 1.2.2; A tandem SVM-LSTM combina- each task, the best official results in EVALITA
tion with linear and quadratic SVM kernels (lin- 2016 SENTIPOLC. As can be seen, the accura-
ear Tandem, quadratic Tandem) using the features cies of different learning models reveal a different
described in section 1.2.2 and the document em- trend when tested on the development and the test
beddings generated by the LSTM. To test the pro- sets. Differently from what observed in the de-
posed classification models, we created an internal velopment experiments, the best system results to
development set randomly selected from the train- be the LSTM one and the gap in terms of accu-
ing set distributed by the task organizers. The re- racy between the linear and quadratic models is
sulting development set is composed by the 10% lower or does not occur. In addition, the accura-
(740 tweets) of the whole training set. cies of all the systems are definitely lower than the
ones obtained in our development experiments. In
Configuration Subject. Polarity Irony our opinion, such results may depend on the oc-
currence of out domain tweets in the test set with
linear SVM 0.725 0.713 0.636 respect to the ones contained in the training set.
quadratic SVM 0.740 0.730 0.595 Different groups of annotators could be a further
LSTM 0.777 0.747 0.646 motivation for these different results and trends.
linear Tandem 0.764 0.743 0.662
quadratic Tandem 0.783 0.754 0.675 3 Conclusion
Table 1: Classification results of the different In this paper, we reported the results of our partic-
learning models on our development set. ipation to the EVALITA 2016 SENTIPOLC tasks.
By resorting to a tandem LSTM-SVM system we
Table 1 reports the overall accuracies achieved achieved the second place at the Subjectivity Clas-
by the classifiers on our internal development set sification task, the third place at the Sentiment Po-
for all the tasks. The accuracy is calculated as larity Classification task and the sixth place at the
the F–score obtained using the evaluation tool pro- Irony Detection task. This tandem system com-
vided by the organizers. It is worth noting that bines the ability of the bidirectional LSTM to cap-
there are similar trends for what concerns the ac- ture long-range dependencies between words from
curacies of the proposed learning models for all both directions of a tweet with SVMs which are
the three tasks. In particular, LSTM outperforms able to exploit document embeddings produced by
SVM models while the Tandem systems clearly LSTM in conjunction with a wide set of general-
purpose features qualifying the lexical and gram- ACM SIGKDD international conference on Knowl-
matical structure of a text. Current direction of re- edge discovery and data mining, KDD ’04. 368-177,
New York, NY, USA. ACM.
search is introducing a character based LSTM (dos
Santos and Zadrozny, 2013) in the tandem system. Isa Maks, Ruben Izquierdo, Francesca Frontini,
Character based LSTM proven to be particularly Montse Cuadros, Rodrigo Agerri and Piek Vossen.
2014. Generating Polarity Lexicons with WordNet
suitable when analyzing social media texts (Dhin- propagation in 5 languages. 9th LREC, Language
gra et al., 2016). Resources and Evaluation Conference. Reykjavik,
Iceland.
References Tomas Mikolov, Kai Chen, Greg Corrado and Jef-
frey Dean. 2013. Efficient estimation of word
Francesco Barbieri, Valerio Basile, Danilo Croce, representations in vector space. arXiv preprint
Malvina Nissim, Nicole Novielli, Viviana Patti. arXiv1:1301.3781.
2016. Overview of the EVALITA 2016 SENTiment
POLarity Classification Task. In Pierpaolo Basile, Saif Mohammad, Svetlana Kiritchenko and Xiaodan
Anna Corazza, Franco Cutugno, Simonetta Mon- Zhu. 2013. NRC-Canada: Building the state-of-the-
temagni, Malvina Nissim, Viviana Patti, Giovanni art in sentiment analysis of tweets. In Proceedings
Semeraro and Rachele Sprugnoli, editors, Proceed- of the Seventh international workshop on Semantic
ings of Third Italian Conference on Computational Evaluation Exercises, SemEval-2013. 321-327, At-
Linguistics (CLiC-it 2016) & Fifth Evaluation Cam- lanta, Georgia, USA.
paign of Natural Language Processing and Speech Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
Tools for Italian. Final Workshop (EVALITA 2016). Sebastiani and Veselin Stoyanov. 2016. SemEval-
December, Naples, Italy. 2016 task 4: Sentiment analysis in Twitter. In Pro-
ceedings of the 10th International Workshop on Se-
Chih-Chung Chang and Chih-Jen Lin. 2001. mantic Evaluation (SemEval-2016)
LIBSVM: a library for support vec-
tor machines. Software available at Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec-
http://www.csie.ntu.edu.tw/cjlin/libsvm. tional recurrent neural networks. IEEE Transactions
on Signal Processing 45(11):2673–2681
François Chollet. 2016. Keras. Software available at
https://github.com/fchollet/keras/tree/master/keras. Duyu Tang, Bing Qin and Ting Liu. 2015. Document
modeling with gated recurrent neural network for
Andre Cimino, Stefano Cresci, Felice Dell’Orletta, sentiment classification. In Proceedings of EMNLP
Maurizio Tesconi. 2014. Linguistically-motivated 2015. 1422-1432, Lisbon, Portugal.
and Lexicon Features for Sentiment Analysis of Ital-
ian Tweets. In Proceedings of EVALITA ’14, Evalu- Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture
ation of NLP and Speech Tools for Italian. Decem- 6.5-RmsProp: Divide the gradient by a running aver-
ber, Pisa, Italy. age of its recent magnitude. In COURSERA: Neural
Networks for Machine Learning.
Felice Dell’Orletta. 2009. Ensemble system for Part-
of-Speech tagging. In Proceedings of EVALITA ’09, Theresa Wilson, Zornitsa Kozareva, Preslav Nakov,
Evaluation of NLP and Speech Tools for Italian. De- Sara Rosenthal, Veselin Stoyanov and Alan Ritter.
cember, Reggio Emilia, Italy. 2005. Recognizing contextual polarity in phrase-
level sentiment analysis. In Proceedings of HLT-
Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, EMNLP 2005. 347-354, Stroudsburg, PA, USA.
Michael Muehl, William Cohen. 2016. Tweet2Vec: ACL.
Character-Based Distributed Representations for Martin Wöllmer, Florian Eyben, Alex Graves, Björn
Social Media. In Proceedings of the 54th Annual Schuller and Gerhard Rigoll. 2009. Tandem
Meeting of the ACL. Berlin, German. BLSTM-DBN architecture for keyword spotting
Cı̀cero Nogueira dos Santos and Bianca Zadrozny. with enhanced context modeling Proc. of NOLISP.
2014. Learning Character-level Representations for Martin Wöllmer, Björn Schuller, Florian Eyben and
Part-of-Speech Tagging. In Proc. of the 31st Inter. Gerhard Rigoll. 2010. Combining Long Short-Term
Conference on Machine Learning (ICML 2014). Memory and Dynamic Bayesian Networks for Incre-
mental Emotion-Sensitive Artificial Listening IEEE
Yarin Gal and Zoubin Ghahramani. 2015. A theoret- Journal of Selected Topics in Signal Processing
ically grounded application of dropout in recurrent
neural networks. arXiv preprint arXiv:1512.05287 XingYi Xu, HuiZhi Liang and Timothy Baldwin. 2016.
UNIMELB at SemEval-2016 Tasks 4A and 4B:
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long An Ensemble of Neural Networks and a Word2Vec
short-term memory. Neural computation Based Model for Sentiment Classification. In Pro-
ceedings of the 10th International Workshop on Se-
Minqing Hu and Bing Liu. 2004. Mining and summa- mantic Evaluation (SemEval-2016)
rizing customer reviews. In Proceedings of the tenth