=Paper= {{Paper |id=Vol-1749/paper_030 |storemode=property |title=Tandem LSTM–SVM Approach for Sentiment Analysis |pdfUrl=https://ceur-ws.org/Vol-1749/paper_030.pdf |volume=Vol-1749 |authors=Andrea Cimino,Felice Dell'Orletta |dblpUrl=https://dblp.org/rec/conf/clic-it/CiminoD16a }} ==Tandem LSTM–SVM Approach for Sentiment Analysis== https://ceur-ws.org/Vol-1749/paper_030.pdf
         Tandem LSTM-SVM Approach for Sentiment Analysis

                         Andrea Cimino and Felice Dell’Orletta
         Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
                              ItaliaNLP Lab - www.italianlp.it
           {andrea.cimino, felice.dellorletta}@ilc.cnr.it



                Abstract                           questo sistema abbiamo ottenuto la sec-
                                                   onda posizione nella classificazione della
English. In this paper we describe our             Soggettività, la terza posizione nella clas-
approach to EVALITA 2016 SENTIPOLC                 sificazione della Polarità e la sesta nella
task. We participated in all the sub-              identificazione dell’Ironia.
tasks with constrained setting: Subjectiv-
ity Classification, Polarity Classification
and Irony Detection. We developed a tan-
dem architecture where Long Short Term
Memory recurrent neural network is used        1    Description of the system
to learn the feature space and to capture
temporal dependencies, while the Support       We addressed the EVALITA 2016 SENTIPOLC
Vector Machines is used for classification.    task (Barbieri et al., 2016) as a three-classification
SVMs combine the document embedding            problem: two binary classification tasks (Subjec-
produced by the LSTM with a wide set           tivity Classification and Irony Detection) and a
of general–purpose features qualifying the     four-class classification task (Polarity Classifica-
lexical and grammatical structure of the       tion).
text. We achieved the second best ac-             We implemented a tandem LSTM-SVM clas-
curacy in Subjectivity Classification, the     sifier operating on morpho-syntactically tagged
third position in Polarity Classification,     texts. We used this architecture since similar sys-
the sixth position in Irony Detection.         tems were successfully employed to tackle differ-
                                               ent classification problems such keyword spotting
Italiano. In questo articolo descrivi-
                                               (Wöllmer et al., 2009) or the automatic estimation
amo il sistema che abbiamo utilizzato
                                               of human affect from speech signal (Wöllmer et
per affrontare i diversi compiti del task
                                               al., 2010), showing that tandem architectures out-
SENTIPOLC della conferenza EVALITA
                                               perform the performances of the single classifiers.
2016. In questa edizione abbiamo parte-
cipato a tutti i sotto compiti nella config-      In this work we used Keras (Chollet, 2016) deep
urazione vincolata, cioè senza utilizzare     learning framework and LIBSVM (Chang et al.,
risorse annotate a mano diverse rispetto a     2001) to generate respectively the LSTM and the
quelle distribuite dagli organizzatori. Per    SVMs statistical models.
questa partecipazione abbiamo sviluppato          Since our approach relies on morpho-
un metodo che combina una rete neurale         syntactically tagged texts,       both training
ricorrente di tipo Long Short Term Mem-        and test data were automatically morpho-
ory, utilizzate per apprendere lo spazio       syntactically tagged by the POS tagger described
delle feature e per catturare dipendenze       in (Dell’Orletta, 2009). In addition, in order to
temporali, e Support Vector Machine per        improve the overall accuracy of our system (de-
la classificazione. Le SVM combinano la        scribed in 1.2), we developed sentiment polarity
rappresentazione del documento prodotta        and word embedding lexicons1 described below.
da LSTM con un ampio insieme di fea-
tures che descrivono la struttura lessi-           1
                                                     All the created lexicons are made freely available at the
cale e grammaticale del testo. Attraverso      following website: http://www.italianlp.it/.
1.1    Lexical resources                                    fascinating), and 200 negative word seeds (e.g.,
1.1.1 Sentiment Polarity Lexicons                           “tradire” betray, “morire” die). These terms were
                                                            chosen from the OPENER lexicon. The result-
Sentiment polarity lexicons provide mappings be-
                                                            ing corpus is made up of 683,811 tweets extracted
tween a word and its sentiment polarity (positive,
                                                            with positive seeds and 1,079,070 tweets extracted
negative, neutral). For our experiments, we used a
                                                            with negative seeds.
publicly available lexicons for Italian and two En-
                                                               The main purpose of this procedure was to as-
glish lexicons that we automatically translated. In
                                                            sign a polarity score to each n-gram occurring
addition, we adopted an unsupervised method to
                                                            in the corpus. For each n-gram (we considered
automatically create a lexicon specific for the Ital-
                                                            up to five n-grams) we calculated the correspond-
ian twitter language.
                                                            ing sentiment polarity score with the following
Existing Sentiment Polarity Lexicons                        scoring function: score(ng) = P M I(ng, pos) −
We used the Italian sentiment polarity lexicon              P M I(ng, neg), where PMI stands for pointwise
(hereafter referred to as OP EN ER) (Maks et                mutual information.
al., 2013) developed within the OpeNER Euro-                1.1.2 Word Embedding Lexicons
pean project2 . This is a freely available lexicon
                                                            Since the lexical information in tweets can be very
for the Italian language3 and includes 24,000 Ital-
                                                            sparse, to overcame this problem we built two
ian word entries. It was automatically created us-
                                                            word embedding lexicons.
ing a propagation algorithm and the most frequent
                                                               For this purpose, we trained two predict mod-
words were manually reviewed.
                                                            els using the word2vec5 toolkit (Mikolov et al.,
Automatically translated Sentiment Polarity                 2013). As recommended in (Mikolov et al., 2013),
Lexicons                                                    we used the CBOW model that learns to pre-
  • The Multi–Perspective Question Answering                dict the word in the middle of a symmetric win-
    (hereafter referred to as M P QA) Subjectiv-            dow based on the sum of the vector representa-
    ity Lexicon (Wilson et al., 2005). This lexi-           tions of the words in the window. For our ex-
    con consists of approximately 8,200 English             periments, we considered a context window of
    words with their associated polarity. In order          5 words. These models learn lower-dimensional
    to use this resource for the Italian language,          word embeddings. Embeddings are represented by
    we translated all the entries through the Yan-          a set of latent (hidden) variables, and each word is
    dex translation service4 .                              a multidimensional vector that represent a specific
                                                            instantiation of these variables. We built two Word
   • The Bing Liu Lexicon (hereafter referred to            Embedding Lexicons starting from the following
     as BL) (Hu et al., 2004). This lexicon in-             corpora:
     cludes approximately 6,000 English words
     with their associated polarity. This resource            • The first lexicon was built using a tokenized
     was automatically translated by the Yandex                 version of the itWaC corpus6 . The itWaC cor-
     translation service.                                       pus is a 2 billion word corpus constructed
                                                                from the Web limiting the crawl to the .it
Automatically created Sentiment Polarity                        domain and using medium-frequency words
Lexicons                                                        from the Repubblica corpus and basic Italian
We built a corpus of positive and negative tweets               vocabulary lists as seeds.
following the Mohammad et al. (2013) approach
adopted in the Semeval 2013 sentiment polarity                • The second lexicon was built from a tok-
detection task. For this purpose we queried the                 enized corpus of tweets. This corpus was col-
Twitter API with a set of hashtag seeds that in-                lected using the Twitter APIs and is made up
dicate positive and negative sentiment polarity.                of 10,700,781 italian tweets.
We selected 200 positive word seeds (e.g. “vin-             1.2      The LSTM-SVM tandem system
cere” to win, “splendido” splendid, “affascinante”
                                                            SVM is an extremely efficient learning algorithm
   2
      http://www.opener-project.eu/                         and hardly to outperform, unfortunately these type
    3
      https://github.com/opener-project/public-sentiment-
                                                               5
lexicons                                                           http://code.google.com/p/word2vec/
    4                                                          6
      http://api.yandex.com/translate/                             http://wacky.sslmit.unibo.it/doku.php?id=corpora
of algorithms capture “sparse” and “discrete” fea-     Unlabeled Tweets
                                                       and itWaC Corpus    Labeled Tweets
tures in document classification tasks, making re-
ally hard the detection of relations in sentences,      Twitter/itWaC
                                                       Word Embeddings         LSTM
which is often the key factor in detecting the over-
all sentiment polarity in documents (Tang et al.,
                                                                           Sentence embed-
2015). On the contrary, Long Short Term Mem-                               dings Extraction
ory (LSTM) networks are a specialization of Re-
current Neural Networks (RNN) which are able                                 SVM Model        Document fea-
                                                                              Generation      ture extraction
to capture long-term dependencies in a sentence.
This type of neural network was recently tested
                                                                             Final statis-
on Sentiment Analysis tasks (Tang et al., 2015),                             tical model
(Xu et al., 2016) where it has been proven to
outperform classification performance in several            Figure 1: The LSTM-SVM architecture
sentiment analysis task (Nakov et al., 2016) with
respect to commonly used learning algorithms,
showing a 3-4 points of improvements. For this         able to propagate an important feature that came
work, we implemented a tandem LSTM-SVM to              early in the input sequence over a long distance,
take advantage from the two classification strate-     thus capturing potential long-distance dependen-
gies.                                                  cies.
   Figure 1 shows a graphical representation of           LSTM is a state-of-the-art learning algorithm
the proposed tandem architecture. This architec-       for semantic composition and allows to compute
ture is composed of 2 sequential machine learning      representation of a document from the representa-
steps both involved in training and classification     tion of its words with multiple abstraction levels.
phases. In the training phase, the LSTM network        Each word is represented by a low dimensional,
is trained considering the training documents and      continuous and real-valued vector, also known
the corresponding gold labels. Once the statisti-      as word embedding and all the word vectors are
cal model of the LSTM neural network is com-           stacked in a word embedding matrix.
puted, for each document of the training set a doc-       We employed a bidirectional LSTM architec-
ument vector (document embedding) is computed          ture since these kind of architecture allows to cap-
exploiting the weights that can be obtained from       ture long-range dependencies from both directions
the penultimate network layer (the layer before the    of a document by constructing bidirectional links
SoftMax classifier) by giving in input the consid-     in the network (Schuster et al., 1997). In addition,
ered document to the LSTM network. The docu-           we applied a dropout factor to both input gates
ment embeddings are used as features during the        and to the recurrent connections in order to pre-
training phase of the SVM classifier in conjunc-       vent overfitting which is a typical issue in neu-
tion with a set of widely used document classi-        ral networks (Galp and Ghahramani , 2015). As
fication features. Once the training phase of the      suggested in (Galp and Ghahramani , 2015) we
SVM classifier is completed the tandem architec-       have chosen a dropout factor value in the opti-
ture is considered trained. The same stages are        mum range [0.3, 0.5], more specifically 0.45 for
involved in the classification phase: for each doc-    this work. For what concerns the optimization
ument that must be classified, an embedding vec-       process, categorical cross-entropy is used as a loss
tor is obtained exploiting the previously trained      function and optimization is performed by the rm-
LSTM network. Finally the embedding is used            sprop optimizer (Tieleman and Hinton, 2012).
jointly with other document classification features       Each input word to the LSTM architecture is
by the SVM classifier which outputs the predicted      represented by a 262-dimensional vector which is
class.                                                 composed by:
                                                       Word embeddings: the concatenation of the two
1.2.1   The LSTM network                               word embeddings extracted by the two available
In this part, we describe the LSTM model em-           Word Embedding Lexicons (128 dimensions for
ployed in the tandem architecture. The LSTM unit       each word embedding, a total of 256 dimensions),
was initially proposed by Hochreiter and Schmid-       and for each word embedding an extra component
huber (Hochreiter et al., 1997). LSTM units are        was added in order to handle the ”unknown word”
(2 dimensions).                                       grained PoS, which represent subdivisions of the
Word polarity: the corresponding word polarity        coarse-grained tags (e.g. the class of nouns is sub-
obtained by exploiting the Sentiment Polarity Lex-    divided into proper vs common nouns, verbs into
icons. This results in 3 components, one for each     main verbs, gerund forms, past particles).
possible lexicon outcome (negative, neutral, posi-    Coarse grained Part-Of-Speech distribution:
tive) (3 dimensions). We assumed that a word not      the distribution of nouns, adjectives, adverbs,
found in the lexicons has a neutral polarity.         numbers in the tweet.
End of Sentence: a component (1 dimension) in-
dicating whether or not the sentence was totally      Lexicon features
read.                                                 Emoticons: presence or absence of positive or
                                                      negative emoticons in the analyzed tweet. The
1.2.2 The SVM classifier                              lexicon of emoticons was extracted from the site
The SVM classifier exploits a wide set of fea-        http://it.wikipedia.org/wiki/Emoticon and manu-
tures ranging across different levels of linguis-     ally classified.
tic description. With the exception of the word       Lemma sentiment polarity n-grams: for each
embedding combination, these features were al-        n-gram of lemmas extracted from the analyzed
ready tested in our previous participation at the     tweet, the feature checks the polarity of each com-
EVALITA 2014 SENTIPOLC edition (Cimino et             ponent lemma in the existing sentiment polarity
al., 2014). The features are organised into three     lexicons. Lemma that are not present are marked
main categories: raw and lexical text features,       with the ABSENT tag. This is for example the
morpho-syntactic features and lexicon features.       case of the trigram “tutto molto bello” (all very
                                                      nice) that is marked as “ABSENT-POS-POS” be-
Raw and Lexical Text Features
                                                      cause molto and bello are marked as positive in
Topic: the manually annotated class of topic pro-
                                                      the considered polarity lexicon and tutto is absent.
vided by the task organizers for each tweet.
                                                      The feature is computed for each existing senti-
Number of tokens: number of tokens occurring
                                                      ment polarity lexicons.
in the analyzed tweet.
                                                      Polarity modifier: for each lemma in the tweet
Character n-grams: presence or absence of con-
                                                      occurring in the existing sentiment polarity lexi-
tiguous sequences of characters in the analyzed
                                                      cons, the feature checks the presence of adjectives
tweet.
                                                      or adverbs in a left context window of size 2. If
Word n-grams: presence or absence of contigu-
                                                      this is the case, the polarity of the lemma is as-
ous sequences of tokens in the analyzed tweet.
                                                      signed to the modifier. This is for example the case
Lemma n-grams: presence or absence of con-
                                                      of the bigram “non interessante” (not interesting),
tiguous sequences of lemma occurring in the an-
                                                      where “interessante” is a positive word, and “non”
alyzed tweet.
                                                      is an adverb. Accordingly, the feature “non POS”
Repetition of n-grams chars: presence or ab-
                                                      is created. The feature is computed 3 times, check-
sence of contiguous repetition of characters in the
                                                      ing all the existing sentiment polarity lexicons.
analyzed tweet.
                                                      PMI score: for each set of unigrams, bigrams,
Number of mentions: number of mentions (@)
                                                      trigrams, four-grams and five-grams that occur in
occurring in the analyzed tweet.
                                                      the analyzed tweet, the feature computes the score
Number of hashtags: number of hashtags occur-                     P
                                                      given by i–gram∈tweet score(i–gram) and re-
ring in the analyzed tweet.
                                                      turns the minimum and the maximum values of the
Punctuation: checks whether the analyzed tweet
                                                      five values (approximated to the nearest integer).
finishes with one of the following punctuation
                                                      Distribution of sentiment polarity: this feature
characters: “?”, “!”.
                                                      computes the percentage of positive, negative and
Morpho-syntactic Features                             neutral lemmas that occur in the tweet. To over-
Coarse grained Part-Of-Speech n-grams: pres-          come the sparsity problem, the percentages are
ence or absence of contiguous sequences of            rounded to the nearest multiple of 5. The feature
coarse–grained PoS, corresponding to the main         is computed for each existing lexicon.
grammatical categories (noun, verb, adjective).       Most frequent sentiment polarity: the feature re-
Fine grained Part-Of-Speech n-grams: pres-            turns the most frequent sentiment polarity of the
ence or absence of contiguous sequences of fine-      lemmas in the analyzed tweet. The feature is com-
puted for each existing lexicon.                       Configuration          Subject.    Polarity    Irony
Sentiment polarity in tweet sections: the feature
                                                       best official Runs       0.718       0.664     0.548
first splits the tweet in three equal sections. For
each section the most frequent polarity is com-        quadratic SVM            0.704       0.646     0.477
puted using the available sentiment polarity lexi-     linear SVM               0.661       0.631     0.495
cons. The purpose of this feature is aimed at iden-    LSTM                     0.716       0.674     0.468
tifying change of polarity within the same tweet.      linear Tandem∗           0.676       0.650     0.499
Word embeddings combination: the feature re-           quadratic Tandem∗        0.713       0.643     0.472
turns the vectors obtained by computing sepa-
rately the average of the word embeddings of the      Table 2: Classification results of the different
nouns, adjectives and verbs of the tweet. It com-     learning models on the official test set.
puted once for each word embedding lexicon, ob-
taining a total of 6 vectors for each tweet.          outperform the SVM and LSTM ones. In addition,
                                                      the quadratic models perform better than the linear
2     Results and Discussion
                                                      ones. These results lead us to choose the linear and
We tested five different learning configurations of   quadratic tandem models as the final systems to be
our system: linear and quadratic support vector       used on the official test set.
machines (linear SVM, quadratic SVM) using the           Table 2 reports the overall accuracies achieved
features described in section 1.2.2, with the ex-     by all our classifier configurations on the official
ception of the document embeddings generated by       test set, the official submitted runs are starred in
the LSTM; LSTM using the word embeddings de-          the table. The best official Runs row reports, for
scribed in 1.2.2; A tandem SVM-LSTM combina-          each task, the best official results in EVALITA
tion with linear and quadratic SVM kernels (lin-      2016 SENTIPOLC. As can be seen, the accura-
ear Tandem, quadratic Tandem) using the features      cies of different learning models reveal a different
described in section 1.2.2 and the document em-       trend when tested on the development and the test
beddings generated by the LSTM. To test the pro-      sets. Differently from what observed in the de-
posed classification models, we created an internal   velopment experiments, the best system results to
development set randomly selected from the train-     be the LSTM one and the gap in terms of accu-
ing set distributed by the task organizers. The re-   racy between the linear and quadratic models is
sulting development set is composed by the 10%        lower or does not occur. In addition, the accura-
(740 tweets) of the whole training set.               cies of all the systems are definitely lower than the
                                                      ones obtained in our development experiments. In
    Configuration      Subject.   Polarity   Irony    our opinion, such results may depend on the oc-
                                                      currence of out domain tweets in the test set with
    linear SVM          0.725      0.713     0.636    respect to the ones contained in the training set.
    quadratic SVM       0.740      0.730     0.595    Different groups of annotators could be a further
    LSTM                0.777      0.747     0.646    motivation for these different results and trends.
    linear Tandem       0.764      0.743     0.662
    quadratic Tandem    0.783      0.754     0.675    3   Conclusion
Table 1: Classification results of the different      In this paper, we reported the results of our partic-
learning models on our development set.               ipation to the EVALITA 2016 SENTIPOLC tasks.
                                                      By resorting to a tandem LSTM-SVM system we
   Table 1 reports the overall accuracies achieved    achieved the second place at the Subjectivity Clas-
by the classifiers on our internal development set    sification task, the third place at the Sentiment Po-
for all the tasks. The accuracy is calculated as      larity Classification task and the sixth place at the
the F–score obtained using the evaluation tool pro-   Irony Detection task. This tandem system com-
vided by the organizers. It is worth noting that      bines the ability of the bidirectional LSTM to cap-
there are similar trends for what concerns the ac-    ture long-range dependencies between words from
curacies of the proposed learning models for all      both directions of a tweet with SVMs which are
the three tasks. In particular, LSTM outperforms      able to exploit document embeddings produced by
SVM models while the Tandem systems clearly           LSTM in conjunction with a wide set of general-
purpose features qualifying the lexical and gram-          ACM SIGKDD international conference on Knowl-
matical structure of a text. Current direction of re-      edge discovery and data mining, KDD ’04. 368-177,
                                                           New York, NY, USA. ACM.
search is introducing a character based LSTM (dos
Santos and Zadrozny, 2013) in the tandem system.         Isa Maks, Ruben Izquierdo, Francesca Frontini,
Character based LSTM proven to be particularly              Montse Cuadros, Rodrigo Agerri and Piek Vossen.
                                                            2014. Generating Polarity Lexicons with WordNet
suitable when analyzing social media texts (Dhin-           propagation in 5 languages. 9th LREC, Language
gra et al., 2016).                                          Resources and Evaluation Conference. Reykjavik,
                                                            Iceland.

References                                               Tomas Mikolov, Kai Chen, Greg Corrado and Jef-
                                                           frey Dean. 2013. Efficient estimation of word
Francesco Barbieri, Valerio Basile, Danilo Croce,          representations in vector space. arXiv preprint
  Malvina Nissim, Nicole Novielli, Viviana Patti.          arXiv1:1301.3781.
  2016. Overview of the EVALITA 2016 SENTiment
  POLarity Classification Task. In Pierpaolo Basile,     Saif Mohammad, Svetlana Kiritchenko and Xiaodan
  Anna Corazza, Franco Cutugno, Simonetta Mon-             Zhu. 2013. NRC-Canada: Building the state-of-the-
  temagni, Malvina Nissim, Viviana Patti, Giovanni         art in sentiment analysis of tweets. In Proceedings
  Semeraro and Rachele Sprugnoli, editors, Proceed-        of the Seventh international workshop on Semantic
  ings of Third Italian Conference on Computational        Evaluation Exercises, SemEval-2013. 321-327, At-
  Linguistics (CLiC-it 2016) & Fifth Evaluation Cam-       lanta, Georgia, USA.
  paign of Natural Language Processing and Speech        Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
  Tools for Italian. Final Workshop (EVALITA 2016).        Sebastiani and Veselin Stoyanov. 2016. SemEval-
  December, Naples, Italy.                                 2016 task 4: Sentiment analysis in Twitter. In Pro-
                                                           ceedings of the 10th International Workshop on Se-
Chih-Chung Chang and Chih-Jen Lin.             2001.       mantic Evaluation (SemEval-2016)
  LIBSVM:       a    library    for    support  vec-
  tor machines.             Software available at        Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec-
  http://www.csie.ntu.edu.tw/cjlin/libsvm.                 tional recurrent neural networks. IEEE Transactions
                                                           on Signal Processing 45(11):2673–2681
François Chollet. 2016. Keras. Software available at
  https://github.com/fchollet/keras/tree/master/keras.   Duyu Tang, Bing Qin and Ting Liu. 2015. Document
                                                           modeling with gated recurrent neural network for
Andre Cimino, Stefano Cresci, Felice Dell’Orletta,         sentiment classification. In Proceedings of EMNLP
  Maurizio Tesconi. 2014. Linguistically-motivated         2015. 1422-1432, Lisbon, Portugal.
  and Lexicon Features for Sentiment Analysis of Ital-
  ian Tweets. In Proceedings of EVALITA ’14, Evalu-      Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture
  ation of NLP and Speech Tools for Italian. Decem-         6.5-RmsProp: Divide the gradient by a running aver-
  ber, Pisa, Italy.                                         age of its recent magnitude. In COURSERA: Neural
                                                            Networks for Machine Learning.
Felice Dell’Orletta. 2009. Ensemble system for Part-
  of-Speech tagging. In Proceedings of EVALITA ’09,      Theresa Wilson, Zornitsa Kozareva, Preslav Nakov,
  Evaluation of NLP and Speech Tools for Italian. De-      Sara Rosenthal, Veselin Stoyanov and Alan Ritter.
  cember, Reggio Emilia, Italy.                            2005. Recognizing contextual polarity in phrase-
                                                           level sentiment analysis. In Proceedings of HLT-
Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick,             EMNLP 2005. 347-354, Stroudsburg, PA, USA.
  Michael Muehl, William Cohen. 2016. Tweet2Vec:           ACL.
  Character-Based Distributed Representations for        Martin Wöllmer, Florian Eyben, Alex Graves, Björn
  Social Media. In Proceedings of the 54th Annual         Schuller and Gerhard Rigoll. 2009. Tandem
  Meeting of the ACL. Berlin, German.                     BLSTM-DBN architecture for keyword spotting
Cı̀cero Nogueira dos Santos and Bianca Zadrozny.          with enhanced context modeling Proc. of NOLISP.
   2014. Learning Character-level Representations for    Martin Wöllmer, Björn Schuller, Florian Eyben and
   Part-of-Speech Tagging. In Proc. of the 31st Inter.    Gerhard Rigoll. 2010. Combining Long Short-Term
   Conference on Machine Learning (ICML 2014).            Memory and Dynamic Bayesian Networks for Incre-
                                                          mental Emotion-Sensitive Artificial Listening IEEE
Yarin Gal and Zoubin Ghahramani. 2015. A theoret-         Journal of Selected Topics in Signal Processing
  ically grounded application of dropout in recurrent
  neural networks. arXiv preprint arXiv:1512.05287       XingYi Xu, HuiZhi Liang and Timothy Baldwin. 2016.
                                                           UNIMELB at SemEval-2016 Tasks 4A and 4B:
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long         An Ensemble of Neural Networks and a Word2Vec
  short-term memory. Neural computation                    Based Model for Sentiment Classification. In Pro-
                                                           ceedings of the 10th International Workshop on Se-
Minqing Hu and Bing Liu. 2004. Mining and summa-           mantic Evaluation (SemEval-2016)
  rizing customer reviews. In Proceedings of the tenth