Bidirectional Attentional LSTM for Aspect Based Sentiment Analysis on
                                Italian

                                 Giancarlo Nicola
                                 University of Pavia
                    giancarlo.nicola01@universitadipavia.it


                    Abstract                        from the ”booking.com” website. The aspects are
                                                    related to the accommodation reviews and com-
    English. This paper describes the SentITA       prehend topics like cleanliness, comfort, location,
    system that participated to the ABSITA          etc. The task is divided in two subtasks As-
    task proposed in Evalita 2018. The sys-         pect Category Detection (ACD) and Aspect Cat-
    tem is based on a Bidirectional Long Short      egory Polarity (ACP). The fist, ACD consists in
    Term Memory network with attention that         identifying the aspects mentioned in the sentence,
    exploits word embeddings and sentiment          while the second requires to associate a senti-
    specific polarity embeddings. The model         ment polarity label to the aspects evoked in the
    also leverages grammatical information          sentence. Both the tasks are addressed with the
    from POS tagging and NER tagging. The           same architecture and the same data preprocess-
    system participated in both the Aspect          ing. The system is based on a deep learning model,
    Category Detection (ACD) and Aspect             a Bidirectional Long Short Term Memory net-
    Category Polarity (ACP) tasks achieving         work with attention. The model exploits word em-
    the 5th place in the ACD task and the 2nd       beddings, sentiment specific polarity embeddings
    in the ACD task.                                and it leverages also grammatical and information
    Italiano. Questo paper descrive il sis-         from POS tagging and NER tagging.
    tema SentITA valutato nel task ABSITA              Recently, deep learning has emerged as a pow-
    proposto all’interno di Evalita 2018. Il        erful machine learning technique achieving state-
    sistema è basato su una rete nuerale ricor-    of-the-art results in many application domains,
    rente con celle di memoria di tipo Long         including sentiment analysis. Among the deep
    Short Term Memory e con implementato            learning frameworks applied to sentiment analy-
    un meccanismo d’attenzione. Il modello          sis, many employ a combination of semantic vec-
    sfrutta sia word embeddings generali sia        tor representations (Mikolov et al. 2013), (Pen-
    polarity embeddings specifici per la sen-       nignton et al. 2014) and different deep learning
    timent analysis ed inoltre fa uso delle in-     architectures. Long Short-Term Memory (LSTM)
    formazioni derivanti dal POS-tagging e          networks (Hochreiter and Schmidhuber 1997),
    dal NER-tagging. Il sistema ha parteci-         (Socher et al. 2013), (Cho et al. 2014) have
    pato sia nella sfida di Aspect Category         been applied to model complex and long term
    Detection (ACD) sia in quella di Aspect         non-local relationships in both word level and
    Category Polarity (ACP) posizionandosi          character level text sequences. Recursive Neu-
    al quinto posto nella prima e al secondo        ral Tensor Networks (RNTN) have shown great
    posto nella seconda.                            results for semantic compositionality (Socher et
                                                    al. 2011), (Socher et al. 2013) and also convo-
                                                    lutional networks (CNN) for both sentiment anal-
1   Introduction
                                                    ysis (Collobert et al 2011) and sentence modelling
This paper describes the SentITA system that par-   (Kalchbrenner et al. 2014) have performed better
ticipated to the ABSITA task (Basile et al. 2018)   than previous state of the art methodologies. All
proposed in Evalita 2018. In ABSITA the task        these methods in most of the applications receive
consists in performing Aspect Based Sentiment       in input a vector representation of words called
Analysis (ABSA) on self-reliant sentences scraped   word embeddings. (Mikolov 2012), (Mikolov et
al. 2013) and (Pennignton et al. 2014), further        tion 2 the model and its features are explained; in
expanding the work on word embeddings (Ben-            Section 3 the model training and its performances
gio et al 2003), that grounds on the idea of dis-      are discussed; in Section 4 a conclusion with the
tributed representations for symbols (Hinton et        next improvement of the model is given.
al 1986), have introduced unsupervised learning
methods to create dense multidimensional spaces        2   Description of the system
where words are represented by vectors. The po-
sition of such vectors is related to their semantic    The model implemented is an Attentional Bidi-
meaning and grammatical properties and they are        rectional Recurrent Neural Network with LSTM
widely used in all modern NLP. In fact, they allow     cells. It operates at word level and therefore each
for a dimensionality reduction compared to tradi-      sentence is represented as a sequence of words
tional sparse vectors space models and they are of-    representations that are sequentially fed to the
ten used as pre-trained initialization for the first   model one after another until the sequence has
embedding layers of the neural networks in NLP         been entirely used up. One sentence sequence cou-
tasks. In (Le and Mikolov 2014), expanding the         pled with its polarity scores represent a single dat-
previous work on word embeddings, is developed         apoint for the model.
a model capable of representing also sentences in         The input to the model are sentences, repre-
a dense multidimensional space. In this case too,      sented as sequence of word representations. The
sentences are represented by vectors whose posi-       maximum sequence length has been set to 35,
tion is related to the semantic content of the sen-    with shorter sentences left-padded to this length
tence with similar sentences represented by vec-       and longer sentences cut to this length. Each
tors that are close to each other.                     word of the sequence is represented by five vec-
                                                       tors corresponding to 5 different features that are:
   When working with isolated and short sen-           high dimensional word embedding, word polar-
tences, often with a specific writing style, like      ity, word NER tag, word POS tag, custom low
tweets or phrases extracted from internet reviews      dimensional word embedding. The high dimen-
many long term text dependencies are lost and          sional word embeddings are the pretrained Fas-
not exploitable. In this situation it is important     text embeddings for Italian (Grave et al. 2018).
that the model learns both to pay attention to spe-    They are 300-dimensional vectors obtained using
cific words that have key roles in determining the     the skip-gram model described in (Bojanowski et
sentence polarity like negations, magnifiers, ad-      al. 2016) with default parameters. The word
jectives and to model the discourse but with less      polarity is obtained from the OpeNER Senti-
focus on long term dependencies (due to the text       ment Lexicon Italian (Russo et al. 2016). This
brevity). For this reason, deep learning word em-      freely available Italian Sentiment Lexicon con-
bedding based models augmented with task spe-          tains a total of 24.293 lexical entries annotated
cific gazettes (dictionaries) and features, repre-     for positive/negative/neutral polarity. It was semi-
sent a solid baseline when working with these          automatically developed using a propagation algo-
kind of datasets (Nakov et al. 2016)(Attardi et        rithm starting from a list of seed key-words and
al. 2016)(Castellucci et al. 2016)(Cimino et al.       manually reviewing the most frequent.
2016)(Deriu et al. 2016). In this system, a polarity
                                                          Both the NER tags and POS tags are obtained
dictionary for Italian has been included as feature
                                                       from the Spacy library Tagger model for Italian
to the model in addition to the word embeddings.
                                                       (Spacy 2.0.11 - https://spacy.io/). The custom low
Moreover every sentence during preprocessing is
                                                       dimensional word embeddings are generated by
augmented with its NER tags and POS tags which
                                                       random initialization and are inserted to provide
then are used as features in the model. Thanks
                                                       an embedding representation of the words that are
to the inclusion of these features relevant for the
                                                       missing from the Fastext embeddings, which oth-
considered task in combination with word embed-
                                                       erwise would all be represented by the same out
dings and an attentional bidirectional LSTM re-
                                                       of vocabulary token (OOV token). In general,
current neural network, the model achieves useful
                                                       it could be possible to train and fine-tune these
results with some thousands labelled examples.
                                                       custom embeddings on specific datasets to let the
  The remainder of the paper presents the model        model learn the usage of words in specific cases.
and the experiments on the ABSITA task. in Sec-        The information extracted from the OpeNER Sen-
Figure 1: Model architecture
timent Lexicon Italian are the word polarity with       gineered features like polarity dictionary, NER tag
its confidence and they are concatenated in a vec-      and POS tag that help in classifying the examples.
tor of length 2 that is one of the input to the first
layer of the network. The NER tags and POS tags         3   Training and results
instead are mapped to randomly initialized em-
beddings of dimensionality respectively 2 and 4         The only preprocessing applied to the text is the
that are not trained during the model training for      conversion of each character to its lower case
the task submission. With more data available it        form. Then, the vocabulary of the model is lim-
would probably be beneficial to train all the NER,      ited to the first 150,000 words of the Fastext em-
POS and custom embeddings but for this specific         beddings trough a cap on the max number of em-
dataset the results were comparable and slightly        beddings, due to memory constraints of the GPU
better when not training the embeddings.                used for training the model. The Fastext embed-
                                                        dings are sorted by descending frequency of ap-
   The model, whose architecture is schematized         pearance in their training corpus, thus the vocabu-
in fig. 1, performs in its initial layer a dimension-   lary comprises the 150,000 most frequent words
ality reduction on the Fastext embeddings and then      in Italian. The other words that remain out of
concatenates them with the rest of the embeddings       this cut are represented in the model high dimen-
(polarity, NER tag, POS tag, and custom word em-        sional embeddings (Fastext embeddings) by an out
beddings) for each each timestep (word) of the se-      of vocabulary token. However, all the training set
quence. The concatenation of the embeddings is          words are anyhow included in the custom low di-
fed in a sequence of two bidirectional recurrent        mensional word embeddings; this is done since
layers with LSTM cells. The result of these recur-      both our training text and in general users text
rent layers is passed to the attention mechanism        (specially when working with reviews, tweets, so-
presented in (Raffel et al. 2016) and finally to        cial network platforms) could be quite different
the dense layers that outputs the aspect detection      from the one on which Fastext embeddings are
and aspect polarity signals. The attention mecha-       trained. In addition the NER-tagging and POS-
nism in this formulation, produces a fixed-length       tagging models for Italian included in the Spacy
embedding of the input sequence by computing            library are applied to the text to compute the ad-
an adaptive weighted average of the sequence of         ditional NER-tags and POS-tags features for each
states (normally denoted as ”h”) of the RNN. This       word.
form of integration is similar to the ”global tem-         To train the model and generate the challenge
poral pooling” described in (Sander 2014), which        submission a k-fold cross validation strategy has
is based on the ”global average pooling” tech-          been applied. The dataset has been divided in
nique of (Min et al. 2014). The non linear ac-          5 folds and 5 different instantiations of the same
tivations used in the model are Rectified Linear        model (with the same architecture) have been
Units (ReLU) for the internal dense layers, hy-         trained picking each time a different fold as val-
perbolic tangent (tanh) in the recurrent layers and     idation set (20%) and the remaining 4 folds as
sigmoid in the output dense layer. In order to con-     training set (80%). The number of training epochs
trast overfitting the dropout mechanism has been        is defined with the early stopping technique with
used after the Fastext embedding dimensionality         patience parameter equal to 7. Once the train-
reduction with rate 0.5, in both the recurrent lay-     ing epochs are completed, the model snapshot that
ers between each timestep with rate 0.5 and on the      achieved the best validation loss is loaded. At the
output of the recurrent layers with rate 0.3.           end the predictions from the 5 models have been
   The model has 61,368 trainable parameters and        averaged together and thresholded at 0.5. The
a total of 45,233,366 parameters, the majority of       training of five different instantiations of the same
them representing the Fastext embedding matrix          model and the averaging of their predictions over-
(45,000,300). Compared to many NLP models               comes the fact that in each k th -fold the model se-
used today the number of trainable parameters is        lection based on the best validation loss is biased
quite small to reduce the possibility of overfit-       on the validation fold itself.
ting the training dataset (6,337 examples is small         Each of the five models is trained minimizing
compared to many English sentiment datasets) and        the crossentropy loss on the different classes with
also because is compensated by the addition of en-      the Nesterov Adam (Nadam) optimizer (Dozat
               Micro       Micro    Micro                           Micro       Micro    Micro
    Ranking    Precision   Recall   F1-score             Ranking    Precision   Recall   F1-score
    1          0.8397      0.7837   0.8108               1          0.8264      0.7161   0.7673
    2          0.8713      0.7504   0.8063               2          0.8612      0.6562   0.7449
    3          0.8697      0.7481   0.8043               3          0.7472      0.7186   0.7326
    4          0.8626      0.7519   0.8035               4          0.7387      0.7206   0.7295
    5          0.8819      0.7378   0.8035               5          0.8735      0.5649   0.6861
    6          0.898       0.6937   0.7827               6          0.6869      0.5409   0.6052
    7          0.8658      0.697    0.7723               7          0.4123      0.3125   0.3555
    8          0.7902      0.7181   0.7524               8          0.5452      0.2511   0.3439
    9          0.6232      0.6093   0.6162               baseline   0.2451      0.1681   0.1994
    10         0.6164      0.6134   0.6149
    11         0.5443      0.5418   0.5431              Table 2: Task ACP (Aspect Category Polarity)
    12         0.6213      0.433    0.5104
    baseline   0.4111      0.2866   0.3377
                                                        ranking. This system score is reported between
                                                        dashed lines
Table 1: Task ACD (Aspect Category Detection)
ranking. This system score is reported between          very far from the 1st in terms of F1-score. The
dashed lines                                            model in general shows a high precision but in
                                                        general a lower recall compared to the other sys-
2015) with default parameters (λ = 0.002, β1 =          tems. The proposed architecture makes use of
0.9, β2 = 0.999, schedule decay = 0.004). The           different features that is easy to obtain through
Nesterov Adam optimizer is similar to the Adam          other models like POS and NER tags, polarity and
optimizer (Kingma et al. 2014) but were momen-          word embeddings, for this reason, the human ef-
tum is replaced with Nesterov momentum (Nes-            fort in the data preprocessing is very limited. One
terov 1983). Adam in fact, combines two algo-           important direction to further improve the model
rithms known to work well for different reasons:        would be to rely more on unsupervised learning,
momentum, which points the model in a better di-        which at the moment is used only for the word
rection, and RMSProp, which adapts how far the          embeddings. It could be possible to integrate in
model goes in that direction on a per-parameter ba-     the model features based on language models or
sis. However, Nesterov momentum which can be            encoder-decoder networks, for example. More un-
viewed as a simple modification of the former, in-      supervised learning would better ensure the model
creases stability, and can sometimes provide a dis-     generalization to cover most of the argument and
tinct improvement in performance, superior to mo-       lexical content of the Italian language due to the
mentum (Sutskever et al. 2013). For this reason         large quantity of text available and thus improving
the two approaches are combined in the Nadam            also the model recall.
optimizer.
   This system obtained the 5th place in the ACD
and the 2nd place in the ACP task as reported re-
                                                        References
spectively in Table 1 and Table 2. In these tables      Giuseppe Attardi, Daniele Sartiano, Chiara Alzetta,
the performances of the systems participating to          Federica Semplici. 2016. Convolutional Neural Net-
                                                          works for Sentiment Analysis on Italian Tweets.
the challenge have been ranked by F1-score from           CLiC-it/EVALITA (2016).
the task organizers. In particular, it is interesting
the second place in the ACP since the model is          Pierpaolo Basile and Valerio Basile and Danilo Croce
                                                           and Marco Polignano. 2018. Overview of the
more oriented towards polarity classification for
                                                           EVALITA 2018 Aspect-based Sentiment Analy-
which it has specific dictionaries more than as-           sis task (ABSITA). Proceedings of the 6th eval-
pect detection. This is confirmed also from the            uation campaign of Natural Language Process-
high precision score obtained from the model in            ing and Speech tools for Italian (EVALITA’18),
the ACP task, the 2nd highest among the partici-           CEUR.org, Turin
pating systems.                                         Y. Bengio, R. Ducharme, P. Vincent, and C.
                                                          Janvin (2003) A neural probabilistic language
4     Discussion                                          model. The Journal of Machine Learning Research,
                                                          3:1137–1155, 2003.
The results obtained by the SentITA system at AB-       P. Bojanowski, E. Grave, A. Joulin, T. Mikolov (2016)
SITA 2018 are promising, as the system placed              Enriching Word Vectors with Subword Information.
2nd in the ACP and 5th in the ACD task but not             arXiv:1607.04606v2
Giuseppe Castellucci, Danilo Croce, Roberto Basili.       Min Lin, Qiang Chen, and Shuicheng Yen. Network in
  2016. Context–aware Convolutional Neural Net-             network. arXiv preprint arXiv:1312.4400, 2014.
  works for Twitter Sentiment Analysis in Italian.
  CLiC-it/EVALITA (2016).                                 Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
                                                            Sebastiani, Veselin Stoyanov. 2016. SemEval-2016
K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares,       Task 4: Sentiment Analysis in Twitter. Proceedings
   H. Schwenk, and Y. Bengio. (2014) Learning phrase        of SemEval-2016, pages 1–18, San Diego, Califor-
   representations using RNN encoder-decoder for sta-       nia, June 16-17, 2016.
   tistical machine translation. In EMNLP, 2014.
                                                          Y. Nesterov (1983) A method of solving a convex pro-
Andrea Cimino, Felice Dell’Orletta. 2016. Tandem             gramming problem with convergence rate o (1/k2).
  LSTM–SVM Approach for Sentiment Analysis.                  In Soviet Mathematics Doklady, volume 27, pages
  Castellucci, Giuseppe et al. CLiC-it/EVALITA               372-376, 1983.
  (2016).
                                                          J. Pennington, R. Socher, and C. Manning. (2014)
R. Collobert, J. Weston, L. Bottou, M. Karlen, K.            Glove: Global vectors for word representation. In
  Kavukcuoglu and P. Kuksa. Natural Language Pro-            Proceedings of the 2014 Conference on Empirical
  cessing (Almost) from Scratch. Journal of Machine          Methods in Natural Language Processing (EMNLP),
  Learning Research, 12:2493- 2537, 2011.                    pages 1532–1543, Doha, Qatar, October. Associa-
Jan Deriu, Mark Cieliebak. 2016. Sentiment Detec-            tion for Computational Linguistics.
   tion using Convolutional Neural Networks with
                                                          Colin Raffel, Daniel P. W. Ellis (2016) Feed-
   Multi–Task Training and Distant Supervision.
                                                            Forward       Networks     with  Attention Can
   CLiC-it/EVALITA (2016).
                                                            Solve Some Long-Term Memory Problems.
Timothy        Dozat       (2015)        Incorporating      https://arxiv.org/abs/1512.08756
  Nesterov         Momentum         into       Adam.
  http://cs229.stanford.edu/proj2015/054 report.pdf.      Russo, Irene; Frontini, Francesca and Quochi, Va-
                                                            leria, 2016, OpeNER Sentiment Lexicon Ital-
E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T.          ian - LMF, ILC-CNR for CLARIN-IT repository
   Mikolov (2018) Learning Word Vectors for 157 Lan-        hosted at Institute for Computational Linguistics
   guages. Proceedings of the International Confer-         ”A. Zampolli”, National Research Council, in Pisa,
   ence on Language Resources and Evaluation (LREC          http://hdl.handle.net/20.500.11752/ILC-73.
   2018)
                                                          Sander       Dieleman.      Recommending          mu-
G. E. Hinton, J. L. McClelland, and D. E. Rumel-            sic     on    Spotify     with    deep     learning.
  hart (1986) Distributed representations. In Rumel-        http://benanne.github.io/2014/08/05/spotify-
  hart, D. E. and McClelland, J. L., editors, Paral-        cnns.html, 2014.
  lel Distributed Processing: Explorations in the Mi-
  crostructure of Cognition. 1986. Volume 1: Founda-      R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and
  tions, MIT Press, Cambridge, MA. pp 77-109.                Christopher D. Manning. (2011) Semi-Supervised
                                                             Recursive Autoencoders for Predicting Sentiment
S. Hochreiter, J. Schmidhuber. Long Short-Term Mem-          Distributions. In Proceedings of the 2011 Confer-
   ory. Neural Computation 9(8):1735-1780, 1997              ence on Empirical Methods in Natural Language
N. Kalchbrenner, E. Grefenstette, P. Blunsom. (2014)         Processing (EMNLP).
  A Convolutional Neural Network for Modelling            R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Man-
  Sentences. In Proceedings of ACL 2014.                     ning, A. Y. Ng, and Christopher Potts. (2013) Re-
Kingma, Diederik and Ba, Jimmy. (2014). Adam:                cursive deep models for semantic compositionality
  A Method for Stochastic Optimization. Interna-             over a sentiment treebank. In Proceedings of the
  tional Conference on Learning Representations.             2013 Conference on Empirical Methods in Natural
  https://arxiv.org/pdf/1412.6980.pdf                        Language Processing, pages 1631–1642, Strouds-
                                                             burg, PA, October. Association for Computational
Q. Le, T. Mikolov. Distributed Representations of Sen-       Linguistics.
   tences and Documents. Proceedings of the 31 st In-
   ternational Conference on Machine Learning, Bei-       Ilya Sutskever, James Martens, George Dahl, Geof-
   jing, China, 2014. JMLR: W&CP, volume 32.                 frey Hinton (2013) Proceedings of the 30th Inter-
                                                             national Conference on Machine Learning, PMLR
T. Mikolov. (2012) Statistical Language Models Based         28(3):1139-1147, 2013.
   on Neural Networks. PhD thesis, PhD Thesis, Brno
   University of Technology, 2012.
T. Mikolov, K. Chen, G. Corrado, and J. Dean. (2013)
   Efficient estimation of word representations in vec-
   tor space. In Proceedings of Workshop at Inter-
   national Conference on Learning Representations,
   2013.