=Paper= {{Paper |id=Vol-2263/paper017 |storemode=property |title=Ensemble of LSTMs for EVALITA 2018 Aspect-based Sentiment Analysis Task (ABSITA) (Short Paper) |pdfUrl=https://ceur-ws.org/Vol-2263/paper017.pdf |volume=Vol-2263 |authors=Mauro Bennici,Xileny Seijas Portocarrero |dblpUrl=https://dblp.org/rec/conf/evalita/BenniciP18 }} ==Ensemble of LSTMs for EVALITA 2018 Aspect-based Sentiment Analysis Task (ABSITA) (Short Paper)== https://ceur-ws.org/Vol-2263/paper017.pdf
                    Ensemble of LSTMs for
    EVALITA 2018 Aspect-based Sentiment Analysis task (ABSITA)
                         (Short Paper)


              Mauro Bennici                                    Xileny Seijas Portocarrero
             You Are My GUide                                      You Are My GUide
        mauro@youaremyguide.com                             xileny@youaremyguide.com



                    Abstract                           Automating the correct recognition of the various
                                                       problems can lead to the timely addressing of the
    English. In identifying the different emo-         same to the persons appointed to solve them.
    tions present in a review, it is necessary
    to distinguish the single entities present         The research was carried out with the dataset
    and the specific semantic relations. The           provided within the task called ABSITA, Aspect-
    number of reviews needed to have a                 based Sentiment Analysis at EVALITA 20181
    complete dataset for every single possible         (Basile et al., 2018). The task was a combination
    option is not predictable.                         of two tasks, Aspect Category Detection (ACD)
                                                       and Aspect Category Polarity (ACP).
    The approach described starts from the
    possibility to study the aspect and later          The dataset is a selection of hotel reviews taken
    the polarity and to create an ensemble of          in Italian from the portal Booking.com.
    the two models to provide a better under-
    standing of the dataset.                           2    Description of the system
                                                       Each review has been cleaned up by special
    Italiano. Nell'identificazione delle diver-
    se emozioni presenti in una recensione è           characters, lemmatized and brought to lowercase
                                                       with the SpaCy2 framework.
    necessario distinguere le singole entità
                                                       Generic Italian texts have been used, instead of
    presenti e le singole relazioni semantiche.
    Il numero di recensioni necessarie per             reviews in the accommodation context to be sure
    avere un dataset completo per ogni singo-          that the model will be suitable for more business
                                                       models, to generate vectors in fastText3. The best
    la opzione possibile non è predicibile.
                                                       one has a dimension of 200, with character n-
    L'approccio descritto parte dalla possibi-         grams of length 5, a window of size 5 and 10
    lità di creare due modelli diversi, uno per        negatives.
    la parte di categorizzazione, e l'altro per
    la parte di polarità. E di unire i due mo-         The system is the ensemble of two different
    delli per ottenere una maggiore compren-           models to improve the ability to discover hidden
    sione del dataset.                                 properties (Akhtar et al., 2018).

1    Introduction                                      The first model is a bi-directional Long Short-
                                                       Term Memory (BI-LSTM).
With the increase in interactions between users        This model is used for the discernment of the
and businesses across different channels and dif-      ASPECT.
ferent languages, it becomes increasingly diffi-
cult for businesses to respond promptly and ef-
fectively in an effective manner. Not all activities
can have a team dedicated to public relations and      1   http://sag.art.uniroma2.it/absita/
often rely on external agencies that do not know       2   https://spacy.io
the internal operations of the company.                3   https://fasttext.cc/
Layer (type)       Output Shape  Param #               Table 2: micro precision, micro recall and micro F1
===================================                    score with the gold dataset.
e (Embedding) (None, 100, 200)   1420400
_______________________________________                The results show that the models are useful to
b (Bidirection) (None, 512)       935936               understand the category of a review better than
_______________________________________                its polarity.
d (Dense)        (None, 7)          3591
===================================                    After that we ensemble the two models (Choi et
                                                       al., 2018) to obtain a system able to overcome
A second BI-LSTM model is used for the dis-            the results of every single model in the ACP task
cernment of POLARITY.                                  reducing the result on the ACD task (table 3).
_______________________________________
Layer (type)     Output Shape      Param #             The ensemble has been created in cascade mak-
===================================                    ing sure that a system acts as Attention to the
e (Embedding) (None, 100, 200)    1420400              underlying system.
_______________________________________                The threshold of activation was a range between
b (Bidirection) (None, 512)         935936             0.45 and 0.55.
_______________________________________
d (Dense)       (None, 14)            7182             A third model, a LightGBM5 (Bennici and Porto-
===================================                    carrero, 2018) was also tested, where the follow-
                                                       ing properties are extracted from the reviews
A dropout and a recurrent_dropout of 0.1.              text:
The optimizer for both is the RMSProp.
The loaded embedding is trainable.                         •   length of the review
Both the systems use Keras4 to create the RNN              •   percentage of special characters
models.                                                    •   the number of exclamation points
The models were trained and tested with a 5-fold           •   the number of question marks
cross-validation with a ratio of 80% training and          •   the number of words
20% testing. The best model was automatically              •   the number of characters
saved at each iteration.                                   •   the number of spaces
A threshold of 0.5 was used on the first model to          •   the number of stop words
activate the result of the last layer. In the second       •   the ratio between words and stop words
model, the threshold was of 0.43.                          •   the ratio between words and spaces

         Aspect Category Detection (ACD)               and they are joined to the vector created by the
                                                       bigram and trigram of the text itself at word and
micro precision   micro recall    micro F1 score       character level.
                                                       The number of leaves is 250, the learner set as
0.8397            0.8050           0.8204              ‘Feature’, and a the learning rate at 0.04.
 Table 1: micro precision, micro recall and micro F1
 score with the gold dataset.                          The result of the union between the three models
                                                       could not be submitted to the final evaluation,
                                                       due to the limit of 2 possible submissions, but
                                                       reported results higher than 83% in the tests car-
                                                       ried out after the release of the complete dataset
         Aspect Category Polarity (ACP)                for ASPECT and 75% for POLARITY.
                                                       Also, the inference is faster than the RNN mod-
                                                       els.
micro precision   micro recall    micro F1 score
0.8138            0.6593          0.7172


4   https://keras.io                                   5   https://github.com/Microsoft/LightGBM
3      Results                                          tively, and to identify different categories of the
                                                        hotel.

         Aspect Category Detection (ACD)                In the near future, we are ready to create a sys-
                                                        tem to split the text of the review to categorize
                                                        only a single sentence, or less a single subject or
Runs     micro precision   micro recall   micro F1
                                                        object. In this way, we will be ready to evaluate
Run 1 0.8713              0.7504         0.8063         also the polarity of the single object or subject,
Run 2 0.8697              0.7481         0.8043         and only the terms single related to it to improve
  Table 3: micro precision, micro recall and micro F1   the result of the ACP task.
 score for the submitted ACD subtasks.
                                                        The performance of the system will also be
                                                        evaluated by replacing all the possible entities
                                                        with variables known as:
         Aspect Category Polarity (ACP)
                                                            l   City
Runs     micro precision   micro recall   micro F1          l   Museum
Run 1 0.7387               0.7206         0.7295
                                                            l   Panoramic Point
                                                            l   Railway station
Run 2 0.7472               0.7186         0.7326            l   Street
  Table 4: micro precision, micro recall and micro F1
 score for the submitted ACP subtask.                   and with a pre-category knew a priori as Break-
                                                        fast for words like Coffee, Cornetto, and Jam.
In the evaluation phase, we can see how the re-
sults have given reason to the ensemble of the          The expected result is to reduce the variance of
two results.                                            the dataset, to improve the ACD result, and to be
                                                        able to use the system in production.
It is clear that the ACP task (table 4) is the bene-
ficiary of this process, instead of the ACD one         Finally, we will evaluate the speed and effective-
(table 3) that lost more than one point.                ness of a CNN model in which the tasks, AS-
                                                        PECT, and POLARITY, can be studied separate-
The study of the dataset is influenced by the little    ly and then merged.
extension of the training dataset and by the speci-
ficity of some terms that could refer to different      Reference
categories such as the comfort of the room and
the quality/price ratio.                                Basile, P., Basile, V., Croce, D., & Polignano, M.
                                                          (2018). Overview of the EVALITA 2018 Aspect-
Various types of data preparation have also been          based Sentiment Analysis task (ABSITA). Pro-
                                                          ceedings of the 6th evaluation campaign of Natural
used, including the preservation of special char-         Language Processing and Speech tools for Italian
acters, the shape of words (to better identify cit-       (EVALITA’18)
ies or places written in capital letters), and some
SMOTE functions to increase the number of en-
tries but with poor results and noticeable overfit-     Akhtar, M., Ghosal, D., Ekbal, A., Bhattacharyya, P.,
ting.                                                     & Kurohashi, S. (2018, October 15). A Multi-task
                                                          Ensemble Framework for Emotion, Sentiment and
                                                          Intensity       Prediction.    Retrieved     from
4      Conclusion                                         https://arxiv.org/abs/1808.01216

Creating an ensemble of models to bring out var-
ious properties of a review gave better results         Choi, J. Y. and Bumshik, L. (2018).“Combining
than using a single model in the polarity identifi-       LSTM Network Ensemble via Adaptive Weighting
cation.                                                   for Improved Time Series Forecasting,” Mathemat-
                                                          ical Problems in Engineering, vol. 2018, Article ID
The terms used in the review are sometimes mis-           2470171,            8          pages.           doi:
leading and can be used both positively or nega-          https://doi.org/10.1155/2018/2470171.
Bennici, M. and Seijas Portocarrero, X. (2018). The
validity of dictionaries over the time in Emoji predic-
tion. In Tommaso Caselli, Nicole Novielli, Viviana
Patti, and Paolo Rosso, editors, Proceedings of the 6th
evaluation campaign of Natural Language Processing
and Speech tools for Italian (EVALITA’18), Turin,
Italy. CEUR.org.