Aspect-based Sentiment Analysis: X2Check at ABSITA 2018

             Emanuele Di Rosa                             Alberto Durante
           Chief Technology Officer                      Research Scientist
               App2Check s.r.l.                           App2Check s.r.l.
            emanuele.dirosa                             alberto.durante
             @app2check.com                             @app2check.com


                 Abstract                           strato sul training set della evaluation.

English. In this paper we describe and
present the results of the two systems,         1   Introduction
called here X2C-A and X2C-B, that we
specifically developed and submitted for        The traditional task of sentiment analysis is the
our participation to ABSITA 2018, for           classification of a sentence according to the pos-
the Aspect Category Detection (ACD) and         itive, negative, or neutral classes. However, such
Aspect Category Polarity (ACP) tasks.           task in this simple version is not enough to detect
The results show that X2C-A is top ranker       when a sentence contains a mixed sentiment, in
in the official results of the ACD task, at a   which a positive sentiment is referred to one as-
distance of just 0.0073 from the best sys-      pect and a negative sentiment to another aspect.
tem; moreover, its post deadline improved       Aspect-based sentiment analysis is focused on the
version, called X2C-A-s, scores first in the    sentiment classification (negative, neutral, posi-
official ACD results. About the ACP re-         tive) for a given aspect/category in a sentence.
sults, our X2C-A-s system, which takes          In nowadays world, reviews became an important
advantage of our ready-to-use industrial        tool widely used by consumers to evaluate ser-
Sentiment API, scores at a distance of just     vices and products. Given the large amount of
0.0577 from the best system, even though        reviews available online, systems allowing to au-
it has not been specifically trained on the     tomatically classify reviews according to differ-
training set of the evaluation.                 ent categories, and assign a sentiment to each of
                                                those categories, are gaining more and more inter-
Italiano. In questo articolo descrivi-          est in the market. The former task is called Aspect
amo e presentiamo i risultati dei due sis-      Category Detection (ACD) since detects whether
temi, chiamati qui X2C-A e X2C-B, che           a review speaks about one of the categories un-
abbiamo specificatamente sviluppato per         der evaluation; the latter task, called Aspect Cat-
partecipare ad ABSITA 2018, per i task          egory Polarity (ACP) tries to assign a sentiment
Aspect Category Detection (ACD) e As-           independently for each aspect. In this paper, we
pect Category Polarity (ACP). I risultati       present X2C-A and X2C-B, two different imple-
mostrano che X2C-A si posiziona ad una          mentations for dealing with the ACD and ACP
distanza di soli 0.0073 dal miglior sis-        tasks, specifically developed for the ABSITA eval-
tema del task ACD; inoltre, la sua versione     uation (Basile et al., 2018). In particular, we de-
migliorata, chiamata X2C-A-s, realizzata        scribe the models used to participate to the ACD
successivamente alla scadenza, mostra un        competition together with some post deadline re-
punteggio che lo posiziona al primo posto       sults, in which we had the opportunity to improve
nella classifica ufficiale del task ACD.        our ACD results and evaluate our systems also on
Riguardo al task ACP, il sistema X2C-           the ACP task. The resuls show that our X2C-A
A-s che utilizza il nostro standard Senti-      system is top ranking in the official ACD com-
ment API, consente di ottenere un punteg-       petition and scores first, in its X2C-A-s version.
gio che dista solo 0.0577 dal miglior sis-      Moreover, by testing our ACD models on the ACP
tema, nonostante il classificatore di sen-      tasks, with the help of our standard X2Check sen-
timent non sia stato specificamente adde-       timent API, the X2C-A-s system scores fifth at a
distance of just 0.057 from the best system, even       on the Value category, while shows the best per-
if the other systems have a sentiment classifier        formance on Location, and high score on Wifi and
specifically trained on the training set of the com-    Staff.
petition. This paper is structured as follow: after
the introduction we present the descriptions of our
two systems submitted to ABSITA and the results         2.2   X2C-B
on the development set; then we show and discuss
the results on the official testset of the competi-     In the model selection process, the two best algo-
tion for both ACD and ACP, finally we provide           rithms have been Naive Bayes and SMO. We built
our conclusions.                                        a model with both algorithms for each category.
                                                        We took into account the F1 score on the posi-
2     Systems description                               tive labels in order to select the best algorithm.
                                                        In this implementation, SMO (Sequential Minimal
The official training dataset has been split into our   Optimization) (Platt, 1998) (Keerthi et al., 2001)
internal training set (80% of the documents) and        (Hastie et al., 1998) has been the best performing
development set (the remaining 20%). We ran-            algorithm on all of the categories, and showed an
domly sampled the examples for each category,           average F1 score across all categories of 85.08%.
thus obtaining different sets for training/test set,    Its scores are reported in Table 1, where we also
by keeping the per category distribution of the         compare its performance with the X2C-A one on
samples through the three sets. We submitted            the development set.
two runs, as the results of the two different sys-
                                                           The two systems are built on different imple-
tems we developed for each category, called X2C-
                                                        mentation of support vector machines, as previ-
A and X2C-B. The former has been developed
                                                        ously pointed out, and differ on the features ex-
on top of the Scikit-learn library in Python lan-
                                                        traction process. In fact, X2C-B takes into ac-
guage (Pedregosa et al., 2011), and the latter on
                                                        count a vocabulary of the 1000 most mentioned
top of the WEKA library (Frank et al., 2016) in
                                                        words in the training set, according to the size
JAVA language. In both cases, the input text has
                                                        limit parameter available in the StringToWordVec-
been cleaned with a typical NLP pipeline, involv-
                                                        tor Weka function. Moreover, it uses unigrams
ing punctuation, numbers and stopwords removal.
                                                        instead of the bi-grams extraction performed in
The two systems have been developed separately,
                                                        X2C-A. The two systems reach similar results, i.e.
but the best algorithms obtained by both the model
                                                        high scores on Location, Wifi and Staff, and low
selections are different implementations of Sup-
                                                        scores on the Value category. However, the overall
port Vector Machine. More details in the follow-
                                                        weighted performance is very close, around 85%
ing sections.
                                                        of F1 on the positive labels, and since for some
2.1    X2C-A                                            categories is better X2C-A and for others X2C-B,
                                                        we decided to submit both implementations, in or-
The X2C-A system has been created by apply-             der to understand which is the best one on the test
ing an NLP pipeline including a vectorization of        set of the ABSITA evaluation.
the collection of reviews to a matrix of token
counts of the bi-grams; then, the count matrix has
                                                                 Category      X2C-A      X2C-B
been transformed to a normalized tf-idf represen-
tation (term-frequency times inverse document-                   Cleanliness   0.8675     0.8882
frequency). As machine learning algorithm, an                    Comfort       0.8017     0.7995
implementation of the Support Vector Machine                     Amenities     0.8041     0.7896
has been used, specifically the LinearSVC. Such                  Staff         0.8917     0.8978
algorithm has been selected as the best performer                Value         0.7561     0.7333
on such dataset compared to other common imple-                  Wifi          0.9056     0.9412
mentations available in the sklearn library.                     Location      0.9179     0.9058
   Table 1 shows the F1 score on the positive
label in the development set for each category,         Table 1: F1 score per category on the positive la-
where the average value on all of the categories        bels on the development set. Best system in bold.
is 84.92%. X2C-A shows the lowest performance
3     Results on the ABSITA testset                                   Team         Mic-Pr    Mic-Re     Mic-F1
                                                                      X2C-A-s      0.8278    0.8014     0.8144
3.1    Aspect Category Detection                                      1            0.8397    0.7837     0.8108
Table 2 shows the official results of the Aspect-                     2            0.8713    0.7504     0.8063
based Category Detection task, with the addition                      3            0.8697    0.7481     0.8043
of two post deadline results obtained by an addi-                     X2C-A        0.8626    0.7519     0.8035
tional version of X2C-A and X2C-B, called X2C-                        5            0.8819    0.7378     0.8035
A-s and X2C-B-s.                                                      X2C-B        0.8980    0.6937     0.7827
   The difference between the submitted results                       X2C-B-s      0.8954    0.6855     0.7765
and the versions called X2C-A-s and X2C-B-s,                          7            0.8658    0.697      0.7723
is just at prediction time: X2C-A and X2C-B                           8            0.7902    0.7181     0.7524
make a prediction at document-level, i.e. on the                      9            0.6232    0.6093     0.6162
whole review, while X2C-A-s and X2C-B-s make                          10           0.6164    0.6134     0.6149
a prediction at sentence-level, where each sen-                       11           0.5443    0.5418     0.5431
tence has been obtained by splitting the reviews                      12           0.6213    0.433      0.5104
on some punctuation and key conjunction words.                        baseline     0.4111    0.2866     0.3377
This makes more likely that each sentence con-
tains one category and it seems to be easier for the                             Table 2: ACD results
models the category detection. For example, the
review                                                          tables 3 and 4, we can see that X2C-A-s is the best
                                                                system on all of the categories, with the exception
      The sight is beautiful, but the staff is
                                                                of Cleanliness, where X2C-B shows a slightly bet-
      rude
                                                                ter performance. Comparing the results on devel-
is about Location and Staff, but since only a part              opment set (Table 1) and the ones on the ABSITA
of it is about Location, the location model of                  test set, Value is confirmed being the most difficult
this category would receive a document contain-                 category to detect for our systems, with a score of
ing ”noise” from its point of view. In the post                 0.6168. Instead, concerning Wifi, which has been
deadline runs, we reduce the ”noise” by splitting               the easiest category in Table 1, in Table 4 shows
this example review in The sight is beautiful which             a lower relative score, while the easiest category
is only about Location, and but the staff is rude               to detect overall was Location, on which X2C-A-
which is only about Staff. As we can see in Ta-                 split has reached a score of 0.8898.
ble 2, the performance of X2C-A increased sig-
                                                                                         X2C-A     X2C-B
nificantly and reached a performance score that
                                                                         Cleanliness     0.8357    0.8459
is better even than the first classified. However,
                                                                         Comfort         0.794     0.7475
the performance of X2C-B slighted decreased in
                                                                         Amenities       0.8156    0.7934
its X2C-B-s version. This means that the model
                                                                         Staff           0.8751    0.8681
of this latter system is not helped by this kind of
                                                                         Value           0.6146    0.6141
”noise” removal technique. This last result shows
                                                                         Wifi            0.8403    0.8667
that such approach does not have a general appli-
                                                                         Location        0.8887    0.8681
cability but it depends on the model; however, it
shows to work very well on X2C-A.
                                                                Table 3: X2Check per category results submitted
   In order to identify the categories where we per-
                                                                to ACD
form better, we calculated the score of our systems
on each category1 , as shown in Table 3 and Table
4. In Table 3 X2C-A is the best of our systems                  3.2   Aspect Category Polarity
on all the categories except Cleanliness and Wifi,              In Table 5 we show the results of the Aspect-based
where X2C-B has reached the higher score. In Ta-                Category Polarity task to which X2Check did not
ble 4, X2C-A-s shows the best performance on all                formally participate. In fact, after the evaluation
of the categories. By comparing the results across              deadline we had time to work on the ACP task.
    1
      To obtain these scores, we modified the ABSITA evalu-        In order to deal with the ACP task, we decided
ation script so that only one category is taken into account.   to take advantage of our ready-to-use, standard
                      X2C-A-s     X2C-B-s                   the collection of all of the category-sentiment
       Cleanliness     0.8445      0.8411                   pairs found in the sentences
       Comfort         0.8099      0.739
       Amenities       0.8308      0.7884                 The results shown in Table 5 show that our as-
       Staff           0.8833      0.8652              sumption is valid. In fact, despite being a single
       Value           0.6168      0.6089              sentiment model for all of the categories, we reach
       Wifi            0.8702      0.8667              the fifth place in the official ranking with our X2C-
       Location        0.8898      0.8584              A-s system, at a distance of just 0.057 from the
                                                       best system specifically trained on such training
Table 4: X2Check per category results submitted        set. Furthermore, the ACP performance depends
post deadline to ACD.                                  on the ACD results, in fact the former task can-
                                                       not reach a performance higher than the other. For
                                                       this reason, we decided to evaluate the sentiment
X2Check sentiment API (Di Rosa and Durante,
                                                       performance reached on the reviews whose cate-
2017). In fact, since we do have an industrial per-
                                                       gories have been correctly predicted. Thus, we
spective, we realized that in a real world setting,
                                                       created a score capturing the relationship between
the fact of training an Aspect-based sentiment sys-
                                                       the two results: it is the ratio between the micro
tem through a specific training set has a high ef-
                                                       F1 score obtained in the ACP task and the one ob-
fort associated and cannot have a general purpose
                                                       tained in the ACD task. This hand crafted score
application. In fact, a very common case is the
                                                       shows the quality of the sentiment model, by re-
one in which new categories to predict have to be
                                                       moving the influence of the performance on the
quickly added into the system. In this setting, a
                                                       ACD task. The overall sentiment score obtained is
high effort activity of labeling examples for the
                                                       88.0% for X2C-B and 87.1% for X2C-A, showing
training set would be required. Moreover, label-
                                                       that even if a specific train has not been made, the
ing a review according to the aspects mentioned
                                                       general purpose X2Check sentiment API shows
and additionally assign a sentiment to each aspect
                                                       very good results (recall that, according to (Wil-
requires a higher human effort than just labeling
                                                       son et al., 2009) humans agree in the sentiment
the category. For this reason, we decided to not
                                                       classification in the 82% of cases).
specifically train a sentiment predictor specialized
on the given categories/aspects in the evaluation.         Team         Mic-Pr     Mic-Re     Mic-F1
Thus, we performed an experimental evaluation in           1            0.8264     0.7161     0.7673
which after the prediction of the category in the          2            0.8612     0.6562     0.7449
review, our standard X2Check sentiment API has             3            0.7472     0.7186     0.7326
been called to predict the sentiment. Since we are         4            0.7387     0.7206     0.7295
aware that a review may, in general, speak about           X2C-A-s      0.7175     0.7019     0.7096
multiple aspects and having different sentiment as-        5            0.8735     0.5649     0.6861
sociated, we decided to apply the X2C-A-s and              X2C-B-s      0.7888     0.6025     0.6832
X2C-B-s versions which use the splitting method            6            0.6869     0.5409     0.6052
described in section 3.1. More specifically:               7            0.4123     0.3125     0.3555
  1. each review document has been split into sen-         8            0.5452     0.2511     0.3439
     tences                                                baseline     0.2451     0.1681     0.1994

  2. both the X2Check sentiment API and the                           Table 5: ACP results.
     X2C-A/X2C-B category classifiers were run
     on each sentence. The former gives as output        Tables 6 and 7 show for each category the
     the polarity of each sentence; our assumption     micro-F1 and the sentiment score of the ACP task,
     is that each portion of the review has a high     calculated like in Table 4, and the relationship be-
     probability to have just one sentiment asso-      tween ACP and ACD scores per category. We can
     ciated. The latter gives as output all of the     see that the sentiment model has reached a very
     detected categories in each sentence              good performance on Cleanliness, Comfort, Staff
                                                       and Location since it is close or over the 90%.
  3. the overall result of a review is given by        However, like noticed for the ACD results, it is
difficult to handle reviews about the Value cate-        2018 Aspect-based Sentiment Analysis task (AB-
gory.                                                    SITA) in Tommaso Caselli, Nicole Novielli, Viviana
                                                         Patti, and Paolo Rosso, editors, Proceedings of the
                                                         6th evaluation campaign of Natural Language Pro-
                       Micro-F1       SS
                                                         cessing and Speech tools for Italian (EVALITA’18),
        Cleanliness     0.7739      91.6%                CEUR.org, Turin
        Comfort         0.7165      88.5%
                                                       Eibe Frank, Mark A. Hall, and Ian H. Witten. 2016.
        Amenities       0.6618      79.7%                The WEKA Workbench. Online Appendix for
        Staff           0.8086      91.5%                ”Data Mining: Practical Machine Learning Tools
        Value           0.4533      73.5%                and Techniques”, Morgan Kaufmann, Fourth Edi-
        Wifi            0.6615      76.0%                tion, 2016.
        Location        0.8168      91.8%              Pedregosa, F. and Varoquaux, G. and Gramfort, A.
                                                         and Michel, V. and Thirion, B. and Grisel, O. and
Table 6: X2C-A ACP results and sentiment score           Blondel, M. and Prettenhofer, P. and Weiss, R. and
                                                         Dubourg, V. and Vanderplas, J. and Passos, A. and
by category.
                                                         Cournapeau, D. and Brucher, M. and Perrot, M. and
                                                         Duchesnay, E. 2011. Scikit-learn: Machine Learn-
                                                         ing in Python in Journal of Machine Learning Re-
                       Micro-F1       SS                 search, pp. 2825–2830.
        Cleanliness     0.7626      90.7%
                                                       Emanuele Di Rosa and Alberto Durante. LREC 2016
        Comfort          0.671      90.8%                App2Check: a Machine Learning-based system for
        Amenities       0.6276      79.6%                Sentiment Analysis of App Reviews in Italian Lan-
        Staff           0.7948      91.9%                guage in Proc. of the 2nd International Workshop on
        Value           0.4581      75.2%                Social Media World Sensors, pp. 8-11.
        Wifi            0.6441      74.3%              Emanuele Di Rosa and Alberto Durante. 2017. Eval-
        Location        0.7969      92.8%                uating Industrial and Research Sentiment Analysis
                                                         Engines on Multiple Sources in Proc. of AI*IA 2017
                                                         Advances in Artificial Intelligence - International
Table 7: X2C-B ACP results and sentiment score           Conference of the Italian Association for Artificial
by category.                                             Intelligence, Bari, Italy, November 14-17, 2017, pp.
                                                         141-155.
                                                       Sophie de Kok, Linda Punt, Rosita van den Puttelaar,
4   Conclusions                                          Karoliina Ranta, Kim Schouten and Flavius Frasin-
In this paper we presented a description of two dif-     car. 2018. Review-aggregated aspect-based senti-
                                                         ment analysis with ontology features in Prog Artif
ferent implementations for dealing with the ACD          Intell (2018) 7: 295.
and ACP tasks at ABSITA 2018. In particular,
we described the models used to participate to the     Theresa Wilson, Janyce Wiebe and Paul Hoffmann.
                                                         2009. Recognizing Contextual Polarity: An Explo-
ACD competition together with some post dead-            ration of Features for Phrase-Level Sentiment Anal-
line results, in which we had the opportunity to         ysis in Computational Linguistic, pp. 399–433.
improve our ACD results and evaluate our systems
                                                       Seth Grimes.       2010.     Expert Analysis: Is
also on the ACP task. The resuls show that our           Sentiment    Analysis     an    80%     Solution?
X2C-A system is top ranking in the official ACD          http://www.informationweek.com/software/information-
competition and scores first, in its X2C-A-s ver-        management/expert-analysis-is-sentiment-analysis-
sion. Moreover, by testing our ACD models on the         an-80–solution/d/d-id/1087919
ACP tasks, with the help of our standard X2Check       J. Platt. 1998. Fast Training of Support Vector
sentiment API, the X2C-A-s system scores fifth            Machines using Sequential Minimal Optimization
at a distance of just 0.057 from the best system,         in Advances in Kernel Methods - Support Vector
                                                          Learning
even if the other systems have a sentiment classi-
fier specifically trained on the training set of the   S.S. Keerthi and S.K. Shevade and C. Bhattacharyya
competition.                                              and K.R.K. Murthy. 2001. Improvements to Platt’s
                                                          SMO Algorithm for SVM Classifier Design in Neural
                                                          Computation, volume 13, pp. 637-649.
References                                             Trevor Hastie and Robert Tibshirani 1998. Classifi-
                                                         cation by Pairwise Coupling in Advances in Neural
Pierpaolo Basile, Valerio Basile, Danilo Croce and       Information Processing Systems, volume 10.
   Marco Polignano. 2018. Overview of the EVALITA