=Paper=
{{Paper
|id=Vol-2263/paper015
|storemode=property
|title=Aspect-based Sentiment Analysis: X2Check at ABSITA 2018
|pdfUrl=https://ceur-ws.org/Vol-2263/paper015.pdf
|volume=Vol-2263
|authors=Emanuele Di Rosa,Alberto Durante
|dblpUrl=https://dblp.org/rec/conf/evalita/RosaD18
}}
==Aspect-based Sentiment Analysis: X2Check at ABSITA 2018==
Aspect-based Sentiment Analysis: X2Check at ABSITA 2018
Emanuele Di Rosa Alberto Durante
Chief Technology Officer Research Scientist
App2Check s.r.l. App2Check s.r.l.
emanuele.dirosa alberto.durante
@app2check.com @app2check.com
Abstract strato sul training set della evaluation.
English. In this paper we describe and
present the results of the two systems, 1 Introduction
called here X2C-A and X2C-B, that we
specifically developed and submitted for The traditional task of sentiment analysis is the
our participation to ABSITA 2018, for classification of a sentence according to the pos-
the Aspect Category Detection (ACD) and itive, negative, or neutral classes. However, such
Aspect Category Polarity (ACP) tasks. task in this simple version is not enough to detect
The results show that X2C-A is top ranker when a sentence contains a mixed sentiment, in
in the official results of the ACD task, at a which a positive sentiment is referred to one as-
distance of just 0.0073 from the best sys- pect and a negative sentiment to another aspect.
tem; moreover, its post deadline improved Aspect-based sentiment analysis is focused on the
version, called X2C-A-s, scores first in the sentiment classification (negative, neutral, posi-
official ACD results. About the ACP re- tive) for a given aspect/category in a sentence.
sults, our X2C-A-s system, which takes In nowadays world, reviews became an important
advantage of our ready-to-use industrial tool widely used by consumers to evaluate ser-
Sentiment API, scores at a distance of just vices and products. Given the large amount of
0.0577 from the best system, even though reviews available online, systems allowing to au-
it has not been specifically trained on the tomatically classify reviews according to differ-
training set of the evaluation. ent categories, and assign a sentiment to each of
those categories, are gaining more and more inter-
Italiano. In questo articolo descrivi- est in the market. The former task is called Aspect
amo e presentiamo i risultati dei due sis- Category Detection (ACD) since detects whether
temi, chiamati qui X2C-A e X2C-B, che a review speaks about one of the categories un-
abbiamo specificatamente sviluppato per der evaluation; the latter task, called Aspect Cat-
partecipare ad ABSITA 2018, per i task egory Polarity (ACP) tries to assign a sentiment
Aspect Category Detection (ACD) e As- independently for each aspect. In this paper, we
pect Category Polarity (ACP). I risultati present X2C-A and X2C-B, two different imple-
mostrano che X2C-A si posiziona ad una mentations for dealing with the ACD and ACP
distanza di soli 0.0073 dal miglior sis- tasks, specifically developed for the ABSITA eval-
tema del task ACD; inoltre, la sua versione uation (Basile et al., 2018). In particular, we de-
migliorata, chiamata X2C-A-s, realizzata scribe the models used to participate to the ACD
successivamente alla scadenza, mostra un competition together with some post deadline re-
punteggio che lo posiziona al primo posto sults, in which we had the opportunity to improve
nella classifica ufficiale del task ACD. our ACD results and evaluate our systems also on
Riguardo al task ACP, il sistema X2C- the ACP task. The resuls show that our X2C-A
A-s che utilizza il nostro standard Senti- system is top ranking in the official ACD com-
ment API, consente di ottenere un punteg- petition and scores first, in its X2C-A-s version.
gio che dista solo 0.0577 dal miglior sis- Moreover, by testing our ACD models on the ACP
tema, nonostante il classificatore di sen- tasks, with the help of our standard X2Check sen-
timent non sia stato specificamente adde- timent API, the X2C-A-s system scores fifth at a
distance of just 0.057 from the best system, even on the Value category, while shows the best per-
if the other systems have a sentiment classifier formance on Location, and high score on Wifi and
specifically trained on the training set of the com- Staff.
petition. This paper is structured as follow: after
the introduction we present the descriptions of our
two systems submitted to ABSITA and the results 2.2 X2C-B
on the development set; then we show and discuss
the results on the official testset of the competi- In the model selection process, the two best algo-
tion for both ACD and ACP, finally we provide rithms have been Naive Bayes and SMO. We built
our conclusions. a model with both algorithms for each category.
We took into account the F1 score on the posi-
2 Systems description tive labels in order to select the best algorithm.
In this implementation, SMO (Sequential Minimal
The official training dataset has been split into our Optimization) (Platt, 1998) (Keerthi et al., 2001)
internal training set (80% of the documents) and (Hastie et al., 1998) has been the best performing
development set (the remaining 20%). We ran- algorithm on all of the categories, and showed an
domly sampled the examples for each category, average F1 score across all categories of 85.08%.
thus obtaining different sets for training/test set, Its scores are reported in Table 1, where we also
by keeping the per category distribution of the compare its performance with the X2C-A one on
samples through the three sets. We submitted the development set.
two runs, as the results of the two different sys-
The two systems are built on different imple-
tems we developed for each category, called X2C-
mentation of support vector machines, as previ-
A and X2C-B. The former has been developed
ously pointed out, and differ on the features ex-
on top of the Scikit-learn library in Python lan-
traction process. In fact, X2C-B takes into ac-
guage (Pedregosa et al., 2011), and the latter on
count a vocabulary of the 1000 most mentioned
top of the WEKA library (Frank et al., 2016) in
words in the training set, according to the size
JAVA language. In both cases, the input text has
limit parameter available in the StringToWordVec-
been cleaned with a typical NLP pipeline, involv-
tor Weka function. Moreover, it uses unigrams
ing punctuation, numbers and stopwords removal.
instead of the bi-grams extraction performed in
The two systems have been developed separately,
X2C-A. The two systems reach similar results, i.e.
but the best algorithms obtained by both the model
high scores on Location, Wifi and Staff, and low
selections are different implementations of Sup-
scores on the Value category. However, the overall
port Vector Machine. More details in the follow-
weighted performance is very close, around 85%
ing sections.
of F1 on the positive labels, and since for some
2.1 X2C-A categories is better X2C-A and for others X2C-B,
we decided to submit both implementations, in or-
The X2C-A system has been created by apply- der to understand which is the best one on the test
ing an NLP pipeline including a vectorization of set of the ABSITA evaluation.
the collection of reviews to a matrix of token
counts of the bi-grams; then, the count matrix has
Category X2C-A X2C-B
been transformed to a normalized tf-idf represen-
tation (term-frequency times inverse document- Cleanliness 0.8675 0.8882
frequency). As machine learning algorithm, an Comfort 0.8017 0.7995
implementation of the Support Vector Machine Amenities 0.8041 0.7896
has been used, specifically the LinearSVC. Such Staff 0.8917 0.8978
algorithm has been selected as the best performer Value 0.7561 0.7333
on such dataset compared to other common imple- Wifi 0.9056 0.9412
mentations available in the sklearn library. Location 0.9179 0.9058
Table 1 shows the F1 score on the positive
label in the development set for each category, Table 1: F1 score per category on the positive la-
where the average value on all of the categories bels on the development set. Best system in bold.
is 84.92%. X2C-A shows the lowest performance
3 Results on the ABSITA testset Team Mic-Pr Mic-Re Mic-F1
X2C-A-s 0.8278 0.8014 0.8144
3.1 Aspect Category Detection 1 0.8397 0.7837 0.8108
Table 2 shows the official results of the Aspect- 2 0.8713 0.7504 0.8063
based Category Detection task, with the addition 3 0.8697 0.7481 0.8043
of two post deadline results obtained by an addi- X2C-A 0.8626 0.7519 0.8035
tional version of X2C-A and X2C-B, called X2C- 5 0.8819 0.7378 0.8035
A-s and X2C-B-s. X2C-B 0.8980 0.6937 0.7827
The difference between the submitted results X2C-B-s 0.8954 0.6855 0.7765
and the versions called X2C-A-s and X2C-B-s, 7 0.8658 0.697 0.7723
is just at prediction time: X2C-A and X2C-B 8 0.7902 0.7181 0.7524
make a prediction at document-level, i.e. on the 9 0.6232 0.6093 0.6162
whole review, while X2C-A-s and X2C-B-s make 10 0.6164 0.6134 0.6149
a prediction at sentence-level, where each sen- 11 0.5443 0.5418 0.5431
tence has been obtained by splitting the reviews 12 0.6213 0.433 0.5104
on some punctuation and key conjunction words. baseline 0.4111 0.2866 0.3377
This makes more likely that each sentence con-
tains one category and it seems to be easier for the Table 2: ACD results
models the category detection. For example, the
review tables 3 and 4, we can see that X2C-A-s is the best
system on all of the categories, with the exception
The sight is beautiful, but the staff is
of Cleanliness, where X2C-B shows a slightly bet-
rude
ter performance. Comparing the results on devel-
is about Location and Staff, but since only a part opment set (Table 1) and the ones on the ABSITA
of it is about Location, the location model of test set, Value is confirmed being the most difficult
this category would receive a document contain- category to detect for our systems, with a score of
ing ”noise” from its point of view. In the post 0.6168. Instead, concerning Wifi, which has been
deadline runs, we reduce the ”noise” by splitting the easiest category in Table 1, in Table 4 shows
this example review in The sight is beautiful which a lower relative score, while the easiest category
is only about Location, and but the staff is rude to detect overall was Location, on which X2C-A-
which is only about Staff. As we can see in Ta- split has reached a score of 0.8898.
ble 2, the performance of X2C-A increased sig-
X2C-A X2C-B
nificantly and reached a performance score that
Cleanliness 0.8357 0.8459
is better even than the first classified. However,
Comfort 0.794 0.7475
the performance of X2C-B slighted decreased in
Amenities 0.8156 0.7934
its X2C-B-s version. This means that the model
Staff 0.8751 0.8681
of this latter system is not helped by this kind of
Value 0.6146 0.6141
”noise” removal technique. This last result shows
Wifi 0.8403 0.8667
that such approach does not have a general appli-
Location 0.8887 0.8681
cability but it depends on the model; however, it
shows to work very well on X2C-A.
Table 3: X2Check per category results submitted
In order to identify the categories where we per-
to ACD
form better, we calculated the score of our systems
on each category1 , as shown in Table 3 and Table
4. In Table 3 X2C-A is the best of our systems 3.2 Aspect Category Polarity
on all the categories except Cleanliness and Wifi, In Table 5 we show the results of the Aspect-based
where X2C-B has reached the higher score. In Ta- Category Polarity task to which X2Check did not
ble 4, X2C-A-s shows the best performance on all formally participate. In fact, after the evaluation
of the categories. By comparing the results across deadline we had time to work on the ACP task.
1
To obtain these scores, we modified the ABSITA evalu- In order to deal with the ACP task, we decided
ation script so that only one category is taken into account. to take advantage of our ready-to-use, standard
X2C-A-s X2C-B-s the collection of all of the category-sentiment
Cleanliness 0.8445 0.8411 pairs found in the sentences
Comfort 0.8099 0.739
Amenities 0.8308 0.7884 The results shown in Table 5 show that our as-
Staff 0.8833 0.8652 sumption is valid. In fact, despite being a single
Value 0.6168 0.6089 sentiment model for all of the categories, we reach
Wifi 0.8702 0.8667 the fifth place in the official ranking with our X2C-
Location 0.8898 0.8584 A-s system, at a distance of just 0.057 from the
best system specifically trained on such training
Table 4: X2Check per category results submitted set. Furthermore, the ACP performance depends
post deadline to ACD. on the ACD results, in fact the former task can-
not reach a performance higher than the other. For
this reason, we decided to evaluate the sentiment
X2Check sentiment API (Di Rosa and Durante,
performance reached on the reviews whose cate-
2017). In fact, since we do have an industrial per-
gories have been correctly predicted. Thus, we
spective, we realized that in a real world setting,
created a score capturing the relationship between
the fact of training an Aspect-based sentiment sys-
the two results: it is the ratio between the micro
tem through a specific training set has a high ef-
F1 score obtained in the ACP task and the one ob-
fort associated and cannot have a general purpose
tained in the ACD task. This hand crafted score
application. In fact, a very common case is the
shows the quality of the sentiment model, by re-
one in which new categories to predict have to be
moving the influence of the performance on the
quickly added into the system. In this setting, a
ACD task. The overall sentiment score obtained is
high effort activity of labeling examples for the
88.0% for X2C-B and 87.1% for X2C-A, showing
training set would be required. Moreover, label-
that even if a specific train has not been made, the
ing a review according to the aspects mentioned
general purpose X2Check sentiment API shows
and additionally assign a sentiment to each aspect
very good results (recall that, according to (Wil-
requires a higher human effort than just labeling
son et al., 2009) humans agree in the sentiment
the category. For this reason, we decided to not
classification in the 82% of cases).
specifically train a sentiment predictor specialized
on the given categories/aspects in the evaluation. Team Mic-Pr Mic-Re Mic-F1
Thus, we performed an experimental evaluation in 1 0.8264 0.7161 0.7673
which after the prediction of the category in the 2 0.8612 0.6562 0.7449
review, our standard X2Check sentiment API has 3 0.7472 0.7186 0.7326
been called to predict the sentiment. Since we are 4 0.7387 0.7206 0.7295
aware that a review may, in general, speak about X2C-A-s 0.7175 0.7019 0.7096
multiple aspects and having different sentiment as- 5 0.8735 0.5649 0.6861
sociated, we decided to apply the X2C-A-s and X2C-B-s 0.7888 0.6025 0.6832
X2C-B-s versions which use the splitting method 6 0.6869 0.5409 0.6052
described in section 3.1. More specifically: 7 0.4123 0.3125 0.3555
1. each review document has been split into sen- 8 0.5452 0.2511 0.3439
tences baseline 0.2451 0.1681 0.1994
2. both the X2Check sentiment API and the Table 5: ACP results.
X2C-A/X2C-B category classifiers were run
on each sentence. The former gives as output Tables 6 and 7 show for each category the
the polarity of each sentence; our assumption micro-F1 and the sentiment score of the ACP task,
is that each portion of the review has a high calculated like in Table 4, and the relationship be-
probability to have just one sentiment asso- tween ACP and ACD scores per category. We can
ciated. The latter gives as output all of the see that the sentiment model has reached a very
detected categories in each sentence good performance on Cleanliness, Comfort, Staff
and Location since it is close or over the 90%.
3. the overall result of a review is given by However, like noticed for the ACD results, it is
difficult to handle reviews about the Value cate- 2018 Aspect-based Sentiment Analysis task (AB-
gory. SITA) in Tommaso Caselli, Nicole Novielli, Viviana
Patti, and Paolo Rosso, editors, Proceedings of the
6th evaluation campaign of Natural Language Pro-
Micro-F1 SS
cessing and Speech tools for Italian (EVALITA’18),
Cleanliness 0.7739 91.6% CEUR.org, Turin
Comfort 0.7165 88.5%
Eibe Frank, Mark A. Hall, and Ian H. Witten. 2016.
Amenities 0.6618 79.7% The WEKA Workbench. Online Appendix for
Staff 0.8086 91.5% ”Data Mining: Practical Machine Learning Tools
Value 0.4533 73.5% and Techniques”, Morgan Kaufmann, Fourth Edi-
Wifi 0.6615 76.0% tion, 2016.
Location 0.8168 91.8% Pedregosa, F. and Varoquaux, G. and Gramfort, A.
and Michel, V. and Thirion, B. and Grisel, O. and
Table 6: X2C-A ACP results and sentiment score Blondel, M. and Prettenhofer, P. and Weiss, R. and
Dubourg, V. and Vanderplas, J. and Passos, A. and
by category.
Cournapeau, D. and Brucher, M. and Perrot, M. and
Duchesnay, E. 2011. Scikit-learn: Machine Learn-
ing in Python in Journal of Machine Learning Re-
Micro-F1 SS search, pp. 2825–2830.
Cleanliness 0.7626 90.7%
Emanuele Di Rosa and Alberto Durante. LREC 2016
Comfort 0.671 90.8% App2Check: a Machine Learning-based system for
Amenities 0.6276 79.6% Sentiment Analysis of App Reviews in Italian Lan-
Staff 0.7948 91.9% guage in Proc. of the 2nd International Workshop on
Value 0.4581 75.2% Social Media World Sensors, pp. 8-11.
Wifi 0.6441 74.3% Emanuele Di Rosa and Alberto Durante. 2017. Eval-
Location 0.7969 92.8% uating Industrial and Research Sentiment Analysis
Engines on Multiple Sources in Proc. of AI*IA 2017
Advances in Artificial Intelligence - International
Table 7: X2C-B ACP results and sentiment score Conference of the Italian Association for Artificial
by category. Intelligence, Bari, Italy, November 14-17, 2017, pp.
141-155.
Sophie de Kok, Linda Punt, Rosita van den Puttelaar,
4 Conclusions Karoliina Ranta, Kim Schouten and Flavius Frasin-
In this paper we presented a description of two dif- car. 2018. Review-aggregated aspect-based senti-
ment analysis with ontology features in Prog Artif
ferent implementations for dealing with the ACD Intell (2018) 7: 295.
and ACP tasks at ABSITA 2018. In particular,
we described the models used to participate to the Theresa Wilson, Janyce Wiebe and Paul Hoffmann.
2009. Recognizing Contextual Polarity: An Explo-
ACD competition together with some post dead- ration of Features for Phrase-Level Sentiment Anal-
line results, in which we had the opportunity to ysis in Computational Linguistic, pp. 399–433.
improve our ACD results and evaluate our systems
Seth Grimes. 2010. Expert Analysis: Is
also on the ACP task. The resuls show that our Sentiment Analysis an 80% Solution?
X2C-A system is top ranking in the official ACD http://www.informationweek.com/software/information-
competition and scores first, in its X2C-A-s ver- management/expert-analysis-is-sentiment-analysis-
sion. Moreover, by testing our ACD models on the an-80–solution/d/d-id/1087919
ACP tasks, with the help of our standard X2Check J. Platt. 1998. Fast Training of Support Vector
sentiment API, the X2C-A-s system scores fifth Machines using Sequential Minimal Optimization
at a distance of just 0.057 from the best system, in Advances in Kernel Methods - Support Vector
Learning
even if the other systems have a sentiment classi-
fier specifically trained on the training set of the S.S. Keerthi and S.K. Shevade and C. Bhattacharyya
competition. and K.R.K. Murthy. 2001. Improvements to Platt’s
SMO Algorithm for SVM Classifier Design in Neural
Computation, volume 13, pp. 637-649.
References Trevor Hastie and Robert Tibshirani 1998. Classifi-
cation by Pairwise Coupling in Advances in Neural
Pierpaolo Basile, Valerio Basile, Danilo Croce and Information Processing Systems, volume 10.
Marco Polignano. 2018. Overview of the EVALITA