=Paper=
{{Paper
|id=Vol-1749/paper_033
|storemode=property
|title=Tweet2Check evaluation at Evalita Sentipolc 2016
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_033.pdf
|volume=Vol-1749
|authors=Emanuele Di Rosa,Alberto Durante
|dblpUrl=https://dblp.org/rec/conf/clic-it/RosaD16
}}
==Tweet2Check evaluation at Evalita Sentipolc 2016==
Tweet2Check evaluation at Evalita Sentipolc 2016
Emanuele Di Rosa Alberto Durante
Head of ML and Semantic Analysis Research Scientist
Finsa s.p.a., Via XX Settembre 14 Finsa s.p.a., Via XX Settembre 14
emanuele.dirosa@finsa.it alberto.durante@finsa.it
Abstract siderando un punteggio volto ad individ-
uare gli strumenti più completi e meglio
English. In this paper we present our performanti per l’analisi del sentiment dei
Tweet2Check tool, provide an analysis of tweet, cioè sommando la migliore F-score
the experimental results obtained by our di ogni team per i tre task (soggettività,
tool at the Evalita Sentipolc 2016 evalu- polarità e ironia); (iv) il secondo miglior
ation, and compare its performance with strumento, secondo lo stesso precedente
the state-of-the-art tools that participated punteggio, considerando insieme i task di
to the evaluation. In the experimental anal- polarità e ironia.
ysis, we show that Tweet2Check is: (i) the
second classified for the irony task, at a
distance of just 0.0068 from the first clas- 1 Introduction
sified; (ii) the second classified for the po-
larity task, considering the unconstrained In this paper we present Tweet2Check, a ma-
runs, at a distance of 0.017 from the first chine learning-based tool for sentiment analysis
tool; (iii) in the top 5 tools (out of 13), con- of tweets, in which we applied the same approach
sidering a score that allows to indicate the that we implemented in App2Check and that we
most complete-best performing tools for have already validated in Di Rosa and Durante
Sentiment Analysis of tweets, i.e. by sum- (2016-a; 2016-b), showing that it works very well
ming up the best F-score of each team for (the most of the times is the best tool) in the field of
the three tasks (subjectivity, polarity and analysis of apps reviews; moreover, this approach
irony); (iv) the second best tool, according has been also validated on general product/service
to the former score, considering together reviews, since our tool was classified as second
polarity and irony tasks. at the International Semantic Sentiment Analysis
Challenge 2016 (Sack et al., 2016), related to the
Italiano. In questo paper presentiamo polarity classification of Amazon product reviews.
il nostro sistema Tweet2Check, produci- Our own research interest in participating to the
amo un’analisi dei risultati sperimentali Sentipolc 2016 evaluation is to apply the method-
ottenuti dal nostro strumento nella valu- ology that was mainly designed to analyze apps
tazione effettuata nell’ambito di Evalita reviews, and thus adapted to analyze tweets, and
Sentipolc 2016, e confrontiamo la sua per- evaluate its performance on tweets. From a re-
formance con quella degli altri sistemi search point of view, it is also interesting, to un-
partecipanti. Nell’analisi sperimentale, derstand if it is possible to obtain good results by
mostriamo che Tweet2Check è: (i) il sec- applying the same approach to very different do-
ondo classificato per il task dedicato alla mains such as apps reviews and tweets.
rilevazione dell’ironia, ad una distanza Starting from the results provided by the orga-
di appena 0.0068 dal primo classificato; nizers of the Sentipolc 2016 evaluation, we per-
(ii) il secondo classificato per il task ded- formed an analysis of the results in which we show
icato alla classificazione della polarità, that Tweet2Check is: (i) the second classified for
considerando i sistemi unconstrained, ad the irony task, at a distance of just 0.0068 from
una distanza di 0.017 dal primo classifi- the first classified; (ii) the second classified for the
cato; (iii) tra i migliori 5 tool (su 13), con- polarity task, considering just the unconstrained
runs, at a distance of 0.017 from the first tool; reflecting the conceptual connection that there is
(iii) in the top 5 tools (out of 13), considering a in reality between subjectivity and polarity: if a
score that allows to indicate the most complete- tweet can have a polarity assigned is also subjec-
best performing tools for Sentiment Analysis of tive. The same kind of connection is also applied
tweets, i.e. by summing up the best F-score of to the other models.
each team for the three tasks (subjectivity, polar- Tweet2Check does not use just the prediction
ity and irony); (iv) the second best tool, according coming from the predictive model, but it ap-
to the former score, considering together polarity plies also a set of algorithms which takes into
and irony task. account natural language processing techniques,
Finally, we show that Tweet2Check uncon- allowing e.g. to also automatically perform
strained runs are overall always better (or al- topic/named entity extraction, and other resources
most equal) than the constrained ones. To sup- which have been both handcrafted and automati-
port our hypothesis, we provide an evaluation of cally extracted. Unfortunately, it is not possible
Tweet2Check also on the Sentipolc 2014 (Basile to give more details about the engine due to non-
et al., 2014) datasets. This is very important for an disclosure restrictions.
industrial tool, since it allows to potentially predict Tweet2Check is not only constituted by a web
well tweets coming from new domains, by keep- service providing access to the sentiment predic-
ing in the training set a higher number of examples tion of sentences, but it is also a full user-friendly
discussing different topics, and thus to generalize web application allowing, between other features,
well from the perspective of the final user. to:
• Perform queries on Twitter
2 Tweet2Check description
• Show the main topics discussed in tweets
Tweet2Check is an industrial system using an ap- which are both comment-specific, associated
proach in which supervised learning methods are to a specific month or evaluated to the overall
applied in order to build predictive models for the results obtained by the query
classification of subjectivity, polarity and irony
in tweets. The overall machine learning system • Show the polarity, subjectivity and irony as-
is an ensemble learning system which combines sociated to each tweet under evaluation
many different classifiers, each of which is built • Show the sentiment of the former extracted
by us using different machine learning algorithms topics
and implementing different features: this allows
to take advantage of different complementary ap- A demo of Tweet2Check and its API can be avail-
proaches, both discriminative and generative. To able only for research purposes, by sending a re-
this aim, we considered the most well known ma- quest by email to the first author of the paper.
chine learning algorithms, considering both the Thus, the results of all of the experiments are re-
most established and the newest approaches. For peatable.
each task, every classifier has been trained sepa-
3 Experimental Analysis
rately; then, the ensemble combines the predic-
tions of the underlying classifiers. The training Considering the Sentipolc 2016 results, we can see
of the models is performed by considering only that:
the tweets provided by Sentipolc 2016 for the con-
• some tools performed very well in one task
strained run, and other tweets discussing other top-
and very bad in other one (e.g. team2 was the
ics for the unconstrained run. While performing
second team for subjectivity and the last one
the training of the models, many features, which
for polarity, team7 was the seventh for sub-
are both Twitter-specific and source-independent,
jectivity and the first one for polarity, etc.);
are generated. Moreover, some features allowing
to ”connect” different tasks are also considered • some other tools show a much better perfor-
in the pipeline to determine subjectivity, polarity mance on the unconstrained run than on the
and irony. For example, in the pipeline to deter- constrained run (e.g. team1 shows for the
mine the polarity of a tweet, a score related to its subjectivity-unconstrained task a score that is
subjectivity is also included as a feature, thus by 4% higher than the constrained run).
However, if the goal is to find which are over- • in Table 2 related to Polarity classification, it
all the most complete-best performing tools, i.e. is very close to the best result, at a distance of
performing well considering the contribution that just 0.0188, and it is the second tool consider-
each tool provided on all of the tasks, an overall ing only the results for the unconstrained run
score/indicator is needed. To this aim, we pro- (which are directly comparable)
pose the following score that takes into account,
for each team, overall the best run per task. Thus, • in Table 3 related to Irony detection, it is the
we introduce formula 1 showing that we consider, second best tool, at a distance of just 0.0068
given a team and a task, the highest value of F- from the first classified.
score between the available runs (considering also
constrained and unconstrained runs). Then, in for- Tables 4 and 5 show the results obtained using
mula 2, we introduce a score per team, calculated formula 2 considering, respectively, polarity and
as the summation of each contribution provided by irony together, and all of the three tasks together1 .
each team for the tasks under evaluation (even a
subset of them). Team Steam con/uncon
1 team1 0.7444 u
Steam,task = max(Fteam,task,run ) (1) 2 team2 0.7184 c
run
3 team3 0.7134 c
4 team4 0.7107 c
5 team5 0.7105 c
X
Steam = Steam,task (2)
task 6 team6 0.7086 c
7 team7 0.6937 c/u
Thanks to this score, it is possible to have an 8 team8 0.6495 c
idea of overall the best available tools on: (i) each 9 Tweet2Check 0.6317 u
single task; (ii) a collection of tasks (couple of 10 team10 0.5647 c
tasks at a time in our case), or (iii) all of the tasks 11 team11 - -
Please consider also that this score can be even 12 team12 - -
more restrictive for our tool: we perform better 13 team13 - -
on the unconstrained runs than on the constrained
ones, and there are more tools for the constrained
Table 1: Subjectivity task at Sentipolc 2016.
runs and performing better than our unconstrained
version, so that they would gain positions in the
In Table 4, Tweet2Check is the second best
chart (e.g. team3, team4 and team5 for the polar-
tool, at a distance of 0.0014 from team4, which
ity task perform better on the constrained version).
is the best tool according to this score. This is
Moreover, we are giving the same equal weight to
clearly our best result at Sentipolc 2016, con-
all of the tasks, even if we focused more on the
sidering more tasks together, thus highlighting
polarity and irony task which are more related to
that polarity classification and irony detection are
the original App2Check approach, i.e. more use-
the best tasks performed by Tweet2Check in the
ful and related the evaluation of apps reviews.
current version. In Table 5, we can see that
Tables 1, 2 and 3 show the results of each single Tweet2Check is the fifth classified, at a distance
task sorted by the score obtained. The columns of 0.0930 from team4, where we consider also the
contain (from left to right): ranking, team name, impact of the subjectivity task on the results. In
the score obtained with formula 1, and a label re- this last case, Tweet2Check is in the top 5 tools
porting whether the best run for the team was con- chart, over 13 tools. Finally, Tables 6, 7 and 8
strained (c) or unconstrained (u). In Tables 1 and report the results obtained training and evaluating
2 we consider the F-score value coming from the Tweet2Check on Evalita Sentipolc 2014 (Basile et
Tweet2Check amended run, representing the cor- al., 2014) datasets. The second and third columns
rect system answer. For the subjectivity task in
Table 1, Tweet2Check does not show good results 1
Since some teams did not participate to all of the tasks,
compared to the other tools, and there is clearly their results are marked as follow:
* The tool did not participate to the Irony task
room for further improvements. For all of the ** The tool participated only to the Polarity task
other results, Tweet2Check shows good results: *** The tool participated only to the Irony task
Team Steam con/uncon Team Steam
1 team7 0.6638 c 1 team4 1.2002
2 team1 0.6620 u 2 Tweet2Check 1.1862
3 team4 0.6522 c 3 team5 1.1586
4 team3 0.6504 c 4 team3 1.1496
5 team5 0.6453 c 5 team1 1.1430
6 Tweet2Check 0.6450 u 6 team8 1.1007
7 team10 0.6367 c 7 team7* 0.6638
8 team11 0.6281 c 8 team10* 0.6367
9 team12 0.6099 c 9 team11** 0.6281
10 team6 0.6075 u 10 team12** 0.6099
11 team8 0.6046 c 11 team6* 0.6075
12 team2 0.5683 c 12 team2* 0.5683
13 team13 - - 13 team13*** 0.5251
Table 2: Polarity task at Sentipolc 2016. Table 4: The best performing tools on the Polarity
and Irony tasks.
Team Steam con/uncon Team Steam
1 team4 0.5480 c 1 team4 1.9109
2 Tweet2Check 0.5412 c 2 team1 1.8874
3 team13 0.5251 c 3 team5 1.8691
4 team5 0.5133 c 4 team3 1.8630
5 team3 0.4992 c 5 Tweet2Check 1.8179
6 team8 0.4961 c 6 team8 1.7502
7 team1 0.4810 u 7 team7* 1.3575
8 team2 - - 8 team6* 1.3161
9 team6 - - 9 team2* 1.2867
10 team7 - - 10 team10* 1.2014
11 team10 - - 11 team11** 0.6281
12 team11 - - 12 team12** 0.6099
13 team12 - - 13 team13*** 0.5251
Table 3: Irony task at Sentipolc 2016. Table 5: The best performing tools on the three
tasks.
of the these tables contain, respectively, the F- ones, and that our tool is the best tool compared to
score of the constrained and the unconstrained the tools that participated in 2014.
runs (in bold the best results). We can see in Ta-
ble 6 that Tweet2Check ranks first for subjectivity 4 Conclusion
in the unconstrained run, and second for the con-
strained run. In Tables 7 and 8 Tweet2Check is In this paper we presented Tweet2Check and dis-
the best tool for both polarity and irony. More- cussed the analysis of the results from Sentipolc
over, since we think that Tweet2Check is always 2016, showing that our tool is: (i) the second clas-
better on the unconstrained settings, we decided sified for the irony task, at a distance of just 0.0068
to further experimentally confirm this observation, from the first classified; (ii) the second classi-
and we trained Tweet2Check on the training set of fied for the polarity task, considering the uncon-
Sentipolc 2014 with the same approach we used strained runs, at a distance of 0.017 from the first
for the 2016 edition; thus, we tested it on the test tool; (iii) in the top 5 tools (out of 13), considering
set of the former Sentipolc 2014 evaluation. We a score that allows to indicate the most complete-
show that, also in this case, Tweet2Check uncon- best performing tools for Sentiment Analysis of
strained runs perform better than the constrained tweets, i.e. by summing up the best F-score of
Team F(C) F(U) Team F(C) F(U)
uniba2930 0.7140 0.6892 Tweet2Check 0.5915 -
Tweet2Check 0.6927 0.6903 UNITOR 0.5759 0.5959
UNITOR 0.6871 0.6897 IRADABE 0.5415 0.5513
IRADABE 0.6706 0.6464 SVMSLU 0.5394 -
UPFtaln 0.6497 - itagetaruns 0.4929 -
ficlit+cs@unibo 0.5972 - mind 0.4771 -
mind 0.5901 - fbkshelldkm 0.4707 -
SVMSLU 0.5825 - UPFtaln 0.4687 -
fbkshelldkm 0.5593 -
itagetaruns 0.5224 - Table 8: Tweet2Check ranking on the Sentipolc
2014 irony task.
Table 6: Tweet2Check ranking on the Sentipolc
2014 subjectivity task.
Dan Gusfield. 1997. Algorithms on Strings, Trees
and Sequences. Cambridge University Press, Cam-
Team F(C) F(U) bridge, UK.
Tweet2Check 0.7048 0.7142
uniba2930 0.6771 0.6638 Frencesco Barbieri and Valerio Basile and Danilo
Croce and Malvina Nissim and Nicole Novielli and
IRADABE 0.6347 0.6108 Viviana Patti. 2016. In Pierpaolo Basile, Anna
CoLingLab 0.6312 - Corazza, Franco Cutugno, Simonetta Montemagni,
UNITOR 0.6299 0.6546 Malvina Nissim, Viviana Patti, Giovanni Semer-
UPFtaln 0.6049 - aro and Rachele Sprugnoli, editors, Proceedings of
Third Italian Conference on Computational Linguis-
SVMSLU 0.6026 - tics (CLiC-it 2016) & Fifth Evaluation Campaign
ficlit+cs@unibo 0.5980 - of Natural Language Processing and Speech Tools
fbkshelldkm 0.5626 - for Italian. Final Workshop (EVALITA 2016). As-
sociazione Italiana di Linguistica Computazionale
mind 0.5342 - (AILC)
itagetaruns 0.5181 -
Itanlp-wafi* 0.5086 - Emanuele Di Rosa and Alberto Durante LREC 2016
2016. App2Check: a Machine Learning-based sys-
*amended run 0.6637 -
tem for Sentiment Analysis of App Reviews in Italian
Language in Proc. of the 2nd International Work-
Table 7: Tweet2Check ranking on the Sentipolc shop on Social Media World Sensors, pp. 8-11.
2014 polarity task. http://ceur-ws.org/Vol-1696/
Emanuele Di Rosa, Alberto Durante. App2Check ex-
each team for the three tasks (subjectivity, polar- tension for Sentiment Analysis of Amazon Products
Reviews. In Semantic Web Challenges Vol. 641-1,
ity and irony); (iv) the second best tool, according CCIS Springer 2016
to the former score, considering together polarity
and irony tasks. Diego Reforgiato. Results of the Semantic Sen-
timent Analysis 2016 International Challenge
https://github.com/diegoref/SSA2016
References ESWC 2016 Challenges http://2016.eswc-
Alfred V. Aho and Jeffrey D. Ullman. 1972. The conferences.org/program/eswc-challenges
Theory of Parsing, Translation and Compiling, vol-
ume 1. Prentice-Hall, Englewood Cliffs, NJ. Harald Sack, Stefan Dietze, Anna Tordai. Semantic
Web Challenges. 2016. CCIS Springer 2016. Third
American Psychological Association. 1983. Publica- SemWebEval Challenge at ESWC 2016.
tions Manual. American Psychological Association,
Washington, DC. Valerio Basile and Andrea Bolioli and Malvina Nis-
sim and Viviana Patti and Paolo Rosso. Overview
Association for Computing Machinery. 1983. Com- of the Evalita 2014 SENTIment POLarity Classifi-
puting Reviews, 24(11):503–512. cation Task. 2014.
Ashok K. Chandra, Dexter C. Kozen, and Larry J. Filipe N. Ribeiro, Matheus Araújo, Pollyanna
Stockmeyer. 1981. Alternation. Journal of the As- Gonçalves, Marcos André Gonçalves, Fabrı́cio
sociation for Computing Machinery, 28(1):114–133. Benevenuto. SentiBench - a benchmark comparison
of state-of-the-practice sentiment analysis methods
- In EPJ Data Science 2016. 2014.