=Paper= {{Paper |id=Vol-1749/paper_033 |storemode=property |title=Tweet2Check evaluation at Evalita Sentipolc 2016 |pdfUrl=https://ceur-ws.org/Vol-1749/paper_033.pdf |volume=Vol-1749 |authors=Emanuele Di Rosa,Alberto Durante |dblpUrl=https://dblp.org/rec/conf/clic-it/RosaD16 }} ==Tweet2Check evaluation at Evalita Sentipolc 2016== https://ceur-ws.org/Vol-1749/paper_033.pdf
              Tweet2Check evaluation at Evalita Sentipolc 2016

              Emanuele Di Rosa                               Alberto Durante
      Head of ML and Semantic Analysis                       Research Scientist
       Finsa s.p.a., Via XX Settembre 14             Finsa s.p.a., Via XX Settembre 14
      emanuele.dirosa@finsa.it                      alberto.durante@finsa.it



                  Abstract                             siderando un punteggio volto ad individ-
                                                       uare gli strumenti più completi e meglio
English. In this paper we present our                  performanti per l’analisi del sentiment dei
Tweet2Check tool, provide an analysis of               tweet, cioè sommando la migliore F-score
the experimental results obtained by our               di ogni team per i tre task (soggettività,
tool at the Evalita Sentipolc 2016 evalu-              polarità e ironia); (iv) il secondo miglior
ation, and compare its performance with                strumento, secondo lo stesso precedente
the state-of-the-art tools that participated           punteggio, considerando insieme i task di
to the evaluation. In the experimental anal-           polarità e ironia.
ysis, we show that Tweet2Check is: (i) the
second classified for the irony task, at a
distance of just 0.0068 from the first clas-       1   Introduction
sified; (ii) the second classified for the po-
larity task, considering the unconstrained         In this paper we present Tweet2Check, a ma-
runs, at a distance of 0.017 from the first        chine learning-based tool for sentiment analysis
tool; (iii) in the top 5 tools (out of 13), con-   of tweets, in which we applied the same approach
sidering a score that allows to indicate the       that we implemented in App2Check and that we
most complete-best performing tools for            have already validated in Di Rosa and Durante
Sentiment Analysis of tweets, i.e. by sum-         (2016-a; 2016-b), showing that it works very well
ming up the best F-score of each team for          (the most of the times is the best tool) in the field of
the three tasks (subjectivity, polarity and        analysis of apps reviews; moreover, this approach
irony); (iv) the second best tool, according       has been also validated on general product/service
to the former score, considering together          reviews, since our tool was classified as second
polarity and irony tasks.                          at the International Semantic Sentiment Analysis
                                                   Challenge 2016 (Sack et al., 2016), related to the
Italiano. In questo paper presentiamo              polarity classification of Amazon product reviews.
il nostro sistema Tweet2Check, produci-            Our own research interest in participating to the
amo un’analisi dei risultati sperimentali          Sentipolc 2016 evaluation is to apply the method-
ottenuti dal nostro strumento nella valu-          ology that was mainly designed to analyze apps
tazione effettuata nell’ambito di Evalita          reviews, and thus adapted to analyze tweets, and
Sentipolc 2016, e confrontiamo la sua per-         evaluate its performance on tweets. From a re-
formance con quella degli altri sistemi            search point of view, it is also interesting, to un-
partecipanti. Nell’analisi sperimentale,           derstand if it is possible to obtain good results by
mostriamo che Tweet2Check è: (i) il sec-          applying the same approach to very different do-
ondo classificato per il task dedicato alla        mains such as apps reviews and tweets.
rilevazione dell’ironia, ad una distanza              Starting from the results provided by the orga-
di appena 0.0068 dal primo classificato;           nizers of the Sentipolc 2016 evaluation, we per-
(ii) il secondo classificato per il task ded-      formed an analysis of the results in which we show
icato alla classificazione della polarità,        that Tweet2Check is: (i) the second classified for
considerando i sistemi unconstrained, ad           the irony task, at a distance of just 0.0068 from
una distanza di 0.017 dal primo classifi-          the first classified; (ii) the second classified for the
cato; (iii) tra i migliori 5 tool (su 13), con-    polarity task, considering just the unconstrained
runs, at a distance of 0.017 from the first tool;         reflecting the conceptual connection that there is
(iii) in the top 5 tools (out of 13), considering a       in reality between subjectivity and polarity: if a
score that allows to indicate the most complete-          tweet can have a polarity assigned is also subjec-
best performing tools for Sentiment Analysis of           tive. The same kind of connection is also applied
tweets, i.e. by summing up the best F-score of            to the other models.
each team for the three tasks (subjectivity, polar-          Tweet2Check does not use just the prediction
ity and irony); (iv) the second best tool, according      coming from the predictive model, but it ap-
to the former score, considering together polarity        plies also a set of algorithms which takes into
and irony task.                                           account natural language processing techniques,
   Finally, we show that Tweet2Check uncon-               allowing e.g. to also automatically perform
strained runs are overall always better (or al-           topic/named entity extraction, and other resources
most equal) than the constrained ones. To sup-            which have been both handcrafted and automati-
port our hypothesis, we provide an evaluation of          cally extracted. Unfortunately, it is not possible
Tweet2Check also on the Sentipolc 2014 (Basile            to give more details about the engine due to non-
et al., 2014) datasets. This is very important for an     disclosure restrictions.
industrial tool, since it allows to potentially predict      Tweet2Check is not only constituted by a web
well tweets coming from new domains, by keep-             service providing access to the sentiment predic-
ing in the training set a higher number of examples       tion of sentences, but it is also a full user-friendly
discussing different topics, and thus to generalize       web application allowing, between other features,
well from the perspective of the final user.              to:
                                                              • Perform queries on Twitter
2   Tweet2Check description
                                                              • Show the main topics discussed in tweets
Tweet2Check is an industrial system using an ap-                which are both comment-specific, associated
proach in which supervised learning methods are                 to a specific month or evaluated to the overall
applied in order to build predictive models for the             results obtained by the query
classification of subjectivity, polarity and irony
in tweets. The overall machine learning system                • Show the polarity, subjectivity and irony as-
is an ensemble learning system which combines                   sociated to each tweet under evaluation
many different classifiers, each of which is built            • Show the sentiment of the former extracted
by us using different machine learning algorithms               topics
and implementing different features: this allows
to take advantage of different complementary ap-          A demo of Tweet2Check and its API can be avail-
proaches, both discriminative and generative. To          able only for research purposes, by sending a re-
this aim, we considered the most well known ma-           quest by email to the first author of the paper.
chine learning algorithms, considering both the           Thus, the results of all of the experiments are re-
most established and the newest approaches. For           peatable.
each task, every classifier has been trained sepa-
                                                          3    Experimental Analysis
rately; then, the ensemble combines the predic-
tions of the underlying classifiers. The training         Considering the Sentipolc 2016 results, we can see
of the models is performed by considering only            that:
the tweets provided by Sentipolc 2016 for the con-
                                                              • some tools performed very well in one task
strained run, and other tweets discussing other top-
                                                                and very bad in other one (e.g. team2 was the
ics for the unconstrained run. While performing
                                                                second team for subjectivity and the last one
the training of the models, many features, which
                                                                for polarity, team7 was the seventh for sub-
are both Twitter-specific and source-independent,
                                                                jectivity and the first one for polarity, etc.);
are generated. Moreover, some features allowing
to ”connect” different tasks are also considered              • some other tools show a much better perfor-
in the pipeline to determine subjectivity, polarity             mance on the unconstrained run than on the
and irony. For example, in the pipeline to deter-               constrained run (e.g. team1 shows for the
mine the polarity of a tweet, a score related to its            subjectivity-unconstrained task a score that is
subjectivity is also included as a feature, thus by             4% higher than the constrained run).
   However, if the goal is to find which are over-           • in Table 2 related to Polarity classification, it
all the most complete-best performing tools, i.e.              is very close to the best result, at a distance of
performing well considering the contribution that              just 0.0188, and it is the second tool consider-
each tool provided on all of the tasks, an overall             ing only the results for the unconstrained run
score/indicator is needed. To this aim, we pro-                (which are directly comparable)
pose the following score that takes into account,
for each team, overall the best run per task. Thus,          • in Table 3 related to Irony detection, it is the
we introduce formula 1 showing that we consider,               second best tool, at a distance of just 0.0068
given a team and a task, the highest value of F-               from the first classified.
score between the available runs (considering also
constrained and unconstrained runs). Then, in for-           Tables 4 and 5 show the results obtained using
mula 2, we introduce a score per team, calculated         formula 2 considering, respectively, polarity and
as the summation of each contribution provided by         irony together, and all of the three tasks together1 .
each team for the tasks under evaluation (even a
subset of them).                                                    Team                  Steam       con/uncon
                                                              1     team1                 0.7444          u
      Steam,task = max(Fteam,task,run )            (1)        2     team2                 0.7184          c
                     run
                                                              3     team3                 0.7134          c
                                                              4     team4                 0.7107          c
                                                              5     team5                 0.7105          c
                      X
           Steam =           Steam,task            (2)
                      task                                    6     team6                 0.7086          c
                                                              7     team7                 0.6937         c/u
   Thanks to this score, it is possible to have an            8     team8                 0.6495          c
idea of overall the best available tools on: (i) each         9     Tweet2Check           0.6317          u
single task; (ii) a collection of tasks (couple of            10    team10                0.5647          c
tasks at a time in our case), or (iii) all of the tasks       11    team11                   -            -
   Please consider also that this score can be even           12    team12                   -            -
more restrictive for our tool: we perform better              13    team13                   -            -
on the unconstrained runs than on the constrained
ones, and there are more tools for the constrained
                                                             Table 1: Subjectivity task at Sentipolc 2016.
runs and performing better than our unconstrained
version, so that they would gain positions in the
                                                             In Table 4, Tweet2Check is the second best
chart (e.g. team3, team4 and team5 for the polar-
                                                          tool, at a distance of 0.0014 from team4, which
ity task perform better on the constrained version).
                                                          is the best tool according to this score. This is
Moreover, we are giving the same equal weight to
                                                          clearly our best result at Sentipolc 2016, con-
all of the tasks, even if we focused more on the
                                                          sidering more tasks together, thus highlighting
polarity and irony task which are more related to
                                                          that polarity classification and irony detection are
the original App2Check approach, i.e. more use-
                                                          the best tasks performed by Tweet2Check in the
ful and related the evaluation of apps reviews.
                                                          current version. In Table 5, we can see that
   Tables 1, 2 and 3 show the results of each single      Tweet2Check is the fifth classified, at a distance
task sorted by the score obtained. The columns            of 0.0930 from team4, where we consider also the
contain (from left to right): ranking, team name,         impact of the subjectivity task on the results. In
the score obtained with formula 1, and a label re-        this last case, Tweet2Check is in the top 5 tools
porting whether the best run for the team was con-        chart, over 13 tools. Finally, Tables 6, 7 and 8
strained (c) or unconstrained (u). In Tables 1 and        report the results obtained training and evaluating
2 we consider the F-score value coming from the           Tweet2Check on Evalita Sentipolc 2014 (Basile et
Tweet2Check amended run, representing the cor-            al., 2014) datasets. The second and third columns
rect system answer. For the subjectivity task in
Table 1, Tweet2Check does not show good results              1
                                                               Since some teams did not participate to all of the tasks,
compared to the other tools, and there is clearly         their results are marked as follow:
                                                          * The tool did not participate to the Irony task
room for further improvements. For all of the             ** The tool participated only to the Polarity task
other results, Tweet2Check shows good results:            *** The tool participated only to the Irony task
          Team             Steam     con/uncon                          Team               Steam
    1     team7            0.6638        c                        1     team4              1.2002
    2     team1            0.6620        u                        2     Tweet2Check        1.1862
    3     team4            0.6522        c                        3     team5              1.1586
    4     team3            0.6504        c                        4     team3              1.1496
    5     team5            0.6453        c                        5     team1              1.1430
    6     Tweet2Check      0.6450        u                        6     team8              1.1007
    7     team10           0.6367        c                        7     team7*             0.6638
    8     team11           0.6281        c                        8     team10*            0.6367
    9     team12           0.6099        c                        9     team11**           0.6281
   10     team6            0.6075        u                        10    team12**           0.6099
   11     team8            0.6046        c                        11    team6*             0.6075
   12     team2            0.5683        c                        12    team2*             0.5683
   13     team13              -          -                        13    team13***          0.5251

     Table 2: Polarity task at Sentipolc 2016.         Table 4: The best performing tools on the Polarity
                                                       and Irony tasks.
          Team             Steam     con/uncon                         Team            Steam
    1     team4            0.5480        c                        1 team4              1.9109
    2     Tweet2Check      0.5412        c                        2 team1              1.8874
    3     team13           0.5251        c                        3 team5              1.8691
    4     team5            0.5133        c                        4 team3              1.8630
    5     team3            0.4992        c                        5 Tweet2Check 1.8179
    6     team8            0.4961        c                        6 team8              1.7502
    7     team1            0.4810        u                        7 team7*             1.3575
    8     team2               -          -                        8 team6*             1.3161
    9     team6               -          -                        9 team2*             1.2867
   10     team7               -          -                       10 team10*            1.2014
   11     team10              -          -                       11 team11**           0.6281
   12     team11              -          -                       12 team12**           0.6099
   13     team12              -          -                       13 team13***          0.5251

        Table 3: Irony task at Sentipolc 2016.         Table 5: The best performing tools on the three
                                                       tasks.

of the these tables contain, respectively, the F-      ones, and that our tool is the best tool compared to
score of the constrained and the unconstrained         the tools that participated in 2014.
runs (in bold the best results). We can see in Ta-
ble 6 that Tweet2Check ranks first for subjectivity    4   Conclusion
in the unconstrained run, and second for the con-
strained run. In Tables 7 and 8 Tweet2Check is         In this paper we presented Tweet2Check and dis-
the best tool for both polarity and irony. More-       cussed the analysis of the results from Sentipolc
over, since we think that Tweet2Check is always        2016, showing that our tool is: (i) the second clas-
better on the unconstrained settings, we decided       sified for the irony task, at a distance of just 0.0068
to further experimentally confirm this observation,    from the first classified; (ii) the second classi-
and we trained Tweet2Check on the training set of      fied for the polarity task, considering the uncon-
Sentipolc 2014 with the same approach we used          strained runs, at a distance of 0.017 from the first
for the 2016 edition; thus, we tested it on the test   tool; (iii) in the top 5 tools (out of 13), considering
set of the former Sentipolc 2014 evaluation. We        a score that allows to indicate the most complete-
show that, also in this case, Tweet2Check uncon-       best performing tools for Sentiment Analysis of
strained runs perform better than the constrained      tweets, i.e. by summing up the best F-score of
       Team                F(C)      F(U)                      Team               F(C)      F(U)
       uniba2930           0.7140    0.6892                    Tweet2Check        0.5915    -
       Tweet2Check         0.6927    0.6903                    UNITOR             0.5759    0.5959
       UNITOR              0.6871    0.6897                    IRADABE            0.5415    0.5513
       IRADABE             0.6706    0.6464                    SVMSLU             0.5394    -
       UPFtaln             0.6497    -                         itagetaruns        0.4929    -
       ficlit+cs@unibo     0.5972    -                         mind               0.4771    -
       mind                0.5901    -                         fbkshelldkm        0.4707    -
       SVMSLU              0.5825    -                         UPFtaln            0.4687    -
       fbkshelldkm         0.5593    -
       itagetaruns         0.5224    -                 Table 8: Tweet2Check ranking on the Sentipolc
                                                       2014 irony task.
Table 6: Tweet2Check ranking on the Sentipolc
2014 subjectivity task.
                                                       Dan Gusfield. 1997. Algorithms on Strings, Trees
                                                         and Sequences. Cambridge University Press, Cam-
       Team                F(C)      F(U)                bridge, UK.
       Tweet2Check         0.7048    0.7142
       uniba2930           0.6771    0.6638            Frencesco Barbieri and Valerio Basile and Danilo
                                                         Croce and Malvina Nissim and Nicole Novielli and
       IRADABE             0.6347    0.6108              Viviana Patti. 2016. In Pierpaolo Basile, Anna
       CoLingLab           0.6312    -                   Corazza, Franco Cutugno, Simonetta Montemagni,
       UNITOR              0.6299    0.6546              Malvina Nissim, Viviana Patti, Giovanni Semer-
       UPFtaln             0.6049    -                   aro and Rachele Sprugnoli, editors, Proceedings of
                                                         Third Italian Conference on Computational Linguis-
       SVMSLU              0.6026    -                   tics (CLiC-it 2016) & Fifth Evaluation Campaign
       ficlit+cs@unibo     0.5980    -                   of Natural Language Processing and Speech Tools
       fbkshelldkm         0.5626    -                   for Italian. Final Workshop (EVALITA 2016). As-
                                                         sociazione Italiana di Linguistica Computazionale
       mind                0.5342    -                   (AILC)
       itagetaruns         0.5181    -
       Itanlp-wafi*        0.5086    -                 Emanuele Di Rosa and Alberto Durante LREC 2016
                                                         2016. App2Check: a Machine Learning-based sys-
       *amended run        0.6637    -
                                                         tem for Sentiment Analysis of App Reviews in Italian
                                                         Language in Proc. of the 2nd International Work-
Table 7: Tweet2Check ranking on the Sentipolc            shop on Social Media World Sensors, pp. 8-11.
2014 polarity task.                                      http://ceur-ws.org/Vol-1696/

                                                       Emanuele Di Rosa, Alberto Durante. App2Check ex-
each team for the three tasks (subjectivity, polar-      tension for Sentiment Analysis of Amazon Products
                                                         Reviews. In Semantic Web Challenges Vol. 641-1,
ity and irony); (iv) the second best tool, according     CCIS Springer 2016
to the former score, considering together polarity
and irony tasks.                                       Diego Reforgiato. Results of the Semantic Sen-
                                                         timent Analysis 2016 International Challenge
                                                         https://github.com/diegoref/SSA2016
References                                             ESWC      2016      Challenges    http://2016.eswc-
Alfred V. Aho and Jeffrey D. Ullman. 1972. The           conferences.org/program/eswc-challenges
  Theory of Parsing, Translation and Compiling, vol-
  ume 1. Prentice-Hall, Englewood Cliffs, NJ.          Harald Sack, Stefan Dietze, Anna Tordai. Semantic
                                                         Web Challenges. 2016. CCIS Springer 2016. Third
American Psychological Association. 1983. Publica-       SemWebEval Challenge at ESWC 2016.
 tions Manual. American Psychological Association,
 Washington, DC.                                       Valerio Basile and Andrea Bolioli and Malvina Nis-
                                                         sim and Viviana Patti and Paolo Rosso. Overview
Association for Computing Machinery. 1983. Com-          of the Evalita 2014 SENTIment POLarity Classifi-
  puting Reviews, 24(11):503–512.                        cation Task. 2014.

Ashok K. Chandra, Dexter C. Kozen, and Larry J.        Filipe N. Ribeiro, Matheus Araújo, Pollyanna
  Stockmeyer. 1981. Alternation. Journal of the As-       Gonçalves, Marcos André Gonçalves, Fabrı́cio
  sociation for Computing Machinery, 28(1):114–133.       Benevenuto. SentiBench - a benchmark comparison
of state-of-the-practice sentiment analysis methods
- In EPJ Data Science 2016. 2014.