<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emanuele Di Rosa Head of ML</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Semantic Analysis Finsa s.p.a.</string-name>
          <email>alberto.durante@finsa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Via XX Settembre</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>emanuele.dirosa@finsa.it</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alberto Durante Research Scientist Finsa s.p.a.</institution>
          ,
          <addr-line>Via XX Settembre 14</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>English. In this paper we present our Tweet2Check tool, provide an analysis of the experimental results obtained by our tool at the Evalita Sentipolc 2016 evaluation, and compare its performance with the state-of-the-art tools that participated to the evaluation. In the experimental analysis, we show that Tweet2Check is: (i) the second classified for the irony task, at a distance of just 0.0068 from the first classified; (ii) the second classified for the polarity task, considering the unconstrained runs, at a distance of 0.017 from the first tool; (iii) in the top 5 tools (out of 13), considering a score that allows to indicate the most complete-best performing tools for Sentiment Analysis of tweets, i.e. by summing up the best F-score of each team for the three tasks (subjectivity, polarity and irony); (iv) the second best tool, according to the former score, considering together polarity and irony tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. In questo paper presentiamo
il nostro sistema Tweet2Check,
produciamo un’analisi dei risultati sperimentali
ottenuti dal nostro strumento nella
valutazione effettuata nell’ambito di Evalita
Sentipolc 2016, e confrontiamo la sua
performance con quella degli altri sistemi
partecipanti. Nell’analisi sperimentale,
mostriamo che Tweet2Check e`: (i) il
secondo classificato per il task dedicato alla
rilevazione dell’ironia, ad una distanza
di appena 0.0068 dal primo classificato;
(ii) il secondo classificato per il task
dedicato alla classificazione della polarita`,
considerando i sistemi unconstrained, ad
una distanza di 0.017 dal primo
classificato; (iii) tra i migliori 5 tool (su 13),
considerando un punteggio volto ad
individuare gli strumenti piu` completi e meglio
performanti per l’analisi del sentiment dei
tweet, cioe` sommando la migliore F-score
di ogni team per i tre task (soggettivita`,
polarita` e ironia); (iv) il secondo miglior
strumento, secondo lo stesso precedente
punteggio, considerando insieme i task di
polarita` e ironia.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>
        In this paper we present Tweet2Check, a
machine learning-based tool for sentiment analysis
of tweets, in which we applied the same approach
that we implemented in App2Check and that we
have already validated in Di Rosa and Durante
(2016-a; 2016-b), showing that it works very well
(the most of the times is the best tool) in the field of
analysis of apps reviews; moreover, this approach
has been also validated on general product/service
reviews, since our tool was classified as second
at the International Semantic Sentiment Analysis
Challenge 2016
        <xref ref-type="bibr" rid="ref4">(Sack et al., 2016)</xref>
        , related to the
polarity classification of Amazon product reviews.
Our own research interest in participating to the
Sentipolc 2016 evaluation is to apply the
methodology that was mainly designed to analyze apps
reviews, and thus adapted to analyze tweets, and
evaluate its performance on tweets. From a
research point of view, it is also interesting, to
understand if it is possible to obtain good results by
applying the same approach to very different
domains such as apps reviews and tweets.
      </p>
      <p>Starting from the results provided by the
organizers of the Sentipolc 2016 evaluation, we
performed an analysis of the results in which we show
that Tweet2Check is: (i) the second classified for
the irony task, at a distance of just 0.0068 from
the first classified; (ii) the second classified for the
polarity task, considering just the unconstrained
runs, at a distance of 0.017 from the first tool;
(iii) in the top 5 tools (out of 13), considering a
score that allows to indicate the most
completebest performing tools for Sentiment Analysis of
tweets, i.e. by summing up the best F-score of
each team for the three tasks (subjectivity,
polarity and irony); (iv) the second best tool, according
to the former score, considering together polarity
and irony task.</p>
      <p>
        Finally, we show that Tweet2Check
unconstrained runs are overall always better (or
almost equal) than the constrained ones. To
support our hypothesis, we provide an evaluation of
Tweet2Check also on the Sentipolc 2014
        <xref ref-type="bibr" rid="ref5">(Basile
et al., 2014)</xref>
        datasets. This is very important for an
industrial tool, since it allows to potentially predict
well tweets coming from new domains, by
keeping in the training set a higher number of examples
discussing different topics, and thus to generalize
well from the perspective of the final user.
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Tweet2Check description</title>
      <p>Tweet2Check is an industrial system using an
approach in which supervised learning methods are
applied in order to build predictive models for the
classification of subjectivity, polarity and irony
in tweets. The overall machine learning system
is an ensemble learning system which combines
many different classifiers, each of which is built
by us using different machine learning algorithms
and implementing different features: this allows
to take advantage of different complementary
approaches, both discriminative and generative. To
this aim, we considered the most well known
machine learning algorithms, considering both the
most established and the newest approaches. For
each task, every classifier has been trained
separately; then, the ensemble combines the
predictions of the underlying classifiers. The training
of the models is performed by considering only
the tweets provided by Sentipolc 2016 for the
constrained run, and other tweets discussing other
topics for the unconstrained run. While performing
the training of the models, many features, which
are both Twitter-specific and source-independent,
are generated. Moreover, some features allowing
to ”connect” different tasks are also considered
in the pipeline to determine subjectivity, polarity
and irony. For example, in the pipeline to
determine the polarity of a tweet, a score related to its
subjectivity is also included as a feature, thus by
reflecting the conceptual connection that there is
in reality between subjectivity and polarity: if a
tweet can have a polarity assigned is also
subjective. The same kind of connection is also applied
to the other models.</p>
      <p>Tweet2Check does not use just the prediction
coming from the predictive model, but it
applies also a set of algorithms which takes into
account natural language processing techniques,
allowing e.g. to also automatically perform
topic/named entity extraction, and other resources
which have been both handcrafted and
automatically extracted. Unfortunately, it is not possible
to give more details about the engine due to
nondisclosure restrictions.</p>
      <p>Tweet2Check is not only constituted by a web
service providing access to the sentiment
prediction of sentences, but it is also a full user-friendly
web application allowing, between other features,
to:</p>
      <p>Perform queries on Twitter
Show the main topics discussed in tweets
which are both comment-specific, associated
to a specific month or evaluated to the overall
results obtained by the query
Show the polarity, subjectivity and irony
associated to each tweet under evaluation
Show the sentiment of the former extracted
topics
A demo of Tweet2Check and its API can be
available only for research purposes, by sending a
request by email to the first author of the paper.
Thus, the results of all of the experiments are
repeatable.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Analysis</title>
      <p>Considering the Sentipolc 2016 results, we can see
that:
some tools performed very well in one task
and very bad in other one (e.g. team2 was the
second team for subjectivity and the last one
for polarity, team7 was the seventh for
subjectivity and the first one for polarity, etc.);
some other tools show a much better
performance on the unconstrained run than on the
constrained run (e.g. team1 shows for the
subjectivity-unconstrained task a score that is
4% higher than the constrained run).</p>
      <p>However, if the goal is to find which are
overall the most complete-best performing tools, i.e.
performing well considering the contribution that
each tool provided on all of the tasks, an overall
score/indicator is needed. To this aim, we
propose the following score that takes into account,
for each team, overall the best run per task. Thus,
we introduce formula 1 showing that we consider,
given a team and a task, the highest value of
Fscore between the available runs (considering also
constrained and unconstrained runs). Then, in
formula 2, we introduce a score per team, calculated
as the summation of each contribution provided by
each team for the tasks under evaluation (even a
subset of them).</p>
      <p>Steam;task = max(Fteam;task;run)</p>
      <p>run
Steam =</p>
      <p>X Steam;task
task
(1)
(2)</p>
      <p>Thanks to this score, it is possible to have an
idea of overall the best available tools on: (i) each
single task; (ii) a collection of tasks (couple of
tasks at a time in our case), or (iii) all of the tasks</p>
      <p>Please consider also that this score can be even
more restrictive for our tool: we perform better
on the unconstrained runs than on the constrained
ones, and there are more tools for the constrained
runs and performing better than our unconstrained
version, so that they would gain positions in the
chart (e.g. team3, team4 and team5 for the
polarity task perform better on the constrained version).
Moreover, we are giving the same equal weight to
all of the tasks, even if we focused more on the
polarity and irony task which are more related to
the original App2Check approach, i.e. more
useful and related the evaluation of apps reviews.</p>
      <p>Tables 1, 2 and 3 show the results of each single
task sorted by the score obtained. The columns
contain (from left to right): ranking, team name,
the score obtained with formula 1, and a label
reporting whether the best run for the team was
constrained (c) or unconstrained (u). In Tables 1 and
2 we consider the F-score value coming from the
Tweet2Check amended run, representing the
correct system answer. For the subjectivity task in
Table 1, Tweet2Check does not show good results
compared to the other tools, and there is clearly
room for further improvements. For all of the
other results, Tweet2Check shows good results:
in Table 2 related to Polarity classification, it
is very close to the best result, at a distance of
just 0.0188, and it is the second tool
considering only the results for the unconstrained run
(which are directly comparable)
in Table 3 related to Irony detection, it is the
second best tool, at a distance of just 0.0068
from the first classified.</p>
      <p>Tables 4 and 5 show the results obtained using
formula 2 considering, respectively, polarity and
irony together, and all of the three tasks together1.</p>
      <p>Team
1 team1
2 team2
3 team3
4 team4
5 team5
6 team6
7 team7
8 team8</p>
      <sec id="sec-4-1">
        <title>9 Tweet2Check</title>
        <p>10 team10
11 team11
12 team12
13 team13</p>
        <p>
          In Table 4, Tweet2Check is the second best
tool, at a distance of 0.0014 from team4, which
is the best tool according to this score. This is
clearly our best result at Sentipolc 2016,
considering more tasks together, thus highlighting
that polarity classification and irony detection are
the best tasks performed by Tweet2Check in the
current version. In Table 5, we can see that
Tweet2Check is the fifth classified, at a distance
of 0.0930 from team4, where we consider also the
impact of the subjectivity task on the results. In
this last case, Tweet2Check is in the top 5 tools
chart, over 13 tools. Finally, Tables 6, 7 and 8
report the results obtained training and evaluating
Tweet2Check on Evalita Sentipolc 2014
          <xref ref-type="bibr" rid="ref5">(Basile et
al., 2014)</xref>
          datasets. The second and third columns
1Since some teams did not participate to all of the tasks,
their results are marked as follow:
* The tool did not participate to the Irony task
** The tool participated only to the Polarity task
*** The tool participated only to the Irony task
        </p>
        <p>Steam
1.2002
1.1862
1.1586
1.1496
1.1430
1.1007
0.6638
0.6367
0.6281
0.6099
0.6075
0.5683
0.5251
Steam
1.9109
1.8874
1.8691
1.8630
1.8179
1.7502
1.3575
1.3161
1.2867
1.2014
0.6281
0.6099
0.5251
of the these tables contain, respectively, the
Fscore of the constrained and the unconstrained
runs (in bold the best results). We can see in
Table 6 that Tweet2Check ranks first for subjectivity
in the unconstrained run, and second for the
constrained run. In Tables 7 and 8 Tweet2Check is
the best tool for both polarity and irony.
Moreover, since we think that Tweet2Check is always
better on the unconstrained settings, we decided
to further experimentally confirm this observation,
and we trained Tweet2Check on the training set of
Sentipolc 2014 with the same approach we used
for the 2016 edition; thus, we tested it on the test
set of the former Sentipolc 2014 evaluation. We
show that, also in this case, Tweet2Check
unconstrained runs perform better than the constrained
ones, and that our tool is the best tool compared to
the tools that participated in 2014.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper we presented Tweet2Check and
discussed the analysis of the results from Sentipolc
2016, showing that our tool is: (i) the second
classified for the irony task, at a distance of just 0.0068
from the first classified; (ii) the second
classified for the polarity task, considering the
unconstrained runs, at a distance of 0.017 from the first
tool; (iii) in the top 5 tools (out of 13), considering
a score that allows to indicate the most
completebest performing tools for Sentiment Analysis of
tweets, i.e. by summing up the best F-score of
each team for the three tasks (subjectivity,
polarity and irony); (iv) the second best tool, according
to the former score, considering together polarity
and irony tasks.</p>
      <p>F(U)
0.6892
0.6903
0.6897
0.6464
F(U)
0.7142
0.6638
0.6108
0.6546</p>
      <sec id="sec-5-1">
        <title>Team</title>
      </sec>
      <sec id="sec-5-2">
        <title>Tweet2Check</title>
        <p>UNITOR
IRADABE
SVMSLU
itagetaruns
mind
fbkshelldkm
UPFtaln</p>
        <p>Frencesco Barbieri and Valerio Basile and Danilo
Croce and Malvina Nissim and Nicole Novielli and
Viviana Patti. 2016. In Pierpaolo Basile, Anna
Corazza, Franco Cutugno, Simonetta Montemagni,
Malvina Nissim, Viviana Patti, Giovanni
Semeraro and Rachele Sprugnoli, editors, Proceedings of
Third Italian Conference on Computational
Linguistics (CLiC-it 2016) &amp; Fifth Evaluation Campaign
of Natural Language Processing and Speech Tools
for Italian. Final Workshop (EVALITA 2016).
Associazione Italiana di Linguistica Computazionale
(AILC)
Emanuele Di Rosa and Alberto Durante LREC 2016
2016. App2Check: a Machine Learning-based
system for Sentiment Analysis of App Reviews in Italian
Language in Proc. of the 2nd International
Workshop on Social Media World Sensors, pp. 8-11.
http://ceur-ws.org/Vol-1696/
Emanuele Di Rosa, Alberto Durante. App2Check
extension for Sentiment Analysis of Amazon Products
Reviews. In Semantic Web Challenges Vol. 641-1,
CCIS Springer 2016
Diego Reforgiato. Results of the Semantic
Sentiment Analysis 2016 International Challenge
https://github.com/diegoref/SSA2016
ESWC 2016 Challenges
http://2016.eswcconferences.org/program/eswc-challenges</p>
      </sec>
      <sec id="sec-5-3">
        <title>Team</title>
        <p>uniba2930</p>
      </sec>
      <sec id="sec-5-4">
        <title>Tweet2Check</title>
        <p>UNITOR
IRADABE
UPFtaln
ficlit+cs@unibo
mind
SVMSLU
fbkshelldkm
itagetaruns
Ashok K. Chandra, Dexter C. Kozen, and Larry J.</p>
        <p>Stockmeyer. 1981. Alternation. Journal of the
Association for Computing Machinery, 28(1):114–133.
Filipe N. Ribeiro, Matheus Arau´jo, Pollyanna
Gonc¸alves, Marcos Andre´ Gonc¸alves, Fabr´ıcio
Benevenuto. SentiBench - a benchmark comparison
of state-of-the-practice sentiment analysis methods
- In EPJ Data Science 2016. 2014.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Alfred</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Aho</surname>
            and
            <given-names>Jeffrey D.</given-names>
          </string-name>
          <string-name>
            <surname>Ullman</surname>
          </string-name>
          .
          <year>1972</year>
          .
          <article-title>The Theory of Parsing, Translation and Compiling</article-title>
          , volume
          <volume>1</volume>
          .
          <string-name>
            <surname>Prentice-Hall</surname>
          </string-name>
          , Englewood Cliffs, NJ.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>American</given-names>
            <surname>Psychological Association</surname>
          </string-name>
          .
          <year>1983</year>
          . Publications Manual. American Psychological Association, Washington, DC.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Association for Computing Machinery</surname>
          </string-name>
          .
          <year>1983</year>
          . Computing Reviews,
          <volume>24</volume>
          (
          <issue>11</issue>
          ):
          <fpage>503</fpage>
          -
          <lpage>512</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Harald</given-names>
            <surname>Sack</surname>
          </string-name>
          , Stefan Dietze,
          <string-name>
            <given-names>Anna</given-names>
            <surname>Tordai</surname>
          </string-name>
          .
          <source>Semantic Web Challenges</source>
          .
          <year>2016</year>
          . CCIS Springer 2016.
          <article-title>Third SemWebEval Challenge at ESWC</article-title>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          and
          <article-title>Andrea Bolioli and Malvina Nissim and Viviana Patti and Paolo Rosso. Overview of the Evalita 2014 SENTIment POLarity Classification Task</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>