Lacam&Int@UNIBA at the EVALITA 2016-SENTIPOLC Task

       Vito Vincenzo, Covella,                                      Stefano Ferilli,

        Berardina De Carolis                                      Domenico Redavid

   Department of Computer Science                          Department of Computer Science

 University of Bari “Aldo Moro”, Italy                   University of Bari “Aldo Moro”, Italy

covinc93@gmail.com, berardi-                          stefano.ferilli@uniba.it, do-
    na.decarolis@uniba.it                                menico.redavid@uniba.it


                  Abstract                          per cercare di renderlo più adatto allo
                                                    scopo della challenge.
 English. This paper describes our first
 experience of participation at the             1    Introduction
 EVALITA challenge. We participated
 only to the SENTIPOLC Sentiment Po-            We tested two systems to analyze the Sentiment
 larity subtask and, to this purpose we         Polarity for Italian. They were designed and cre-
 tested two systems, both developed for a       ated to be generic Text Categorization (TC) sys-
 generic Text Categorization task, in the       tems without any specific feature and resource to
 context of the sentiment analysis: Senti-      support Sentiment Analysis. We used them in
 mentWS and SentiPy. Both were devel-           various domains (movie reviews, opinion about
 oped according to the same pipeline, but       public administration services, mood detection,
 using different feature sets and classifica-   Facebook posts, polarity expressed in the linguis-
 tion algorithms. The first system does not     tic content of speech interaction, etc.).
 use any resource specifically developed           Both systems were applied to the EVALITA
 for the sentiment analysis task. The se-       2016 SENTIPOLC Sentiment Polarity detection
 cond one, which had a slightly better per-     subtask (Barbieri et al., 2016) in order to under-
 formance in the polarity detection sub-        stand whether, notwithstanding their “general-
 task, was enriched with an emoticon clas-      purpose” and context-independent setting, they
 sifier in order to fit better the purpose of   were flexible enough to reach a good accuracy. If
 the challenge.                                 so, this would mean that the Sentiment Analysis
                                                task could be approached without creating spe-
 Italiano. Questo articolo descrive la no-      cial resources for this purpose, which is known
 stra prima esperienza di partecipazione        to be a costly and critical activity, or that, if
 ad EVALITA. Il nostro team ha parteci-         available, these resources may improve their per-
 pato solo al subtask inerente il ricono-       formance.
 scimento della Sentiment Polarity, In             We present here only the results of the con-
 questo contesot abbiamo testato due si-        strained runs in which only the provided training
 stemi sviluppati genericamente per la          data were used to develop the systems.
 Text Categorization applicandoli a que-           The first system was entirely developed by the
 sto specifico task: SentimentWS e Sen-         LACAM research group (all the classes used in
 tiPy. Entrambi i sistemi usano la stessa       the pipeline). After studying the effect of differ-
 pipeline ma con set di feature e algoritmi     ent combinations of features and algorithms on
 di classificazione differenti. Il primo si-    the automatic learning of sentiment polarity clas-
 stema non usa alcuna risorsa specifiche        sifiers in Italian based on the EVALITA SENTI-
 per la sentment analysis, mentre il se-        POLC 2014 dataset, we applied the best one to
 condo, che si è classifcato meglio, pur        the training set of EVALITA 2016 in order to
 mantendendo la sua genericità nella            participate to the challenge.
 classificazione del testo, è stato arricchi-      The second system was developed using the
 to con un classificatore per le emoticon       scikit-learn (Pedregosa et al., 2011) and NLTK
                                                libraries (Bird et al., 2009) for building the pipe-
line and in order to optimize the performance on              - abbreviations, acronyms, and colloquial
the provided training set, classifications algo-           expressions, especially those that are often
rithms and feature sets, different from those used         found in informal texts such as blog posts on
in SentimentWS, were tested.                               the Internet and SMS’;
   Even if initially they have been conceived as a            - n-grams (groups of n consecutive terms)
generic TC system, with the aim of tuning it for           whose frequency of occurrence in the corpus
the SENTIPOLC task, they considered also the               is above a pre-defined threshold, that some-
emoticons present in the tweets. In the first sys-         times may be particularly meaningful;
tem this was made by including them in the set                - PoS tags, that are intuitively discriminant
of features, while in the second one emoticons             for subjectivity;
were handled by building a classifier whose re-               - expressive punctuation (dots, exclama-
sult was considered to influence the sentiment             tion and question marks), that may be indica-
polarity detection. The results obtained by the            tive of subjectivity and emotional involve-
two systems are comparable even if the second              ment.
one shows a better overall accuracy and ranked            In order to test the system in the context of
higher than the first one in the challenge.            Sentiment Analysis we added emoticons in the
                                                       set of features to be considered, due to their di-
2     Systems Description                              rect and explicit relationship to emotions and
                                                       moods.
2.1    SentimentWS                                        As regards NLP pre-processing, we used the
In a previous work (Ferilli et al., 2015) we de-       TreeTagger (Schmid, 1994) for PoS-tagging and
veloped a system for Sentiment Analy-                  the Snowball suite (Porter, 2001) for stemming.
sis/Opinion Italian. It was called SentimentWS,        All the selected features are collectively repre-
since it has been initially developed to run as a      sented in a single vector space based on the real-
web-service in the context of opinion coming           valued weighting scheme of Term Frequency -
from web-based applications. SentimentWS casts         Inverse Document Frequency (TF-IDF) (Robert-
the Sentiment Classification problem as a TC           son, 2004). To have values into [0, 1] we use
task, where the categories represent the polarity.     cosine normalization.
To be general and context-independent, it relies          To reduce the dimensionality of the vector
on supervised Machine Learning approaches. To          space, Document Frequency (i.e., removing
learn a classifier, one must first choose what fea-    terms that do not pass a predefined frequency
tures to consider to describe the documents, and       threshold) was used as a good tradeoff between
what is the learning method to be exploited. An        simplicity and effectiveness. To build the classi-
analysis of the state-of-the-art suggested that no     fication model we focused on two complemen-
single approach can be considered as the abso-         tary approaches that have been proved effective
lute winner, and that different approaches, based      in the literature: a similarity-based one (Rocchio)
on different perspectives, may reach interesting       and a probabilistic one (Naive Bayes). Senti-
results on different features. As regards the fea-     mentWS combines the above approaches in a
tures, for the sake of flexibility, it allows to se-   committee, where each classifier (i = 1,2) plays
lect different combinations of features to be used     the role of a different domain expert that assigns
for learning the predictive models. As regards the     a score sik to category ck for each document to
approaches, our proposal is to select a set of ap-     be classified. The final prediction is obtained as
proaches that are sufficiently complementary to        class c = arg maxk Sk, considering a function Sk
mutually provide strengths and support weak-           = f (s1k; s2k ). There is a wide range of options
nesses.                                                for function f (Tulyakov et al., 2008). In our case
   As regards the internal representation of text,     we use a weighted sum, which requires that the
most NLP approaches and applications focus on          values returned by the single approaches are
the lexical/grammatical level as a good tradeoff       comparable, i.e. they refer to the same scale. In
for expressiveness and complexity, effectiveness       fact, while the Naive Bayes approach returns
and efficiency. Accordingly, we have decided to        probability values, Rocchio's classifier returns
take into account the following kinds of de-           similarity values, both in [0; 1].
scriptors:                                             2.2   SentiPy
      - single normalized words (ignoring dates,
    numbers and the like), that we believe convey      SentiPy has been developed using the scikit-learn
    most of informational content in the text;         and NLTK libraries for building the pipeline and,
in order to optimize the performance on the pro-            Tokenization            Scikit-Learn de-
vided training set, classifications algorithms and                                   fault tokenizer
feature sets, different from those used in Senti-
mentWS, were tested. It uses a committee of two         Maximum document                   0.5
classifiers, one for the text component of the          frequency CountVec-
message and the other for the emoticons. For the
                                                          torizer parameter
first classifier we use a very simple set of fea-
tures, any string made at least of two chars, and
                                                        Maximum number of              unlimited
linear SVC as classification algorithm.
   Even if this might seem too simple, we made          terms for the vocabu-
some experiments in which we tested other con-                  lary
figurations of features taking advantage of i)
lemmatization, ii) lemmatization followed by                   n-grams               Unigrams and
POS-tagging, iii) stemming, iv) stemming fol-
lowed by POS-tagging. All of them were tested                                           bigrams
with and without removing italian's stopwords
(taken                                        from          Term weights                  tf-idf
nltk.corpus.stopwords.words(“italian”)).
   We tested also other classification algorithms        Vector’s normaliza-               l2
(Passive Aggressive Classifier, SGDClassifier,                   tion
Multinomial Naive Bayes), but their performance
was less accurate than the one of linear SVC, that      fit_intercept classifier         False
we selected.                                                  parameter
   Before fitting the classifier text preprocessing
was performed according to the following steps:          dual classifier para-            True
       - Twitter's “mentions” (identified by the                meter
    character '@' followed by the username) and
    http links are removed;                             Number of iterations              1000
       - retweets special characters (“RT” and           over training data
    “rt”) are removed;
       - hashtags are “purged”, removing the               Class balancing             automatic
    character '#' followed by the string, which is
    then left unmodified;                             Table 1: SentiPy - positive vs all best configura-
       - non-BMP utf8 characters (characters                   tion based on Sentipolc 2014.
    outside the Basic Multilingual Plane), usually
    used to encode special emoticons and emojis             Tokenization            Scikit-Learn de-
    used in tweets, are handled by replacing them                                    fault tokenizer
    with their hexadecimal encoding; this is done
    to avoid errors while reading the files.           Maximum document                   0.5
   After doing the aforementioned experiments
using the training and testing sets provided by        frequency CountVec-
sentipolc2014, which was also used to fine-tune          torizer parameter
the parameters used by the LinearSVC algo-
rithm, we compared the most successful ap-             Maximum number of               unlimited
proaches:         tokenization     done       using    terms for the vocabu-
nltk.tokenize.TweetTokenizer        followed     by            lary
stemming and feature extraction simply done by
using the default tokenizer provided by scikit (it             n-grams               Unigrams and
tokenizes the string by extracting words of at                                         bigrams
least 2 letters).
   The best configurations are those shown in               Term weights                 tf-idf
Table 1 and Table 2.
                                                       Vector’s normalization              l2
  fit_intercept classifier             True             As far as the emoticons and emojis are
        parameter                                    concerned, in this system we decided to exclude
                                                     them from the features set, solution adopted in
 dual classifier parame-               True          SentimentWS, and train a classifier according to
           ter                                       the valence with whom the tweet was labeled.
                                                     This approach may be useful to detect irony or
  Number of iterations                1000           for recognizing valence in particular domains in
                                                     which emoticons are used with a different mean-
   over training data
                                                     ing. Emoticons and emojis were retrieved using
                                 automatic           a dictionary of strings and some regular expres-
     Class balancing
                                                     sions. The emoticons and emojis retrieved are
Table 2: SentiPy - negative vs all best configura-   replaced with identifiers, removing all other
                                                     terms not related to the emoticons, thus obtaining
          tion based on Sentipolc 2014.              a matrix emoticons-classes. The underlying clas-
                                                     sifier takes this matrix as input and creates the
   These two configurations, which had the same
                                                     model that will be used in the classification
fine-tuned LinearSVC's parameters, were com-
                                                     phase. The algorithm used in the experiments is
pared observing the evaluation data obtained
                                                     the Multinomial Naive Bayes.
testing them on sentipolc2016 training set, taking
                                                        The committee of classifiers was built using
advantage of a standard common technique: 10-
                                                     the VotingClassifier class, which is provided by
fold cross validation, whose results are shown in
                                                     the Scikit-Learn framework. The chosen voting
Table 3.
                                                     technique is the so called “hard voting”: it is
   The obtained results were comparable there-
                                                     based on the majority voting rule; in case of a tie,
fore we selected the configuration shown in the
                                                     the classifier will select the class based on the
first two rows of Table 3 combined with the
                                                     ascending sort order (classifier 1 → class 2; clas-
emoticon classifier since it was not presented in
                                                     sifier 2 → class 1; class 1 will be the selected
the SentimentWS system.
                                                     class).
                           F1-score
                             macro                   3    Experiments
                                        Accura-
     Configuration        averaging         cy       Both systems were tested on other domains be-
                                                     fore applying them to the SENTIPOLC subtask.
VotingClassifier                                         In the results tables, for each class (positive
                             0,70388      0,77383    and negative) 0 represents the value “False” used
default tokenization –                               in the dataset annotations for the specific tweet
positive vs all                                      and class, 1 represents “True”, following the task
                                                     guidelines of Sentipolc 2016. Thus the cell iden-
VotingClassifier                                     tified by the row positive and the column prec.0
                                                     shows the precision related to the tweets with
default tokenization –       0,70162      0,70648
                                                     positive polarity annotations set to False. The
negative vs all                                      meaning of the other cells can be obtained analo-
                                                     gously.
VotingClassifier                                               3.1     SentimentWS Results
stemming –                                           SentimentWS was tested initially on a dataset of
                             0,70654      0,75424    2000 reviews in Italian language, concerning 558
positive vs all                                      movies, taken from http://filmup.leonardo.it/. In
                                                     this case, classification performance was evalu-
VotingClassifier                                     ated on 17 different feature settings using a 5-
stemming –                                           fold cross-validation procedure. Equal weight
                             0,6952       0,70351    was assigned to all classifiers in the Senti-
negative vs all                                      mentWS committee. Overall accuracy reported
                                                     in (Ferilli et al., 2015) was always above 81%.
 Table 3: 10-fold on Sentipolc 2016 training set.    When Rocchio outperformed Naive Bayes, accu-
                                                     racy of the committee was greater than that of
                                                     the components; in the other cases, correspond-
ing to settings that used n-grams, Naive Bayes              - minimum number of occurrences for a
alone was the winner.                                    term to be considered: 3
   Before tackling the EVALITA 2016 SENTI-                  - POS-tags used: NOUN-WH-CLI-ADV-
POLC task, in order to tune the system on a              NEG-CON-CHE-DET-NPR-PRE-ART-
(hopefully) similar environment, we tested our           INTADJ-VER-PRO-AUX
system on the EVALITA 2014 dataset and de-                  - n-grams: unigrams
termined in this way the combination of features         With the configuration described above, the
that had a better accuracy on this dataset.           system SentimentWS was able to classify the
   We tested the system using a subset of ~900        whole test set of Sentipolc 2014 (1935 tweet)
tweet (taken from the dataset provided in Senti-      obtaining a combined F-score of 0.6285.
polc 2014), in order to find the best configuration      The previously mentioned best configuration
of parameters, which resulted to be the following     was also used in one of the two runs sent for
one:                                                  Sentipolc 2016, obtaining a combined F-score of
      - term normalization: lemmatization;            0.6037, as shown in Table 4.

         class        prec. 0     rec. 0    F-sc. 0       prec. 1   rec. 1   F-sc. 1       F-sc

       positive       0.8642     0.7646     0.8113        0.2841    0.4375   0.3445    0.5779

       negative       0.7087     0.7455     0.7266        0.5567    0.5104   0.5325    0.6296

               Table 4: SentimentWS - Sentipolc 2016 test set - Combined F-score = 0.6037.


       class           prec. 0   rec. 0     F-sc. 0       prec. 1   rec. 1       F-sc. 1          F-sc

       positive        0.8792    0.7992     0.8373        0.3406    0.4858       0.4005           0.6189

       negative        0.7001    0.8577     0.7709        0.6450    0.4130       0.5036           0.6372

       Table 5: SentiPy@Sentipolc2016 results (LinearSVC fine-tuned + EmojiCustomClassifier) -
                                    Combined F-score = 0.6281.


3.2   SentiPy Results
                                                      4      Conclusions
With the configuration discussed above SentiPy
combined F-score was 0.6281 as shown in Table         Looking at the results of the Sentiment Polarity
5.                                                    detection subtask we were surprised of the over-
   We made other experiments on the Sentipolc         all performance of the systems presented in this
2016 test set after the deadline of EVALITA.          paper since they were simply Text Categoriza-
Their results, even if unofficial, show significant   tion systems. The only integrations to the origi-
improvements, since we managed to get 0.6403          nal systems, in order to tune their performance
as a combined F-score. We got it by making spe-       on the sentiment polarity detection task, con-
cific changes in the positive vs all classifier: we   cerned emoticons. In SentimentWS these were
used lemmatization (without stopwords remov-          included in the feature set and SentiPy was en-
al), unigrams (no other n-grams allowed) and the      riched with a classifier created for handling
parameter fit_intercept of the LinearSVC algo-        emoticons.
rithm was set to True. The other parameters re-          Besides the experiments that were executed on
mained unchanged. No changes have been made           the SENTIPOLC dataset, we tested both systems
to the classifier negative vs all.                    on a dataset of Facebook posts in Italian collect-
                                                      ed and annotated by a group of researchers in our
                                                      laboratories. This experiment was important in
                                                      order to understand whether their performance
was comparable to the one obtained in the               Steven Bird, Ewan Klein, and Edward Loper (2009).
SENTIPOLC challenge. Results in these cases                Natural Language Processing with Python.
were encouraging since both systems had a com-             O’Reilly Media Inc.
bined F-score higher than 0.8.                          Stefano Ferilli , Berardina De Carolis, Floriana
    We are currently working at the improvement           Esposito , Domenico Redavid. 2015. Sentiment
of the performance of the system by tuning it on          analysis as a text categorization task: A study on
the Sentiment Analysis context. To this aim we            feature and algorithm selection for Italian lan-
are developing a specific module to handle nega-          guage. In Proceeding of IEEE International Con-
tion in Italian and, in our future work we plan to        ference on Data Science and Advanced Analytics
                                                          (DSAA).
integrate the two systems by creating one com-
mittee including all the classifiers, moreover we       H. Schmid. 1994. Probabilistic part-of-speech tagging
plan to include an approach based on a combina-            using decision trees. In Proceedings of Interna-
tion of probabilistic and lexicon (De Carolis et           tional Conference on New Methods in Language
al., 2015).                                                Processing, pp. 44–49.
                                                        M. F. Porter, 2001. Snowball: A language for stem-
References                                                ming              algorithms,”               [Online].
                                                          http://snowball.tartarus.org/texts/introduction.html
Barbieri, Francesco and Basile, Valerio and Croce,
  Danilo and Nissim, Malvina and Novielli, Nicole       Stephen Robertson. 2004. Understanding inverse doc-
  and Patti, Viviana      2016. Overview of the            ument frequency: On theoretical arguments for
  EVALITA 2016 SENTiment POLarity Classifica-              IDF. Journal of documentation, 60(5), 503-520..
  tion Task. In Proceedings of Third Italian Confer-    S. Tulyakov, S. Jaeger, V. Govindaraju, and D. Do-
  ence on Computational Linguistics (CLiC-it 2016)         ermann. 2008. Review of classifier combination
  & Fifth Evaluation Campaign of Natural Lan-              methods,” ser. Studies in Computational Intelli-
  guage Processing and Speech Tools for Italian. Fi-       gence (SCI). Springer. vol. 90, pp. 361–386.
  nal Workshop (EVALITA 2016). Associazione Ita-
  liana di Linguistica Computazionale (AILC).           B. De Carolis, D. Redavid, A. Bruno. 2015. A Senti-
                                                           ment Polarity Analyser based on a Lexical-
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel,        Probabilistic Approach . In Proceedings of
  V., Thirion, B., Grisel, O., et al. (2011). Scikit-      IT@LIA2015 1st AI*IA Workshop on Intelligent
  learn: machine learning in python. Journal of Ma-        Techniques At LIbraries and Archives co-located
  chine Learning Research, 12(Oct), 2825-2830.             with XIV Conference of the Italian Association for
                                                           Artificial Intelligence (AI*IA 2015).