=Paper=
{{Paper
|id=Vol-2006/paper029
|storemode=property
|title=Sanremo's Winner Is... Category-driven Selection Strategies for Active Learning
|pdfUrl=https://ceur-ws.org/Vol-2006/paper029.pdf
|volume=Vol-2006
|authors=Anne-Lyse Minard,Manuela Speranza,Mohammed R. H. Qwaider,Bernardo Magnini
|dblpUrl=https://dblp.org/rec/conf/clic-it/MinardSQM17
}}
==Sanremo's Winner Is... Category-driven Selection Strategies for Active Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-2006/paper029.pdf</pdf>
<pre>
                            Sanremo’s winner is...
            Category-driven Selection Strategies for Active Learning

 Anne-Lyse Minard, Manuela Speranza, Mohammed R. H. Qwaider, Bernardo Magnini
                      Fondazione Bruno Kessler, Trento, Italy
              {minard,manspera,qwaider,magnini}@fbk.eu


                    Abstract                         al., 1994). In the AL framework samples are usu-
                                                     ally selected according to several criteria, such as
    English. This paper compares Active              informativeness, representativeness, and diversity
    Learning selection strategies for sentiment      (Shen et al., 2004).
    analysis of Twitter data. We focus mainly           This paper investigates AL selection strategies
    on category-driven strategies, which select      that consider the categories the current classifier
    training instances taking into considera-        assigns to samples, combined with the confidence
    tion the confidence of the system as well        of the classifier on the same samples. We are in-
    as the category of the tweet (e.g. posi-         terested in understanding whether these strategies
    tive or negative). We show that this com-        are effective, particularly when category distribu-
    bination is particularly effective when the      tion and category performance are unbalanced. By
    performance of the system is unbalanced          comparing several options, we show that select-
    over the different categories. This work         ing low confidence samples of the category with
    was conducted in the framework of auto-          the highest performance is a better strategy than
    matically ranking the songs of “Festival di      selecting high confidence samples of the category
    Sanremo 2017” based on sentiment analy-          with the lowest performance.
    sis of the tweets posted during the contest.        The context of our study is the development of a
    Italiano. Questo lavoro confronta strate-        sentiment analysis system that classifies tweets in
    gie di selezione di Active Learning per          Italian. We used the system to automatically rank
    l’analisi del sentiment dei tweet focaliz-       the songs of Sanremo 2017 based on the sentiment
    zandosi su strategie guidate dalla cate-         of the tweets posted during the contest.
    goria. Selezioniamo istanze di addestra-            The paper is structured as follows. In Section 2
    mento combinando la categoria del tweet          we give an overview of the state-of-the-art in se-
    (per esempio positivo o negativo) con il         lection strategies for AL. Then we present our ex-
    grado di confidenza del sistema. Questa          perimental setting (Section 3) before detailing the
    combinazione è particolarmente efficace         tested selection strategies (Section 4). Finally, we
    quando la distribuzione delle categorie          describe the results of our experiment in Section 5
    non è bilanciata. Questo lavoro aveva           and the application of the system to ranking San-
    come scopo il ranking delle canzoni del          remo’s songs in Section 6.
    “Festival di Sanremo 2017” sulla base
                                                     2   Related Work
    dell’analisi del sentiment dei tweet postati
    durante la manifestazione.                       AL (Cohn et al., 1994; Settles, 2010) provides a
                                                     well known methodology for reducing the amount
                                                     of human supervision (and the corresponding cost)
1   Introduction
                                                     for the production of training datasets necessary
Active Learning (AL) is a well known technique       in many Natural Language Processing tasks. An
for the selection of training samples to be anno-    incomplete list of references includes Shen et al.
tated by a human when developing a supervised        (2004) for Named Entity Recognition, Ringger et
machine learning system. AL allows for the col-      al. (2007) for PoS Tagging, and Schohn and Cohn
lection of more useful training data, while at the   (2000) for Text Classification.
same time reducing the annotation effort (Cohn et       AL methods are based on strategies for sam-
ple selection. Although there are two main              group internship at Fondazione Bruno Kessler.
types of selection methods, certainty-based and            We created an initial training set using an AL
committee-based, here we concentrate only on            mechanism that selects the samples with the low-
certainty-based selection methods. The main             est system confidence1 , i.e. those closer to the hy-
certainty-based strategy used is the uncertainty        perplane and therefore most difficult to classify. In
sampling method (Lewis and Gale, 1994). Shen et         the following we describe the sentiment analysis
al. (2004) propose a strategy which is based on the     system, the Active Learning process and the cre-
combination of several criteria: informativeness,       ation of the test and the initial training set. Finally,
representativeness, and diversity. The results pre-     we introduce the experiments performed on selec-
sented by Settles and Craven (2008) show that in-       tion strategies for Active Learning.
formation density is the best criterion for sequence
                                                        Sentiment Analysis System. Our system for
labeling. Tong and Koller (2002) propose three
                                                        sentiment analysis is based on a supervised ma-
selection strategies that are specific to SVM learn-
                                                        chine learning method using the SVM-MultiClass
ers and are based on different measures taking into
                                                        tool (Joachims et al., 2009)2 . We extract the fol-
consideration the distances to the decision hyper-
                                                        lowing features from each tweet: the tokens com-
plane and margins.
                                                        posing the tweet, and the number of urls, hashtags,
   Many NLP tasks suffer from unbalanced data.
                                                        and aliases it contains. It takes as input a tokenized
Ertekin et al. (2007) show that selecting examples
                                                        tweet3 and returns as output its polarity.
within the margin overcomes the problem of un-
balanced data.                                          AL Process. We used TextPro-AL, a platform
   The previously cited selection strategies are of-    which integrates an NLP pipeline, an AL mech-
ten applied to binary classification and do not take    anism and an annotation interface (Magnini et al.,
into account the predicted class. In this work we       2016). The AL process is as follows: (i) a large
are interested in multi-class classification tasks,     unlabeled dataset is annotated by the sentiment
and in the problem of unbalanced data and dom-          analysis system (with a small temporary model
inant classes in terms of performance.                  used to initialize the AL process4 ); (ii) samples are
   Esuli and Sebastiani (2009) define three crite-      selected according to a selection strategy; (iii) an-
ria that they combine to create different selection     notators annotate the selected tweets; (iv) the new
strategies in the context of multi-label text classi-   annotated samples are accumulated in the batch;
fication. The criteria are based on the confidence      (v) when the batch is full the annotated data are
of the system for each label, a combination of the      added to the existing training dataset and a new
confidence of each class for one document, and a        model is built; (vi) the unlabeled dataset is anno-
weight (based on the F1-measure) assigned to each       tated again using the newly built model and the
class to distinguish those for which the system per-    cycle begins again at (ii).
forms badly. They show that in most of the cases           The unlabeled dataset consists of 400,000
this last criteria does not improve the selection.      tweets that contained the hashtag #Sanremo2017.
   Our applicative context is a bit different as we     The maximum size of the batch is 120, so retrain-
are not working on a multi-label task. Instead of       ing takes place every 120 annotated tweets.
computing a weight according to the F1-measure,
                                                        Training and Performance. The initial training
we experimented with a change of strategy where
                                                        set, whose creation required half a day of work5 , is
we focus on a single class.
                                                             1
                                                               The confidence score is computed as the average of the
                                                        margin estimated by the SVM classifier for each entity.
3   Experimental Setting                                     2
                                                               https://www.cs.cornell.edu/people/tj/
                                                        svm_light/svm_multiclass.html
The context of our study was the development of              3
                                                               Tokenization is performed using the Twokenizer
a supervised sentiment analysis system that classi-     java      library   https://github.com/vinhkhuc/
fies tweets into one of the following four classes:     Twitter-Tokenizer/blob/master/src/
                                                        Twokenizer.java
positive, negative, neutral, and n/a                         4
                                                               The temporary model has been built using 155 tweets
(i.e. not applicable).                                  annotated manually by one annotator. After the first step of
   The manual annotation of the data was mainly         the AL process, these tweets are removed from the training
                                                        set.
performed by 25 3rd and 4th year students from               5
                                                               The 25 high schools students worked in pairs and trios,
local high schools who were doing a one-week            for a total of 12 groups.
composed of 2,702 tweets. The class negative           sified by the system with the lowest confidence.
is the most represented, covering almost 40% of        The low confidence strategy was also used to build
the total, with respect to positive, with around       the initial training set (S0: lowC) as described is
30% of the total. The distribution of the two mi-      Section 3.
nor classes is rather close, with 18% for neutral
and 13% for n/a.                                       S2: NEGATIVE with high confidence. The
    As a test set we used 1,136 tweets randomly se-    second strategy consists of selecting the samples
lected from among all the tweets which mentioned       classified as negative with the highest con-
either a Sanremo song or singer. The test set was      fidence. We assume that this will increase the
annotated partly by the high school students (656      amount of negative tweets selected, thus enabling
tweets) and partly by two expert annotators (480       us to improve the performance of the system on
tweets); each tweet was annotated with the same        the negative class. Nevertheless, as the sys-
category by at least two annotators. 58% of the        tem has a high confidence on the classification of
tweets are positive, 20% are negative, 14%             these tweets, through this strategy we are adding
are neutral, and 8% are n/a.                           easy examples to the training set that the system is
                                                       probably already able to classify correctly.
    We built the test set selecting the tweets ran-
domly from the unlabeled dataset in order to make      S3: POSITIVE with low confidence. The third
it representative of the whole dataset.                strategy aims at selecting the positive tweets
    The overall performance of the system trained      for which the system has the lowest confidence.
on the initial set is 40.7 in terms of F1 (see         We expect in this way to get the difficult cases, i.e.
EVAL 2702 in Table 1). The F1 obtained on              tweets that are close to the hyperplane and that are
the two main categories, i.e. positive and             classified as positive but whose classification
negative, is 54.5, but the system performs more        has a high chance of being incorrect.
poorly on negative than on positive, with                 As the initial system has high recall (82.8) but
F1-measures of 33.6 and 75.4 respectively.             low precision (69.3) for the class positive, we
Experiment. As the evaluation showed good              assume that it needs to improve on the examples
results on positive but poor results on                wrongly classified as positive. We expect that
negative, we devised and tested novel selection        inside the tweets wrongly classified as positive
strategies better able to balance the performance of   we will find difficult cases of negative tweets
the system over the two classes. We divided the 25     which will help to improve the system on the
annotators into three different groups: each group     negative class. On the other hand, recall for the
annotated 775 tweets. The tweets annotated by the      negative class is low (25.7), whereas precision
first group were selected with the same strategy       is slightly better (48.7), which is why we decided
used before, whereas for the other two groups we       to extract positive tweets with low confidence
implemented two new selection strategies taking        instead of negative tweets with low confidence.
into account not only the confidence of the system
                                                       5   Results and Discussion
but also the class it assigns to a tweet. As a re-
sult we obtained three different extensions of the     In Table 1 we present the results (in tersm of F1)
same size and were thus able to compare the per-       obtained by the system using the additional train-
formance of the system trained on the initial train-   ing data selected through the three different selec-
ing set plus each of the extensions.                   tion strategies described above. In order to facili-
                                                       tate the interpretation of the results, we also report
4   Selection Strategies                               the performance obtained by the system trained
We tested three selection strategies that take into    only on the initial set of 2,702 tweets. Addition-
account the classification proposed by the sys-        ally, in Table 2, we give the results obtained by
tem in order to select the most useful samples to      the system for each configuration also in terms of
improve the distinction between positive and           recall and precision (besides F1).
negative.                                                 The first four lines report the results for each of
                                                       the four categories, while lines six and seven re-
S1: low confidence. The first strategy we tested       port respectively the macro-average F1 over the
is the baseline strategy, which selects tweets clas-   four classes and the macro-average F1 over the
                                    Eval2702                 Experiment on selection strategies
 Strategy used                      S0: lowC          S1: lowC      S2: NEG-highC S3: POS-lowC
                                  F1     tweets    F1       tweets F1         tweets F1         tweets
 NEGATIVE                         33.6 1,080       34.8     1,374   32.0      1,669      39.3   1,299
                        wrt S0    -      -         (+1.2) (+294) (-1.6) (+589) (+5.7) (+219)
 POSITIVE                         75.4 798         74.8     975     74.8      869        76.5   1,065
                        wrt S0    -      -         (-0.6) (+177) (-0.6) (+71)            (+1.1) (+267)
 NEUTRAL                          22.3 476         20.9     595     23.3      567        24.6   672
                        wrt S0    -      -         (-1.4) (+119) (+1.0) (+91)            (+2.3) (+196)
 N/A                              31.3 348         28.6     533     27.6      372        28.6   441
                        wrt S0    -      -         (-2.7) (+185) (-3.7) (+24)            (-2.7) (+93)
 Average 4 classes                40.7 2,702       39.8     3,477   39.4      3,477      42.3   3,477
                        wrt S0    -      -         (-0.9) (+775) (-1.3) (+775) (+1.6) (+775)
 Average POS/NEG                  54.5 -           54.8     -       53.4      -          57.9   -
                        wrt S0    -      -         (+0.3) -         (-1.1) -             (+3.4) -

Table 1: Performance of the system trained on 2,702 tweets and performance of the system trained on
the same set of data incremented with 775 tweets selected through three different selection strategies.

                            Eval2702                      Experiment on selection strategies
 Strategy used              S0: lowC              S1: lowC        S2: NEG-highC          S3: POS-lowC
                        R     P      F1       R     P      F1    R     P       F1      R      P     F1
 NEGATIVE               25.7 48.7 33.6        28.4 45.0 34.8 24.3 46.6 32.0 30.6 54.8 39.3
 POSITIVE               82.8 69.3 75.4        81.6 69.0 74.8 82.2 68.7 74.8 85.3 69.3 76.5
 NEUTRAL                20.1 25.0 22.3        17.7 25.4 20.9 20.7 26.6 23.3 21.3 29.2 24.6
 N/A                    32.6 30.0 31.3        30.4 26.9 28.6 29.3 26.0 27.6 27.2 30.1 28.6
 Average 4 classes      40.3 43.2 40.7        39.5 41.6 39.8 39.2 41.9 39.4 41.1 45.9 42.3
 Average POS/NEG        54.3 59.0 54.5        55.0 57.0 54.8 53.3 57.6 53.4 57.9 62.1 57.9

Table 2: Performance in terms of precision, recall and F1 of the system trained on the different training
set. The two last lines are the average of the recall, precision and F1 over 4 and 2 classes.


two most important classes, i.e. positive and            We observe that the best strategy is S3 (POS-
negative. For each selection strategy, we indi-        lowC, i.e., selection of the positive tweets with
cate the difference in performance obtained with       the lowest confidence), with an improvement of
respect to the system trained on the initial set, as   the macro-average F1-measure over the 4 classes
well as the number of annotated tweets that have       by 1.6 points and over the positive and
been added.                                            negative classes by 3.4 points. Although we
   With the baseline strategy (S1: lowC, i.e., se-     add more positive than negative tweets to the train-
lection of the tweets for which the system has the     ing data (34%), the performance of the system on
lowest confidence) the performance of the system       the negative class increases as well, from F1
decreases slightly, from an F1 of 40.7 to an F1        33.6 to F1 39.3. This strategy worked very well in
of 39.8. Most of the added samples are nega-           enabling us to select the examples which help the
tive tweets (38%), which enables the system to in-     system discriminate between the two main classes.
crease its performance on this class by 1.2 points.
                                                       6   Application: Sanremo’s Ranking
   When using the second strategy (S2: NEG-
highC, i.e. selection of the negative tweets with      After evaluating the three different selection
the highest confidence), 76% of the new tweets are     strategies, we trained a new model using all the
negative, but the performance of the system on this    tweets that had been annotated. With this new
class decreases. Even the overall performance of       model, as expected, we obtained the best results.
the system decreases, despite adding 775 tweets.       The average F-measure on the negative and
positive classes is 58.2, the average F-measure          strategy that takes into account both the automat-
over the 4 classes is 42.1.                              ically assigned category and the system’s confi-
   For the annotation to be used for producing the       dence performs well in the case of unbalanced per-
automatic ranking, we provided the system with           formance over the different classes.
some gazetteers, i.e. a list of words that carry pos-        To complete our study it would be interesting
itive polarity and a list of words that carry negative   to perform further experiments on other multi-
polarity. We thus obtained a small improvement in        classification problems. Unfortunately this work
system performance, with an F1 of 42.8 on the av-        required intensive annotation work and so its repli-
erage of the four classes and an F1 of 58.3 on the       cation on other tasks would be very expensive. A
average of positive and negative.                        lot of work on Active Learning has been done us-
   As explained in the Introduction, the applicative     ing existing annotated corpora, but we think that
scope of our work was to rank the songs compet-          it is too far from a real annotation situation as the
ing in Sanremo 2017. For this, we used only the          datasets used are generally limited in tems of size.
total number of tweets talking about each singer             In order to test different selection strategies,
and the polarity assigned to each tweet by the sys-      we have evaluated the sentiment analysis sys-
tem. In total we had 118,000 tweets containing ei-       tem against a gold standard, but we have also
ther a reference to a competing singer or song that      performed an application-oriented evaluation by
had been annotated automatically by the sentiment        ranking the songs participating in Sanremo 2017.
analysis system. By doing the ranking according              As future work, we want to explore the possibil-
to the proportion of positive tweets of each singer,     ity of automatically adapting the selection strate-
we were able to identify 4 out of the top 5 songs        gies while annotating. For example, if the perfor-
and 4 out of the 5 last place songs. In Table 3,         mance of the classifier of one class is low, the strat-
we show the official ranking versus the automatic        egy in use could be changed in order to select the
ranking. The Spearman’s rank correlation coeffi-         samples needed to improve on that class.
cient between the official ranking and our ranking
is 0.83, and the Kendall’s tau coefficient is 0.67       Acknowledgments

     Singer                  Official   System           This work has been partially funded by the Euclip-
     Francesco Gabbani       1          8                Res project, under the program Bando Inno-
     Fiorella Mannoia        2          4                vazione 2016 of the Autonomous Province of
     Ermal Meta              3          1                Bolzano. We also thank the high school students
     Michele Bravi           4          2                who contributed to this study with their annotation
     Paola Turci             5          5                work within the FBK Junior initiative.
     Sergio Sylvestre        6          6
     Fabrizio Moro           7          3
                                                         References
     Elodie                  8          9
     Bianca Atzei            9          13               David Cohn, Richard Ladner, and Alex Waibel. 1994.
                                                           Improving generalization with active learning. In
     Samuel                  10         7
                                                           Machine Learning, pages 201–221.
     Michele Zarrillo        11         10
     Lodovica Comello        12         12               Seyda Ertekin, Jian Huang, Léon Bottou, and C. Lee
     Marco Masini            13         14                 Giles. 2007. Learning on the border: active learn-
                                                           ing in imbalanced data classification. In Mário J.
     Chiara                  14         11
                                                           Silva, Alberto H. F. Laender, Ricardo A. Baeza-
     Alessio Bernabei        15         16                 Yates, Deborah L. McGuinness, Bjørn Olstad, Øys-
     Clementino              16         15                 tein Haug Olsen, and André O. Falcão, editors, Pro-
                                                           ceedings of the Sixteenth ACM Conference on Infor-
Table 3: Sanremo’s official ranking and the rank-          mation and Knowledge Management, CIKM 2007,
ing produced by our system                                 Lisbon, Portugal, November 6-10, 2007, pages 127–
                                                           136. ACM.

                                                         Andrea Esuli and Fabrizio Sebastiani. 2009. Ac-
7   Conclusion                                             tive learning strategies for multi-label text classifi-
                                                           cation. In Mohand Boughanem, Catherine Berrut,
We have presented a comparative study of three             Josiane Mothe, and Chantal Soulé-Dupuy, editors,
AL selection strategies. We have shown that a              Advances in Information Retrieval, 31th European
  Conference on IR Research, ECIR 2009, Toulouse,
  France, April 6-9, 2009. Proceedings, volume 5478
  of Lecture Notes in Computer Science, pages 102–
  113. Springer.
Thorsten Joachims, Thomas Finley, and Chun-
  Nam John Yu. 2009. Cutting-plane training of
  structural svms. Mach. Learn., 77(1):27–59, Octo-
  ber.
David D. Lewis and William A. Gale. 1994. A se-
  quential algorithm for training text classifiers. In
  Proc.International ACM SIGIR conference on Re-
  search and development in information retrieval (SI-
  GIR), pages 3–12, New York, NY, USA. Springer-
  Verlag New York, Inc.
Bernardo Magnini, Anne-Lyse Minard, Mohammed
  R. H. Qwaider, and Manuela Speranza. 2016.
  T EXT P RO -AL: An Active Learning Platform for
  Flexible and Efficient Production of Training Data
  for NLP Tasks. In Proceedings of COLING 2016,
  the 26th International Conference on Computational
  Linguistics: System Demonstrations.
Eric Ringger, Peter McClanahan, Robbie Haertel,
   George Busby, Marc Carmen, James Carroll, Kevin
   Seppi, and Deryle Lonsdale. 2007. Active learning
   for part-of-speech tagging: Accelerating corpus an-
   notation. In Proceedings of the Linguistic Annota-
   tion Workshop, LAW ’07, pages 101–108, Strouds-
   burg, PA, USA. Association for Computational Lin-
   guistics.
Greg Schohn and David Cohn. 2000. Less is more:
  Active learning with support vector machines. In
  Proceedings of the Seventeenth International Con-
  ference on Machine Learning, ICML ’00, pages
  839–846, San Francisco, CA, USA. Morgan Kauf-
  mann Publishers Inc.
Burr Settles and Mark Craven. 2008. An analysis
  of active learning strategies for sequence labeling
  tasks. In 2008 Conference on Empirical Methods in
  Natural Language Processing, EMNLP 2008, Pro-
  ceedings of the Conference, 25-27 October 2008,
  Honolulu, Hawaii, USA, A meeting of SIGDAT, a
  Special Interest Group of the ACL, pages 1070–
  1079. ACL.
Burr Settles. 2010. Active learning literature survey.
  Technical report.
Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, and
  Chew-Lim Tan. 2004. Multi-criteria-based active
  learning for named entity recognition. In Proceed-
  ings of the 42Nd Annual Meeting on Association for
  Computational Linguistics, ACL ’04, Stroudsburg,
  PA, USA. Association for Computational Linguis-
  tics.
Simon Tong and Daphne Koller. 2002. Support
  vector machine active learning with applications to
  text classification. J. Mach. Learn. Res., 2:45–66,
  March.

</pre>