=Paper=
{{Paper
|id=Vol-2006/paper029
|storemode=property
|title=Sanremo's Winner Is... Category-driven Selection Strategies for Active Learning
|pdfUrl=https://ceur-ws.org/Vol-2006/paper029.pdf
|volume=Vol-2006
|authors=Anne-Lyse Minard,Manuela Speranza,Mohammed R. H. Qwaider,Bernardo Magnini
|dblpUrl=https://dblp.org/rec/conf/clic-it/MinardSQM17
}}
==Sanremo's Winner Is... Category-driven Selection Strategies for Active Learning==
Sanremo’s winner is... Category-driven Selection Strategies for Active Learning Anne-Lyse Minard, Manuela Speranza, Mohammed R. H. Qwaider, Bernardo Magnini Fondazione Bruno Kessler, Trento, Italy {minard,manspera,qwaider,magnini}@fbk.eu Abstract al., 1994). In the AL framework samples are usu- ally selected according to several criteria, such as English. This paper compares Active informativeness, representativeness, and diversity Learning selection strategies for sentiment (Shen et al., 2004). analysis of Twitter data. We focus mainly This paper investigates AL selection strategies on category-driven strategies, which select that consider the categories the current classifier training instances taking into considera- assigns to samples, combined with the confidence tion the confidence of the system as well of the classifier on the same samples. We are in- as the category of the tweet (e.g. posi- terested in understanding whether these strategies tive or negative). We show that this com- are effective, particularly when category distribu- bination is particularly effective when the tion and category performance are unbalanced. By performance of the system is unbalanced comparing several options, we show that select- over the different categories. This work ing low confidence samples of the category with was conducted in the framework of auto- the highest performance is a better strategy than matically ranking the songs of “Festival di selecting high confidence samples of the category Sanremo 2017” based on sentiment analy- with the lowest performance. sis of the tweets posted during the contest. The context of our study is the development of a Italiano. Questo lavoro confronta strate- sentiment analysis system that classifies tweets in gie di selezione di Active Learning per Italian. We used the system to automatically rank l’analisi del sentiment dei tweet focaliz- the songs of Sanremo 2017 based on the sentiment zandosi su strategie guidate dalla cate- of the tweets posted during the contest. goria. Selezioniamo istanze di addestra- The paper is structured as follows. In Section 2 mento combinando la categoria del tweet we give an overview of the state-of-the-art in se- (per esempio positivo o negativo) con il lection strategies for AL. Then we present our ex- grado di confidenza del sistema. Questa perimental setting (Section 3) before detailing the combinazione è particolarmente efficace tested selection strategies (Section 4). Finally, we quando la distribuzione delle categorie describe the results of our experiment in Section 5 non è bilanciata. Questo lavoro aveva and the application of the system to ranking San- come scopo il ranking delle canzoni del remo’s songs in Section 6. “Festival di Sanremo 2017” sulla base 2 Related Work dell’analisi del sentiment dei tweet postati durante la manifestazione. AL (Cohn et al., 1994; Settles, 2010) provides a well known methodology for reducing the amount of human supervision (and the corresponding cost) 1 Introduction for the production of training datasets necessary Active Learning (AL) is a well known technique in many Natural Language Processing tasks. An for the selection of training samples to be anno- incomplete list of references includes Shen et al. tated by a human when developing a supervised (2004) for Named Entity Recognition, Ringger et machine learning system. AL allows for the col- al. (2007) for PoS Tagging, and Schohn and Cohn lection of more useful training data, while at the (2000) for Text Classification. same time reducing the annotation effort (Cohn et AL methods are based on strategies for sam- ple selection. Although there are two main group internship at Fondazione Bruno Kessler. types of selection methods, certainty-based and We created an initial training set using an AL committee-based, here we concentrate only on mechanism that selects the samples with the low- certainty-based selection methods. The main est system confidence1 , i.e. those closer to the hy- certainty-based strategy used is the uncertainty perplane and therefore most difficult to classify. In sampling method (Lewis and Gale, 1994). Shen et the following we describe the sentiment analysis al. (2004) propose a strategy which is based on the system, the Active Learning process and the cre- combination of several criteria: informativeness, ation of the test and the initial training set. Finally, representativeness, and diversity. The results pre- we introduce the experiments performed on selec- sented by Settles and Craven (2008) show that in- tion strategies for Active Learning. formation density is the best criterion for sequence Sentiment Analysis System. Our system for labeling. Tong and Koller (2002) propose three sentiment analysis is based on a supervised ma- selection strategies that are specific to SVM learn- chine learning method using the SVM-MultiClass ers and are based on different measures taking into tool (Joachims et al., 2009)2 . We extract the fol- consideration the distances to the decision hyper- lowing features from each tweet: the tokens com- plane and margins. posing the tweet, and the number of urls, hashtags, Many NLP tasks suffer from unbalanced data. and aliases it contains. It takes as input a tokenized Ertekin et al. (2007) show that selecting examples tweet3 and returns as output its polarity. within the margin overcomes the problem of un- balanced data. AL Process. We used TextPro-AL, a platform The previously cited selection strategies are of- which integrates an NLP pipeline, an AL mech- ten applied to binary classification and do not take anism and an annotation interface (Magnini et al., into account the predicted class. In this work we 2016). The AL process is as follows: (i) a large are interested in multi-class classification tasks, unlabeled dataset is annotated by the sentiment and in the problem of unbalanced data and dom- analysis system (with a small temporary model inant classes in terms of performance. used to initialize the AL process4 ); (ii) samples are Esuli and Sebastiani (2009) define three crite- selected according to a selection strategy; (iii) an- ria that they combine to create different selection notators annotate the selected tweets; (iv) the new strategies in the context of multi-label text classi- annotated samples are accumulated in the batch; fication. The criteria are based on the confidence (v) when the batch is full the annotated data are of the system for each label, a combination of the added to the existing training dataset and a new confidence of each class for one document, and a model is built; (vi) the unlabeled dataset is anno- weight (based on the F1-measure) assigned to each tated again using the newly built model and the class to distinguish those for which the system per- cycle begins again at (ii). forms badly. They show that in most of the cases The unlabeled dataset consists of 400,000 this last criteria does not improve the selection. tweets that contained the hashtag #Sanremo2017. Our applicative context is a bit different as we The maximum size of the batch is 120, so retrain- are not working on a multi-label task. Instead of ing takes place every 120 annotated tweets. computing a weight according to the F1-measure, Training and Performance. The initial training we experimented with a change of strategy where set, whose creation required half a day of work5 , is we focus on a single class. 1 The confidence score is computed as the average of the margin estimated by the SVM classifier for each entity. 3 Experimental Setting 2 https://www.cs.cornell.edu/people/tj/ svm_light/svm_multiclass.html The context of our study was the development of 3 Tokenization is performed using the Twokenizer a supervised sentiment analysis system that classi- java library https://github.com/vinhkhuc/ fies tweets into one of the following four classes: Twitter-Tokenizer/blob/master/src/ Twokenizer.java positive, negative, neutral, and n/a 4 The temporary model has been built using 155 tweets (i.e. not applicable). annotated manually by one annotator. After the first step of The manual annotation of the data was mainly the AL process, these tweets are removed from the training set. performed by 25 3rd and 4th year students from 5 The 25 high schools students worked in pairs and trios, local high schools who were doing a one-week for a total of 12 groups. composed of 2,702 tweets. The class negative sified by the system with the lowest confidence. is the most represented, covering almost 40% of The low confidence strategy was also used to build the total, with respect to positive, with around the initial training set (S0: lowC) as described is 30% of the total. The distribution of the two mi- Section 3. nor classes is rather close, with 18% for neutral and 13% for n/a. S2: NEGATIVE with high confidence. The As a test set we used 1,136 tweets randomly se- second strategy consists of selecting the samples lected from among all the tweets which mentioned classified as negative with the highest con- either a Sanremo song or singer. The test set was fidence. We assume that this will increase the annotated partly by the high school students (656 amount of negative tweets selected, thus enabling tweets) and partly by two expert annotators (480 us to improve the performance of the system on tweets); each tweet was annotated with the same the negative class. Nevertheless, as the sys- category by at least two annotators. 58% of the tem has a high confidence on the classification of tweets are positive, 20% are negative, 14% these tweets, through this strategy we are adding are neutral, and 8% are n/a. easy examples to the training set that the system is probably already able to classify correctly. We built the test set selecting the tweets ran- domly from the unlabeled dataset in order to make S3: POSITIVE with low confidence. The third it representative of the whole dataset. strategy aims at selecting the positive tweets The overall performance of the system trained for which the system has the lowest confidence. on the initial set is 40.7 in terms of F1 (see We expect in this way to get the difficult cases, i.e. EVAL 2702 in Table 1). The F1 obtained on tweets that are close to the hyperplane and that are the two main categories, i.e. positive and classified as positive but whose classification negative, is 54.5, but the system performs more has a high chance of being incorrect. poorly on negative than on positive, with As the initial system has high recall (82.8) but F1-measures of 33.6 and 75.4 respectively. low precision (69.3) for the class positive, we Experiment. As the evaluation showed good assume that it needs to improve on the examples results on positive but poor results on wrongly classified as positive. We expect that negative, we devised and tested novel selection inside the tweets wrongly classified as positive strategies better able to balance the performance of we will find difficult cases of negative tweets the system over the two classes. We divided the 25 which will help to improve the system on the annotators into three different groups: each group negative class. On the other hand, recall for the annotated 775 tweets. The tweets annotated by the negative class is low (25.7), whereas precision first group were selected with the same strategy is slightly better (48.7), which is why we decided used before, whereas for the other two groups we to extract positive tweets with low confidence implemented two new selection strategies taking instead of negative tweets with low confidence. into account not only the confidence of the system 5 Results and Discussion but also the class it assigns to a tweet. As a re- sult we obtained three different extensions of the In Table 1 we present the results (in tersm of F1) same size and were thus able to compare the per- obtained by the system using the additional train- formance of the system trained on the initial train- ing data selected through the three different selec- ing set plus each of the extensions. tion strategies described above. In order to facili- tate the interpretation of the results, we also report 4 Selection Strategies the performance obtained by the system trained We tested three selection strategies that take into only on the initial set of 2,702 tweets. Addition- account the classification proposed by the sys- ally, in Table 2, we give the results obtained by tem in order to select the most useful samples to the system for each configuration also in terms of improve the distinction between positive and recall and precision (besides F1). negative. The first four lines report the results for each of the four categories, while lines six and seven re- S1: low confidence. The first strategy we tested port respectively the macro-average F1 over the is the baseline strategy, which selects tweets clas- four classes and the macro-average F1 over the Eval2702 Experiment on selection strategies Strategy used S0: lowC S1: lowC S2: NEG-highC S3: POS-lowC F1 tweets F1 tweets F1 tweets F1 tweets NEGATIVE 33.6 1,080 34.8 1,374 32.0 1,669 39.3 1,299 wrt S0 - - (+1.2) (+294) (-1.6) (+589) (+5.7) (+219) POSITIVE 75.4 798 74.8 975 74.8 869 76.5 1,065 wrt S0 - - (-0.6) (+177) (-0.6) (+71) (+1.1) (+267) NEUTRAL 22.3 476 20.9 595 23.3 567 24.6 672 wrt S0 - - (-1.4) (+119) (+1.0) (+91) (+2.3) (+196) N/A 31.3 348 28.6 533 27.6 372 28.6 441 wrt S0 - - (-2.7) (+185) (-3.7) (+24) (-2.7) (+93) Average 4 classes 40.7 2,702 39.8 3,477 39.4 3,477 42.3 3,477 wrt S0 - - (-0.9) (+775) (-1.3) (+775) (+1.6) (+775) Average POS/NEG 54.5 - 54.8 - 53.4 - 57.9 - wrt S0 - - (+0.3) - (-1.1) - (+3.4) - Table 1: Performance of the system trained on 2,702 tweets and performance of the system trained on the same set of data incremented with 775 tweets selected through three different selection strategies. Eval2702 Experiment on selection strategies Strategy used S0: lowC S1: lowC S2: NEG-highC S3: POS-lowC R P F1 R P F1 R P F1 R P F1 NEGATIVE 25.7 48.7 33.6 28.4 45.0 34.8 24.3 46.6 32.0 30.6 54.8 39.3 POSITIVE 82.8 69.3 75.4 81.6 69.0 74.8 82.2 68.7 74.8 85.3 69.3 76.5 NEUTRAL 20.1 25.0 22.3 17.7 25.4 20.9 20.7 26.6 23.3 21.3 29.2 24.6 N/A 32.6 30.0 31.3 30.4 26.9 28.6 29.3 26.0 27.6 27.2 30.1 28.6 Average 4 classes 40.3 43.2 40.7 39.5 41.6 39.8 39.2 41.9 39.4 41.1 45.9 42.3 Average POS/NEG 54.3 59.0 54.5 55.0 57.0 54.8 53.3 57.6 53.4 57.9 62.1 57.9 Table 2: Performance in terms of precision, recall and F1 of the system trained on the different training set. The two last lines are the average of the recall, precision and F1 over 4 and 2 classes. two most important classes, i.e. positive and We observe that the best strategy is S3 (POS- negative. For each selection strategy, we indi- lowC, i.e., selection of the positive tweets with cate the difference in performance obtained with the lowest confidence), with an improvement of respect to the system trained on the initial set, as the macro-average F1-measure over the 4 classes well as the number of annotated tweets that have by 1.6 points and over the positive and been added. negative classes by 3.4 points. Although we With the baseline strategy (S1: lowC, i.e., se- add more positive than negative tweets to the train- lection of the tweets for which the system has the ing data (34%), the performance of the system on lowest confidence) the performance of the system the negative class increases as well, from F1 decreases slightly, from an F1 of 40.7 to an F1 33.6 to F1 39.3. This strategy worked very well in of 39.8. Most of the added samples are nega- enabling us to select the examples which help the tive tweets (38%), which enables the system to in- system discriminate between the two main classes. crease its performance on this class by 1.2 points. 6 Application: Sanremo’s Ranking When using the second strategy (S2: NEG- highC, i.e. selection of the negative tweets with After evaluating the three different selection the highest confidence), 76% of the new tweets are strategies, we trained a new model using all the negative, but the performance of the system on this tweets that had been annotated. With this new class decreases. Even the overall performance of model, as expected, we obtained the best results. the system decreases, despite adding 775 tweets. The average F-measure on the negative and positive classes is 58.2, the average F-measure strategy that takes into account both the automat- over the 4 classes is 42.1. ically assigned category and the system’s confi- For the annotation to be used for producing the dence performs well in the case of unbalanced per- automatic ranking, we provided the system with formance over the different classes. some gazetteers, i.e. a list of words that carry pos- To complete our study it would be interesting itive polarity and a list of words that carry negative to perform further experiments on other multi- polarity. We thus obtained a small improvement in classification problems. Unfortunately this work system performance, with an F1 of 42.8 on the av- required intensive annotation work and so its repli- erage of the four classes and an F1 of 58.3 on the cation on other tasks would be very expensive. A average of positive and negative. lot of work on Active Learning has been done us- As explained in the Introduction, the applicative ing existing annotated corpora, but we think that scope of our work was to rank the songs compet- it is too far from a real annotation situation as the ing in Sanremo 2017. For this, we used only the datasets used are generally limited in tems of size. total number of tweets talking about each singer In order to test different selection strategies, and the polarity assigned to each tweet by the sys- we have evaluated the sentiment analysis sys- tem. In total we had 118,000 tweets containing ei- tem against a gold standard, but we have also ther a reference to a competing singer or song that performed an application-oriented evaluation by had been annotated automatically by the sentiment ranking the songs participating in Sanremo 2017. analysis system. By doing the ranking according As future work, we want to explore the possibil- to the proportion of positive tweets of each singer, ity of automatically adapting the selection strate- we were able to identify 4 out of the top 5 songs gies while annotating. For example, if the perfor- and 4 out of the 5 last place songs. In Table 3, mance of the classifier of one class is low, the strat- we show the official ranking versus the automatic egy in use could be changed in order to select the ranking. The Spearman’s rank correlation coeffi- samples needed to improve on that class. cient between the official ranking and our ranking is 0.83, and the Kendall’s tau coefficient is 0.67 Acknowledgments Singer Official System This work has been partially funded by the Euclip- Francesco Gabbani 1 8 Res project, under the program Bando Inno- Fiorella Mannoia 2 4 vazione 2016 of the Autonomous Province of Ermal Meta 3 1 Bolzano. We also thank the high school students Michele Bravi 4 2 who contributed to this study with their annotation Paola Turci 5 5 work within the FBK Junior initiative. Sergio Sylvestre 6 6 Fabrizio Moro 7 3 References Elodie 8 9 Bianca Atzei 9 13 David Cohn, Richard Ladner, and Alex Waibel. 1994. Improving generalization with active learning. In Samuel 10 7 Machine Learning, pages 201–221. Michele Zarrillo 11 10 Lodovica Comello 12 12 Seyda Ertekin, Jian Huang, Léon Bottou, and C. Lee Marco Masini 13 14 Giles. 2007. Learning on the border: active learn- ing in imbalanced data classification. In Mário J. Chiara 14 11 Silva, Alberto H. F. Laender, Ricardo A. Baeza- Alessio Bernabei 15 16 Yates, Deborah L. McGuinness, Bjørn Olstad, Øys- Clementino 16 15 tein Haug Olsen, and André O. Falcão, editors, Pro- ceedings of the Sixteenth ACM Conference on Infor- Table 3: Sanremo’s official ranking and the rank- mation and Knowledge Management, CIKM 2007, ing produced by our system Lisbon, Portugal, November 6-10, 2007, pages 127– 136. ACM. Andrea Esuli and Fabrizio Sebastiani. 2009. Ac- 7 Conclusion tive learning strategies for multi-label text classifi- cation. In Mohand Boughanem, Catherine Berrut, We have presented a comparative study of three Josiane Mothe, and Chantal Soulé-Dupuy, editors, AL selection strategies. We have shown that a Advances in Information Retrieval, 31th European Conference on IR Research, ECIR 2009, Toulouse, France, April 6-9, 2009. Proceedings, volume 5478 of Lecture Notes in Computer Science, pages 102– 113. Springer. Thorsten Joachims, Thomas Finley, and Chun- Nam John Yu. 2009. Cutting-plane training of structural svms. Mach. Learn., 77(1):27–59, Octo- ber. David D. Lewis and William A. Gale. 1994. A se- quential algorithm for training text classifiers. In Proc.International ACM SIGIR conference on Re- search and development in information retrieval (SI- GIR), pages 3–12, New York, NY, USA. Springer- Verlag New York, Inc. Bernardo Magnini, Anne-Lyse Minard, Mohammed R. H. Qwaider, and Manuela Speranza. 2016. T EXT P RO -AL: An Active Learning Platform for Flexible and Efficient Production of Training Data for NLP Tasks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. Eric Ringger, Peter McClanahan, Robbie Haertel, George Busby, Marc Carmen, James Carroll, Kevin Seppi, and Deryle Lonsdale. 2007. Active learning for part-of-speech tagging: Accelerating corpus an- notation. In Proceedings of the Linguistic Annota- tion Workshop, LAW ’07, pages 101–108, Strouds- burg, PA, USA. Association for Computational Lin- guistics. Greg Schohn and David Cohn. 2000. Less is more: Active learning with support vector machines. In Proceedings of the Seventeenth International Con- ference on Machine Learning, ICML ’00, pages 839–846, San Francisco, CA, USA. Morgan Kauf- mann Publishers Inc. Burr Settles and Mark Craven. 2008. An analysis of active learning strategies for sequence labeling tasks. In 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Pro- ceedings of the Conference, 25-27 October 2008, Honolulu, Hawaii, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1070– 1079. ACL. Burr Settles. 2010. Active learning literature survey. Technical report. Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, and Chew-Lim Tan. 2004. Multi-criteria-based active learning for named entity recognition. In Proceed- ings of the 42Nd Annual Meeting on Association for Computational Linguistics, ACL ’04, Stroudsburg, PA, USA. Association for Computational Linguis- tics. Simon Tong and Daphne Koller. 2002. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2:45–66, March.