=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-Rep-GobeilEt2014
|storemode=property
|title=Instance-based Learning for Tweet Categorization in CLEF RepLab 2014
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Rep-GobeilEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/GobeillGR14
}}
==Instance-based Learning for Tweet Categorization in CLEF RepLab 2014==
INSTANCE-BASED LEARNING FOR TWEET CATEGORIZATION IN CLEF REPLAB 2014 Julien Gobeill1,2, Arnaud Gaudinat1, Patrick Ruch1,2 1 , BiTeM group, HEG / HES-SO, University of Applied Sciences, 7 rte de Drize, 1227 Carouge, Switzerland 2 , SIBtex group, SIB Swiss Institute of Bioinformatics, 1 rue Michel-Servet, 1206 Genève, Switzerland {julien.gobeill, arnaud.gaudinat, patrick.ruch}@hesge.ch Abstract. BiTeM/SIBtex is a university research group with a strong background in Text Mining and Bibliomics, and a long tradition of participating in large evaluation campaigns. The CLEF RepLab 2014 Track was the occasion to integrate several local tools into a complete system for tweet monitoring and categorization based on instance-based learning. The algorithm we implemented was a k Nearest Neighbors. Dealing with the domain (automotive or banking) and the language (English or Spanish), the experiments showed that the categorizer was not affected by the choice of representation: even with all data merged into one single Knowledge Base (KB), the observed performances were close to those with dedicated KBs. Furthermore, English training data in addition to the sparse Spanish data were useful for Spanish categorization (+14% for accuracy for automotive, +26% for banking). Finally, our best official run was in top five. Yet, performances suffered from an overprediction of the most prevalent category, while we were not able to address this issue of unbalanced labels within the competition time. The algorithm showed the defects of its virtues: it was very robust, but not easy to improve. BiTeM/SIBtex tools for tweet monitoring are available within the DrugsListener Project page of the BiTeM website (http://bitem.hesge.ch/). 1 Introduction BiTeM/SIBtex is a university research group with a strong background in Text Mining and Bibliomics, and a particular focus on clinical and biological data. Occasionally, the group is involved in studies with data from the intellectual property (granted patents) or the social media (tweets and reviews) domains. Finally, the group has a long tradition of participating in large evaluation campaigns, such as TREC, NTCIR or CLEF [1-4]. The CLEF RepLab 2014 Track was the occasion to integrate several local tools into a complete system, and to evaluate a simple and robust statistical approach for tweet classification in competition. BiTeM/SIBtex only took part in the first task: Reputation Dimensions. The goal of the task was to perform text categorization on Twitter, i.e. to design a system able to assign a predefined category to a tweet. This category was one out of eight related to 1491 companies’ reputations. All tweets dealt with entities from the automotive (20 entities) or the banking (11 entities) domain, and were in English (93%) or in Spanish (7%). For training and/or learning purposes, participants were provided with approximately 15,000 tweets labeled by human experts (the training set). Additionally, participants were allowed to use provided sets of tweets related to the mentioned companies for incorporating domain knowledge. Then, the systems had to predict the good categories for 32,000 unlabeled tweets (the test set). In this task, the main difficulty was to efficiently preprocess the text, as standard Natural Language Processing strategies can fail to deal with the short, noisy, and strongly contextualised nature of the tweets. Another difficulty was to efficiently learn from unbalanced classes: indeed, the “Products & Services” category was assigned to 44% of the training tweets, versus only 1% for the “Innovation” category. Finally, this was a multilingual task, but the language distribution also was unbalanced, with less than 10% Spanish learning instances. We applied a simple and robust statistical approach in order to design our system, based on instance-based learning for categorization purposes. Instance-based learning is a kind of machine learning that compares unseen instances with labelled instances contained in a Knowledge Base (KB). The instance-based learning algorithm we chose to implement is k Nearest Neighbors (k-NN). Three particular questions were investigated during this study: - Q1 : is it better to build one KB for each domain, or to merge automotive and banking into the same KB ? - Q2 : is it better to build one KB for each language, or to merge English and Spanish into the same KB ? - Q3 : as the labels are unbalanced, is it efficient to use weighting strategies for categorization ? 2 Methods 2.1 Overall architecture of the system Figure 1 illustrates the overall architecture of our system. The workflow is divided into two steps: the training phase (offline), and the test phase (online). Three independent components act cooperatively to preprocess data (component 1), building the knowledge base (component 2) and classifying tweets (component 3). 1492 Figure 1: Overall architecture of the system During the training phase, all tweets belonging to the training set were preprocessed by component 1. Component 1 is composed of several standard Natural Language Processing treatments, along with a language detector. Then, they were indexed in one or several indexes by component 2, in order to make the KB. Component 2 is an Information Retrieval platform, which builds indexes for related documents retrieval. During the test phase, all tweets belonging to the test set also were preprocessed by component 1. Then, for a given test tweet, the component 3 (k-NN) exploited the KB in order to retrieve the most similar tweets seen in the training data, and to infer a predicted category. Official runs were computed with the whole test set. 2.2 Data A training set of approximately 15,000 labelled tweets was provided by the organizers. There as an average of 511 tweets for an automotive entity, versus 485 for a banking entity. Table 1 shows the average distribution of each category for a given entity. 1493 Table 1: Average distribution of reputation labels in training entities. Automotive Banking Category (20 entities) (11 entities) Citizenship 53 104 Governance 2 114 Innovation 8 4 Leadership 4 19 Performance 20 49 Products & Services 338 104 Undefined 75 66 Workplace 10 25 TOTAL 511 485 The first observation from Table 1 is that classes are unbalanced. For the automotive domain, 66% of training tweets deal with Products & Services, while only 0.8% deal with Leadership. The second observation is that distributions are different for the banking domain (e.g. only 21.4% for Products & Services). The distribution observed in test set (not reported) were consistent with those observed in the training set. Here is a representative example of a tweet: 208844584137134080: Me and a sexy BMW M3 at last nights shoot pic.twitter.com/ibW6sdXW Tweets often contain metadata within tags, the most frequent being hyperlinks () and emphasis (). Moreover, they often don’t have proper punctuation. 2.3 Preprocessing The goal of the component 1 was to preprocess the tweets in order to have proper and efficient instances to index (for the training phase) or search (for the test phase). For this purpose, a set of basic rules was applied. Tags were first discarded. Contents within an emphasis tag () were repeated in order to be overweighted. Contents within a hyperlink tag () also were repeated, and were preceded by the “HREF” mention. For language detection purposes, we performed simple N-Gram-Based Text Categorization, based on the Cavnar and Trenkle works [5]. This approach aims at comparing n-grams frequency profiles in a given text, with profiles observed in large English and Spanish corpus. This simple approach is reported to have an accuracy in the range of 92% to 99%. N-grams profiles were taken from [6]. 2.4 Indexing The goal of the component 2 was to build one or several indexes from the training data, in order to obtain a related documents search engine. For this purpose, we used 1494 the Terrier platform [7]. We used default stemming, stop words and a Poisson weighting scheme (PL2). Dealing with Q1 and Q2, we investigated several strategies and built several indexes: - all: a unique index with all the training tweets; - cars: an index with all tweets from the automotive domain; - banks: an index with all tweets from the banking domain; - cars_en: an index with all English tweets from the automotive domain; - banks_en: an index with all English tweets from the banking domain; - cars_es: an index with all Spanish tweets from the automotive domain; - banks_es: an index with all Spanish tweets from the banking domain. 2.5 k-NN The goal of the component 3 was to categorize tweets from the test set. For this purpose, we used a k-NN, a remarkably simple algorithm which assigns to a new text the categories that are the most prevalent among the k most similar tweets contained in the KB [8]. Similar tweets were retrieved thanks to component 2. Then, a score computer inferred the category from the k most similar instances, following this formula: predcat arg max x K E ( xi , c ) RSV ( xi ) c{c1 , c 2 ... c m } i where predcat is the predicted category for a test tweet, c1,c2…cm are the possible categories, K is the set of the k nearest neighbors of the test tweet, RSV(xi) is the retrieval status value given by the component 2 (i.e. the similarity score) of the neighbor xi, and E(xi,c) is 1 when xi is of category c, 0 otherwise. Dealing with Q3, an additional score computing was tested for handling the issue of unbalanced labels when using a k-NN. Several studies were conducted for such an issue [9-12]. Solutions varies from rebalancing the training data to injecting weights in the score computing. The conclusions about how the k-NN really suffers from unbalanced data are not always concrete. Due to a lack of time, we investigated only one solution and chose to compute a weight associated to the local distribution of training tweets. The formula thus evolved into: predcat arg max x K E ( xi , c ) RSV ( xi ) W ( xi , k d , c ) c{c1,c2 ... cm } i where d is a parameter and W(xi,k+d,c) is the frequency of training tweets from category c in the set of the k nearest neighbors of xi. 1495 3 Results and Discussions The Q1, Q2 and Q3 issues were addressed with the training data, thanks to a ten-fold cross validation strategy. 3.1 Q1: is it better to build one KB for each domain, or to merge automotive and banking into the same KB ? First, we investigated Q1, by exploiting the all, cars, and banks indexes. Both languages were merged into the same indexes. Figures 2a and 2b show the performances of the system for different values of k. Figure 2a: Performances for the cars test set, using the all index (all training data merged) or the specific cars index (only cars training data), for different values of k. Figure 2b: Performances for the banks test set, using the all index (all training data merged) or the specific banks index (only banks training data), for different values of k. Experiments showed that the optimal k for these data was around 10. They also showed that throughout the curves, it was better to use specific indexes (orange curves) versus a unique merged index (blue curves). Yet, the difference between best 1496 performances is not significant, with an accuracy of 0.69 for the all and the banks indexes for banks tweets (at k=10), and accuracies of 0.77 versus 0.76 for the cars index and the all index. We can say that, for categorizing tweets from a given domain, data from the other domain do not provide useful information, but do not degrade the optimal performances, thanks to the k-NN robustness. 3.2 Q2: is it better to build one KB for each language, or to merge English and Spanish into the same KB ? Second, we investigated Q2, especially for the Spanish language that represented less than 7% of the training data. We exploited the cars, banks, cars_es and banks_es indexes. Figures 3a and 3b show the performances of the system for different values of k. Figure 3a: Performances for the cars - Spanish test set, using the cars index (English and Spanish merged) or the specific cars - Spanish index (only Spanish data), for different values of k. Figure 3b: Performances for the banks - Spanish test set, using the banks index (English and Spanish merged) or the specific banks - Spanish index (only Spanish data), for different values of k. 1497 Experiments showed that the optimal k for Spanish data was around 30, significantly higher than the general case. This could be explained by the smaller set of Spanish instances. They also showed that it was better to use both languages indexes (orange curves) versus a Spanish-specific index (blue curves). We can say that, for categorizing tweets from Spanish, an additional amount of English data provides useful information and increases the top accuracy (from 0.69 to 0.79 for cars, from 0.57 to 0.72 for banks). The same experiments with the English language (not reported) showed no significant differences between the merged and the English-specific indexes. 3.3 Q3: as the labels are unbalanced, is it efficient to use weighting strategies for categorization ? The last experiments aimed at tuning the k-NN for dealing with unbalanced labels. Results with different values of d (not reported) showed no improvements from the unweighted k-NN. Other strategies need to be investigated fur this issue. 3.4 Official submissions and results We finally submitted two runs. For both runs, the automotive and banking training tweets were in separate Knowledge Bases. For run 1 (SIBtex_RD_1), we used a merged index for both languages. For run 2 (SIBtex_RD_2), we used specific languages. The best accuracy in the competition was 0.731. SIBtex_RD_1 had an official accuracy of 0.707 and was ranked #4. SIBtex_RD_2 had an official accuracy of 0.704 and was ranked #6. Interestingly, performances were better with the test set. Official statistics also showed that, in our run, the “Products & Services” category was overrepresented (68% instead of 49% in the gold standard). Although we failed to design an efficient strategy for dealing with unbalanced data, this distribution shows that our k-NN probably suffered from this issue. 4 Conclusion We designed a complete system for tweet categorization according to predefined reputational categories. Dealing with the domain (automotive or banking) and the language (English or Spanish), we explored a range of representations and wanted to know if it was better to use separate or merged Knowledge Bases. The experiments showed that the k-NN was not very affected by the kind of representations: even with all data merged into one single KB, the observed performances are close to those observed with dedicated KB. Moreover, English training data were useful for Spanish categorization (+14% for accuracy for automotive, +26% for banking). Yet, the unbalanced labels make the k-NN to predict the most prevalent category (“Products & Services”) more often than necessary (68% instead of 49%); this issue needs to be investigated in future works. The k-NN showed the defects of its virtues: it was 1498 robust, but not easy to improve. BiTeM/SIBtex tools for tweet monitoring are available within the DrugsListener Project page of the BiTeM website [13]. 5 References 1. Gobeill J., Teodoro D., Pasche E. and Ruch P., Report on the trec 2009 experiments: Chemical IR track. the Eighteenth Text REtrieval Conference (TREC-18), 2009. 2. Gobeill J., Pasche E., Teodoro D. and Ruch P., Simple Pre and Post Processing Strategies for Patent Searching in CLEF Intellectual Property Track, 2009. 3. Teodoro D., Gobeill J., Pasche E., Ruch P., Vishnyakova D. and Lovis C., Automatic IPC encoding and novelty tracking for effective patent mining. In: The 8th NTCIR Workshop Meeting on Evaluation of Information Access Technologies, Tokyo, Japan, pp 309-317, 2010. 4. Vishnyakova D., Pasche E., Ruch P., Selection of relevant articles for curation for the Comparative Toxicogenomic Database. BioCreative Workshop [Internet], pp 31-38, 2012. 5. Cavnar W. and Trenkle J., N-gram-based Text Categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994. 6. http://practicalcryptography.com/ 7. Ounis I., Amati G., Plachouras V., He B., Macdonald C. and Lioma C., Terrier: A High Performance and Scalable Information Retrieval Platform. In Proceedings of ACM SIGIR'06 Workshop on Open Source Information Retrieval, 2006. 8. Manning C. and Schütze H., Foundations of Statistical Natural Language Processing. Cambridge: MIT Press, 1999. 9. Tan S., Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Expert Syst. Appl. 28 (4), 2005. 10. Yang Y., An Evaluation of Statistical Approaches to Text Categorization. Inf. Retr. 1, pp 69-90, 1999. 11. Yang Y. and Liu X., A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '99). ACM, New York, NY, USA, pp 42-49, 1999. 12. Qiao X. and Liu Y., Adaptive weighted learning for unbalanced multicategory classification. Biometrics, Mar;65(1), pp 159-68, 2008. 13. http://bitem.hesge.ch/ 1499