=Paper=
{{Paper
|id=Vol-1737/T2-6
|storemode=property
|title=Using Relevancer to Detect Relevant Tweets: The Nepal Earthquake Case
|pdfUrl=https://ceur-ws.org/Vol-1737/T2-6.pdf
|volume=Vol-1737
|authors=Ali Hürriyetoğlu,Antal van den Bosch,Nelleke Oostdijk
|dblpUrl=https://dblp.org/rec/conf/fire/HurriyetogluBO16
}}
==Using Relevancer to Detect Relevant Tweets: The Nepal Earthquake Case==
Using Relevancer to Detect Relevant Tweets: The Nepal Earthquake Case Ali Hürriyetoǧlu Antal van den Bosch Nelleke Oostdijk Centre for Language Studies Centre for Language Studies Centre for Language Studies Radboud University Radboud University Radboud University P.O. Box 9103, NL-6500 HD, P.O. Box 9103, NL-6500 HD, P.O. Box 9103, NL-6500 HD, Nijmegen, the Netherlands Nijmegen, the Netherlands Nijmegen, the Netherlands a.hurriyetoglu@let.ru.nl a.vandenbosch@let.ru.nl n.oostdijk@let.ru.nl 1. INTRODUCTION 1000 In this working note we describe our submission to the FIRE 2016 Microblog track Information Extraction from 800 Microblogs Posted during Disasters [1]. The task in this track was to extract all relevant tweets pertaining to seven 600 tweet count given topics from a set of tweets. The tweet set was collected using key terms related to the Nepal Earthquake1 . 400 Our submission is based on a semi-automatic approach in which we used Relevancer, a complete analysis pipeline 200 designed for analyzing a tweet collection. The main analy- sis steps supported by Relevancer are (1) preprocessing the 0 tweets, (2) clustering them, (3) manually labeling the coher- 26 27 28 29 30 01 02 03 04 05 06 07 08 09 10 May ent clusters, and (4) creating a classifier that can be used for 2015 classifying tweets that are not placed in any coherent clus- date ter, and for classifying new (i.e. previously unseen) tweets using the labels defined in step (3). Figure 1: Temporal distribution of the tweets The data and the system are described in more detail in Sections 2 and 3, respectively. 2. DATA 3.1 Normalisation At the time of download (August 3, 2016), 49,660 tweet Normalisation starts with converting user names and URLs IDs were available out of the 50,068 tweet IDs provided for that occur in the tweet text to the dummy values ‘usrusrusr’ this task. The missing tweets had been deleted by the peo- and ‘urlurlurl’ respectively. ple who originally posted them. We used only the English After inspection of the data, we decided to normalise a tweets, 48,679 tweets in all, based on the language tag pro- number of phenomena. First, we removed certain automat- vided by the Twitter API. Tweets in this data set were al- ically generated parts at the beginning and at the end of a ready deduplicated by the task organisation team as much tweet text. We determined these manually, e.g. ‘live up- as possible. dates:’, ‘I posted 10 photos on Facebook in the album’ and The final tweet collection contains tweets that were posted ‘via usrusrusr’. After that, words that end in ‘. . . ’ were between April 25, 2015 and May 10, 2015. The daily distri- removed as well. These words are mostly incomplete due bution of the tweets is visualized in Figure 1. to the length restriction of a tweet text, and are usually at the end of tweets generated from within another application. 3. SYSTEM OVERVIEW Also, we eliminated any consecutive duplication of a token. Duplication of tokens mostly occurs with the dummy forms The typical analysis steps of the Relevancer were applied for user names and urls, and event-related key words and to the data provided for this task. The current focus of entities. For instance, two of three consecutive tokens at the Relevancer tool is the text and the date of posting of the beginning of the tweet #nepal: nepal: nepal earthquake: a tweet. Relevancer aims at discovering and distinguishing main language groups (10 may 2015) urlurlurl #crisisman- between the different topically coherent information threads agement were removed in this last step of normalization. in a tweet collection[3, 2]. Tweets are clustered such that This last step facilitates the process of identifying the ac- each cluster represents an information thread and the clus- tual content of the tweet text. ters can be used to train a classifier. Each step of the analysis process is described in some de- 3.2 Clustering and labeling tail in the following subsections2 . The clustering step aims at finding topically coherent groups 1 https://en.wikipedia.org/wiki/April 2015 Nepal of tweets that we call information threads. These groups are earthquake 2 See http://relevancer.science.ru.nl and https://bitbucket. org/hurrial/relevancer for further details. labeled as relevant, irrelevant, or incoherent. Coherent igrams and bigrams were used as features. The performance clusters were selected from the output of the clustering algo- of the classifier on a 15% held-out data is provided below in rithm K-Means3 , with k = 200, i.e. a preset number of 200 Tables 1 and 2 7 . clusters. Coherency of a cluster is calculated based on the distance between the tweets in a particular cluster and the Irrelevant Relevant cluster center. Tweets that are in incoherent clusters (as Irrelevant 720 34 determined by the algorithm) were clustered again by re- Relevant 33 257 laxing the coherency restrictions until the algorithm reaches the requested number of coherent clusters. The second stop criterion for the algorithm is the limit of the coherency pa- Table 1: Confusion matrix of the Naive Bayes clas- rameter relaxation. sifier on test data. The rows and the columns rep- The coherent clusters were extended with the tweets that resent the actual and the predicted labels of test are not in any coherent cluster. This step was performed tweets. The diagonal provides the correct number by iterating all coherent clusters in descending order of the of predictions. total length of the tweets in a cluster and adding tweets that have a cosine similarity higher than 0.85 with respect to the center of a cluster to that respective cluster. The precision recall F1 support total number of tweets that were transferred to the clusters Irrelevant .96 .95 .96 754 this way was 847. Relevant .88 .89 .88 290 As Relevancer takes dates of posts as relevant informa- tion, the tool first searches for coherent clusters of tweets Avg/Total .94 .94 .94 1,044 in each day separately. Then, in a second step it clusters all tweets from all days that previously were not placed in Table 2: Precision, recall, and F1-score of the clas- any coherent cluster. Applying the two steps sequentially sifier on the test collection. The recall is based on enables Relevancer to detect local and global information the test set. threads as coherent clusters respectively. For each cluster thus identified, a list of tweets is presented The whole collection was classified with the trained Naive to an expert who then determines which are the relevant and Bayes classifier. 11,300 tweets were predicted as relevant. irrelevant clusters4 . Clusters that contain both relevant and We continued the analysis with these relevant tweets. irrelevant tweets are labeled as incoherent by the expert5 . Relevant clusters are those which an expert considers to be 3.4 Clustering and labeling relevant tweets relevant for the aim she wants to achieve. In the present Relevant tweets, as predicted by the automatic classifier, context more specifically, clusters that are about a topic were clustered without filtering them based on the coherency specified as relevant by the task organisation team should criteria. In contrast to the first clustering step, the output be labeled as relevant. Any other coherent cluster should be of K-means was used as is, again with k = 200. These clus- labeled as irrelevant. ters were annotated using the seven topics as predetermined by the task. To the extent possible, incoherent clusters were 3.3 Creating the classifier labeled using the closest provided topic. Otherwise, the clus- The classifier was trained with the tweets labeled as rel- ter was labeled as irrelevant. evant or irrelevant in the previous step. Tweets in the in- The clusters that have a topic label contain 8,654 tweets. coherent clusters were not used. The Naive Bayes method Since the remaining clusters, containing 2,646 tweets, were was used to train the classifier. evaluated as irrelevant, they were not included in the sub- We used a small set of stop words. These are a small set mitted set. of key words (nouns), viz. nepal, earthquake, quake, kath- mandu and their hashtag versions6 , determiners the, a, an, conjunctions and, or, prepositions to, of, from, with, in, on, 4. RESULTS for, at, by, about, under, above, after, before, and the news The result of our submission was recorded under the ID related words breaking and news and their hashtag versions. relevancer ru nl. The performance of our results was evalu- The normalized forms of the user names and URLs usrusrusr ated by the organisation committee at ranks 20, 1,000, and and urlurlurl are included in the stop word list as well. all, considering the tweets retrieved in the respective ranks. We optimized the smoothing prior parameter α to be 0.31 As announced by the organisation committee, our results are by cross-validation, comparing the classifier performance with as follows: 0.3143 precision at rank 20, 0.1329 and 0.0319 equally separated 20 values of α between 0 and 2. Word un- recall and Mean Average Precision (MAP) at rank 1,000 3 respectively, and 0.0406 MAP considering all tweets in our We used scikit-learn v0.17.1 for all machine learning tasks in this study http://scikit-learn.org. submitted results. 4 The first author of this working note had the role of being We generated an additional calculation for our results the expert for this task. A real scenario would require a based on the annotated tweets provided by task organizers. domain expert. The overall precision and recall are 0.081 and 0.34 respec- 5 Although the algorithmic approach determines the clusters tively. The performance for the topics FMT1 (available re- that were returned as coherent, the expert may not agree sources), FMT2 (required resources), FMT3 (available med- with it. ical resources), FMT4 (required medical resources), FMT5 6 This set was based on our observation as we did not have 7 access to the key words that were used to collect this data Since we optimize the classifier for this collection, the per- set. formance of the classifier on unseen data is not relevant here. (resource availability at certain locations), FMT6 (NGO and 7. REFERENCES governmental organization activities), and FMT7 (infras- [1] S. Ghosh and K. Ghosh. Overview of the FIRE 2016 tructure damage and restoration reports) is provided in the Microblog track: Information Extraction from Table 3. Microblogs Posted during Disasters. In Working notes of FIRE 2016 - Forum for Information Retrieval precision recall F1 percentage Evaluation, Kolkata, India, December 7-10, 2016, FMT1 0.17 0.50 0.26 0.27 CEUR Workshop Proceedings. CEUR-WS.org, 2016. FMT2 0.35 0.09 0.15 0.14 [2] A. Hürriyetoğlu, C. Gudehus, N. Oostdijk, and FMT3 0.19 0.28 0.23 0.16 A. van den Bosch. Relevancer: Finding and labeling FMT4 0.06 0.06 0.06 0.05 relevant information in tweet collections. In E. Spiro FMT5 0.05 0.06 0.06 0.09 and Y.-Y. Ahn, editors, Social Informatics, volume FMT6 0.05 0.74 0.09 0.18 10046. Springer International Publishing, November FMT7 0.25 0.08 0.12 0.12 2016. [3] A. Hürriyetoğlu, A. van den Bosch, and N. Oostdijk. Table 3: Precision, recall, and F1-score of our sub- Analysing role of key term inflections in knowledge mission and the percentage of the tweets in the an- discovery on twitter. In International Workshop on notated tweets per topic. Knowledge Discovery on the Web, Cagliari, Italy, September 2016. On the basis of these results, we conclude that the success of our method differs drastically across topics. In Table 3, we observe that there is a clear relation between the F1-score and the percentage of the tweets per topic in the manually annotated data. Consequently, we conclude that our method performs better in case the topic is presented well in the collection. 5. CONCLUSION In this study we applied the methodology supported by the Relevancer system in order to identify relevant informa- tion by enabling human input in terms of cluster labels. This method has yielded an average performance in comparison to other participating systems. We observed that clustering tweets for each day separately enabled the unsupervised clustering algorithm to identify specific coherent clusters in a shorter time than the time spent on clustering the whole set. Moreover, this setting provided an overview that realistically changes each day, for each day following the day of the earthquake. Our approach is optimized to incorporate human input. In principle, an expert should be able to refine a tweet collec- tion until she reaches a point where the time spent on a task is optimal and the performance is sufficient. However, with this particular task, an annotation manual was not available and the expert had to stop after one iteration without being sure to what extent certain information threads were actu- ally relevant to the task at hand; for example, are (clusters of) tweets pertaining to providing or collecting funds for the disaster victims considered to be relevant or not. It is important to note that the Relevancer system yields the results in random order, as it has no ranking mechanism that ranks posts for relative importance. We speculate that rank-based performance metrics are not optimally suited for evaluating it. In our future work we will aim to increase the precision and diminish the performance differences across topics, pos- sibly by downsampling or upsampling methods to tackle class imbalance. 6. ACKNOWLEDGEMENTS This research was funded by the Dutch national research programme COMMIT.