=Paper=
{{Paper
|id=Vol-2036/T2-1
|storemode=property
|title=Overview of the FIRE 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis)
|pdfUrl=https://ceur-ws.org/Vol-2036/T2-1.pdf
|volume=Vol-2036
|authors=Moumita Basu,Saptarshi Ghosh,Kripabandhu Ghosh,Monojit Choudhury
|dblpUrl=https://dblp.org/rec/conf/fire/BasuGGC17
}}
==Overview of the FIRE 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis)==
Overview of the FIRE 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis) Moumita Basu Saptarshi Ghosh UEM Kolkata, India; IIEST Shibpur, India IIT Kharagpur, India; IIEST Shibpur, India Kripabandhu Ghosh Monojit Choudhury IIT Kanpur, India Microsoft Research, India ABSTRACT tweets are classified into three classes - need-tweets, availability- The FIRE 2017 Information Retrieval from Microblogs during Dis- tweets, and others. asters (IRMiDis) track focused on retrieval and matching of needs Task 2: Matching need-tweets and availability-tweets: An and availabilities of resources from microblogs posted on Twit- availability-tweet is said to match a need-tweet, if the availability- ter during disaster events. A dataset of around 67,000 microblogs tweet informs about the availability of at least one resource whose (tweets) in English as well as in local languages such as Hindi and need is indicated in the need-tweet. In this task, the participants Nepali, posted during the Nepal earthquake in April 2015, was were asked to develop methodologies for matching need-tweets made available to the participants. There were two tasks. The first with appropriate availability-tweets. task (Task1) was to retrieve tweets that inform about needs and availabilities of resources; these tweets are called need-tweets and Table 1 shows some examples of need-tweets and availability-tweets availability-tweets. The second task (Task2) was to match need- from the dataset that was made available to the participants (de- tweets with appropriate availability-tweets. scribed in the next section). Note that the dataset contains tweets not only in English but also in local languages such as Hindi and Nepali, and also code-mixed tweets, as shown in Table 1. CCS CONCEPTS •Information systems →Query reformulation; 2 THE TEST COLLECTION In this track, our objective was to develop a test collection contain- ing code-mixed microblogs for evaluating 1 INTRODUCTION • Methodologies for extracting two specific type of action- Various important information is posted on online social media able situational information – needs and availabilities of like Twitter at the times of disaster events such as floods and earth- various types of resources (need-tweets and availability- quakes. However, this important information is immersed within tweets), and a lot of conversational content such as prayers and sympathy for • Methodologies for matching need-tweets and availability- the victims. Hence automated methodologies are needed to extract tweets the important information from the deluge of tweets posted dur- In this section, we describe how the test collection for both the ing such an event [3]. In this track, we focused on two types of tasks of IRMiDis track was developed. tweets that are very important for coordinating relief operations in a disaster situation: (1) Need-tweets: Tweets which inform about the need or require- 2.1 Tweet dataset ment of some specific resources such as food, water, medical aid, As part of the same track in FIRE 2016, we had released a collection shelter, mobile or Internet connectivity, etc. of 50, 018 English tweets related to the devastating earthquake (2) Availability-tweets: Tweets which inform about the avail- that occurred in Nepal and parts of India on 25t h April 20151 [2]. ability of some specific resources. This class includes both tweets We also utilized this collection to evaluate several IR methodolo- which inform about potential availability, such as resources be- gies developed by ourselves and others [1, 2]. We re-use these ing transported or despatched to the disaster-struck area, as well tweets in the present track. Additionally, in the present track, we as tweets informing about the actual availability in the disaster- collected tweets in Hindi and Nepali (based on language identifica- struck area, such as food being distributed, etc. tion by Twitter itself) using the Twitter Search API [4], using the keyword ‘नेपाल’, that were posted during the same period that of The track had two tasks, as described below. the English tweets. A total 90K tweets were collected, and after re- Task 1: Identifying need-tweets and availability-tweets: Here moving duplicates and near-duplicates as before [1, 2], we obtained the participants were asked to develop methodologies for identify- a set of 16,903 tweets. Hence, a set of 66,921 tweets tweets was ob- ing need-tweets and availability-tweets. Note that this task can be tained – containing 50, 018 English tweets and 16, 903 tweets in approached in different ways. It can be approached as a retrieval or Hindi, Nepali or code-mixed tweets – which was used as the test search problem, where two types of tweets are to be retrieved. Dif- collection for the track. ferently, the problem of identifying need-tweets and availability- tweets can also be viewed as a classification problem, e.g., where 1 https://en.wikipedia.org/wiki/April_2015_Nepal_earthquake Examples of need-tweets Examples of availability-tweets नुनाकोट िज ा थानिसग ं गा वसमा अ हले स म कु नै राहत सामामी Nepal earthquake: Spiritual group sends relief materials to vic- तथा उ ारटोल नपुगक े ो खबरले दुखी बनायो,तेताितर प न स बि धत tims [url] प … नेपाल म दवाओं क क त, एयरपोट पर हजार क भीड़ - आज ःवाः य म ालय र WHO सँगको सय ं ोजनमा कर ब छ दजन तक #World [url] चल चऽकम ह औषधी र खा वतरण तथा जनचेतना कायबममा #earthquake #Nepalifilms after 7days of earthquake! people are still crying, sleeping in RT @abpnewshindi: वमान म खाना, पानी और कंबल नेपाल rain, lack of food and water! hope it was dream but this all के लए भेजे गए ह . एस. जयशंकर #NepalEarthquake लाइव happens to us! देख- [url] Nepal earthquake: Homeless urgently need tents; Death toll #grgadventure donating our tents and sleepig bags for victims above 5,200 Read More… [url] of the #nepal #earthquake [url] Table 1: Examples of need-tweets and matching availability-tweets, posted during the 2015 Nepal earthquake Topic for Hindi Nepali English find matching availability-tweets for each need-tweet. Addition- Retrieval tweets tweets tweets ally, pooling was used over the participant runs to identify relevant Need-Tweets 31 82 558 matches which the annotators might not have found. Availability-Tweets 238 206 1326 Table 2: Summary of the gold standard used in IRMiDis 3 TASK 1: IDENTIFYING NEED-TWEETS AND AVAILABILITY-TWEETS 11 teams have participated in Task1 and 18 runs were submitted. A The data was ordered chronologically based on the timestamp summary of the methodologies used by each team is given in the assigned by Twitter, and released in two stages. At the start of the next sub-section. track, the chronologically earlier posted 20K tweets were released (training set), along with a sample of Need-tweets and Availability- 3.1 Methodologies tweets in these 20K tweets (development set). The participating We now summarize the methodologies adopted in the submitted teams were expected to use the training and development sets to runs. formulate their methodologies. Next, about two weeks before the • iitbhu_fmt17: This team participated from Indian Insti- submission of results, the set of chronologically later posted 46K tute of Technology (BHU) Varanasi, India. It submitted tweets were released (test set). The methodologies were evaluated the following two Automatic (i.e. no manual step involved) based on their performance over the test set. runs. Both the runs used google translator API to convert the code-mixed tweets. 2.2 Developing gold standard for retrieval – iitbhu_fmt17_task1_1: It used Apache Lucene, a open The gold-standard for both tasks was generated by ‘manual runs’. source Java-based text search engine library2 . Train- To develop the gold standard set of need-tweets and availability- ing data was indexed using Standard Analyzer and tweets, a set of three human annotators having proficiency in Eng- frequency of each token is training set recorded. Query lish, Hindi and Nepali were involved. Additionally, annotators is generated by the disjunction of tokens with fre- were a regular user of Twitter, and had previous experience of quency more than or equal to a threshold value. Tweets working with social media content posted during disasters. The are categorized according to the score return by Lucene gold standard development involved similar three phases as de- search engine. scribed in [1, 2] – first each annotator individually retrieved need- – iitbhu_fmt17_task1_2: It treated the task as a classifi- tweets and availability-tweets, then there was mutual discussion cation task, and used SVM algorithm. Undersampling among the annotators to resolve conflicts, and finally there was a was employed. A threshold of 0.2 in the predicted pooling step over all the runs submitted to the track. score by the SVM classifier was set to classify a tweet The summary of the number of need-tweets and availability- as relevant. tweets present in the final gold standard corresponding to three • DataBros: This team participated from Indian Institute of different languages is reported in Table 2. Information Technology, Kalyani, India. It submitted one automatic run described below: 2.3 Developing gold standard for Matching – iiests_IRMiDis_FIRE2017_1: The bag of words model To develop the gold standard for matching, the same human an- was used with TfidfVectorizer to collect the features notators were involved. The annotators were asked to inspect the including unigram and bigrams. Recursive Feature gold standard for need-tweets and availability-tweets, and to man- Elimination (RFE) algorithm with LinearSVM was used ually find out the set of need-tweets for which at least one match- ing availability-tweet exists. The annotators were also asked to 2 https://lucene.apache.org/ 2 to compute the ranking weights for all features and – HLJIT2017-IRMIDIS_task1_3: SVM of Nonlinear ker- sort the features according to weight vectors. In ad- nel classifier was used. dition, Decision Tree Classifier is applied to classify • HLJIT2017-IRMIDIS_1: This team participated from Hei- the data. longjiang Institute of Technology, China. The task was • Bits_Pilani_WiSoc: This team participated from Birla In- viewed as a classification task in all the runs used words stitute of Technology and Science, Pilani, India. It submit- as a feature. ted two automatic runs. Both the runs were generated by – HLJIT2017-IRMIDIS_1_task1_1: LibSVM classifier was using word embeddings and then fastText classification used. algorithm to classify the tweet to its appropriate category. – HLJIT2017-IRMIDIS_1_task1_2: LibSVM classifier was The fastText classifier was trained on the labeled data and used. the previously created word embeddings. – HLJIT2017-IRMIDIS_1_task1_3: Linear Regression model – BITS_PILANI_RUN1: Created word embeddings us- was used. ing Skip-gram model. • Iwist-Group: This team participated from, Hildesheim – BITS_PILANI_RUN2: Created word embeddings us- University, Germany. It submitted one automatic run Iwist_task1_1 ing CBOW model. that is described as follows. Pole-based overlapping clus- • Data Engineering Group: This team participated from tering algorithm was used to measure the degree of rel- Indraprastha Institute of Information Technology, Delhi, evance of the tweet. For ranking the tweets Euclidean India. It submitted one automatic run – DataEngineering- distance was used as a similarity measure and the object Group_1 described as follows: closer to a pole was ranked higher. – DataEngineeringGroup_1: This run used Stanford CoreNLP • Radboud_CLS Netherlands: This team participated from library 3 for the POS tagging along with the lemma Radboud University, the Netherlands and submitted the identification of all the words in the tweet set. Fea- following two semi-automatic runs described as follows. tures were constructed using both the words present Code-mixed tweets were preprocessed and translated to in the tweets and its POS tag. Logistic Regression English using Google translator. model was used for this classification task. – Radboud_CLS_task1_1: A lexicon and a set of hand- • DIA Lab - NITK: This team participated from, National crafted rules were used to tag the relevant n-grams. Institute of Technology, Karnataka, India. It submitted Then the class labels were automatically assigned to one automatic run described as follows: the tagged output. The output was initially ranked us- – daiict_irlab_1: This run used Doc2vec model to trans- ing combined score of human-estimated confidence form tweets into embedding vectors of size 100. To of specific class label and tag pattern. However, the convert the code-mixed tweets, the ASCII translitera- final ranking was generated by ordering the tweets tions of unicode text (tweet) was used. The frequency within these ranked sets according to their tweet ID. of each token available in a tweet is also used as the – Radboud_CLS_task1_2: This run used a tool Relevancer feature. These embeddings was the input for multi- for initial clustering of the tweets tagged as English layer preceptron (a feed forward Artificial Neural Net- or Hindi. English clusters were annotated and used as work model) for classification and w-Ranking Key Al- training data for the support vector machines (SVM) gorithm was used to rank the tweets. based classifier. • FAST-NU: This team participated from, FAST National • Amrita CEN 1: This team participated from Amrita school University Karachi Campus, Pakistan. It submitted one of Engineering, Coimbatore, India. It submitted one semi- automatic run described below: automatic run AU_NLP_1 described as follows. The train- – NU_Team_run01: This run extracted textual features ing data was tokenized. Classifier was trained using the using tf*idf scores. All non -English tweets are trans- word count as feature. For ranking the tweets cosine sim- lated using Google Translator API into English equiv- ilarity was used. alent text. The logistic regression based classifier is used for classification. • HLJIT2017-IRMIDIS: This team participated from Hei- longjiang Institute of Technology, China. It submitted three 3.2 Evaluation Measures and Result automatic runs. The task was viewed as a classification We now report the performance of the methodologies submitted to task in all the runs and the feature selection was based on the Task1 of FIRE 2017 IRMiDis Track. We consider the following logistic regression method. measures to evaluate the performance – (i) Precision at 100 (Pre- – HLJIT2017-IRMIDIS_task1_1: SVM of Liner kernel clas- cision@100): what fraction of the top ranked 100 results are actu- sifier was used. ally relevant according to the gold standard, i.e., what fraction of – HLJIT2017-IRMIDIS_task1_2: AdaBoost classifier was the retrieved tweets are actually need-tweets or availability-tweets, used. (ii) Recall at 1000 (Recall@1000): fraction of relevant tweets (ac- cording to the gold standard) that are in the top 1000 retrieved tweets, and (iii) Mean Average Precision (MAP) considering the 3 https://nlp.stanford.edu/software/tagger.shtml full retrieved ranked list. 3 Run Id Type Precision Recall MAP Method @100 @1000 summary iitbhu_fmt17_task1_2 Automatic 0.7900 0.6160 0.4386 SVM classifier Undersampling was employed iiests_IRMiDis_FIRE2017_1 Automatic 0.7850 0.3542 0.2639 TfidfVectorizer, LinearSVM, , Decision Tree Classifier Bits_Pilani_1 Automatic 0.6800 0.2983 0.2073 POS tagging, word embeddings, Skip-gram model, fastText classifier Bits_Pilani_2 Automatic 0.7300 0.2634 0.1993 POS tagging, word embeddings, CBOW model, fastText classifier DataEngineeringGroup_1 Automatic 0.5400 0.2896 0.1304 POS tagging, Lemma identification, Logistic Regression model HLJIT2017- IRMIDIS_1_task1_3 Automatic 0.6850 0.1662 0.1208 words as features Linear Regression model iitbhu_fmt17_task1_1 Automatic 0.5600 0.1570 0.0906 Query generation by token disjunction more than threshold frequency, Apache Lucene HLJIT2017-IRMIDIS_1_task1_2 Automatic 0.3650 0.1176 0.0710 Words as a feature LibSVM classifier HLJIT2017-IRMIDIS_task1_3 Automatic 0.4450 0.1642 0.0687 Logistic regression based feature selection, SVM Nonlinear kernel classifier DIA_Lab_NITK_task1_1 Automatic 0.3850 0.1437 0.0681 Doc2vec, Multilayer preceptron, w-Ranking Key HLJIT2017-IRMIDIS_task1_2 Automatic 0.5500 0.1094 0.0633 Logistic regression based feature selection, AdaBoost classifier HLJIT2017-IRMIDIS_1_task1_1 Automatic 0.3050 0.0636 0.0317 Words as a feature LibSVM classifier Iwist_task1_1 Automatic 0.0350 0.0916 0.0291 POS tagging, Cosine similarity, Greedy approach search HLJIT2017-IRMIDIS_task1_1 Automatic 0.1250 0.1414 0.0286 Logistic regression based feature selection, SVM Liner kernel classifier NU_Team_run01 Automatic 0.0700 0.0478 0.0047 tf*idf scores, Logistic regression based classifier Radboud_CLS_task1_1 Semi-automatic 0.7400 0.3731 0.2458 Linguistic approach, Tagged n-grams, Automatically assigned class labels Radboud_CLS_task1_2 Semi-automatic 0.5500 0.2189 0.1736 Relevancer for initial clustering, SVM based classifier, Cosine similarity AU_NLP_1 Semi-automatic 0.0800 0.0645 0.0199 Tokenization, Word count as feature, Classification, Cosine similarity Table 3: Comparison among all the submitted runs in Task 1 (identifying need-tweets and availability-tweets). Runs are ranked in decreasing order of MAP score. Table 3 reports the retrieval performance for all the submitted 4 TASK 2: MATCHING NEED-TWEETS AND runs in Task1. Each of the measures (i.e. Precision@100, Recall@1000, AVAILABILITY-TWEETS Map) are reported by taking an average over both the topics need- In Task2, 5 teams participated and 10 runs were submitted. We first tweets and availability-tweets. describe the runs, and then report the comparative evaluation. It is seen that classification-based approaches performed bet- ter than the other methodologies based on word-embeddings or searching tools like Apache Lucene, as is evident from the scores 4.1 Methodologies in Table 3. We now describe the submitted runs. • DataBros : This team participated from Indian Institute of Information Technology, Kalyani, India. It submitted one automatic run. This run used POS (Parts of Speech) tag- ging and matching-score was obtained from the number 4 Team Id Precision@5 Recall F-Score Type Method summary DataBros 0.2482 0.3888 0.3030 Automatic POS tagging, Common noun overlapping Data Engineering Group 0.2081 0.2904 0.2424 Automatic POS tagging, Cosine similarity, Brute force search Data Engineering Group 0.1758 0.3677 0.2379 Automatic POS tagging, Cosine similarity, Greedy approach search HLJIT2017-IRMIDIS 0.1819 0.1546 0.1671 Automatic Indri, Dirichlet smoothing, KL distance sorting model HLJIT2017-IRMIDIS 0.2033 0.1405 0.1662 Automatic Indri, Dirichlet smoothing, KL distance sorting model HLJIT2017-IRMIDIS 0.2051 0.0913 0.1264 Automatic Indri, Dirichlet smoothing, KL distance sorting model HLJIT2017-IRMIDIS_1 0.0882 0.2178 0.1256 Automatic Indri, Dirichlet smoothing, Correlation calculation HLJIT2017-IRMIDIS_1 0.0825 0.1475 0.1058 Automatic Indri, Dirichlet smoothing, Correlation calculation HLJIT2017-IRMIDIS_1 0.0889 0.0211 0.0341 Automatic Indri, Dirichlet smoothing, Correlation calculation Radboud_CLS Netherlands 0.3305 0.4450 0.3793 Semi-automatic n-grams, Resource tagging Table 4: Comparison among all the submitted runs in Task 2 (matching need-tweets and availability-tweets). Runs are ranked in decreasing order of F-score. of overlapping of common nouns between Need-tweets • Radboud_CLS Netherlands: This team participated from, and Availability-tweets. Radboud University, Netherlands, and submitted the semi- • Data Engineering Group: This team participated from automatic run Radboud_CLS_task1_1. This method used Indraprastha Institute of Information Technology, Delhi, the tagged output obtained in the processing the tweets India. It submitted two automatic runs described as fol- for Task 1 using a linguistic approach. For every Need- lows: tweet all the word n-grams were tagged as identifying a – Both the runs used POS tag of nouns and similarity resource; the approach attempt to find an exact match in between Need-tweets and Availability-tweets were mea- the Availability-tweets and ranked the Availability-tweets sured by cosine similarity. However, for the first sub- accordingly. mitted run the similarity threshold was set as 0.7 as inferred on the basis of experimentation. Thus, brute force approach was followed in searching. 4.2 Evaluation Measures and Result – In the second submitted run, greedy approach was The runs were evaluated against the gold standards generated by followed and the search stopped as soon as it finds manual runs. Additionally, the annotators (same as used to de- the first five or lesser availability tweets with a cosine velop the gold standard) checked many of the need-availability similarity score greater than our set threshold of 0.7. pairs matched by the methodologies (after pooling), and judged • HLJIT2017-IRMIDIS: This team participated from Hei- whether the match is correct. longjiang Institute of Technology, China. It submitted three We have used the following IR measures to evaluate the runs. automatic runs. The task was viewed as an IR task. All the (i) Precision@5: Let n be the number of need-tweets correctly runs used the open source retrieval tool Indri language identified (i.e., present in the gold standard) by a particular match- model based on the Dirichlet smoothing for retrieval and ing methodology. For each need-tweet, we consider the top 5 KL distance as the sorting model. matching availability-tweets as matched by the method. The preci- • HLJIT2017-IRMIDIS_1: This team participated from Hei- sion of a particular matching methodology is the fraction of pairs longjiang Institute of Technology, China. It submitted three that are matched correctly by the methodology (out of the 5 × n automatic runs. The task was viewed as an IR task. Need- pairs). tweets used as a query set and Availability-tweets used as (ii) Recall: The recall of matching is the fraction of all the need- a collection of documents. All the runs used Indri open- tweets (present in the gold standard) which a methodology is able source retrieval tool and the Dirichlet smoothing language to match correctly. model to solve the matching problem. However, the three (iii) F-Score: F-score of a matching methodology is the harmonic runs submitted by this team differ in preprocessing step. mean of the precision and recall. 5 Table 4 shows the evaluation performance of each submitted relevant microblogs from the live streaming of microblogs dynam- run, along with a brief summary. For each type, the runs are ar- ically. We plan to explore this direction in the coming years. ranged in the decreasing order of the F-Score. It is evident that the methods which considered noun overlapping or cosine similarity ACKNOWLEDGEMENTS between need-tweets and availability-Tweets to obtain matching- The track organizers thank all the participants for their interest in score (post POS tagging) outperformed the other methodologies. this track. We also thank the FIRE 2017 organizers for their support in organizing the track. 5 CONCLUSION AND FUTURE DIRECTIONS The FIRE 2017 IRMiDis track successfully created a benchmark col- REFERENCES lection of code-mixed microblogs posted during disaster events. [1] M. Basu, K. Ghosh, S. Das, R. Dey, S. Bandyopadhyay, and S. Ghosh. 2017. Identi- fying Post-Disaster Resource Needs and Availabilities from Microblogs. In Proc. The track also compared the performance of various methodolo- ASONAM. gies in retrieving and matching two pertinent and actionable types [2] M. Basu, A. Roy, K. Ghosh, S. Bandyopadhyay, and S. Ghosh. 2017. Microblog of information, namely need-tweets and availability-tweets. We Retrieval in a Disaster Situation: A New Test Collection for Evaluation. In Proc. Workshop on Exploitation of Social Media for Emergency Relief and Preparedness hope that the test collection developed in this track will help the (SMERP) co-located with European Conference on Information Retrieval. 22–31. research community in the development of a better model for re- http://ceur-ws.org/Vol-1832/SMERP_2017_peer_review_paper_3.pdf trieval and matching in future. [3] M. Imran, C. Castillo, F. Diaz, and S. Vieweg. 2015. Processing Social Media Messages in Mass Emergency: A Survey. Comput. Surveys 47, 4 (June 2015), In this year’s track we considered a static collection of code- 67:1–67:38. mixed microblogs. However, in reality, microblogs are obtained [4] Twitter-search-api 2017. Twitter Search API. (2017). https://dev.twitter.com/ rest/public/search in a continuous stream. The challenge can be extended to retrieve 6