=Paper=
{{Paper
|id=Vol-1737/T2-5
|storemode=property
|title=IIT BHU at FIRE 2016 Microblog Track: A Semi-automatic Microblog Retrieval System
|pdfUrl=https://ceur-ws.org/Vol-1737/T2-5.pdf
|volume=Vol-1737
|authors=Ribhav Soni,Sukomal Pal
|dblpUrl=https://dblp.org/rec/conf/fire/SoniP16
}}
==IIT BHU at FIRE 2016 Microblog Track: A Semi-automatic Microblog Retrieval System==
IIT BHU at FIRE 2016 Microblog Track: A Semi-automatic Microblog Retrieval System Ribhav Soni Sukomal Pal Department of Computer Science and Department of Computer Science and Engineering Engineering Indian Institute of Technology (BHU) Varanasi Indian Institute of Technology (BHU) Varanasi ribhav.soni.cse13@iitbhu.ac.in spal.cse@iitbhu.ac.in ABSTRACT (1) Fully automatic, where no step involves manual inter- This paper presents our work for the Microblog Track in vention FIRE 2016. The task involved utilizing microblog data (tweets) (2) Semi-automatic, where there is some manual interven- to retrieve useful information during times of disasters. In tion but only in the query formation stage particular, given a set of tweets posted during the Nepal (3) Manual, where there is manual intervention in both earthquake in 2015, the goal was to judge relevance of each the query formation and document retrieval stages tweet against a set of topics which reflected useful informa- We submitted one run in the semi-automatic category. We tion needs. Our approach made use of manual query forma- used Lucene [2] for indexing, and retrieved relevant tweets tion for searching relevant tweets based on the information for each of the topics by manual query formation. Results required for each topic, after indexing them using Lucene. show that our system performs reasonably well, given its simplicity, but is outperformed by more complex systems. The rest of this paper is structured as follows. We describe 1. INTRODUCTION the training data used in Section 2, the main challenges in- This paper describes our approach for the Microblog Track volved due to the nature of microblogs in Section 3, our in FIRE 2016 [1]. Microblogging sites like Twitter are im- approach in Section 4, results and discussions in Section 5, portant sources of real-time information, and thus can be and conclusion and future work in Section 6. utilized for extracting significant information at times of dis- asters such as floods, earthquakes, cyclones, etc. The aim of the Microblog track at FIRE 2016 [1] was to develop IR 2. DATA systems to retrieve important information from microblogs The training data was a collection of about 50,000 tweets posted at the time of disasters. The task involved identifying posted during the Nepal earthquake in April 2015, along tweets relevant to the given topics which reflect the infor- with the associated metadata for each tweet [4]. mation needs at critical times. The topics were provided in a standard TREC format, containing a title, a brief descrip- tion, and a detailed narrative specifying what type of tweets 3. CHALLENGES would be considered relevant to the topic. An example of a Tweets have a stringent word limit, and users often make topic is: use of innovative abbreviations which are difficult to handle for retrieval systems. Besides, they are mostly informal, and may involve the use of multiple languages in the same tweet(called code mixing), or even multiple scripts in a tweet. It Boolean logic from the query specification, and then ranks them via the specified retrieval model [3]. We made use of Three types of runs were considered in the track, based on the default similarity model, which computes scores using a the amount of manual intervention in different stages such combination of the Vector Space Model (VSM) and proba- as query formation and document retrieval: bilistic models such as Okapi BM25. Table 1: Query strings for each topic TOPIC QUERY STRING What Resources Were Available ”food water clothes volunteers power charge availableˆ2” What Resources Were Required ”food water clothes volunteers power charge availableˆ2” What Medical Resources Were Available ”doctor medicine ambulance blood milk baby food nurse water tent power availableˆ2” What Medical Resources Were Required ”doctor medicine ambulance food blood nurse water tent power requireˆ2 needˆ2” What Were The Requirements / Availability Of Resources At Spe- ”locationˆ3 placeˆ2 townˆ2 kathmandu village available need” cific Locations What Were The Activities Of Various NGOs / Government Or- ”NGOˆ5 governmentˆ4 workˆ3” ganizations What Infrastructure Damage And Restoration Were Being Re- ”road railway house damage place town” ported Table 2: Results of our run RUN ID Precision@20 Recall@1000 MAP@1000 OverallMAP iitbhu fmt16 1 0.3214 0.2581 0.0670 0.0827 For tokenization, we used the StandardAnalyzer, which [1] S. Ghosh and K. Ghosh. Overview of the FIRE 2016 creates tokens using the Word Break rules from the Unicode Microblog track: Information Extraction from Text Segmentation algorithm specified in [5]. It is capable Microblogs Posted during Disasters. In Working notes of handling names and email address, lowercases each token, of FIRE 2016 - Forum for Information Retrieval and removes stopwords and punctuations. Evaluation, Kolkata, India, December 7-10, 2016, Lucene ranks the returned search results (retrieved tweets) CEUR Workshop Proceedings. CEUR-WS.org, 2016. based on the degree of relevance using its scoring algorithms, [2] Apache Lucene. https://lucene.apache.org/. and returns the ranked list as well as individual scores. Since [3] org.apache.lucene.search (Lucene 6.2.1 API). https:// the task involved tagging all relevant tweets, all the re- lucene.apache.org/core/6 2 1/core/org/apache/lucene/ turned tweets were marked relevant. In addition, the re- search/package-summary.html#package.description. turned scores were normalized to the range of [0, 1] for as- [4] Tweets – Twitter Developers. signing relevance scores to each returned topic-tweet pair as https://dev.twitter.com/overview/api/tweets. per the run submission instructions. [5] Unicode Standard Annex #29: Unicode Text Segmentation. http://unicode.org/reports/tr29/. 5. RESULTS AND DISCUSSION The results of our run based on several metrics are given in Table 2. Our system performed reasonably well in the semi-automatic category. However, it was outperformed in the task by more elaborate systems. 6. CONCLUSION AND FUTURE WORK Our system was overly simplistic, and offers much scope for improvement by making use of state of the art IR tech- niques. Some approaches to improve on the system include better preprocessing of tweets (which is very essential for microblog retrieval tasks, given the challenges with the nature of mi- croblogs that we described in Section 2), taking the quality of tweets into account by considering the prior tweets by the author, query expansion approaches using external informa- tion (like Google search results, or Wikipedia corpus), and using pseudo-relevance feedback techniques to better tune relevant search results. Also, Lucene was too liberal in re- turning tweets with a very low score as part of search results, most of which were found to be non-relevant. Thus, it is im- portant to further refine the relevant results returned by the search query by setting a threshold so that only those tweets are considered relevant to a query which have a reasonably large similarity score. 7. REFERENCESNumber: FMT4 is also difficult to make sense of emoticons, especially inno- vative ones made up by users. WHAT MEDICAL RESOURCES WERE REQUIRED Description: Identify the messages which describe the requirement of 4. OUR APPROACH some medicine or other medical resources. Our run was in the semi-automatic category, which in- Narrative: cludes systems with manual intervention in the query forma- A relevant message must mention the requirement of some tion stage. We used Apache Lucene, an open-source textual medical resource like medicines, medical equipments, supple- search engine library, for indexing the available tweets. mentary food items, blood, human resources like doctors/staff For retrieval, we used Lucene’s search facility with manual and resources to build or support medical infrastructure like search queries, which were formed on the basis of require- tents, water filter, power supply, ambulance, etc. General- ments for each topic. The search queries that we used for ized statements without reference to medical resources would each of the topics are given in Table 1. not be relevant. Lucene first selects the documents to be scored based on