1. INTRODUCTION

IIT BHU at FIRE 2016 Microblog Track: A Semi-automatic Microblog Retrieval System

Ribhav Soni

ribhav.soni.cse13@iitbhu.ac.in 0

Sukomal Pal

spal.cse@iitbhu.ac.in 0 0 Department of Computer Science and, Engineering, Indian Institute of Technology (BHU) Varanasi

This paper presents our work for the Microblog Track in FIRE 2016. The task involved utilizing microblog data (tweets) to retrieve useful information during times of disasters. In particular, given a set of tweets posted during the Nepal earthquake in 2015, the goal was to judge relevance of each tweet against a set of topics which re ected useful information needs. Our approach made use of manual query formation for searching relevant tweets based on the information required for each topic, after indexing them using Lucene.

1. INTRODUCTION

This paper describes our approach for the Microblog Track in FIRE 2016 [1]. Microblogging sites like Twitter are important sources of real-time information, and thus can be utilized for extracting signi cant information at times of disasters such as oods, earthquakes, cyclones, etc. The aim of the Microblog track at FIRE 2016 [1] was to develop IR systems to retrieve important information from microblogs posted at the time of disasters. The task involved identifying tweets relevant to the given topics which re ect the information needs at critical times. The topics were provided in a standard TREC format, containing a title, a brief description, and a detailed narrative specifying what type of tweets would be considered relevant to the topic. An example of a topic is: <top> <num> Number: FMT4 <title> WHAT MEDICAL RESOURCES WERE REQUIRED <desc> Description:

Identify the messages which describe the requirement of some medicine or other medical resources.

<narr> Narrative:

A relevant message must mention the requirement of some medical resource like medicines, medical equipments, supplementary food items, blood, human resources like doctors/sta and resources to build or support medical infrastructure like tents, water lter, power supply, ambulance, etc. Generalized statements without reference to medical resources would not be relevant.

</top>

Three types of runs were considered in the track, based on the amount of manual intervention in di erent stages such as query formation and document retrieval: (1) Fully automatic, where no step involves manual intervention

(2) Semi-automatic, where there is some manual intervention but only in the query formation stage

(3) Manual, where there is manual intervention in both the query formation and document retrieval stages

We submitted one run in the semi-automatic category. We used Lucene [2] for indexing, and retrieved relevant tweets for each of the topics by manual query formation. Results show that our system performs reasonably well, given its simplicity, but is outperformed by more complex systems.

The rest of this paper is structured as follows. We describe the training data used in Section 2, the main challenges involved due to the nature of microblogs in Section 3, our approach in Section 4, results and discussions in Section 5, and conclusion and future work in Section 6. 2.

DATA

The training data was a collection of about 50,000 tweets posted during the Nepal earthquake in April 2015, along with the associated metadata for each tweet [4]. 3.

CHALLENGES

Tweets have a stringent word limit, and users often make use of innovative abbreviations which are di cult to handle for retrieval systems. Besides, they are mostly informal, and may involve the use of multiple languages in the same tweet (called code mixing), or even multiple scripts in a tweet. It is also di cult to make sense of emoticons, especially innovative ones made up by users. 4.

OUR APPROACH

Our run was in the semi-automatic category, which includes systems with manual intervention in the query formation stage. We used Apache Lucene, an open-source textual search engine library, for indexing the available tweets.

For retrieval, we used Lucene's search facility with manual search queries, which were formed on the basis of requirements for each topic. The search queries that we used for each of the topics are given in Table 1.

Lucene rst selects the documents to be scored based on Boolean logic from the query speci cation, and then ranks them via the speci ed retrieval model [3]. We made use of the default similarity model, which computes scores using a combination of the Vector Space Model (VSM) and probabilistic models such as Okapi BM25.

TOPIC QUERY STRING What Resources Were Available "food water clothes volunteers power charge available^2" What Resources Were Required "food water clothes volunteers power charge available^2" What Medical Resources Were Available "doctor medicine ambulance blood milk baby food nurse water tent power available^2" What Medical Resources Were Required "doctor medicine ambulance food blood nurse water tent power require^2 need^2" What Were The Requirements / Availability Of Resources At Spe- "location^3 place^2 town^2 kathmandu village available need" ci c Locations What Were The Activities Of Various NGOs / Government Or- "NGO^5 government^4 work^3" ganizations What Infrastructure Damage And Restoration Were Being Re- "road railway house damage place town" ported

RUN ID iitbhu fmt16 1

MAP@1000 0.0670

OverallMAP 0.0827

For tokenization, we used the StandardAnalyzer, which creates tokens using the Word Break rules from the Unicode Text Segmentation algorithm speci ed in [5]. It is capable of handling names and email address, lowercases each token, and removes stopwords and punctuations.

Lucene ranks the returned search results (retrieved tweets) based on the degree of relevance using its scoring algorithms, and returns the ranked list as well as individual scores. Since the task involved tagging all relevant tweets, all the returned tweets were marked relevant. In addition, the returned scores were normalized to the range of [0, 1] for assigning relevance scores to each returned topic-tweet pair as per the run submission instructions.

RESULTS AND DISCUSSION

The results of our run based on several metrics are given in Table 2.

Our system performed reasonably well in the semi-automatic category. However, it was outperformed in the task by more elaborate systems.

CONCLUSION AND FUTURE WORK

Our system was overly simplistic, and o ers much scope for improvement by making use of state of the art IR techniques.

Some approaches to improve on the system include better preprocessing of tweets (which is very essential for microblog retrieval tasks, given the challenges with the nature of microblogs that we described in Section 2), taking the quality of tweets into account by considering the prior tweets by the author, query expansion approaches using external information (like Google search results, or Wikipedia corpus), and using pseudo-relevance feedback techniques to better tune relevant search results. Also, Lucene was too liberal in returning tweets with a very low score as part of search results, most of which were found to be non-relevant. Thus, it is important to further re ne the relevant results returned by the search query by setting a threshold so that only those tweets are considered relevant to a query which have a reasonably large similarity score.

REFERENCES