IIT BHU at FIRE 2017 IRMiDis Track - Fully Automatic
                   Approaches to Information Retrieval
            Harshit Mehrotra                                  Ribhav Soni                                Sukomal Pal
     Department of Computer Science                Department of Computer Science             Department of Computer Science
             and Engineering                               and Engineering                            and Engineering
   Indian Institute of Technology (BHU)          Indian Institute of Technology (BHU)       Indian Institute of Technology (BHU)
              Varanasi 221005                               Varanasi 221005                            Varanasi 221005
    harshit.mehrotra.cse15@iitbhu.ac.in             ribhav.soni.cse13@iitbhu.ac.in                  spal.cse@iitbhu.ac.in

ABSTRACT                                                               3 OUR METHODOLOGY - RUN 1
This paper presents the work of the team of IIT (BHU) Varanasi         The run is fully automatic in both query generation and search-
for the IRMiDis track in FIRE 2017. The task involved classify-        ing. It makes use of Apache Lucene, a open source Java based text
ing tweets posted during a disaster into those expressing need and     search engine library [1]. The run can be divided into the following
availability of various types of resources, given some tweets from     steps:
the Nepal 2015 earthquake. We submitted two runs, both of which
were fully automatic.
                                                                           (1) Cleaning and Tokenization:
KEYWORDS
Information retrieval, microblogs, disaster, Lucene ,query genera-                 The tweets in the training data are first cleaned to re-
tion                                                                           move hashtags, numbers, addresses (of the type @…) and
                                                                               URLs. These objects are not deterministic of the category
                                                                               (Nepal-Need/Nepal-Avail) a tweet falls in. Many hashtags
                                                                               (like #earthquake, #nepal, #NepalEarthquake) can appear
1 INTRODUCTION                                                                 in tweets of any category. Following this, the cleaned
                                                                               tweets are tokenized using the Standard Analyzer, which
With the increasing impact of social media, websites like Twitter
                                                                               indexes documents after converting each token to lower-
which provide microblogging services have become increasingly
                                                                               case, and removing stopwords and punctuations, if any.
popular. Apart from acting as a window to the outside world, these
                                                                               [4] The frequency of each token in the training set is then
also serve as an important means to communicate and collect in-
                                                                               recorded.
formation, especially in times of emergency/disaster. The IRMiDis
track in FIRE 2017 [5] posed a challenge to work on such data col-
                                                                           (2) Query Generation:
lection and analysis purposes. Specifically, the task was to develop
IR methodologies to classify tweets as:
                                                                                   The token set of each category is modified to its set
      • Need-tweets: Indicating the need or requirement of some                difference with the other token set. Then the queries for
        specific resource such as food, water, medical aid, shel-              the 2 categories are generated as follows:
        ter, to name a few. Tweets pointing to scarcity or non-                  • Nepal-Avail: Disjunction of tokens with frequency
        availability of some resources also qualify for this cate-                  more than or equal to 3 given weight of their respec-
        gory.                                                                       tive frequencies divided by 3.
      • Availability-tweets: Informing about the potential/actual                • Nepal-Need: Disjunction of tokens with frequency
        availability of resources. The former may be speaking about                 more than or equal to 2 given weight of their respec-
        resources being transported, or food packets being deliv-                   tive frequencies divided by 2.
        ered.                                                                      The threshold frequencies are set in accordance with
                                                                               the number of tweets of each category present in the train-
A tweet may be both a need-tweet and an availability-tweet.
                                                                               ing set.
   We submitted two runs, both of which were fully automatic i.e.
no retrieval step involved manual intervention. Details of the runs
                                                                           (3) Searching and Scoring:
are given in the subsequent sections.
                                                                               The test set is also pre-processed and indexed like the train-
                                                                               ing set in step 1. The test index is then searched for the
2 DATA                                                                         queries generated in step 2. The scores are computed by
The data contained around 70,000 microblogs (tweets) from Twitter              Lucene. This scoring uses a combination of the Vector
that were posted during Nepal earthquake 2015, some of which                   Space Model (VSM) of Information Retrieval and the Boolean
were code-mixed, i.e., contained different languages and/or scripts.           model to determine how relevant a given document is to
Around 20,000 of these were provided for development/training                  a user’s query. [3]
purpose and the remaining 50,000 for testing and evaluation.
     (4) Categorization:                                                                    [5] M. Basu, S. Ghosh, K. Ghosh, and M. Choudhury. 2017. Overview of the FIRE
         The scores returned by Lucene are normalized to (0,1) and                              2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis).
                                                                                                In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation (CEUR
         tweets having scores >=0.1 and >=0.2 are considered ap-                                Workshop Proceedings). CEUR-WS.org.
         propriate for the categories Nepal-Avail and Nepal-Need                            [6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
                                                                                                Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
         respectively. It is seen in our experiment that since to-                              napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
         kens for Nepal-Avail are selected for a greater threshold                              Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
         frequency, the corresponding search query gives suitable
         tweets even on a lower score, hence the above difference.

4 OUR METHODOLOGY - RUN 2
This run is also a fully automatic one, i.e., no retrieval step required
manual intervention.
       • The task was treated as a classification task, and SVM al-
         gorithm was applied, as implemented in the scikit-learn
         machine learning library [6].
       • The preprocessing included removal of tokens like ”RT”,
         URLs, and tokens starting with ”@” or ”#”.
       • Besides the provided code-mixed training data for this task,
         the gold standard from the FIRE Microblog Track 2016 was
         also used.
       • Undersampling was employed, i.e., only as many non-relevant
         tweets were given as input to the machine learning clas-
         sifier as relevant tweets (since relevant tweets were much
         less as compared to irrelevant ones).
       • For dealing with code-mixed tweets, Google Translate [2]
         was used to convert tweets in other languages to English.
         Specifically, if the language field of the tweet metadata
         was ”hi” (which denotes Hindi) or ”ne” (for Nepali), the
         tweet was translated from its original language to English.
         For tweets in any other non-English language, they were
         assumed to be in Nepali (the most common non-English
         language) and were translated to English.
       • A threshold of 0.2 in the predicted score by the SVM clas-
         sifier was set to classify a tweet as relevant.


                      Table 1: Results of our runs

 Submission Detail                          Availability-Tweets Evaluation                     Need-Tweets Evaluation                                Average MAP
 S. No. Run ID                              Precision@100 Recall@100 MAP                       Precision@100 Recall@100                  MAP         MAP
 1      iitbhu_fmt17_task1_2                0.79             0.5082        0.3786              0.79           0.7237                     0.4986      0.4386
 2      iitbhu_fmt17_task1_1                0.54             0.0867        0.057               0.58           0.2272                     0.1241      0.0906


5 RESULTS
The results of our runs based on several metrics are given in Table
1.
   Our run with Run ID iitbhu_fmt17_task1_2 was the best-performing
run in this task.

REFERENCES
[1] [n. d.]. Apache Lucene Core. ([n. d.]). https://lucene.apache.org/core/.
[2] [n. d.]. Google Translate. ([n. d.]). https://translate.google.com/.
[3] [n. d.]. Lucene Scoring. ([n. d.]). https://lucene.apache.org/core/3_6_0/scoring.
    html.
[4] [n. d.]. Lucene Standard Analyzer. ([n. d.]). https://www.tutorialspoint.com/
    lucene/lucene_standardanalyzer.htm.
                                                                                        2