=Paper= {{Paper |id=Vol-1737/T2-5 |storemode=property |title=IIT BHU at FIRE 2016 Microblog Track: A Semi-automatic Microblog Retrieval System |pdfUrl=https://ceur-ws.org/Vol-1737/T2-5.pdf |volume=Vol-1737 |authors=Ribhav Soni,Sukomal Pal |dblpUrl=https://dblp.org/rec/conf/fire/SoniP16 }} ==IIT BHU at FIRE 2016 Microblog Track: A Semi-automatic Microblog Retrieval System== https://ceur-ws.org/Vol-1737/T2-5.pdf
     IIT BHU at FIRE 2016 Microblog Track: A Semi-automatic
                   Microblog Retrieval System

                            Ribhav Soni                                             Sukomal Pal
              Department of Computer Science and                       Department of Computer Science and
                            Engineering                                              Engineering
          Indian Institute of Technology (BHU) Varanasi            Indian Institute of Technology (BHU) Varanasi
               ribhav.soni.cse13@iitbhu.ac.in                                 spal.cse@iitbhu.ac.in

ABSTRACT                                                              (1) Fully automatic, where no step involves manual inter-
This paper presents our work for the Microblog Track in            vention
FIRE 2016. The task involved utilizing microblog data (tweets)        (2) Semi-automatic, where there is some manual interven-
to retrieve useful information during times of disasters. In       tion but only in the query formation stage
particular, given a set of tweets posted during the Nepal             (3) Manual, where there is manual intervention in both
earthquake in 2015, the goal was to judge relevance of each        the query formation and document retrieval stages
tweet against a set of topics which reflected useful informa-         We submitted one run in the semi-automatic category. We
tion needs. Our approach made use of manual query forma-           used Lucene [2] for indexing, and retrieved relevant tweets
tion for searching relevant tweets based on the information        for each of the topics by manual query formation. Results
required for each topic, after indexing them using Lucene.         show that our system performs reasonably well, given its
                                                                   simplicity, but is outperformed by more complex systems.
                                                                      The rest of this paper is structured as follows. We describe
1.    INTRODUCTION                                                 the training data used in Section 2, the main challenges in-
   This paper describes our approach for the Microblog Track       volved due to the nature of microblogs in Section 3, our
in FIRE 2016 [1]. Microblogging sites like Twitter are im-         approach in Section 4, results and discussions in Section 5,
portant sources of real-time information, and thus can be          and conclusion and future work in Section 6.
utilized for extracting significant information at times of dis-
asters such as floods, earthquakes, cyclones, etc. The aim
of the Microblog track at FIRE 2016 [1] was to develop IR          2.   DATA
systems to retrieve important information from microblogs            The training data was a collection of about 50,000 tweets
posted at the time of disasters. The task involved identifying     posted during the Nepal earthquake in April 2015, along
tweets relevant to the given topics which reflect the infor-       with the associated metadata for each tweet [4].
mation needs at critical times. The topics were provided in
a standard TREC format, containing a title, a brief descrip-
tion, and a detailed narrative specifying what type of tweets
                                                                   3.   CHALLENGES
would be considered relevant to the topic. An example of a            Tweets have a stringent word limit, and users often make
topic is:                                                          use of innovative abbreviations which are difficult to handle
                                                                   for retrieval systems. Besides, they are mostly informal, and
                                                                   may involve the use of multiple languages in the same tweet
                                                              (called code mixing), or even multiple scripts in a tweet. It
    Number: FMT4                                              is also difficult to make sense of emoticons, especially inno-
                                                            vative ones made up by users.
   WHAT MEDICAL RESOURCES WERE REQUIRED
   <desc> Description:
   Identify the messages which describe the requirement of         4.   OUR APPROACH
some medicine or other medical resources.                             Our run was in the semi-automatic category, which in-
   <narr> Narrative:                                               cludes systems with manual intervention in the query forma-
   A relevant message must mention the requirement of some         tion stage. We used Apache Lucene, an open-source textual
medical resource like medicines, medical equipments, supple-       search engine library, for indexing the available tweets.
mentary food items, blood, human resources like doctors/staff         For retrieval, we used Lucene’s search facility with manual
and resources to build or support medical infrastructure like      search queries, which were formed on the basis of require-
tents, water filter, power supply, ambulance, etc. General-        ments for each topic. The search queries that we used for
ized statements without reference to medical resources would       each of the topics are given in Table 1.
not be relevant.                                                      Lucene first selects the documents to be scored based on
   </top>                                                          Boolean logic from the query specification, and then ranks
                                                                   them via the specified retrieval model [3]. We made use of
  Three types of runs were considered in the track, based on       the default similarity model, which computes scores using a
the amount of manual intervention in different stages such         combination of the Vector Space Model (VSM) and proba-
as query formation and document retrieval:                         bilistic models such as Okapi BM25.
                                           Table 1: Query strings for each topic
                            TOPIC                                                     QUERY STRING
 What Resources Were Available                                    ”food water clothes volunteers power charge availableˆ2”
 What Resources Were Required                                     ”food water clothes volunteers power charge availableˆ2”
 What Medical Resources Were Available                            ”doctor medicine ambulance blood milk baby food nurse water
                                                                  tent power availableˆ2”
 What Medical Resources Were Required                             ”doctor medicine ambulance food blood nurse water tent power
                                                                  requireˆ2 needˆ2”
 What Were The Requirements / Availability Of Resources At Spe-   ”locationˆ3 placeˆ2 townˆ2 kathmandu village available need”
 cific Locations
 What Were The Activities Of Various NGOs / Government Or-        ”NGOˆ5 governmentˆ4 workˆ3”
 ganizations
 What Infrastructure Damage And Restoration Were Being Re-        ”road railway house damage place town”
 ported


                                             Table 2: Results of our run
            RUN ID                Precision@20      Recall@1000          MAP@1000                      OverallMAP
       iitbhu fmt16 1          0.3214                  0.2581                0.0670                 0.0827


   For tokenization, we used the StandardAnalyzer, which           [1] S. Ghosh and K. Ghosh. Overview of the FIRE 2016
creates tokens using the Word Break rules from the Unicode             Microblog track: Information Extraction from
Text Segmentation algorithm specified in [5]. It is capable            Microblogs Posted during Disasters. In Working notes
of handling names and email address, lowercases each token,            of FIRE 2016 - Forum for Information Retrieval
and removes stopwords and punctuations.                                Evaluation, Kolkata, India, December 7-10, 2016,
   Lucene ranks the returned search results (retrieved tweets)         CEUR Workshop Proceedings. CEUR-WS.org, 2016.
based on the degree of relevance using its scoring algorithms,     [2] Apache Lucene. https://lucene.apache.org/.
and returns the ranked list as well as individual scores. Since    [3] org.apache.lucene.search (Lucene 6.2.1 API). https://
the task involved tagging all relevant tweets, all the re-             lucene.apache.org/core/6 2 1/core/org/apache/lucene/
turned tweets were marked relevant. In addition, the re-               search/package-summary.html#package.description.
turned scores were normalized to the range of [0, 1] for as-       [4] Tweets – Twitter Developers.
signing relevance scores to each returned topic-tweet pair as          https://dev.twitter.com/overview/api/tweets.
per the run submission instructions.                               [5] Unicode Standard Annex #29: Unicode Text
                                                                       Segmentation. http://unicode.org/reports/tr29/.
5.   RESULTS AND DISCUSSION
   The results of our run based on several metrics are given
in Table 2.
   Our system performed reasonably well in the semi-automatic
category. However, it was outperformed in the task by more
elaborate systems.

6.   CONCLUSION AND FUTURE WORK
   Our system was overly simplistic, and offers much scope
for improvement by making use of state of the art IR tech-
niques.
   Some approaches to improve on the system include better
preprocessing of tweets (which is very essential for microblog
retrieval tasks, given the challenges with the nature of mi-
croblogs that we described in Section 2), taking the quality
of tweets into account by considering the prior tweets by the
author, query expansion approaches using external informa-
tion (like Google search results, or Wikipedia corpus), and
using pseudo-relevance feedback techniques to better tune
relevant search results. Also, Lucene was too liberal in re-
turning tweets with a very low score as part of search results,
most of which were found to be non-relevant. Thus, it is im-
portant to further refine the relevant results returned by the
search query by setting a threshold so that only those tweets
are considered relevant to a query which have a reasonably
large similarity score.

7.   REFERENCES

</pre>