<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IIT BHU at FIRE 2016 Microblog Track: A Semi-automatic Microblog Retrieval System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ribhav Soni</string-name>
          <email>ribhav.soni.cse13@iitbhu.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sukomal Pal</string-name>
          <email>spal.cse@iitbhu.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and, Engineering, Indian Institute of Technology (BHU) Varanasi</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents our work for the Microblog Track in FIRE 2016. The task involved utilizing microblog data (tweets) to retrieve useful information during times of disasters. In particular, given a set of tweets posted during the Nepal earthquake in 2015, the goal was to judge relevance of each tweet against a set of topics which re ected useful information needs. Our approach made use of manual query formation for searching relevant tweets based on the information required for each topic, after indexing them using Lucene.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>This paper describes our approach for the Microblog Track
in FIRE 2016 [1]. Microblogging sites like Twitter are
important sources of real-time information, and thus can be
utilized for extracting signi cant information at times of
disasters such as oods, earthquakes, cyclones, etc. The aim
of the Microblog track at FIRE 2016 [1] was to develop IR
systems to retrieve important information from microblogs
posted at the time of disasters. The task involved identifying
tweets relevant to the given topics which re ect the
information needs at critical times. The topics were provided in
a standard TREC format, containing a title, a brief
description, and a detailed narrative specifying what type of tweets
would be considered relevant to the topic. An example of a
topic is:
&lt;top&gt;
&lt;num&gt; Number: FMT4
&lt;title&gt;
WHAT MEDICAL RESOURCES WERE REQUIRED
&lt;desc&gt; Description:</p>
      <p>Identify the messages which describe the requirement of
some medicine or other medical resources.</p>
      <p>&lt;narr&gt; Narrative:</p>
      <p>A relevant message must mention the requirement of some
medical resource like medicines, medical equipments,
supplementary food items, blood, human resources like doctors/sta
and resources to build or support medical infrastructure like
tents, water lter, power supply, ambulance, etc.
Generalized statements without reference to medical resources would
not be relevant.</p>
      <p>&lt;/top&gt;</p>
      <p>Three types of runs were considered in the track, based on
the amount of manual intervention in di erent stages such
as query formation and document retrieval:
(1) Fully automatic, where no step involves manual
intervention</p>
      <p>(2) Semi-automatic, where there is some manual
intervention but only in the query formation stage</p>
      <p>(3) Manual, where there is manual intervention in both
the query formation and document retrieval stages</p>
      <p>We submitted one run in the semi-automatic category. We
used Lucene [2] for indexing, and retrieved relevant tweets
for each of the topics by manual query formation. Results
show that our system performs reasonably well, given its
simplicity, but is outperformed by more complex systems.</p>
      <p>The rest of this paper is structured as follows. We describe
the training data used in Section 2, the main challenges
involved due to the nature of microblogs in Section 3, our
approach in Section 4, results and discussions in Section 5,
and conclusion and future work in Section 6.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>DATA</title>
      <p>The training data was a collection of about 50,000 tweets
posted during the Nepal earthquake in April 2015, along
with the associated metadata for each tweet [4].
3.</p>
    </sec>
    <sec id="sec-3">
      <title>CHALLENGES</title>
      <p>Tweets have a stringent word limit, and users often make
use of innovative abbreviations which are di cult to handle
for retrieval systems. Besides, they are mostly informal, and
may involve the use of multiple languages in the same tweet
(called code mixing), or even multiple scripts in a tweet. It
is also di cult to make sense of emoticons, especially
innovative ones made up by users.
4.</p>
    </sec>
    <sec id="sec-4">
      <title>OUR APPROACH</title>
      <p>Our run was in the semi-automatic category, which
includes systems with manual intervention in the query
formation stage. We used Apache Lucene, an open-source textual
search engine library, for indexing the available tweets.</p>
      <p>For retrieval, we used Lucene's search facility with manual
search queries, which were formed on the basis of
requirements for each topic. The search queries that we used for
each of the topics are given in Table 1.</p>
      <p>Lucene rst selects the documents to be scored based on
Boolean logic from the query speci cation, and then ranks
them via the speci ed retrieval model [3]. We made use of
the default similarity model, which computes scores using a
combination of the Vector Space Model (VSM) and
probabilistic models such as Okapi BM25.</p>
      <p>TOPIC QUERY STRING
What Resources Were Available "food water clothes volunteers power charge available^2"
What Resources Were Required "food water clothes volunteers power charge available^2"
What Medical Resources Were Available "doctor medicine ambulance blood milk baby food nurse water
tent power available^2"
What Medical Resources Were Required "doctor medicine ambulance food blood nurse water tent power
require^2 need^2"
What Were The Requirements / Availability Of Resources At Spe- "location^3 place^2 town^2 kathmandu village available need"
ci c Locations
What Were The Activities Of Various NGOs / Government Or- "NGO^5 government^4 work^3"
ganizations
What Infrastructure Damage And Restoration Were Being Re- "road railway house damage place town"
ported</p>
      <p>RUN ID
iitbhu fmt16 1</p>
      <p>MAP@1000
0.0670</p>
      <p>OverallMAP
0.0827</p>
      <p>For tokenization, we used the StandardAnalyzer, which
creates tokens using the Word Break rules from the Unicode
Text Segmentation algorithm speci ed in [5]. It is capable
of handling names and email address, lowercases each token,
and removes stopwords and punctuations.</p>
      <p>Lucene ranks the returned search results (retrieved tweets)
based on the degree of relevance using its scoring algorithms,
and returns the ranked list as well as individual scores. Since
the task involved tagging all relevant tweets, all the
returned tweets were marked relevant. In addition, the
returned scores were normalized to the range of [0, 1] for
assigning relevance scores to each returned topic-tweet pair as
per the run submission instructions.</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSION</title>
      <p>The results of our run based on several metrics are given
in Table 2.</p>
      <p>Our system performed reasonably well in the semi-automatic
category. However, it was outperformed in the task by more
elaborate systems.</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>Our system was overly simplistic, and o ers much scope
for improvement by making use of state of the art IR
techniques.</p>
      <p>Some approaches to improve on the system include better
preprocessing of tweets (which is very essential for microblog
retrieval tasks, given the challenges with the nature of
microblogs that we described in Section 2), taking the quality
of tweets into account by considering the prior tweets by the
author, query expansion approaches using external
information (like Google search results, or Wikipedia corpus), and
using pseudo-relevance feedback techniques to better tune
relevant search results. Also, Lucene was too liberal in
returning tweets with a very low score as part of search results,
most of which were found to be non-relevant. Thus, it is
important to further re ne the relevant results returned by the
search query by setting a threshold so that only those tweets
are considered relevant to a query which have a reasonably
large similarity score.</p>
    </sec>
    <sec id="sec-7">
      <title>REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>