Overview of the FIRE 2016 Microblog track: Information
    Extraction from Microblogs Posted during Disasters

                        Saptarshi Ghosh                                          Kripabandhu Ghosh
                        Department of CST,                                      Indian Statistical Institute,
                        IIEST Shibpur, India                                          Kolkata, India
                    sghosh@cs.iiests.ac.in                                    kripa.ghosh@gmail.com


ABSTRACT                                                           participants, along with a set of seven practical information
The FIRE 2016 Microblog track focused on retrieval of mi-          needs that are faced in a disaster situation by the agencies
croblogs (tweets posted on Twitter) during disaster events.        responding to the disaster. Details of the collection are dis-
A collection of about 50,000 microblogs posted during a re-        cussed in Section 2. The task was to retrieve microblogs
cent disaster event was made available to the participants,        relevant to the information needs (see Section 3. 10 teams
along with a set of seven practical information needs during       participated in the track, submitting a total of 15 runs that
a disaster situation. The task was to retrieve microblogs          are described in Section 4). The runs were evaluated against
relevant to these needs. 10 teams participated in the task,        a gold standard developed by human assessors, using stan-
submitting a total of 15 runs. The task resulted in com-           dard measures like Precision, Recall, and MAP.
parison among performances of various microblog retrieval
strategies over a benchmark collection, and brought out the        2.     THE TEST COLLECTION
challenges in microblog retrieval.
                                                                   In this section, we describe how the test collection for the
                                                                   Microblog track was developed. Following the Cranfield
CCS Concepts                                                       style [1], we describe the creation of topics (information
•Information systems → Query reformulation;                        needs), document set (here, microblogs or tweets) collec-
                                                                   tion and relevance assessment to prepare the gold standard
Keywords                                                           necessary for evaluation of IR methodologies.
FIRE 2016; Microblog track; Microblog retrieval; Disaster          2.1      Topics for retrieval
                                                                   In this track, our objective was to develop a test collec-
1. INTRODUCTION                                                    tion to evaluate IR methodologies for extracting informa-
Microblogging sites such as Twitter (https://twitter.com)          tion (from microblogs) that can potentially help responding
have become important sources of situational information           agencies to respond to a disaster situation such as an earth-
during disaster events, such as earthquakes, floods, and hur-      quake or a flood. To this end, we consulted members of
ricanes [2, 11]. On such sites, a lot of content is posted         some NGOs who regularly work in disaster-aﬀected regions
during disaster events (in the order of thousands to mil-          – such as, Doctors For You (http://doctorsforyou.org/) and
lions of tweets), and the important situational information        SPADE (http://www.spadeindia.org/) – to know what are
is usually immersed in large amounts of general conversa-          the typical information requirements during a disaster re-
tional content, e.g., sympathy for the victims of the disaster.    lief operation. They identified certain information needs
Hence, automated IR techniques are needed to retrieve spe-         such as what resources are required / available (especially
cific types of situational information from the large amount       medical resources), what infrastructure damages are being
of text.                                                           reported, the situation at specific geographical locations, the
   There have been few prior attempts to develop IR tech-          ongoing activities of various NGOs and government agen-
niques over microblogs posted during disasters, but there has      cies (so that the operations of various responding agencies
been little eﬀort till now to develop a benchmark dataset /        can be coordinated), and so on. Based on their feedback,
test collection using which various microblog retrieval method-    we identified seven topics on which information needs to be
ologies can be compared and evaluated. The objectives of           retrieved during a disaster.
the FIRE 2016 Microblog track are two-fold – (i) to develop           Table 1 states the seven topics which we have developed
a test collection of microblogs posted during a disaster situa-    as a part of the test collection. These topics are written
tion, which can serve as a benchmark dataset for evaluation        in the format conventionally used for TREC topics.1 Each
of microblog retrieval methodologies, and (ii) to evaluate and     topic contains an identifying number (num), a textual repre-
compare the performance of various IR methodologies over           sentation of the information need (title), a brief description
the test collection. The track is inspired by the TREC Mi-         (desc) of the same and a more detailed narrative (narr) ex-
croblog Track [4] which aims to evaluate microblog retrieval       plaining what type of documents (tweets) will be considered
strategies in general. In contrast, the FIRE 2016 Microblog        relevant to the topic, and what type of tweets would not be
Track focuses on microblog retrieval in a disaster situation.      considered relevant.
   In this track, a collection of about 50,000 microblogs posted
                                                                   1
during a recent disaster event was made available to the               trec.nist.gov/pubs/trec6/papers/overview.ps.gz
      <num> Number: FMT1
      <title> What resources were available
      <desc> Identify the messages which describe the availability of some resources.
      <narr> A relevant message must mention the availability of some resource like food, drinking water, shelter, clothes, blankets,
      human resources like volunteers, resources to build or support infrastructure, like tents, water filter, power supply and so on.
      Messages informing the availability of transport vehicles for assisting the resource distribution process would also be relevant.
      However, generalized statements without reference to any resource or messages asking for donation of money would not be relevant.
      <num> Number: FMT2
      <title> What resources were required
      <desc> Identify the messages which describe the requirement or need of some resources.
      <narr> A relevant message must mention the requirement / need of some resource like food, water, shelter, clothes, blankets,
      human resources like volunteers, resources to build or support infrastructure like tents, water filter, power supply, and so on. A
      message informing the requirement of transport vehicles assisting resource distribution process would also be relevant. However,
      generalized statements without reference to any particular resource, or messages asking for donation of money would not be relevant.
      <num> Number: FMT3
      <title> What medical resources were available
      <desc> Identify the messages which give some information about availability of medicines and other medical resources.
      <narr> A relevant message must mention the availability of some medical resource like medicines, medical equipments, blood,
      supplementary food items (e.g., milk for infants), human resources like doctors/staﬀ and resources to build or support medical
      infrastructure like tents, water filter, power supply, ambulance, etc. Generalized statements without reference to medical resources
      would not be relevant.
      <num> Number: FMT4
      <title> What medical resources were required
      <desc> Identify the messages which describe the requirement of some medicine or other medical resources.
      <narr> A relevant message must mention the requirement of some medical resource like medicines, medical equipments, supple-
      mentary food items, blood, human resources like doctors/staﬀ and resources to build or support medical infrastructure like tents,
      water filter, power supply, ambulance, etc. Generalized statements without reference to medical resources would not be relevant.
      <num> Number: FMT5
      <title> What were the requirements / availability of resources at specific locations
      <desc> Identify the messages which describe the requirement or availability of resources at some particular geographical location.
      <narr> A relevant message must mention both the requirement or availability of some resource, (e.g., human resources like
      volunteers/medical staﬀ, food, water, shelter, medical resources, tents, power supply) as well as a particular geographical location.
      Messages containing only the requirement / availability of some resource, without mentioning a geographical location would not
      be relevant.
      <num> Number: FMT6
      <title> What were the activities of various NGOs / Government organizations
      <desc> Identify the messages which describe on-ground activities of diﬀerent NGOs and Government organizations.
      <narr> A relevant message must contain information about relief-related activities of diﬀerent NGOs and Government organizations
      in rescue and relief operation. Messages that contain information about the volunteers visiting diﬀerent geographical locations would
      also be relevant. However, messages that do not contain the name of any NGO / Government organization would not be relevant.
      <num> Number: FMT7
      <title> What infrastructure damage and restoration were being reported
      <desc> Identify the messages which contain information related to infrastructure damage or restoration.
      <narr> A relevant message must mention the damage or restoration of some specific infrastructure resources, such as structures
      (e.g., dams, houses, mobile tower), communication infrastructure (e.g., roads, runways, railway), electricity, mobile or Internet
      connectivity, etc. Generalized statements without reference to infrastructure resources would not be relevant.

Table 1: The seven topics (information needs) used in the track. Each topic is written following the format
conventionally used in TREC tracks (containing a number, title, description and narrative). The task is to
retrieve microblogs relevant to these topics.


2.2 Tweet dataset                                                           standard English stopwords and URLs), and the similarity
We collected a large set of tweets related to the devastating               between two tweets was measured as the Jaccard similar-
earthquake that occurred in Nepal and parts of India on                     ity between the two corresponding bags (sets) of words. If
25th April 2015.2 We collected tweets using the Twitter                     the Jaccard similarity between two tweets was found to be
Search API [10], using the keyword ‘nepal’, that were posted                higher than a threshold value (0.7), the two tweets were
during the two weeks following the earthquake. We collected                 considered near-duplicates, and only the longer tweet (po-
only tweets in English (based on language identification by                 tentially more informative) was retained in the collection.
Twitter itself), and collected about 100K tweets in total.                  After removing duplicates and near-duplicates, we obtained
   Tweets often contain duplicates and near-duplicates since                a set of 50,068 tweets, which was used as the test collection
the same information is frequently retweeted / re-posted by                 for the track.
multiple users [9]. However, duplicates are not desirable
in a test collection for IR, since the presence of duplicates               2.3    Developing gold standard for retrieval
can result in over-estimation of the performance of an IR                   Evaluation of any IR methodology requires a gold standard
methodology. Additionally, the presence of duplicate doc-                   containing the documents that are actually relevant to the
uments also creates information overload for human anno-                    topics. As is the standard procedure, we used human anno-
tators while developing the gold standard [3]. Hence, we                    tators to develop this gold standard. A set of three human
removed duplicate and near-duplicate tweets using a simpli-                 annotators were used, each of whom is proficient in English
fied version of the methodologies discussed in [9], as follows.             and is a regular user of Twitter, and has prior experience of
   Each tweet was considered as a bag of words (excluding                   working with social media content posted during disasters.
                                                                            The development of gold standard involved three phases.
2
  https://en.wikipedia.org/wiki/April 2015 Nepal
earthquake                                                                  Phase 1: Each annotator was given the set of 50,068 tweets,
and the seven topics (in TREC format, as stated in Table 1).     that our approach, where the annotators viewed the entire
Each annotator was asked to identify all tweets relevant to      dataset instead of a relatively small pool, is likely to be more
each topic, independently, i.e., without consulting the other    robust, and is expected to have resulted in development of
annotators. To help the annotators, the tweets were indexed      a more complete gold standard which is irrespective of the
using the Indri IR system [8], which helped the annotators to    performance of any IR methodology.
search for tweets containing specific terms. For each topic,
the annotators were asked to think of appropriate search-        3.    DESCRIPTION OF THE TASK
terms, retrieve tweets containing those search terms (using
                                                                    The participants were given the tweet collection and the
Indri), and to judge the relevance of the retrieved tweets.
                                                                 seven topics described earlier. It can be noted that the Twit-
   After the first phase, we observed that the set of tweets
                                                                 ter terms and conditions prohibit direct public sharing of
identified to be relevant to the same topic by diﬀerent an-
                                                                 tweets. Hence, only the tweet-ids4 of the tweets were dis-
notators, was considerably diﬀerent. This diﬀerence was be-
                                                                 tributed among the participants, along with a Python script
cause diﬀerent annotators used diﬀerent search-terms to re-
                                                                 using which the tweets can be downloaded via the Twitter
trieve tweets.3 Hence, we conducted a second phase.
                                                                 API.
                                                                    The participants were invited to develop IR methodologies
Phase 2: In this phase, for a particular topic, all tweets
                                                                 for retrieving tweets relevant to the seven topics. The partic-
that were judged relevant by at least one annotator (in the
                                                                 ipants were asked to submit a ranked list of tweets that they
first phase) were considered. The decision whether a tweet
                                                                 judge relevant to each topic. The ranked list was evaluated
is relevant to a topic was finalised through discussion among
                                                                 based on the gold standard (developed as described earlier)
all the annotators and mutual agreement.
                                                                 using the following measures: (i) Precision at 20 (Prec@20),
                                                                 i.e., what fraction of the top-ranked 20 results are actually
Phase 3: The third phase used standard pooling [7] (as
                                                                 relevant according to the gold standard, (ii) Recall at 1000
commonly done in TREC tracks) – the top 30 results of all
                                                                 (Recall@1000), i.e., what fraction of all tweets relevant to a
the submitted runs were pooled (separately for each topic),
                                                                 topic (as identified in the gold standard) is present among
and judged by the annotators. In this phase, all annotators
                                                                 the top-ranked 1000 results, (iii) Mean Average Precision at
were judging a common set of tweets, hence inter-annotator
                                                                 1000 (MAP@1000), and (iv) Overall MAP considering the
agreement could be measured. There was agreement among
                                                                 full retrieved ranked list. Out of these, we only report the
all annotators for over 90% of the tweets; for the rest, the
                                                                 Prec@20 and MAP measures (in the next section).
relevance was decided through discussion among all the an-
                                                                    The track invited three types of methodologies – (i) Auto-
notators and mutual agreement.
                                                                 matic, where both query formulation and retrieval are auto-
                                                                 mated, and (ii) Semi-automatic, where manual intervention
The final gold standard contains the following number of
                                                                 is involved in the query formulation stage (but not in the re-
tweets judged relevant to the seven topics – FMT1: 589,
                                                                 trieval stage), and (iii) Manual, where manual intervention
FMT2: 301, FMT3: 334, FMT4: 112, FMT5: 189, FMT6:
                                                                 is involved in both query formulation and retrieval stages.
378, FMT7: 254.
                                                                    15 runs were submitted by the participants, out of which,
2.4 Insights from the gold standard develop-                     one run was fully automatic, while the others were semi-
    ment process                                                 automatic. The methodologies are summarized and com-
                                                                 pared in the next section.
Through the process described above, we understood that
for any of the topics, there are several tweets which are def-
initely relevant to the topic, but which were diﬃcult to re-     4.    METHODOLOGIES
trieve even for human annotators. This is evident from the       Ten teams participated in the FIRE 2016 Microblog track.
fact that, many of the relevant tweets could initially be re-    A summary of the methodologies used by each team is given
trieved by only one out of the three annotators (in the first    in the next sub-section. Table 2 shows the evaluation perfor-
phase), but when the tweets were shown to the other anno-        mance of each submitted run, along with a brief summary.
tators (in the second phase), they unanimously agreed that       For each type, the runs are arranged in the decreasing order
the tweet was relevant. These observations highlight the         of the primary measure, i.e., Precision@20. In case of a tie,
challenges in microblog retrieval.                               the arrangement is done in the decreasing order of MAP.
   Note that our approach for developing the gold standard
is diﬀerent from that used in TREC tracks, where the gold
                                                                 4.1     Method summary
standard is usually developed by pooling few top-ranked          We now summarize the methodologies adopted in the sub-
documents retrieved by diﬀerent submitted systems, and           mitted runs.
then annotating these top-ranked documents [7]. In other
                                                                      • dcu fmt16: This team participated from ADAPT Cen-
words, only the third phase (as described above) is applied
                                                                        tre, School of Computing, Dublin City University, Ire-
in TREC tracks.
                                                                        land. It used WordNet5 to perform synonym-based
   Given that it is challenging to identify many of the tweets
                                                                        query expansion and submitted the following two runs:
relevant to a topic (as discussed above), annotating only a
relatively small pool of documents retrieved by IR method-                1. dcu fmt16 1: This is an Automatic run (i.e. no
ologies has the potential risk of missing many of the relevant               manual step involved). First, the words in <ti-
documents which are more diﬃcult to retrieve. We believe                     tle> and <narr> were considered, from which the
3                                                                4
 Since the diﬀerent annotators retrieved and judged very           Twitter assigns a unique numeric id to each tweet, called
diﬀerent sets of tweets, it is not meaningful to report inter-   the tweet-id.
                                                                 5
annotator agreement in this case.                                  https://wordnet.princeton.edu/
                   Run Id                 Precision@20    MAP           Type                       Method summary
                dcu fmt16 1                  0.3786       0.1103     Automatic                WordNet, Query Expansion
    iiest saptarashmi bandyopadhyay 1        0.4357       0.1125   Semi-automatic                 Correlation, NER,
                                                                                                       Word2Vec
                JU NLP 1                      0.4357      0.1079   Semi-automatic            WordNet, Query Expansion,
                                                                                                      NER, GloVe
               dcu fmt16 2                    0.4286      0.0815   Semi-automatic            WordNet, Query Expansion,
                                                                                                  Relevance Feedback
                JU NLP 2                      0.3714      0.0881   Semi-automatic            WordNet, Query Expansion,
                                                                                             NER, GloVe, word bags split
                JU NLP 3                      0.3714      0.0881   Semi-automatic            WordNet, Query Expansion,
                                                                                             NER, GloVe, word bags split
              iitbhu fmt16 1                  0.3214      0.0827   Semi-automatic                Lucene default model
             relevancer ru nl                 0.3143      0.0406   Semi-automatic           Relevancer system, Clustering
                                                                                       Manual labelling, Naive Bayes classification
               daiict irlab 1                 0.3143      0.0275   Semi-automatic            Word2vec, Query Expansion,
                                                                                                   equal term weight
               daiict irlab 2                 0.3000      0.0250   Semi-automatic            Word2vec, Query Expansion,
                                                                                           unequal term weights, WordNet
               trish iiest ss                 0.0929      0.0203   Semi-automatic             Word-overlap, POS tagging
               trish iiest ws                 0.0786      0.0099   Semi-automatic         WordNet, wup score, POS tagging
               nita nitmz 1                   0.0583      0.0031   Semi-automatic       Apache Nutch 0.9, query segmentation,
                                                                                                     result merging
        Helpingtech 1 (on 5 topics)           0.7700      0.2208   Semi-automatic        Entity and action verbs relationships,
                                                                                                 Temporal Importance
         GANJI 1, GANJI 2,                    0.8500      0.2420   Semi-automatic      Keyword extraction, Part-of-speech tagger,
    GANJI 3 (Combined) (on 3 topics)                                                    Word2Vec, WordNet, Terrier, Retrieval,
                                                                                                  Classification, SVM

Table 2: Comparison among all the submitted runs. Runs which attempted retrieval only for a subset of the
topics are listed separately at the end of the table.


             stopwords were removed. Thus the initial query                     manually selected on which a number of tools
             was formed. Then, for each word in the query,                      (e.g., PyDictionary, NodeBox toolkit etc.) were
             the synonyms were added using WordNet, result-                     used to find the corresponding synonyms, inflec-
             ing in the expanded query. Retrieval was done                      tional variants etc. The bag of words for each
             from this expanded query using the BM25 model                      topic was further converted into a vector using
             [6].                                                               Word2Vec package.7 Finally, the relevance score
          2. dcu fmt16 2: This is a Semi-automatic run (i.e.                    was calculated from the correlation between the
             manual step was involved). First an initial ranked                 vector representations of the topic word bags and
             list was generated using the original topic. From                  the tweet text.
             the top 30 tweets, 1-2 relevant tweets were manu-
             ally identified and query expansion was done from           • JU NLP: This team participated from Jadavpur Uni-
             these relevant tweets. The expanded query was                 versity, India. It submitted three Semi-automatic runs
             further expanded using WordNet just as done for               described as below:
             dcu fmt16 1. This final expanded query was used
             for retrieval.                                                  1. JU NLP 1: This run was generated by using word
                                                                                embeddings. For each topic, relevant words were
      • iiest saptarashmi bandyopadhyay: This team par-                         manually chosen and expanded using the synonyms
        ticipated from Indian Institute of Engineering Science                  obtained from NLTK WordNet toolkit. In addi-
        and Technology, Shibpur, India. It submitted one Semi-                  tion, past, past participle and present continu-
        automatic run described below:                                          ous forms of verbs were obtained using the Node-
                                                                                Box library for Python. For the topics FMT5
          – iiest saptarashmi bandyopadhyay 1: Correlation be-                  and FMT6, location and organization informa-
            tween the topic words and the tweet was cal-                        tion was extracted using Stanford NER tagger.
            culated and this value determined the relevance                     GloVe[5] model was trained on the twitter collec-
            score for a given topic-tweet pair. The Stanford                    tion. A tweet vector, as well as, a query vector
            NER tagger6 was used to identify the LOCA-                          was formed by taking the normalized summation
            TION, ORGANIZATION and PERSON names in                              of the vector (obtained from GloVe) of the con-
            the tweets. For each topic, some keywords were                      stituent words. Then for each query-tweet pair,
6                                                                  7
    nlp.stanford.edu/software/Stanford-ner-2015-04-20.zip              https://deeplearning4j.org/word2vec
           the similarity score was calculated by the cosine                  – nita nitmz 1: This run was generated on Apache
           similarity of the corresponding vectors.                             Nutch 0.9. Search was done using the diﬀerent
        2. JU NLP 2: This run is similar to JU NLP 1 ex-                        combination of words present in the query. The
           cept that here word bags were split categorically                    results obtained from diﬀerent combinations of
           and average similarity between the tweet vector                      query were merged.
           and the split topic vectors was calculated.
                                                                         • Helpingtech: This team participated from Indian In-
        3. JU NLP 3: This is identical to JU NLP 2.                        stitute of Technology, Patna, Bihar, India and sub-
                                                                           mitted the following Semi-automatic run (on 5 topics
    • iitbhu fmt16: This team participated from Depart-                    only):
      ment of Computer Science and Engineering, Indian In-
      stitute of Technology (BHU) Varanasi, India. It sub-                    – Helpingtech 1: For each query, relationships enti-
      mitted one Semi-automatic run – iitbhu fmt16 1 de-                        ties and action verbs were defined through manual
      scribed as follows:                                                       inspection. The ranking score was calculated on
                                                                                the basis of the presence of these pre-defined re-
        – iitbhu fmt16 1: The Lucene8 default similarity model,                 lationships in the tweet for a given query. More
          which combines Vector Space Model (VSM) and                           importance was given to a tweet which indicated
          probabilistic models (e.g., BM25), was used to                        immediate action than a one which indicated a
          generate the run. StandardAnalyzer, which han-                        proposed action for future.
          dled names and email address and lowercased each
          token, and removed stopwords and punctuations,                 • GANJI: This team participated from Èvora Univer-
          was used. The query formulation stage involved                   sity, Portugal. It submitted three retrieval results (GANJI 1,
          manual intervention.                                             GANJI 2, GANJI 3) for the first three topics only us-
                                                                           ing Semi-automatic methodology, described below:
    • daiict irlab: This team participated from DAIICT,
      Gandhinagar, India and LDRP, Gandhinagar, India.                        – GANJI 1, GANJI 2, GANJI 3 (combined): First,
      It submitted two Semi-automatic runs described as fol-                    keyword extraction was done using Part-of-speech
      lows:                                                                     tagger, Word2Vec (to obtain the nouns) and Word-
                                                                                Net (to obtain the verbs). Then, retrieval was
        1. daiict irlab 1: This run was generated using query                   performed on Terrier10 using the BM25 model.
           expansion, where the 5 similar words and hash-                       Finally, SVM classifier was used to classify the
           tags from the Word2vec model, trained on the                         retrieved tweets into available, required and other
           tweet corpus, were added to the original query.                      classes.
           Equal weight was assigned to each term.
        2. daiict irlab 2: This run was generated in the same            • relevancer ru nl: This team participated from Rad-
           way as daiict irlab 1 except that diﬀerent weights              boud University, the Netherlands and submitted the
           were assigned to the expanded terms than the                    following Semi-automatic run:
           original terms. More weights were assigned to                      – relevancer ru nl: This run was produced by a tool
           the words like required and available. These terms                   Relevancer. After a pre-processing step, the tweet
           were also expanded using WordNet.                                    collection was clustered to identify coherent clus-
                                                                                ters. Each such cluster was manually labelled by
    • trish iiest: This team participated from Indian Insti-
                                                                                some experts as relevant or non-relevant. This
      tute of Engineering Science and Technology, Shibpur,
                                                                                training data was used for Naive Bayes based clas-
      India. It submitted two Semi-automatic runs described
                                                                                sification. For each topic, the test tweets pre-
      below:
                                                                                dicted as relevant by the classifier were submit-
        1. trish iiest ss: The similarity score between a query                 ted.
           and a tweet is the word-overlap between them,
           normalized by the query length. In each topic, the     5.       CONCLUSION AND FUTURE DIRECTIONS
           nouns, identified by the Stanford Part-Of-Speech       The FIRE 2016 Microblog track successfully created a bench-
           Tagger, were selected to form the query. In addi-      mark collection of microblogs posted during disaster events,
           tion, more weight is assigned on words like avail-     and compared the performance of various IR methodologies
           ability or requirement.                                over the collection.
        2. trish iiest ws: For this run, wup 9 score is calcu-      In subsequent years, we hope to conduct extended versions
           lated on the synsets of each term obtained from        of the Microblog track, where the following extensions can
           WordNet.                                               be considered:

    • nita nitmz: This team participated from National In-               • Instead of just considering binary relevance (where a
      stitute of Technology, Agartala, India and National                  tweet is either relevant to a topic or not), graded rele-
      Institute of Technology, Mizoram. It submitted one                   vance can be considered, e.g., based on factors like how
      Semi-supervised run described as below:                              important or actionable the information contained in
8                                                                          the tweet is, how useful the tweet is likely to be to the
 https://lucene.apache.org/(2016,August20)                                 agencies responding to the disaster, and so on.
9
 http://search.cpan.org/dist/WordNet-Similarity/lib/
                                                                  10
WordNet/Similarity/wup.pm                                              http://terrier.org
   • The challenge in this year’s track considered a static     [10] Twitter Search API.
     set of microblog. But in reality, microblogs are ob-            https://dev.twitter.com/rest/public/search.
     tained in a continuous stream. The challenge can be        [11] S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen.
     extended to retrieve relevant microblogs dynamically,           Microblogging During Two Natural Hazards Events:
     e.g., as and when they are posted.                              What Twitter May Contribute to Situational
                                                                     Awareness. In Proc. ACM SIGCHI, 2010.
   It can be noted that even the best performing method
submitted in the track achieved a relatively low MAP score
of 0.24 (considering only three topics), which highlights the
diﬃculty and challenges in microblog retrieval during a dis-
aster situation. We hope that the test collection developed
in this track will help development of better models for mi-
croblog retrieval in future.

Acknowledgements
The track organizers thank all the participants for their in-
terest in this track. We also acknowledge our assessors, no-
tably Moumita Basu and Somenath Das, for their help in
developing the gold standard for the test collection. We also
thank the FIRE 2016 organizers for their support in orga-
nizing the track.

6. REFERENCES
 [1] C. Cleverdon. The cranfield tests on index language
     devices. In K. Sparck Jones and P. Willett, editors,
     Readings in Information Retrieval, pages 47–59.
     Morgan Kaufmann Publishers Inc., San Francisco,
     CA, USA, 1997.
 [2] M. Imran, C. Castillo, F. Diaz, and S. Vieweg.
     Processing Social Media Messages in Mass Emergency:
     A Survey. ACM Computing Surveys, 47(4):67:1–67:38,
     June 2015.
 [3] J. Lin, M. Efron, Y. Wang, G. Sherman, and
     E. Voorhees. Overview of the TREC-2015 Microblog
     Track. Available at: https://cs.uwaterloo.ca/
     ˜jimmylin/publications/Lin etal TREC2015.pdf,
     2015.
 [4] I. Ounis, C. Macdonald, J. Lin, and I. Soboroﬀ.
     Overview of the TREC-2011 Microblog Track.
     Available at: http://trec.nist.gov/pubs/trec20/
     papers/MICROBLOG.OVERVIEW.pdf, 2011.
 [5] J. Pennington, R. Socher, and C. D. Manning. Glove:
     Global vectors for word representation. In Empirical
     Methods in Natural Language Processing (EMNLP),
     pages 1532–1543, 2014.
 [6] S. E. Robertson and H. Zaragoza. The probabilistic
     relevance framework: Bm25 and beyond. Foundations
     and Trends in Information Retrieval, 3(4):333–389,
     2009.
 [7] K. Sparck Jones and C. van Rijsbergen. Report on the
     need for and provision of an ideal information
     retrieval test collection. Tech. Rep. 5266, Computer
     Laboratory, University of Cambridge, UK, 1975.
 [8] T. Strohman, D. Metzler, H. Turtle, and W. B. Croft.
     Indri: A language model-based search engine for
     complex queries. In Proc. ICIA. Available at:
     http://www.lemurproject.org/indri/, 2004.
 [9] K. Tao, F. Abel, C. Hauﬀ, G.-J. Houben, and
     U. Gadiraju. Groundhog Day: Near-duplicate
     Detection on Twitter. In Proc. World Wide Web
     (WWW), 2013.