=Paper= {{Paper |id=Vol-1737/T2-8 |storemode=property |title=An Information Retrieval System for FIRE 2016 Microblog Track |pdfUrl=https://ceur-ws.org/Vol-1737/T2-8.pdf |volume=Vol-1737 |authors=Trishnendu Ghorai |dblpUrl=https://dblp.org/rec/conf/fire/Ghorai16 }} ==An Information Retrieval System for FIRE 2016 Microblog Track== https://ceur-ws.org/Vol-1737/T2-8.pdf
                           An Information Retrieval System for
                               FIRE 2016 Microblog Track
                                                         Trishnendu Ghorai
                                                           Department of CST
                                                             IIEST, Shibpur




ABSTRACT                                                               2.1 Brief Overview
This paper describes our approaches to FIRE (Forum for                 In this task, a set of previously collected tweets (more specifically
Information Retrieval Evaluation) 2016 Microblog track. The            tweet ids) on Nepal Earthquake 2015 was provided. And
main aim of this track was to develop an information retrieval         alongside 7 queries were given in the traditional TREC format (an
system that can identify relevant tweets posted during a disaster      XML like format). The goal of this task was to find most relevant
event. The relevance is measured with respect to some predefined       tweets from the set of tweets based on the queries.
topics provide by the track organizers. In this working note we
have given the description of the system which has taken part in       Our system has mainly four components as follows,
this year’s FIRE track as well as has analysed the performance of
                                                                         1)    Tweet Preprocessing – As tweets are informally written,
the system.
                                                                               tweets generally contain a lot of noise and unnecessary
Keywords                                                                       data. For this reason, in preprocessing stage data filters are
                                                                               applied on the tweets to get rid of the unwanted data.
FIRE; information retrieval; tweet; relevancy;                           2)    Query Construction – The topics are provided have three
                                                                               parts, namely tittle, narration and description. To get more
                                                                               relevant tweets, a new set of queries is constructed from
1 INTRODUCTION                                                                 this given topic.
 User written informal microblogs, like tweets, are quite important      3)    Scoring of tweets – Once the queries are constructed each
and a big source of real time information. As this microblogs are              tweet are scored based on each query. Two different
quite informal and doesn’t obey standard vocabulary, thus special              approaches have been used in scoring the tweets.
information retrieval system and recommendation systems are              4)    Final filtering – When each tweet gets a score against
needed to retrieve information from this microblogs. To boost the              each topic, a heuristic threshold has been set to get good
retrieval performance of information retrieval system FIRE has                 quality tweets.
introduced this track this year [4]. In this task the participant IR
systems have to find relevant tweets from a set of tweets posted
during the recent disaster time. The initial dataset consists of       2.2 Tweet Preprocessing
around 50,000 tweets from twitter that were posted in a recent         The following steps have been taken to preprocess the tweet text.
Nepal earthquake. The relevancy of the tweet is measured with             1)   Punctuation removal – Punctuations are removed from
respect to topics, which will identify different resources that are            each tweet. We have not given any extra importance to
available or required during the disaster time. The organizers                 hash tags, all ‘#’ symbols are also removed.
provide a set of seven topics in the standard TREC format. The            2)   Case folding – All the capital letters in the tweets are
main challenge of the task is to tackle the nosiness of the tweets             converted to small letters
and at the same time find most relevant tweets. To deal with the          3)   Stop word removal – All commonly used English words
problem of noise we have applied a preprocessing phase on tweets               which do not have much significance on the subject
which will remove all noisy data from the tweets. The tweets are               matter of the tweet but are used only for semantic reasons
converted to a bag of words to ease up the scoring process. To                 are removed. A list of top most frequently used words
calculate relevance, we have developed two different scoring and               (around 500 words) are used as stop word list. And from
raking methods. The topics are optimized by constructing new                   the tweet the words that are present in the stop word list
queries based on the previous topics.                                          are removed.
                                                                          4)   Non ASCII character – In addition, we have removed all
                                                                               non ASCII characters which come to tweet due to the use
2 SYSTEM OVERVIEW                                                              of emoticons and other symbol
                                                                          5)   Constructing bag of word – Each tweet is then splited
                                                                               into words and are converted to a set of words. Each set
In this section we have described the system architecture for the              represents the collection of the distinct word that are
data challenge. The system consists of tweet preprocessing, query              present in the tweet. Each bag of word is identified by the
generation, scoring of tweets and result analysis.                             tweet id of the tweet which is unique to the tweet and can
                                                                               be used to track it in next steps.
2.3 Query Construction                                                                similarity.1 After this all wup score is added up and
Topics are made of three fields, namely the title, description and                    normalized. This normalized score denotes the similarity
narratives. Titles contain several three or four keys, while                          value between t i and q j
descriptions are one-sentence long statements of the users’                      2)   We iterate through all the terms in tweets and queries and
information needs; narratives are paragraph-length descriptions of                    summed up all the similarity score of each pairs and
the tweets that the users want to receive and are the long                            normalize it.
description. Each topic is assigned one topic id which can be used               3)   This normalize score is the final score of the tweet respect
to uniquely specify one topic in submission stage. Query                              to that particular query.
construction part consists of two different phases described as
follows:
   1) Keyword Extraction – As nouns in a sentence holds most                 2.5 Final Filtering
        of the information, we choose nouns in the topics as the
        keywords for the query. We have used Stanford Part-Of-               After scoring the tweets according to relevance to each
        Speech Tagger[1] to label different parts-of-speech first            topic, we need to choose most relevant tweets for a given
        and then collected words which have been identified as               topic. For this reason, we have taken a heuristically set
        Noun.                                                                threshold based filtering method to choose most relevant
   2) Giving weight to keywords – As all the topics can be                   tweets. The threshold has been set to 0.25. That is the
        broadly classified into two groups based on if it wants to
                                                                             tweets which have a score greater than 0.25 are considered
        retrieve tweets on ‘availability’ or ‘requirement’. For this
        reason, the words like ‘availability’ or ‘requirement’ have          as relevant and are submitted. All other tweets have been
        been assigned more weight than the other key words in the            discarded.
        topics.
Each query can be expressed as a set of keywords where each
keyword is assigned a definite weight and each query is assigned             3     RESULT ANALYSIS
the topic ids to identify each query in later stage.                         Table-1 shows the result of our two submitted runs. The run
                                                                             acquired from method-1 is tagged as “ss” and then run acquired
                                                                             from method-2 is tagged as “ws”. The runs have been evaluated
2.4 Scoring                                                                  based ground truth obtained by the organizers. Different metrics
After construction of queries, each bag of words corresponding to            like Precision@20, Recal@1000, MAP@10000 and MAP have
each tweet is assigned a score with respect to a query. We have              been used to evaluate the runs.
used two different scoring techniques for two separate runs.

Method–1: Co-occurrence based Similarity                                     Run Id               Precision        Recall         MAP             Overall
This method is based on co-occurrence based similarity measure                                    @20              @1000          @1000           MAP
[2]. This method tries to find out how many words from the query             trish_iiest_ss       0.0929           0.1407         0.0140          0.0203
have also occurred in the tweet and scored the tweet based on that.
                                                                             trish_iiest_ws       0.0786           0.0618         0.0032          0.0099
For a given tweet T = {t 1, t2 , ..., t n } and a given query Q = {q1,
q2 , ..., qn } the score of the tweet is calculated as follows:
            Score (T , Q) = | intersection of T ,Q | / | Q | , where | Q |                                   Table-1
denotes number of elements in set Q.
                                                                             As it can be clearly seen from the result, though the second
That is this score measure the fractions of common words in a                method uses a more deep similarity measure than the first
tweet and a query. The higher the fraction, higher the probability           approach the first approach performs better than the second one.
that the tweet is relevant to the query.                                     The most probable reason for this is due to lack of grammar and
                                                                             spelling wise correctness of tweets. Most of the tweets are
                                                                             informally written microblogs, so using a standard English
Method–2:    WordNet                        based          Semantic          dictionary based filters and standard semantics based methods are
     Similarity                                                              not practically that much effective. While much simpler co-
The previous method is generally based on the co-occurrence                  occurrence based similarity measure outperforms it on the basis of
similarity which does not concern about the meaning wise                     performance and running time and cost.
similarity of two words. This problem can be solved by
WordNet[3] based approach. WordNet is a lexical database of
English. Each word in WordNet has a set of cognitive synonyms                4     CONCLUSION
called synsets. Two find the similarity between two words we can             In this working note, we have presented a brief discussion on our
calculate the similarity between two synsets.                                approach to FIRE 2016 microblog task. We have observed that
                                                                             traditional dictionary and vocabulary based filtering techniques
For a given tweet T = {t1, t2 , ..., tn } and a given query Q = {q1,         are very inefficient for informally written documents like tweets.
q2 , ..., qn } the score of the tweet is calculated as follows:              The relatively simpler co-occurrence based methods suits well for
                                                                             future work that also includes finding new filtering techniques and
   1)    For each t i and q j we have first found the synsets of two         parameters to tackle such informally written documents like
         words say S1 and S2 respectively. Now for each term in              tweets.
         S1 and each term in S2 we have calculated wup
                                                                             1
                                                                              http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity/wup.pm
5   REFERENCES

[1] http://nlp.stanford.edu/software/tagger.shtml.
[2] Ekkachai Naenudorn Suphakit Niwattanakul, Jatsada
    Singthongchai and Supachanun Wanapu. Using of jaccard
    coefficient for keywords similarity. volume 1, pages 380–
    384, Hong Kong, 2013
[3] https://wordnet.princeton.edu/
[4] S. Ghosh and K. Ghosh. Overview of the FIRE 2016
    Microblog track: Information Extraction from Microblogs
    Posted during Disasters. In Working notes of FIRE 2016 -
    Forum for Information Retrieval Evaluation, Kolkata, India,
    December 7-10, 2016, CEUR Workshop Proceedings.
    CEUR-WS.org, 2016.