=Paper=
{{Paper
|id=Vol-1737/T2-8
|storemode=property
|title=An Information Retrieval System for FIRE 2016 Microblog Track
|pdfUrl=https://ceur-ws.org/Vol-1737/T2-8.pdf
|volume=Vol-1737
|authors=Trishnendu Ghorai
|dblpUrl=https://dblp.org/rec/conf/fire/Ghorai16
}}
==An Information Retrieval System for FIRE 2016 Microblog Track==
An Information Retrieval System for FIRE 2016 Microblog Track Trishnendu Ghorai Department of CST IIEST, Shibpur ABSTRACT 2.1 Brief Overview This paper describes our approaches to FIRE (Forum for In this task, a set of previously collected tweets (more specifically Information Retrieval Evaluation) 2016 Microblog track. The tweet ids) on Nepal Earthquake 2015 was provided. And main aim of this track was to develop an information retrieval alongside 7 queries were given in the traditional TREC format (an system that can identify relevant tweets posted during a disaster XML like format). The goal of this task was to find most relevant event. The relevance is measured with respect to some predefined tweets from the set of tweets based on the queries. topics provide by the track organizers. In this working note we have given the description of the system which has taken part in Our system has mainly four components as follows, this year’s FIRE track as well as has analysed the performance of 1) Tweet Preprocessing – As tweets are informally written, the system. tweets generally contain a lot of noise and unnecessary Keywords data. For this reason, in preprocessing stage data filters are applied on the tweets to get rid of the unwanted data. FIRE; information retrieval; tweet; relevancy; 2) Query Construction – The topics are provided have three parts, namely tittle, narration and description. To get more relevant tweets, a new set of queries is constructed from 1 INTRODUCTION this given topic. User written informal microblogs, like tweets, are quite important 3) Scoring of tweets – Once the queries are constructed each and a big source of real time information. As this microblogs are tweet are scored based on each query. Two different quite informal and doesn’t obey standard vocabulary, thus special approaches have been used in scoring the tweets. information retrieval system and recommendation systems are 4) Final filtering – When each tweet gets a score against needed to retrieve information from this microblogs. To boost the each topic, a heuristic threshold has been set to get good retrieval performance of information retrieval system FIRE has quality tweets. introduced this track this year [4]. In this task the participant IR systems have to find relevant tweets from a set of tweets posted during the recent disaster time. The initial dataset consists of 2.2 Tweet Preprocessing around 50,000 tweets from twitter that were posted in a recent The following steps have been taken to preprocess the tweet text. Nepal earthquake. The relevancy of the tweet is measured with 1) Punctuation removal – Punctuations are removed from respect to topics, which will identify different resources that are each tweet. We have not given any extra importance to available or required during the disaster time. The organizers hash tags, all ‘#’ symbols are also removed. provide a set of seven topics in the standard TREC format. The 2) Case folding – All the capital letters in the tweets are main challenge of the task is to tackle the nosiness of the tweets converted to small letters and at the same time find most relevant tweets. To deal with the 3) Stop word removal – All commonly used English words problem of noise we have applied a preprocessing phase on tweets which do not have much significance on the subject which will remove all noisy data from the tweets. The tweets are matter of the tweet but are used only for semantic reasons converted to a bag of words to ease up the scoring process. To are removed. A list of top most frequently used words calculate relevance, we have developed two different scoring and (around 500 words) are used as stop word list. And from raking methods. The topics are optimized by constructing new the tweet the words that are present in the stop word list queries based on the previous topics. are removed. 4) Non ASCII character – In addition, we have removed all non ASCII characters which come to tweet due to the use 2 SYSTEM OVERVIEW of emoticons and other symbol 5) Constructing bag of word – Each tweet is then splited into words and are converted to a set of words. Each set In this section we have described the system architecture for the represents the collection of the distinct word that are data challenge. The system consists of tweet preprocessing, query present in the tweet. Each bag of word is identified by the generation, scoring of tweets and result analysis. tweet id of the tweet which is unique to the tweet and can be used to track it in next steps. 2.3 Query Construction similarity.1 After this all wup score is added up and Topics are made of three fields, namely the title, description and normalized. This normalized score denotes the similarity narratives. Titles contain several three or four keys, while value between t i and q j descriptions are one-sentence long statements of the users’ 2) We iterate through all the terms in tweets and queries and information needs; narratives are paragraph-length descriptions of summed up all the similarity score of each pairs and the tweets that the users want to receive and are the long normalize it. description. Each topic is assigned one topic id which can be used 3) This normalize score is the final score of the tweet respect to uniquely specify one topic in submission stage. Query to that particular query. construction part consists of two different phases described as follows: 1) Keyword Extraction – As nouns in a sentence holds most 2.5 Final Filtering of the information, we choose nouns in the topics as the keywords for the query. We have used Stanford Part-Of- After scoring the tweets according to relevance to each Speech Tagger[1] to label different parts-of-speech first topic, we need to choose most relevant tweets for a given and then collected words which have been identified as topic. For this reason, we have taken a heuristically set Noun. threshold based filtering method to choose most relevant 2) Giving weight to keywords – As all the topics can be tweets. The threshold has been set to 0.25. That is the broadly classified into two groups based on if it wants to tweets which have a score greater than 0.25 are considered retrieve tweets on ‘availability’ or ‘requirement’. For this reason, the words like ‘availability’ or ‘requirement’ have as relevant and are submitted. All other tweets have been been assigned more weight than the other key words in the discarded. topics. Each query can be expressed as a set of keywords where each keyword is assigned a definite weight and each query is assigned 3 RESULT ANALYSIS the topic ids to identify each query in later stage. Table-1 shows the result of our two submitted runs. The run acquired from method-1 is tagged as “ss” and then run acquired from method-2 is tagged as “ws”. The runs have been evaluated 2.4 Scoring based ground truth obtained by the organizers. Different metrics After construction of queries, each bag of words corresponding to like Precision@20, Recal@1000, MAP@10000 and MAP have each tweet is assigned a score with respect to a query. We have been used to evaluate the runs. used two different scoring techniques for two separate runs. Method–1: Co-occurrence based Similarity Run Id Precision Recall MAP Overall This method is based on co-occurrence based similarity measure @20 @1000 @1000 MAP [2]. This method tries to find out how many words from the query trish_iiest_ss 0.0929 0.1407 0.0140 0.0203 have also occurred in the tweet and scored the tweet based on that. trish_iiest_ws 0.0786 0.0618 0.0032 0.0099 For a given tweet T = {t 1, t2 , ..., t n } and a given query Q = {q1, q2 , ..., qn } the score of the tweet is calculated as follows: Score (T , Q) = | intersection of T ,Q | / | Q | , where | Q | Table-1 denotes number of elements in set Q. As it can be clearly seen from the result, though the second That is this score measure the fractions of common words in a method uses a more deep similarity measure than the first tweet and a query. The higher the fraction, higher the probability approach the first approach performs better than the second one. that the tweet is relevant to the query. The most probable reason for this is due to lack of grammar and spelling wise correctness of tweets. Most of the tweets are informally written microblogs, so using a standard English Method–2: WordNet based Semantic dictionary based filters and standard semantics based methods are Similarity not practically that much effective. While much simpler co- The previous method is generally based on the co-occurrence occurrence based similarity measure outperforms it on the basis of similarity which does not concern about the meaning wise performance and running time and cost. similarity of two words. This problem can be solved by WordNet[3] based approach. WordNet is a lexical database of English. Each word in WordNet has a set of cognitive synonyms 4 CONCLUSION called synsets. Two find the similarity between two words we can In this working note, we have presented a brief discussion on our calculate the similarity between two synsets. approach to FIRE 2016 microblog task. We have observed that traditional dictionary and vocabulary based filtering techniques For a given tweet T = {t1, t2 , ..., tn } and a given query Q = {q1, are very inefficient for informally written documents like tweets. q2 , ..., qn } the score of the tweet is calculated as follows: The relatively simpler co-occurrence based methods suits well for future work that also includes finding new filtering techniques and 1) For each t i and q j we have first found the synsets of two parameters to tackle such informally written documents like words say S1 and S2 respectively. Now for each term in tweets. S1 and each term in S2 we have calculated wup 1 http://search.cpan.org/dist/WordNet-Similarity/lib/WordNet/Similarity/wup.pm 5 REFERENCES [1] http://nlp.stanford.edu/software/tagger.shtml. [2] Ekkachai Naenudorn Suphakit Niwattanakul, Jatsada Singthongchai and Supachanun Wanapu. Using of jaccard coefficient for keywords similarity. volume 1, pages 380– 384, Hong Kong, 2013 [3] https://wordnet.princeton.edu/ [4] S. Ghosh and K. Ghosh. Overview of the FIRE 2016 Microblog track: Information Extraction from Microblogs Posted during Disasters. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016.