=Paper=
{{Paper
|id=Vol-1737/T2-7
|storemode=property
|title=Real Time Information Extraction from Microblog
|pdfUrl=https://ceur-ws.org/Vol-1737/T2-7.pdf
|volume=Vol-1737
|authors=Sandip Modha,Chintak Mandalia,Krati Agrawal,Deepali Verma,Prasenjit Majumder
|dblpUrl=https://dblp.org/rec/conf/fire/ModhaMAVM16
}}
==Real Time Information Extraction from Microblog==
Real Time Information Extraction from Microblog Sandip Modha Chintak Mandalia Krati Agrawal DAIICT Gandhinagar LDRP Gandhinagar DAIICT Gandhinagar Gujarat-382007, India Gujarat-382015, India Gujarat-382007, India Deepali Verma Prasenjit Majumder DAIICT Gandhinagar DAIICT Gandhinagar Gujarat-382007, India Gujarat-382007, India ABSTRACT This paper present the participation of Information Retrieval Lab(IR LAB DA-IICT Gandhinagar) in FIRE 2016 Microblog Track. The main objective of the track is to identify Infor- mation Retrieval methodologies to retrieve important infor- mation from Twitter posted during the disasters. We have submitted two runs for this track. In the first run, dai- ict irlab 1, we have expanded topic term using Word2vec model trained by the tweet corpus provided by the organizer. Relevance score between tweet and corpus are calculated by Okapi BM25 model. Precision@20 ,primary metric, for this Figure 1: BM25 run, is 0.3143. In the second run,daiict irlab 2, we have set different weight for original term and expanded topic term, different weights to different types of terms. To discover the we achieve precision@20 around 0.30. most significant tokens in each user profile, they calculated pointwise KL divergence and ranked the scores for each to- 1. INTRODUCTION ken in the profile. Social media, like Twitter, is a massive source of real-time information. Twitter is, one of the popular micro blogging 3. PROBLEM STATEMENT website, which has massive user-generated content due to its Given a topic Q v={F M T1 , ..., F M T7 }, representing dif- large number of registered user. During the disaster, Twitter ferent information needs, given corpus of tweets T={t1 , t2 , ..tn }, proved its importance on many occasions. we need to compute the relevance score between tweets and In the FIRE 2016 Microblog track [2], a large set of micro topics. blogs (tweets), posted during a Nepal earthquake, was made available by track organizer, along with a set of topics (in R score = f (T, Q) TREC format). Each ‘topic’ identified by a broad informa- tion need during a disaster, such as âĂŞ what resources are 4. OUR APPROACH needed by the population in the disaster affected area, what In this section, we discuss the architecture of the proposed resources are available, what resources are required / avail- system. able in which geographical region, and so on. Specifically, each topic will contain a title, a brief description, and a more 4.1 Topic Pre-processing detailed narrative on what type of tweets will be considered FIRE 2016 Microblog track has given 7 topics. Essentially relevant to the topic. these topics are our query. We converted these topics into the query by removing stop words and consider Noun proper noun and verb using Stanford POS tagger. 2. RELATED WORK We started our work by referring TREC MICROBLOG 4.2 Topic (Query) expansion 2015 papers [1, 5, 4]. We have trained Word2vec model [3] using the corpus pro- CLIP [1] has trained their Word2vec model using 4 years vided by an organizer to expand topic term. We find 5 tweet corpus. They used Okapi BM25 relevance model to similar words and hash tag. We set equal weight for each calculate the score. To refine the scores of the relevant term in the first run(daiict irlab 1). In the second run, we tweets, tweets were rescored using the SVM rank package have set weight different weight for original terms and ex- using the relevance score of the previous stage. Then Nov- panded terms. Words like required and available have been elty Detection is done, where the tweets which are not useful expanded with their synonyms using WordNet and assigned are discarded, this is done using Jaccard similarity. more weights. University of waterloo [4] implemented the filtering tasks, by building a term vector for each user profile and assigning 4.3 Tweet Pre-processing Run Id Precisio@20 Recall@1000 MAP@1000 Overall MAP daiict irlab 1 0.3143 0.0729 0.0275 0.0275 daiict irlab 2 0.3000 0.0704 0.0250 0.0250 Table 1: Official Results as declared by track organizer Run Id Precisio@20 Recall@1000 MAP@1000 Overall MAP daiict irlab 1 0.3143 0.1499 0.0638 0.0638 daiict irlab 2 0.3000 0.1528 0.0625 0.0625 Table 2: Post Evaluation results on top 1000 tweet 7. REFERENCES In this step, non-English tweets were filtered out. Tweet [1] M. Bagdouri and D. W.Oard. CLIP at TREC 2015: includes smileys, hashtags, and many special characters. We Microblog and LiveQA. In Proc. TREC 2015, 2015. did not consider retweets and tweet with only hashtag or [2] S. Ghosh and K. Ghosh. Overview of the FIRE 2016 emoticon or special characters. We also ignored the tweet Microblog track: Information Extraction from with less than 5 words and removed all the stop words from Microblogs Posted during Disasters. In Working notes the tweet. of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, 4.4 Query Normalization CEUR Workshop Proceedings. CEUR-WS.org, 2016. In this step, Title and description were merged so as to [3] T. Mikolov, K. Chen, G. Corrado, and J. Dean. make topics more informative. To increase the relevance, Efficient estimation of word representations in vector topics were also pre-processed topics by converting all al- space. CoRR, abs/1301.3781, 2013. phabets to small case and expanding the abbreviations. Ex- [4] L. Tan, A. Roegiest, and C. L. Clarke. University of ample: NYC- New York City. Also, topics were stemmed. Waterloo at TREC 2015 Microblog Track. In Proc. Eg: behaving was converted to behave. TREC 2015, 2015. 4.5 Relevance Score [5] X. Zhu et al. NUDTSNA at TREC 2015 Microblog Track. In Proc. TREC 2015, 2015. In this phase, we have calculated relevance score between tweets and topics, In the first run, we kept same weight for original term and expanded term. In the second run, we set weight 2 for the original term in the topics and 1 for expanded term. We have used Okapi BM25 model for calcu- lating relevance score between expanded topics and tweets. R score = BM 25 Sim(Qe xp, T ) 5. RESULT We misunderstand the track guideline. We had sent only top 100 tweets for each topic. So our Precision@20 is on the line with other participant but other metric were substan- tially lower. Table 1 represents the result declared by track organizer. After Getting Gold Standard data from track or- ganizer, we again perform experiments. Table 2 shows the result of top 1000 tweets for each topic. 6. CONCLUSION We have submitted 2 runs in FIRE Microblog track. In the first run, we expanded topic term by training Word2vec model with corpus provided by track organizer. We have cal- culated relevance a score between expanded topic term and tweet using Okapi BM25 model. We have kept the same weight of the original term and expanded term. In the sec- ond run, we set weight of the original term and expanded the term in the ratio 2 :1. We have put more weight on a word like âĂIJavailableâĂİ, âĂIJrequiredâĂİ. After ana- lyzing the results, we conclude that by changing the weight for the original term and the expanded term does not im- prove Precision@20 but actually have some adverse effect. However, Recall@1000 improves approximately 2%.