=Paper=
{{Paper
|id=Vol-1737/T2-7
|storemode=property
|title=Real Time Information Extraction from Microblog
|pdfUrl=https://ceur-ws.org/Vol-1737/T2-7.pdf
|volume=Vol-1737
|authors=Sandip Modha,Chintak Mandalia,Krati Agrawal,Deepali Verma,Prasenjit Majumder
|dblpUrl=https://dblp.org/rec/conf/fire/ModhaMAVM16
}}
==Real Time Information Extraction from Microblog==
<pdf width="1500px">https://ceur-ws.org/Vol-1737/T2-7.pdf</pdf>
<pre>
           Real Time Information Extraction from Microblog

                 Sandip Modha                         Chintak Mandalia                        Krati Agrawal
               DAIICT Gandhinagar                      LDRP Gandhinagar                   DAIICT Gandhinagar
               Gujarat-382007, India                  Gujarat-382015, India               Gujarat-382007, India

                                     Deepali Verma                      Prasenjit Majumder
                                  DAIICT Gandhinagar                    DAIICT Gandhinagar
                                  Gujarat-382007, India                 Gujarat-382007, India


ABSTRACT
This paper present the participation of Information Retrieval
Lab(IR LAB DA-IICT Gandhinagar) in FIRE 2016 Microblog
Track. The main objective of the track is to identify Infor-
mation Retrieval methodologies to retrieve important infor-
mation from Twitter posted during the disasters. We have
submitted two runs for this track. In the first run, dai-
ict irlab 1, we have expanded topic term using Word2vec
model trained by the tweet corpus provided by the organizer.
Relevance score between tweet and corpus are calculated by
Okapi BM25 model. Precision@20 ,primary metric, for this                                 Figure 1: BM25
run, is 0.3143. In the second run,daiict irlab 2, we have set
different weight for original term and expanded topic term,        different weights to different types of terms. To discover the
we achieve precision@20 around 0.30.                               most significant tokens in each user profile, they calculated
                                                                   pointwise KL divergence and ranked the scores for each to-
1.   INTRODUCTION                                                  ken in the profile.
   Social media, like Twitter, is a massive source of real-time
information. Twitter is, one of the popular micro blogging         3.    PROBLEM STATEMENT
website, which has massive user-generated content due to its          Given a topic Q v={F M T1 , ..., F M T7 }, representing dif-
large number of registered user. During the disaster, Twitter      ferent information needs, given corpus of tweets T={t1 , t2 , ..tn },
proved its importance on many occasions.                           we need to compute the relevance score between tweets and
   In the FIRE 2016 Microblog track [2], a large set of micro      topics.
blogs (tweets), posted during a Nepal earthquake, was made
available by track organizer, along with a set of topics (in                             R score = f (T, Q)
TREC format). Each ‘topic’ identified by a broad informa-
tion need during a disaster, such as âĂŞ what resources are     4.    OUR APPROACH
needed by the population in the disaster affected area, what         In this section, we discuss the architecture of the proposed
resources are available, what resources are required / avail-      system.
able in which geographical region, and so on. Specifically,
each topic will contain a title, a brief description, and a more   4.1     Topic Pre-processing
detailed narrative on what type of tweets will be considered         FIRE 2016 Microblog track has given 7 topics. Essentially
relevant to the topic.                                             these topics are our query. We converted these topics into
                                                                   the query by removing stop words and consider Noun proper
                                                                   noun and verb using Stanford POS tagger.
2.   RELATED WORK
   We started our work by referring TREC MICROBLOG                 4.2     Topic (Query) expansion
2015 papers [1, 5, 4].                                               We have trained Word2vec model [3] using the corpus pro-
   CLIP [1] has trained their Word2vec model using 4 years         vided by an organizer to expand topic term. We find 5
tweet corpus. They used Okapi BM25 relevance model to              similar words and hash tag. We set equal weight for each
calculate the score. To refine the scores of the relevant          term in the first run(daiict irlab 1). In the second run, we
tweets, tweets were rescored using the SVM rank package            have set weight different weight for original terms and ex-
using the relevance score of the previous stage. Then Nov-         panded terms. Words like required and available have been
elty Detection is done, where the tweets which are not useful      expanded with their synonyms using WordNet and assigned
are discarded, this is done using Jaccard similarity.              more weights.
   University of waterloo [4] implemented the filtering tasks,
by building a term vector for each user profile and assigning      4.3     Tweet Pre-processing
                        Run Id             Precisio@20        Recall@1000   MAP@1000         Overall MAP
                        daiict irlab 1     0.3143             0.0729        0.0275           0.0275
                        daiict irlab 2     0.3000             0.0704        0.0250           0.0250

                                  Table 1: Official Results as declared by track organizer

                        Run Id             Precisio@20        Recall@1000   MAP@1000         Overall MAP
                        daiict irlab 1     0.3143             0.1499        0.0638           0.0638
                        daiict irlab 2     0.3000             0.1528        0.0625           0.0625

                                     Table 2: Post Evaluation results on top 1000 tweet


                                                                      7.    REFERENCES
  In this step, non-English tweets were filtered out. Tweet           [1] M. Bagdouri and D. W.Oard. CLIP at TREC 2015:
includes smileys, hashtags, and many special characters. We               Microblog and LiveQA. In Proc. TREC 2015, 2015.
did not consider retweets and tweet with only hashtag or              [2] S. Ghosh and K. Ghosh. Overview of the FIRE 2016
emoticon or special characters. We also ignored the tweet                 Microblog track: Information Extraction from
with less than 5 words and removed all the stop words from                Microblogs Posted during Disasters. In Working notes
the tweet.                                                                of FIRE 2016 - Forum for Information Retrieval
                                                                          Evaluation, Kolkata, India, December 7-10, 2016,
4.4    Query Normalization                                                CEUR Workshop Proceedings. CEUR-WS.org, 2016.
  In this step, Title and description were merged so as to            [3] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
make topics more informative. To increase the relevance,                  Efficient estimation of word representations in vector
topics were also pre-processed topics by converting all al-               space. CoRR, abs/1301.3781, 2013.
phabets to small case and expanding the abbreviations. Ex-            [4] L. Tan, A. Roegiest, and C. L. Clarke. University of
ample: NYC- New York City. Also, topics were stemmed.                     Waterloo at TREC 2015 Microblog Track. In Proc.
Eg: behaving was converted to behave.                                     TREC 2015, 2015.
4.5    Relevance Score                                                [5] X. Zhu et al. NUDTSNA at TREC 2015 Microblog
                                                                          Track. In Proc. TREC 2015, 2015.
   In this phase, we have calculated relevance score between
tweets and topics, In the first run, we kept same weight
for original term and expanded term. In the second run,
we set weight 2 for the original term in the topics and 1 for
expanded term. We have used Okapi BM25 model for calcu-
lating relevance score between expanded topics and tweets.

               R score = BM 25 Sim(Qe xp, T )


5.    RESULT
   We misunderstand the track guideline. We had sent only
top 100 tweets for each topic. So our Precision@20 is on the
line with other participant but other metric were substan-
tially lower. Table 1 represents the result declared by track
organizer. After Getting Gold Standard data from track or-
ganizer, we again perform experiments. Table 2 shows the
result of top 1000 tweets for each topic.


6.    CONCLUSION
  We have submitted 2 runs in FIRE Microblog track. In
the first run, we expanded topic term by training Word2vec
model with corpus provided by track organizer. We have cal-
culated relevance a score between expanded topic term and
tweet using Okapi BM25 model. We have kept the same
weight of the original term and expanded term. In the sec-
ond run, we set weight of the original term and expanded
the term in the ratio 2 :1. We have put more weight on
a word like âĂIJavailableâĂİ, âĂIJrequiredâĂİ. After ana-
lyzing the results, we conclude that by changing the weight
for the original term and the expanded term does not im-
prove Precision@20 but actually have some adverse effect.
However, Recall@1000 improves approximately 2%.

</pre>