=Paper= {{Paper |id=Vol-1737/T2-9 |storemode=property |title=Information Extraction from Microblogs |pdfUrl=https://ceur-ws.org/Vol-1737/T2-9.pdf |volume=Vol-1737 |authors=Prashant Bhardwaj,Partha Pakray |dblpUrl=https://dblp.org/rec/conf/fire/BhardwajP16 }} ==Information Extraction from Microblogs== https://ceur-ws.org/Vol-1737/T2-9.pdf
                       Information Extraction from Microblogs
                 Prashant Bhardwaj                                                     Partha Pakray
        Computer Science and Engineering                                   Computer Science and Engineering
      National Institute of Technology Agartala                          National Institute of Technology Mizoram
                cse.pbh@gmail.com                                               parthapakray@gmail.com

ABSTRACT                                                               2010 by Sarah Vieweg et.al. [1] and Leysia Palen et.al.[2]. But
The micro blogging sites contain the emotion and expression of the     there has been tremendous work since then and a new field of
public in raw format. The data can be used to extract much             information retrieval has come into existence. Sudha Verma et.al.
meaningful information that could be used to develop technologies      wrote on Situational Awareness through tweets [3]. The research
for future use. There are numerous micro blogging sites available      on location of disaster hit area, the response and the information
these days that are used in different contexts. Some are used          extraction has been going on since then [4][5][6][7]. One of the
basically for conversation, some for image and video sharing, and      important part of the information retrieval part is the part of
some for formal and official purposes. Twitter is one of the most      speech     tagging     in     the     code    mixed      microblog
outspoken platform for sharing the emotions and comments on            data[8][9][10][11][12][13]. Several researchers even work on
almost every topic starting from sport to entertainment, religion to   information extraction from mixed script analysis in social media
politics and many more. The paper attempts to extract information      websites and forums. English used to dominate the micro
from the database of the tweets collected from Twitter. The task is    blogging sites previously such as Twitter and Facebook.
to develop methodologies for extracting tweets that are relevant to
each topic with high precision. This paper presents the nita_nitmz
team participation in FIRE 2016 Microblog track.                       3 TASK DESCRIPTION
CCS Concepts                                                           A large set of microblogs (tweets) posted during a recent disaster
 •    Computing methodologies ~ Natural language                       event was be made available, along with a set of topics (in TREC
      Processing                                                       format). Each ‘topic’ identified a broad information need during a
 •    Information systems ~ Information extraction                     disaster, such as – what resources are needed by the population in
                                                                       the disaster affected area, what resources are available, what
Keywords                                                               resources are required / available in which geographical region,
Information Retrieval ; Micro blogging ; Twitter                       and so on. Specifically, each topic contained a title, a brief
                                                                       description, and a more detailed narrative on what type of tweets
                                                                       will be considered relevant to the topic. The participants are
                                                                       required to develop methodologies for extracting tweets that are
1 INTRODUCTION                                                         relevant to each topic with high precision as well as high recall.
The faster growth of Internet in current period provides new           The data contained:
sources of information. Now a day’s people prefer to express
                                                                            •    Around 50,000 microblogs (tweets) from Twitter, those
themselves more often on social sites than any print media. The
idea of Information Extraction from Microblogs Posted during                     were posted during the Nepal earthquake in April 2015.
Disasters was introduced by Sarah Vieweg et.al., in 2010                         Only the tweetids of the tweets was provided, along
Proceedings of the SIGCHI Conference on Human Factors in                         with a script that was be used to download the tweets
Computing Systems [1]. Henceforth it has become one of the                       using the Twitter API. Out of 50,000 tweets only 48845
most researched topics considering the possibilities it contained                could be downloaded on my experimental setup.
for the proper accessing of any incidence. The importance of the
                                                                            •    A set of 5 – 8 topics in TREC format, each containing a
said topic can be attributed to the fact that people rarely provide
any false information in the social sites and pour their emotions                title, a brief description, and a more detailed narrative.
according to their knowledge and wisdom. This paper presents the
experiments carried out at National Institute of Technology
Agartala as part of the participation in the Forum for Information
Retrieval Evaluation (FIRE) 2016 in Information Extraction from        4 METHODOLOGY
Microblogs Posted during Disasters [14]. The experiments carried       For the given task we created the required searching configuration
out by us for FIRE 2016 are based on stemming, zonal indexing,         on Apache Nutch 0.9 which is a highly extensible and scalable
theme identification, TF-IDF based ranking model and positional        open source web crawler software project. The implementation of
information. The data contained 48845 tweets out of 50,000             the said task is done in two steps. First is creating the searching
tweets mentioned in the workshop website. Query was provided           environment, Secondly apply appropriate test queries to search the
by the organizing committee and each query was specified using         results from the previously configured Nutch using Tomcat
title, narration and description format.                               server.
2 RELATED WORKS                                                        4.1 Preparation of the data
The problem of Information Extraction from Microblogs Posted           The python script provided by the organizers helped to get the
during Disasters is researched for a couple of years starting from     tweets. But file generated was of json type. So another script had
to be developed to extract the tweets from the json file. The json                the query text is merged to get one of the probable
file contained the tweets, tweet_id and many more metadata. First                 results of the search.
problem arises when after extracting tweets from the json file, we
found out that only 48845 tweets were downloaded via the given
script. As per the norms of the Apache Nutch, the tweets had to          5    RESULT AND CONCLUSION
separated into different files. Since the task was to extract relevant   We submitted 37 results of the four query subtexts. The run
tweets , the files containing the tweets had to be named according       submission was accepted in the category of semi automatic run.
to their tweet_ids. We developed two files one containing the            The results from the organizers after judging the submitted run is
tweets only and other containing the tweet_ids. After that we            provided in Table 1.
wrote a code to take the tweet_id from one file , create a text file
with that name , take a tweet from another file and store it in the           Table 1. Evaluation Results of Semi-automatic Runs
newly created file. But due to some unavoidable errors, only                              Precision     Recall      MAP        Overall
48815 file could be created, each having the name as tweet_id and            Run Id
                                                                                            @20         @1000       @1000       MAP
containing the corresponding tweet inside.                               nita_nitmz_1      0.0583       0.0046      0.0031     0.0031

                                                                         The results are not encouraging, but considering the fact that we
4.2 Crawling using Nutch                                                 started from the scratch, we have much to learn. The different
We used another code to store the addresses of the different file in     participating teams have employed different algorithms to extract
the urls.txt file. We started the crawling part using the nutch. It      the results. We would try to enhance our methodology for future
works in following steps:                                                research.
Injecter: All the URLs are taken by the injector from the
          seed.txt (here urls.txt) file, compares urls with regex-
          urlfiler regex and update crawldb with supported urls..        6    REFERENCES
          The crawldb maintains information on all known URLs
          (fetch schedule, fetch status, metadata, …).                   [1] Vieweg S., L. Hughes A., Starbird K., Plaen L., 2010.
                                                                             Microblogging during two natural hazards events: what
Generator: Based on the data of crawldb, the generator selects               twitter may contribute to situational awareness. In
          best-scoring urls due for fetch and the segments                   Proceedings of the SIGCHI Conference on Human Factors
          directory is created.                                              in Computing Systems(Atlanta, GA, USA, April 10-15,2010).
Fetcher and CrawlDb Update: Next, the fetcher fetches                        CHI '10. ACM, Atlanta, GA, 1079-1088.
          the remote pages of the URLs on the fetch list and                 DOI=10.1145/1753326.1753486.
          updates it to the segment directory. This step takes a lot     [2] Palen L., M. Anderson K., Mark G.,Martin J., Sicker D.,
          of time.                                                           Palmer M., Grunwald D. 2010. A vision for technology-
Parser: The contents of each web page are parsed. If the crawl               mediated support for public participation & assistance in
          produces another extension to an already existing one,             mass emergencies & disasters. In Proceedings of the 2010
          the updater adds the new data to the crawldb.                      ACM-BCS Visions of Computer Science Conference
                                                                             (Swindon, UK, 13-16 April 2010). ACM-BCS '10, Swindon,
Inverting: The links need to be inverted before indexing. The                UK.
          fact that the number of incoming links is more valuable
                                                                         [3] Verma S., Vieweg S., J. Corvey W., Plaen L., H. Martin J.,
          than the outgoing links is taken care off, similar to how
                                                                             Palmer M., Schram A., M. Anderson K., 2011. Natural
          Google PageRank works. The inverted links are saved
                                                                             Language Processing to the Rescue?: Extracting “Situational
          in the linkdb.
                                                                             Awareness” Tweets During Mass Emergency. Association
Indexing, Deduplicating and Merging: Using data from                         for the Advancement of Artificial Intelligence.
          crawldb, linkdb and segments, the indexer creates an
                                                                         [4] Watanabe K., Ochi M., Okabe M., Onai R.2011. Jasmine: a
          index and saves it. The Lucene library is used for
                                                                             real-time local-event detection system based on geolocation
          Indexing.
                                                                             information propagated to microblogs. In Proceedings of the
                                                                             20th ACM international conference on Information and
4.3 Searching for test Queries                                               knowledge management(Glasow,UK, 24-28 October 2011).
Now, we can search for tweets regarding the crawled database.                CIKM '11, Glasow,UK, 2541-2544,
The searching of the test query takes place in the following steps:          DOI=10.1145/2063576.2064014.

Stop Word Removal: From the given query the stop words                   [5] Lingad J., Karimi S., Yin J. 2013. Location extraction from
          have to be removed, because they do not contribute                 disaster-related microblogs. In Proceedings of the 22nd
                                                                             International Conference on World Wide Web(Rio De
          much to the searching procedure.
                                                                             Janerio, Brazil, 13-17 May). WWW '13 Companion, Rio De
Query Segmentation: for a given set of words in a query, the                 Janerio, Brazil, 1017-1020. DOI=10.1145/2487788.2488108
          search engine may not give proper results, so it searched      [6] Imran M., Elbassuoni S., Castillo C., Diaz F., Meier P. 2013.
          for every combination of the words contained in the                Practical extraction of disaster-relevant information from
          search query.                                                      social media. In Proceedings of the 22nd International
Merging: As discussed in the Query segmentation, the results                 Conference on World Wide Web(Rio De Janerio, Brazil, 13-
          for a given query may not be available; so the results             17 May). WWW '13 Companion, Rio De Janerio, Brazil,
          obtained from the different combination of the words of            1021-1024. DOI=10.1145/2487788.2488109
[7] Imran M., Castillo C., Lucas J., Meier P., Vieweg S. 2014.
    AIDR: artificial intelligence for disaster response. In
    Proceedings of the 23nd International Conference on World
    Wide Web(Seoul, South Korea,7-11 April). WWW '14
    Companion, Seoul, South Korea, 159-162. DOI=
    10.1145/2567948.2577034
[8] Anupam Jamatia,Amitava Das.Part-of-Speech Tagging
    System for Indian Social Media Text on Twitter. In
    Proceedings Workshop on Language Technologies For
    Indian Social Media(SOCIAL-INDIA), Pages 21- 28.
[9] Pinaki Bhaskar, Amitava Das, Partha Pakray, Sivaji
    Bandyopadhyay, Theme Based English and Bengali Ad-hoc
    Monolingual Information Retrieval in FIRE 2010, In FIRE
    2010, Working Notes. [2010]
[10] Barman, U., Das, A., Wagner, J., and Foster, J. 2014 Code-
     Mixing: A Challenge for Language Identification in the
     Language of Social Media. In The 1st Workshop on
     Computational Approaches to Code Switching, EMNLP
     2014, October, 2014, Doha, Qatar.
[11] Björn Gambäck, and Amitava Das.2016. Comparing the
     Level of CodeSwitching in Corpora. In the 10th edition of
     the Language Resources and Evaluation Conference
     (LREC), 2328 May 2016, Portorož (Slovenia).
[12] Anupam Jamatia, Björn Gambäck, and Amitava Das. 2016.
     Collecting and Annotating Indian Social Media CodeMixed
     Corpora. In the 17th International Conference on Intelligent
     Text Processing and Computational Linguistics (CICLING),
     April 3–9, Konya, Turkey.
[13] Kunal Chakma, and Amitava Das. CMIR:A Corpus for
     Evaluation of Code Mixed Information Retrieval of Hindi-
     English Tweets. 2016. In the 17th International Conference
     on Intelligent Text Processing and Computational Linguistics
     (CICLING), April 3–9, Konya, Turkey.
[14] S. Ghosh and K. Ghosh. Overview of the FIRE 2016
     Microblog track: Information Extraction from Microblogs
     Posted during Disasters. In Working notes of FIRE 2016 -
     Forum for Information Retrieval Evaluation, Kolkata, India,
     December 7-10, 2016, CEUR Workshop Proceedings.
     CEUR-WS.org, 2016.