=Paper=
{{Paper
|id=Vol-1737/T2-9
|storemode=property
|title=Information Extraction from Microblogs
|pdfUrl=https://ceur-ws.org/Vol-1737/T2-9.pdf
|volume=Vol-1737
|authors=Prashant Bhardwaj,Partha Pakray
|dblpUrl=https://dblp.org/rec/conf/fire/BhardwajP16
}}
==Information Extraction from Microblogs==
Information Extraction from Microblogs Prashant Bhardwaj Partha Pakray Computer Science and Engineering Computer Science and Engineering National Institute of Technology Agartala National Institute of Technology Mizoram cse.pbh@gmail.com parthapakray@gmail.com ABSTRACT 2010 by Sarah Vieweg et.al. [1] and Leysia Palen et.al.[2]. But The micro blogging sites contain the emotion and expression of the there has been tremendous work since then and a new field of public in raw format. The data can be used to extract much information retrieval has come into existence. Sudha Verma et.al. meaningful information that could be used to develop technologies wrote on Situational Awareness through tweets [3]. The research for future use. There are numerous micro blogging sites available on location of disaster hit area, the response and the information these days that are used in different contexts. Some are used extraction has been going on since then [4][5][6][7]. One of the basically for conversation, some for image and video sharing, and important part of the information retrieval part is the part of some for formal and official purposes. Twitter is one of the most speech tagging in the code mixed microblog outspoken platform for sharing the emotions and comments on data[8][9][10][11][12][13]. Several researchers even work on almost every topic starting from sport to entertainment, religion to information extraction from mixed script analysis in social media politics and many more. The paper attempts to extract information websites and forums. English used to dominate the micro from the database of the tweets collected from Twitter. The task is blogging sites previously such as Twitter and Facebook. to develop methodologies for extracting tweets that are relevant to each topic with high precision. This paper presents the nita_nitmz team participation in FIRE 2016 Microblog track. 3 TASK DESCRIPTION CCS Concepts A large set of microblogs (tweets) posted during a recent disaster • Computing methodologies ~ Natural language event was be made available, along with a set of topics (in TREC Processing format). Each ‘topic’ identified a broad information need during a • Information systems ~ Information extraction disaster, such as – what resources are needed by the population in the disaster affected area, what resources are available, what Keywords resources are required / available in which geographical region, Information Retrieval ; Micro blogging ; Twitter and so on. Specifically, each topic contained a title, a brief description, and a more detailed narrative on what type of tweets will be considered relevant to the topic. The participants are required to develop methodologies for extracting tweets that are 1 INTRODUCTION relevant to each topic with high precision as well as high recall. The faster growth of Internet in current period provides new The data contained: sources of information. Now a day’s people prefer to express • Around 50,000 microblogs (tweets) from Twitter, those themselves more often on social sites than any print media. The idea of Information Extraction from Microblogs Posted during were posted during the Nepal earthquake in April 2015. Disasters was introduced by Sarah Vieweg et.al., in 2010 Only the tweetids of the tweets was provided, along Proceedings of the SIGCHI Conference on Human Factors in with a script that was be used to download the tweets Computing Systems [1]. Henceforth it has become one of the using the Twitter API. Out of 50,000 tweets only 48845 most researched topics considering the possibilities it contained could be downloaded on my experimental setup. for the proper accessing of any incidence. The importance of the • A set of 5 – 8 topics in TREC format, each containing a said topic can be attributed to the fact that people rarely provide any false information in the social sites and pour their emotions title, a brief description, and a more detailed narrative. according to their knowledge and wisdom. This paper presents the experiments carried out at National Institute of Technology Agartala as part of the participation in the Forum for Information Retrieval Evaluation (FIRE) 2016 in Information Extraction from 4 METHODOLOGY Microblogs Posted during Disasters [14]. The experiments carried For the given task we created the required searching configuration out by us for FIRE 2016 are based on stemming, zonal indexing, on Apache Nutch 0.9 which is a highly extensible and scalable theme identification, TF-IDF based ranking model and positional open source web crawler software project. The implementation of information. The data contained 48845 tweets out of 50,000 the said task is done in two steps. First is creating the searching tweets mentioned in the workshop website. Query was provided environment, Secondly apply appropriate test queries to search the by the organizing committee and each query was specified using results from the previously configured Nutch using Tomcat title, narration and description format. server. 2 RELATED WORKS 4.1 Preparation of the data The problem of Information Extraction from Microblogs Posted The python script provided by the organizers helped to get the during Disasters is researched for a couple of years starting from tweets. But file generated was of json type. So another script had to be developed to extract the tweets from the json file. The json the query text is merged to get one of the probable file contained the tweets, tweet_id and many more metadata. First results of the search. problem arises when after extracting tweets from the json file, we found out that only 48845 tweets were downloaded via the given script. As per the norms of the Apache Nutch, the tweets had to 5 RESULT AND CONCLUSION separated into different files. Since the task was to extract relevant We submitted 37 results of the four query subtexts. The run tweets , the files containing the tweets had to be named according submission was accepted in the category of semi automatic run. to their tweet_ids. We developed two files one containing the The results from the organizers after judging the submitted run is tweets only and other containing the tweet_ids. After that we provided in Table 1. wrote a code to take the tweet_id from one file , create a text file with that name , take a tweet from another file and store it in the Table 1. Evaluation Results of Semi-automatic Runs newly created file. But due to some unavoidable errors, only Precision Recall MAP Overall 48815 file could be created, each having the name as tweet_id and Run Id @20 @1000 @1000 MAP containing the corresponding tweet inside. nita_nitmz_1 0.0583 0.0046 0.0031 0.0031 The results are not encouraging, but considering the fact that we 4.2 Crawling using Nutch started from the scratch, we have much to learn. The different We used another code to store the addresses of the different file in participating teams have employed different algorithms to extract the urls.txt file. We started the crawling part using the nutch. It the results. We would try to enhance our methodology for future works in following steps: research. Injecter: All the URLs are taken by the injector from the seed.txt (here urls.txt) file, compares urls with regex- urlfiler regex and update crawldb with supported urls.. 6 REFERENCES The crawldb maintains information on all known URLs (fetch schedule, fetch status, metadata, …). [1] Vieweg S., L. Hughes A., Starbird K., Plaen L., 2010. Microblogging during two natural hazards events: what Generator: Based on the data of crawldb, the generator selects twitter may contribute to situational awareness. In best-scoring urls due for fetch and the segments Proceedings of the SIGCHI Conference on Human Factors directory is created. in Computing Systems(Atlanta, GA, USA, April 10-15,2010). Fetcher and CrawlDb Update: Next, the fetcher fetches CHI '10. ACM, Atlanta, GA, 1079-1088. the remote pages of the URLs on the fetch list and DOI=10.1145/1753326.1753486. updates it to the segment directory. This step takes a lot [2] Palen L., M. Anderson K., Mark G.,Martin J., Sicker D., of time. Palmer M., Grunwald D. 2010. A vision for technology- Parser: The contents of each web page are parsed. If the crawl mediated support for public participation & assistance in produces another extension to an already existing one, mass emergencies & disasters. In Proceedings of the 2010 the updater adds the new data to the crawldb. ACM-BCS Visions of Computer Science Conference (Swindon, UK, 13-16 April 2010). ACM-BCS '10, Swindon, Inverting: The links need to be inverted before indexing. The UK. fact that the number of incoming links is more valuable [3] Verma S., Vieweg S., J. Corvey W., Plaen L., H. Martin J., than the outgoing links is taken care off, similar to how Palmer M., Schram A., M. Anderson K., 2011. Natural Google PageRank works. The inverted links are saved Language Processing to the Rescue?: Extracting “Situational in the linkdb. Awareness” Tweets During Mass Emergency. Association Indexing, Deduplicating and Merging: Using data from for the Advancement of Artificial Intelligence. crawldb, linkdb and segments, the indexer creates an [4] Watanabe K., Ochi M., Okabe M., Onai R.2011. Jasmine: a index and saves it. The Lucene library is used for real-time local-event detection system based on geolocation Indexing. information propagated to microblogs. In Proceedings of the 20th ACM international conference on Information and 4.3 Searching for test Queries knowledge management(Glasow,UK, 24-28 October 2011). Now, we can search for tweets regarding the crawled database. CIKM '11, Glasow,UK, 2541-2544, The searching of the test query takes place in the following steps: DOI=10.1145/2063576.2064014. Stop Word Removal: From the given query the stop words [5] Lingad J., Karimi S., Yin J. 2013. Location extraction from have to be removed, because they do not contribute disaster-related microblogs. In Proceedings of the 22nd International Conference on World Wide Web(Rio De much to the searching procedure. Janerio, Brazil, 13-17 May). WWW '13 Companion, Rio De Query Segmentation: for a given set of words in a query, the Janerio, Brazil, 1017-1020. DOI=10.1145/2487788.2488108 search engine may not give proper results, so it searched [6] Imran M., Elbassuoni S., Castillo C., Diaz F., Meier P. 2013. for every combination of the words contained in the Practical extraction of disaster-relevant information from search query. social media. In Proceedings of the 22nd International Merging: As discussed in the Query segmentation, the results Conference on World Wide Web(Rio De Janerio, Brazil, 13- for a given query may not be available; so the results 17 May). WWW '13 Companion, Rio De Janerio, Brazil, obtained from the different combination of the words of 1021-1024. DOI=10.1145/2487788.2488109 [7] Imran M., Castillo C., Lucas J., Meier P., Vieweg S. 2014. AIDR: artificial intelligence for disaster response. In Proceedings of the 23nd International Conference on World Wide Web(Seoul, South Korea,7-11 April). WWW '14 Companion, Seoul, South Korea, 159-162. DOI= 10.1145/2567948.2577034 [8] Anupam Jamatia,Amitava Das.Part-of-Speech Tagging System for Indian Social Media Text on Twitter. In Proceedings Workshop on Language Technologies For Indian Social Media(SOCIAL-INDIA), Pages 21- 28. [9] Pinaki Bhaskar, Amitava Das, Partha Pakray, Sivaji Bandyopadhyay, Theme Based English and Bengali Ad-hoc Monolingual Information Retrieval in FIRE 2010, In FIRE 2010, Working Notes. [2010] [10] Barman, U., Das, A., Wagner, J., and Foster, J. 2014 Code- Mixing: A Challenge for Language Identification in the Language of Social Media. In The 1st Workshop on Computational Approaches to Code Switching, EMNLP 2014, October, 2014, Doha, Qatar. [11] Björn Gambäck, and Amitava Das.2016. Comparing the Level of CodeSwitching in Corpora. In the 10th edition of the Language Resources and Evaluation Conference (LREC), 2328 May 2016, Portorož (Slovenia). [12] Anupam Jamatia, Björn Gambäck, and Amitava Das. 2016. Collecting and Annotating Indian Social Media CodeMixed Corpora. In the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING), April 3–9, Konya, Turkey. [13] Kunal Chakma, and Amitava Das. CMIR:A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi- English Tweets. 2016. In the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING), April 3–9, Konya, Turkey. [14] S. Ghosh and K. Ghosh. Overview of the FIRE 2016 Microblog track: Information Extraction from Microblogs Posted during Disasters. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016.