=Paper=
{{Paper
|id=Vol-1737/T2-11
|storemode=property
|title=Semi-automatic keyword based approach for FIRE 2016 Microblog Track
|pdfUrl=https://ceur-ws.org/Vol-1737/T2-11.pdf
|volume=Vol-1737
|authors=Ganchimeg Lkhagvasuren,Teresa Gonçalves,José Saias
|dblpUrl=https://dblp.org/rec/conf/fire/LkhagvasurenGS16
}}
==Semi-automatic keyword based approach for FIRE 2016 Microblog Track==
Semi-automatic keyword based approach for FIRE 2016 Microblog Track Ganchimeg Lkhagvasuren Teresa Gonçalves José Saias Évora University Évora University Évora University ganchimeg@seas.num.edu.mn tcg@uevora.pt jsaias@uevora.pt ABSTRACT This paper describes our semi-automatic keyword based approach for the four topics of Information Extraction from Microblogs Posted during Disasters task at Forum for Information Retrieval Evaluation (FIRE) 2016. The approach consists three phases; Keywords extraction, Retrieval, and Classification. CCS Concepts • Computing methodologies Support Vector Machine • Information systems Information Extraction. Keywords Supervised classification; Information extraction; Terrier; Twitter. Figure 1. Processing pipeline for the task In terms of our approach, we propose to achieve the first four 1 INTRODUCTION topics using keywords extraction with manual work and It is undeniable that microblogging sites have become key classification methods. resources of significant information during disaster event [1]. One This paper is organized as follows. First, the components of of these microblogging site, Twitter, is a social networking approach are described separately. Then, the result analysis and website which enables users to generate 140-character messages conclusion are presented. Our work is submitted in FIRE 2016 named “tweets” everyday. A giant number of tweets is posted Microblog track [7]. including informative and non-informative messages, which makes opportunities for information extraction [3]. However, dealing with tweets and identifying specific keywords are challenging work due to the nature of Twitter. The small, 2 KEYWORD BASED APPROACH noisy and fragmented tweets mean they have very simple Our approach for the Microblog track comprises three phases, discourse and pragmatic structure, issues which still challenge Keyword extraction, Retrieval, and Classification (see Figure 1). state-of-the-art NLP systems [2]. In the first phase, we extracted all relief resources (keywords) that were available or required. Using those keywords and Terrier1 search engine, we retrieved a number of tweets that each tweet Task description: The aim is to retrieve a number tweets relevant includes a keyword at least in the middle phase. In the last phase, to each topic provided with high precision as well as high recall. the retrieved tweets are classified into the first and second topics The titles of topics are provided in TREC format as the following: using Support Vector Machine (SVM). 1. What resources were available 2.1 Extracting keywords 2. What resource were required In order to extract keywords, we used separately the following two methods with manual work. The keywords that we first 3. What medical resource were available extracted is provided in the topic descriptions, such as food, 4. What medical resource were required water, volunteer, money, medicine and transportation. The quantitative results explored in this phase is presented in Table 1. 5. What were the requirements or availability of resources at specific location Since tweets are usually written in an informal style, the most of NLP tools show poor performance on Twitter datasets. So we 6. What were the activities of various NGOs or tried to exploit specific NLP tools which are [4] and [5]. government organizations Word embedding: Based on these relief resources we mentioned 7. What infrastructure damage or restoration were reported before, we attempted to obtain more keywords from the given dataset. In order to do that, First of all, we tagged all tweets by GATE twitter Part-of-speech tagger [4]. After distinguished all Dataset: Approximately 50,000 tweets that posted during Nepal earthquake disaster were given in JSON format. A main feature of the task is that a gold standard dataset was not provided. 1 http://terrier.org nouns, each noun is represented by a Word2Vec model [5] that machine learning software. The best result was executed by SVM was trained particularly on Twitter datasets to deal with noisy (see Table 2). tweets. Then 50 nearest neighbor nouns of each of the keywords extracted from the descriptions are found as candidates. From In term of the third and fourth topic, “What medical resources these candidates which are more likely be relief resources during were available” and “What medical resources were required”, we retrieved the relevant tweets from the tweets of the first topic and the earthquake, we labeled 86 nouns as keywords manually. However, it was clear that there are more keywords we could not second topic using medical relief resources respectively. extract, such as Nepali words. Chunking and Wordnet: One of the basic technique for Table 3. Accuracy Results of Cross-Validation on Training Data information extraction, chunking, is used to identify keywords in our approach. We defined some chunk grammar, for example, Method Accuracy “CHUNK: {* + ? * +}” based on tagging by POS in the previous step. Next, the nouns SVM 81.5 were filtered by Wordnet [6] and specific verbs such as distribute, MaxEnt 78.9 give, provide, support and hand. Then we enriched the keyword list from filtered nouns manually. Naive Bayes 77.2 Table 2. Some numbers of Results in Keywords Extraction phase 3 Result Extracted nouns using POS 12236 It is impossible to compare our results to other participants results Extracted keywords from the descriptions 16 because we submitted the attempts of only three topics to the organizers. However, the results estimated by the organizers was Manually extracted keywords using Word2Vec 86 reasonable, which brought us encourage to complete our work. Number of verbs used with Chunking 18 The result is presented in Table 3. Manually extracted keywords using Chunking 38 Table 3. Results estimated by the organizers Total number of keywords 124 Precision Recall Map @ Run_ID Overall MAP @ 20 @ 1000 1000 Ganji_1 0,8500 0,4988 0,2204 0,2420 2.2 Retrieval Once we had a bunch of keywords extracted in the previous phase, we retrieved all tweets (around 8620 tweets) that include at 4 Conclusion least one keyword using those keywords on the Terrier. There are In this paper, we have presented our keyword based approach for few open search engines however, we chose Terrier taking some the four topics of FIRE2016 Microblog Track. Our system is of its advantages into consideration. In term of scoring model, we semi-automatic, which includes manual work in the keyword employed BM25 which is based on probabilistic retrieval extraction phase. Moreover, the phases are not integrated with framework. The rank and scores are used to compute the each other. relevance of a tweet to a topic in further. Next, we plan to improve our system to become automatic and to 2.3 Classifying into topics use advanced methods. The most of tweets that retrieved in previous phase can be 5 REFERENCES significantly related to the first two topics while some of them cannot. For instance, Even though the following two tweets both [1] M.Imran, S. Elbassuoni, C. Castillo, F. Diaz, P. Meier. includes water (a keyword), the former one is related to the first Extracting Information Nuggets from Disaster Related topic what resources were available, whereas the latter tweet is Messages in Social Media. In: Proceeding of the 10th not related to any topic. International ISCRAM Conference, 2013 Anyone in need of drinking water contact me. Have some can [2] A.Ritter, Mausem and O.Etzioni, Open domain event donate #earthquake #Nepal #bhaktapur extraction from Twitter. KDD’12 2012. [3] J.Piskorski, R.Yangerber Information Extraction: Past, #ShameOnYou #nepalgov Rs 20 water cost Rs 40 Present and Future. In: Multi source, Multilingual #earthquakenepal #earthquake #Nepal #fuckoff don't #donate Information Extraction and Summarisation. 2013 #unknown #website [4] L. Derczynski, A. Ritter, S. Clarke, and K. Bontcheva. 2013. Therefore we classified the tweets into three classes – available, "Twitter Part-of-Speech Tagging for All: Overcoming Sparse required and other. In order to do that, first, we annotated 1000 and Noisy Data". In Proceedings of the International tweets manually. In preprocessing of classification, all URL, Conference on Recent Advances in Natural Language usertag and some symbols were removed. Then we employed Processing, ACL. three classifiers with basic features such as unigram, bag-of-words and some twitter specific features on WEKA2 open source [5] Godin, F., Vandersmissen, B., De Neve, W., & Van de Walle, R. (2015). Multimedia Lab @ ACL W-NUT NER shared task: Named entity recognition for Twitter microposts 2 http://www.cs.waikato.ac.nz/ml/weka/ using distributed word representations. In: Workshop on Noisy User-generated Text, ACL 2015. [6] George A. Miller (1995). WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39- 41. [7] S. Ghosh and K. Ghosh. Overview of the FIRE 2016 Microblog track: Information Extraction from Microblogs Posted during Disasters. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016.