=Paper= {{Paper |id=Vol-1737/T2-11 |storemode=property |title=Semi-automatic keyword based approach for FIRE 2016 Microblog Track |pdfUrl=https://ceur-ws.org/Vol-1737/T2-11.pdf |volume=Vol-1737 |authors=Ganchimeg Lkhagvasuren,Teresa Gonçalves,José Saias |dblpUrl=https://dblp.org/rec/conf/fire/LkhagvasurenGS16 }} ==Semi-automatic keyword based approach for FIRE 2016 Microblog Track== https://ceur-ws.org/Vol-1737/T2-11.pdf
   Semi-automatic keyword based approach for FIRE 2016
                     Microblog Track
    Ganchimeg Lkhagvasuren                                Teresa Gonçalves                                 José Saias
             Évora University                                Évora University                            Évora University
ganchimeg@seas.num.edu.mn                                   tcg@uevora.pt                             jsaias@uevora.pt

ABSTRACT
This paper describes our semi-automatic keyword based approach
for the four topics of Information Extraction from Microblogs
Posted during Disasters task at Forum for Information Retrieval
Evaluation (FIRE) 2016. The approach consists three phases;
Keywords extraction, Retrieval, and Classification.

CCS Concepts
• Computing methodologies       Support Vector          Machine
• Information systems Information Extraction.

Keywords
Supervised classification; Information extraction; Terrier; Twitter.
                                                                                  Figure 1. Processing pipeline for the task
                                                                       In terms of our approach, we propose to achieve the first four
1 INTRODUCTION                                                         topics using keywords extraction with manual work and
It is undeniable that microblogging sites have become key              classification methods.
resources of significant information during disaster event [1]. One    This paper is organized as follows. First, the components of
of these microblogging site, Twitter, is a social networking           approach are described separately. Then, the result analysis and
website which enables users to generate 140-character messages         conclusion are presented. Our work is submitted in FIRE 2016
named “tweets” everyday. A giant number of tweets is posted            Microblog track [7].
including informative and non-informative messages, which
makes opportunities for information extraction [3].
However, dealing with tweets and identifying specific keywords
are challenging work due to the nature of Twitter. The small,
                                                                       2 KEYWORD BASED APPROACH
noisy and fragmented tweets mean they have very simple                 Our approach for the Microblog track comprises three phases,
discourse and pragmatic structure, issues which still challenge        Keyword extraction, Retrieval, and Classification (see Figure 1).
state-of-the-art NLP systems [2].                                      In the first phase, we extracted all relief resources (keywords) that
                                                                       were available or required. Using those keywords and Terrier1
                                                                       search engine, we retrieved a number of tweets that each tweet
Task description: The aim is to retrieve a number tweets relevant      includes a keyword at least in the middle phase. In the last phase,
to each topic provided with high precision as well as high recall.     the retrieved tweets are classified into the first and second topics
The titles of topics are provided in TREC format as the following:     using Support Vector Machine (SVM).

     1.   What resources were available                                2.1 Extracting keywords
     2.   What resource were required                                  In order to extract keywords, we used separately the following
                                                                       two methods with manual work. The keywords that we first
     3.   What medical resource were available                         extracted is provided in the topic descriptions, such as food,
     4.   What medical resource were required                          water, volunteer, money, medicine and transportation. The
                                                                       quantitative results explored in this phase is presented in Table 1.
     5.   What were the requirements or availability of resources
          at specific location                                         Since tweets are usually written in an informal style, the most of
                                                                       NLP tools show poor performance on Twitter datasets. So we
     6.   What were the activities of various NGOs or                  tried to exploit specific NLP tools which are [4] and [5].
          government organizations
                                                                       Word embedding: Based on these relief resources we mentioned
     7.   What infrastructure damage or restoration were reported      before, we attempted to obtain more keywords from the given
                                                                       dataset. In order to do that, First of all, we tagged all tweets by
                                                                       GATE twitter Part-of-speech tagger [4]. After distinguished all
Dataset: Approximately 50,000 tweets that posted during Nepal
earthquake disaster were given in JSON format. A main feature
of the task is that a gold standard dataset was not provided.
                                                                       1    http://terrier.org
nouns, each noun is represented by a Word2Vec model [5] that          machine learning software. The best result was executed by SVM
was trained particularly on Twitter datasets to deal with noisy       (see Table 2).
tweets. Then 50 nearest neighbor nouns of each of the keywords
extracted from the descriptions are found as candidates. From         In term of the third and fourth topic, “What medical resources
these candidates which are more likely be relief resources during     were available” and “What medical resources were required”, we
                                                                      retrieved the relevant tweets from the tweets of the first topic and
the earthquake, we labeled 86 nouns as keywords manually.
However, it was clear that there are more keywords we could not       second topic using medical relief resources respectively.
extract, such as Nepali words.
Chunking and Wordnet: One of the basic technique for                      Table 3. Accuracy Results of Cross-Validation on Training
                                                                                                    Data
information extraction, chunking, is used to identify keywords in
our approach. We defined some chunk grammar, for example,                                    Method            Accuracy
“CHUNK: {*+
?*+}” based on tagging by POS in the previous step. Next, the nouns SVM 81.5 were filtered by Wordnet [6] and specific verbs such as distribute, MaxEnt 78.9 give, provide, support and hand. Then we enriched the keyword list from filtered nouns manually. Naive Bayes 77.2 Table 2. Some numbers of Results in Keywords Extraction phase 3 Result Extracted nouns using POS 12236 It is impossible to compare our results to other participants results Extracted keywords from the descriptions 16 because we submitted the attempts of only three topics to the organizers. However, the results estimated by the organizers was Manually extracted keywords using Word2Vec 86 reasonable, which brought us encourage to complete our work. Number of verbs used with Chunking 18 The result is presented in Table 3. Manually extracted keywords using Chunking 38 Table 3. Results estimated by the organizers Total number of keywords 124 Precision Recall Map @ Run_ID Overall MAP @ 20 @ 1000 1000 Ganji_1 0,8500 0,4988 0,2204 0,2420 2.2 Retrieval Once we had a bunch of keywords extracted in the previous phase, we retrieved all tweets (around 8620 tweets) that include at 4 Conclusion least one keyword using those keywords on the Terrier. There are In this paper, we have presented our keyword based approach for few open search engines however, we chose Terrier taking some the four topics of FIRE2016 Microblog Track. Our system is of its advantages into consideration. In term of scoring model, we semi-automatic, which includes manual work in the keyword employed BM25 which is based on probabilistic retrieval extraction phase. Moreover, the phases are not integrated with framework. The rank and scores are used to compute the each other. relevance of a tweet to a topic in further. Next, we plan to improve our system to become automatic and to 2.3 Classifying into topics use advanced methods. The most of tweets that retrieved in previous phase can be 5 REFERENCES significantly related to the first two topics while some of them cannot. For instance, Even though the following two tweets both [1] M.Imran, S. Elbassuoni, C. Castillo, F. Diaz, P. Meier. includes water (a keyword), the former one is related to the first Extracting Information Nuggets from Disaster Related topic what resources were available, whereas the latter tweet is Messages in Social Media. In: Proceeding of the 10th not related to any topic. International ISCRAM Conference, 2013 Anyone in need of drinking water contact me. Have some can [2] A.Ritter, Mausem and O.Etzioni, Open domain event donate #earthquake #Nepal #bhaktapur extraction from Twitter. KDD’12 2012. [3] J.Piskorski, R.Yangerber Information Extraction: Past, #ShameOnYou #nepalgov Rs 20 water cost Rs 40 Present and Future. In: Multi source, Multilingual #earthquakenepal #earthquake #Nepal #fuckoff don't #donate Information Extraction and Summarisation. 2013 #unknown #website [4] L. Derczynski, A. Ritter, S. Clarke, and K. Bontcheva. 2013. Therefore we classified the tweets into three classes – available, "Twitter Part-of-Speech Tagging for All: Overcoming Sparse required and other. In order to do that, first, we annotated 1000 and Noisy Data". In Proceedings of the International tweets manually. In preprocessing of classification, all URL, Conference on Recent Advances in Natural Language usertag and some symbols were removed. Then we employed Processing, ACL. three classifiers with basic features such as unigram, bag-of-words and some twitter specific features on WEKA2 open source [5] Godin, F., Vandersmissen, B., De Neve, W., & Van de Walle, R. (2015). Multimedia Lab @ ACL W-NUT NER shared task: Named entity recognition for Twitter microposts 2 http://www.cs.waikato.ac.nz/ml/weka/ using distributed word representations. In: Workshop on Noisy User-generated Text, ACL 2015. [6] George A. Miller (1995). WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39- 41. [7] S. Ghosh and K. Ghosh. Overview of the FIRE 2016 Microblog track: Information Extraction from Microblogs Posted during Disasters. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016.