=Paper=
{{Paper
|id=Vol-1737/T2-11
|storemode=property
|title=Semi-automatic keyword based approach for FIRE 2016 Microblog Track
|pdfUrl=https://ceur-ws.org/Vol-1737/T2-11.pdf
|volume=Vol-1737
|authors=Ganchimeg Lkhagvasuren,Teresa Gonçalves,José Saias
|dblpUrl=https://dblp.org/rec/conf/fire/LkhagvasurenGS16
}}
==Semi-automatic keyword based approach for FIRE 2016 Microblog Track==
<pdf width="1500px">https://ceur-ws.org/Vol-1737/T2-11.pdf</pdf>
<pre>
   Semi-automatic keyword based approach for FIRE 2016
                     Microblog Track
    Ganchimeg Lkhagvasuren                                Teresa Gonçalves                                 José Saias
             Évora University                                Évora University                            Évora University
ganchimeg@seas.num.edu.mn                                   tcg@uevora.pt                             jsaias@uevora.pt

ABSTRACT
This paper describes our semi-automatic keyword based approach
for the four topics of Information Extraction from Microblogs
Posted during Disasters task at Forum for Information Retrieval
Evaluation (FIRE) 2016. The approach consists three phases;
Keywords extraction, Retrieval, and Classification.

CCS Concepts
• Computing methodologies       Support Vector          Machine
• Information systems Information Extraction.

Keywords
Supervised classification; Information extraction; Terrier; Twitter.
                                                                                  Figure 1. Processing pipeline for the task
                                                                       In terms of our approach, we propose to achieve the first four
1 INTRODUCTION                                                         topics using keywords extraction with manual work and
It is undeniable that microblogging sites have become key              classification methods.
resources of significant information during disaster event [1]. One    This paper is organized as follows. First, the components of
of these microblogging site, Twitter, is a social networking           approach are described separately. Then, the result analysis and
website which enables users to generate 140-character messages         conclusion are presented. Our work is submitted in FIRE 2016
named “tweets” everyday. A giant number of tweets is posted            Microblog track [7].
including informative and non-informative messages, which
makes opportunities for information extraction [3].
However, dealing with tweets and identifying specific keywords
are challenging work due to the nature of Twitter. The small,
                                                                       2 KEYWORD BASED APPROACH
noisy and fragmented tweets mean they have very simple                 Our approach for the Microblog track comprises three phases,
discourse and pragmatic structure, issues which still challenge        Keyword extraction, Retrieval, and Classification (see Figure 1).
state-of-the-art NLP systems [2].                                      In the first phase, we extracted all relief resources (keywords) that
                                                                       were available or required. Using those keywords and Terrier1
                                                                       search engine, we retrieved a number of tweets that each tweet
Task description: The aim is to retrieve a number tweets relevant      includes a keyword at least in the middle phase. In the last phase,
to each topic provided with high precision as well as high recall.     the retrieved tweets are classified into the first and second topics
The titles of topics are provided in TREC format as the following:     using Support Vector Machine (SVM).

     1.   What resources were available                                2.1 Extracting keywords
     2.   What resource were required                                  In order to extract keywords, we used separately the following
                                                                       two methods with manual work. The keywords that we first
     3.   What medical resource were available                         extracted is provided in the topic descriptions, such as food,
     4.   What medical resource were required                          water, volunteer, money, medicine and transportation. The
                                                                       quantitative results explored in this phase is presented in Table 1.
     5.   What were the requirements or availability of resources
          at specific location                                         Since tweets are usually written in an informal style, the most of
                                                                       NLP tools show poor performance on Twitter datasets. So we
     6.   What were the activities of various NGOs or                  tried to exploit specific NLP tools which are [4] and [5].
          government organizations
                                                                       Word embedding: Based on these relief resources we mentioned
     7.   What infrastructure damage or restoration were reported      before, we attempted to obtain more keywords from the given
                                                                       dataset. In order to do that, First of all, we tagged all tweets by
                                                                       GATE twitter Part-of-speech tagger [4]. After distinguished all
Dataset: Approximately 50,000 tweets that posted during Nepal
earthquake disaster were given in JSON format. A main feature
of the task is that a gold standard dataset was not provided.
                                                                       1    http://terrier.org
nouns, each noun is represented by a Word2Vec model [5] that          machine learning software. The best result was executed by SVM
was trained particularly on Twitter datasets to deal with noisy       (see Table 2).
tweets. Then 50 nearest neighbor nouns of each of the keywords
extracted from the descriptions are found as candidates. From         In term of the third and fourth topic, “What medical resources
these candidates which are more likely be relief resources during     were available” and “What medical resources were required”, we
                                                                      retrieved the relevant tweets from the tweets of the first topic and
the earthquake, we labeled 86 nouns as keywords manually.
However, it was clear that there are more keywords we could not       second topic using medical relief resources respectively.
extract, such as Nepali words.
Chunking and Wordnet: One of the basic technique for                      Table 3. Accuracy Results of Cross-Validation on Training
                                                                                                    Data
information extraction, chunking, is used to identify keywords in
our approach. We defined some chunk grammar, for example,                                    Method            Accuracy
“CHUNK: {<NNP.?>*<VB.?>+<DT>?<JJ>*<NN|NNS>+}”
based on tagging by POS in the previous step. Next, the nouns                                   SVM               81.5
were filtered by Wordnet [6] and specific verbs such as distribute,                          MaxEnt               78.9
give, provide, support and hand. Then we enriched the keyword
list from filtered nouns manually.                                                         Naive Bayes            77.2
    Table 2. Some numbers of Results in Keywords Extraction
                             phase
                                                                      3     Result
              Extracted nouns using POS                  12236
                                                                      It is impossible to compare our results to other participants results
       Extracted keywords from the descriptions           16          because we submitted the attempts of only three topics to the
                                                                      organizers. However, the results estimated by the organizers was
    Manually extracted keywords using Word2Vec            86          reasonable, which brought us encourage to complete our work.
         Number of verbs used with Chunking               18          The result is presented in Table 3.
     Manually extracted keywords using Chunking           38                     Table 3. Results estimated by the organizers
              Total number of keywords                    124                       Precision    Recall      Map @
                                                                          Run_ID                                         Overall MAP
                                                                                      @ 20       @ 1000       1000
                                                                          Ganji_1    0,8500      0,4988      0,2204          0,2420
2.2 Retrieval
Once we had a bunch of keywords extracted in the previous
phase, we retrieved all tweets (around 8620 tweets) that include at   4     Conclusion
least one keyword using those keywords on the Terrier. There are      In this paper, we have presented our keyword based approach for
few open search engines however, we chose Terrier taking some         the four topics of FIRE2016 Microblog Track. Our system is
of its advantages into consideration. In term of scoring model, we    semi-automatic, which includes manual work in the keyword
employed BM25 which is based on probabilistic retrieval               extraction phase. Moreover, the phases are not integrated with
framework. The rank and scores are used to compute the                each other.
relevance of a tweet to a topic in further.
                                                                      Next, we plan to improve our system to become automatic and to
2.3 Classifying into topics                                           use advanced methods.
The most of tweets that retrieved in previous phase can be            5     REFERENCES
significantly related to the first two topics while some of them
cannot. For instance, Even though the following two tweets both       [1] M.Imran, S. Elbassuoni, C. Castillo, F. Diaz, P. Meier.
includes water (a keyword), the former one is related to the first        Extracting Information Nuggets from Disaster Related
topic what resources were available, whereas the latter tweet is          Messages in Social Media. In: Proceeding of the 10th
not related to any topic.                                                 International ISCRAM Conference, 2013

Anyone in need of drinking water contact me. Have some can            [2] A.Ritter, Mausem and O.Etzioni, Open domain event
donate #earthquake #Nepal #bhaktapur                                      extraction from Twitter. KDD’12 2012.
                                                                      [3] J.Piskorski, R.Yangerber Information Extraction: Past,
#ShameOnYou #nepalgov Rs 20 water cost Rs 40
                                                                          Present and Future. In: Multi source, Multilingual
#earthquakenepal #earthquake #Nepal #fuckoff don't #donate
                                                                          Information Extraction and Summarisation. 2013
#unknown #website
                                                                      [4] L. Derczynski, A. Ritter, S. Clarke, and K. Bontcheva. 2013.
Therefore we classified the tweets into three classes – available,        "Twitter Part-of-Speech Tagging for All: Overcoming Sparse
required and other. In order to do that, first, we annotated 1000         and Noisy Data". In Proceedings of the International
tweets manually. In preprocessing of classification, all URL,             Conference on Recent Advances in Natural Language
usertag and some symbols were removed. Then we employed                   Processing, ACL.
three classifiers with basic features such as unigram, bag-of-words
and some twitter specific features on WEKA2 open source               [5] Godin, F., Vandersmissen, B., De Neve, W., & Van de
                                                                          Walle, R. (2015). Multimedia Lab @ ACL W-NUT NER
                                                                          shared task: Named entity recognition for Twitter microposts
2      http://www.cs.waikato.ac.nz/ml/weka/
    using distributed word representations. In: Workshop on
    Noisy User-generated Text, ACL 2015.
[6] George A. Miller (1995). WordNet: A Lexical Database for
    English. Communications of the ACM Vol. 38, No. 11: 39-
    41.
[7] S. Ghosh and K. Ghosh. Overview of the FIRE 2016
    Microblog track: Information Extraction from Microblogs
    Posted during Disasters. In Working notes of FIRE 2016 -
    Forum for Information Retrieval Evaluation, Kolkata, India,
    December 7-10, 2016, CEUR Workshop Proceedings.
    CEUR-WS.org, 2016.

</pre>