Correlation Distance based Information Extraction System
               at FIRE 2016 Microblog Track

                                                Saptarashmi Bandyopadhyay
                                           Indian Institute of Engineering Science
                                                  and Technology, Shibpur
                                                   Howrah 711103, India
                                                saptarashmicse@gmail.com


ABSTRACT                                                         and terrorist attacks. The aim of the FIRE 2016 Microblog
The FIRE 2016 Microblog track provided a set of tweets           track [14] is to develop Information Retrieval
posted during the Nepal earthquake on April 2015, and a          methodologies for extracting important information from
set of seven topics. The challenge was to extract all tweets     microblogs posted during disasters.
relevant to each topic. In this method, separate word bags       A total of 49,774 tweets that were posted during the Nepal
are constructed for each topic describing a generic              earthquake in April 2015, have been provided as the data
information need during a disaster situation using topic         for the task along with a set of 7 topics in TREC format.
seed words, stemmers, dictionary and other NLP tools. The        Each topic contains an identifier number, a title, a
topic word bags have been populated with scrambled words         description and a more detailed narrative which describes
that generally appear as noise words in tweet texts. The         the types of tweets that would be considered relevant to the
correlation distance between the topic word bag vectors and      topic. Each of the seven topics identifies a broad
each tweet text vector is computed. The correlation              information need during a disaster, such as – what
distance measure is used to compute the relevance score of       resources are available (FMT1), what resources are
each tweet to each topic. Special consideration is taken for     required (FMT2), what medical resources are available
the topics that are conditioned on the presence of               (FMT3), what medical resources are required (FMT4),
organization names, location names and Geo locations.            what were the requirements / availability of resources at
Organization names and location names are identified on          specified locations (FMT5), what were the activities of
the crawled tweet texts. The presence of geo locations in        various NGOs / Government Organizations (FMT6) and
the crawled tweets is also identified by the tweet parser.       what infrastructure damage / restoration were being
The system response is generated by ordering tweet ids in        reported (FMT7). The corresponding topic ids have been
descending order of their relevance score with respect to        mentioned within brackets.
each topic. The evaluation scores of the submitted system
in terms of Precision@20, Recall@1000, Map@1000 and              The task was to develop methodologies for extracting
Overall Map have been reported as 0.4357, 0.3420, 0.0869         tweets that are relevant to each topic with high precision as
and 0.1125 respectively. The evaluation scores of the            well as with high recall. The main challenges involved with
system without the scrambled noisy words in the word bag         the ad-hoc search task are dealing with the noisy nature of
vectors in terms of Precision@20, Recall@1000,                   the tweets and identification of specific keywords relevant
Map@1000 and Overall Map have been reported as 0.4000,           to each topic. Tweet texts contain maximum of 140
0.3401, 0.0860 and 0.1119 respectively. The results              characters and are often informally written using
indicate that the precision of the information extraction        abbreviations, colloquial terms, etc. Each individual tweet
system depends on considering the presence of scrambled          text might not contain most of the specific keywords even
noisy words in the tweet texts.                                  though the tweet is relevant to a topic.In the present system,
                                                                 the tweet parser parses the tweets and extracts the tweet
CCS Concepts                                                     texts. Organization names and locations names are
• Information systems ➝ Information Retrieval ➝ Information      identified on the crawled tweet texts. The presence of geo
Retrieval Query Processing                                       locations in the crawled tweets is also identified. Separate
Keywords                                                         word bags are constructed for each topic. The topic word
FIRE 2016; Microblog Track; Twitter Information                  bags have been populated with scrambled words that
Extraction; Vector Model; Correlation Distance.                  generally appear as noise words in tweet texts. The
                                                                 correlation distance between the topic word bag vectors and
1.       INTRODUCTION                                            each tweet text vector is computed. The correlation
User-generated content in microblogging sites like Twitter       distance measure is used to compute the relevance score of
are important sources of real time information on various        each tweet to each topic. Special consideration is taken for
events, including disaster events like floods, earthquakes,      the topics that are conditioned on the presence of
organization names, location names and geo locations. The            tagger jar file and default ASCII encoding has been used as
system response is generated by ordering tweet ids in                the training data encoding. The output tags are obtained as
descending order of their relevance score with respect to            UTF-8 encoding for LOCATION, ORGANIZATION and
each topic. The evaluation scores of the submitted system            PERSON Named Entities.
in terms of Precision@20, Recall@1000, Map@1000 and                  It may be observed at this point that identified location
Overall Map have been reported as 0.4357, 0.3420, 0.0869             names may not belong to the country of Nepal - the place
and 0.1125 respectively. The evaluation scores of the                of disaster while the topic with id FMT5 requires that the
system without the scrambled noisy words in the word bag             availability or requirement of resources are referred at
vectors in terms of Precision@20, Recall@1000,                       specific locations in the place of disaster. Organization
Map@1000 and Overall Map have been reported as 0.4000,               names must be present in tweet texts that look for activities
0.3401, 0.0860 and 0.1119 respectively. The results                  of NGOs / Government Organizations (FMT6). It is
indicate that the precision of the information extraction            observed that the identified organization names may not
system depends on considering the presence of scrambled              identify NGOs/ Government Organizations who are
noisy words in the tweet texts. In the present work no               working in Nepal - the place of disaster. The situation will
attempt has been made to identify duplicate tweets. This             have an effect on the precision and recall of the information
policy is in line what is mentioned in the problem statement         extraction system. A better alternative would have been the
about weeding out the duplicate tweets.                              development of a list of Locations names in Nepal and a
2.         PREPROCESSING                                             list of the NGOs or Government Organizations in Nepal.
The 49,774 tweets have been made available as json files as          The crawled tweets in the json files are also checked for the
part of the FIRE 2016 Microblog track. The tweet parser              presence of geo locations in the tweets. Geo locations
parses the tweets and extracts the tweet text in the format          present in a tweet identify the location from which the
<S> text </S> in which a new line is included after the              tweet has been submitted. Geo locations are present in the
tweet text. During preprocessing the string as part of the           tweets only when the feature is turned ON before sending
text attribute in a tweet is parsed. Non ASCII characters            the tweets. It is observed that geo locations present in a
present in the tweet text are removed by using a python              tweet may not belong to Nepal - the place of disaster. It has
script. It has been observed that some of the tweets contain         been observed that location named entities is not always
non ASCII characters in the tweet text which are not                 present in the tweet texts. The presence of location named
necessary during the vector correlation distance                     entities or geo locations are considered for relevance of
computation. The newline character present in the parsed             tweets with respect to topic id FMT5 and organization
tweet text is also removed.                                          name entities are considered for the relevance of tweets
The StanfordNER (Named Entity Recognizer) Tagger [1] class           with respect to topic id FMT6.
available as part of the NLTK 3.0 toolkit [3] has been used on the   The following bags of words are initially created
parsed and preprocessed (after removal of non ASCII characters
                                                                     considering the information need of seven topics with topic
and new line characters) tweet texts to identify the location and
organization names in the tweet texts. A big benefit of the          ids FMT1 though FMT7: available, resources, required,
Stanford NER (Named Entity Recognizer) tagger is that is             medical, working, relief, infrastructure, damage and
provides us with a few different models for pulling out named        restoration. The ‘working’ and ‘relief’ bags have been
entities. We can use any of the following:                           considered to take care of topic id FMT6. The word bags
                                                                     have been identified by analyzing the topic descriptions.
• 3 class model for recognizing locations, persons, and
  organizations                                                      These word bags are created in the following manner: first
                                                                     by looking into the narrations in the FMTs, seed words
• 4 class model for recognizing locations, persons,
                                                                     have been identified. For example, the seed words for
  organizations, and miscellaneous entities
                                                                     available word bag as identified from the FMT are
• 7 class model for recognizing locations, persons,                  ‘available’ and ‘availability’. Then PyDictionary 1.5.2 [2]
  organizations, times, money, percents, and dates                   has been used to find the synonyms of the seed words and
The NLTK toolkit provides a wrapper to the StanfordNERTagger         these synonyms have been included in the word bag.
so that it can be used in Python. The parameters passed to the       PyDictionary uses WordNet corpus but not directly. Next
StanfordNERTagger class include:                                     stemming has been carried out using the Porter Stemmer
1. Classification model path                                         module available in NLTK 3.0 [3] toolkit. Other possible
                                                                     stemmers could have been Lancaster stemmer etc. Next the
2. Stanford tagger jar file path (has been used in the present       NodeBox toolkit [4] has been used for generating the
   work)                                                             surface level inflected forms of the words in the word bags.
3. Training data encoding (default of ASCII encoding has             The library bundles WordNet (using Oliver Steele's
   been used in the present work)                                    PyWordNet [5]), NLTK [3], Damian Conway's
In the present work, the 3 class model for English has been          pluralisation rules [6], Bermi Ferrer's singularization rules
used as the Classification model, the Stanford-ner-2015-04-          [7], Jason Wiener's Brill tagger [8], several algorithms
20/Stanford-ner.zip file has been used as the Stanford               adopted from Michael Granger's Ruby Linguistics module
[9], Charles K. Ogden's list of basic English words [10],       Each word bag for each topic FMT is converted to a vector
and Peter Norvig's spelling corrector [11]. The words in the    of 200 dimensions by using the Word2Vec package [13]
word bag have been pluralized and the past forms of the         with w2v.twitter.200d.txt as the model file. Each tweet text
verb words have been generated. Finally by looking into         is also converted to a vector of 200 dimensions in a similar
the topic narrations appropriate words in each word bag         manner.Word2Vec [13] is a two-layer neural net that
have been identified which fit into the sense with respect to   processes text. Its input is a text corpus and its output is a
the particular topic.                                           set of vectors: feature vectors for words in that corpus.
Scrambled words like ‘avlbl’ for the ‘available’ word bag       While Word2Vec [13] is not a deep neural network, it turns
have also been added by randomly selecting tweets and           text into a numerical form that deep nets can understand.
looking into the narration. Such noisy words are often          The correlation distance between each tweet vector and
present in tweets. A separate set of word bags has also been    each topic FMT word bag vector is computed. Such values
developed for each topic without the inclusion of the           are stored in an array for each topic FMT. The distance
scrambled words as above.                                       correlation of two random variables is obtained by dividing
Now, separate word bags have been constructed for each of       their distance covariance by the product of their distance
the topics as shown in Table 1.                                 standard deviations. The correlation distance (dCor(u,v))
                                                                between two one dimensional vectors u and v is defined as
                                                                dCor(u,v)    =      dCov(u,v)       /       SQRT(dVar(u)*dVar(v))
3.        INFORMATION EXTRACTION SYSTEM                         (1)
The basic task in the Extraction System is to look for co-
occurrence of words corresponding to each topic FMT             wheredCov(u,v) is the distance covariance of the two
(words in the topic FMT word bag) and in each tweet text.       vectors u and v and dVar(u) and dVar(v) are the distance
The objective is to assign a relevance score to each tweet      standard deviations of the u and v vectors respectively. It
text corresponding to each topic FMT. This can be               may be noted that the distance values are computed as (1-
accomplished by converting each topic FMT word bag and          coorrelation distance) in the scipy package spatial distance
each tweet text into separate vectors. The distance between     module [12].
the two vectors will assign a relevance score to each tweet     The relevance scores of each tweet text for each topic FMT
text corresponding to each topic FMT.                           are calculated in the following way. For topic FMTs 1-4
                                                                and 7, relevance score of each tweet text corresponding to
                 Table 1. Topic Word bag                        each topic FMT is computed as

Topic                                                           relevance score = 1 - correlation distance as computed by
            Description            Topic Word Bag               the spatial distance modul                          (3)
 Id
           availability of                                      which can be simplified as
FMT1                             available+ resources
              resources                                         relevance    score      =     actual         correlation      distance
FMT2      requirement of                                        (4)
                                 required + resources
              resources
FMT3       availability of                                      since correlation distance as computed by the spatial
               medical            available + medical           distance module = 1 – actual correlation distance
              resources                                         (5).
FMT4      requirement of                                        Relevance scores for each tweet as computed by equations
               medical            required + medical            3-5 above will have a lower value if the tweet is relevant to
              resources                                         the corresponding topic FMT. Similarly, relevance scores
FMT5      availability and                                      for each tweet as computed will have a higher value if the
                                 available + required +         tweet is less relevant to the corresponding topic FMT.
          requirement of
                                 resources + medical +          Hence, the relevance scores of each tweet text for each
            general and
                                (occurrence of location         topic FMT are subtracted from 1 and stored as the final
               medical
                                 Named Entities or geo          relevance score for each tweet text corresponding to that
            resources at
                               locations in the tweet text      topic FMT. Thus,
              specified
                                        is must)
              locations                                         final   relevance     score     =       1     –   relevance     score
FMT6        activities of         Working + relief +            (6)
          various NGOs /      (occurrence of organization
            Government           Named Entities in the          This ensures that relevant tweets corresponding to each
           Organizations          tweet text is must)           topic FMT will have a high relevance score.
FMT7       infrastructure                                       For topic FMT5, if no location names have been identified
                              infrastructure + damage +         in the tweet text or no geo locations have been identified in
            damage and
                                      restoration               the tweet, a score of 0.5 is added to the actual correlation
             restoration
                                                                distance score already obtained, otherwise a score of 0.05 is
added. The above scores of 0.5 or 0.05 have been               The final result is submitted in TREC format as
considered heuristically. It may be noted that tweets with     <FMTID><Q0><TWEETID><RANK><RELEVANCE
specific location names are considered relevant to topic       SCORE><RUNID>.
FMT5 provided other conditions are satisfied.
relevance score = actual correlation distance + 0.5 (no        4.       SYSTEM EVALUATION RESULTS
location names in the tweet text or no geo locations in the    The submitted system has been evaluated by the FIRE 2016
tweet)                                                         Microblog Track organizers in terms of Precision@20,
else                                                           Recall@1000, MAP@1000 and Overall Map. The
relevance score = actual correlation distance + 0.05           evaluation scores for each topic have been averaged to
(location names in the tweet text or geo locations in the      generate the evaluation scores for the submitted systems.
tweet)       (7)                                               The evaluation scores of the submitted system in terms of
                                                               Precision@20, Recall@1000, Map@1000 and Overall Map
The final relevance score for each tweet corresponding to      have been reported by the track organizers as 0.4357,
topic FMT 5 will be computed as in equation 6. The             0.3420, 0.0869 and 0.1125 respectively. The evaluation
objective is to ensure that the relevance scores for tweets    scores of the system without the scrambled noisy words in
with no location names in the tweet text or no geo locations   the topic FMT word bag vectors in terms of Precision@20,
become small with respect to other topic FMT5 relevant         Recall@1000, Map@1000 and Overall Map have been
tweets. Similarly, the relevance scores for tweets with        reported as 0.4000, 0.3401, 0.0860 and 0.1119 respectively.
location names in the tweet text or geo locations become       The results indicate that the precision of the information
higher with respect to other FMT5 tweets. It may be noted      extraction system improves if scrambled noisy words are
that tweet texts with no location names or geo locations       included the topic FMT word bag vectors.
have not been completely rejected since the named entity
identification process in tweet texts using the Stanford NER
package may have missed such names.                            5.       CONCLUSION
For topic FMT6, if no organization names have been             The submitted information extraction system has only
identified in the tweet text, a score of 0.5 is added to the   considered the correlation distance measure between the
actual correlation distance score already obtained,            vector representations of the topic word bags and the tweet
otherwise a score of 0.05 is added. The above scores of 0.5    texts. It would be interesting to consider other vector
or 0.05 have been considered heuristically. It may be noted    distance measures such as Cosine Similarity or Euclidean
that tweets with specific NGOs or Government                   distance. In the present system, additional heuristic scores
Organization names are considered relevant to topic FMT6       of 0.5 and 0.05 have been considered to identify and rank
provided other conditions are satisfied.                       relevant tweets for topic ids FMT5 and FMT6. It is
                                                               necessary that multiple experiments are carried
relevance score = actual correlation distance + 0.5 (no
organization names in the tweet text)                           out with different values of these additional heuristic
                                                               scores and the scores that generate the best results are
else                                                           considered. It has been observed identified location names
relevance score = actual correlation distance + 0.05           or geo locations or organization names are not checked for
(organization   names    in     the     tweet   text)          their occurrence in the place of disaster. This has an effect
(8)                                                            on the performance of the system. A better alternative
The final relevance score for each tweet corresponding to      would have been the development of a list of Locations
topic FMT 6 will be computed as in equation 6. The             names in Nepal and a list of the NGOs or Government
objective is to ensure that the relevance scores for tweets    Organizations in Nepal so that identified location or
with no organization names in the tweet text become small      organization names in tweet texts can be checked in these
with respect to other topic FMT6 relevant tweets. Similarly,   lists. It has been observed that use of general resource or
the relevance scores for tweets with organization names in     medical resource ontologies as well as infrastructure
the tweet text become higher with respect to other topic       ontologies in the preparation of the word bags would have
FMT6 tweets. It may be noted that tweet texts with no          produced more relevant results. Scrambled noisy words
organization names have not been completely rejected           have been included in the topic word bags in an ad hoc
since the named entity identification process in tweet texts   manner. A better alternative will be to collect such
using the Stanford NER package may have missed such            scrambled noisy words from large tweet corpus and the
names.                                                         development of a methodology for identifying the
                                                               scrambled noisy words that can be included in a topic word
Next, these relevance scores are sorted and in each topic      bag.
FMT structure, we get the tweet id and relevance score pair
in descending order of final relevance score. Highly
relevant tweets are placed high in the list.
6.        REFERENCES                                                 [14] S. Ghosh and K. Ghosh. Overview of the FIRE 2016
                                                                          Microblog track: Information Extraction from Microblogs
                                                                          Posted during Disasters. In Working notes of FIRE 2016 -
[1] nlp.stanford.edu/software/Stanford-ner-2015-04-20.zip                 Forum for Information Retrieval Evaluation, Kolkata, India,
[2]   https://pypi.python.org/pypi/PyDictionary/1.5.2                     December 7-10, 2016, CEUR Workshop Proceedings.
[3]   https://pypi.python.org/pypi/nltk/3.0.0                             CEUR-WS.org, 2016.

[4]   https://www.nodebox.net/code/index.php/Linguistics
[5]   https://pypi.python.org/pypi/pywordnet
[6] www.csse.monash.edu.au/~damian/papers/extabs/Plurals.htm
[7] https://github.com/bermi/Python-
    Inflector/blob/master/rules/english.py
[8] pydoc.net/Python/Pattern/1.5/pattern.en.parser/
[9] https://github.com/bruce/linguistics.
    www.nodebox.net/code/index.php/Linguistics
[10] ogden.basic-english.org/
[11] norvig.com/spell-correct.html
[12] docs.scipy.org/doc/scipy-
     0.16.1/reference/generated/scipy.spatial.distance.correlation
     .html
[13] https://deeplearning4j.org/word2vec