Correlation Distance based Information Extraction System
at FIRE 2016 Microblog Track
Saptarashmi Bandyopadhyay
Indian Institute of Engineering Science
and Technology, Shibpur
Howrah 711103, India
saptarashmicse@gmail.com
ABSTRACT and terrorist attacks. The aim of the FIRE 2016 Microblog
The FIRE 2016 Microblog track provided a set of tweets track [14] is to develop Information Retrieval
posted during the Nepal earthquake on April 2015, and a methodologies for extracting important information from
set of seven topics. The challenge was to extract all tweets microblogs posted during disasters.
relevant to each topic. In this method, separate word bags A total of 49,774 tweets that were posted during the Nepal
are constructed for each topic describing a generic earthquake in April 2015, have been provided as the data
information need during a disaster situation using topic for the task along with a set of 7 topics in TREC format.
seed words, stemmers, dictionary and other NLP tools. The Each topic contains an identifier number, a title, a
topic word bags have been populated with scrambled words description and a more detailed narrative which describes
that generally appear as noise words in tweet texts. The the types of tweets that would be considered relevant to the
correlation distance between the topic word bag vectors and topic. Each of the seven topics identifies a broad
each tweet text vector is computed. The correlation information need during a disaster, such as – what
distance measure is used to compute the relevance score of resources are available (FMT1), what resources are
each tweet to each topic. Special consideration is taken for required (FMT2), what medical resources are available
the topics that are conditioned on the presence of (FMT3), what medical resources are required (FMT4),
organization names, location names and Geo locations. what were the requirements / availability of resources at
Organization names and location names are identified on specified locations (FMT5), what were the activities of
the crawled tweet texts. The presence of geo locations in various NGOs / Government Organizations (FMT6) and
the crawled tweets is also identified by the tweet parser. what infrastructure damage / restoration were being
The system response is generated by ordering tweet ids in reported (FMT7). The corresponding topic ids have been
descending order of their relevance score with respect to mentioned within brackets.
each topic. The evaluation scores of the submitted system
in terms of Precision@20, Recall@1000, Map@1000 and The task was to develop methodologies for extracting
Overall Map have been reported as 0.4357, 0.3420, 0.0869 tweets that are relevant to each topic with high precision as
and 0.1125 respectively. The evaluation scores of the well as with high recall. The main challenges involved with
system without the scrambled noisy words in the word bag the ad-hoc search task are dealing with the noisy nature of
vectors in terms of Precision@20, Recall@1000, the tweets and identification of specific keywords relevant
Map@1000 and Overall Map have been reported as 0.4000, to each topic. Tweet texts contain maximum of 140
0.3401, 0.0860 and 0.1119 respectively. The results characters and are often informally written using
indicate that the precision of the information extraction abbreviations, colloquial terms, etc. Each individual tweet
system depends on considering the presence of scrambled text might not contain most of the specific keywords even
noisy words in the tweet texts. though the tweet is relevant to a topic.In the present system,
the tweet parser parses the tweets and extracts the tweet
CCS Concepts texts. Organization names and locations names are
• Information systems ➝ Information Retrieval ➝ Information identified on the crawled tweet texts. The presence of geo
Retrieval Query Processing locations in the crawled tweets is also identified. Separate
Keywords word bags are constructed for each topic. The topic word
FIRE 2016; Microblog Track; Twitter Information bags have been populated with scrambled words that
Extraction; Vector Model; Correlation Distance. generally appear as noise words in tweet texts. The
correlation distance between the topic word bag vectors and
1. INTRODUCTION each tweet text vector is computed. The correlation
User-generated content in microblogging sites like Twitter distance measure is used to compute the relevance score of
are important sources of real time information on various each tweet to each topic. Special consideration is taken for
events, including disaster events like floods, earthquakes, the topics that are conditioned on the presence of
organization names, location names and geo locations. The tagger jar file and default ASCII encoding has been used as
system response is generated by ordering tweet ids in the training data encoding. The output tags are obtained as
descending order of their relevance score with respect to UTF-8 encoding for LOCATION, ORGANIZATION and
each topic. The evaluation scores of the submitted system PERSON Named Entities.
in terms of Precision@20, Recall@1000, Map@1000 and It may be observed at this point that identified location
Overall Map have been reported as 0.4357, 0.3420, 0.0869 names may not belong to the country of Nepal - the place
and 0.1125 respectively. The evaluation scores of the of disaster while the topic with id FMT5 requires that the
system without the scrambled noisy words in the word bag availability or requirement of resources are referred at
vectors in terms of Precision@20, Recall@1000, specific locations in the place of disaster. Organization
Map@1000 and Overall Map have been reported as 0.4000, names must be present in tweet texts that look for activities
0.3401, 0.0860 and 0.1119 respectively. The results of NGOs / Government Organizations (FMT6). It is
indicate that the precision of the information extraction observed that the identified organization names may not
system depends on considering the presence of scrambled identify NGOs/ Government Organizations who are
noisy words in the tweet texts. In the present work no working in Nepal - the place of disaster. The situation will
attempt has been made to identify duplicate tweets. This have an effect on the precision and recall of the information
policy is in line what is mentioned in the problem statement extraction system. A better alternative would have been the
about weeding out the duplicate tweets. development of a list of Locations names in Nepal and a
2. PREPROCESSING list of the NGOs or Government Organizations in Nepal.
The 49,774 tweets have been made available as json files as The crawled tweets in the json files are also checked for the
part of the FIRE 2016 Microblog track. The tweet parser presence of geo locations in the tweets. Geo locations
parses the tweets and extracts the tweet text in the format present in a tweet identify the location from which the
text in which a new line is included after the tweet has been submitted. Geo locations are present in the
tweet text. During preprocessing the string as part of the tweets only when the feature is turned ON before sending
text attribute in a tweet is parsed. Non ASCII characters the tweets. It is observed that geo locations present in a
present in the tweet text are removed by using a python tweet may not belong to Nepal - the place of disaster. It has
script. It has been observed that some of the tweets contain been observed that location named entities is not always
non ASCII characters in the tweet text which are not present in the tweet texts. The presence of location named
necessary during the vector correlation distance entities or geo locations are considered for relevance of
computation. The newline character present in the parsed tweets with respect to topic id FMT5 and organization
tweet text is also removed. name entities are considered for the relevance of tweets
The StanfordNER (Named Entity Recognizer) Tagger [1] class with respect to topic id FMT6.
available as part of the NLTK 3.0 toolkit [3] has been used on the The following bags of words are initially created
parsed and preprocessed (after removal of non ASCII characters
considering the information need of seven topics with topic
and new line characters) tweet texts to identify the location and
organization names in the tweet texts. A big benefit of the ids FMT1 though FMT7: available, resources, required,
Stanford NER (Named Entity Recognizer) tagger is that is medical, working, relief, infrastructure, damage and
provides us with a few different models for pulling out named restoration. The ‘working’ and ‘relief’ bags have been
entities. We can use any of the following: considered to take care of topic id FMT6. The word bags
have been identified by analyzing the topic descriptions.
• 3 class model for recognizing locations, persons, and
organizations These word bags are created in the following manner: first
by looking into the narrations in the FMTs, seed words
• 4 class model for recognizing locations, persons,
have been identified. For example, the seed words for
organizations, and miscellaneous entities
available word bag as identified from the FMT are
• 7 class model for recognizing locations, persons, ‘available’ and ‘availability’. Then PyDictionary 1.5.2 [2]
organizations, times, money, percents, and dates has been used to find the synonyms of the seed words and
The NLTK toolkit provides a wrapper to the StanfordNERTagger these synonyms have been included in the word bag.
so that it can be used in Python. The parameters passed to the PyDictionary uses WordNet corpus but not directly. Next
StanfordNERTagger class include: stemming has been carried out using the Porter Stemmer
1. Classification model path module available in NLTK 3.0 [3] toolkit. Other possible
stemmers could have been Lancaster stemmer etc. Next the
2. Stanford tagger jar file path (has been used in the present NodeBox toolkit [4] has been used for generating the
work) surface level inflected forms of the words in the word bags.
3. Training data encoding (default of ASCII encoding has The library bundles WordNet (using Oliver Steele's
been used in the present work) PyWordNet [5]), NLTK [3], Damian Conway's
In the present work, the 3 class model for English has been pluralisation rules [6], Bermi Ferrer's singularization rules
used as the Classification model, the Stanford-ner-2015-04- [7], Jason Wiener's Brill tagger [8], several algorithms
20/Stanford-ner.zip file has been used as the Stanford adopted from Michael Granger's Ruby Linguistics module
[9], Charles K. Ogden's list of basic English words [10], Each word bag for each topic FMT is converted to a vector
and Peter Norvig's spelling corrector [11]. The words in the of 200 dimensions by using the Word2Vec package [13]
word bag have been pluralized and the past forms of the with w2v.twitter.200d.txt as the model file. Each tweet text
verb words have been generated. Finally by looking into is also converted to a vector of 200 dimensions in a similar
the topic narrations appropriate words in each word bag manner.Word2Vec [13] is a two-layer neural net that
have been identified which fit into the sense with respect to processes text. Its input is a text corpus and its output is a
the particular topic. set of vectors: feature vectors for words in that corpus.
Scrambled words like ‘avlbl’ for the ‘available’ word bag While Word2Vec [13] is not a deep neural network, it turns
have also been added by randomly selecting tweets and text into a numerical form that deep nets can understand.
looking into the narration. Such noisy words are often The correlation distance between each tweet vector and
present in tweets. A separate set of word bags has also been each topic FMT word bag vector is computed. Such values
developed for each topic without the inclusion of the are stored in an array for each topic FMT. The distance
scrambled words as above. correlation of two random variables is obtained by dividing
Now, separate word bags have been constructed for each of their distance covariance by the product of their distance
the topics as shown in Table 1. standard deviations. The correlation distance (dCor(u,v))
between two one dimensional vectors u and v is defined as
dCor(u,v) = dCov(u,v) / SQRT(dVar(u)*dVar(v))
3. INFORMATION EXTRACTION SYSTEM (1)
The basic task in the Extraction System is to look for co-
occurrence of words corresponding to each topic FMT wheredCov(u,v) is the distance covariance of the two
(words in the topic FMT word bag) and in each tweet text. vectors u and v and dVar(u) and dVar(v) are the distance
The objective is to assign a relevance score to each tweet standard deviations of the u and v vectors respectively. It
text corresponding to each topic FMT. This can be may be noted that the distance values are computed as (1-
accomplished by converting each topic FMT word bag and coorrelation distance) in the scipy package spatial distance
each tweet text into separate vectors. The distance between module [12].
the two vectors will assign a relevance score to each tweet The relevance scores of each tweet text for each topic FMT
text corresponding to each topic FMT. are calculated in the following way. For topic FMTs 1-4
and 7, relevance score of each tweet text corresponding to
Table 1. Topic Word bag each topic FMT is computed as
Topic relevance score = 1 - correlation distance as computed by
Description Topic Word Bag the spatial distance modul (3)
Id
availability of which can be simplified as
FMT1 available+ resources
resources relevance score = actual correlation distance
FMT2 requirement of (4)
required + resources
resources
FMT3 availability of since correlation distance as computed by the spatial
medical available + medical distance module = 1 – actual correlation distance
resources (5).
FMT4 requirement of Relevance scores for each tweet as computed by equations
medical required + medical 3-5 above will have a lower value if the tweet is relevant to
resources the corresponding topic FMT. Similarly, relevance scores
FMT5 availability and for each tweet as computed will have a higher value if the
available + required + tweet is less relevant to the corresponding topic FMT.
requirement of
resources + medical + Hence, the relevance scores of each tweet text for each
general and
(occurrence of location topic FMT are subtracted from 1 and stored as the final
medical
Named Entities or geo relevance score for each tweet text corresponding to that
resources at
locations in the tweet text topic FMT. Thus,
specified
is must)
locations final relevance score = 1 – relevance score
FMT6 activities of Working + relief + (6)
various NGOs / (occurrence of organization
Government Named Entities in the This ensures that relevant tweets corresponding to each
Organizations tweet text is must) topic FMT will have a high relevance score.
FMT7 infrastructure For topic FMT5, if no location names have been identified
infrastructure + damage + in the tweet text or no geo locations have been identified in
damage and
restoration the tweet, a score of 0.5 is added to the actual correlation
restoration
distance score already obtained, otherwise a score of 0.05 is
added. The above scores of 0.5 or 0.05 have been The final result is submitted in TREC format as
considered heuristically. It may be noted that tweets with .
FMT5 provided other conditions are satisfied.
relevance score = actual correlation distance + 0.5 (no 4. SYSTEM EVALUATION RESULTS
location names in the tweet text or no geo locations in the The submitted system has been evaluated by the FIRE 2016
tweet) Microblog Track organizers in terms of Precision@20,
else Recall@1000, MAP@1000 and Overall Map. The
relevance score = actual correlation distance + 0.05 evaluation scores for each topic have been averaged to
(location names in the tweet text or geo locations in the generate the evaluation scores for the submitted systems.
tweet) (7) The evaluation scores of the submitted system in terms of
Precision@20, Recall@1000, Map@1000 and Overall Map
The final relevance score for each tweet corresponding to have been reported by the track organizers as 0.4357,
topic FMT 5 will be computed as in equation 6. The 0.3420, 0.0869 and 0.1125 respectively. The evaluation
objective is to ensure that the relevance scores for tweets scores of the system without the scrambled noisy words in
with no location names in the tweet text or no geo locations the topic FMT word bag vectors in terms of Precision@20,
become small with respect to other topic FMT5 relevant Recall@1000, Map@1000 and Overall Map have been
tweets. Similarly, the relevance scores for tweets with reported as 0.4000, 0.3401, 0.0860 and 0.1119 respectively.
location names in the tweet text or geo locations become The results indicate that the precision of the information
higher with respect to other FMT5 tweets. It may be noted extraction system improves if scrambled noisy words are
that tweet texts with no location names or geo locations included the topic FMT word bag vectors.
have not been completely rejected since the named entity
identification process in tweet texts using the Stanford NER
package may have missed such names. 5. CONCLUSION
For topic FMT6, if no organization names have been The submitted information extraction system has only
identified in the tweet text, a score of 0.5 is added to the considered the correlation distance measure between the
actual correlation distance score already obtained, vector representations of the topic word bags and the tweet
otherwise a score of 0.05 is added. The above scores of 0.5 texts. It would be interesting to consider other vector
or 0.05 have been considered heuristically. It may be noted distance measures such as Cosine Similarity or Euclidean
that tweets with specific NGOs or Government distance. In the present system, additional heuristic scores
Organization names are considered relevant to topic FMT6 of 0.5 and 0.05 have been considered to identify and rank
provided other conditions are satisfied. relevant tweets for topic ids FMT5 and FMT6. It is
necessary that multiple experiments are carried
relevance score = actual correlation distance + 0.5 (no
organization names in the tweet text) out with different values of these additional heuristic
scores and the scores that generate the best results are
else considered. It has been observed identified location names
relevance score = actual correlation distance + 0.05 or geo locations or organization names are not checked for
(organization names in the tweet text) their occurrence in the place of disaster. This has an effect
(8) on the performance of the system. A better alternative
The final relevance score for each tweet corresponding to would have been the development of a list of Locations
topic FMT 6 will be computed as in equation 6. The names in Nepal and a list of the NGOs or Government
objective is to ensure that the relevance scores for tweets Organizations in Nepal so that identified location or
with no organization names in the tweet text become small organization names in tweet texts can be checked in these
with respect to other topic FMT6 relevant tweets. Similarly, lists. It has been observed that use of general resource or
the relevance scores for tweets with organization names in medical resource ontologies as well as infrastructure
the tweet text become higher with respect to other topic ontologies in the preparation of the word bags would have
FMT6 tweets. It may be noted that tweet texts with no produced more relevant results. Scrambled noisy words
organization names have not been completely rejected have been included in the topic word bags in an ad hoc
since the named entity identification process in tweet texts manner. A better alternative will be to collect such
using the Stanford NER package may have missed such scrambled noisy words from large tweet corpus and the
names. development of a methodology for identifying the
scrambled noisy words that can be included in a topic word
Next, these relevance scores are sorted and in each topic bag.
FMT structure, we get the tweet id and relevance score pair
in descending order of final relevance score. Highly
relevant tweets are placed high in the list.
6. REFERENCES [14] S. Ghosh and K. Ghosh. Overview of the FIRE 2016
Microblog track: Information Extraction from Microblogs
Posted during Disasters. In Working notes of FIRE 2016 -
[1] nlp.stanford.edu/software/Stanford-ner-2015-04-20.zip Forum for Information Retrieval Evaluation, Kolkata, India,
[2] https://pypi.python.org/pypi/PyDictionary/1.5.2 December 7-10, 2016, CEUR Workshop Proceedings.
[3] https://pypi.python.org/pypi/nltk/3.0.0 CEUR-WS.org, 2016.
[4] https://www.nodebox.net/code/index.php/Linguistics
[5] https://pypi.python.org/pypi/pywordnet
[6] www.csse.monash.edu.au/~damian/papers/extabs/Plurals.htm
[7] https://github.com/bermi/Python-
Inflector/blob/master/rules/english.py
[8] pydoc.net/Python/Pattern/1.5/pattern.en.parser/
[9] https://github.com/bruce/linguistics.
www.nodebox.net/code/index.php/Linguistics
[10] ogden.basic-english.org/
[11] norvig.com/spell-correct.html
[12] docs.scipy.org/doc/scipy-
0.16.1/reference/generated/scipy.spatial.distance.correlation
.html
[13] https://deeplearning4j.org/word2vec