Correlation Distance based Information Extraction System at FIRE 2016 Microblog Track Saptarashmi Bandyopadhyay Indian Institute of Engineering Science and Technology, Shibpur Howrah 711103, India saptarashmicse@gmail.com ABSTRACT and terrorist attacks. The aim of the FIRE 2016 Microblog The FIRE 2016 Microblog track provided a set of tweets track [14] is to develop Information Retrieval posted during the Nepal earthquake on April 2015, and a methodologies for extracting important information from set of seven topics. The challenge was to extract all tweets microblogs posted during disasters. relevant to each topic. In this method, separate word bags A total of 49,774 tweets that were posted during the Nepal are constructed for each topic describing a generic earthquake in April 2015, have been provided as the data information need during a disaster situation using topic for the task along with a set of 7 topics in TREC format. seed words, stemmers, dictionary and other NLP tools. The Each topic contains an identifier number, a title, a topic word bags have been populated with scrambled words description and a more detailed narrative which describes that generally appear as noise words in tweet texts. The the types of tweets that would be considered relevant to the correlation distance between the topic word bag vectors and topic. Each of the seven topics identifies a broad each tweet text vector is computed. The correlation information need during a disaster, such as – what distance measure is used to compute the relevance score of resources are available (FMT1), what resources are each tweet to each topic. Special consideration is taken for required (FMT2), what medical resources are available the topics that are conditioned on the presence of (FMT3), what medical resources are required (FMT4), organization names, location names and Geo locations. what were the requirements / availability of resources at Organization names and location names are identified on specified locations (FMT5), what were the activities of the crawled tweet texts. The presence of geo locations in various NGOs / Government Organizations (FMT6) and the crawled tweets is also identified by the tweet parser. what infrastructure damage / restoration were being The system response is generated by ordering tweet ids in reported (FMT7). The corresponding topic ids have been descending order of their relevance score with respect to mentioned within brackets. each topic. The evaluation scores of the submitted system in terms of Precision@20, Recall@1000, Map@1000 and The task was to develop methodologies for extracting Overall Map have been reported as 0.4357, 0.3420, 0.0869 tweets that are relevant to each topic with high precision as and 0.1125 respectively. The evaluation scores of the well as with high recall. The main challenges involved with system without the scrambled noisy words in the word bag the ad-hoc search task are dealing with the noisy nature of vectors in terms of Precision@20, Recall@1000, the tweets and identification of specific keywords relevant Map@1000 and Overall Map have been reported as 0.4000, to each topic. Tweet texts contain maximum of 140 0.3401, 0.0860 and 0.1119 respectively. The results characters and are often informally written using indicate that the precision of the information extraction abbreviations, colloquial terms, etc. Each individual tweet system depends on considering the presence of scrambled text might not contain most of the specific keywords even noisy words in the tweet texts. though the tweet is relevant to a topic.In the present system, the tweet parser parses the tweets and extracts the tweet CCS Concepts texts. Organization names and locations names are • Information systems ➝ Information Retrieval ➝ Information identified on the crawled tweet texts. The presence of geo Retrieval Query Processing locations in the crawled tweets is also identified. Separate Keywords word bags are constructed for each topic. The topic word FIRE 2016; Microblog Track; Twitter Information bags have been populated with scrambled words that Extraction; Vector Model; Correlation Distance. generally appear as noise words in tweet texts. The correlation distance between the topic word bag vectors and 1. INTRODUCTION each tweet text vector is computed. The correlation User-generated content in microblogging sites like Twitter distance measure is used to compute the relevance score of are important sources of real time information on various each tweet to each topic. Special consideration is taken for events, including disaster events like floods, earthquakes, the topics that are conditioned on the presence of organization names, location names and geo locations. The tagger jar file and default ASCII encoding has been used as system response is generated by ordering tweet ids in the training data encoding. The output tags are obtained as descending order of their relevance score with respect to UTF-8 encoding for LOCATION, ORGANIZATION and each topic. The evaluation scores of the submitted system PERSON Named Entities. in terms of Precision@20, Recall@1000, Map@1000 and It may be observed at this point that identified location Overall Map have been reported as 0.4357, 0.3420, 0.0869 names may not belong to the country of Nepal - the place and 0.1125 respectively. The evaluation scores of the of disaster while the topic with id FMT5 requires that the system without the scrambled noisy words in the word bag availability or requirement of resources are referred at vectors in terms of Precision@20, Recall@1000, specific locations in the place of disaster. Organization Map@1000 and Overall Map have been reported as 0.4000, names must be present in tweet texts that look for activities 0.3401, 0.0860 and 0.1119 respectively. The results of NGOs / Government Organizations (FMT6). It is indicate that the precision of the information extraction observed that the identified organization names may not system depends on considering the presence of scrambled identify NGOs/ Government Organizations who are noisy words in the tweet texts. In the present work no working in Nepal - the place of disaster. The situation will attempt has been made to identify duplicate tweets. This have an effect on the precision and recall of the information policy is in line what is mentioned in the problem statement extraction system. A better alternative would have been the about weeding out the duplicate tweets. development of a list of Locations names in Nepal and a 2. PREPROCESSING list of the NGOs or Government Organizations in Nepal. The 49,774 tweets have been made available as json files as The crawled tweets in the json files are also checked for the part of the FIRE 2016 Microblog track. The tweet parser presence of geo locations in the tweets. Geo locations parses the tweets and extracts the tweet text in the format present in a tweet identify the location from which the text in which a new line is included after the tweet has been submitted. Geo locations are present in the tweet text. During preprocessing the string as part of the tweets only when the feature is turned ON before sending text attribute in a tweet is parsed. Non ASCII characters the tweets. It is observed that geo locations present in a present in the tweet text are removed by using a python tweet may not belong to Nepal - the place of disaster. It has script. It has been observed that some of the tweets contain been observed that location named entities is not always non ASCII characters in the tweet text which are not present in the tweet texts. The presence of location named necessary during the vector correlation distance entities or geo locations are considered for relevance of computation. The newline character present in the parsed tweets with respect to topic id FMT5 and organization tweet text is also removed. name entities are considered for the relevance of tweets The StanfordNER (Named Entity Recognizer) Tagger [1] class with respect to topic id FMT6. available as part of the NLTK 3.0 toolkit [3] has been used on the The following bags of words are initially created parsed and preprocessed (after removal of non ASCII characters considering the information need of seven topics with topic and new line characters) tweet texts to identify the location and organization names in the tweet texts. A big benefit of the ids FMT1 though FMT7: available, resources, required, Stanford NER (Named Entity Recognizer) tagger is that is medical, working, relief, infrastructure, damage and provides us with a few different models for pulling out named restoration. The ‘working’ and ‘relief’ bags have been entities. We can use any of the following: considered to take care of topic id FMT6. The word bags have been identified by analyzing the topic descriptions. • 3 class model for recognizing locations, persons, and organizations These word bags are created in the following manner: first by looking into the narrations in the FMTs, seed words • 4 class model for recognizing locations, persons, have been identified. For example, the seed words for organizations, and miscellaneous entities available word bag as identified from the FMT are • 7 class model for recognizing locations, persons, ‘available’ and ‘availability’. Then PyDictionary 1.5.2 [2] organizations, times, money, percents, and dates has been used to find the synonyms of the seed words and The NLTK toolkit provides a wrapper to the StanfordNERTagger these synonyms have been included in the word bag. so that it can be used in Python. The parameters passed to the PyDictionary uses WordNet corpus but not directly. Next StanfordNERTagger class include: stemming has been carried out using the Porter Stemmer 1. Classification model path module available in NLTK 3.0 [3] toolkit. Other possible stemmers could have been Lancaster stemmer etc. Next the 2. Stanford tagger jar file path (has been used in the present NodeBox toolkit [4] has been used for generating the work) surface level inflected forms of the words in the word bags. 3. Training data encoding (default of ASCII encoding has The library bundles WordNet (using Oliver Steele's been used in the present work) PyWordNet [5]), NLTK [3], Damian Conway's In the present work, the 3 class model for English has been pluralisation rules [6], Bermi Ferrer's singularization rules used as the Classification model, the Stanford-ner-2015-04- [7], Jason Wiener's Brill tagger [8], several algorithms 20/Stanford-ner.zip file has been used as the Stanford adopted from Michael Granger's Ruby Linguistics module [9], Charles K. Ogden's list of basic English words [10], Each word bag for each topic FMT is converted to a vector and Peter Norvig's spelling corrector [11]. The words in the of 200 dimensions by using the Word2Vec package [13] word bag have been pluralized and the past forms of the with w2v.twitter.200d.txt as the model file. Each tweet text verb words have been generated. Finally by looking into is also converted to a vector of 200 dimensions in a similar the topic narrations appropriate words in each word bag manner.Word2Vec [13] is a two-layer neural net that have been identified which fit into the sense with respect to processes text. Its input is a text corpus and its output is a the particular topic. set of vectors: feature vectors for words in that corpus. Scrambled words like ‘avlbl’ for the ‘available’ word bag While Word2Vec [13] is not a deep neural network, it turns have also been added by randomly selecting tweets and text into a numerical form that deep nets can understand. looking into the narration. Such noisy words are often The correlation distance between each tweet vector and present in tweets. A separate set of word bags has also been each topic FMT word bag vector is computed. Such values developed for each topic without the inclusion of the are stored in an array for each topic FMT. The distance scrambled words as above. correlation of two random variables is obtained by dividing Now, separate word bags have been constructed for each of their distance covariance by the product of their distance the topics as shown in Table 1. standard deviations. The correlation distance (dCor(u,v)) between two one dimensional vectors u and v is defined as dCor(u,v) = dCov(u,v) / SQRT(dVar(u)*dVar(v)) 3. INFORMATION EXTRACTION SYSTEM (1) The basic task in the Extraction System is to look for co- occurrence of words corresponding to each topic FMT wheredCov(u,v) is the distance covariance of the two (words in the topic FMT word bag) and in each tweet text. vectors u and v and dVar(u) and dVar(v) are the distance The objective is to assign a relevance score to each tweet standard deviations of the u and v vectors respectively. It text corresponding to each topic FMT. This can be may be noted that the distance values are computed as (1- accomplished by converting each topic FMT word bag and coorrelation distance) in the scipy package spatial distance each tweet text into separate vectors. The distance between module [12]. the two vectors will assign a relevance score to each tweet The relevance scores of each tweet text for each topic FMT text corresponding to each topic FMT. are calculated in the following way. For topic FMTs 1-4 and 7, relevance score of each tweet text corresponding to Table 1. Topic Word bag each topic FMT is computed as Topic relevance score = 1 - correlation distance as computed by Description Topic Word Bag the spatial distance modul (3) Id availability of which can be simplified as FMT1 available+ resources resources relevance score = actual correlation distance FMT2 requirement of (4) required + resources resources FMT3 availability of since correlation distance as computed by the spatial medical available + medical distance module = 1 – actual correlation distance resources (5). FMT4 requirement of Relevance scores for each tweet as computed by equations medical required + medical 3-5 above will have a lower value if the tweet is relevant to resources the corresponding topic FMT. Similarly, relevance scores FMT5 availability and for each tweet as computed will have a higher value if the available + required + tweet is less relevant to the corresponding topic FMT. requirement of resources + medical + Hence, the relevance scores of each tweet text for each general and (occurrence of location topic FMT are subtracted from 1 and stored as the final medical Named Entities or geo relevance score for each tweet text corresponding to that resources at locations in the tweet text topic FMT. Thus, specified is must) locations final relevance score = 1 – relevance score FMT6 activities of Working + relief + (6) various NGOs / (occurrence of organization Government Named Entities in the This ensures that relevant tweets corresponding to each Organizations tweet text is must) topic FMT will have a high relevance score. FMT7 infrastructure For topic FMT5, if no location names have been identified infrastructure + damage + in the tweet text or no geo locations have been identified in damage and restoration the tweet, a score of 0.5 is added to the actual correlation restoration distance score already obtained, otherwise a score of 0.05 is added. The above scores of 0.5 or 0.05 have been The final result is submitted in TREC format as considered heuristically. It may be noted that tweets with . FMT5 provided other conditions are satisfied. relevance score = actual correlation distance + 0.5 (no 4. SYSTEM EVALUATION RESULTS location names in the tweet text or no geo locations in the The submitted system has been evaluated by the FIRE 2016 tweet) Microblog Track organizers in terms of Precision@20, else Recall@1000, MAP@1000 and Overall Map. The relevance score = actual correlation distance + 0.05 evaluation scores for each topic have been averaged to (location names in the tweet text or geo locations in the generate the evaluation scores for the submitted systems. tweet) (7) The evaluation scores of the submitted system in terms of Precision@20, Recall@1000, Map@1000 and Overall Map The final relevance score for each tweet corresponding to have been reported by the track organizers as 0.4357, topic FMT 5 will be computed as in equation 6. The 0.3420, 0.0869 and 0.1125 respectively. The evaluation objective is to ensure that the relevance scores for tweets scores of the system without the scrambled noisy words in with no location names in the tweet text or no geo locations the topic FMT word bag vectors in terms of Precision@20, become small with respect to other topic FMT5 relevant Recall@1000, Map@1000 and Overall Map have been tweets. Similarly, the relevance scores for tweets with reported as 0.4000, 0.3401, 0.0860 and 0.1119 respectively. location names in the tweet text or geo locations become The results indicate that the precision of the information higher with respect to other FMT5 tweets. It may be noted extraction system improves if scrambled noisy words are that tweet texts with no location names or geo locations included the topic FMT word bag vectors. have not been completely rejected since the named entity identification process in tweet texts using the Stanford NER package may have missed such names. 5. CONCLUSION For topic FMT6, if no organization names have been The submitted information extraction system has only identified in the tweet text, a score of 0.5 is added to the considered the correlation distance measure between the actual correlation distance score already obtained, vector representations of the topic word bags and the tweet otherwise a score of 0.05 is added. The above scores of 0.5 texts. It would be interesting to consider other vector or 0.05 have been considered heuristically. It may be noted distance measures such as Cosine Similarity or Euclidean that tweets with specific NGOs or Government distance. In the present system, additional heuristic scores Organization names are considered relevant to topic FMT6 of 0.5 and 0.05 have been considered to identify and rank provided other conditions are satisfied. relevant tweets for topic ids FMT5 and FMT6. It is necessary that multiple experiments are carried relevance score = actual correlation distance + 0.5 (no organization names in the tweet text) out with different values of these additional heuristic scores and the scores that generate the best results are else considered. It has been observed identified location names relevance score = actual correlation distance + 0.05 or geo locations or organization names are not checked for (organization names in the tweet text) their occurrence in the place of disaster. This has an effect (8) on the performance of the system. A better alternative The final relevance score for each tweet corresponding to would have been the development of a list of Locations topic FMT 6 will be computed as in equation 6. The names in Nepal and a list of the NGOs or Government objective is to ensure that the relevance scores for tweets Organizations in Nepal so that identified location or with no organization names in the tweet text become small organization names in tweet texts can be checked in these with respect to other topic FMT6 relevant tweets. Similarly, lists. It has been observed that use of general resource or the relevance scores for tweets with organization names in medical resource ontologies as well as infrastructure the tweet text become higher with respect to other topic ontologies in the preparation of the word bags would have FMT6 tweets. It may be noted that tweet texts with no produced more relevant results. Scrambled noisy words organization names have not been completely rejected have been included in the topic word bags in an ad hoc since the named entity identification process in tweet texts manner. A better alternative will be to collect such using the Stanford NER package may have missed such scrambled noisy words from large tweet corpus and the names. development of a methodology for identifying the scrambled noisy words that can be included in a topic word Next, these relevance scores are sorted and in each topic bag. FMT structure, we get the tweet id and relevance score pair in descending order of final relevance score. Highly relevant tweets are placed high in the list. 6. REFERENCES [14] S. Ghosh and K. Ghosh. Overview of the FIRE 2016 Microblog track: Information Extraction from Microblogs Posted during Disasters. In Working notes of FIRE 2016 - [1] nlp.stanford.edu/software/Stanford-ner-2015-04-20.zip Forum for Information Retrieval Evaluation, Kolkata, India, [2] https://pypi.python.org/pypi/PyDictionary/1.5.2 December 7-10, 2016, CEUR Workshop Proceedings. [3] https://pypi.python.org/pypi/nltk/3.0.0 CEUR-WS.org, 2016. [4] https://www.nodebox.net/code/index.php/Linguistics [5] https://pypi.python.org/pypi/pywordnet [6] www.csse.monash.edu.au/~damian/papers/extabs/Plurals.htm [7] https://github.com/bermi/Python- Inflector/blob/master/rules/english.py [8] pydoc.net/Python/Pattern/1.5/pattern.en.parser/ [9] https://github.com/bruce/linguistics. www.nodebox.net/code/index.php/Linguistics [10] ogden.basic-english.org/ [11] norvig.com/spell-correct.html [12] docs.scipy.org/doc/scipy- 0.16.1/reference/generated/scipy.spatial.distance.correlation .html [13] https://deeplearning4j.org/word2vec