=Paper=
{{Paper
|id=Vol-1737/T2-4
|storemode=property
|title=Word Embeddings for Information Extraction from Tweets
|pdfUrl=https://ceur-ws.org/Vol-1737/T2-4.pdf
|volume=Vol-1737
|authors=Surajit Dasgupta,Abhash Kumar,Dipankar Das,Sudip Kumar Naskar,Sivaji Bandyopadhyay
|dblpUrl=https://dblp.org/rec/conf/fire/DasguptaKDNB16
}}
==Word Embeddings for Information Extraction from Tweets==
Word Embeddings for Information Extraction from Tweets Surajit Dasgupta* Abhash Kumar* Dipankar Das Jadavpur University, India Jadavpur University, India Jadavpur University, India surajit.techie@gmail.com abhashmaddi@gmail.com ddas@cse.jdvu.ac.in Sudip Kumar Naskar Sivaji Bandyopadhyay Jadavpur University, India Jadavpur University, India sudip.naskar@cse.jdvu.ac.in sivaji.cse.ju@gmail.com ABSTRACT geographical region, reports of relief being carried out by an This paper describes our approach on “Information Extrac- organization and reports of damage to infrastructure. The tion from Microblogs Posted during Disasters” as an attempt main objective of this task is to extract all tweets, ti ∈ T in the shared task of the Microblog Track at Forum for Infor- that are relevant to each topic, qj ∈ Q with high precision mation Retrieval Evaluation (FIRE) 2016 [2]. Our method and high recall, and rank them in their order of relevance. uses vector space word embeddings to extract information from microblogs (tweets) related to disaster scenarios, and 3. DATA AND RESOURCES can be replicated across various domains. The system, which This section describes the dataset and resources provided shows encouraging performance, was evaluated on the Twit- to the shared task participants. A text file containing 50,068 ter dataset provided by the FIRE 2016 shared task. tweet identifiers that were posted during the Nepal earth- quake in April 2015, was provided by the organizers. A CCS Concepts Python script was provided that downloaded the tweets us- ing the Twitter API2 into a JSON encoded tweet file which •Computing methodologies → Natural language pro- was processed during the task. A text file of topic descrip- cessing; Information extraction; tions in TREC3 format was provided, that contained in- formation necessary for the extraction of relevant tweets. Keywords The topic file consisted of 7 topics: FMT1, FMT2, FMT3, word embedding; information retrieval; information extrac- FMT4, FMT5, FMT6 and FMT7. Each topic consisted of tion; social media the following 4 sections: ·: Topic number. 1. INTRODUCTION · : Title of the topic. Social media plays a very important role in the dissemi- nation of real-time information such as disaster outbreaks. · : Description of the topic. Efficient processing of information from social media web- sites such as Twitter can help us to pursue proper disaster · : A detailed narrative which describes what types of tweets would be considered relevant to the mitigation strategies. Extracting relevant information from topic. tweets proves to be a challenging task, owing to their short and noisy nature. Information extraction from social media 4. SYSTEM DESCRIPTION text is a well researched problem [3], [1], [9], [4], [8], [7]. Ap- proaches using bag-of-words model, n-grams based methods 4.1 Preprocessing and machine learning have been extensively used to extract information from microblogs. We parsed the JSON encoded tweets and retrieved the fol- lowing attributes – tweet identifier, tweet, geolocation. From the tweets, we removed the Twitter handles starting with @, 2. TASK DEFINITION URLs and all punctuation marks except the instances of a A set of tweets, T = {t1 , t2 , t3 . . . tn } and a set of topics, single “ . ” (period) and a single “ , ” (comma), using reg- Q = {q1 , q2 , q3 . . . qm } are given. Each topic contains a title, ular expressions. We removed the ASCII characters from a brief description and a detailed narrative on what type the tweets and converted the remaining tweet to lower case of tweets are considered relevant to the topic. The tweets characters. given in the task were posted during the Nepal earthquake1 We also preprocessed the sections of the topic in April 2015. Each topic contains a broad information need file. We removed the punctuation marks, stop words and during a disaster, such as – availability or requirement of converted the text to lower case characters. The prepro- general or medical resources by the population in the disas- cessed sections for each topic was used for building ter affected area, availability or requirement of resources in a the word bags. 2 *Indicates equal contribution. https://dev.twitter.com/overview/api/tweets 1 3 http://en.wikipedia.org/wiki/April 2015 Nepal earthquake http://trec.nist.gov 4.2 Word Bags where, To build the topic-specific word bags, the preprocessed q#»i = Topic vector of ith word bag, qi . section was manually checked to retain the relevant words for each topic. The topic words were expanded using Nv (qi ) = Number of words in qi present in vocabulary. the synonyms obtained from NLTK WordNet4 . The past, # » = Vector of ith word in jth word bag. w ij past participle and present continuous forms of verbs were #» obtained using the NodeBox5 library for Python. Vowels, The tweet vector, ti and the word bag vector, q#»i are used except the initial character, were removed to create unnor- to calculate the similarity. malized version of the words which are generally used in The Word2Vec [5] library for Gensim8 was used to cre- Twitter owing to the 140 character limit. The resultant set ate the tweet vectors and the topic vectors using the pre- of words were used to create the word bags for each topic. trained GloVe model. The GloVe vectors were converted to Word2Vec vectors using code from the GitHub repository, 4.3 Entity Detection manasRK/glove-gensim9 . For the topics FMT5 and FMT6, location and organiza- 4.5 Similarity Metric tion information was required to be detected from the tweet. We used cosine similarity measure to calculate the cosine To extract the location information, we used the geo-location similarity, S between the tweet vector and the topic vector. attribute from the tweets and the Stanford NER tagger6 to #» extract location names from the tweet text. Similarly, we S = cosine-sim(t , q#») i j #» used the Stanford NER tagger to detect organizations in ti · q#»j = #» ||ti || ||q#»j || the tweet text. We split the tweet file into 10 files containing 5,000 tweets each. The Stanford NER tagger was used in parallel on the A high value of S denotes higher similarity between the tweet #» 10 splitted files to identify the location and organization, if vector, ti and the topic vector, q#»j and vice versa. any. This reduced the computation time by 85%. For topics such as FMT5 and FMT6, where entity infor- mation such as location (LOC) or organization (ORG) was 4.4 Word Vectors required, the consolidated score, S 0 was calculated as fol- We used the pre-trained 200 dimensional GloVe [6] word lows: vectors on Twitter data7 (2 billion tweets) to create the vec- S+I S0 = tors of the preprocessed tweets and the word bags. 2 The tweet vectors were created by taking the normalized 1, if LOC or ORG is present. summation of the vectors of the words in the tweets, which where, I= were present in the vocabulary of the pre-trained GloVe 0, otherwise. model. In cases where the word was not a part of the model vocabulary, it was assigned to the null vector. The consolidated value, S 0 shifts the cosine similarity to- wards 1 if the location or organization information is present Nv (ti ) (high relevance) and towards 0, otherwise (low relevance). #» 1 X # » ti = u ij Nv (ti ) j=1 5. RESULTS AND ERROR ANALYSIS and, # » = #» u 0, if uij ∈ / vocabulary Table 1 represents the results obtained by our word em- ij bedding based approach. As seen in the table, Run 1 has where, achieved the best results among the other runs, owing to the #» fact that Run 1 used word bags which were made from its ti = Tweet vector of ith tweet, ti . corresponding descriptions for each topic, whereas the the Nv (ti ) = Number of words in ti present in vocabulary. other runs split the word bags categorically and averaged # » = Vector of ith word in jth tweet. the similarity between the tweet vector and the split topic u ij vectors. Similarly, the word bag vectors were created by taking Run ID Precision Recall MAP Overall MAP the normalized summation of the vectors of the words in @ 20 @ 1000 @ 1000 the word bags, which were present in the vocabulary of the JU NLP 1 0.4357 0.3420 0.0869 0.1125 pre-trained GloVe model. Out of vocabulary words were JU NLP 2 0.3714 0.3004 0.0647 0.0881 assigned to the null vector. JU NLP 3 0.3714 0.3004 0.0647 0.0881 Nv (qi ) q#»i = # » 1 X Table 1: Results of automatic runs w ij Nv (qi ) j=1 The secondary performance obtained in Run 2, 3 is a re- and, # » = #» w 0, if wij ∈ / vocabulary ij sult of the averaging which approximated the actual cosine similarity value between the tweet and topic vectors. Runs 4 2 and 3, which are identical in nature, used cosine distance http://www.nltk.org/howto/wordnet.html 5 https://www.nodebox.net/code/index.php/Linguistics as their similarity metric. 6 8 http://nlp.stanford.edu/software/CRF-NER.shtml https://radimrehurek.com/gensim/models/word2vec.html 7 9 http://nlp.stanford.edu/projects/glove/ https://github.com/manasRK/glove-gensim 6. CONCLUSION In this paper, we presented a brief overview of our sys- tem to address the information extraction from microblog data. We have observed that, building word bags which contained all the topic words relevant to the topic showed better results than splitting the word bags. Therefore, Run 1 exhibited better results than the rest. Considering hash- tags as a feature should also improve the performance of the system. As a future work, we work like to explore more sophisti- cated techniques to build the vectors of the tweets, given the vectors of its constituent words, by considering the sequence of the words into account. We also plan to incorporate more topic specific features to improve the performance of our system. 7. REFERENCES [1] S. Choudhury, S. Banerjee, S. K. Naskar, P. Rosso, and S. Bandyopadhyay. Entity extraction from social media using machine learning approaches. In Working Notes in Forum for Information Retrieval Evaluation (FIRE 2015), pages 106–109, Gandhinagar, India, 2015. [2] S. Ghosh and K. Ghosh. Overview of the FIRE 2016 Microblog track: Information Extraction from Microblogs Posted during Disasters. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016. [3] M. Imran, S. Elbassuoni, C. Castillo, F. Diaz, and P. Meier. Practical extraction of disaster-relevant information from social media. In Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13 Companion, pages 1021–1024, New York, NY, USA, 2013. ACM. [4] M. Imran, S. M. Elbassuoni, C. Castillo, F. Diaz, and P. Meier. Extracting information nuggets from disaster-related messages in social media. Proc. of ISCRAM, Baden-Baden, Germany, 2013. [5] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. [6] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–43, 2014. [7] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. Short text classification in twitter to improve information filtering. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10, pages 841–842, New York, NY, USA, 2010. ACM. [8] S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen. Microblogging during two natural hazards events: what twitter may contribute to situational awareness. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1079–1088. ACM, 2010. [9] X. Yang, C. Macdonald, and I. Ounis. Using word embeddings in twitter election classification. arXiv preprint arXiv:1606.07006, 2016.