Event extraction from Social Media text using Conditional Random Fields Nagesh Bhattu Sristy N. Satya Krishna∗ D. V. L. N. Somayajulu IDRBT, Hyderabad IDRBT NIT, Warangal Hyderabad, Telangana Hyderabad, Telangana Warangal, Telangana nageshbs@idrbt.ac.in satya.krishna.nunna@gmail.com soma@nitw.ac.in ABSTRACT Indian languages namely ’hindi’, ’tamil’, ’malayalam’ to obtain the Social Media tools popularized the digital devices among masses events from the social media messages. making information dissemination easier and faster. Exchange of Event Extraction is one of the most valuable tasks in Natural Lan- text is most popular effective means of communication across social guage and Information Extraction. For example, accurate selection media users. It has become necessity to process, understand the of news messages will improve the performance of news systems semantics of messages communicated as the messages have wide [5]. Furthermore, by detecting the occurrence of events, as early as effect across the users. Event extraction refers to understanding the possible, the performance of risk analysis systems Capet et al. [4], events across streams of social media messages. Event extraction traffic monitoring systems Kamijo et al. [7] can be improved and helps in taking quicker corrective actions in case of natural calami- forecasting civil unrest [13]. ties and hence possibly save lives of people. The main objective Early works, most of the methods Allan et al. [1] [6] Yang et al. of the task is, drawing specific knowledge to predict the events( [17] for event extraction have focused on news articles, which is the incidents) specified in digital text. We proposed two step procedure only best source of information for current events. With the ability to extract events. First phase consists of applying a binary classifier of social media tools to virally popularize news items and their to identify the messages, containing the event. Second phase con- acceptance across masses, numerous media agencies have been sists of applying a sequence labeling technique, conditional random relying on twitter, facebook feed pages to disseminate their news fields(CRF), to extract the event from the message. As social media highlights. Twitter feeds for hindi 3,4 , tamil 5,6 and malayalam 7,8 text is noisy, it is a challenge to develop learning algorithms for are few examples of social media forums continuously posting the these tasks. We use Parts of Speech (POS) tags of the words to news items. Among the posts made by these feeds, only a small address some of the issues in this challenge. fraction of tweets contain events. Alen Ritter Allan et al. [1] developed the first open-domain event CCS CONCEPTS extraction tool (TWICAL) for twitter data. Extraction of NASDAQ- 100 listed companies information from RSS feed using StockWatcher • Information systems → Structured text search; was proposed by [10]. Hermes Borsje et al. [3] is news interpreter that supports the decision making process to filter the relevant news KEYWORDS using Semantic Web technologies. Using specific features related ACM proceedings, LATEX, text tagging to the natural disasters, Sakaki et al. [15] proposed a method to detect the earthquake-related tweets. Benson et al. [2] presented a 1 INTRODUCTION relation extractor to predict the artists and venues from tweets. Twitter1 and Facebook2 messages provide up-to-date information about the current events. In todays world with proliferation in the 2 OVERVIEW usage of social media, extracting current events from unstructured Social media text written in Indian languages has received much tweets and posts has gained ample attention. Social media data lesser attention compared to english. The multiplicity of languages consists unusual characteristics like short length, stylistic varia- in India and usage by comparatively lesser population can be tions, acronyms, noisy and unstructured forms. This makes event thought of as the possible reasons for this observation. Tokenization extraction a challenging problem in Natural Language Processing is first step in processing the social media text written in Indian (NLP). Numerous tools have been developed [14], [8] in the recent languages. Effective tokenization helps in segregating meaningful past for enabling short-text processing for various tasks such as features from noise. Feature extraction involves conversion of text POS Tagging, Chunking, Named Entity Recognition. These tools into lemma form and morphological analysis. POS tagging involves use special techniques to account for the out-of-vocabulary terms in attributing POS tags for the text, observed as a sequence of words. text collected from twitter. Ritter et al. [14] uses brown’s clustering We depend on the tools available 9 [12] for POS tagging of various on a huge corpus of english tweets to cluster similarly used words Indian language sentences. The task is to identify event carrying such as ’2morrow’, ’tmrrow’, ’2mar’. In comparison to english, the 3 https://twitter.com/aajtak?lang=en resources available for remaining languages are quite less even for 4 https://twitter.com/bbchindi?lang=en formal text. This shared task pertains to processing text written in 5 https://twitter.com/news7tamil?lang=en 6 https://twitter.com/thatstamil?lang=en ∗ Work done as part of Ph.D 7 https://twitter.com/manoramanews?lang=en 1 http://www.twitter.com 8 https://twitter.com/beatsofkerala?lang=en 2 http://www.facebook.com 9 http://ltrc.iiit.ac.in/download.php 2.3 Event Detection The critical component of our solution is the classifier which detects the events. We analysed the detection capabilities of 3 classifiers namely naive Bayes, logistic regression and semi-supervised naive Bayes EM. Naive Bayes (NB) algorithm models the classifier as an outcome of generative algorithm. If D is a training dataset of N examples x1 , x2 , ...xN and y1 , y2 ...y N are corresponding labels, NB model is expressed in the equation (1). We assume that the each xi ∈ R F and each yi ∈ {1, 2, .., L} are the domains of respective portions of examples (xi , yi ) where F is the number of features and L is number of labels. x i j represents j’th features count in i’th example. w j is the jth feature in the set of features. p(w j |y) is the j’th element of the parameter vector associated with label y. N X maximize loд(p(xi , yi )) (1) i=1 p(xi , yi ) = p(yi )p(xi |yi ) (2) F p(w j |yi ) x i j Y p(xi , yi ) = (3) j=1 Figure 1: Overview of the approach followed Naive Bayes EM (NBEM) is a semi-supervised algorithm [11] which makes use of test data also along with the training data to infer the classifier. If the dataset training and testing portions of the messages and then extract relevant portions of the events from dataset are designated as Dl and Du respectively, the NBEM learns such carrying text. The main phase of the approach consists of the classifier as a maximizer of (4). The first part of the objective is building a classifier to distinguish messages based on whether they same as that of NB approach. Labels of test examples are not known carry event or not. Further post-processing is applied to extract the and are learnt in an iterative EM algorithm, where E-Step predicts actual event from the event text. This is done through sequence the labels of the test examples and M-Step learns the parameters of tagger trained from the provided training data. Figure 1 indicates the model with the probabilistic labels learnt in the E-Step. the overview of the our approach followed for the shared task. X X maximize loд(p(xi , yi )) + loд(p(xi , y) (4) We use mallet 10 [9] the toolkit for the various machine learning i ∈Dl i ∈Du algorithms applied in this work. (5) 2.1 Problem Statement Maximum entropy or logistic regression (MaxEnt) is a discrimi- Given a collection of sequence of words D where each sentence is native approach for building the classifier and hence does not get of form w 1 , w 2 ...w n , identify the sequences carrying events. For the much benefit of the EM setting. MaxEnt learns the model which identified sequences a label sequence l 1 , l 2 ...ln is predicted where maximizes the objective in equation (6). The conditional distribu- each tag li ∈ B, I, O. The tags B,I,O indicate the beginning, inside tion in equation (7) is softmax function. The numerator of equation and outside of event in event carrying text. (7) is the score of example xi in class yi . The denominator is nor- malizer which ensures that the value p(y|x) is summing upto 1. It 2.2 Tokenization & POS Tagging is known as maximum entropy, because Equation (6) is a dual of equivalent maximization of entropy under feature constraints. µ y Tokenization for our approach follows that of twokenizer 11 . This is parameter vector corresponding to class y, which is of same size was extended to handle the unicodes of Indian languages. The as that of F (number of features). L is the total number of labels. indian languages hindi,tamil,malayalam have unicode ranges of N 0x0900-0x097F, 0x0B80-0x0BFF, 0x0D00-0x0D7F respectively. Emoti- X maximize loд(p(yi |x i )) (6) cons, urls, hashtags, userids are various other special tokens. i=1 The POS taggers used from 12 have their own tokenizers and lem- exp(µ yi t .xi ) matizers to do the morphological analysis. These tools are designed p(yi |xi ) = PL (7) t for news wire text which is supposed to be much more cleaner than y=1 µ y .xi ) the text seen on social media. We modified the respective tokenizers to get process the social media text. 2.4 Event Extraction Event extraction is performed using CRF [16]. If (xi , yi ) is the i’th 10 http://mallet.cs.umass.edu/index.php example where xi is a feature sequence of lenght T and yi is corre- 11 https://github.com/brendano/tweetmotif sponding label sequence, the CRF models the maximization of joint 12 http://ltrc.iiit.ac.in/download.php conditional likelihood in Equation (6) where p(yi |xi ) is defined as 2 Table 1: Dataset Characteristics Table 2: Event Detection Accuracy & F1 Score Language No of Instances No of Tokens Dictionary Size Language Method Accuracy F1-Score Hindi Train 1025 19,497 4381 Hindi NB 73.66 0.6304 Hindi Test 4451 89,167 11420 Hindi MaxEnt 75.61 0.6641 Malayalam Train 2218 18449 4065 Hindi NBEM 80.00 0.7218 Malayalam Test 5173 38625 13460 Hindi MaxEnt + POS 80.0 0.7730 Tamil Train 3843 37221 12033 Hindi MaxEnt + POS 80.0 0.7551 Tamil Test 5304 50365 15255 Hindi NBEM 83.57 0.8096 Malayalam NB 72.88 0.6243 Malayalam MaxEnt 82.73 0.8583 in Equation (8). The difference between the numerator in Equations Malayalam NBEM 82.75 0.797 (6) and (8) lies in features considered. CRF tries to model sequen- Tamil NB 76.79 0.8437 tial dependencies, while MaxEnt classifier disregards sequential Tamil MaxEnt 77.05 0.8481 dependecies. The feature vector f (xi , yi ) in Equation (8) is similar Tamil NBEM 80.00 0.866 to the feature vector in Equation (6), encoding number of times a feature associated with a label. The feature vector g(yi ) encodes the label sequence features or number of times a label combination appears in succession in the example (xi , yi ). The number of fea- 3.1 Event Detection Performance tures of g(yi ) vector is LX L each encoding possible label bigrams. We used mallet library for building the binary classifier for event The denominator Z (xi ) is normalizer which is evaluated over all detection. The training dataset consisted of event-text file and an- possible label assignments (LT f or xi )over label space. Evaluation notation file. The events text file consisted of one message for each of Z (xi ) is efficiently done using forward-backward algorithm. µ line with additional details such as user-id, message-id. The annota- and η are respective parameters associated with node features and tion file shows message-id and event index for each event carrying edge features of CRF. message given in events file. We prepared a binary classifier treat- ing the missing messages of annotation file as labelled ’no’, while matched events as ’yes’. The performance of classifier is measured exp(µ t .f (xi , yi ) + ηt g(yi )) p(yi |xi ) = (8) using two metrics namely Accuracy and F1-Score. They are defined Z (xi ) in equations We used CRF++ 13 as our sequence labeller. CRF++ allows feature templates to be given for learning. The features being used for Noo f CorrectPredictions CRF++ are unigrams with a window of 5 words from the current Accuracy = (9) Totalnoo f Predictions word. As unigram features are extremely noisy as they are mostly TruePositives seen very few times in the corpus and test data unigrams mostly are Precision = (10) TotalPositivePredictions seen for the first time, we use POS tag based features also as second TruePositives set of features to help the model inferred by CRF in mimicking event Recall = (11) TotalPositiveexamples extraction process. We use similar 5 tag window for POS tags also. 2 ∗ Precision ∗ Recall We used label based bigram features as label based features. The F 1 − Score = (12) dataset provided for the shared task contains the starting ending Precision + Recall positions of the event for each matched event text with in the We performed a 5 fold cross-validation to detect events. The clas- text. We have converted this format of the input to the B, I, O based sification accuracies and F1-Scores of different classifiers NB, ME, tagging to reflect the input for CRF++. As the tokenization employed NBEM are reported in table 2 adds spaces to reflect the tokenization process, the output of CRF is We can observe that MaxEnt classifier performs better than NB remapped to original to reflect the positions of the event as expected in all cases significantly, asserting the superiority of discriminative by the shared task. approaches. Adding POS tag features has improved the classification accuracy of MaxEnt and NB by 4.4% and 6.4% respectively. NBEM 3 EXPERIMENTS is the semi-supervised approach which is consistently better than The datasets taken for the task are summarized in Table 1. The the other two methods with and with-out POS tags, as it uses the preprocessing replaces all urls with keyword URL, ’@’ mentions test portion of the data for learning its model. are replaced with USER and hash-tags are all preserved as they sometime contain the event specific tags such as #BombBlast #Earth- 3.2 Event Extraction Accuracy Quake. The test set for hindi is 4-times that of trainset while tamil The event extraction module contains the tags B, I, O indicating and malayalam are relatively better in this ratio. Number of unique the beginning, inside, ending of a event. The accuracy and F1-Score words is lesser for malayalam. of Equations (9) and (12) are extended for the CRF output and the tagging effectiveness is reported in table 3. Our submission only 13 https://taku910.github.io/crfpp/ includes only one language namely hindi. 3 Table 3: Event Extraction Accuracy & F1 Score Language Precision Recall F-measure Hindi 31.56 71.39 43.77 REFERENCES [1] James Allan, Ron Papka, and Victor Lavrenko. 1998. On-line new event detection and tracking. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 37–45. [2] Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in social media feeds. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 389–398. [3] Jethro Borsje, Leonard Levering, and Flavius Frasincar. 2008. Hermes: a seman- tic web-based news decision support system. In Proceedings of the 2008 ACM symposium on Applied computing. ACM, 2415–2420. [4] Philippe Capet, Thomas Delavallade, Takuya Nakamura, Agnes Sandor, Cedric Tarsitano, and Stavroula Voyatzi. 2008. A Risk Assessment System with Automatic Extraction of Event Types. Springer US, Boston, MA, 220–229. https://doi.org/10. 1007/978-0-387-87685-6_27 [5] Philipp Cimiano and Steffen Staab. 2004. Learning by Googling. SIGKDD Explor. Newsl. 6, 2 (Dec. 2004), 24–33. https://doi.org/10.1145/1046456.1046460 [6] George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw, Stephanie Strassel, and Ralph M Weischedel. 2004. The Automatic Content Extraction (ACE) Program-Tasks, Data, and Evaluation.. In LREC, Vol. 2. 837– 840. [7] Shunsuke Kamijo, Yasuyuki Matsushita, Katsushi Ikeuchi, and Masao Sakauchi. 2000. Traffic monitoring and accident detection at intersections. IEEE transactions on Intelligent transportation systems 1, 2 (2000), 108–118. [8] Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. 2012. TwiNER: Named Entity Recognition in Targeted Twitter Stream. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’12). ACM, New York, NY, USA, 721–730. https://doi.org/10.1145/2348283.2348380 [9] Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. (2002). http://mallet.cs.umass.edu. [10] Alex Micu, Laurens Mast, Viorel Milea, Flavius Frasincar, and Uzay Kaymak. 2009. Financial news analysis using a semantic web approach. In Semantic Knowledge Management: An Ontology-Based Framework. IGI Global, 311–328. [11] Kamal Nigam, Andrew Kachites Mccallum, Sebastian Thrun, and Tom Mitchell. 2000. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 2 (01 May 2000), 103–134. https://doi.org/10.1023/A: 1007692713085 [12] Avinesh PVS and G Karthik. 2007. Part-of-speech tagging and chunking using conditional random fields and transformation based learning. In Proceedings of IJCAI Workshop on Shallow Parsing for South Asian Languages, Vol. 21. 21–25. [13] Naren Ramakrishnan, Patrick Butler, Sathappan Muthiah, Nathan Self, Rupinder Khandpur, Parang Saraf, Wei Wang, Jose Cadena, Anil Vullikanti, Gizem Korkmaz, Chris Kuhlman, Achla Marathe, Liang Zhao, Ting Hua, Feng Chen, Chang Tien Lu, Bert Huang, Aravind Srinivasan, Khoa Trinh, Lise Getoor, Graham Katz, Andy Doyle, Chris Ackermann, Ilya Zavorin, Jim Ford, Kristen Summers, Youssef Fayed, Jaime Arredondo, Dipak Gupta, and David Mares. 2014. ’Beating the News’ with EMBERS: Forecasting Civil Unrest Using Open Source Indicators. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14). ACM, New York, NY, USA, 1799–1808. https://doi.org/10.1145/2623330.2623373 [14] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named Entity Recognition in Tweets: An Experimental Study. In Proceedings of the Confer- ence on Empirical Methods in Natural Language Processing (EMNLP ’11). As- sociation for Computational Linguistics, Stroudsburg, PA, USA, 1524–1534. http://dl.acm.org/citation.cfm?id=2145432.2145595 [15] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: real-time event detection by social sensors. In Proceedings of the 19th international conference on World wide web. ACM, 851–860. [16] Charles Sutton, Andrew McCallum, et al. 2012. An introduction to conditional random fields. Foundations and Trends® in Machine Learning 4, 4 (2012), 267–373. [17] Yiming Yang, Tom Pierce, and Jaime Carbonell. 1998. A study of retrospective and on-line event detection. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 28–36. 4