Event extraction from Social Media text using Conditional
                             Random Fields
            Nagesh Bhattu Sristy                             N. Satya Krishna∗                                    D. V. L. N. Somayajulu
               IDRBT, Hyderabad                                    IDRBT                                              NIT, Warangal
              Hyderabad, Telangana                         Hyderabad, Telangana                                     Warangal, Telangana
              nageshbs@idrbt.ac.in                    satya.krishna.nunna@gmail.com                                  soma@nitw.ac.in

ABSTRACT                                                                  Indian languages namely ’hindi’, ’tamil’, ’malayalam’ to obtain the
Social Media tools popularized the digital devices among masses           events from the social media messages.
making information dissemination easier and faster. Exchange of              Event Extraction is one of the most valuable tasks in Natural Lan-
text is most popular effective means of communication across social       guage and Information Extraction. For example, accurate selection
media users. It has become necessity to process, understand the           of news messages will improve the performance of news systems
semantics of messages communicated as the messages have wide              [5]. Furthermore, by detecting the occurrence of events, as early as
effect across the users. Event extraction refers to understanding the     possible, the performance of risk analysis systems Capet et al. [4],
events across streams of social media messages. Event extraction          traffic monitoring systems Kamijo et al. [7] can be improved and
helps in taking quicker corrective actions in case of natural calami-     forecasting civil unrest [13].
ties and hence possibly save lives of people. The main objective             Early works, most of the methods Allan et al. [1] [6] Yang et al.
of the task is, drawing specific knowledge to predict the events(         [17] for event extraction have focused on news articles, which is the
incidents) specified in digital text. We proposed two step procedure      only best source of information for current events. With the ability
to extract events. First phase consists of applying a binary classifier   of social media tools to virally popularize news items and their
to identify the messages, containing the event. Second phase con-         acceptance across masses, numerous media agencies have been
sists of applying a sequence labeling technique, conditional random       relying on twitter, facebook feed pages to disseminate their news
fields(CRF), to extract the event from the message. As social media       highlights. Twitter feeds for hindi 3,4 , tamil 5,6 and malayalam 7,8
text is noisy, it is a challenge to develop learning algorithms for       are few examples of social media forums continuously posting the
these tasks. We use Parts of Speech (POS) tags of the words to            news items. Among the posts made by these feeds, only a small
address some of the issues in this challenge.                             fraction of tweets contain events.
                                                                             Alen Ritter Allan et al. [1] developed the first open-domain event
CCS CONCEPTS                                                              extraction tool (TWICAL) for twitter data. Extraction of NASDAQ-
                                                                          100 listed companies information from RSS feed using StockWatcher
• Information systems → Structured text search;
                                                                          was proposed by [10]. Hermes Borsje et al. [3] is news interpreter
                                                                          that supports the decision making process to filter the relevant news
KEYWORDS                                                                  using Semantic Web technologies. Using specific features related
ACM proceedings, LATEX, text tagging                                      to the natural disasters, Sakaki et al. [15] proposed a method to
                                                                          detect the earthquake-related tweets. Benson et al. [2] presented a
1    INTRODUCTION                                                         relation extractor to predict the artists and venues from tweets.
Twitter1 and Facebook2 messages provide up-to-date information
about the current events. In todays world with proliferation in the       2     OVERVIEW
usage of social media, extracting current events from unstructured        Social media text written in Indian languages has received much
tweets and posts has gained ample attention. Social media data            lesser attention compared to english. The multiplicity of languages
consists unusual characteristics like short length, stylistic varia-      in India and usage by comparatively lesser population can be
tions, acronyms, noisy and unstructured forms. This makes event           thought of as the possible reasons for this observation. Tokenization
extraction a challenging problem in Natural Language Processing           is first step in processing the social media text written in Indian
(NLP). Numerous tools have been developed [14], [8] in the recent         languages. Effective tokenization helps in segregating meaningful
past for enabling short-text processing for various tasks such as         features from noise. Feature extraction involves conversion of text
POS Tagging, Chunking, Named Entity Recognition. These tools              into lemma form and morphological analysis. POS tagging involves
use special techniques to account for the out-of-vocabulary terms in      attributing POS tags for the text, observed as a sequence of words.
text collected from twitter. Ritter et al. [14] uses brown’s clustering   We depend on the tools available 9 [12] for POS tagging of various
on a huge corpus of english tweets to cluster similarly used words        Indian language sentences. The task is to identify event carrying
such as ’2morrow’, ’tmrrow’, ’2mar’. In comparison to english, the
                                                                          3 https://twitter.com/aajtak?lang=en
resources available for remaining languages are quite less even for       4 https://twitter.com/bbchindi?lang=en
formal text. This shared task pertains to processing text written in      5 https://twitter.com/news7tamil?lang=en
                                                                          6 https://twitter.com/thatstamil?lang=en
∗ Work done as part of Ph.D                                               7 https://twitter.com/manoramanews?lang=en
1 http://www.twitter.com                                                  8 https://twitter.com/beatsofkerala?lang=en
2 http://www.facebook.com                                                 9 http://ltrc.iiit.ac.in/download.php
                                                                                 2.3    Event Detection
                                                                                 The critical component of our solution is the classifier which detects
                                                                                 the events. We analysed the detection capabilities of 3 classifiers
                                                                                 namely naive Bayes, logistic regression and semi-supervised naive
                                                                                 Bayes EM. Naive Bayes (NB) algorithm models the classifier as an
                                                                                 outcome of generative algorithm. If D is a training dataset of N
                                                                                 examples x1 , x2 , ...xN and y1 , y2 ...y N are corresponding labels, NB
                                                                                 model is expressed in the equation (1). We assume that the each
                                                                                 xi ∈ R F and each yi ∈ {1, 2, .., L} are the domains of respective
                                                                                 portions of examples (xi , yi ) where F is the number of features
                                                                                 and L is number of labels. x i j represents j’th features count in i’th
                                                                                 example. w j is the jth feature in the set of features. p(w j |y) is the
                                                                                 j’th element of the parameter vector associated with label y.
                                                                                                                   N
                                                                                                                   X
                                                                                                      maximize           loд(p(xi , yi ))               (1)
                                                                                                                   i=1
                                                                                                          p(xi , yi ) = p(yi )p(xi |yi )                (2)
                                                                                                                       F
                                                                                                                             p(w j |yi ) x i j
                                                                                                                       Y
                                                                                                       p(xi , yi ) =                                    (3)
                                                                                                                       j=1
           Figure 1: Overview of the approach followed                              Naive Bayes EM (NBEM) is a semi-supervised algorithm [11]
                                                                                 which makes use of test data also along with the training data to
                                                                                 infer the classifier. If the dataset training and testing portions of the
messages and then extract relevant portions of the events from                   dataset are designated as Dl and Du respectively, the NBEM learns
such carrying text. The main phase of the approach consists of                   the classifier as a maximizer of (4). The first part of the objective is
building a classifier to distinguish messages based on whether they              same as that of NB approach. Labels of test examples are not known
carry event or not. Further post-processing is applied to extract the            and are learnt in an iterative EM algorithm, where E-Step predicts
actual event from the event text. This is done through sequence                  the labels of the test examples and M-Step learns the parameters of
tagger trained from the provided training data. Figure 1 indicates               the model with the probabilistic labels learnt in the E-Step.
the overview of the our approach followed for the shared task.                                           X                       X
                                                                                           maximize           loд(p(xi , yi )) +     loд(p(xi , y)      (4)
We use mallet 10 [9] the toolkit for the various machine learning
                                                                                                      i ∈Dl                         i ∈Du
algorithms applied in this work.
                                                                                                                                                        (5)
2.1      Problem Statement                                                          Maximum entropy or logistic regression (MaxEnt) is a discrimi-
Given a collection of sequence of words D where each sentence is                 native approach for building the classifier and hence does not get
of form w 1 , w 2 ...w n , identify the sequences carrying events. For the       much benefit of the EM setting. MaxEnt learns the model which
identified sequences a label sequence l 1 , l 2 ...ln is predicted where         maximizes the objective in equation (6). The conditional distribu-
each tag li ∈ B, I, O. The tags B,I,O indicate the beginning, inside             tion in equation (7) is softmax function. The numerator of equation
and outside of event in event carrying text.                                     (7) is the score of example xi in class yi . The denominator is nor-
                                                                                 malizer which ensures that the value p(y|x) is summing upto 1. It
2.2      Tokenization & POS Tagging                                              is known as maximum entropy, because Equation (6) is a dual of
                                                                                 equivalent maximization of entropy under feature constraints. µ y
Tokenization for our approach follows that of twokenizer 11 . This               is parameter vector corresponding to class y, which is of same size
was extended to handle the unicodes of Indian languages. The                     as that of F (number of features). L is the total number of labels.
indian languages hindi,tamil,malayalam have unicode ranges of
                                                                                                                   N
0x0900-0x097F, 0x0B80-0x0BFF, 0x0D00-0x0D7F respectively. Emoti-                                                   X
                                                                                                      maximize           loд(p(yi |x i ))               (6)
cons, urls, hashtags, userids are various other special tokens.
                                                                                                                   i=1
   The POS taggers used from 12 have their own tokenizers and lem-
                                                                                                                      exp(µ yi t .xi )
matizers to do the morphological analysis. These tools are designed                                      p(yi |xi ) = PL                                (7)
                                                                                                                                t
for news wire text which is supposed to be much more cleaner than                                                      y=1 µ y .xi )
the text seen on social media. We modified the respective tokenizers
to get process the social media text.                                            2.4    Event Extraction
                                                                                 Event extraction is performed using CRF [16]. If (xi , yi ) is the i’th
10 http://mallet.cs.umass.edu/index.php                                          example where xi is a feature sequence of lenght T and yi is corre-
11 https://github.com/brendano/tweetmotif                                        sponding label sequence, the CRF models the maximization of joint
12 http://ltrc.iiit.ac.in/download.php
                                                                                 conditional likelihood in Equation (6) where p(yi |xi ) is defined as
                                                                             2
                    Table 1: Dataset Characteristics                                         Table 2: Event Detection Accuracy & F1 Score

 Language                      No of Instances        No of Tokens    Dictionary Size       Language         Method          Accuracy     F1-Score
   Hindi            Train           1025                 19,497            4381               Hindi            NB             73.66        0.6304
   Hindi            Test            4451                 89,167            11420              Hindi          MaxEnt           75.61        0.6641
 Malayalam          Train           2218                 18449             4065               Hindi           NBEM            80.00        0.7218
 Malayalam          Test            5173                 38625             13460              Hindi        MaxEnt + POS        80.0        0.7730
   Tamil            Train           3843                 37221             12033              Hindi        MaxEnt + POS        80.0        0.7551
   Tamil            Test            5304                 50365             15255              Hindi           NBEM            83.57        0.8096
                                                                                            Malayalam          NB             72.88        0.6243
                                                                                            Malayalam        MaxEnt           82.73        0.8583
in Equation (8). The difference between the numerator in Equations                          Malayalam         NBEM            82.75         0.797
(6) and (8) lies in features considered. CRF tries to model sequen-                           Tamil            NB             76.79        0.8437
tial dependencies, while MaxEnt classifier disregards sequential                              Tamil          MaxEnt           77.05        0.8481
dependecies. The feature vector f (xi , yi ) in Equation (8) is similar                       Tamil           NBEM            80.00         0.866
to the feature vector in Equation (6), encoding number of times a
feature associated with a label. The feature vector g(yi ) encodes
the label sequence features or number of times a label combination
appears in succession in the example (xi , yi ). The number of fea-
                                                                                     3.1    Event Detection Performance
tures of g(yi ) vector is LX L each encoding possible label bigrams.                 We used mallet library for building the binary classifier for event
The denominator Z (xi ) is normalizer which is evaluated over all                    detection. The training dataset consisted of event-text file and an-
possible label assignments (LT f or xi )over label space. Evaluation                 notation file. The events text file consisted of one message for each
of Z (xi ) is efficiently done using forward-backward algorithm. µ                   line with additional details such as user-id, message-id. The annota-
and η are respective parameters associated with node features and                    tion file shows message-id and event index for each event carrying
edge features of CRF.                                                                message given in events file. We prepared a binary classifier treat-
                                                                                     ing the missing messages of annotation file as labelled ’no’, while
                                                                                     matched events as ’yes’. The performance of classifier is measured
                                 exp(µ t .f (xi , yi ) + ηt g(yi ))
                  p(yi |xi ) =                                             (8)       using two metrics namely Accuracy and F1-Score. They are defined
                                              Z (xi )                                in equations
   We used CRF++ 13 as our sequence labeller. CRF++ allows feature
templates to be given for learning. The features being used for                                                 Noo f CorrectPredictions
CRF++ are unigrams with a window of 5 words from the current                                        Accuracy =                                          (9)
                                                                                                                 Totalnoo f Predictions
word. As unigram features are extremely noisy as they are mostly                                                     TruePositives
seen very few times in the corpus and test data unigrams mostly are                                Precision =                                         (10)
                                                                                                               TotalPositivePredictions
seen for the first time, we use POS tag based features also as second                                                 TruePositives
set of features to help the model inferred by CRF in mimicking event                                    Recall =                                       (11)
                                                                                                                 TotalPositiveexamples
extraction process. We use similar 5 tag window for POS tags also.
                                                                                                                   2 ∗ Precision ∗ Recall
We used label based bigram features as label based features. The                                     F 1 − Score =                                     (12)
dataset provided for the shared task contains the starting ending                                                    Precision + Recall
positions of the event for each matched event text with in the                       We performed a 5 fold cross-validation to detect events. The clas-
text. We have converted this format of the input to the B, I, O based                sification accuracies and F1-Scores of different classifiers NB, ME,
tagging to reflect the input for CRF++. As the tokenization employed                 NBEM are reported in table 2
adds spaces to reflect the tokenization process, the output of CRF is                    We can observe that MaxEnt classifier performs better than NB
remapped to original to reflect the positions of the event as expected               in all cases significantly, asserting the superiority of discriminative
by the shared task.                                                                  approaches. Adding POS tag features has improved the classification
                                                                                     accuracy of MaxEnt and NB by 4.4% and 6.4% respectively. NBEM
3     EXPERIMENTS                                                                    is the semi-supervised approach which is consistently better than
The datasets taken for the task are summarized in Table 1. The                       the other two methods with and with-out POS tags, as it uses the
preprocessing replaces all urls with keyword URL, ’@’ mentions                       test portion of the data for learning its model.
are replaced with USER and hash-tags are all preserved as they
sometime contain the event specific tags such as #BombBlast #Earth-                  3.2    Event Extraction Accuracy
Quake. The test set for hindi is 4-times that of trainset while tamil                The event extraction module contains the tags B, I, O indicating
and malayalam are relatively better in this ratio. Number of unique                  the beginning, inside, ending of a event. The accuracy and F1-Score
words is lesser for malayalam.                                                       of Equations (9) and (12) are extended for the CRF output and the
                                                                                     tagging effectiveness is reported in table 3. Our submission only
13 https://taku910.github.io/crfpp/                                                  includes only one language namely hindi.
                                                                                 3
         Table 3: Event Extraction Accuracy & F1 Score

              Language         Precision       Recall      F-measure
                Hindi            31.56         71.39         43.77


REFERENCES
 [1] James Allan, Ron Papka, and Victor Lavrenko. 1998. On-line new event detection
     and tracking. In Proceedings of the 21st annual international ACM SIGIR conference
     on Research and development in information retrieval. ACM, 37–45.
 [2] Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in
     social media feeds. In Proceedings of the 49th Annual Meeting of the Association for
     Computational Linguistics: Human Language Technologies-Volume 1. Association
     for Computational Linguistics, 389–398.
 [3] Jethro Borsje, Leonard Levering, and Flavius Frasincar. 2008. Hermes: a seman-
     tic web-based news decision support system. In Proceedings of the 2008 ACM
     symposium on Applied computing. ACM, 2415–2420.
 [4] Philippe Capet, Thomas Delavallade, Takuya Nakamura, Agnes Sandor, Cedric
     Tarsitano, and Stavroula Voyatzi. 2008. A Risk Assessment System with Automatic
     Extraction of Event Types. Springer US, Boston, MA, 220–229. https://doi.org/10.
     1007/978-0-387-87685-6_27
 [5] Philipp Cimiano and Steffen Staab. 2004. Learning by Googling. SIGKDD Explor.
     Newsl. 6, 2 (Dec. 2004), 24–33. https://doi.org/10.1145/1046456.1046460
 [6] George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw,
     Stephanie Strassel, and Ralph M Weischedel. 2004. The Automatic Content
     Extraction (ACE) Program-Tasks, Data, and Evaluation.. In LREC, Vol. 2. 837–
     840.
 [7] Shunsuke Kamijo, Yasuyuki Matsushita, Katsushi Ikeuchi, and Masao Sakauchi.
     2000. Traffic monitoring and accident detection at intersections. IEEE transactions
     on Intelligent transportation systems 1, 2 (2000), 108–118.
 [8] Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun,
     and Bu-Sung Lee. 2012. TwiNER: Named Entity Recognition in Targeted Twitter
     Stream. In Proceedings of the 35th International ACM SIGIR Conference on Research
     and Development in Information Retrieval (SIGIR ’12). ACM, New York, NY, USA,
     721–730. https://doi.org/10.1145/2348283.2348380
 [9] Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language
     Toolkit. (2002). http://mallet.cs.umass.edu.
[10] Alex Micu, Laurens Mast, Viorel Milea, Flavius Frasincar, and Uzay Kaymak. 2009.
     Financial news analysis using a semantic web approach. In Semantic Knowledge
     Management: An Ontology-Based Framework. IGI Global, 311–328.
[11] Kamal Nigam, Andrew Kachites Mccallum, Sebastian Thrun, and Tom Mitchell.
     2000. Text Classification from Labeled and Unlabeled Documents using EM.
     Machine Learning 39, 2 (01 May 2000), 103–134. https://doi.org/10.1023/A:
     1007692713085
[12] Avinesh PVS and G Karthik. 2007. Part-of-speech tagging and chunking using
     conditional random fields and transformation based learning. In Proceedings of
     IJCAI Workshop on Shallow Parsing for South Asian Languages, Vol. 21. 21–25.
[13] Naren Ramakrishnan, Patrick Butler, Sathappan Muthiah, Nathan Self, Rupinder
     Khandpur, Parang Saraf, Wei Wang, Jose Cadena, Anil Vullikanti, Gizem Korkmaz,
     Chris Kuhlman, Achla Marathe, Liang Zhao, Ting Hua, Feng Chen, Chang Tien
     Lu, Bert Huang, Aravind Srinivasan, Khoa Trinh, Lise Getoor, Graham Katz,
     Andy Doyle, Chris Ackermann, Ilya Zavorin, Jim Ford, Kristen Summers, Youssef
     Fayed, Jaime Arredondo, Dipak Gupta, and David Mares. 2014. ’Beating the
     News’ with EMBERS: Forecasting Civil Unrest Using Open Source Indicators.
     In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge
     Discovery and Data Mining (KDD ’14). ACM, New York, NY, USA, 1799–1808.
     https://doi.org/10.1145/2623330.2623373
[14] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named Entity
     Recognition in Tweets: An Experimental Study. In Proceedings of the Confer-
     ence on Empirical Methods in Natural Language Processing (EMNLP ’11). As-
     sociation for Computational Linguistics, Stroudsburg, PA, USA, 1524–1534.
     http://dl.acm.org/citation.cfm?id=2145432.2145595
[15] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes
     Twitter users: real-time event detection by social sensors. In Proceedings of the
     19th international conference on World wide web. ACM, 851–860.
[16] Charles Sutton, Andrew McCallum, et al. 2012. An introduction to conditional
     random fields. Foundations and Trends® in Machine Learning 4, 4 (2012), 267–373.
[17] Yiming Yang, Tom Pierce, and Jaime Carbonell. 1998. A study of retrospective and
     on-line event detection. In Proceedings of the 21st annual international ACM SIGIR
     conference on Research and development in information retrieval. ACM, 28–36.


                                                                                            4