=Paper= {{Paper |id=Vol-2036/T5-1 |storemode=property |title=EventXtract-IL: Event Extraction from Social Media Text in Indian Languages @ FIRE 2017 - An Overview |pdfUrl=https://ceur-ws.org/Vol-2036/T5-1.pdf |volume=Vol-2036 |authors=Pattabhi RK Rao,Sobha Lalitha Devi |dblpUrl=https://dblp.org/rec/conf/fire/RaoD17 }} ==EventXtract-IL: Event Extraction from Social Media Text in Indian Languages @ FIRE 2017 - An Overview== https://ceur-ws.org/Vol-2036/T5-1.pdf
EventXtract-IL: Event Extraction from Social Media Text in
     Indian Languages @ FIRE 2017 – An Overview
                                         Pattabhi RK Rao and Sobha Lalitha Devi
                                                   AU-KBC Research Centre
                                            MIT Campus of Anna University, Chrompet,
                                                        Chennai, India
                                                      +91 44 22232711
                                                 {pattabhi, sobha}@au-kbc.org


ABSTRACT                                                               1.        INTRODUCTION
Today through social media platforms the communication has             Over the past decade, Indian language content on various media
become exceptionally fast that people across the world get to          types such as websites, blogs, email, chats has increased
know any event happening at the nook and corner of the world in        significantly and it is observed that with the advent of smart
a fraction of a second. The penetration of smart phones, tabs etc      phones more people are using social media such as twitter,
has significantly changed the way people communicate. Facebook         facebook to comment on people, products, services, organizations,
and Twitter are two most popular social media platforms, where         governments, etc. Thus it is seen that content growth is driven by
people post about events, their personal daily activities and plans.   people from non-metros and small cities who generally are
And also post their thoughts, responses or reactions for any public    comfortable with their own mother tongue rather than English.
cause or issue. In the recent times we have seen how the facebook      The growth of Indian language content is expected to increase by
posts and twitter tweets have helped in mobilizing people in states    more than 70% every year. Hence there is a great need to process
such as Tamil Nadu (TN) and Jammu & Kashmir (J&K) in India.            these data automatically. This requires natural language
The mass public protests for the “Jallikattu” event in TN and          processing software systems which extracts events, entities or the
stone pelting protests in J&K are prominent examples of how            associations of them. Thus an automatic Event extraction system
social media has impacted the common man. The information              is required.
about events or happenings in real time is very valuable to the
administration for disaster management, crowd control, public          The objectives of the evaluation are:
alerting. These information which is used in the development of                 Creation of benchmark data for Event Extraction in
recommender systems adds value for the growth of business                        Indian language Social Media text.
enterprises. Thus there is great need to develop automatic systems
for automatic event extraction. This paper presents the overview                To encourage development of Event extraction systems
of the task “Event extraction in Indian languages”, a track in FIRE              for Indian language Social Media text.
2017. The task of this track is to extract events from the social
                                                                       Event extraction has been actively researched for over last decade.
media text, The Twitter. Some of the main issues in handling of
                                                                       Most of the research has, however, been focused on resource rich
such social media texts are i) Spelling errors ii) Abbreviated new
                                                                       languages, such as English, French and Spanish. The scope of this
language vocabulary such as “gr8” for great iii) use of symbols
                                                                       work covers the task of event recognition and extraction in social
such as emoticons/emojis iv) use of meta tags and hash tags and
                                                                       media text (twitter data) for Indian languages. In the past there
v) Code mixing, though in this track, we have not considered code
                                                                       were events such as Workshop on NER for South and South East
mixing. Though event extraction from Indian language texts is
                                                                       Asian Languages (NER-SSEA, 2008), Workshop on South and
gaining attention among Indian research community, however
                                                                       South East Asian Natural Language Processing (SANLP,
there is no benchmark data available for testing the systems.
                                                                       2010&2011) conducted to bring various research works on NER
Hence we have organized the Event Extraction in social media
                                                                       being done on a single platform. NER-IL tracks at FIRE (Forum
text track for Indian languages (EventXtract-IL) in the Forum for
                                                                       for Information Retrieval and Evaluation) in 2013, 2014, and
Information Retrieval Evaluation (FIRE). The paper describes the
                                                                       2015; Code Mix Entity Extraction (CMEE-IL) in 2016 have
corpus created for three languages, viz., Hindi, Malayalam and
                                                                       contributed to the development of benchmark data and boosted
Tamil and present the overview of the approaches used by the
                                                                       the research towards NER for Indian languages. But it is
participants.
                                                                       observed that there are very little works in Indian language event
CCS Concepts                                                           extraction. The user generated texts such as twitter and facebook
                                                                       texts are diverse and noisy. These texts contain non-standard
• Computing     methodologies ~ Artificial intelligence                spellings and abbreviations, unreliable punctuation styles. Apart
• Computing methodologies ~ Natural language processing                from these writing style and language challenges, another
• Information systems ~ Information extraction                         challenge is concept drift (Dredze etal., 2010; Fromreide et al.,
                                                                       2014); the distribution of language and topics on Twitter and
Keywords                                                               Facebook is constantly shifting, thus leading to performance
Event Extraction; Social Media Text; Twitter; Indian Languages;
                                                                       degradation of NLP tools over time.
Tamil; Hindi; Malayalam; Event Annotated Corpora for Indian
Language data.                                                         Some of the main issues in handling of such texts are i) Spelling
                                                                       errors ii) Abbreviated new language vocabulary such as “gr8” for
great iii) use of symbols such as emoticons/emojis iv) use of meta     manually annotated tweets will contain very few training
tags and hash tags v) Code mixing.                                     examples.
For example:                                                              ii) In comparison with English, Indian Languages have more
                                                                       dialectal variations. These dialects are mainly influenced by
“Muje kabi bhoolen gy to nhi na? :(                                    different regions and communities.
Want ur sweet feedback about my FC ? mai
                                                                        iii) Indian Language tweets are multilingual in nature and
dilli jaa rahi hoon”.
                                                                       predominantly contain English words.
The research in analyzing the social media data is attempted in        The following examples illustrate the usage of English words and
English through various shared tasks. Language identification in       spoken, dialectal forms in the tweets.
tweets (tweetLID) shared task held at SEPLN 2014 had the task of       Example 1 (Tamil):
identifying the tweets from six different languages. SemEval
                                                                       Ta: Stamp veliyittu ivaga        ativaangi …..
2013, 2014 and 2015 held as shared task track where sentiment
analysis in tweets were focused. They conducted two sub-tasks          En: stamp released these_people get_beaten ….
namely, contextual polarity disambiguation and message polarity        Ta: othavaangi …. kadasiya kovai
classification. In Indian languages, Amitav et al (2015) had
organized a shared task titled 'Sentiment Analysis in Indian           En: get_slapped … at_end        kovai
languages' as a part of MIKE 2015, where sentiment analysis in         Ta: pooyi pallakaatti kuththu vaangiyaachchu.
tweets is done for tweets in Hindi, Bengali and Tamil language.
                                                                       En: gone show_tooth punch got
Named Entity recognition was explored in twitter through shared
                                                                        (“They released stamp, got slapping and beating … at the end
task organized by Microsoft as part of 2015 ACL-IJCNLP, a
                                                                       reached Kovai and got punched on the face”)
shared task on noisy user-generated text, where they had two sub-
tasks namely, twitter text normalization and named entity              This example is a Tamil tweet where it is written in a particular
recognition for English. The ESM-IL track at FIRE 2015 was the         dialect and also has usage of English words.
came up with the entity annotated benchmark data for the social
media text, where the data was in only one language. where users
use only one language. But there are no such shared task for event     Similarly in Hindi we find lot of spell variations. Such as for the
identification and Extraction. Thus there is a need to develop         words “mumbai”, “gaandhi”, “sambandh”, “thanda” there are
systems that focus on social media texts for event extraction.         atleast three different spelling variations.

The paper is organized as follows: section 2 describes the             3.        CORPUS DESCRIPTION
challenges in event extraction on Indian languages. Section 3          The corpus was collected using the twitter API in two different
describes the corpus annotation, the tag set and corpus statistics.    time periods. The training partition of the corpus was collected
In section 4 the overview of the approaches used by the                during June 2017. And the test partition of the corpus was
participants are described and section 5 concludes the paper.          collected during Aug 2017. As explained in the above sections, in
                                                                       the twitter data we observe concept drift. Thus to evaluate how the
2.   GENERAL CHALLENGES IN                                             systems handle concept drift we had collected data in two
INDIAN LANGUAGE EVENT                                                  different time periods. In this present initiative the corpus is
                                                                       available for three Indian languages Hindi, Malayalam and Tamil.
EXTRACTION                                                             The Tables and figures show different aspects of corpus statistics.
The challenges in the development of event extraction systems for
Indian languages from social media text arise due to several           ANNOTATION TAGSET
factors. One of the main factors being there is no annotated data      The corpus for each language was annotated manually by trained
available for any of the Indian languages, Apart from the lack of      experts. Event Extraction task requires to identify event trigger
annotated data, the other factors which differentiate Indian           keyword and the full event predicate and represent it with a tag. In
languages from other European languages are the following:             this work, the data is tagged with one single tag “Event” where a
     a)   Ambiguity – Ambiguity between common and proper              single phrase consisting of Event trigger and the event predicate.
          nouns. Eg: common words such as “Roja” meaning               For example “Governor for Tamil Nadu appointed”. We find that
          Rose flower is a name of a person.                           in most of the works in Event extraction in English, Automatic
     b)   Spell variations – One of the major challenges is that       Content Extraction (ACE) Event tag set has been used. In the
          different people spell the same entity differently. For      present work for this track we have only focused on just the
          example: In Tamil person name -Roja is spelt as "rosa",      extraction one event phrase, which consists of the Even trigger
          "roja”.                                                      and the whole event predicate which gives the information of
     c)   Less Resources – Most of the Indian languages are less       where and when the event has happened and who all participants
          resource languages. There are no automated tools             involved in the event. As there is no much work in this area in
          available to perform preprocessing tasks required for        Indian languages, and to keep the task definition simple, in this
          NER such as part-of-speech tagging, chunking which           edition we have not taken identification of event types, where and
          can handle social media text.                                who of the events individually.
Apart from these challenges we also find that development of
automatic event recognition systems is difficult due to following      DATA FORMAT
reasons:
                                                                       The participants were provided the data with annotation markup
    i) Tweets contain a huge range of distinct event types. Almost     in a separate file called annotation file. The raw tweets were to be
all these types are relatively infrequent, so even a large sample of   separately downloaded using the twitter API. The annotation file
is a column format file, where each column was tab space
separated. It consisted of the following columns:                           b)   Sharmila team used SVMs for developing the system.
                                                                                 The data was preprocessed for tokenization and no
    i) Tweet_ID                                                                  cleaning is performed. The task is modeled as simple
    ii) User_Id                                                                  binary classification task. The team submitted
    iii) Event string                                                            participated for Tamil and submitted three runs.
    iv) Event Start_Index
    v) EventString_Length
For example:                                                                c)   Nageshbhattu team used CRFs for the task. This team
                                                                                 pre-processed the data for Part-of-Speech (POS)
                                                                                 tagging. They have used POS tags and words in the
          Tweet_ID:890123456782341                                               Window of 5 as features for the CRFs learning. One
          User_Id:987654321                                                      interesting aspect is that the POS tagger for the general
          EventString: TN Governor appointed                                     texts has been used for the Tweet data. It will be
                                                                                 interesting to know how well a general Newswire POS
          Index:43
                                                                                 engine performs on Tweet data. This team participated
          Length:21                                                              for Hindi and submitted one run.

Index column is the starting character position of the Event string         d)   Manju team used an open source tool called
calculated for each tweet and the count starts from ‘0’. The                     BeautifulSoap to identify the events. This tool is used
participants were also instructed to provide the test file                       for website scrapping but here they have used for event
annotations in the same format as given for the training data.                   classification. The choice of the tool is not appropriate
The dataset statistics is as follows:                                            for this task. Infact this method can be said as a “blind
          Table 1. Corpus Statistics                                             mrthod”, where almost all the input tweets are marked
  Language            No.         of No. of Events                               as events, and by default 1/5th of it has come out
                      Tweets                                                     correct. This team participated in Malayalam and
  Hindi               5476            1533                                       submitted one run.
  Malayalam           7391            1733
  Tamil               9147            2074
                                                                       The different methodologies used by the teams are summarized in
The data has events from different types such as cyclones, floods,     Table 2.
accidents, disease outbreak and political events. And the majority
of the types were the disasters and political events such              Evaluation
inaugurations/opening ceremonies by political leaders. Also the
data had events on movie or audio release functions.                   Evaluation metrics used are precision, recall and f-measure. All
                                                                       the systems have been evaluated automatically by comparing with
4.        SUBMISSION OVERVIEWS                                         the gold data. The results obtained for each participant is shown in
The evaluation metrics used for this task is Precision, Recall and     table 3.
F-measure, which is the widely used metric for this task. A total      One main condition in the Event phrase identification is related to
of 16 teams registered for participating in the track. The final       the event span. The span or extent of the Event phrase is to be
submission was done by 4 teams among the 16 teams. They                optimally minimum, it should include Event trigger and the
submitted their test runs for evaluation with multiple runs. A total   Predicate. Consider the example below
of 11 test runs were submitted for evaluation. Only 1 team had
participated for all the three languages. One teams each                Hi: bahut dinom se kahi jA rahi rAjyapAl ki niyukti, tamilnadu
participated for Hindi, Tamil and Malayalam.                           me naye rAjyapAl ki niyukti huA.

We had developed a base system without using any pre-                    Here the event phrase is “tamilnadu me naye rAjyapAl ki
processing and lexical resources. The base line system was             niyukti”. It can not be just “rAjyapAl ki niyukti”.Here the event
developed using a CRF classifier which will mark if a word is part     trigger is “niyukti”. The event predicate is “ tamilnadu me naye
of an event phrase or not. The base line system was developed so       rAjyapAl”, from which we get the information where and what.
that it would help in making a better comparative study. The
system performance is: precision of 23.87% and recall of 29.67%.        So the participating system need to identify this exact event
It is observed that all the teams outperformed the base system. In     phrase. Any system output which has tagged anything more than
the following paragraphs we briefly describe the approaches used       this extent is considered as wrong.
by each team. The results of the teams are given in Table 3.
                                                                       Thus we define:
     a)   Alapan team had used Neural Networks, to develop the          Precision,P=(No. Correctly identified Events             by    the
          system. They had used CNN algorithm in combination           system)/(Total No. of Events identified by the system)
          with LSTM. They first remove the URLs, emoticons etc
          from tweets. There is no NLP pre-processing such as           Recall, R=(No. Correctly identified Events by the system)/(Total
          POS and Chunking done to the tweets. This team               No.of Events identified in the Gold)
          participated in all languages and had submitted 2 runs
          each for each language.                                       F-measure= (2*P*R)/(P+R)
5.        CONCLUSION                                                    [4]      H.T. Ng, C.Y., Lim, S.K., Foo. 1999. “A Case Study on
The main objective of creating benchmark data representing a few        Inter-Annotator Agreement for Word Sense Disambiguation”. In
of the popular Indian languages has been achieved. And this data        Proceedings of the {ACL} {SIGLEX} Workshop on Standardizing
has been made available to research community for free for              Lexical Resources {(SIGLEX99)}. Maryland. pp. 9-13.
research purposes. The data is user generated data and is not any       [5]     Preslav Nakov and Torsten Zesch and Daniel Cer
genre specific. Efforts are still going on to standardize this data     and David Jurgens. 2015. Proceedings of the 9th International
and make it perfect data set for future researchers. We observe         Workshop on Semantic Evaluation (SemEval 2015).
that the results obtained are almost similar for all the languages.
We hope to see more publications in this area in the coming days        [6]       Nakov, Preslav and Rosenthal, Sara and Kozareva,
from these different research groups who could not submit their         Zornitsa and Stoyanov, Veselin and Ritter, Alan and Wilson,
results. Also we expect more groups would start using this data         Theresa. 2013. SemEval-2013 Task 2: Sentiment Analysis in
for their research work.                                                Twitter. Second Joint Conference on Lexical and Computational
This EventXtract-IL track is one of the first efforts towards           Semantics (*SEM), Volume 2: Proceedings of the Seventh
creation of Event annotated user generated data for Indian              International Workshop on Semantic Evaluation (SemEval 2013)
languages. The data being generic, this could be used for               [7]      Rajeev Sangal and M. G. Abbas Malik. 2011.
developing generic systems upon which a domain specific system          Proceedings of the 1st Workshop on South and Southeast Asian
could be built after customization. In the next edition of this track   Natural Language Processing (SANLP)
we plan to add more data and also include identification and
extraction of event types, event cause-effects and event                [8]       Aravind K. Joshi and M. G. Abbas Malik. 2010.
participants.                                                           Proceedings of the 1st Workshop on South and Southeast Asian
                                                                        Natural         Language         Processing        (SANLP).
6.        ACKNOWLEDGMENTS                                               (http://www.aclweb.org/anthology/W10-36)
We thank the FIRE 2017 organizers for giving us the opportunity
to conduct the evaluation exercise. We also thank the Language          [9]       Rajeev Sangal, Dipti Misra Sharma and Anil Kumar
Editors in CLRG, AU-KBC Research Centre.                                Singh. 2008. Proceedings of the IJCNLP-08 Workshop on Named
                                                                        Entity Recognition for South and South East Asian Languages.
                                                                        (http://www.aclweb.org/anthology/I/I08/I08-03)

7.        REFERENCES                                                    [10]      Pattabhi RK Rao, CS Malarkodi, Vijay Sundar R and
[1]      Arkaitz    Zubiaga, Iñaki     San    Vicente, Pablo            Sobha Lalitha Devi. 2014. Proceedings of Named-Entity
Gamallo, José    Ramom     Pichel   Campos, Iñaki    Alegría            Recognition Indian Languages track at FIRE 2014. http://au-
Loinaz, Nora Aranberri, Aitzol Ezeiza, Víctor Fresno. 2014              kbc.org/nlp/NER-FIRE2014/
TweetLID@SEPLN 2014, Girona, Spain, September 16th,
2014. CEUR Workshop Proceedings 1228, CEUR-WS.org 2014
[2]       Mark Dredze, Tim Oates, and Christine Piatko. 2010.
“We’re not in kansas anymore: detecting domainchanges in
streams”. In Proceedings of the 2010 Conferenceon Empirical
Methods in Natural LanguageProcessing, pages 585–595.
Association for ComputationalLinguistics.
[3]      Hege Fromreide, Dirk Hovy, and Anders Søgaard.2014.
“Crowdsourcing and annotating ner for twitter#drift”. European
language resources distributionagency.
                                           Table 2. Participant Team Overview - Summary
Team              Languages & System Approaches (ML method) Pre-Processing Lexical   Open Source NLP Tools Variation                    Between
                  Submissions        Used                   Step           Resources Used                  Runs
                                                                           Used
Alapan – IIT-Kgp i)Hindi: 2 runs        Run1: Neural Networks –      Tweet            NIL   CNN – ML tool                  Pipelined Process
                                        CNN architecture with        Preprocessor                                          Flow and Non-
                                        LSTM , pipelined process     alone used to                                         pipeline process flow
                  ii) Malayalam: 2 runs
                                        flow                         eliminate http
                                       Run2: Neural Networks –       links, emoticons
                  iii) Tamil: 2 runs   CNN architecture with
                                       LSTM , non- pipelined
                                       process flow
Sharmila –        i) Tamil: 3 runs     SVMs – words, prefixes,       Tweet cleaning NIL     SVM Tool kit                   Run 1: C-parametre of
Karpagam Eng.                          suffixes and shape features   and Tokenization                                      SVM is tuned
College (KEC)                          used                                                                                Run 2: Without any
                                                                                                                           parametre tuning of
                                                                                                                           the SVM tool kit
                                                                                                                           Run 3: Tuning of all
                                                                                                                           other parameters of
                                                                                                                           the SVM tool kit


Nageshbhattu -    i) Hindi: 1 run      CRFs –                        NLP pre-        NIL    POS tagger and CRFs tool kit   N/A
IDRBT                                                                processing –
                                                                     Uses general
                                                                     Text POS tagger



Manju – CEC,      i) Malayalam:1 run    NIL                          Tweet cleaning   NIL   BeautifulSoap – a web          N/A
Chertala                                                                                    scrapping tool
                                 Table 3. Evaluation Results of Participating Systems

Team        Language    Submission 1                    Submission 2                    Submission 3

                        Prec %     Rec %      F-m%      Prec %     Rec %      F-m%      Prec %   Rec %   F-m%

IIT Kgp     Hindi       36.58      79.02      50.01     31.42      56.37      40.35     NA       NA      NA

            Malayalam   32.98      90.20      48.29     39.98      57.50      47.17     NA       NA      NA

            Tamil       43.16      64.77      51.80     39.73      49.33      44.01     NA       NA      NA

IDRBT     Hindi         31.56      71.39      43.77     NA         NA         NA        NA       NA      NA
Hyderabad

KEC,       Tamil        39.10      62.28      48.04     38.05      51.81      43.88     38.44    61.14   47.20
Coimbatore

CEC         Malayalam   21.43      67.17      32.40      NA         NA        NA        NA       NA      NA
Cherthala