=Paper=
{{Paper
|id=Vol-2036/T5-1
|storemode=property
|title=EventXtract-IL: Event Extraction from Social Media Text in Indian Languages @ FIRE 2017 - An Overview
|pdfUrl=https://ceur-ws.org/Vol-2036/T5-1.pdf
|volume=Vol-2036
|authors=Pattabhi RK Rao,Sobha Lalitha Devi
|dblpUrl=https://dblp.org/rec/conf/fire/RaoD17
}}
==EventXtract-IL: Event Extraction from Social Media Text in Indian Languages @ FIRE 2017 - An Overview==
EventXtract-IL: Event Extraction from Social Media Text in
Indian Languages @ FIRE 2017 – An Overview
Pattabhi RK Rao and Sobha Lalitha Devi
AU-KBC Research Centre
MIT Campus of Anna University, Chrompet,
Chennai, India
+91 44 22232711
{pattabhi, sobha}@au-kbc.org
ABSTRACT 1. INTRODUCTION
Today through social media platforms the communication has Over the past decade, Indian language content on various media
become exceptionally fast that people across the world get to types such as websites, blogs, email, chats has increased
know any event happening at the nook and corner of the world in significantly and it is observed that with the advent of smart
a fraction of a second. The penetration of smart phones, tabs etc phones more people are using social media such as twitter,
has significantly changed the way people communicate. Facebook facebook to comment on people, products, services, organizations,
and Twitter are two most popular social media platforms, where governments, etc. Thus it is seen that content growth is driven by
people post about events, their personal daily activities and plans. people from non-metros and small cities who generally are
And also post their thoughts, responses or reactions for any public comfortable with their own mother tongue rather than English.
cause or issue. In the recent times we have seen how the facebook The growth of Indian language content is expected to increase by
posts and twitter tweets have helped in mobilizing people in states more than 70% every year. Hence there is a great need to process
such as Tamil Nadu (TN) and Jammu & Kashmir (J&K) in India. these data automatically. This requires natural language
The mass public protests for the “Jallikattu” event in TN and processing software systems which extracts events, entities or the
stone pelting protests in J&K are prominent examples of how associations of them. Thus an automatic Event extraction system
social media has impacted the common man. The information is required.
about events or happenings in real time is very valuable to the
administration for disaster management, crowd control, public The objectives of the evaluation are:
alerting. These information which is used in the development of Creation of benchmark data for Event Extraction in
recommender systems adds value for the growth of business Indian language Social Media text.
enterprises. Thus there is great need to develop automatic systems
for automatic event extraction. This paper presents the overview To encourage development of Event extraction systems
of the task “Event extraction in Indian languages”, a track in FIRE for Indian language Social Media text.
2017. The task of this track is to extract events from the social
Event extraction has been actively researched for over last decade.
media text, The Twitter. Some of the main issues in handling of
Most of the research has, however, been focused on resource rich
such social media texts are i) Spelling errors ii) Abbreviated new
languages, such as English, French and Spanish. The scope of this
language vocabulary such as “gr8” for great iii) use of symbols
work covers the task of event recognition and extraction in social
such as emoticons/emojis iv) use of meta tags and hash tags and
media text (twitter data) for Indian languages. In the past there
v) Code mixing, though in this track, we have not considered code
were events such as Workshop on NER for South and South East
mixing. Though event extraction from Indian language texts is
Asian Languages (NER-SSEA, 2008), Workshop on South and
gaining attention among Indian research community, however
South East Asian Natural Language Processing (SANLP,
there is no benchmark data available for testing the systems.
2010&2011) conducted to bring various research works on NER
Hence we have organized the Event Extraction in social media
being done on a single platform. NER-IL tracks at FIRE (Forum
text track for Indian languages (EventXtract-IL) in the Forum for
for Information Retrieval and Evaluation) in 2013, 2014, and
Information Retrieval Evaluation (FIRE). The paper describes the
2015; Code Mix Entity Extraction (CMEE-IL) in 2016 have
corpus created for three languages, viz., Hindi, Malayalam and
contributed to the development of benchmark data and boosted
Tamil and present the overview of the approaches used by the
the research towards NER for Indian languages. But it is
participants.
observed that there are very little works in Indian language event
CCS Concepts extraction. The user generated texts such as twitter and facebook
texts are diverse and noisy. These texts contain non-standard
• Computing methodologies ~ Artificial intelligence spellings and abbreviations, unreliable punctuation styles. Apart
• Computing methodologies ~ Natural language processing from these writing style and language challenges, another
• Information systems ~ Information extraction challenge is concept drift (Dredze etal., 2010; Fromreide et al.,
2014); the distribution of language and topics on Twitter and
Keywords Facebook is constantly shifting, thus leading to performance
Event Extraction; Social Media Text; Twitter; Indian Languages;
degradation of NLP tools over time.
Tamil; Hindi; Malayalam; Event Annotated Corpora for Indian
Language data. Some of the main issues in handling of such texts are i) Spelling
errors ii) Abbreviated new language vocabulary such as “gr8” for
great iii) use of symbols such as emoticons/emojis iv) use of meta manually annotated tweets will contain very few training
tags and hash tags v) Code mixing. examples.
For example: ii) In comparison with English, Indian Languages have more
dialectal variations. These dialects are mainly influenced by
“Muje kabi bhoolen gy to nhi na? :( different regions and communities.
Want ur sweet feedback about my FC ? mai
iii) Indian Language tweets are multilingual in nature and
dilli jaa rahi hoon”.
predominantly contain English words.
The research in analyzing the social media data is attempted in The following examples illustrate the usage of English words and
English through various shared tasks. Language identification in spoken, dialectal forms in the tweets.
tweets (tweetLID) shared task held at SEPLN 2014 had the task of Example 1 (Tamil):
identifying the tweets from six different languages. SemEval
Ta: Stamp veliyittu ivaga ativaangi …..
2013, 2014 and 2015 held as shared task track where sentiment
analysis in tweets were focused. They conducted two sub-tasks En: stamp released these_people get_beaten ….
namely, contextual polarity disambiguation and message polarity Ta: othavaangi …. kadasiya kovai
classification. In Indian languages, Amitav et al (2015) had
organized a shared task titled 'Sentiment Analysis in Indian En: get_slapped … at_end kovai
languages' as a part of MIKE 2015, where sentiment analysis in Ta: pooyi pallakaatti kuththu vaangiyaachchu.
tweets is done for tweets in Hindi, Bengali and Tamil language.
En: gone show_tooth punch got
Named Entity recognition was explored in twitter through shared
(“They released stamp, got slapping and beating … at the end
task organized by Microsoft as part of 2015 ACL-IJCNLP, a
reached Kovai and got punched on the face”)
shared task on noisy user-generated text, where they had two sub-
tasks namely, twitter text normalization and named entity This example is a Tamil tweet where it is written in a particular
recognition for English. The ESM-IL track at FIRE 2015 was the dialect and also has usage of English words.
came up with the entity annotated benchmark data for the social
media text, where the data was in only one language. where users
use only one language. But there are no such shared task for event Similarly in Hindi we find lot of spell variations. Such as for the
identification and Extraction. Thus there is a need to develop words “mumbai”, “gaandhi”, “sambandh”, “thanda” there are
systems that focus on social media texts for event extraction. atleast three different spelling variations.
The paper is organized as follows: section 2 describes the 3. CORPUS DESCRIPTION
challenges in event extraction on Indian languages. Section 3 The corpus was collected using the twitter API in two different
describes the corpus annotation, the tag set and corpus statistics. time periods. The training partition of the corpus was collected
In section 4 the overview of the approaches used by the during June 2017. And the test partition of the corpus was
participants are described and section 5 concludes the paper. collected during Aug 2017. As explained in the above sections, in
the twitter data we observe concept drift. Thus to evaluate how the
2. GENERAL CHALLENGES IN systems handle concept drift we had collected data in two
INDIAN LANGUAGE EVENT different time periods. In this present initiative the corpus is
available for three Indian languages Hindi, Malayalam and Tamil.
EXTRACTION The Tables and figures show different aspects of corpus statistics.
The challenges in the development of event extraction systems for
Indian languages from social media text arise due to several ANNOTATION TAGSET
factors. One of the main factors being there is no annotated data The corpus for each language was annotated manually by trained
available for any of the Indian languages, Apart from the lack of experts. Event Extraction task requires to identify event trigger
annotated data, the other factors which differentiate Indian keyword and the full event predicate and represent it with a tag. In
languages from other European languages are the following: this work, the data is tagged with one single tag “Event” where a
a) Ambiguity – Ambiguity between common and proper single phrase consisting of Event trigger and the event predicate.
nouns. Eg: common words such as “Roja” meaning For example “Governor for Tamil Nadu appointed”. We find that
Rose flower is a name of a person. in most of the works in Event extraction in English, Automatic
b) Spell variations – One of the major challenges is that Content Extraction (ACE) Event tag set has been used. In the
different people spell the same entity differently. For present work for this track we have only focused on just the
example: In Tamil person name -Roja is spelt as "rosa", extraction one event phrase, which consists of the Even trigger
"roja”. and the whole event predicate which gives the information of
c) Less Resources – Most of the Indian languages are less where and when the event has happened and who all participants
resource languages. There are no automated tools involved in the event. As there is no much work in this area in
available to perform preprocessing tasks required for Indian languages, and to keep the task definition simple, in this
NER such as part-of-speech tagging, chunking which edition we have not taken identification of event types, where and
can handle social media text. who of the events individually.
Apart from these challenges we also find that development of
automatic event recognition systems is difficult due to following DATA FORMAT
reasons:
The participants were provided the data with annotation markup
i) Tweets contain a huge range of distinct event types. Almost in a separate file called annotation file. The raw tweets were to be
all these types are relatively infrequent, so even a large sample of separately downloaded using the twitter API. The annotation file
is a column format file, where each column was tab space
separated. It consisted of the following columns: b) Sharmila team used SVMs for developing the system.
The data was preprocessed for tokenization and no
i) Tweet_ID cleaning is performed. The task is modeled as simple
ii) User_Id binary classification task. The team submitted
iii) Event string participated for Tamil and submitted three runs.
iv) Event Start_Index
v) EventString_Length
For example: c) Nageshbhattu team used CRFs for the task. This team
pre-processed the data for Part-of-Speech (POS)
tagging. They have used POS tags and words in the
Tweet_ID:890123456782341 Window of 5 as features for the CRFs learning. One
User_Id:987654321 interesting aspect is that the POS tagger for the general
EventString: TN Governor appointed texts has been used for the Tweet data. It will be
interesting to know how well a general Newswire POS
Index:43
engine performs on Tweet data. This team participated
Length:21 for Hindi and submitted one run.
Index column is the starting character position of the Event string d) Manju team used an open source tool called
calculated for each tweet and the count starts from ‘0’. The BeautifulSoap to identify the events. This tool is used
participants were also instructed to provide the test file for website scrapping but here they have used for event
annotations in the same format as given for the training data. classification. The choice of the tool is not appropriate
The dataset statistics is as follows: for this task. Infact this method can be said as a “blind
Table 1. Corpus Statistics mrthod”, where almost all the input tweets are marked
Language No. of No. of Events as events, and by default 1/5th of it has come out
Tweets correct. This team participated in Malayalam and
Hindi 5476 1533 submitted one run.
Malayalam 7391 1733
Tamil 9147 2074
The different methodologies used by the teams are summarized in
The data has events from different types such as cyclones, floods, Table 2.
accidents, disease outbreak and political events. And the majority
of the types were the disasters and political events such Evaluation
inaugurations/opening ceremonies by political leaders. Also the
data had events on movie or audio release functions. Evaluation metrics used are precision, recall and f-measure. All
the systems have been evaluated automatically by comparing with
4. SUBMISSION OVERVIEWS the gold data. The results obtained for each participant is shown in
The evaluation metrics used for this task is Precision, Recall and table 3.
F-measure, which is the widely used metric for this task. A total One main condition in the Event phrase identification is related to
of 16 teams registered for participating in the track. The final the event span. The span or extent of the Event phrase is to be
submission was done by 4 teams among the 16 teams. They optimally minimum, it should include Event trigger and the
submitted their test runs for evaluation with multiple runs. A total Predicate. Consider the example below
of 11 test runs were submitted for evaluation. Only 1 team had
participated for all the three languages. One teams each Hi: bahut dinom se kahi jA rahi rAjyapAl ki niyukti, tamilnadu
participated for Hindi, Tamil and Malayalam. me naye rAjyapAl ki niyukti huA.
We had developed a base system without using any pre- Here the event phrase is “tamilnadu me naye rAjyapAl ki
processing and lexical resources. The base line system was niyukti”. It can not be just “rAjyapAl ki niyukti”.Here the event
developed using a CRF classifier which will mark if a word is part trigger is “niyukti”. The event predicate is “ tamilnadu me naye
of an event phrase or not. The base line system was developed so rAjyapAl”, from which we get the information where and what.
that it would help in making a better comparative study. The
system performance is: precision of 23.87% and recall of 29.67%. So the participating system need to identify this exact event
It is observed that all the teams outperformed the base system. In phrase. Any system output which has tagged anything more than
the following paragraphs we briefly describe the approaches used this extent is considered as wrong.
by each team. The results of the teams are given in Table 3.
Thus we define:
a) Alapan team had used Neural Networks, to develop the Precision,P=(No. Correctly identified Events by the
system. They had used CNN algorithm in combination system)/(Total No. of Events identified by the system)
with LSTM. They first remove the URLs, emoticons etc
from tweets. There is no NLP pre-processing such as Recall, R=(No. Correctly identified Events by the system)/(Total
POS and Chunking done to the tweets. This team No.of Events identified in the Gold)
participated in all languages and had submitted 2 runs
each for each language. F-measure= (2*P*R)/(P+R)
5. CONCLUSION [4] H.T. Ng, C.Y., Lim, S.K., Foo. 1999. “A Case Study on
The main objective of creating benchmark data representing a few Inter-Annotator Agreement for Word Sense Disambiguation”. In
of the popular Indian languages has been achieved. And this data Proceedings of the {ACL} {SIGLEX} Workshop on Standardizing
has been made available to research community for free for Lexical Resources {(SIGLEX99)}. Maryland. pp. 9-13.
research purposes. The data is user generated data and is not any [5] Preslav Nakov and Torsten Zesch and Daniel Cer
genre specific. Efforts are still going on to standardize this data and David Jurgens. 2015. Proceedings of the 9th International
and make it perfect data set for future researchers. We observe Workshop on Semantic Evaluation (SemEval 2015).
that the results obtained are almost similar for all the languages.
We hope to see more publications in this area in the coming days [6] Nakov, Preslav and Rosenthal, Sara and Kozareva,
from these different research groups who could not submit their Zornitsa and Stoyanov, Veselin and Ritter, Alan and Wilson,
results. Also we expect more groups would start using this data Theresa. 2013. SemEval-2013 Task 2: Sentiment Analysis in
for their research work. Twitter. Second Joint Conference on Lexical and Computational
This EventXtract-IL track is one of the first efforts towards Semantics (*SEM), Volume 2: Proceedings of the Seventh
creation of Event annotated user generated data for Indian International Workshop on Semantic Evaluation (SemEval 2013)
languages. The data being generic, this could be used for [7] Rajeev Sangal and M. G. Abbas Malik. 2011.
developing generic systems upon which a domain specific system Proceedings of the 1st Workshop on South and Southeast Asian
could be built after customization. In the next edition of this track Natural Language Processing (SANLP)
we plan to add more data and also include identification and
extraction of event types, event cause-effects and event [8] Aravind K. Joshi and M. G. Abbas Malik. 2010.
participants. Proceedings of the 1st Workshop on South and Southeast Asian
Natural Language Processing (SANLP).
6. ACKNOWLEDGMENTS (http://www.aclweb.org/anthology/W10-36)
We thank the FIRE 2017 organizers for giving us the opportunity
to conduct the evaluation exercise. We also thank the Language [9] Rajeev Sangal, Dipti Misra Sharma and Anil Kumar
Editors in CLRG, AU-KBC Research Centre. Singh. 2008. Proceedings of the IJCNLP-08 Workshop on Named
Entity Recognition for South and South East Asian Languages.
(http://www.aclweb.org/anthology/I/I08/I08-03)
7. REFERENCES [10] Pattabhi RK Rao, CS Malarkodi, Vijay Sundar R and
[1] Arkaitz Zubiaga, Iñaki San Vicente, Pablo Sobha Lalitha Devi. 2014. Proceedings of Named-Entity
Gamallo, José Ramom Pichel Campos, Iñaki Alegría Recognition Indian Languages track at FIRE 2014. http://au-
Loinaz, Nora Aranberri, Aitzol Ezeiza, Víctor Fresno. 2014 kbc.org/nlp/NER-FIRE2014/
TweetLID@SEPLN 2014, Girona, Spain, September 16th,
2014. CEUR Workshop Proceedings 1228, CEUR-WS.org 2014
[2] Mark Dredze, Tim Oates, and Christine Piatko. 2010.
“We’re not in kansas anymore: detecting domainchanges in
streams”. In Proceedings of the 2010 Conferenceon Empirical
Methods in Natural LanguageProcessing, pages 585–595.
Association for ComputationalLinguistics.
[3] Hege Fromreide, Dirk Hovy, and Anders Søgaard.2014.
“Crowdsourcing and annotating ner for twitter#drift”. European
language resources distributionagency.
Table 2. Participant Team Overview - Summary
Team Languages & System Approaches (ML method) Pre-Processing Lexical Open Source NLP Tools Variation Between
Submissions Used Step Resources Used Runs
Used
Alapan – IIT-Kgp i)Hindi: 2 runs Run1: Neural Networks – Tweet NIL CNN – ML tool Pipelined Process
CNN architecture with Preprocessor Flow and Non-
LSTM , pipelined process alone used to pipeline process flow
ii) Malayalam: 2 runs
flow eliminate http
Run2: Neural Networks – links, emoticons
iii) Tamil: 2 runs CNN architecture with
LSTM , non- pipelined
process flow
Sharmila – i) Tamil: 3 runs SVMs – words, prefixes, Tweet cleaning NIL SVM Tool kit Run 1: C-parametre of
Karpagam Eng. suffixes and shape features and Tokenization SVM is tuned
College (KEC) used Run 2: Without any
parametre tuning of
the SVM tool kit
Run 3: Tuning of all
other parameters of
the SVM tool kit
Nageshbhattu - i) Hindi: 1 run CRFs – NLP pre- NIL POS tagger and CRFs tool kit N/A
IDRBT processing –
Uses general
Text POS tagger
Manju – CEC, i) Malayalam:1 run NIL Tweet cleaning NIL BeautifulSoap – a web N/A
Chertala scrapping tool
Table 3. Evaluation Results of Participating Systems
Team Language Submission 1 Submission 2 Submission 3
Prec % Rec % F-m% Prec % Rec % F-m% Prec % Rec % F-m%
IIT Kgp Hindi 36.58 79.02 50.01 31.42 56.37 40.35 NA NA NA
Malayalam 32.98 90.20 48.29 39.98 57.50 47.17 NA NA NA
Tamil 43.16 64.77 51.80 39.73 49.33 44.01 NA NA NA
IDRBT Hindi 31.56 71.39 43.77 NA NA NA NA NA NA
Hyderabad
KEC, Tamil 39.10 62.28 48.04 38.05 51.81 43.88 38.44 61.14 47.20
Coimbatore
CEC Malayalam 21.43 67.17 32.40 NA NA NA NA NA NA
Cherthala