<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EventXtract-IL: Event Extraction from Social Media Text in Indian Languages @ FIRE 2017 - An Overview</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>CCS Concepts</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pattabhi RK Rao and Sobha Lalitha Devi AU-KBC Research Centre MIT Campus of Anna University</institution>
          ,
          <addr-line>Chrompet, Chennai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Today through social media platforms the communication has become exceptionally fast that people across the world get to know any event happening at the nook and corner of the world in a fraction of a second. The penetration of smart phones, tabs etc has significantly changed the way people communicate. Facebook and Twitter are two most popular social media platforms, where people post about events, their personal daily activities and plans. And also post their thoughts, responses or reactions for any public cause or issue. In the recent times we have seen how the facebook posts and twitter tweets have helped in mobilizing people in states such as Tamil Nadu (TN) and Jammu &amp; Kashmir (J&amp;K) in India. The mass public protests for the “Jallikattu” event in TN and stone pelting protests in J&amp;K are prominent examples of how social media has impacted the common man. The information about events or happenings in real time is very valuable to the administration for disaster management, crowd control, public alerting. These information which is used in the development of recommender systems adds value for the growth of business enterprises. Thus there is great need to develop automatic systems for automatic event extraction. This paper presents the overview of the task “Event extraction in Indian languages”, a track in FIRE 2017. The task of this track is to extract events from the social media text, The Twitter. Some of the main issues in handling of such social media texts are i) Spelling errors ii) Abbreviated new language vocabulary such as “gr8” for great iii) use of symbols such as emoticons/emojis iv) use of meta tags and hash tags and v) Code mixing, though in this track, we have not considered code mixing. Though event extraction from Indian language texts is gaining attention among Indian research community, however there is no benchmark data available for testing the systems. Hence we have organized the Event Extraction in social media text track for Indian languages (EventXtract-IL) in the Forum for Information Retrieval Evaluation (FIRE). The paper describes the corpus created for three languages, viz., Hindi, Malayalam and Tamil and present the overview of the approaches used by the participants.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>• Computing methodologies ~ Artificial intelligence
• Computing methodologies ~ Natural language processing
• Information systems ~ Information extraction
1.</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Over the past decade, Indian language content on various media
types such as websites, blogs, email, chats has increased
significantly and it is observed that with the advent of smart
phones more people are using social media such as twitter,
facebook to comment on people, products, services, organizations,
governments, etc. Thus it is seen that content growth is driven by
people from non-metros and small cities who generally are
comfortable with their own mother tongue rather than English.
The growth of Indian language content is expected to increase by
more than 70% every year. Hence there is a great need to process
these data automatically. This requires natural language
processing software systems which extracts events, entities or the
associations of them. Thus an automatic Event extraction system
is required.</p>
      <sec id="sec-2-1">
        <title>The objectives of the evaluation are:</title>
        <p>
</p>
        <p>Creation of benchmark data for Event Extraction in
Indian language Social Media text.</p>
        <p>To encourage development of Event extraction systems
for Indian language Social Media text.</p>
        <p>
          Event extraction has been actively researched for over last decade.
Most of the research has, however, been focused on resource rich
languages, such as English, French and Spanish. The scope of this
work covers the task of event recognition and extraction in social
media text (twitter data) for Indian languages. In the past there
were events such as Workshop on NER for South and South East
Asian Languages (NER-SSEA, 2008), Workshop on South and
South East Asian Natural Language Processing (SANLP,
2010&amp;2011) conducted to bring various research works on NER
being done on a single platform. NER-IL tracks at FIRE (Forum
for Information Retrieval and Evaluation) in 2013, 2014, and
2015; Code Mix Entity Extraction (CMEE-IL) in 2016 have
contributed to the development of benchmark data and boosted
the research towards NER for Indian languages. But it is
observed that there are very little works in Indian language event
extraction. The user generated texts such as twitter and facebook
texts are diverse and noisy. These texts contain non-standard
spellings and abbreviations, unreliable punctuation styles. Apart
from these writing style and language challenges, another
challenge is concept drift
          <xref ref-type="bibr" rid="ref2 ref3">(Dredze etal., 2010; Fromreide et al.,
2014)</xref>
          ; the distribution of language and topics on Twitter and
Facebook is constantly shifting, thus leading to performance
degradation of NLP tools over time.
        </p>
        <p>Some of the main issues in handling of such texts are i) Spelling
errors ii) Abbreviated new language vocabulary such as “gr8” for
great iii) use of symbols such as emoticons/emojis iv) use of meta
tags and hash tags v) Code mixing.</p>
        <p>For example:
“Muje kabi bhoolen gy to nhi na? :(
Want ur sweet feedback about my FC ? mai
dilli jaa rahi hoon”.</p>
        <p>The research in analyzing the social media data is attempted in
English through various shared tasks. Language identification in
tweets (tweetLID) shared task held at SEPLN 2014 had the task of
identifying the tweets from six different languages. SemEval
2013, 2014 and 2015 held as shared task track where sentiment
analysis in tweets were focused. They conducted two sub-tasks
namely, contextual polarity disambiguation and message polarity
classification. In Indian languages, Amitav et al (2015) had
organized a shared task titled 'Sentiment Analysis in Indian
languages' as a part of MIKE 2015, where sentiment analysis in
tweets is done for tweets in Hindi, Bengali and Tamil language.
Named Entity recognition was explored in twitter through shared
task organized by Microsoft as part of 2015 ACL-IJCNLP, a
shared task on noisy user-generated text, where they had two
subtasks namely, twitter text normalization and named entity
recognition for English. The ESM-IL track at FIRE 2015 was the
came up with the entity annotated benchmark data for the social
media text, where the data was in only one language. where users
use only one language. But there are no such shared task for event
identification and Extraction. Thus there is a need to develop
systems that focus on social media texts for event extraction.
The paper is organized as follows: section 2 describes the
challenges in event extraction on Indian languages. Section 3
describes the corpus annotation, the tag set and corpus statistics.
In section 4 the overview of the approaches used by the
participants are described and section 5 concludes the paper.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. GENERAL CHALLENGES IN</title>
    </sec>
    <sec id="sec-4">
      <title>INDIAN LANGUAGE EVENT</title>
    </sec>
    <sec id="sec-5">
      <title>EXTRACTION</title>
      <p>The challenges in the development of event extraction systems for
Indian languages from social media text arise due to several
factors. One of the main factors being there is no annotated data
available for any of the Indian languages, Apart from the lack of
annotated data, the other factors which differentiate Indian
languages from other European languages are the following:
a)
b)
c)</p>
      <p>Ambiguity – Ambiguity between common and proper
nouns. Eg: common words such as “Roja” meaning
Rose flower is a name of a person.</p>
      <p>Spell variations – One of the major challenges is that
different people spell the same entity differently. For
example: In Tamil person name -Roja is spelt as "rosa",
"roja”.</p>
      <p>Less Resources – Most of the Indian languages are less
resource languages. There are no automated tools
available to perform preprocessing tasks required for
NER such as part-of-speech tagging, chunking which
can handle social media text.</p>
      <p>Apart from these challenges we also find that development of
automatic event recognition systems is difficult due to following
reasons:</p>
      <p>i) Tweets contain a huge range of distinct event types. Almost
all these types are relatively infrequent, so even a large sample of
manually annotated tweets will contain very few training
examples.</p>
      <p>ii) In comparison with English, Indian Languages have more
dialectal variations. These dialects are mainly influenced by
different regions and communities.
iii) Indian Language tweets are multilingual in nature and
predominantly contain English words.</p>
      <p>The following examples illustrate the usage of English words and
spoken, dialectal forms in the tweets.</p>
      <sec id="sec-5-1">
        <title>Example 1 (Tamil):</title>
        <p>Ta: Stamp veliyittu ivaga</p>
        <sec id="sec-5-1-1">
          <title>En: stamp released these_people get_beaten ….</title>
          <p>Ta: othavaangi …. kadasiya &lt;loc&gt;kovai&lt;/loc&gt;</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>En: get_slapped … at_end kovai</title>
          <p>Ta: pooyi pallakaatti kuththu vaangiyaachchu.</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>En: gone show_tooth punch got</title>
          <p>ativaangi …..
(“They released stamp, got slapping and beating … at the end
reached Kovai and got punched on the face”)
This example is a Tamil tweet where it is written in a particular
dialect and also has usage of English words.</p>
          <p>Similarly in Hindi we find lot of spell variations. Such as for the
words “mumbai”, “gaandhi”, “sambandh”, “thanda” there are
atleast three different spelling variations.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>3. CORPUS DESCRIPTION</title>
      <p>The corpus was collected using the twitter API in two different
time periods. The training partition of the corpus was collected
during June 2017. And the test partition of the corpus was
collected during Aug 2017. As explained in the above sections, in
the twitter data we observe concept drift. Thus to evaluate how the
systems handle concept drift we had collected data in two
different time periods. In this present initiative the corpus is
available for three Indian languages Hindi, Malayalam and Tamil.
The Tables and figures show different aspects of corpus statistics.</p>
    </sec>
    <sec id="sec-7">
      <title>ANNOTATION TAGSET</title>
      <p>The corpus for each language was annotated manually by trained
experts. Event Extraction task requires to identify event trigger
keyword and the full event predicate and represent it with a tag. In
this work, the data is tagged with one single tag “Event” where a
single phrase consisting of Event trigger and the event predicate.
For example “Governor for Tamil Nadu appointed”. We find that
in most of the works in Event extraction in English, Automatic
Content Extraction (ACE) Event tag set has been used. In the
present work for this track we have only focused on just the
extraction one event phrase, which consists of the Even trigger
and the whole event predicate which gives the information of
where and when the event has happened and who all participants
involved in the event. As there is no much work in this area in
Indian languages, and to keep the task definition simple, in this
edition we have not taken identification of event types, where and
who of the events individually.</p>
    </sec>
    <sec id="sec-8">
      <title>DATA FORMAT</title>
      <p>The participants were provided the data with annotation markup
in a separate file called annotation file. The raw tweets were to be
separately downloaded using the twitter API. The annotation file
is a column format file, where each column was tab space
separated. It consisted of the following columns:
i)
ii)</p>
      <sec id="sec-8-1">
        <title>Tweet_ID</title>
      </sec>
      <sec id="sec-8-2">
        <title>User_Id iii) Event string iv) Event Start_Index v)</title>
        <p>The data has events from different types such as cyclones, floods,
accidents, disease outbreak and political events. And the majority
of the types were the disasters and political events such
inaugurations/opening ceremonies by political leaders. Also the
data had events on movie or audio release functions.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4. SUBMISSION OVERVIEWS</title>
      <p>The evaluation metrics used for this task is Precision, Recall and
F-measure, which is the widely used metric for this task. A total
of 16 teams registered for participating in the track. The final
submission was done by 4 teams among the 16 teams. They
submitted their test runs for evaluation with multiple runs. A total
of 11 test runs were submitted for evaluation. Only 1 team had
participated for all the three languages. One teams each
participated for Hindi, Tamil and Malayalam.</p>
      <p>We had developed a base system without using any
preprocessing and lexical resources. The base line system was
developed using a CRF classifier which will mark if a word is part
of an event phrase or not. The base line system was developed so
that it would help in making a better comparative study. The
system performance is: precision of 23.87% and recall of 29.67%.
It is observed that all the teams outperformed the base system. In
the following paragraphs we briefly describe the approaches used
by each team. The results of the teams are given in Table 3.
a)</p>
      <p>Alapan team had used Neural Networks, to develop the
system. They had used CNN algorithm in combination
with LSTM. They first remove the URLs, emoticons etc
from tweets. There is no NLP pre-processing such as
POS and Chunking done to the tweets. This team
participated in all languages and had submitted 2 runs
each for each language.</p>
      <p>Sharmila team used SVMs for developing the system.
The data was preprocessed for tokenization and no
cleaning is performed. The task is modeled as simple
binary classification task. The team submitted
participated for Tamil and submitted three runs.</p>
      <p>Nageshbhattu team used CRFs for the task. This team
pre-processed the data for Part-of-Speech (POS)
tagging. They have used POS tags and words in the
Window of 5 as features for the CRFs learning. One
interesting aspect is that the POS tagger for the general
texts has been used for the Tweet data. It will be
interesting to know how well a general Newswire POS
engine performs on Tweet data. This team participated
for Hindi and submitted one run.</p>
      <p>Manju team used an open source tool called
BeautifulSoap to identify the events. This tool is used
for website scrapping but here they have used for event
classification. The choice of the tool is not appropriate
for this task. Infact this method can be said as a “blind
mrthod”, where almost all the input tweets are marked
as events, and by default 1/5th of it has come out
correct. This team participated in Malayalam and
submitted one run.</p>
      <p>The different methodologies used by the teams are summarized in
Table 2.</p>
      <sec id="sec-9-1">
        <title>Evaluation</title>
        <p>Evaluation metrics used are precision, recall and f-measure. All
the systems have been evaluated automatically by comparing with
the gold data. The results obtained for each participant is shown in
table 3.</p>
        <p>One main condition in the Event phrase identification is related to
the event span. The span or extent of the Event phrase is to be
optimally minimum, it should include Event trigger and the
Predicate. Consider the example below</p>
        <p>Hi: bahut dinom se kahi jA rahi rAjyapAl ki niyukti, tamilnadu
me naye rAjyapAl ki niyukti huA.</p>
        <p>Here the event phrase is “tamilnadu me naye rAjyapAl ki
niyukti”. It can not be just “rAjyapAl ki niyukti”.Here the event
trigger is “niyukti”. The event predicate is “ tamilnadu me naye
rAjyapAl”, from which we get the information where and what.
So the participating system need to identify this exact event
phrase. Any system output which has tagged anything more than
this extent is considered as wrong.</p>
        <sec id="sec-9-1-1">
          <title>Thus we define:</title>
          <p>Precision,P=(No. Correctly identified Events
system)/(Total No. of Events identified by the system)
by
the
Recall, R=(No. Correctly identified Events by the system)/(Total
No.of Events identified in the Gold)
F-measure= (2*P*R)/(P+R)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>5. CONCLUSION</title>
      <p>The main objective of creating benchmark data representing a few
of the popular Indian languages has been achieved. And this data
has been made available to research community for free for
research purposes. The data is user generated data and is not any
genre specific. Efforts are still going on to standardize this data
and make it perfect data set for future researchers. We observe
that the results obtained are almost similar for all the languages.
We hope to see more publications in this area in the coming days
from these different research groups who could not submit their
results. Also we expect more groups would start using this data
for their research work.</p>
      <p>This EventXtract-IL track is one of the first efforts towards
creation of Event annotated user generated data for Indian
languages. The data being generic, this could be used for
developing generic systems upon which a domain specific system
could be built after customization. In the next edition of this track
we plan to add more data and also include identification and
extraction of event types, event cause-effects and event
participants.</p>
    </sec>
    <sec id="sec-11">
      <title>6. ACKNOWLEDGMENTS</title>
      <p>We thank the FIRE 2017 organizers for giving us the opportunity
to conduct the evaluation exercise. We also thank the Language
Editors in CLRG, AU-KBC Research Centre.
[6] Nakov, Preslav and Rosenthal, Sara and Kozareva,
Zornitsa and Stoyanov, Veselin and Ritter, Alan and Wilson,
Theresa. 2013. SemEval-2013 Task 2: Sentiment Analysis in
Twitter. Second Joint Conference on Lexical and Computational
Semantics (*SEM), Volume 2: Proceedings of the Seventh
International Workshop on Semantic Evaluation (SemEval 2013)
[7] Rajeev Sangal and M. G. Abbas Malik. 2011.
Proceedings of the 1st Workshop on South and Southeast Asian
Natural Language Processing (SANLP)
[8] Aravind K. Joshi and M. G. Abbas Malik. 2010.
Proceedings of the 1st Workshop on South and Southeast Asian
Natural Language Processing (SANLP).
(http://www.aclweb.org/anthology/W10-36)
[9] Rajeev Sangal, Dipti Misra Sharma and Anil Kumar
Singh. 2008. Proceedings of the IJCNLP-08 Workshop on Named
Entity Recognition for South and South East Asian Languages.
(http://www.aclweb.org/anthology/I/I08/I08-03)
[10] Pattabhi RK Rao, CS Malarkodi, Vijay Sundar R and
Sobha Lalitha Devi. 2014. Proceedings of Named-Entity
Recognition Indian Languages track at FIRE 2014.
http://aukbc.org/nlp/NER-FIRE2014/
IIT Kgp</p>
      <sec id="sec-11-1">
        <title>IDRBT Hyderabad KEC, Coimbatore</title>
        <p>CEC
Cherthala
39.10
21.43
62.28
67.17
48.04
32.40
38.05
NA
51.81
NA
43.88
NA
38.44
61.14
NA
NA
NA</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Arkaitz</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          , Iñaki San Vicente, Pablo Gamallo, José Ramom Pichel Campos, Iñaki Alegría Loinaz, Nora Aranberri, Aitzol Ezeiza,
          <string-name>
            <given-names>Víctor</given-names>
            <surname>Fresno</surname>
          </string-name>
          .
          <source>2014 TweetLID@SEPLN</source>
          <year>2014</year>
          , Girona, Spain,
          <year>September 16th</year>
          ,
          <year>2014</year>
          . CEUR Workshop Proceedings 1228, CEUR-WS.
          <article-title>org 2014</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Dredze</surname>
          </string-name>
          , Tim Oates, and
          <string-name>
            <given-names>Christine</given-names>
            <surname>Piatko</surname>
          </string-name>
          .
          <year>2010</year>
          . “
          <article-title>We're not in kansas anymore: detecting domainchanges in streams”</article-title>
          .
          <source>In Proceedings of the 2010 Conferenceon Empirical Methods in Natural LanguageProcessing</source>
          , pages
          <fpage>585</fpage>
          -
          <lpage>595</lpage>
          . Association for ComputationalLinguistics.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Hege</given-names>
            <surname>Fromreide</surname>
          </string-name>
          , Dirk Hovy, and
          <string-name>
            <given-names>Anders</given-names>
            <surname>Søgaard</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>“Crowdsourcing and annotating ner for twitter#drift”</article-title>
          .
          <source>European language resources distributionagency.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.T.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.Y.</given-names>
            ,
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.K.</given-names>
            ,
            <surname>Foo</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>“A Case Study on Inter-Annotator Agreement for Word Sense Disambiguation”</article-title>
          .
          <source>In Proceedings of the {ACL} {SIGLEX} Workshop on Standardizing Lexical Resources {(SIGLEX99)}. Maryland</source>
          . pp.
          <fpage>9</fpage>
          -
          <lpage>13</lpage>
          .
          <fpage>47</fpage>
          .20
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>