KCE_DAlab @ EventXtract-IL-FIRE2017: Event Extraction using
                Support Vector Machines
    SharmilaDevi V, Kannimuthu S                                  Safeeq G                              Anand Kumar M
        Department of Information                      Department of Information                   Center for Computational
              Technology,                                     Technology,                      Engineering and Networking (CEN)
     Karpagam College of Engineering,                  Sri Ramakrishna Institute of              Amrita School of Engineering,
            Coimbatore, India                                 Technology,                         Coimbatore, Amrita Vishwa
                                                            Coimbatore, India                        Vidyapeetham, India

ABSTRACT                                                                like risk analysis, monitoring systems and decision making sup-
Nowadays, Social media has become a major part to transfer the          porting tools. The event must be used in three methods that is data
message that must be shared with the people ideas and express the       to driven knowledge, extract knowledge through representation
information globally in our day-to-day life. Through social media       and exploitation of expert knowledge and hybrid event extraction.
can able to connect the people together, they are vulnerable to             With the enormous content of data and the impact of digital data
crimes like Identity thefts, false information, and identity masking    sources are easily extracted. Most of the data is in an unstructured
etc. Identifying the event from the social media messages and news      format that is human can easily understand the language.The data
headlines are the important area of research in the current era.        that are given here is to be converted to machine understandable
This paper illustrates work done on Event Extraction for Indian         language. The application that is mainly used in Information re-
language shared task which is conducted in Forum for Information        trieval and Information Extraction Methods. Information Extraction
Retrieval Evaluation (FIRE) 2017. For this Event extraction task,       is the method of automatically extracting structured information
organizers release the dataset with three languages Tamil, Hindi,       from the unstructured or semi-structured machine-readable docu-
and Malayalam. Each language dataset consists of two files Original     ments. In related work, the open dataset for event extraction for
Tweet file and Annotation files. We only participated in the Tamil      the English language is explored in [6]. Here the corpus raises two
event extraction task. In this task, we converted the original tweet    main issues. It was annotated with templates describing all events
file into Bio-format to apply the machine learning directly. Then       with the same set of slots. The methodology used in this type is
analyzing each chunk of the word is an event is said to [B] beginner    Annotation and ASTRE corpus. In this paper, the ASTRE corpus, a
and the other events will be given as Intermediate and the others are   new corpus dedicated to the evaluation of event schema induction.
assigned as O tag. Each word or chunk should be trained whether         Template-based Information Extraction without template is dis-
it is the event or not an event with the help of rich features and      cussed in [2]. The template defines a specific type of event with
SVM classifier. Here we also find out the Cross- Validation accuracy    a set of semantic roles for the typical entities involved in such an
using Natural language techniques.                                      event. The methodology in this paper is learning templates from
                                                                        raw text and clustering on event distance.
CCS CONCEPTS                                                                Distant supervision approach to template-based event extraction,
                                                                        focusing on the extraction of passenger counts, aircraft types, and
• Computing methodologies → Natural language process-
                                                                        other facts concerning airplane crash events is explored in [8]. They
ing; Language resources; Feature selection;
                                                                        also presented a publicly available dataset and event extraction
                                                                        task in the plane crash domain based on Wikipedia infoboxes and
KEYWORDS                                                                newswire text.
Indian Language, Event extraction, Social Media, Text Classification        Event extraction is treated as an dependency parsing in [5]. Here,
                                                                        authors proposed a simple approach for the extraction of such
1   INTRODUCTION                                                        structures by taking the tree of event-argument relation. This gives
Natural Language Processing (NLP) is a field that covers computer       the better performance in the extraction of a biomedical event.
understanding and manipulation of human languages. It focuses               In [9], Event Extraction from unstructured text data was ex-
on the interaction between human language and computer is called        plained. Authors extended the bootstrapping method that was ini-
Natural language processing. Event Extraction is an important           tially developed for extracting relations from web pages to the
stream of information extracted it has greatly gained in popularity     problem of content extraction from short unstructured text. The
due to the advent of big data and the developments in the related       event extraction method proposed in this paper attained less accu-
fields of text mining in Natural Language Processing. One common        racy for the Twitter dataset as compared to the enterprise dataset.
application of text mining is event extraction which encompasses            An overview of event extraction from text was described in [3].
deducing specific knowledge concerning incidents referred to in         This literature survey discussed the text mining techniques that are
texts. Most of the data is initially unstructured. Using NLP tech-      employed for various event extraction purposes. Here knowledge
niques, information is extracted from texts from various sources        driven event extraction and Hybrid driven even extraction methods
such as new messages and blogs that must be stored in a structured      are discussed elaborately.
way eg. Databases. The event can be useful in some applications
FIRE 2017, 8th - 10th December, Indian Institute of Science, Bangalore                                                   SharmilaDevi V et al.

            Table 1: EventXtract-IL Tamil Dataset

              Files               Training    Testing
              Annotations         1109        -
              News Headlines      3843        5304
              Unique Authors      1799        2509


   The rest of the paper is organized as follows. Section 2 presents
the overview of the shared task and the details regarding the dataset.
Section 3 describes the proposed system developed for the event
extraction task while Section 4 shows the evaluation results of three
submissions for Tamil event extraction shared task. Finally, Section
5 concludes the paper.

2   DATASET DETAILS
The task contains two files such as Original tweet file and Anno-
tation files. The first two column must contain Tweet ID and user                              Figure 1: Methodology
ID. The third column must represent the event phrase of the ID.
The Fourth column will mention the index where this phrase starts
                                                                                         Table 2: EventXtract-IL Results
in the tweet string and the fifth column is the string length of the
event phrase. The events are given as Natural disasters, Man-made
disasters, political events and cultural /social events.                                                         Tamil
                                                                                       KCE_DAlab        Prec %   Rec %     F-m%
3   EVENT EXTRACTION FOR TAMIL                                                         Submission1      39.1     62.28     48.04
                                                                                       Submission2      38.05    51.81     43.88
    LANGUAGE                                                                           Submission3      38.44    61.14     47.2
Normally, for Text mining and Information Extraction, preprocess-
ing is the mandatory step and it is necessary for the Twitter dataset.
The methodology which is followed in the entity extraction is fol-       3.1    Features for Event extraction
lowed in the event extraction too [7] [1] . The preprocessing step
                                                                         In this work, feature extraction is essential as this decides the accu-
encompass Normalization and Tokenisation methods. In Tokeni-
                                                                         racy of the machine learning based system. The traditional features
sation, based on the white spaces, sentences are partitioned into
                                                                         like words, prefixes, and suffixes of the word, binary feature, shape
tokens. These tokens are further normalized where superficial vari-
                                                                         features are used in the feature extraction step. Binary feature and
ations are extracted. However, normalization of Twitter messages
                                                                         shape feature is binary features where if it is present in the tweet
is desired to prevailing the non-standard words, spelling digression,
                                                                         then it is marked as ’1’ or else ’0’. For prefix and suffix feature max-
lengthen the unconstrained abbreviations (eg., tmrw for tomorrow),
                                                                         imum up to five characters before and after the current character
and prevailing the phonetic alternation. For English language, case
                                                                         are taken as features. The punctuation mark such as question mark,
folding is a relevant one where case variations must be obtained but
                                                                         exclamatory marks, comma, and full stop are also used as features.
it does not feel necessary for Indian language where no such varia-
tion exists.The methodology of the proposed system is illustrated
in Figure 1. The training dataset consists of two files such as raw
tweets and extracted type annotated entities. The tweet file will be
                                                                         4     RESULTS
expressed by "Tweet ID","User ID" and tweets. The entity file must       This section explains the submission details and the results obtained.
be expressed of "Tweet ID","User ID", Entity type, entity, starting      The results are shown in Table.2, Submission-2, is the baseline
index and length. We have merged these files and converted into          system and submission-1 undergone the C-parameter tuning of
conventional BIO formatted text in which B-XXX tag refers the            SVM. In, submission-3 the parameters are fixed based on 10-fold
Beginning word of the entity type and I-XXX is needed for the            cross-validation.
following chunks of an entity. The tag other than the event is repre-
sented as O. In tokenization the tweets are further partitioned into     5     CONCLUSION AND FUTURE SCOPE
small chunks called as tokens. Training and testing tweets must          The work is submitted as a part of Shared Task on Event Extraction
be tokenized properly in one token-per line format. Annotated            for the Tamil Language in FIRE 2017. The task organizer provided
events and tokenized training tweets are combined to create the          the twitter file and annotation file. Three submissions were sub-
BIO format. Features are extracted in Tamil and train the system         mitted for the task using the traditional features. The system was
with support vector machine-based classifier, SVMLight [4]. Finally,     trained and tested using SVM classifier. In future, POS tagging
the BIO format tokens are converted into the given annotation            and the NER features along with word embedding can be added to
format and the event is extracted.                                       improve the performance of the event extraction system.
KCE_DAlab @ EventXtract-IL-FIRE2017: Event Extraction using Support
                                                              FIRE 2017,
                                                                    Vector
                                                                         8th
                                                                           Machines
                                                                             - 10th December, Indian Institute of Science, Bangalore


ACKNOWLEDGEMENT
We would like to thank organizers of Forum for Information Re-
trieval Evaluation 2017 for providing the shared task platform to
the researchers. We would also like to thank the organizers of the
EventXtract-IL task.

REFERENCES
[1] M. Anand Kumar, S. Se, and K. Soman. Amrita-cen@fire 2015: Extracting entities
    for social media texts in indian languages. volume 1587, pages 85–88, 2015.
[2] N. Chambers and D. Jurafsky. Template-based information extraction without
    the templates. In Proceedings of the 49th Annual Meeting of the Association for
    Computational Linguistics: Human Language Technologies-Volume 1, pages 976–986.
    Association for Computational Linguistics, 2011.
[3] F. Hogenboom, F. Frasincar, U. Kaymak, and F. De Jong. An overview of event
    extraction from text. In Workshop on Detection, Representation, and Exploitation
    of Events in the Semantic Web (DeRiVE 2011) at Tenth International Semantic Web
    Conference (ISWC 2011), volume 779, pages 48–57, 2011.
[4] T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges,
    and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning,
    chapter 11, pages 169–184. MIT Press, Cambridge, MA, 1999.
[5] D. McClosky, M. Surdeanu, and C. D. Manning. Event extraction as dependency
    parsing. In Proceedings of the 49th Annual Meeting of the Association for Com-
    putational Linguistics: Human Language Technologies-Volume 1, pages 1626–1635.
    Association for Computational Linguistics, 2011.
[6] K.-H. Nguyen, X. Tannier, O. Ferret, and R. BesanÃ̈ğon. A dataset for open event
    extraction in english. In Proceedings of the Tenth International Conference on
    Language Resources and Evaluation (LREC 2016), pages 1939–1943, 2016.
[7] G. Remmiya Devi, P. Veena, M. Anand Kumar, and K. Soman. Amrita-cen@fire
    2016: Code-mix entity extraction for hindi-english and tamil-english tweets. vol-
    ume 1737, pages 304–308, 2016.
[8] K. Reschke, M. Jankowiak, M. Surdeanu, C. D. Manning, and D. Jurafsky. Event
    extraction using distant supervision.
[9] C. Shang, A. Panangadan, and V. K. Prasanna. Event extraction from unstructured
    text data. In International Conference on Database and Expert Systems Applications,
    pages 543–557. Springer, 2015.