Rule Based Event Extraction System from
    Newswires and Social Media Text in Indian
    Languages (EventXtract-IL) for English and
                   Hindi data

           Anita Saroj, Rajesh kumar Munodtiya, and Sukomal Pal

                     Indian Institute of Technology (BHU),
                          Uttar Pradesh 221005, India
         {anitas.rs.cse16,rajeshkm.rs.cse16,spal.cse}@iitbhu.ac.in


      Abstract. Due to today’s information overload, the user is particularly
      finding it difficult to access the right information through the World Wide
      Web. The situation becomes worse when this information is in multiple
      languages. In this paper we present a model for information extraction.
      Our model mainly works on the concept of speech tagging and named
      entity recognization. We represent each word with the POS tag and the
      entity identified for that term. We assume that the event exists in the
      first line of the document. If we do not find it in the first line, then we
      take the help of emotion analysis. If it has negative polarity, then it is
      associated with an unexpected event which has negative meaning. We
      use NLTK for emotion analysis.


1   Introduction
The overload of today’s information, throws enormous difficulty to access the
right information especially through the World Wide Web. The situation be-
comes worse when the information is in multiple languages. When a user ac-
cesses a document written in multiple languages and the user faces difficulty in
finding facts, it is important to remove all information from the data except only
the facts that match the user’s interest. To extract various types of information
from document pertaining to specific languages and domains, Information Ex-
traction (IE) systems are primarily used [2, 3]. Existing IE techniques, however,
sometimes, remove specific facts from docements that match the user’s interest.
Also, sometimes, IE techniques return keywords that are irrelevant to the user’s
interests. On the other hand, users can manually find more relevant information
according to their domain of interest than a system can provide [4].

    Most IE systems process texts in sequential phases (or “steps” ) like lexical
and morphological processing, identifying and typing appropriate names, analyz-
ing large syntactic components, final extraction of anaphora and co-referent, and
relationships with domain-release events and lessons [2]. Information Extraction
systems need to be easily optimized for new domains in order to increase the
2       Anita Saroj, Rajesh kumar Munodtiya, and Sukomal Pal

suitability for end-user applications [4]. Rapid growth in information technology
in the last two decades has increased the amount of information available on-
line. There is a new style social media to share information. Social media is a
platform of communication between people in which public share and exchange
information in virtual communities and networks (like Facebook and Twitter)
[6].

2   Related Work
In Event detection, Topic Detection and tracking are one of the major compo-
nent of a broad initiative. Written and spoken news stories are primarily belong
to the category of Topic Detection and tracking interest based broadcasting news
[1]. Driven by the MUC contest, work on Information Extraction, and especially
on named entity recognition (NE), mostly concentrated on narrow subdomain,
like newspaper about terrorist attacks (MUC-3 and MUC-4), and report on
air vehicle launch (MUC-7) [7, 8]. Process different types of documents without
involving much tuning or type of document a system is required. To adjust man-
ually or semi-automatically new domains and application has been successfully
implemented in many exsiting IE system - but there has been some progress in
dealing with the problem of strengthening the system to overcome this require-
ment [5].

    Recent research in this area starts with the notion that statistical machine
learning is the best way to solve information extraction problems. To find struc-
tured information with uncontrolled or semi-structured text are the primary
objective of information extraction. [6].

3   System overview
Our model primarily works on the concepts of Part of Speech tagging and
Named-entity recognition. We represented each word with POS tag and identi-
fied entity for that word. We assumed that the event exists in the first line of the
document. If we do not find it in the first line, then we took help of sentiment
analysis. If it has negative polarity, it was found to be associated with an unex-
pected event that has a negative sense. We used NLTK for sentiment analysis.

   Most of the information is easily extracted by Named Entity Recognition such
as Time-argument(Date), and Place-argument(Location). The Speed-argument
and Casualty extraction are first tagged by “CD” and further distinguished with
the help of NER tag.

4   Methdology
We developed a modular method for information extraction from the Indian
language. The dataset provided by organisers is collected from various social
                                     Title Suppressed Due to Excessive Length   3


              Fig. 1. Rule Based Information Extraction Framework


sources such as blogs, microblogs, social media and newswires that are either in
Roman script or codemixed. Here, Indian language is mixed with English. We
have worked on English and Hindi languages. The key components of proposed
work (Figure 1) are described as follows.
Dataset the task organiser provide contains three languages: English, Hindi and
Tamil. Training dataset is in the xml format and testing data in raw text. The
statistics about given dataset are shown Table(1).


Table 1. Training dataset statistics respective to each language (Training dataset
shows number of files)

                           Languages        Training
                                            Dataset
                           English          100
                           Hindi            107
                           Tamil            64


4.1   POS Tagging
We have used Stanford POS Tagger for the part of speech tagging to English
dataset. We have used Hindi treebank dataset to make POS Tagger applying
conditional random field algorithm. Our entire system relies on POS tagger.
The part of speech tagging is the first step to perform extraction task.
4         Anita Saroj, Rajesh kumar Munodtiya, and Sukomal Pal

4.2     Named Entity Recognization
Similar to POS Tagging, Stanford NER Tagger was used for named entity recog-
nization to English dataset. For Hindi language, similar to POS Tagger we use
Hindi treebank dataset to create NER Tagger. By default, the Stanford NER
Tagger neither use part of speech nor gazette to extract locations. The pair of
POS Tag and NER Tag along with word is helpful to extract information about
Person, Date,..etc.

4.3     Chunking
When we want to extract full information related to the event such as casualties,
depth,..etc. then it is necessary to identify the whole phrases. Thus, the complete
phrase is extracted by Stanford Parser. Sentiment analysis the extracted phrase
that has tagged as a verb followed by a cardinal number but not date NER.
This may have information about casualties. This extracted pattern will fail
when some positive events follow the casualties’ pattern. To avoid such situation
we check the sentiment polarity, the threshold value for sentiment checking is
0.5.
    The proposed method is a framework for information extraction from un-
structured user generated contents on social media. Our information extraction
systems analyse human language text as linguistics structure in order to extract
information about different types of events, time, place, casualties and speed.
We Selected the sentences from the dataset and perform the POS Tagging, NER
Tagging and Chunking then extracted the phrases from the above-POS, NER
and Chunk.
    – Time extraction: We assume that if we got date tag in NER tag then that
      is the correct date otherwise we match the day name or day abbreviation
      in calender library. Apart from that we also match the word with Today,
      Tomorrow and yesterday strings.

    – Event extraction: We took the instance from the dataset and applied the lem-
      matizer on the selected instance and extracted the most frequent 10 lemmas
      that should not be the stopword and punctuation mark and for a selection
      of an event among those lemma that should be the noun and belongs from
      first sentence of instance.

    – Place extraction: Those extracted phrases, who must contain the noun,
      proper noun and preposition POS tag from the dataset and the selected
      phrases that shows location in NER tag and those words should start with
      a capital letter then it represents the place.

    – Casualties extraction: The phrase which we selected for the extraction must
      hold the cardinal number, verb, noun and preposition POS tag then only we
      will select that phrase. After the data analysis, we found the cardinal num-
      ber and verb, the window size should be from 1-5 but there is some selected
                                   Title Suppressed Due to Excessive Length    5

    phrase which represents time and the next we check the NER tag should not
    be Date tag.

 – Speed extraction: for the selection of phrase, the phrase which contains the
   cardinal number, noun and preposition then we selected those phrase. Those
   phrases should not match with extracted Time and Casualties and the word
   which hold the cardinality number’s word we check that to measurement
   unit for the remaining phrase. The measurement unit is created by a manual
   dictionary which keeps the almost measurement unit.


5   Result
The evaluation matrix for the Event Extraction problem is shown in table 2.
The task organiser includes the Precision(P) and Recall(R), F-measure:

                                P recision = T PT+F
                                                 P
                                                    P


                                  Recall = T PT+F
                                                P
                                                  N


                      F − measure = 2 ∗ PPrecision(P
                                          recision(P )∗Recall(R)
                                                     )+Recall(R)
Where TP is truly positive, FP is false positive, FN is a false negative. We have
extracted five information out of seven information i.e Speed, Casualties, Time,
Event, Place.

       Table 2. Result of our system submissions for Hindi and English data

            Language                       Submissions
                            Precison%     Recall%      F-measure%
            Hindi           29.65         61.39        39.90
            English         34.54         64.87        45.07


   We clearly saw that our Hindi model is not performing well as compare to
the English model because there is no free existence of Hindi POS Tagger and
NER Tagger available yet.

6   Conclusion
We have discussed our Rule-Based methodology used to solve the task of infor-
mation extraction from newswires and social media text in Indian languages. Our
methodology tested on Hindi and English language and have derived some in-
sights from the achieved results. The achieved F-measure are 39.98% and 45.07%
for Hindi and English respectively. We believe that the incorporation of proba-
bilistic approach with Rule-Based will improve the results.
6       Anita Saroj, Rajesh kumar Munodtiya, and Sukomal Pal

References
1. Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In:
   Proceedings of the 21st annual international ACM SIGIR conference on Research
   and development in information retrieval. pp. 37–45. ACM (1998)
2. Appelt, D.E.: Introduction to information extraction. Ai Communications 12(3),
   161–172 (1999)
3. Cowie, J., Lehnert, W.: Information extraction. Communications of the ACM 39(1),
   80–91 (1996)
4. Karkaletsis, V., Spyropoulos, C.D., Petasis, G.: Named entity recognition from greek
   texts: the gie project. In: Advances in Intelligent Systems, pp. 131–142. Springer
   (1999)
5. Maynard, D., Tablan, V., Ursu, C., Cunningham, H., Wilks, Y.: Named entity recog-
   nition from diverse text types. In: Recent Advances in Natural Language Processing
   2001 Conference. pp. 257–274 (2001)
6. Morgan, M.B.H., Van Keulen, M.: Information extraction for social media. In: Pro-
   ceedings of the Third Workshop on Semantic Web and Information Extraction. pp.
   9–16 (2014)
7. Sundheim, B.: Proceedings of the fifth message understanding conference (muc-5).
   Columbia, MD: ARPA, Morgan Kaufmann (1995)
8. Sundheim, B.: Proceedings of the seventh message understanding conference (muc-
   7). ARPA, Morgan Kaufmann (1998)