<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Rule Based Event Extraction System from Newswires and Social Media Text in Indian Languages (EventXtract-IL) for English and Hindi data</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology (BHU)</institution>
          ,
          <addr-line>Uttar Pradesh 221005</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Due to today's information overload, the user is particularly nding it di cult to access the right information through the World Wide Web. The situation becomes worse when this information is in multiple languages. In this paper we present a model for information extraction. Our model mainly works on the concept of speech tagging and named entity recognization. We represent each word with the POS tag and the entity identi ed for that term. We assume that the event exists in the rst line of the document. If we do not nd it in the rst line, then we take the help of emotion analysis. If it has negative polarity, then it is associated with an unexpected event which has negative meaning. We use NLTK for emotion analysis.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The overload of today's information, throws enormous di culty to access the
right information especially through the World Wide Web. The situation
becomes worse when the information is in multiple languages. When a user
accesses a document written in multiple languages and the user faces di culty in
nding facts, it is important to remove all information from the data except only
the facts that match the user's interest. To extract various types of information
from document pertaining to speci c languages and domains, Information
Extraction (IE) systems are primarily used [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Existing IE techniques, however,
sometimes, remove speci c facts from docements that match the user's interest.
Also, sometimes, IE techniques return keywords that are irrelevant to the user's
interests. On the other hand, users can manually nd more relevant information
according to their domain of interest than a system can provide [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Most IE systems process texts in sequential phases (or \steps" ) like lexical
and morphological processing, identifying and typing appropriate names,
analyzing large syntactic components, nal extraction of anaphora and co-referent, and
relationships with domain-release events and lessons [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Information Extraction
systems need to be easily optimized for new domains in order to increase the
suitability for end-user applications [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Rapid growth in information technology
in the last two decades has increased the amount of information available
online. There is a new style social media to share information. Social media is a
platform of communication between people in which public share and exchange
information in virtual communities and networks (like Facebook and Twitter)
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In Event detection, Topic Detection and tracking are one of the major
component of a broad initiative. Written and spoken news stories are primarily belong
to the category of Topic Detection and tracking interest based broadcasting news
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Driven by the MUC contest, work on Information Extraction, and especially
on named entity recognition (NE), mostly concentrated on narrow subdomain,
like newspaper about terrorist attacks (MUC-3 and MUC-4), and report on
air vehicle launch (MUC-7) [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Process di erent types of documents without
involving much tuning or type of document a system is required. To adjust
manually or semi-automatically new domains and application has been successfully
implemented in many exsiting IE system - but there has been some progress in
dealing with the problem of strengthening the system to overcome this
requirement [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Recent research in this area starts with the notion that statistical machine
learning is the best way to solve information extraction problems. To nd
structured information with uncontrolled or semi-structured text are the primary
objective of information extraction. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>System overview</title>
      <p>Our model primarily works on the concepts of Part of Speech tagging and
Named-entity recognition. We represented each word with POS tag and
identied entity for that word. We assumed that the event exists in the rst line of the
document. If we do not nd it in the rst line, then we took help of sentiment
analysis. If it has negative polarity, it was found to be associated with an
unexpected event that has a negative sense. We used NLTK for sentiment analysis.</p>
      <p>Most of the information is easily extracted by Named Entity Recognition such
as Time-argument(Date), and Place-argument(Location). The Speed-argument
and Casualty extraction are rst tagged by \CD" and further distinguished with
the help of NER tag.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Methdology</title>
      <p>We developed a modular method for information extraction from the Indian
language. The dataset provided by organisers is collected from various social
sources such as blogs, microblogs, social media and newswires that are either in
Roman script or codemixed. Here, Indian language is mixed with English. We
have worked on English and Hindi languages. The key components of proposed
work (Figure 1) are described as follows.</p>
      <p>Dataset the task organiser provide contains three languages: English, Hindi and
Tamil. Training dataset is in the xml format and testing data in raw text. The
statistics about given dataset are shown Table(1).
We have used Stanford POS Tagger for the part of speech tagging to English
dataset. We have used Hindi treebank dataset to make POS Tagger applying
conditional random eld algorithm. Our entire system relies on POS tagger.
The part of speech tagging is the rst step to perform extraction task.</p>
      <sec id="sec-4-1">
        <title>Named Entity Recognization</title>
        <p>Similar to POS Tagging, Stanford NER Tagger was used for named entity
recognization to English dataset. For Hindi language, similar to POS Tagger we use
Hindi treebank dataset to create NER Tagger. By default, the Stanford NER
Tagger neither use part of speech nor gazette to extract locations. The pair of
POS Tag and NER Tag along with word is helpful to extract information about
Person, Date,..etc.
4.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Chunking</title>
        <p>When we want to extract full information related to the event such as casualties,
depth,..etc. then it is necessary to identify the whole phrases. Thus, the complete
phrase is extracted by Stanford Parser. Sentiment analysis the extracted phrase
that has tagged as a verb followed by a cardinal number but not date NER.
This may have information about casualties. This extracted pattern will fail
when some positive events follow the casualties' pattern. To avoid such situation
we check the sentiment polarity, the threshold value for sentiment checking is
0.5.</p>
        <p>The proposed method is a framework for information extraction from
unstructured user generated contents on social media. Our information extraction
systems analyse human language text as linguistics structure in order to extract
information about di erent types of events, time, place, casualties and speed.
We Selected the sentences from the dataset and perform the POS Tagging, NER
Tagging and Chunking then extracted the phrases from the above-POS, NER
and Chunk.</p>
        <p>{ Time extraction: We assume that if we got date tag in NER tag then that
is the correct date otherwise we match the day name or day abbreviation
in calender library. Apart from that we also match the word with Today,
Tomorrow and yesterday strings.
{ Event extraction: We took the instance from the dataset and applied the
lemmatizer on the selected instance and extracted the most frequent 10 lemmas
that should not be the stopword and punctuation mark and for a selection
of an event among those lemma that should be the noun and belongs from
rst sentence of instance.
{ Place extraction: Those extracted phrases, who must contain the noun,
proper noun and preposition POS tag from the dataset and the selected
phrases that shows location in NER tag and those words should start with
a capital letter then it represents the place.
{ Casualties extraction: The phrase which we selected for the extraction must
hold the cardinal number, verb, noun and preposition POS tag then only we
will select that phrase. After the data analysis, we found the cardinal
number and verb, the window size should be from 1-5 but there is some selected
phrase which represents time and the next we check the NER tag should not
be Date tag.
{ Speed extraction: for the selection of phrase, the phrase which contains the
cardinal number, noun and preposition then we selected those phrase. Those
phrases should not match with extracted Time and Casualties and the word
which hold the cardinality number's word we check that to measurement
unit for the remaining phrase. The measurement unit is created by a manual
dictionary which keeps the almost measurement unit.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Result</title>
      <p>The evaluation matrix for the Event Extraction problem is shown in table 2.
The task organiser includes the Precision(P) and Recall(R), F-measure:
T P
P recision = T P +F P</p>
      <p>T P</p>
      <p>Recall = T P +F N</p>
      <p>F measure = 2 PPrreecciissiioonn((PP))+RReeccaallll((RR))
Where TP is truly positive, FP is false positive, FN is a false negative. We have
extracted ve information out of seven information i.e Speed, Casualties, Time,
Event, Place.</p>
      <p>We clearly saw that our Hindi model is not performing well as compare to
the English model because there is no free existence of Hindi POS Tagger and
NER Tagger available yet.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We have discussed our Rule-Based methodology used to solve the task of
information extraction from newswires and social media text in Indian languages. Our
methodology tested on Hindi and English language and have derived some
insights from the achieved results. The achieved F-measure are 39.98% and 45.07%
for Hindi and English respectively. We believe that the incorporation of
probabilistic approach with Rule-Based will improve the results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Allan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papka</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavrenko</surname>
          </string-name>
          , V.:
          <article-title>On-line new event detection and tracking</article-title>
          .
          <source>In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <volume>37</volume>
          {
          <fpage>45</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Appelt</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          : Introduction to information extraction.
          <source>Ai Communications</source>
          <volume>12</volume>
          (
          <issue>3</issue>
          ),
          <volume>161</volume>
          {
          <fpage>172</fpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cowie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehnert</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Information extraction</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>39</volume>
          (
          <issue>1</issue>
          ),
          <volume>80</volume>
          {
          <fpage>91</fpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Karkaletsis</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spyropoulos</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petasis</surname>
          </string-name>
          , G.:
          <article-title>Named entity recognition from greek texts: the gie project</article-title>
          .
          <source>In: Advances in Intelligent Systems</source>
          , pp.
          <volume>131</volume>
          {
          <fpage>142</fpage>
          . Springer (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Maynard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tablan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ursu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cunningham</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilks</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Named entity recognition from diverse text types</article-title>
          .
          <source>In: Recent Advances in Natural Language Processing 2001 Conference</source>
          . pp.
          <volume>257</volume>
          {
          <issue>274</issue>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Morgan,
          <string-name>
            <surname>M.B.H.</surname>
          </string-name>
          , Van Keulen,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Information extraction for social media</article-title>
          .
          <source>In: Proceedings of the Third Workshop on Semantic Web and Information Extraction</source>
          . pp.
          <volume>9</volume>
          {
          <issue>16</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Sundheim</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Proceedings of the fth message understanding conference (muc-5). Columbia, MD: ARPA</article-title>
          , Morgan Kaufmann (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sundheim</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Proceedings of the seventh message understanding conference (muc7)</article-title>
          . ARPA, Morgan Kaufmann (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>