<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Event Extraction From Radio News Bulletins</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kim van Putten</string-name>
          <email>ke.vanputten@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victor de Boer</string-name>
          <email>v.de.boer@vu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oana Inel</string-name>
          <email>oana.inel@vu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lora Aroyo</string-name>
          <email>lora.aroyo@vu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vrije Universiteit Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>extract events from the KB radio news bulletins to improve linkage within the DIVE+ demonstrator?". We aim to find a better approach to extract events from the KB dataset rather than extracting the ifrst 100 characters.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Exploratory search systems provide information to users with an
unclear information need, by ofering support for browsing
strategies through carefully designed interfaces that support interactive
forms of search [3]. DIVE+1 is a linked data digital cultural heritage
collection browser that organizes historical media objects and
facilitates exploratory search through event-centric linking of the data
[2]. The DIVE+ browser facilitates exploration and learning through
an intuitive and interactive interface which allows the end user to
browse media objects from four heritage institutions (Netherlands
Institute for Sounds and Vision2, Dutch National Library (KB)3,
Amsterdam Museum4 and Tropenmuseum5). All objects have
metadata which includes descriptive text, related entities such as actors,
places and events and links between them.</p>
      <p>
        In the DIVE+ project, event extraction proved to be particularly
dificult for KB media objects, i.e., radio news bulletins (see Figure
1. On the one hand, the media objects in the KB dataset deal with
several issues (see footnote 6) introduced by the OCR software: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
garbage strings, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) misidentified characters and spelling errors.
Due to these errors, Natural Language Processing (NLP) tools also
struggle to extract meaningful entities [1]. On the other hand, the
NEs in the metadata are not always correct or identified (e.g.,
Staatsoourant was incorrectly classified as actor, the place DEN HAAG
was not extracted). Considering these and the fact that currently,
the KB objects do not have a well defined event, we formulate the
following research question: "Can we find a more efective way to
1http://diveplus.frontwise.com/
2http://www.beeldengeluid.nl/
3https://www.kb.nl/
4https://www.amsterdammuseum.nl/
5http://www.opencultuurdata.nl/wiki/tropenmuseum/
© 2017 Copyright held by the author/owner(s).
      </p>
      <p>SEMANTiCS 2017 workshop proceedings: EVENTS
September 11-14, 2017, Amsterdam, Netherlands</p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>This section describes our research methodology for finding a more
suitable method to extract events from the radio bulletins. We apply
our research methodology on a subset of 215 news radio bulletins
from KB dating from April 1939.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Preprocessing</title>
      <p>As mentioned previously, the content of the bulletins contains
errors caused by OCR. Therefore, we first perform garbage removal
from the text of the bulletins by adopting a series of pattern-based
approaches from [4]. When a string is identified as a garbage string,
it is removed from the text. Second, we perform sentence boundary
detection by assuming that all sentences end with a period.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Event Extraction</title>
      <p>We distinguish two types of events: named events and unnamed
events. Named events are events which have a name, e.g., Olymische
Spelen. Unnamed events are linguistic events, which do not have a
name, e.g., the sentence "functionarissen uit Spaansch Marokko is in
RABAT aangekomen." describes the event of arriving in Rabat.</p>
      <p>Named Event Extraction: To extract named events from the
bulletins, we used the NLP system Frog6. When Frog recognizes a
token in the text as a NE, it assigns it a type (i.e., person,
organization, location, product, event or miscellaneous). To identify the
events, we extracted only the tokens which have been typed by
Frog as events.</p>
      <p>
        Unnamed Event Extraction: Since unnamed events can not be
detected in texts with typical NER tools, we first identify actions
by means of verbs, using the NLP tool TreeTagger7. We are
interesting in identifying eventful sentences, i.e., sentences that contain
one or more unnamed events under the pattern someone, doing
something, somewhere. We attempt to extract unnamed events from
the bulletins using a knowledge-driven approach which exploits
the NEs already in the metadata of the bulletin and the actions
(i.e., verbs) identified by TreeTagger. Since not all events might
6https://languagemachines.github.io/frog/
7http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/
be associated to both an actor and a place, we introduce a tiered
method of sentence extraction: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Tier 1: Sentence contains at least
a verb, an actor and a place; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Tier 2: Sentence contains at least a
verb, and either at least an actor or a place and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Tier 3: Sentence
contains at least a verb.A sentence matching tier 1 is more likely
to contain an unnamed event than a sentence of tier 2, and thus,
tier 1 is preferred over tier 2 and tier 2 is preferred over tier 3. If
there are no sentences that have at least one verb, then we keep
the initial string of the first 100 characters as the event.
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS</title>
      <p>In this section we briefly present the results of all the intermediate
steps of the event extraction pipeline.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Preprocessing Results</title>
      <p>Overall, 2,574 garbage strings were removed from the data.
Despite the large number of strings removed, there are still garbage
strings undetected. All attempts of adding new rules or changing
the current ones to recognize similar words resulted in the seesaw
phenomena, where the removal of garbage strings also led to the
removal of non-garbage strings.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Event Extraction Analysis</title>
      <p>The extracted named events and eventful sentences are analyzed
to determine how well the extraction methods perform.</p>
      <p>Analysis of Named Event Extraction: Frog extracted a total of 57
events from the 215 bulletins in the data. Overall, it appears that
Frog performed very poor on event extraction. Only 4 out of 57
extracted events are actually events, and 2 out of these 4 have an
incorrect span.</p>
      <p>
        Analysis of Unnamed Event Extraction: We extracted one sentence
per bulletin using the 3-tier extraction method: 92 sentences in tier
1, 85 sentences in tier 2, 15 sentences in tier 3 and for 23 bulletins
we found no sentence containing a verb so we kept the initial event.
Further, we manually evaluate the sentences extracted with the
3-tier method and compare them with the baseline, i.e., the current
event strings in the metadata. A string was considered an event if
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) it was reported as something that happened, is happening, or
will/may happen at a later date, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) it is based on a verb or a set of
verbs, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) it has historic value.
      </p>
      <p>
        We found out that from the original events which were extracted
by taking the first 100 characters only 8.4% contained unnamed
events. From the sentences that were extracted with the new 3-tier
method 77.2% were eventful. Thus, the new method of extraction
provides better suited events mentioned in the bulletins. The event
strings in the metadata that were found eventful contained exactly
one event. The newly extracted eventful sentences contained more
than one event on average (1.5 events) which means that overall,
they are more expressive. We identified the following reasons why
a sentence did not contain an unnamed event: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) no new sentence
was extracted because TreeTagger did not recognize any verbs
in the text of the bulletins (bulletins without verbs or misspelled
verbs); (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) words were incorrectly tagged as verbs; (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) incorrect
sentence boundary detection and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) incorrect NE in the metadata.
For the first two observations the quality of the OCR negatively
van Puten et al.
impacts the performance of the event extraction. We address the
fourth observation in the next part.
3.3
      </p>
    </sec>
    <sec id="sec-8">
      <title>Improvements of the Event Extraction</title>
      <p>The extraction of unnamed events relies on finding relationships
between verbs and NEs. However, in Section 1 we see that the NEs
in the metadata of the bulletins are not always correct. Overall, we
see that about a quarter of all the NEs are incorrect or mistyped.
Actors show to have the largest percentage of correct NEs but
simultaneously the largest percentage of incorrect extracted NEs
(15.4%). Next, we investigated whether we can improve the named
entities from the bulletins using Frog. Frog extracted a total of 5,807
NEs of type person, organization, location and event. However, we
see that Frog performs poor on the extraction of all entity types
(only around 20% were correct) and hence, we chose not to use the
NEs extracted by Frog in our pipeline.</p>
      <p>
        We further analyzed two assumptions: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Sentences that contain
a verb, an actor and a place are more likely to contain unnamed events
than sentences which do not have both an actor and place. To prove
this assumption we tested the unnamed event extraction with a
2-tier method which is identical to the 3-tier method except that we
omit tier 1. We evaluate the sentences extracted by the 2-tier system
and compare them to the sentences extracted by the 3-tier method.
Results show that fewer of the sentences extracted by the 2-tier
method are eventful compared to the 3-tier method (0.65 compared
to 0.8). (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) The main event or most important events are mentioned at
the beginning of the text. We conclude that limiting the extraction to
a specific part of the text results in slightly worse event extraction
because (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the extractor might be forced to extract a sentence that
matches a lower tier and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) if a text contains only one sentence
with a verb an actor and a place, we may not chose the part of the
text where this sentence is placed.
4
      </p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>This paper presents a methodology to extract events from radio
news bulletins to improve the exploratory search ofered by DIVE+
using a NER tool and a pattern-based approach which exploits the
NE space in the metadata of the bulletins. The new events are full
sentences, less likely to be header information of the bulletin and
more likely to include relevant NEs and terms that a user might
search for. Therefore, the bulletins are more likely to show up in
search results (see Figure 2). On the one hand, the NER tool Frog was
unsuccessful at extracting events from the radio bulletins. On the
other hand, the pattern-based method improved the events, which
was further beneficial for the searchability and the presentation of
the media objects. Overall, errors in the OCRed data turned out to
be problematic for sentence boundary detection, NE extraction and
ultimately for the extraction of events. To achieve a finer granularity
of event extraction, future research is necessary to identify what
is the relationship between the NEs and the verbs that describe
an event. It might also be worthwhile to invest further research in
OCR post-correction and normalization to improve the quality of
the data so that better NER can be achieved.
with the new events. The left object shows a radio bulletin and the right object is the event associated with the bulletin.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Beatrice</given-names>
            <surname>Alex and John Burns</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Estimating and rating the quality of optically character recognised text</article-title>
          .
          <source>In DATeCH. ACM</source>
          ,
          <volume>97</volume>
          -
          <fpage>102</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Victor</given-names>
            <surname>De Boer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Johan</given-names>
            <surname>Oomen</surname>
          </string-name>
          , et al.
          <year>2015</year>
          .
          <article-title>DIVE into the event-based browsing of linked historical media</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on WWW 35</source>
          (
          <year>2015</year>
          ),
          <fpage>152</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Gary</given-names>
            <surname>Marchionini</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Exploratory search: from finding to understanding</article-title>
          .
          <source>Commun. ACM 49</source>
          ,
          <issue>4</issue>
          (
          <year>2006</year>
          ),
          <fpage>41</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Kazem</given-names>
            <surname>Taghva</surname>
          </string-name>
          , Tom Nartker,
          <string-name>
            <given-names>Allen</given-names>
            <surname>Condit</surname>
          </string-name>
          , et al.
          <year>2001</year>
          .
          <article-title>Automatic removal of “garbage strings” in OCR text: An implementation</article-title>
          .
          <source>In WMSCI.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>