=Paper= {{Paper |id=Vol-2063/events-paper4 |storemode=property |title=Event Extraction From Radio News Bulletins |pdfUrl=https://ceur-ws.org/Vol-2063/events-paper4.pdf |volume=Vol-2063 |authors=Kim van Putten,Oana Inel,Victor de Boer,Lora Aroyo |dblpUrl=https://dblp.org/rec/conf/i-semantics/PuttenIBA17 }} ==Event Extraction From Radio News Bulletins== https://ceur-ws.org/Vol-2063/events-paper4.pdf
                            Event Extraction From Radio News Bulletins
                                   Kim van Putten                                                      Oana Inel
                        Vrije Universiteit Amsterdam                                       Vrije Universiteit Amsterdam
                        Amsterdam, The Netherlands                                         Amsterdam, The Netherlands
                          ke.vanputten@gmail.com                                                  oana.inel@vu.nl

                                   Victor de Boer                                                     Lora Aroyo
                        Vrije Universiteit Amsterdam                                      Vrije Universiteit Amsterdam
                        Amsterdam, The Netherlands                                        Amsterdam, The Netherlands
                               v.de.boer@vu.nl                                                  lora.aroyo@vu.nl

                                                                         extract events from the KB radio news bulletins to improve linkage
                                                                         within the DIVE+ demonstrator?". We aim to find a better approach
                                                                         to extract events from the KB dataset rather than extracting the
                                                                         first 100 characters.

                                                                         2     METHODOLOGY
1    INTRODUCTION                                                        This section describes our research methodology for finding a more
Exploratory search systems provide information to users with an          suitable method to extract events from the radio bulletins. We apply
unclear information need, by offering support for browsing strate-       our research methodology on a subset of 215 news radio bulletins
gies through carefully designed interfaces that support interactive      from KB dating from April 1939.
forms of search [3]. DIVE+1 is a linked data digital cultural heritage
collection browser that organizes historical media objects and facil-
itates exploratory search through event-centric linking of the data      2.1     Preprocessing
[2]. The DIVE+ browser facilitates exploration and learning through      As mentioned previously, the content of the bulletins contains
an intuitive and interactive interface which allows the end user to      errors caused by OCR. Therefore, we first perform garbage removal
browse media objects from four heritage institutions (Netherlands        from the text of the bulletins by adopting a series of pattern-based
Institute for Sounds and Vision2 , Dutch National Library (KB)3 ,        approaches from [4]. When a string is identified as a garbage string,
Amsterdam Museum4 and Tropenmuseum5 ). All objects have meta-            it is removed from the text. Second, we perform sentence boundary
data which includes descriptive text, related entities such as actors,   detection by assuming that all sentences end with a period.
places and events and links between them.
   In the DIVE+ project, event extraction proved to be particularly      2.2     Event Extraction
difficult for KB media objects, i.e., radio news bulletins (see Figure
                                                                         We distinguish two types of events: named events and unnamed
1. On the one hand, the media objects in the KB dataset deal with
                                                                         events. Named events are events which have a name, e.g., Olymische
several issues (see footnote 6) introduced by the OCR software: (1)
                                                                         Spelen. Unnamed events are linguistic events, which do not have a
garbage strings, (2) misidentified characters and spelling errors.
                                                                         name, e.g., the sentence "functionarissen uit Spaansch Marokko is in
Due to these errors, Natural Language Processing (NLP) tools also
                                                                         RABAT aangekomen." describes the event of arriving in Rabat.
struggle to extract meaningful entities [1]. On the other hand, the
                                                                             Named Event Extraction: To extract named events from the bul-
NEs in the metadata are not always correct or identified (e.g., Staat-
                                                                         letins, we used the NLP system Frog6 . When Frog recognizes a
soourant was incorrectly classified as actor, the place DEN HAAG
                                                                         token in the text as a NE, it assigns it a type (i.e., person, organi-
was not extracted). Considering these and the fact that currently,
                                                                         zation, location, product, event or miscellaneous). To identify the
the KB objects do not have a well defined event, we formulate the
                                                                         events, we extracted only the tokens which have been typed by
following research question: "Can we find a more effective way to
                                                                         Frog as events.
1 http://diveplus.frontwise.com/
                                                                             Unnamed Event Extraction: Since unnamed events can not be
2 http://www.beeldengeluid.nl/
3 https://www.kb.nl/                                                     detected in texts with typical NER tools, we first identify actions
4 https://www.amsterdammuseum.nl/                                        by means of verbs, using the NLP tool TreeTagger7 . We are inter-
5 http://www.opencultuurdata.nl/wiki/tropenmuseum/                       esting in identifying eventful sentences, i.e., sentences that contain
                                                                         one or more unnamed events under the pattern someone, doing
                                                                         something, somewhere. We attempt to extract unnamed events from
                                                                         the bulletins using a knowledge-driven approach which exploits
                                                                         the NEs already in the metadata of the bulletin and the actions
                                                                         (i.e., verbs) identified by TreeTagger. Since not all events might
© 2017 Copyright held by the author/owner(s).                            6 https://languagemachines.github.io/frog/
SEMANTiCS 2017 workshop proceedings: EVENTS                              7 http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/

September 11-14, 2017, Amsterdam, Netherlands
EVENTS2017, September 2017, Amsterdam, The Netherlands                                                                                     van Putten et al.


be associated to both an actor and a place, we introduce a tiered         impacts the performance of the event extraction. We address the
method of sentence extraction: (1) Tier 1: Sentence contains at least     fourth observation in the next part.
a verb, an actor and a place; (2) Tier 2: Sentence contains at least a
verb, and either at least an actor or a place and (3) Tier 3: Sentence    3.3     Improvements of the Event Extraction
contains at least a verb.A sentence matching tier 1 is more likely        The extraction of unnamed events relies on finding relationships
to contain an unnamed event than a sentence of tier 2, and thus,          between verbs and NEs. However, in Section 1 we see that the NEs
tier 1 is preferred over tier 2 and tier 2 is preferred over tier 3. If   in the metadata of the bulletins are not always correct. Overall, we
there are no sentences that have at least one verb, then we keep          see that about a quarter of all the NEs are incorrect or mistyped.
the initial string of the first 100 characters as the event.              Actors show to have the largest percentage of correct NEs but
                                                                          simultaneously the largest percentage of incorrect extracted NEs
3     RESULTS                                                             (15.4%). Next, we investigated whether we can improve the named
In this section we briefly present the results of all the intermediate    entities from the bulletins using Frog. Frog extracted a total of 5,807
steps of the event extraction pipeline.                                   NEs of type person, organization, location and event. However, we
                                                                          see that Frog performs poor on the extraction of all entity types
3.1    Preprocessing Results                                              (only around 20% were correct) and hence, we chose not to use the
Overall, 2,574 garbage strings were removed from the data. De-            NEs extracted by Frog in our pipeline.
spite the large number of strings removed, there are still garbage           We further analyzed two assumptions: (1) Sentences that contain
strings undetected. All attempts of adding new rules or changing          a verb, an actor and a place are more likely to contain unnamed events
the current ones to recognize similar words resulted in the seesaw        than sentences which do not have both an actor and place. To prove
phenomena, where the removal of garbage strings also led to the           this assumption we tested the unnamed event extraction with a
removal of non-garbage strings.                                           2-tier method which is identical to the 3-tier method except that we
                                                                          omit tier 1. We evaluate the sentences extracted by the 2-tier system
3.2    Event Extraction Analysis                                          and compare them to the sentences extracted by the 3-tier method.
The extracted named events and eventful sentences are analyzed            Results show that fewer of the sentences extracted by the 2-tier
to determine how well the extraction methods perform.                     method are eventful compared to the 3-tier method (0.65 compared
   Analysis of Named Event Extraction: Frog extracted a total of 57       to 0.8). (2) The main event or most important events are mentioned at
events from the 215 bulletins in the data. Overall, it appears that       the beginning of the text. We conclude that limiting the extraction to
Frog performed very poor on event extraction. Only 4 out of 57            a specific part of the text results in slightly worse event extraction
extracted events are actually events, and 2 out of these 4 have an        because (1) the extractor might be forced to extract a sentence that
incorrect span.                                                           matches a lower tier and (2) if a text contains only one sentence
   Analysis of Unnamed Event Extraction: We extracted one sentence        with a verb an actor and a place, we may not chose the part of the
per bulletin using the 3-tier extraction method: 92 sentences in tier     text where this sentence is placed.
1, 85 sentences in tier 2, 15 sentences in tier 3 and for 23 bulletins
we found no sentence containing a verb so we kept the initial event.
                                                                          4     CONCLUSION
Further, we manually evaluate the sentences extracted with the            This paper presents a methodology to extract events from radio
3-tier method and compare them with the baseline, i.e., the current       news bulletins to improve the exploratory search offered by DIVE+
event strings in the metadata. A string was considered an event if        using a NER tool and a pattern-based approach which exploits the
(1) it was reported as something that happened, is happening, or          NE space in the metadata of the bulletins. The new events are full
will/may happen at a later date, (2) it is based on a verb or a set of    sentences, less likely to be header information of the bulletin and
verbs, and (3) it has historic value.                                     more likely to include relevant NEs and terms that a user might
   We found out that from the original events which were extracted        search for. Therefore, the bulletins are more likely to show up in
by taking the first 100 characters only 8.4% contained unnamed            search results (see Figure 2). On the one hand, the NER tool Frog was
events. From the sentences that were extracted with the new 3-tier        unsuccessful at extracting events from the radio bulletins. On the
method 77.2% were eventful. Thus, the new method of extraction            other hand, the pattern-based method improved the events, which
provides better suited events mentioned in the bulletins. The event       was further beneficial for the searchability and the presentation of
strings in the metadata that were found eventful contained exactly        the media objects. Overall, errors in the OCRed data turned out to
one event. The newly extracted eventful sentences contained more          be problematic for sentence boundary detection, NE extraction and
than one event on average (1.5 events) which means that overall,          ultimately for the extraction of events. To achieve a finer granularity
they are more expressive. We identified the following reasons why         of event extraction, future research is necessary to identify what
a sentence did not contain an unnamed event: (1) no new sentence          is the relationship between the NEs and the verbs that describe
was extracted because TreeTagger did not recognize any verbs              an event. It might also be worthwhile to invest further research in
in the text of the bulletins (bulletins without verbs or misspelled       OCR post-correction and normalization to improve the quality of
verbs); (2) words were incorrectly tagged as verbs; (3) incorrect         the data so that better NER can be achieved.
sentence boundary detection and (4) incorrect NE in the metadata.
For the first two observations the quality of the OCR negatively          REFERENCES
                                                                           [1] Beatrice Alex and John Burns. 2014. Estimating and rating the quality of optically
                                                                               character recognised text. In DATeCH. ACM, 97–102.
Event Extraction From Radio News Bulletins                                               EVENTS2017, September 2017, Amsterdam, The Netherlands




                              Figure 1: An example of an ANP radio news bulletin in the DIVE+ demonstrator




Figure 2: The search result for the query "openingskoersen 1 punt lager" in the DIVE+ demonstrator after the data enrichment
with the new events. The left object shows a radio bulletin and the right object is the event associated with the bulletin.


[2] Victor De Boer, Johan Oomen, et al. 2015. DIVE into the event-based browsing     [4] Kazem Taghva, Tom Nartker, Allen Condit, et al. 2001. Automatic removal of
    of linked historical media. Web Semantics: Science, Services and Agents on WWW       “garbage strings” in OCR text: An implementation. In WMSCI.
    35 (2015), 152–158.
[3] Gary Marchionini. 2006. Exploratory search: from finding to understanding.
    Commun. ACM 49, 4 (2006), 41–46.