=Paper=
{{Paper
|id=Vol-2063/events-paper4
|storemode=property
|title=Event Extraction From Radio News Bulletins
|pdfUrl=https://ceur-ws.org/Vol-2063/events-paper4.pdf
|volume=Vol-2063
|authors=Kim van Putten,Oana Inel,Victor de Boer,Lora Aroyo
|dblpUrl=https://dblp.org/rec/conf/i-semantics/PuttenIBA17
}}
==Event Extraction From Radio News Bulletins==
Event Extraction From Radio News Bulletins Kim van Putten Oana Inel Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam Amsterdam, The Netherlands Amsterdam, The Netherlands ke.vanputten@gmail.com oana.inel@vu.nl Victor de Boer Lora Aroyo Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam Amsterdam, The Netherlands Amsterdam, The Netherlands v.de.boer@vu.nl lora.aroyo@vu.nl extract events from the KB radio news bulletins to improve linkage within the DIVE+ demonstrator?". We aim to find a better approach to extract events from the KB dataset rather than extracting the first 100 characters. 2 METHODOLOGY 1 INTRODUCTION This section describes our research methodology for finding a more Exploratory search systems provide information to users with an suitable method to extract events from the radio bulletins. We apply unclear information need, by offering support for browsing strate- our research methodology on a subset of 215 news radio bulletins gies through carefully designed interfaces that support interactive from KB dating from April 1939. forms of search [3]. DIVE+1 is a linked data digital cultural heritage collection browser that organizes historical media objects and facil- itates exploratory search through event-centric linking of the data 2.1 Preprocessing [2]. The DIVE+ browser facilitates exploration and learning through As mentioned previously, the content of the bulletins contains an intuitive and interactive interface which allows the end user to errors caused by OCR. Therefore, we first perform garbage removal browse media objects from four heritage institutions (Netherlands from the text of the bulletins by adopting a series of pattern-based Institute for Sounds and Vision2 , Dutch National Library (KB)3 , approaches from [4]. When a string is identified as a garbage string, Amsterdam Museum4 and Tropenmuseum5 ). All objects have meta- it is removed from the text. Second, we perform sentence boundary data which includes descriptive text, related entities such as actors, detection by assuming that all sentences end with a period. places and events and links between them. In the DIVE+ project, event extraction proved to be particularly 2.2 Event Extraction difficult for KB media objects, i.e., radio news bulletins (see Figure We distinguish two types of events: named events and unnamed 1. On the one hand, the media objects in the KB dataset deal with events. Named events are events which have a name, e.g., Olymische several issues (see footnote 6) introduced by the OCR software: (1) Spelen. Unnamed events are linguistic events, which do not have a garbage strings, (2) misidentified characters and spelling errors. name, e.g., the sentence "functionarissen uit Spaansch Marokko is in Due to these errors, Natural Language Processing (NLP) tools also RABAT aangekomen." describes the event of arriving in Rabat. struggle to extract meaningful entities [1]. On the other hand, the Named Event Extraction: To extract named events from the bul- NEs in the metadata are not always correct or identified (e.g., Staat- letins, we used the NLP system Frog6 . When Frog recognizes a soourant was incorrectly classified as actor, the place DEN HAAG token in the text as a NE, it assigns it a type (i.e., person, organi- was not extracted). Considering these and the fact that currently, zation, location, product, event or miscellaneous). To identify the the KB objects do not have a well defined event, we formulate the events, we extracted only the tokens which have been typed by following research question: "Can we find a more effective way to Frog as events. 1 http://diveplus.frontwise.com/ Unnamed Event Extraction: Since unnamed events can not be 2 http://www.beeldengeluid.nl/ 3 https://www.kb.nl/ detected in texts with typical NER tools, we first identify actions 4 https://www.amsterdammuseum.nl/ by means of verbs, using the NLP tool TreeTagger7 . We are inter- 5 http://www.opencultuurdata.nl/wiki/tropenmuseum/ esting in identifying eventful sentences, i.e., sentences that contain one or more unnamed events under the pattern someone, doing something, somewhere. We attempt to extract unnamed events from the bulletins using a knowledge-driven approach which exploits the NEs already in the metadata of the bulletin and the actions (i.e., verbs) identified by TreeTagger. Since not all events might © 2017 Copyright held by the author/owner(s). 6 https://languagemachines.github.io/frog/ SEMANTiCS 2017 workshop proceedings: EVENTS 7 http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/ September 11-14, 2017, Amsterdam, Netherlands EVENTS2017, September 2017, Amsterdam, The Netherlands van Putten et al. be associated to both an actor and a place, we introduce a tiered impacts the performance of the event extraction. We address the method of sentence extraction: (1) Tier 1: Sentence contains at least fourth observation in the next part. a verb, an actor and a place; (2) Tier 2: Sentence contains at least a verb, and either at least an actor or a place and (3) Tier 3: Sentence 3.3 Improvements of the Event Extraction contains at least a verb.A sentence matching tier 1 is more likely The extraction of unnamed events relies on finding relationships to contain an unnamed event than a sentence of tier 2, and thus, between verbs and NEs. However, in Section 1 we see that the NEs tier 1 is preferred over tier 2 and tier 2 is preferred over tier 3. If in the metadata of the bulletins are not always correct. Overall, we there are no sentences that have at least one verb, then we keep see that about a quarter of all the NEs are incorrect or mistyped. the initial string of the first 100 characters as the event. Actors show to have the largest percentage of correct NEs but simultaneously the largest percentage of incorrect extracted NEs 3 RESULTS (15.4%). Next, we investigated whether we can improve the named In this section we briefly present the results of all the intermediate entities from the bulletins using Frog. Frog extracted a total of 5,807 steps of the event extraction pipeline. NEs of type person, organization, location and event. However, we see that Frog performs poor on the extraction of all entity types 3.1 Preprocessing Results (only around 20% were correct) and hence, we chose not to use the Overall, 2,574 garbage strings were removed from the data. De- NEs extracted by Frog in our pipeline. spite the large number of strings removed, there are still garbage We further analyzed two assumptions: (1) Sentences that contain strings undetected. All attempts of adding new rules or changing a verb, an actor and a place are more likely to contain unnamed events the current ones to recognize similar words resulted in the seesaw than sentences which do not have both an actor and place. To prove phenomena, where the removal of garbage strings also led to the this assumption we tested the unnamed event extraction with a removal of non-garbage strings. 2-tier method which is identical to the 3-tier method except that we omit tier 1. We evaluate the sentences extracted by the 2-tier system 3.2 Event Extraction Analysis and compare them to the sentences extracted by the 3-tier method. The extracted named events and eventful sentences are analyzed Results show that fewer of the sentences extracted by the 2-tier to determine how well the extraction methods perform. method are eventful compared to the 3-tier method (0.65 compared Analysis of Named Event Extraction: Frog extracted a total of 57 to 0.8). (2) The main event or most important events are mentioned at events from the 215 bulletins in the data. Overall, it appears that the beginning of the text. We conclude that limiting the extraction to Frog performed very poor on event extraction. Only 4 out of 57 a specific part of the text results in slightly worse event extraction extracted events are actually events, and 2 out of these 4 have an because (1) the extractor might be forced to extract a sentence that incorrect span. matches a lower tier and (2) if a text contains only one sentence Analysis of Unnamed Event Extraction: We extracted one sentence with a verb an actor and a place, we may not chose the part of the per bulletin using the 3-tier extraction method: 92 sentences in tier text where this sentence is placed. 1, 85 sentences in tier 2, 15 sentences in tier 3 and for 23 bulletins we found no sentence containing a verb so we kept the initial event. 4 CONCLUSION Further, we manually evaluate the sentences extracted with the This paper presents a methodology to extract events from radio 3-tier method and compare them with the baseline, i.e., the current news bulletins to improve the exploratory search offered by DIVE+ event strings in the metadata. A string was considered an event if using a NER tool and a pattern-based approach which exploits the (1) it was reported as something that happened, is happening, or NE space in the metadata of the bulletins. The new events are full will/may happen at a later date, (2) it is based on a verb or a set of sentences, less likely to be header information of the bulletin and verbs, and (3) it has historic value. more likely to include relevant NEs and terms that a user might We found out that from the original events which were extracted search for. Therefore, the bulletins are more likely to show up in by taking the first 100 characters only 8.4% contained unnamed search results (see Figure 2). On the one hand, the NER tool Frog was events. From the sentences that were extracted with the new 3-tier unsuccessful at extracting events from the radio bulletins. On the method 77.2% were eventful. Thus, the new method of extraction other hand, the pattern-based method improved the events, which provides better suited events mentioned in the bulletins. The event was further beneficial for the searchability and the presentation of strings in the metadata that were found eventful contained exactly the media objects. Overall, errors in the OCRed data turned out to one event. The newly extracted eventful sentences contained more be problematic for sentence boundary detection, NE extraction and than one event on average (1.5 events) which means that overall, ultimately for the extraction of events. To achieve a finer granularity they are more expressive. We identified the following reasons why of event extraction, future research is necessary to identify what a sentence did not contain an unnamed event: (1) no new sentence is the relationship between the NEs and the verbs that describe was extracted because TreeTagger did not recognize any verbs an event. It might also be worthwhile to invest further research in in the text of the bulletins (bulletins without verbs or misspelled OCR post-correction and normalization to improve the quality of verbs); (2) words were incorrectly tagged as verbs; (3) incorrect the data so that better NER can be achieved. sentence boundary detection and (4) incorrect NE in the metadata. For the first two observations the quality of the OCR negatively REFERENCES [1] Beatrice Alex and John Burns. 2014. Estimating and rating the quality of optically character recognised text. In DATeCH. ACM, 97–102. Event Extraction From Radio News Bulletins EVENTS2017, September 2017, Amsterdam, The Netherlands Figure 1: An example of an ANP radio news bulletin in the DIVE+ demonstrator Figure 2: The search result for the query "openingskoersen 1 punt lager" in the DIVE+ demonstrator after the data enrichment with the new events. The left object shows a radio bulletin and the right object is the event associated with the bulletin. [2] Victor De Boer, Johan Oomen, et al. 2015. DIVE into the event-based browsing [4] Kazem Taghva, Tom Nartker, Allen Condit, et al. 2001. Automatic removal of of linked historical media. Web Semantics: Science, Services and Agents on WWW “garbage strings” in OCR text: An implementation. In WMSCI. 35 (2015), 152–158. [3] Gary Marchionini. 2006. Exploratory search: from finding to understanding. Commun. ACM 49, 4 (2006), 41–46.