=Paper=
{{Paper
|id=Vol-2063/events-paper3
|storemode=property
|title=Linking Dutch World War II Cultural Heritage Collections with Events Extracted by Machines and Crowds
|pdfUrl=https://ceur-ws.org/Vol-2063/events-paper3.pdf
|volume=Vol-2063
|authors=Jessie Both,Didi de Hooge,Ramses IJff,Oana Inel,Victor de Boer,Lora Aroyo
|dblpUrl=https://dblp.org/rec/conf/i-semantics/BothHIIBA17
}}
==Linking Dutch World War II Cultural Heritage Collections with Events Extracted by Machines and Crowds==
Linking Dutch World War II Cultural Heritage Collections with
Events Extracted by Machines and Crowds
Jessie Both Didi de Hooge Ramses IJff
University of Amsterdam University of Amsterdam Vrije Universiteit Amsterdam
Amsterdam, The Netherlands Amsterdam, The Netherlands Amsterdam, The Netherlands
jessieboth@gmail.com dididehooge@gmail.com ramses.ijff@gmail.com
Oana Inel Victor de Boer Lora Aroyo
Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam
Amsterdam, The Netherlands Amsterdam, The Netherlands Amsterdam, The Netherlands
oana.inel@vu.nl v.de.boer@vu.nl lora.aroyo@vu.nl
browser. This research is done on behalf of the Institute for War,
Holocaust and Genocide Studies1 (NIOD).
2 DATASET
For the current experiment we selected Wikipedia pages related to
World War II (WWI), i.e., "Lijsten over de Tweede Wereldoorlog"
and "Tweede Wereldoorlog in Nederland". We focused on three
1 INTRODUCTION themes: deportation, arrest and resistance. This resulted in a corpus
The digital revolution is changing the way people access and use of 2.556 pages.
knowledge and information. By digitizing collections, cultural her- Person vocabulary: For this research we also used a person
itage institutions aim to fulfill their new role as information providers. vocabulary in order to understand which are the most relevant
However, the majority of the digitized collections lack sufficient people in our dataset. The people vocabulary consists of data from
metadata and provide no context. This makes the collections hard to Online Begraafplaatsen, Erelijst, and the Oranjehotel Kartotheek,
access and to search through. Prior research [3] shows that enrich- filtered based on our three research themes. Our person vocabulary
ing metadata of digitized collection items with events helps with consists of 35,739 names.
providing context, structure collections, and describe relationships Personal and Historical Events: The NIOD thesaurus is used
between collection items. as a source of named historical events in our dataset. From the
In general, collections can be significantly enriched through the metadata of the person vocabulary aforementioned we identified
addition of events. Events can place collection items into context four personal events: birth, death, arrival at the Oranjehotel and
and create structure by linking related items. This improves the departure from the Oranjehotel.
accessibility of the items and helps users to understand their context.
Both personal events and historical events are relevant for this 3 EXPERIMENTAL METHODOLOGY
cause. Personal events can provide an understanding of the day-to- This section describes our experimental methodology.
day lives of people involved in historical events which can further
enable large scale data research. However, detecting and extracting 3.1 Automatic Entity Extraction
such events is a difficult task, as events are very ambiguous. They
The first step in the pipeline is to detect and extract entities through
lack conclusive definition and their position as events is frequently
machines. The input data was processed by two pipelines. The first
dependent on personal interpretation.
pipeline uses three free NER tools: DBpediaSpotlight2 , Targeted
In this paper, we create a pipeline that detects, extracts and repre-
Hypernym Discovery3 (THD) and SemiTags4 . DBpediaSpotlight
sents personal and historical events to improve the accessibility of
extracts DBpedia mentions in texts. The THD extractor uses lexico-
cultural heritage items. We show the added value of enriching the
syntactic patterns to find hypernyms in the text. The SemiTags
metadata of media objects through events and links between them
extractor is an online tool that recognizes named entities and their
in the context of DIVE+ [1], a Link Data event-centric collections
meaning in a particular context. As this pipeline is specialized in
extracting named entities, it is likely to provide historical named
events. The second pipeline that is used is the pipeline of Biogra-
phyNet [2], a project that uses NLP and Semantic Web Technology
to support digital humanities scholars with historical research. The
© 2017 Copyright held by the author/owner(s). 1 http://www.niod.nl
2 http://www.dbpedia-spotlight.org
SEMANTiCS 2017 workshop proceedings: EVENTS 3 http://entityclassifier.eu/thd/
September 11-14, 2017, Amsterdam, Netherlands 4 http://nlp.vse.cz/SemiTags/
pipeline is a supervised machine earning system trained to extract events, 74 dates, 128 locations and 97 actors. In the first crowdsourc-
biographical metadata such as date and place of birth and death, ing task the workers annotated 167 extra entities and removed 34
education, occupation, religion, and parents. This pipeline is more entities. This resulted in a total of 564 entities, consisting of 102
likely to provide personal events. dates, 228 events, 125 places and 61 actors. For the second crowd-
sourcing task we used 71 sentences to connect events and places, 72
3.2 Crowdsourcing experiments sentences to connect events and actors and 77 sentences to connect
To improve and enrich the results from the NLP tools described in events and time periods. From these sentences, a total of 94 events
Section 3.1, we created two crowdsourcing tasks on CrowdFlower. were linked to 58 places, 61 actors and 64 time periods.
The aim of the first task is to validate and correct entities extracted Event Modeling: After we converted all the data into SEM classes
by the two NLP pipelines and annotate entities that were missed by we obtained a knowledge graph with 209,183 triples. After gen-
them. As a proof of concept, for this task we select 100 sentences erating "owl:sameAs" relations between names and locations, the
from the Wikipedia pages. The aim of the second task is to create knowledge graph extended to 209,245 triples. Looking at the four
links between the events and the people, places and time periods core classes in SEM, our enrichment contains 18,316 personal and
identified by the two NLP pipelines and the crowd. For each task historical event triples, 9,300 actor triples, 875 place triples and 60
we gathered judgments from 15 workers. The task showed one row time periods, as shown in Table 2.
(sentence) per page and workers were paid 2 cents for one row. The
task was only distributed in the Netherlands and Belgium because 5 ILLUSTRATIVE EXAMPLE
of language constraints.
To illustrate the effectiveness of SEM for our dataset, this section
describes an illustrating example through the interface of the DIVE+
3.3 Modeling events with Simple Event Model
demonstrator. When searching for the historical event "februaris-
To model the events, actors, places and time periods extracted by taking" in DIVE+ all related events, objects, locations and actors are
the machines and the crowd we use the Simple Event Model (SEM). shown (see Figure 1). When looking at one of the objects we see all
All data in the dataset is being converted to the RDF model of the related entities to that specific object. For example, we can further
SEM classes and relationships. We create a first graph by using the explore the entity Communistische Partij Nederland (CPN), Figure
output of the second crowdsourcing experiment which is converted 2, which is an important concept for the February Strike. As objects
to RDF. We create a second graph for the people vocabulary. are linked to each other we make the objects easier accessible and
The WWII thesaurus was already available in RDF, in SKOS. This create meaningful relations between them. With the use of the
format is also suitable for SEM, and therefore does not have to be event model we tell stories around and about the object and the
converted. Only concepts which are classified as ‘gebeurtenis’ in scholars can get a more complete overview of the event.
the thesaurus are used, plus one manually added named event. To
enhance the model, we conduct some alignment proceedings: for the
core class place, we automatically generated "sameAs" relationships
between identical places in the different graphs. For the core class
actor, this can not be done automatically as an identical name does
not necessarily address the same person. Therefore, these "sameAs"
relationships are done manually.
4 RESULTS
This section describes the results for each step of the experimental
methodology.
Automatic Entity Extraction: Table 1 shows the results of the
two pipelines. The THD extractor extracted 145,617 unique entities,
mostly locations months, out of which 6,675 entities are events.
However, when analyzing the results from THD there is a lot of
noise, e.g., "handen namen" is classified as the Belgian city "Namen".
DBpediaSpotlight extracted 115,984 unique entities, with a total
number of 2,518 entity types (34,785 entities were extracted without
Figure 1: Media Objects, Events, Actors and Places related to
type). While performing well on recognizing locations and persons,
’Februaristaking’
DBpediaSpotlight also extracted 3,434 events. The SemiTags extrac-
tor extracted 99,170 unique entities, but no entity was classified as
event. With the BiographyNet pipeline we extracted a number of For the personal event, we focus on Eduard Popko van Gronin-
260,244 unique entities. The BiographyNet pipeline is especially gen. Without the model, information about him is fragmented over
good at recognizing events, names and precise dates. the collections of Erelijst and Online Begraafplaatsen. All this in-
formation is now available in DIVE+ through one query. Figure
Crowdsourcing Results: The 100 sentences ran through the crowd- 3 shows that the query leads to an actor, which is related to five
sourcing pipeline had 431 entities automatically extracted: 132 events. The first four are the events of birth and death as described
2
Table 1: Overview of Entities Extracted by each Tool
#Pages #Entities % of Unique Avg. # Entities #Entity # Entities without
Tool
Processed Extracted Entities per Page Types Type
THD 2,665 145,617 73.50% 64.64 2,517 0
DBpediaSpotlight 2,665 115,984 37.69% 43.52 193 34,785
SemiTags 2,665 99,170 31.37% 37.21 4 0
BiographyNet 2,308 349,321 32.62% 151.35 11 0
Table 2: Triples Overview per SEM Core Classes
SEM core class Triples
Event 18,317
Actor 9,300
Place 875
Time 60
Figure 3: Events related to ’Eduard Popko van Groningen’
the Dutch digital WWII objects metadata within the NOB portal.
We used automatic event extraction, making use of different infor-
mation extracting tools and manual event curration through the
use of a crowdsourcing strategy. Applied our methodology a set
of Wikipedia pages related to WWII. We illustrated the historical
and personal events in the DIVE+ demonstrator. With our created
historical and personal event model we are able to enrich the meta-
data of the objects in the NOB portal. In this way the available
digital resources of the NOB portal can be linked and improved
Figure 2: Exploration of related concept ’CPN’ with storylines told about and around the objects.
Future work can improve the pipeline. For example, the entity
extraction could be improved or tested with other NLP tools better
in their original source. The fifth event is an event extracted from
suited for Dutch texts. Another way to improve entity extraction
a Wikipedia page. The related entities are, in turn, also related to
is by tweaking the crowdsourcing tasks. Referencing words, such
other entities. In the case of the extracted event, to the actor, Ed-
as ‘he’ and ‘that’ were not extracted as the context around the
uard, and a location, Sachsenhausen. New opportunities for digital
sentence is missing. As a consequence, a lot of verbs could not be
humanities scholars appear through the entity ‘Sachsenhausen’.
linked with an actor. We recommend doing a pilot with extracting
The location Sachsenhausen is linked to 21 events which are all
entities from paragraphs to improve results. Lastly, we suggest
connected with other actors. This means that potential fellow pris-
enriching the SEM with adding subtypes to the core classes actor,
oners of Eduard are known. Another storyline could be that the
event, place and time. For example, nationalities can be added to
actors of the other 21 events were also shot at the 3rd of May 1942,
names of people for a richer context.
which would indicate that a bigger event took place.
REFERENCES
6 CONCLUSION AND FUTURE WORK [1] Victor De Boer et al. 2015. DIVE into the event-based browsing of linked historical
Digitization projects made various cultural heritage collections media. Web Semantics: Science, Services and Agents on WWW 35 (2015), 152–158.
[2] Antske Fokkens, Serge Ter Braake, Niels Ockeloen, Piek Vossen, Susan Legêne,
publicly available online. However, due to unstructured metadata and Guus Schreiber. 2014. BiographyNet: Methodological Issues when NLP
and the distributed allocation of the objects in different databases, supports historical research.. In LREC.
purely digitizing objects does not make them more accessible. In [3] Walter Renteria-Agualimpia, Francisco J Lopez-Pellicer, Javier Lacasta, F Javier
Zarazaga-Soria, and Pedro R Muro-Medrano. 2016. Improving the geospatial
this research we used a hybrid machine-crowd approach to extract consistency of digital libraries metadata. Journal of Information Science 42, 4
personal and historical events and use them for the enrichment of (2016), 507–523.
3