=Paper=
{{Paper
|id=Vol-2063/events-paper3
|storemode=property
|title=Linking Dutch World War II Cultural Heritage Collections with Events Extracted by Machines and Crowds
|pdfUrl=https://ceur-ws.org/Vol-2063/events-paper3.pdf
|volume=Vol-2063
|authors=Jessie Both,Didi de Hooge,Ramses IJff,Oana Inel,Victor de Boer,Lora Aroyo
|dblpUrl=https://dblp.org/rec/conf/i-semantics/BothHIIBA17
}}
==Linking Dutch World War II Cultural Heritage Collections with Events Extracted by Machines and Crowds==
<pdf width="1500px">https://ceur-ws.org/Vol-2063/events-paper3.pdf</pdf>
<pre>
 Linking Dutch World War II Cultural Heritage Collections with
          Events Extracted by Machines and Crowds
                  Jessie Both                                    Didi de Hooge                                          Ramses IJff
         University of Amsterdam                           University of Amsterdam                               Vrije Universiteit Amsterdam
        Amsterdam, The Netherlands                        Amsterdam, The Netherlands                             Amsterdam, The Netherlands
          jessieboth@gmail.com                             dididehooge@gmail.com                                    ramses.ijff@gmail.com

                   Oana Inel                                     Victor de Boer                                         Lora Aroyo
        Vrije Universiteit Amsterdam                     Vrije Universiteit Amsterdam                            Vrije Universiteit Amsterdam
        Amsterdam, The Netherlands                       Amsterdam, The Netherlands                              Amsterdam, The Netherlands
               oana.inel@vu.nl                                  v.de.boer@vu.nl                                        lora.aroyo@vu.nl

                                                                             browser. This research is done on behalf of the Institute for War,
                                                                             Holocaust and Genocide Studies1 (NIOD).

                                                                             2     DATASET
                                                                             For the current experiment we selected Wikipedia pages related to
                                                                             World War II (WWI), i.e., "Lijsten over de Tweede Wereldoorlog"
                                                                             and "Tweede Wereldoorlog in Nederland". We focused on three
1    INTRODUCTION                                                            themes: deportation, arrest and resistance. This resulted in a corpus
The digital revolution is changing the way people access and use             of 2.556 pages.
knowledge and information. By digitizing collections, cultural her-              Person vocabulary: For this research we also used a person
itage institutions aim to fulfill their new role as information providers.   vocabulary in order to understand which are the most relevant
However, the majority of the digitized collections lack sufficient           people in our dataset. The people vocabulary consists of data from
metadata and provide no context. This makes the collections hard to          Online Begraafplaatsen, Erelijst, and the Oranjehotel Kartotheek,
access and to search through. Prior research [3] shows that enrich-          filtered based on our three research themes. Our person vocabulary
ing metadata of digitized collection items with events helps with            consists of 35,739 names.
providing context, structure collections, and describe relationships             Personal and Historical Events: The NIOD thesaurus is used
between collection items.                                                    as a source of named historical events in our dataset. From the
   In general, collections can be significantly enriched through the         metadata of the person vocabulary aforementioned we identified
addition of events. Events can place collection items into context           four personal events: birth, death, arrival at the Oranjehotel and
and create structure by linking related items. This improves the             departure from the Oranjehotel.
accessibility of the items and helps users to understand their context.
Both personal events and historical events are relevant for this             3     EXPERIMENTAL METHODOLOGY
cause. Personal events can provide an understanding of the day-to-           This section describes our experimental methodology.
day lives of people involved in historical events which can further
enable large scale data research. However, detecting and extracting          3.1      Automatic Entity Extraction
such events is a difficult task, as events are very ambiguous. They
                                                                             The first step in the pipeline is to detect and extract entities through
lack conclusive definition and their position as events is frequently
                                                                             machines. The input data was processed by two pipelines. The first
dependent on personal interpretation.
                                                                             pipeline uses three free NER tools: DBpediaSpotlight2 , Targeted
   In this paper, we create a pipeline that detects, extracts and repre-
                                                                             Hypernym Discovery3 (THD) and SemiTags4 . DBpediaSpotlight
sents personal and historical events to improve the accessibility of
                                                                             extracts DBpedia mentions in texts. The THD extractor uses lexico-
cultural heritage items. We show the added value of enriching the
                                                                             syntactic patterns to find hypernyms in the text. The SemiTags
metadata of media objects through events and links between them
                                                                             extractor is an online tool that recognizes named entities and their
in the context of DIVE+ [1], a Link Data event-centric collections
                                                                             meaning in a particular context. As this pipeline is specialized in
                                                                             extracting named entities, it is likely to provide historical named
                                                                             events. The second pipeline that is used is the pipeline of Biogra-
                                                                             phyNet [2], a project that uses NLP and Semantic Web Technology
                                                                             to support digital humanities scholars with historical research. The
© 2017 Copyright held by the author/owner(s).                                1 http://www.niod.nl
                                                                             2 http://www.dbpedia-spotlight.org
SEMANTiCS 2017 workshop proceedings: EVENTS                                  3 http://entityclassifier.eu/thd/
September 11-14, 2017, Amsterdam, Netherlands                                4 http://nlp.vse.cz/SemiTags/
pipeline is a supervised machine earning system trained to extract            events, 74 dates, 128 locations and 97 actors. In the first crowdsourc-
biographical metadata such as date and place of birth and death,              ing task the workers annotated 167 extra entities and removed 34
education, occupation, religion, and parents. This pipeline is more           entities. This resulted in a total of 564 entities, consisting of 102
likely to provide personal events.                                            dates, 228 events, 125 places and 61 actors. For the second crowd-
                                                                              sourcing task we used 71 sentences to connect events and places, 72
3.2    Crowdsourcing experiments                                              sentences to connect events and actors and 77 sentences to connect
To improve and enrich the results from the NLP tools described in             events and time periods. From these sentences, a total of 94 events
Section 3.1, we created two crowdsourcing tasks on CrowdFlower.               were linked to 58 places, 61 actors and 64 time periods.
The aim of the first task is to validate and correct entities extracted          Event Modeling: After we converted all the data into SEM classes
by the two NLP pipelines and annotate entities that were missed by            we obtained a knowledge graph with 209,183 triples. After gen-
them. As a proof of concept, for this task we select 100 sentences            erating "owl:sameAs" relations between names and locations, the
from the Wikipedia pages. The aim of the second task is to create             knowledge graph extended to 209,245 triples. Looking at the four
links between the events and the people, places and time periods              core classes in SEM, our enrichment contains 18,316 personal and
identified by the two NLP pipelines and the crowd. For each task              historical event triples, 9,300 actor triples, 875 place triples and 60
we gathered judgments from 15 workers. The task showed one row                time periods, as shown in Table 2.
(sentence) per page and workers were paid 2 cents for one row. The
task was only distributed in the Netherlands and Belgium because              5   ILLUSTRATIVE EXAMPLE
of language constraints.
                                                                              To illustrate the effectiveness of SEM for our dataset, this section
                                                                              describes an illustrating example through the interface of the DIVE+
3.3    Modeling events with Simple Event Model
                                                                              demonstrator. When searching for the historical event "februaris-
To model the events, actors, places and time periods extracted by             taking" in DIVE+ all related events, objects, locations and actors are
the machines and the crowd we use the Simple Event Model (SEM).               shown (see Figure 1). When looking at one of the objects we see all
All data in the dataset is being converted to the RDF model of the            related entities to that specific object. For example, we can further
SEM classes and relationships. We create a first graph by using the           explore the entity Communistische Partij Nederland (CPN), Figure
output of the second crowdsourcing experiment which is converted              2, which is an important concept for the February Strike. As objects
to RDF. We create a second graph for the people vocabulary.                   are linked to each other we make the objects easier accessible and
   The WWII thesaurus was already available in RDF, in SKOS. This             create meaningful relations between them. With the use of the
format is also suitable for SEM, and therefore does not have to be            event model we tell stories around and about the object and the
converted. Only concepts which are classified as ‘gebeurtenis’ in             scholars can get a more complete overview of the event.
the thesaurus are used, plus one manually added named event. To
enhance the model, we conduct some alignment proceedings: for the
core class place, we automatically generated "sameAs" relationships
between identical places in the different graphs. For the core class
actor, this can not be done automatically as an identical name does
not necessarily address the same person. Therefore, these "sameAs"
relationships are done manually.

4     RESULTS
This section describes the results for each step of the experimental
methodology.
   Automatic Entity Extraction: Table 1 shows the results of the
two pipelines. The THD extractor extracted 145,617 unique entities,
mostly locations months, out of which 6,675 entities are events.
However, when analyzing the results from THD there is a lot of
noise, e.g., "handen namen" is classified as the Belgian city "Namen".
DBpediaSpotlight extracted 115,984 unique entities, with a total
number of 2,518 entity types (34,785 entities were extracted without
                                                                              Figure 1: Media Objects, Events, Actors and Places related to
type). While performing well on recognizing locations and persons,
                                                                              ’Februaristaking’
DBpediaSpotlight also extracted 3,434 events. The SemiTags extrac-
tor extracted 99,170 unique entities, but no entity was classified as
event. With the BiographyNet pipeline we extracted a number of                   For the personal event, we focus on Eduard Popko van Gronin-
260,244 unique entities. The BiographyNet pipeline is especially              gen. Without the model, information about him is fragmented over
good at recognizing events, names and precise dates.                          the collections of Erelijst and Online Begraafplaatsen. All this in-
                                                                              formation is now available in DIVE+ through one query. Figure
  Crowdsourcing Results: The 100 sentences ran through the crowd-             3 shows that the query leads to an actor, which is related to five
sourcing pipeline had 431 entities automatically extracted: 132               events. The first four are the events of birth and death as described
                                                                          2
                                         Table 1: Overview of Entities Extracted by each Tool

                                    #Pages        #Entities    % of Unique        Avg. # Entities          #Entity       # Entities without
            Tool
                                   Processed      Extracted      Entities           per Page                Types               Type
            THD                      2,665         145,617        73.50%               64.64                2,517                 0
            DBpediaSpotlight         2,665         115,984        37.69%               43.52                 193               34,785
            SemiTags                 2,665          99,170        31.37%               37.21                  4                   0
            BiographyNet             2,308         349,321        32.62%              151.35                  11                  0

      Table 2: Triples Overview per SEM Core Classes

                    SEM core class      Triples
                       Event             18,317
                       Actor              9,300
                       Place               875
                       Time                 60


                                                                             Figure 3: Events related to ’Eduard Popko van Groningen’


                                                                            the Dutch digital WWII objects metadata within the NOB portal.
                                                                            We used automatic event extraction, making use of different infor-
                                                                            mation extracting tools and manual event curration through the
                                                                            use of a crowdsourcing strategy. Applied our methodology a set
                                                                            of Wikipedia pages related to WWII. We illustrated the historical
                                                                            and personal events in the DIVE+ demonstrator. With our created
                                                                            historical and personal event model we are able to enrich the meta-
                                                                            data of the objects in the NOB portal. In this way the available
                                                                            digital resources of the NOB portal can be linked and improved
       Figure 2: Exploration of related concept ’CPN’                       with storylines told about and around the objects.
                                                                               Future work can improve the pipeline. For example, the entity
                                                                            extraction could be improved or tested with other NLP tools better
in their original source. The fifth event is an event extracted from
                                                                            suited for Dutch texts. Another way to improve entity extraction
a Wikipedia page. The related entities are, in turn, also related to
                                                                            is by tweaking the crowdsourcing tasks. Referencing words, such
other entities. In the case of the extracted event, to the actor, Ed-
                                                                            as ‘he’ and ‘that’ were not extracted as the context around the
uard, and a location, Sachsenhausen. New opportunities for digital
                                                                            sentence is missing. As a consequence, a lot of verbs could not be
humanities scholars appear through the entity ‘Sachsenhausen’.
                                                                            linked with an actor. We recommend doing a pilot with extracting
The location Sachsenhausen is linked to 21 events which are all
                                                                            entities from paragraphs to improve results. Lastly, we suggest
connected with other actors. This means that potential fellow pris-
                                                                            enriching the SEM with adding subtypes to the core classes actor,
oners of Eduard are known. Another storyline could be that the
                                                                            event, place and time. For example, nationalities can be added to
actors of the other 21 events were also shot at the 3rd of May 1942,
                                                                            names of people for a richer context.
which would indicate that a bigger event took place.
                                                                            REFERENCES
6   CONCLUSION AND FUTURE WORK                                               [1] Victor De Boer et al. 2015. DIVE into the event-based browsing of linked historical
Digitization projects made various cultural heritage collections                 media. Web Semantics: Science, Services and Agents on WWW 35 (2015), 152–158.
                                                                             [2] Antske Fokkens, Serge Ter Braake, Niels Ockeloen, Piek Vossen, Susan Legêne,
publicly available online. However, due to unstructured metadata                 and Guus Schreiber. 2014. BiographyNet: Methodological Issues when NLP
and the distributed allocation of the objects in different databases,            supports historical research.. In LREC.
purely digitizing objects does not make them more accessible. In             [3] Walter Renteria-Agualimpia, Francisco J Lopez-Pellicer, Javier Lacasta, F Javier
                                                                                 Zarazaga-Soria, and Pedro R Muro-Medrano. 2016. Improving the geospatial
this research we used a hybrid machine-crowd approach to extract                 consistency of digital libraries metadata. Journal of Information Science 42, 4
personal and historical events and use them for the enrichment of                (2016), 507–523.

                                                                        3

</pre>