<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Linking Dutch World War II Cultural Heritage Collections with Events Extracted by Machines and Crowds</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jessie Both</string-name>
          <email>jessieboth@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oana Inel</string-name>
          <email>oana.inel@vu.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Didi de Hooge</string-name>
          <email>dididehooge@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victor de Boer</string-name>
          <email>v.de.boer@vu.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ramses IJf</string-name>
          <email>ramses.ijf@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lora Aroyo</string-name>
          <email>lora.aroyo@vu.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vrije Universiteit Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The digital revolution is changing the way people access and use
knowledge and information. By digitizing collections, cultural
heritage institutions aim to fulfill their new role as information providers.
However, the majority of the digitized collections lack suficient
metadata and provide no context. This makes the collections hard to
access and to search through. Prior research [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] shows that
enriching metadata of digitized collection items with events helps with
providing context, structure collections, and describe relationships
between collection items.
      </p>
      <p>In general, collections can be significantly enriched through the
addition of events. Events can place collection items into context
and create structure by linking related items. This improves the
accessibility of the items and helps users to understand their context.
Both personal events and historical events are relevant for this
cause. Personal events can provide an understanding of the
day-today lives of people involved in historical events which can further
enable large scale data research. However, detecting and extracting
such events is a dificult task, as events are very ambiguous. They
lack conclusive definition and their position as events is frequently
dependent on personal interpretation.</p>
      <p>
        In this paper, we create a pipeline that detects, extracts and
represents personal and historical events to improve the accessibility of
cultural heritage items. We show the added value of enriching the
metadata of media objects through events and links between them
in the context of DIVE+ [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a Link Data event-centric collections
browser. This research is done on behalf of the Institute for War,
Holocaust and Genocide Studies1 (NIOD).
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>DATASET</title>
      <p>For the current experiment we selected Wikipedia pages related to
World War II (WWI), i.e., "Lijsten over de Tweede Wereldoorlog"
and "Tweede Wereldoorlog in Nederland". We focused on three
themes: deportation, arrest and resistance. This resulted in a corpus
of 2.556 pages.</p>
      <p>Person vocabulary: For this research we also used a person
vocabulary in order to understand which are the most relevant
people in our dataset. The people vocabulary consists of data from
Online Begraafplaatsen, Erelijst, and the Oranjehotel Kartotheek,
ifltered based on our three research themes. Our person vocabulary
consists of 35,739 names.</p>
      <p>Personal and Historical Events: The NIOD thesaurus is used
as a source of named historical events in our dataset. From the
metadata of the person vocabulary aforementioned we identified
four personal events: birth, death, arrival at the Oranjehotel and
departure from the Oranjehotel.
3</p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTAL METHODOLOGY</title>
      <p>This section describes our experimental methodology.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Automatic Entity Extraction</title>
      <p>
        The first step in the pipeline is to detect and extract entities through
machines. The input data was processed by two pipelines. The first
pipeline uses three free NER tools: DBpediaSpotlight2, Targeted
Hypernym Discovery3 (THD) and SemiTags4. DBpediaSpotlight
extracts DBpedia mentions in texts. The THD extractor uses
lexicosyntactic patterns to find hypernyms in the text. The SemiTags
extractor is an online tool that recognizes named entities and their
meaning in a particular context. As this pipeline is specialized in
extracting named entities, it is likely to provide historical named
events. The second pipeline that is used is the pipeline of
BiographyNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a project that uses NLP and Semantic Web Technology
to support digital humanities scholars with historical research. The
© 2017 Copyright held by the author/owner(s).
      </p>
      <p>SEMANTiCS 2017 workshop proceedings: EVENTS
September 11-14, 2017, Amsterdam, Netherlands
1http://www.niod.nl
2http://www.dbpedia-spotlight.org
3http://entityclassifier.eu/thd/
4http://nlp.vse.cz/SemiTags/
pipeline is a supervised machine earning system trained to extract
biographical metadata such as date and place of birth and death,
education, occupation, religion, and parents. This pipeline is more
likely to provide personal events.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Crowdsourcing experiments</title>
      <p>To improve and enrich the results from the NLP tools described in
Section 3.1, we created two crowdsourcing tasks on CrowdFlower.
The aim of the first task is to validate and correct entities extracted
by the two NLP pipelines and annotate entities that were missed by
them. As a proof of concept, for this task we select 100 sentences
from the Wikipedia pages. The aim of the second task is to create
links between the events and the people, places and time periods
identified by the two NLP pipelines and the crowd. For each task
we gathered judgments from 15 workers. The task showed one row
(sentence) per page and workers were paid 2 cents for one row. The
task was only distributed in the Netherlands and Belgium because
of language constraints.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Modeling events with Simple Event Model</title>
      <p>To model the events, actors, places and time periods extracted by
the machines and the crowd we use the Simple Event Model (SEM).
All data in the dataset is being converted to the RDF model of the
SEM classes and relationships. We create a first graph by using the
output of the second crowdsourcing experiment which is converted
to RDF. We create a second graph for the people vocabulary.</p>
      <p>The WWII thesaurus was already available in RDF, in SKOS. This
format is also suitable for SEM, and therefore does not have to be
converted. Only concepts which are classified as ‘gebeurtenis’ in
the thesaurus are used, plus one manually added named event. To
enhance the model, we conduct some alignment proceedings: for the
core class place, we automatically generated "sameAs" relationships
between identical places in the diferent graphs. For the core class
actor, this can not be done automatically as an identical name does
not necessarily address the same person. Therefore, these "sameAs"
relationships are done manually.
4</p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS</title>
      <p>This section describes the results for each step of the experimental
methodology.</p>
      <p>Automatic Entity Extraction: Table 1 shows the results of the
two pipelines. The THD extractor extracted 145,617 unique entities,
mostly locations months, out of which 6,675 entities are events.
However, when analyzing the results from THD there is a lot of
noise, e.g., "handen namen" is classified as the Belgian city "Namen".
DBpediaSpotlight extracted 115,984 unique entities, with a total
number of 2,518 entity types (34,785 entities were extracted without
type). While performing well on recognizing locations and persons,
DBpediaSpotlight also extracted 3,434 events. The SemiTags
extractor extracted 99,170 unique entities, but no entity was classified as
event. With the BiographyNet pipeline we extracted a number of
260,244 unique entities. The BiographyNet pipeline is especially
good at recognizing events, names and precise dates.</p>
      <p>Crowdsourcing Results: The 100 sentences ran through the
crowdsourcing pipeline had 431 entities automatically extracted: 132
events, 74 dates, 128 locations and 97 actors. In the first
crowdsourcing task the workers annotated 167 extra entities and removed 34
entities. This resulted in a total of 564 entities, consisting of 102
dates, 228 events, 125 places and 61 actors. For the second
crowdsourcing task we used 71 sentences to connect events and places, 72
sentences to connect events and actors and 77 sentences to connect
events and time periods. From these sentences, a total of 94 events
were linked to 58 places, 61 actors and 64 time periods.</p>
      <p>Event Modeling: After we converted all the data into SEM classes
we obtained a knowledge graph with 209,183 triples. After
generating "owl:sameAs" relations between names and locations, the
knowledge graph extended to 209,245 triples. Looking at the four
core classes in SEM, our enrichment contains 18,316 personal and
historical event triples, 9,300 actor triples, 875 place triples and 60
time periods, as shown in Table 2.
5</p>
    </sec>
    <sec id="sec-8">
      <title>ILLUSTRATIVE EXAMPLE</title>
      <p>To illustrate the efectiveness of SEM for our dataset, this section
describes an illustrating example through the interface of the DIVE+
demonstrator. When searching for the historical event
"februaristaking" in DIVE+ all related events, objects, locations and actors are
shown (see Figure 1). When looking at one of the objects we see all
related entities to that specific object. For example, we can further
explore the entity Communistische Partij Nederland (CPN), Figure
2, which is an important concept for the February Strike. As objects
are linked to each other we make the objects easier accessible and
create meaningful relations between them. With the use of the
event model we tell stories around and about the object and the
scholars can get a more complete overview of the event.</p>
      <p>For the personal event, we focus on Eduard Popko van
Groningen. Without the model, information about him is fragmented over
the collections of Erelijst and Online Begraafplaatsen. All this
information is now available in DIVE+ through one query. Figure
3 shows that the query leads to an actor, which is related to five
events. The first four are the events of birth and death as described
% of Unique</p>
      <sec id="sec-8-1">
        <title>Entities</title>
        <p>73.50%
37.69%
31.37%
32.62%
Avg. # Entities
per Page
64.64
43.52
37.21
151.35
#Entity</p>
      </sec>
      <sec id="sec-8-2">
        <title>Types</title>
        <p>2,517
193
4
11
# Entities without</p>
      </sec>
      <sec id="sec-8-3">
        <title>Type</title>
        <p>0
34,785
0
0
in their original source. The fifth event is an event extracted from
a Wikipedia page. The related entities are, in turn, also related to
other entities. In the case of the extracted event, to the actor,
Eduard, and a location, Sachsenhausen. New opportunities for digital
humanities scholars appear through the entity ‘Sachsenhausen’.
The location Sachsenhausen is linked to 21 events which are all
connected with other actors. This means that potential fellow
prisoners of Eduard are known. Another storyline could be that the
actors of the other 21 events were also shot at the 3rd of May 1942,
which would indicate that a bigger event took place.
6</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>Digitization projects made various cultural heritage collections
publicly available online. However, due to unstructured metadata
and the distributed allocation of the objects in diferent databases,
purely digitizing objects does not make them more accessible. In
this research we used a hybrid machine-crowd approach to extract
personal and historical events and use them for the enrichment of
the Dutch digital WWII objects metadata within the NOB portal.
We used automatic event extraction, making use of diferent
information extracting tools and manual event curration through the
use of a crowdsourcing strategy. Applied our methodology a set
of Wikipedia pages related to WWII. We illustrated the historical
and personal events in the DIVE+ demonstrator. With our created
historical and personal event model we are able to enrich the
metadata of the objects in the NOB portal. In this way the available
digital resources of the NOB portal can be linked and improved
with storylines told about and around the objects.</p>
      <p>Future work can improve the pipeline. For example, the entity
extraction could be improved or tested with other NLP tools better
suited for Dutch texts. Another way to improve entity extraction
is by tweaking the crowdsourcing tasks. Referencing words, such
as ‘he’ and ‘that’ were not extracted as the context around the
sentence is missing. As a consequence, a lot of verbs could not be
linked with an actor. We recommend doing a pilot with extracting
entities from paragraphs to improve results. Lastly, we suggest
enriching the SEM with adding subtypes to the core classes actor,
event, place and time. For example, nationalities can be added to
names of people for a richer context.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Victor</given-names>
            <surname>De Boer</surname>
          </string-name>
          et al.
          <year>2015</year>
          .
          <article-title>DIVE into the event-based browsing of linked historical media</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on WWW 35</source>
          (
          <year>2015</year>
          ),
          <fpage>152</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Antske</given-names>
            <surname>Fokkens</surname>
          </string-name>
          , Serge Ter Braake, Niels Ockeloen, Piek Vossen, Susan Legêne, and
          <string-name>
            <given-names>Guus</given-names>
            <surname>Schreiber</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>BiographyNet: Methodological Issues when NLP supports historical research</article-title>
          ..
          <source>In LREC.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Walter</given-names>
            <surname>Renteria-Agualimpia</surname>
          </string-name>
          , Francisco J Lopez-Pellicer,
          <string-name>
            <given-names>Javier</given-names>
            <surname>Lacasta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F Javier</given-names>
            <surname>Zarazaga-Soria</surname>
          </string-name>
          , and
          <string-name>
            <surname>Pedro R Muro-Medrano</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Improving the geospatial consistency of digital libraries metadata</article-title>
          .
          <source>Journal of Information Science</source>
          <volume>42</volume>
          ,
          <issue>4</issue>
          (
          <year>2016</year>
          ),
          <fpage>507</fpage>
          -
          <lpage>523</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>