<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Some Rela-
tions between Pleasant and Unpleasant Events and Depres-
sion. In: Journal of Abnormal Psychology</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Extracting and Visualising Biographical Events from Wikipedia</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Irene Russo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Caselli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Monica Monachini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ILC-CNR “A. Zampolli” Pisa</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Computational Lexicology</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>t.caselli@vu.nl</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>monica.monachini@ilc.cnr.it</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>87</volume>
      <issue>6</issue>
      <fpage>55</fpage>
      <lpage>60</lpage>
      <abstract>
        <p>This work presents a proposal for the development of a natural language processing module for event and temporal analysis of biographies as available in Wikipedia. At the current level of development, we restricted the extraction to temporally anchored events as they represent salient information which can be further used to extract additional events and facilitate their chronological ordering and the representation of a person's timeline. Visualising data about basic facts concerning groups of people helps with historical reasoning and enables comparisons among them.</p>
      </abstract>
      <kwd-group>
        <kwd>mining biographies for structured information</kwd>
        <kwd>visualising biographical data</kwd>
        <kwd>temporal information</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Historical reasoning concerns facts and stories of the past
to describe, compare and explain historical phenomena.
Six activities can be listed as part of historical reasoning:
a.) historical questions; b.) the use of primary and
secondary sources; c.) contextualisation; d.) argumentation;
e.) the use of substantive concepts; and, f.) the use of
metaconcepts
        <xref ref-type="bibr" rid="ref3">(van Drie and van Boxtel, 2008)</xref>
        .
      </p>
      <p>
        The way information is presented has an impact on
historical reasoning: primary and secondary sources are today
widely digitized and computational processing of textual
and multimodal information (Novak et al., 2014) can meet
the needs of users from the humanities, both in research and
education. As a matter of fact, in many disciplines,
visualisation constitutes a graphical or cognitive aid to thinking
because it is based on interrelated textual content,
spatiotemporal data and metadata related to images and videos.
Data can be mainly visualised for presentation or
exploration but in well designed projects there is a continuum
between these two modalities
        <xref ref-type="bibr" rid="ref2">(Cairo, 2012)</xref>
        . A key aspect
in Digital Humanities (DH) is to provide processed results
in a way that is usable and easy to browse: data
visualisation is today a device which enables the exploration,
filtering and searching of data, skipping the interaction with
databases.
      </p>
      <p>Narratives can be integrated in visual forms of
presentations: the Spatial History Project at Stanford University1
deals with visualisations that involve the geographic
dimensions of Holocaust, portraying, for instance, the mobility in
the Budapest ghetto or the arrests of the Italian Jews.
In e-history projects, textual content is processed by
Natural Language Processing (NLP) modules which take care of
tasks such as Named Entity Recognition and
Disambiguation (NERD). These modules identify different types of
entities (e.g. Person, Organization, Location etc.) and can
link them to external knowledge repositories (e.g.
DBpedia) by means of URIs, thus enriching the extracted
information. Hypotheses formulation about relations between
entities (e.g. people, places and events) is supported once
1http://web.stanford.edu/group/
spatialhistory
entities and relations between them have been mined from
texts. Along this line, data can be explored in novel ways
through basic inference rules that find out commonalities
between two or more entities.</p>
      <p>Our work focuses on comparisons and questions that may
arise through visualisations of data automatically extracted
from corpora. We designed a simple visualisation tool
because visualising data about basic facts concerning groups
of people may help with historical reasoning and enables
comparisons.</p>
      <p>In our framework a fact is a notion where history and data
visualisation meet: if for historians a fact is an assertion
about people and events that can be located in the past (i.e. a
main predicate temporally and spatially grounded as in 1a),
for data visualisation, based on data structures like JSON,
the same fact, like the date of death, is an instantiated value
for a key, as in 1b:</p>
      <sec id="sec-1-1">
        <title>1a. Primo Levi died on 11 April 1987.</title>
        <p>1b. “nodes”:[</p>
        <p>f“name”:“Primo Levi”,“deathDate”:“1987-04-11”g
As case study, we selected a coherent group of biographies
concerning people that have been deported to
concentration camps during Nazism, including both those who died
because of the deportation and the survivors. Data
cleaning and tagging is reported in Section 2.1 while in Section
2.2 we describe the way biographical information has been
structured before automatic extraction.</p>
        <p>Choosing an operational notion of biographical event as an
activity that involve directly the person in question and can
be anchored on a timeline whenever a date is available, we
set up a basic biographical data array for each subject to
encode factual data presented in the visualisation (see Section
3). We end with conclusions and proposals for future work
in Section 4.</p>
        <p>2.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Dataset and Tools</title>
      <p>2.1.</p>
      <sec id="sec-2-1">
        <title>Holocaust deportees dataset</title>
        <p>As a case study 782 Wikipedia pages relative to the
biographies of people deported in Nazi concentration camps were
downloaded. The set includes 247 short biographies of
people that survived. All these people have a key event in
common, namely that they have been deported to Nazi
concentration camps, and several others along which their lives
differ and can be compared.</p>
        <p>The biographies are part of a Wikipedia category, namely
people who died in Nazi concentration camps2, and of a list
of Holocaust survivors3.</p>
        <p>
          The Wikipedia pages have been downloaded and saved in
plain text files by removing all HTLM tags. The
downloaded data have been processed by two different NLP
systems: the NewsReader pipeline
          <xref ref-type="bibr" rid="ref1">(Agerri et al., 2014)</xref>
          and the
Stanford CoreNLP (Manning et al., 2014).
        </p>
        <p>
          The NewsReader pipeline is a set of 15 NLP modules that
generates a structure in the NLP Annotation Format (NAF)
          <xref ref-type="bibr" rid="ref4">(Fokkens et al., 2014)</xref>
          . The pipeline has been developed as
part of the EU project NewsReader4. Apart from the basic
processing modules, such as tokenization, part-of-speech
tagging and lemmatization, the additional modules which
are relevant for our project include:
coreference resolution (COREF layer);
named entity recognition and disambiguation (NERC
and NERD layers);
semantic role labelling (SRL layer).
        </p>
        <p>Stanford CoreNLP is a Java natural language analysis
library that includes a part-of-speech (POS) tagger, a named
entity recognizer (NER), a dependency parser and a
coreference resolution system.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Structuring biographical events</title>
        <p>Events extracted from DBpedia are a subpart of
biographical events found in Wikipedia entries and other resources.
The aim of this section is to show how the integration
of information can be achieved by means of structured
information derived from the output of existing NLP
modules.</p>
        <p>From the NewsReader pipeline output we extracted the
disambiguated DBpedia URIs from the NERD layer. The
DBpedia URI is used to query the corresponding DBpedia
HTML page to collect the basic information which are
summarized in Table 1.</p>
        <p>Grounding each page on DBpedia URIs, we decided
to combine the two sources, Wikipedia and DBpedia
respectively, to structure the array that will be the starting
point for the visualisations, choosing the keys (see Table
1) that are basic for representing the lives of Holocaust
deportees. In Table 2 the data extracted for Marceli
Handelsman, a Polish historian that died in the Mittelbau-Dora
concentration camp, are reported. Most of them come
from DBpedia (Lehmann et al., to appear). The data are
obtained by processing the linguistic information contained
2http://en.wikipedia.org/wiki/Category:
People_who_died_in_Nazi_concentration_
camps_by_occupation</p>
        <p>3http://en.wikipedia.org/wiki/Lis_of_
Holocaust_survivors
4http://www.newsreaderproject.eu
name name of the deportee
nationality nationality of the
deportee (current
nationality for
survivors)
birthDate date of the birth of the
deportee
birthPlace place of birth of the
deportee
deathDate date of death of the
deportee
deathPlace place of death of the
deportee
deathInConcentrationCamp YES/NO it reports if
the deportee died in a
concentration camp
deportationDate date of deportation
deportationCamp1 first camp of
deportation
deportationCamp2 second camp of
deportation, if available
locationCamp1 location of
deportationCamp1
locationCamp2 location of
deportationCamp2
deportationFromCity the city where the
deportee was living at
deportationDate
deportationAge age of the deportee at
deportationDate
deathAge age of the deportee at
deathDate
wikiOccupation Wikipedia category
wikiNationality Wikipedia category
wikiCamp Wikipedia category
gender M/F
in the first lines of Wikipedia entries, missing important
information that is provided in the remaining text as, in
this case, the birth place, the death place and the exact date
of death, that we are able to include because we extracted
them from the processed text. Moreover, DBpedia sets the
date of death to the first day of the year when it finds just
the year in the Wikipedia pages infobox; we overwrite this
value when the NLP tools extract the exact date for the
event (i.e. deathDate in this case is 1945-03-20 and not
1945-01-01). deportationFromCity is an example
of a potentially uncertain value: if the place where the
arrest/deportation took place is not mentioned we presume
that it coincides with the one where the person was living
at the moment.
wikiOccupation, wikiNationality and
wikiCamp values are imported, when they are
available, from the Wikipedia taxonomy. These values allow
to group deportees according to three different modalities
(their nationality, their job and the concentration camp
where they have been deported). This information could
look redundant with deportationCamp and nationality</p>
      </sec>
      <sec id="sec-2-3">
        <title>PREP</title>
        <p>in
in
in</p>
      </sec>
      <sec id="sec-2-4">
        <title>EVENT</title>
        <p>write
found
marry</p>
      </sec>
      <sec id="sec-2-5">
        <title>NSUBJ</title>
        <p>Gokkes
he
Gokkes</p>
      </sec>
      <sec id="sec-2-6">
        <title>DOBJ</title>
        <p>composition
choir
Winnik</p>
        <sec id="sec-2-6-1">
          <title>Amsterdam</title>
          <p>extracted from data processed by NLP modules but, since
it was part of Wikipedia pages infobox compiled by users,
has a higher degree of certainty and, in case of mismatch,
should be the considered the right value. In Table 3 data
relative to Helen Berman, one of the survivors, make
evident a different type of information that is missing.
She was a child when she was deported and her short
biography on Wikipedia mainly refers to her artistic career
as a painter, briefly mentioning the event of deportation
without details about it. Some of the keys in the JSON
structure will be without a value because information is
missing both in DBpedia and Wikipedia and this could
be a problem for a set of persons in the dataset; other
sources of information should be found to integrate them.5
Some basic biographical events, such as birthDate
and deportationDate among others, are anchored
because of their nature to time, i.e. they make sense only
when a temporal expression fills their values. We consider
event anchoring as the first step to discover commonalities
between biographies of different people as it provides
5One possible source could be the Central Database of Shoah
Victims Names (http://www.yadvashem.org/), an
international endeavor initiated and led by Yad Vashem; an
estimated 4.3 million murdered Jews have been commemorated and
a database of Shoah Survivors will be released soon.
name Helen Berman
nationality Dutch-Israeli
birthDate 1936-04-06
birthPlace Amsterdam,
Netherlands
deathDate living
deathPlace living
deathInConcentrationCamp no
deportationDate
deportationCamp
deportationFromCity
lengthJourney
deportationAge
deathAge living
wikiOccupation Painter
wikiNationality
wikiCamp
a way of comparing events with respect to a timeline.
Time anchored events have been extracted through a
rule-based module on top of the output of the Stanford
CoreNLP. The rules take in input all dependency relations
between a temporal expression, marked as DATE in the
NER analysis of the Stanford CoreNLP, and a verb in the
same sentence. We assume that the dependency relation
between the temporal expression and the verb expresses
a temporal relation of inclusion, i.e. anchors the event
on a timeline. More complex temporal relations, such as
before, after, begins, ends, overlap or simultaneous, can be
expressed. In particular, in case a temporal expression is
introduced by a preposition, a further set of rules carved
to express the temporal meaning (or relation) associated
to the preposition have been developed. This set of rules
is based on the manually annotated data from the English
TimeBank6 (Pustejovsky et al., 2003). After the time
anchoring of the verb has been established we extracted
also dependency relations concerning the subject, direct
object and, if available, locations. Subjects and objects
are then mapped to the NewsReader output to solve entity
disambiguation (NERD) and pronominal coreference.
Table 4 reports the output structure of the post-processing
rules for event anchoring extraction. The data obtained
from the Stanford parser and their integration with the
NWR output facilitate the comparison of biographies.
Examples 2a and 2b show how by looking for the same
event lemma (e.g. “immigrate”) in the extracted data, we
6http://www.timeml.org/site/timebank/
timebank.html
can easily aggregate people by the type of event and still be
able to tfind out differences (in this case, the fact that there
are two emigration instances, one in 1978 in Israel and one
in 1947 in the Unites States).</p>
          <p>Anchoring in time is often paired with anchoring in space
because the information about where the event took place
is necessary for completeness, for this reason we integrated
the data structure of event lemma with information about
the closest syntactic constituent that is a LOCATION
according to the NER module of the Stanford parser (Israel
and United States, in the cases above).</p>
          <p>3.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Visualisation of data</title>
      <p>The implementation of the visualisation is based on D3.js,
a JavaScript library designed to display digital data in a
dynamic graphical form. We propose two interrelated
visualisation modalities:</p>
      <p>A force-directed graph (Figure 1) where each person
is a node connected to other nodes when they share
the same value. It allows the visualisation of
clusters of persons based on the different values in
Table 1, highlighting data that have been extracted from
Wikipedia, DBpedia and biographies parsed with the
Stanford CoreNLP and the NewsReader pipeline. In
this way, the names of the persons can be visible
moving a pointer over a node and the source, i.e. the
corresponding Wikipedia page in this work, will open in
a different window, when clicking on the node.</p>
      <p>This type of visualisation will allow the user to directly
explore the source of information from which the data
have been extracted and thus verify if the proposed
clusters are relevant or if they are due to errors in data
extraction.</p>
      <p>An horizontal timeline as illustrated in Figure 2. This
visualisation allows to represent also relevant dates
concerning larger events, e.g. World War II, which
have crossed the lives of the people in our dataset. The
timeline visualisation reports, for each deportee, a set
of dates and sentences extracted from Wikipedia and
describing a biographical event.
We propose a method and a preliminary development of
an NLP module for extracting biographical events from
biographical notes such as the ones that are available in
Wikipedia. Focusing on temporally anchored events will
allow to extract salient events which can be further used
to identify other biographical events and facilitate their
chronological ordering and the representation of a person
storyline.</p>
      <p>One of the limitations of our visualisation as a tool for data
presentation is the lack of potential interactions with
historians: spatial and temporal data extracted from texts can be
ambiguous or uncertain and some events can be wrongly
extracted. Users should be able to validate the information
discovered, labelling the results as true, worth further
investigation or useless because noisy. We plan to make the
visualisations interactive, with the possibility to delete or
annotate each piece of information.</p>
      <p>As future work we aim at labelling the biographical events
as positive or negative, integrating a Sentiment Analysis
component in our module by means of a psychologically
grounded dataset (Lewinsohn and Amenson, 1978). So far
this task was conducted manually by associating the
predicate and semantic role information of the extracted events
to the entries in the psychological dataset. The manual
labelling aims at developing a reliable training set data for the
development of a learning algorithm.</p>
      <p>5.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>One of the author wants to thanks the NWO Spinoza Prize
project Understanding Language by Machines (sub-track 3)
for partially supporting this work.</p>
      <p>6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Agerri</surname>
            ,
            <given-names>R. I.</given-names>
          </string-name>
          <string-name>
            <surname>Aldabe</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Beloki</surname>
          </string-name>
          , E. Laparra,
          <string-name>
            <surname>M. De Lacalle</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Rigau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Soroa</surname>
            ,
            <given-names>M. van Erp</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vossen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Girardi</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          (
          <year>2014</year>
          ),
          <source>Event Detection, version 2 D4</source>
          .2.2. NewsReader Project Deliverable.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Cairo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2012</year>
          ),
          <article-title>The Functional Art: An introduction to information graphics and visualization</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>van Drie</surname>
            , J. and
            <given-names>C. van Boxtel</given-names>
          </string-name>
          (
          <year>2008</year>
          ),
          <article-title>Historical Reasoning: Towards a Framework for Analyzing Students' Reasoning about the Past</article-title>
          .
          <source>Educational Psychology Review</source>
          ,
          <volume>20</volume>
          (
          <issue>2</issue>
          ),
          <fpage>87</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Fokkens</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soroa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Beloki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ockeloen</surname>
          </string-name>
          , G. Rigau,
          <string-name>
            <given-names>W. R. van Hage and P.</given-names>
            <surname>Vossen</surname>
          </string-name>
          (
          <year>2014</year>
          ),
          <article-title>NAF and GAF: linking linguistic annotations</article-title>
          .
          <source>In: Proceedings 10th</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>