=Paper=
{{Paper
|id=Vol-1399/paper17
|storemode=property
|title=Extracting and Visualising Biographical Events from Wikipedia
|pdfUrl=https://ceur-ws.org/Vol-1399/paper17.pdf
|volume=Vol-1399
|dblpUrl=https://dblp.org/rec/conf/bd/RussoCM15
}}
==Extracting and Visualising Biographical Events from Wikipedia==
Extracting and Visualising Biographical Events from Wikipedia
Irene Russo*,Tommaso Caselli**, Monica Monachini*
*ILC-CNR “A. Zampolli” Pisa,** Computational Lexicology & Terminology Lab Vrije Universiteit Amsterdam
irene.russo@ilc.cnr.it, t.caselli@vu.nl, monica.monachini@ilc.cnr.it
Abstract
This work presents a proposal for the development of a natural language processing module for event and temporal analysis of
biographies as available in Wikipedia. At the current level of development, we restricted the extraction to temporally anchored events as
they represent salient information which can be further used to extract additional events and facilitate their chronological ordering and
the representation of a person’s timeline. Visualising data about basic facts concerning groups of people helps with historical reasoning
and enables comparisons among them.
Keywords: mining biographies for structured information, visualising biographical data, temporal information
1. Introduction entities and relations between them have been mined from
Historical reasoning concerns facts and stories of the past texts. Along this line, data can be explored in novel ways
to describe, compare and explain historical phenomena. through basic inference rules that find out commonalities
Six activities can be listed as part of historical reasoning: between two or more entities.
a.) historical questions; b.) the use of primary and sec- Our work focuses on comparisons and questions that may
ondary sources; c.) contextualisation; d.) argumentation; arise through visualisations of data automatically extracted
e.) the use of substantive concepts; and, f.) the use of meta- from corpora. We designed a simple visualisation tool be-
concepts (van Drie and van Boxtel, 2008). cause visualising data about basic facts concerning groups
The way information is presented has an impact on histor- of people may help with historical reasoning and enables
ical reasoning: primary and secondary sources are today comparisons.
widely digitized and computational processing of textual In our framework a fact is a notion where history and data
and multimodal information (Novak et al., 2014) can meet visualisation meet: if for historians a fact is an assertion
the needs of users from the humanities, both in research and about people and events that can be located in the past (i.e. a
education. As a matter of fact, in many disciplines, visual- main predicate temporally and spatially grounded as in 1a),
isation constitutes a graphical or cognitive aid to thinking for data visualisation, based on data structures like JSON,
because it is based on interrelated textual content, spatio- the same fact, like the date of death, is an instantiated value
temporal data and metadata related to images and videos. for a key, as in 1b:
Data can be mainly visualised for presentation or explo-
1a. Primo Levi died on 11 April 1987.
ration but in well designed projects there is a continuum
1b. “nodes”:[
between these two modalities (Cairo, 2012). A key aspect
{“name”:“Primo Levi”,“deathDate”:“1987-04-11”}
in Digital Humanities (DH) is to provide processed results
]
in a way that is usable and easy to browse: data visuali-
sation is today a device which enables the exploration, fil- As case study, we selected a coherent group of biographies
tering and searching of data, skipping the interaction with concerning people that have been deported to concentra-
databases. tion camps during Nazism, including both those who died
Narratives can be integrated in visual forms of presenta- because of the deportation and the survivors. Data clean-
tions: the Spatial History Project at Stanford University1 ing and tagging is reported in Section 2.1 while in Section
deals with visualisations that involve the geographic dimen- 2.2 we describe the way biographical information has been
sions of Holocaust, portraying, for instance, the mobility in structured before automatic extraction.
the Budapest ghetto or the arrests of the Italian Jews. Choosing an operational notion of biographical event as an
In e-history projects, textual content is processed by Natu- activity that involve directly the person in question and can
ral Language Processing (NLP) modules which take care of be anchored on a timeline whenever a date is available, we
tasks such as Named Entity Recognition and Disambigua- set up a basic biographical data array for each subject to en-
tion (NERD). These modules identify different types of en- code factual data presented in the visualisation (see Section
tities (e.g. Person, Organization, Location etc.) and can 3). We end with conclusions and proposals for future work
link them to external knowledge repositories (e.g. DBpe- in Section 4.
dia) by means of URIs, thus enriching the extracted infor-
mation. Hypotheses formulation about relations between 2. Dataset and Tools
entities (e.g. people, places and events) is supported once
2.1. Holocaust deportees dataset
1 As a case study 782 Wikipedia pages relative to the biogra-
http://web.stanford.edu/group/
spatialhistory phies of people deported in Nazi concentration camps were
111
downloaded. The set includes 247 short biographies of peo- name name of the deportee
ple that survived. All these people have a key event in com- nationality nationality of the
mon, namely that they have been deported to Nazi concen- deportee (current
tration camps, and several others along which their lives nationality for sur-
differ and can be compared. vivors)
The biographies are part of a Wikipedia category, namely birthDate date of the birth of the
people who died in Nazi concentration camps2 , and of a list deportee
of Holocaust survivors3 . birthPlace place of birth of the
The Wikipedia pages have been downloaded and saved in deportee
plain text files by removing all HTLM tags. The down- deathDate date of death of the
loaded data have been processed by two different NLP sys- deportee
tems: the NewsReader pipeline (Agerri et al., 2014) and the deathPlace place of death of the
Stanford CoreNLP (Manning et al., 2014). deportee
The NewsReader pipeline is a set of 15 NLP modules that deathInConcentrationCamp YES/NO it reports if
generates a structure in the NLP Annotation Format (NAF) the deportee died in a
(Fokkens et al., 2014). The pipeline has been developed as concentration camp
part of the EU project NewsReader4 . Apart from the basic deportationDate date of deportation
processing modules, such as tokenization, part-of-speech deportationCamp1 first camp of deporta-
tagging and lemmatization, the additional modules which tion
are relevant for our project include: deportationCamp2 second camp of de-
portation, if available
• coreference resolution (COREF layer); locationCamp1 location of deporta-
tionCamp1
• named entity recognition and disambiguation (NERC
locationCamp2 location of deporta-
and NERD layers);
tionCamp2
• semantic role labelling (SRL layer). deportationFromCity the city where the de-
portee was living at
Stanford CoreNLP is a Java natural language analysis li- deportationDate
brary that includes a part-of-speech (POS) tagger, a named deportationAge age of the deportee at
entity recognizer (NER), a dependency parser and a coref- deportationDate
erence resolution system. deathAge age of the deportee at
deathDate
2.2. Structuring biographical events wikiOccupation Wikipedia category
Events extracted from DBpedia are a subpart of biographi- wikiNationality Wikipedia category
cal events found in Wikipedia entries and other resources. wikiCamp Wikipedia category
The aim of this section is to show how the integration gender M/F
of information can be achieved by means of structured
information derived from the output of existing NLP Table 1: Basic biographical events as keys.
modules.
From the NewsReader pipeline output we extracted the
disambiguated DBpedia URIs from the NERD layer. The in the first lines of Wikipedia entries, missing important
DBpedia URI is used to query the corresponding DBpedia information that is provided in the remaining text as, in
HTML page to collect the basic information which are this case, the birth place, the death place and the exact date
summarized in Table 1. of death, that we are able to include because we extracted
Grounding each page on DBpedia URIs, we decided them from the processed text. Moreover, DBpedia sets the
to combine the two sources, Wikipedia and DBpedia date of death to the first day of the year when it finds just
respectively, to structure the array that will be the starting the year in the Wikipedia pages infobox; we overwrite this
point for the visualisations, choosing the keys (see Table value when the NLP tools extract the exact date for the
1) that are basic for representing the lives of Holocaust event (i.e. deathDate in this case is 1945-03-20 and not
deportees. In Table 2 the data extracted for Marceli Han- 1945-01-01). deportationFromCity is an example
delsman, a Polish historian that died in the Mittelbau-Dora of a potentially uncertain value: if the place where the
concentration camp, are reported. Most of them come arrest/deportation took place is not mentioned we presume
from DBpedia (Lehmann et al., to appear). The data are that it coincides with the one where the person was living
obtained by processing the linguistic information contained at the moment.
wikiOccupation, wikiNationality and
2
http://en.wikipedia.org/wiki/Category:
wikiCamp values are imported, when they are avail-
People_who_died_in_Nazi_concentration_ able, from the Wikipedia taxonomy. These values allow
camps_by_occupation to group deportees according to three different modalities
3
http://en.wikipedia.org/wiki/Lis_of_ (their nationality, their job and the concentration camp
Holocaust_survivors where they have been deported). This information could
4
http://www.newsreaderproject.eu look redundant with deportationCamp and nationality
112
DATE PREP EVENT NSUBJ DOBJ LOCATION
1912 in write Gokkes composition
1921 in found he choir Amsterdam
1923 in marry Gokkes Winnik
Table 4: Output of the post-processing rules over the Stanford dependency parser.
name Marceli Handelsman name Helen Berman
nationality Polish nationality Dutch-Israeli
birthDate 1882-01-01 birthDate 1936-04-06
birthPlace Warsaw birthPlace Amsterdam, Nether-
deathDate 1945-01-01 no 1945- lands
03-20 deathDate living
deathPlace Mittelbau-Dora deathPlace living
deathInConcentrationCamp yes deathInConcentrationCamp no
deportationDate 1944 deportationDate
deportationCamp1 Gross-Rosen deportationCamp
deportationCamp2 Mittelbau-Dora deportationFromCity
locationCamp1 Rogonica, Poland lengthJourney
locationCamp2 Nordhausen, Ger- deportationAge
many deathAge living
deportationFromCity Warsaw (induced) wikiOccupation Painter
lengthJourney1 282 km wikiNationality
lengthJourney2 681 km wikiCamp
deportationAge 62
deathAge 63 Table 3: Example of basic biographical events about Helen
wikiOccupation historian Berman.
wikiNationality Polish
wikiCamp Mittelbau-Dora a way of comparing events with respect to a timeline.
Time anchored events have been extracted through a
Table 2: Example of basic biographical events about
rule-based module on top of the output of the Stanford
Marceli Handelsman.
CoreNLP. The rules take in input all dependency relations
between a temporal expression, marked as DATE in the
extracted from data processed by NLP modules but, since NER analysis of the Stanford CoreNLP, and a verb in the
it was part of Wikipedia pages infobox compiled by users, same sentence. We assume that the dependency relation
has a higher degree of certainty and, in case of mismatch, between the temporal expression and the verb expresses
should be the considered the right value. In Table 3 data a temporal relation of inclusion, i.e. anchors the event
relative to Helen Berman, one of the survivors, make on a timeline. More complex temporal relations, such as
evident a different type of information that is missing. before, after, begins, ends, overlap or simultaneous, can be
She was a child when she was deported and her short expressed. In particular, in case a temporal expression is
biography on Wikipedia mainly refers to her artistic career introduced by a preposition, a further set of rules carved
as a painter, briefly mentioning the event of deportation to express the temporal meaning (or relation) associated
without details about it. Some of the keys in the JSON to the preposition have been developed. This set of rules
structure will be without a value because information is is based on the manually annotated data from the English
missing both in DBpedia and Wikipedia and this could TimeBank6 (Pustejovsky et al., 2003). After the time
be a problem for a set of persons in the dataset; other anchoring of the verb has been established we extracted
sources of information should be found to integrate them.5 also dependency relations concerning the subject, direct
Some basic biographical events, such as birthDate object and, if available, locations. Subjects and objects
and deportationDate among others, are anchored are then mapped to the NewsReader output to solve entity
because of their nature to time, i.e. they make sense only disambiguation (NERD) and pronominal coreference.
when a temporal expression fills their values. We consider Table 4 reports the output structure of the post-processing
event anchoring as the first step to discover commonalities rules for event anchoring extraction. The data obtained
between biographies of different people as it provides from the Stanford parser and their integration with the
NWR output facilitate the comparison of biographies.
5
One possible source could be the Central Database of Shoah Examples 2a and 2b show how by looking for the same
Victims Names (http://www.yadvashem.org/), an inter- event lemma (e.g. “immigrate”) in the extracted data, we
national endeavor initiated and led by Yad Vashem; an esti-
6
mated 4.3 million murdered Jews have been commemorated and http://www.timeml.org/site/timebank/
a database of Shoah Survivors will be released soon. timebank.html
113
can easily aggregate people by the type of event and still be
able to tfind out differences (in this case, the fact that there
are two emigration instances, one in 1978 in Israel and one
in 1947 in the Unites States).
2a. Helen Berman: In 1978, she emigrated to Israel.
1978, in, emigrate, she, Israel.
2b. Ruth Klüger: In 1947 she emigrated to the United States.
1947, in, emigrate, she, States. Figure 2: Horizontal timeline visualisation.
Anchoring in time is often paired with anchoring in space
because the information about where the event took place
is necessary for completeness, for this reason we integrated 4. Conclusions and Future Work
the data structure of event lemma with information about
We propose a method and a preliminary development of
the closest syntactic constituent that is a LOCATION ac-
an NLP module for extracting biographical events from
cording to the NER module of the Stanford parser (Israel
biographical notes such as the ones that are available in
and United States, in the cases above).
Wikipedia. Focusing on temporally anchored events will
3. Visualisation of data allow to extract salient events which can be further used
to identify other biographical events and facilitate their
The implementation of the visualisation is based on D3.js, chronological ordering and the representation of a person
a JavaScript library designed to display digital data in a dy- storyline.
namic graphical form. We propose two interrelated visual- One of the limitations of our visualisation as a tool for data
isation modalities: presentation is the lack of potential interactions with histo-
• A force-directed graph (Figure 1) where each person rians: spatial and temporal data extracted from texts can be
is a node connected to other nodes when they share ambiguous or uncertain and some events can be wrongly
the same value. It allows the visualisation of clus- extracted. Users should be able to validate the information
ters of persons based on the different values in Ta- discovered, labelling the results as true, worth further in-
ble 1, highlighting data that have been extracted from vestigation or useless because noisy. We plan to make the
Wikipedia, DBpedia and biographies parsed with the visualisations interactive, with the possibility to delete or
Stanford CoreNLP and the NewsReader pipeline. In annotate each piece of information.
this way, the names of the persons can be visible mov- As future work we aim at labelling the biographical events
ing a pointer over a node and the source, i.e. the cor- as positive or negative, integrating a Sentiment Analysis
responding Wikipedia page in this work, will open in component in our module by means of a psychologically
a different window, when clicking on the node. grounded dataset (Lewinsohn and Amenson, 1978). So far
This type of visualisation will allow the user to directly this task was conducted manually by associating the pred-
explore the source of information from which the data icate and semantic role information of the extracted events
have been extracted and thus verify if the proposed to the entries in the psychological dataset. The manual la-
clusters are relevant or if they are due to errors in data belling aims at developing a reliable training set data for the
extraction. development of a learning algorithm.
5. Acknowledgements
One of the author wants to thanks the NWO Spinoza Prize
project Understanding Language by Machines (sub-track 3)
for partially supporting this work.
6. References
Agerri, R. I. Aldabe, Z. Beloki, E. Laparra, M. De La-
calle, G. Rigau, A. Soroa, M. van Erp, P. Vossen, C. Girardi
Figure 1: Force-directed graph visualisation. and S. Tonelli (2014), Event Detection, version 2 D4.2.2.
NewsReader Project Deliverable.
Cairo, A. (2012), The Functional Art: An introduction to
• An horizontal timeline as illustrated in Figure 2. This information graphics and visualization.
visualisation allows to represent also relevant dates van Drie, J. and C. van Boxtel (2008), Historical Reason-
concerning larger events, e.g. World War II, which ing: Towards a Framework for Analyzing Students’ Rea-
have crossed the lives of the people in our dataset. The soning about the Past. Educational Psychology Review, 20
timeline visualisation reports, for each deportee, a set (2), 87-110.
of dates and sentences extracted from Wikipedia and Fokkens, A., A. Soroa, Z. Beloki, N. Ockeloen, G.
describing a biographical event. Rigau, W. R. van Hage and P. Vossen (2014), NAF and
GAF: linking linguistic annotations. In: Proceedings 10th
114
joint iso acl sigsem workshop on interoperable semantic an-
notation, Reykjavik, Iceland, 2014, p. 9.
Lehmann, J., R. Isele, M. Jakob, A. Jentzsch, D. Kon-
tokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van
Kleef, S. Auer and C. Bizer, (to appear), DBpedia A
Large scale, Multilingual Knowledge Base Extracted from
Wikipedia. To appear in the Semantic Web Journal.
Lewinsohn, J. and C.S. Amenson (1978), Some Rela-
tions between Pleasant and Unpleasant Events and Depres-
sion. In: Journal of Abnormal Psychology, 87(6): 644 654.
Manning, C., D., M. Surdeanu, J. Bauer, J. Finkel, S.
Bethard and D. McClosky (2014), The Stanford CoreNLP
Natural Language Processing Toolkit. In Proceedings of
52nd Annual Meeting of the Association for Computational
Linguistics: System Demonstrations, pp. 55-60.
Novak, J., I. Micheel, L. Wieneke, M. Dring, M. Me-
lenhorst, J. Garcia Moron, C. Pasini, M. Tagliasacchi and
P. Fraternali (2014), HistoGraph, A visualization Tool for
Collaborative Analysis of Networks from Historical Social
Multimedia Collections. 2014 18th International Confer-
ence on Information Visualisation (IV).
Pustejovsky, J., P. Hanks, R. Sauri, A. See, R.
Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, L.
Ferro and M. Lazo (2003), The Timebank Corpus, Proceed-
ings of Corpus Linguistics 2003.
115