=Paper= {{Paper |id=Vol-1399/paper17 |storemode=property |title=Extracting and Visualising Biographical Events from Wikipedia |pdfUrl=https://ceur-ws.org/Vol-1399/paper17.pdf |volume=Vol-1399 |dblpUrl=https://dblp.org/rec/conf/bd/RussoCM15 }} ==Extracting and Visualising Biographical Events from Wikipedia== https://ceur-ws.org/Vol-1399/paper17.pdf
             Extracting and Visualising Biographical Events from Wikipedia
                             Irene Russo*,Tommaso Caselli**, Monica Monachini*
       *ILC-CNR “A. Zampolli” Pisa,** Computational Lexicology & Terminology Lab Vrije Universiteit Amsterdam
                        irene.russo@ilc.cnr.it, t.caselli@vu.nl, monica.monachini@ilc.cnr.it

                                                               Abstract
This work presents a proposal for the development of a natural language processing module for event and temporal analysis of
biographies as available in Wikipedia. At the current level of development, we restricted the extraction to temporally anchored events as
they represent salient information which can be further used to extract additional events and facilitate their chronological ordering and
the representation of a person’s timeline. Visualising data about basic facts concerning groups of people helps with historical reasoning
and enables comparisons among them.

Keywords: mining biographies for structured information, visualising biographical data, temporal information


                    1.    Introduction                                   entities and relations between them have been mined from
Historical reasoning concerns facts and stories of the past              texts. Along this line, data can be explored in novel ways
to describe, compare and explain historical phenomena.                   through basic inference rules that find out commonalities
Six activities can be listed as part of historical reasoning:            between two or more entities.
a.) historical questions; b.) the use of primary and sec-                Our work focuses on comparisons and questions that may
ondary sources; c.) contextualisation; d.) argumentation;                arise through visualisations of data automatically extracted
e.) the use of substantive concepts; and, f.) the use of meta-           from corpora. We designed a simple visualisation tool be-
concepts (van Drie and van Boxtel, 2008).                                cause visualising data about basic facts concerning groups
The way information is presented has an impact on histor-                of people may help with historical reasoning and enables
ical reasoning: primary and secondary sources are today                  comparisons.
widely digitized and computational processing of textual                 In our framework a fact is a notion where history and data
and multimodal information (Novak et al., 2014) can meet                 visualisation meet: if for historians a fact is an assertion
the needs of users from the humanities, both in research and             about people and events that can be located in the past (i.e. a
education. As a matter of fact, in many disciplines, visual-             main predicate temporally and spatially grounded as in 1a),
isation constitutes a graphical or cognitive aid to thinking             for data visualisation, based on data structures like JSON,
because it is based on interrelated textual content, spatio-             the same fact, like the date of death, is an instantiated value
temporal data and metadata related to images and videos.                 for a key, as in 1b:
Data can be mainly visualised for presentation or explo-
                                                                         1a. Primo Levi died on 11 April 1987.
ration but in well designed projects there is a continuum
                                                                         1b. “nodes”:[
between these two modalities (Cairo, 2012). A key aspect
                                                                             {“name”:“Primo Levi”,“deathDate”:“1987-04-11”}
in Digital Humanities (DH) is to provide processed results
                                                                         ]
in a way that is usable and easy to browse: data visuali-
sation is today a device which enables the exploration, fil-             As case study, we selected a coherent group of biographies
tering and searching of data, skipping the interaction with              concerning people that have been deported to concentra-
databases.                                                               tion camps during Nazism, including both those who died
Narratives can be integrated in visual forms of presenta-                because of the deportation and the survivors. Data clean-
tions: the Spatial History Project at Stanford University1               ing and tagging is reported in Section 2.1 while in Section
deals with visualisations that involve the geographic dimen-             2.2 we describe the way biographical information has been
sions of Holocaust, portraying, for instance, the mobility in            structured before automatic extraction.
the Budapest ghetto or the arrests of the Italian Jews.                  Choosing an operational notion of biographical event as an
In e-history projects, textual content is processed by Natu-             activity that involve directly the person in question and can
ral Language Processing (NLP) modules which take care of                 be anchored on a timeline whenever a date is available, we
tasks such as Named Entity Recognition and Disambigua-                   set up a basic biographical data array for each subject to en-
tion (NERD). These modules identify different types of en-               code factual data presented in the visualisation (see Section
tities (e.g. Person, Organization, Location etc.) and can                3). We end with conclusions and proposals for future work
link them to external knowledge repositories (e.g. DBpe-                 in Section 4.
dia) by means of URIs, thus enriching the extracted infor-
mation. Hypotheses formulation about relations between                                   2.    Dataset and Tools
entities (e.g. people, places and events) is supported once
                                                                         2.1.   Holocaust deportees dataset
   1                                                                     As a case study 782 Wikipedia pages relative to the biogra-
   http://web.stanford.edu/group/
spatialhistory                                                           phies of people deported in Nazi concentration camps were

                                                                   111
downloaded. The set includes 247 short biographies of peo-             name                     name of the deportee
ple that survived. All these people have a key event in com-           nationality              nationality of the
mon, namely that they have been deported to Nazi concen-                                        deportee       (current
tration camps, and several others along which their lives                                       nationality for sur-
differ and can be compared.                                                                     vivors)
The biographies are part of a Wikipedia category, namely               birthDate                date of the birth of the
people who died in Nazi concentration camps2 , and of a list                                    deportee
of Holocaust survivors3 .                                              birthPlace               place of birth of the
The Wikipedia pages have been downloaded and saved in                                           deportee
plain text files by removing all HTLM tags. The down-                  deathDate                date of death of the
loaded data have been processed by two different NLP sys-                                       deportee
tems: the NewsReader pipeline (Agerri et al., 2014) and the            deathPlace               place of death of the
Stanford CoreNLP (Manning et al., 2014).                                                        deportee
The NewsReader pipeline is a set of 15 NLP modules that                deathInConcentrationCamp YES/NO it reports if
generates a structure in the NLP Annotation Format (NAF)                                        the deportee died in a
(Fokkens et al., 2014). The pipeline has been developed as                                      concentration camp
part of the EU project NewsReader4 . Apart from the basic              deportationDate          date of deportation
processing modules, such as tokenization, part-of-speech               deportationCamp1         first camp of deporta-
tagging and lemmatization, the additional modules which                                         tion
are relevant for our project include:                                  deportationCamp2         second camp of de-
                                                                                                portation, if available
  • coreference resolution (COREF layer);                              locationCamp1            location of deporta-
                                                                                                tionCamp1
  • named entity recognition and disambiguation (NERC
                                                                       locationCamp2            location of deporta-
    and NERD layers);
                                                                                                tionCamp2
  • semantic role labelling (SRL layer).                               deportationFromCity      the city where the de-
                                                                                                portee was living at
Stanford CoreNLP is a Java natural language analysis li-                                        deportationDate
brary that includes a part-of-speech (POS) tagger, a named             deportationAge           age of the deportee at
entity recognizer (NER), a dependency parser and a coref-                                       deportationDate
erence resolution system.                                              deathAge                 age of the deportee at
                                                                                                deathDate
2.2. Structuring biographical events                                   wikiOccupation           Wikipedia category
Events extracted from DBpedia are a subpart of biographi-              wikiNationality          Wikipedia category
cal events found in Wikipedia entries and other resources.             wikiCamp                 Wikipedia category
The aim of this section is to show how the integration                 gender                   M/F
of information can be achieved by means of structured
information derived from the output of existing NLP                            Table 1: Basic biographical events as keys.
modules.
From the NewsReader pipeline output we extracted the
disambiguated DBpedia URIs from the NERD layer. The                    in the first lines of Wikipedia entries, missing important
DBpedia URI is used to query the corresponding DBpedia                 information that is provided in the remaining text as, in
HTML page to collect the basic information which are                   this case, the birth place, the death place and the exact date
summarized in Table 1.                                                 of death, that we are able to include because we extracted
Grounding each page on DBpedia URIs, we decided                        them from the processed text. Moreover, DBpedia sets the
to combine the two sources, Wikipedia and DBpedia                      date of death to the first day of the year when it finds just
respectively, to structure the array that will be the starting         the year in the Wikipedia pages infobox; we overwrite this
point for the visualisations, choosing the keys (see Table             value when the NLP tools extract the exact date for the
1) that are basic for representing the lives of Holocaust              event (i.e. deathDate in this case is 1945-03-20 and not
deportees. In Table 2 the data extracted for Marceli Han-              1945-01-01). deportationFromCity is an example
delsman, a Polish historian that died in the Mittelbau-Dora            of a potentially uncertain value: if the place where the
concentration camp, are reported. Most of them come                    arrest/deportation took place is not mentioned we presume
from DBpedia (Lehmann et al., to appear). The data are                 that it coincides with the one where the person was living
obtained by processing the linguistic information contained            at the moment.
                                                                       wikiOccupation,               wikiNationality             and
   2
    http://en.wikipedia.org/wiki/Category:
                                                                       wikiCamp values are imported, when they are avail-
People_who_died_in_Nazi_concentration_                                 able, from the Wikipedia taxonomy. These values allow
camps_by_occupation                                                    to group deportees according to three different modalities
  3
    http://en.wikipedia.org/wiki/Lis_of_                               (their nationality, their job and the concentration camp
Holocaust_survivors                                                    where they have been deported). This information could
  4
    http://www.newsreaderproject.eu                                    look redundant with deportationCamp and nationality

                                                                 112
                           DATE      PREP       EVENT       NSUBJ              DOBJ          LOCATION
                           1912      in         write       Gokkes             composition
                           1921      in         found       he                 choir         Amsterdam
                           1923      in         marry       Gokkes             Winnik

                     Table 4: Output of the post-processing rules over the Stanford dependency parser.



name                     Marceli Handelsman                             name                     Helen Berman
nationality              Polish                                         nationality              Dutch-Israeli
birthDate                1882-01-01                                     birthDate                1936-04-06
birthPlace               Warsaw                                         birthPlace               Amsterdam, Nether-
deathDate                1945-01-01 no 1945-                                                     lands
                         03-20                                          deathDate                living
deathPlace               Mittelbau-Dora                                 deathPlace               living
deathInConcentrationCamp yes                                            deathInConcentrationCamp no
deportationDate          1944                                           deportationDate
deportationCamp1         Gross-Rosen                                    deportationCamp
deportationCamp2         Mittelbau-Dora                                 deportationFromCity
locationCamp1            Rogonica, Poland                               lengthJourney
locationCamp2            Nordhausen,     Ger-                           deportationAge
                         many                                           deathAge                 living
deportationFromCity      Warsaw (induced)                               wikiOccupation           Painter
lengthJourney1           282 km                                         wikiNationality
lengthJourney2           681 km                                         wikiCamp
deportationAge           62
deathAge                 63                                             Table 3: Example of basic biographical events about Helen
wikiOccupation           historian                                      Berman.
wikiNationality          Polish
wikiCamp                 Mittelbau-Dora                                 a way of comparing events with respect to a timeline.
                                                                        Time anchored events have been extracted through a
Table 2: Example of basic biographical events about
                                                                        rule-based module on top of the output of the Stanford
Marceli Handelsman.
                                                                        CoreNLP. The rules take in input all dependency relations
                                                                        between a temporal expression, marked as DATE in the
extracted from data processed by NLP modules but, since                 NER analysis of the Stanford CoreNLP, and a verb in the
it was part of Wikipedia pages infobox compiled by users,               same sentence. We assume that the dependency relation
has a higher degree of certainty and, in case of mismatch,              between the temporal expression and the verb expresses
should be the considered the right value. In Table 3 data               a temporal relation of inclusion, i.e. anchors the event
relative to Helen Berman, one of the survivors, make                    on a timeline. More complex temporal relations, such as
evident a different type of information that is missing.                before, after, begins, ends, overlap or simultaneous, can be
She was a child when she was deported and her short                     expressed. In particular, in case a temporal expression is
biography on Wikipedia mainly refers to her artistic career             introduced by a preposition, a further set of rules carved
as a painter, briefly mentioning the event of deportation               to express the temporal meaning (or relation) associated
without details about it. Some of the keys in the JSON                  to the preposition have been developed. This set of rules
structure will be without a value because information is                is based on the manually annotated data from the English
missing both in DBpedia and Wikipedia and this could                    TimeBank6 (Pustejovsky et al., 2003). After the time
be a problem for a set of persons in the dataset; other                 anchoring of the verb has been established we extracted
sources of information should be found to integrate them.5              also dependency relations concerning the subject, direct
Some basic biographical events, such as birthDate                       object and, if available, locations. Subjects and objects
and deportationDate among others, are anchored                          are then mapped to the NewsReader output to solve entity
because of their nature to time, i.e. they make sense only              disambiguation (NERD) and pronominal coreference.
when a temporal expression fills their values. We consider              Table 4 reports the output structure of the post-processing
event anchoring as the first step to discover commonalities             rules for event anchoring extraction. The data obtained
between biographies of different people as it provides                  from the Stanford parser and their integration with the
                                                                        NWR output facilitate the comparison of biographies.
   5
     One possible source could be the Central Database of Shoah         Examples 2a and 2b show how by looking for the same
Victims Names (http://www.yadvashem.org/), an inter-                    event lemma (e.g. “immigrate”) in the extracted data, we
national endeavor initiated and led by Yad Vashem; an esti-
                                                                           6
mated 4.3 million murdered Jews have been commemorated and                 http://www.timeml.org/site/timebank/
a database of Shoah Survivors will be released soon.                    timebank.html

                                                                  113
can easily aggregate people by the type of event and still be
able to tfind out differences (in this case, the fact that there
are two emigration instances, one in 1978 in Israel and one
in 1947 in the Unites States).
2a. Helen Berman: In 1978, she emigrated to Israel.
    1978, in, emigrate, she, Israel.
2b. Ruth Klüger: In 1947 she emigrated to the United States.
    1947, in, emigrate, she, States.                                             Figure 2: Horizontal timeline visualisation.
Anchoring in time is often paired with anchoring in space
because the information about where the event took place
is necessary for completeness, for this reason we integrated                     4.   Conclusions and Future Work
the data structure of event lemma with information about
                                                                         We propose a method and a preliminary development of
the closest syntactic constituent that is a LOCATION ac-
                                                                         an NLP module for extracting biographical events from
cording to the NER module of the Stanford parser (Israel
                                                                         biographical notes such as the ones that are available in
and United States, in the cases above).
                                                                         Wikipedia. Focusing on temporally anchored events will
              3.    Visualisation of data                                allow to extract salient events which can be further used
                                                                         to identify other biographical events and facilitate their
The implementation of the visualisation is based on D3.js,               chronological ordering and the representation of a person
a JavaScript library designed to display digital data in a dy-           storyline.
namic graphical form. We propose two interrelated visual-                One of the limitations of our visualisation as a tool for data
isation modalities:                                                      presentation is the lack of potential interactions with histo-
  • A force-directed graph (Figure 1) where each person                  rians: spatial and temporal data extracted from texts can be
    is a node connected to other nodes when they share                   ambiguous or uncertain and some events can be wrongly
    the same value. It allows the visualisation of clus-                 extracted. Users should be able to validate the information
    ters of persons based on the different values in Ta-                 discovered, labelling the results as true, worth further in-
    ble 1, highlighting data that have been extracted from               vestigation or useless because noisy. We plan to make the
    Wikipedia, DBpedia and biographies parsed with the                   visualisations interactive, with the possibility to delete or
    Stanford CoreNLP and the NewsReader pipeline. In                     annotate each piece of information.
    this way, the names of the persons can be visible mov-               As future work we aim at labelling the biographical events
    ing a pointer over a node and the source, i.e. the cor-              as positive or negative, integrating a Sentiment Analysis
    responding Wikipedia page in this work, will open in                 component in our module by means of a psychologically
    a different window, when clicking on the node.                       grounded dataset (Lewinsohn and Amenson, 1978). So far
    This type of visualisation will allow the user to directly           this task was conducted manually by associating the pred-
    explore the source of information from which the data                icate and semantic role information of the extracted events
    have been extracted and thus verify if the proposed                  to the entries in the psychological dataset. The manual la-
    clusters are relevant or if they are due to errors in data           belling aims at developing a reliable training set data for the
    extraction.                                                          development of a learning algorithm.

                                                                                        5.    Acknowledgements
                                                                         One of the author wants to thanks the NWO Spinoza Prize
                                                                         project Understanding Language by Machines (sub-track 3)
                                                                         for partially supporting this work.

                                                                                              6.   References
                                                                            Agerri, R. I. Aldabe, Z. Beloki, E. Laparra, M. De La-
                                                                         calle, G. Rigau, A. Soroa, M. van Erp, P. Vossen, C. Girardi
       Figure 1: Force-directed graph visualisation.                     and S. Tonelli (2014), Event Detection, version 2 D4.2.2.
                                                                         NewsReader Project Deliverable.
                                                                            Cairo, A. (2012), The Functional Art: An introduction to
  • An horizontal timeline as illustrated in Figure 2. This              information graphics and visualization.
    visualisation allows to represent also relevant dates                   van Drie, J. and C. van Boxtel (2008), Historical Reason-
    concerning larger events, e.g. World War II, which                   ing: Towards a Framework for Analyzing Students’ Rea-
    have crossed the lives of the people in our dataset. The             soning about the Past. Educational Psychology Review, 20
    timeline visualisation reports, for each deportee, a set             (2), 87-110.
    of dates and sentences extracted from Wikipedia and                     Fokkens, A., A. Soroa, Z. Beloki, N. Ockeloen, G.
    describing a biographical event.                                     Rigau, W. R. van Hage and P. Vossen (2014), NAF and
                                                                         GAF: linking linguistic annotations. In: Proceedings 10th

                                                                   114
joint iso acl sigsem workshop on interoperable semantic an-
notation, Reykjavik, Iceland, 2014, p. 9.
   Lehmann, J., R. Isele, M. Jakob, A. Jentzsch, D. Kon-
tokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van
Kleef, S. Auer and C. Bizer, (to appear), DBpedia A
Large scale, Multilingual Knowledge Base Extracted from
Wikipedia. To appear in the Semantic Web Journal.
   Lewinsohn, J. and C.S. Amenson (1978), Some Rela-
tions between Pleasant and Unpleasant Events and Depres-
sion. In: Journal of Abnormal Psychology, 87(6): 644 654.
   Manning, C., D., M. Surdeanu, J. Bauer, J. Finkel, S.
Bethard and D. McClosky (2014), The Stanford CoreNLP
Natural Language Processing Toolkit. In Proceedings of
52nd Annual Meeting of the Association for Computational
Linguistics: System Demonstrations, pp. 55-60.
   Novak, J., I. Micheel, L. Wieneke, M. Dring, M. Me-
lenhorst, J. Garcia Moron, C. Pasini, M. Tagliasacchi and
P. Fraternali (2014), HistoGraph, A visualization Tool for
Collaborative Analysis of Networks from Historical Social
Multimedia Collections. 2014 18th International Confer-
ence on Information Visualisation (IV).
    Pustejovsky, J., P. Hanks, R. Sauri, A. See, R.
Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, L.
Ferro and M. Lazo (2003), The Timebank Corpus, Proceed-
ings of Corpus Linguistics 2003.




                                                              115