Analyzing and Visualizing Prosopographical Linked Data Based on Biographies

                              Petri Leskinen1 , Eero Hyvönen1,2 , and Jouni Tuominen1,2
                          Semantic Computing Research Group (SeCo), Aalto University, Finland and
                        HELDIG – Helsinki Centre for Digital Humanities, University of Helsinki, Finland
                                           http://seco.cs.aalto.fi, http://heldig.fi

This paper shows how faceted search on biographical data can be utilized as a flexible basis for filtering target groups of people and,
in particular, how generic data analysis and visualizations tools can then be applied for solving prosopographical research questions
based on the filtered data. This idea is demonstrated and evaluated in practice by presenting two application case studies: 1) linked
data extracted from a printed registry of over 10 000 alumni (1867–1992) of the prominent Finnish high school Norssi, and 2) a
knowledge graph extracted from 13 000 short biographies of significant Finnish people (from 3rd century to present times) in the
National Biography of Finland. In both cases, the data is enriched by linking their entities with several other external datasets.

Keywords: Linked Data, Data Visualization, Biography, Prosopography

           1.    Prosopographical Method                               Riddle, 2005) and visualizations (Dadzie and Rowe, 2011;
Biographies describe life stories of particular people of sig-         Kehrer and Hauser, 2013).
nificance, with the aim of getting a better understanding of           The main contribution of this paper is to test and demon-
their personality and actions, e.g., to understand their mo-           strate the prosopographical method in practice by pre-
tives (Roberts, 2002). In contrast, the focus of prosopogra-           senting how various data visualization tools using Google
phy is to study life histories of groups of people in order to         Charts and Google Maps can be integrated with the
find out some kind of commonness or average in them (Ver-              SPARQL endpoint allowing the end user to filter out tar-
boven et al., 2007). For example, the research question may            get groups of people and biographies, and then to study
be to find out what happened to the students of a school be-           them. In addition to providing statistical analyses of person
fore the World War II in terms of social ranking, employ-              groups, an interesting use case identified here is to compare
ment, or military involvement after their graduation.                  analyses and visualizations based on different subgroups,
The prosopographical research method (Verboven et al.,                 e.g., people with same profession during different eras.
2007, p. 47) consists of two major steps. First, a target              The paper is organized as follows. First, prosopographi-
group of people is selected that share desired characteris-            cal analyses and visualizations are presented and discussed
tics for solving the research question at hand. Second, the            for the two linked datasets and applications using the ap-
target group is analyzed, and possibly compared with other             proach outlined above: the Norssi high school alumni on
groups, in order to solve the research question.                       the Semantic Web and the Semantic National Biography of
In our earlier paper (Hyvönen et al., 2017) we presented an           Finland. After this contributions of the work in relation to
application case study where data from a printed collection            related research are summarized and directions for further
of over 10,000 short biographies (registry entries) of Norssi          research are outlined.
high school alumni were extracted and transformed into
Linked Open Data, enriched by data linking to 10 exter-
                                                                                    2.    Norssi Alumni Application
nal data sources, and published in a SPARQL1 endpoint. A               The Norssi alumni data service is available as linked open
semantic faceted search engine and browser was developed               data at the Linked Data Finland platform2 , including some
for searching and filtering people and biographies that were           892,000 triples about 131,000 resources. The digitization,
enriched with internal and external linking for biographical           ”lodification”, and the Vanhat Norssit Portal3 is described
research. Application of the same idea to the dataset of               in more detail in Hyvönen et al. (2017). The datasets con-
the Semantic National Biography of Finland (2014–2017)                 sist of 10 137 person resources, enriched with graphs of re-
was considered in(Hyvönen et al., 2018), and the underly-             lating career events and family relations, and vocabularies
ing data model was presented in Leskinen et al. (2017).                of titles, schools, companies, medals, and hobbies. These
This paper extends this line of research by showing how                additional data were extracted automatically from the short
the filtered target group of faceted search can be utilized as         biographical descriptions of a printed book using OCR and
a basis for prosopographical research using different kind             text extraction and cleaning tools based on regular expres-
of data-analytic tools for solving prosopographical research           sions.
questions. Such tools may involve, e.g., methods of net-               The ontology model representing people and their bio-
work analysis (Easley and Kleinberg, 2010; Hanneman and                graphical information in the Norssit alumni knowledge

   1                                                                      2
     SPARQL Protocol and RDF Query Language,                                  http://www.ldf.fi/dataset/norssit
https://www.w3.org/TR/sparql11-query/                                         http://www.norssit.fi/semweb

graph is based on the Bio CRM data model4 (Tuominen et                  sponding educational titles (e.g., MSc in Technology, Doc-
al., 2018), which has been developed to facilitate and har-             tor of Medicine, etc.) on the right. From this visualization
monize the representation of biographies and cultural her-              one can see which titles were obtained from which universi-
itage data on the Semantic Web. Bio CRM is a domain spe-                ties regarding the filtered target group. The highlighted path
cific extension of CIDOC CRM5 (Doerr, 2003), the event-                 in Fig. 3 shows, e.g., the connection from the University of
based ISO standard for representing and harmonizing Cul-                Helsinki to Bachelor of Arts when no filtering choices have
tural Heritage data. It includes structures for basic data              been made.
of people, personal relations, professions, and events with
participants in different qualified roles. Bio CRM makes
a distinction between enduring unary roles of actors, their
enduring binary relationships, and perduring events, where
the participants can take different roles modeled as a role
concept hierarchy. The ontology and data infrastructure
used for the Norssi dataset are described in detail in Le-
skinen et al. (2017).

                                                                        Figure 2: Pie chart showing the most common educations
                                                                        among high school alumni.

                                                                        On the second visualization page8 , there are first two his-
                                                                        tograms showing years of enrollment and matriculation of
                                                                        the target group. Below these, three multi-column charts
                                                                        show the most popular universities and colleges, employ-
                                                                        ers, and occupations of the filtered people on a decade by
                                                                        decade basis. For example, from the histogram represent-
                                                                        ing the years of enrollment (Fig. 4) one can see that when
                                                                        education in Norssi was started, a lot of pupils from other
                                                                        schools moved to Norssi (first high bar on the left). Also the
                                                                        changes made in the Finnish school system in the 1970’s are
Figure 1: Faceted search for short biographies in the alumni            clearly visible as very low enrollment rates. Fig. 5 depicts
register Norssit 1867–1992.                                             the most popular employers. It shows a great and inter-
                                                                        esting variation of companies and organizations at different
                                                                        times: in the late 1800’s the Finnish State Railways (Valtion
The Vanhat Norssit Portal contains two search interfaces,               Rautatiet, blue columns) was the most popular employer,
person pages, and two pages for statistical visualiza-                  but declined soon probably because the main railway con-
tions. The search interface (Fig. 1) is based on SPARQL                 nections in Finland were built in 1850–1900.9 The Finnish
Faceter (Koho et al., 2016), a tool for creating faceted                Defense Forces (Puolustusvoimat, green columns), on the
search interfaces on a SPARQL endpoint. The interface al-               other hand, has its highest peek during the Second World
lows the user to filter the results based on, e.g., people’s            War. After this the banking industry and the city of Helsinki
education, profession, place of birth, or on which external             became major employers for Norssi alumni.
databases he or she has been linked to.                                 The facet for links to external datasets provides also an in-
For analyzing and visualizing data statistics of a filtered tar-        teresting option for selecting target groups. For example, a
get group of people, we created two views based on Google               student in the school may ask herself/himself the question:
Chart6 diagrams. On the first visualization page7 , the popu-           where should I work if I want to become famous and get an
larity of the most common educations (Fig. 2), universities             entry in the National Biography? By making the selection
and colleges, professions, and employers after the gradua-              ”National Biography” on the facet and then looking at the
tion of the alumni are shown as four pie charts. By making              employer multi-column chart one can get an idea of where
filtering selections on the facets, the graphics are updated            to work in order to be included in the National Biography.
accordingly. For example, by selecting ”professor” on the
                                                                        The official motto of the Norssi high school is Non scholae
profession facet the employers of the 258 professors in the
                                                                        sed vitae (not for school, but for life). Data analytics based
data can be seen on the employer pie chart. On the same
                                                                        on the linked data service now provides new insights on
page, there is also a Sankey diagram depicted in Fig. 3 that
                                                                        what actually happened to the school alumni in life after
shows a list of universities on the left side and the corre-
                                                                        graduation in a prosopographical sense.
   5                                                                       8
     http://cidoc-crm.org                                                   http://www.norssit.fi/semweb/#!/visualisointi2
   6                                                                       9
     https://developers.google.com/chart/                                   https://en.wikipedia.org/wiki/History of rail transport in
     http://www.norssit.fi/semweb/#!/visualisointi                      Finland

                     Figure 3: Sankey diagram showing the linkage between the university and the education.

                                                                       et al., 2018), and the person ontology is compatible with
                                                                       the Getty ULAN LOD11 model.
                                                                       The source data consists (at the moment) of fields extracted
                                                                       from the original database dump in CSV format. In the sim-
                                                                       plest cases, the value of a data field is directly indicated by
                                                                       the value of a property, e.g., date or place of birth. However,
                                                                       most of the structured knowledge was extracted from short
Figure 4: Column chart showing the amount of pupils by                 snippets of text in the end of each biography describing ma-
enrollment year.                                                       jor life events of the protagonist, such as graduation from
                                                                       a university, designing a building, publishing a book, get-
                                                                       ting a honorary medal, etc. The resulting knowledge graph
                                                                       includes 13 144 people with a biographical description in
3.       Semantic National Biography of Finland                        the National Biography, 51 243 relating people mentioned
The National Biography of Finland10 consists of biogra-                in the biographies, and 977 authors of the biographies. At
phies of notable Finnish people throughout history (200–               the moment, the data includes 37 730 births, 25 552 deaths,
2018). The biographies describe the lives and achievements             and 102 300 other biographical events. In addition to that
of these historical and contemporary figures, containing               there are 51 937 family relations, 4953 places, 3101 oc-
vast amounts of references to notable Finnish and foreign              cupational titles, and 2938 companies extracted from the
figures, including internal links to other biographies of the          source data. (Hyvönen et al., 2018) On top of the data ser-
National Biography of Finland. In addition, the text con-              vice, a search interface (Fig. 6) using the SPARQL Faceter
tains references to historical events, notable works (such as          tool (Koho et al., 2016) and AngularJS12 framework was
paintings, books, music, and acting), places (such as place            created. It can be used for finding individual biographies
of birth and death), organizations, and dates.                         and for filtering out target groups for prosopography.
In this case, the texts and data were available in a database          For biographical research, we created for each person entry
in a semi-structured form. As in the Norssi case above, the            page two tabs: one for the textual description of the person
texts were transformed into RDF form by extracting entities            with additional data links, and one for a spatio-temporal
from the semi-structured texts, and the result was uploaded            visualization of the life events of the person using a map
into a SPARQL endpoint of the Linked Data Finland ser-                 and a timeline. For prosopography, there is 1) a page for
vice.                                                                  studying the events of the target group, and 2) a page for
The underlying ontology model represents people and their              visualizing statistics of the filtered people. The application
biographical information. A natural choice for modeling                will be opened to the public in September 2018.
life stories is the event-based approach where a person’s
life is seen as a sequence of spatio-temporal, possibly inter-
linked events from birth to death (and beyond). The events             Fig. 7 depicts an example of a person’s map-timeline page.
are modeled according to the Bio CRM model (Tuominen
  10                                                                     12
       https://kansallisbiografia.fi/english/national-biography               http://angularjs.org

                               Figure 5: Column chart showing the most common employers.

                                                                     rope, towns of the Hanseatic League15 , Finnish mansions,
                                                                     churches, and other well-known buildings were added to
                                                                     the place ontology using the Google services. The place on-
                                                                     tology includes locations in different scales, such as coun-
                                                                     tries, towns, villages, and in some cases even buildings with
                                                                     a known specified address.

 Figure 6: Main page of the Finnish National Biography.

There is a chronological list of life events on the left col-
umn. Events with known locations are shown on the map,
and below there is a timeline showing the timespan of the
events. The timeline spans from a person’s birth to death,
and shows when the career highlights have taken place.
There are four horizontal lines in the timeline for separat-         Figure 7: Map and timeline showing events related to the
ing different categories of biographical events, each rep-           Finnish architect Eliel Saarinen.
resented in a different color: family events (e.g., getting
married, having children), career events (e.g., education,
professional experience), achievements, and mentions of              As for prosopographical research, there are two different
honor. Corresponding markers on the map follow the same              views available using Angular Google Maps16 . The target
color schema.                                                        group can be filtered by using a time span slider17 that is
When an event is hovered on the event list or on the time-           included as a facet for the user to specify a desired range
line, the corresponding marker on the map gets highlighted.          of years in interest. Other filtering facets include choos-
The size of the marker depends on the number of events re-           ing person’s profession, gender, dataset, related companies,
lated to that specific location, so the most important places        related place, and linkage to external databases.
for a person’s career are emphasized. In the example case,           The visualizations depicted in Fig. 8, show the results of
the visualization is based on the biography of architect             a SPARQL query corresponding to the facet selections on
Eliel Saarinen, and Helsinki and Michigan (where he lived            Angular Google Maps. The markers on the map show
his later years) are emphasized. Data about the places in            places of birth in blue and places of death in red color. The
Finland was extracted from the Finnish Gazetteer of His-             size of the marker corresponds to the number of events that
torical Places and Maps (Hipla) databases and data ser-              has taken place in that particular location. Clicking on a
vice13 (Ikkala et al., 2016; Hyvönen et al., 2016). Foreign         marker opens a modal window containing a list of people
placenames were linked using the Google Maps APIs14 .                who were born or died at the location.
For example, the locations of medieval universities in Eu-
  13                                                                   16
     http://hipla.fi                                                      http://angular-ui.github.io/angular-google-maps/
  14                                                                   17
     http://developers.google.com/maps/                                   https://github.com/angular-slider/angularjs-slider

The first selection (Fig. 8a) shows the places of birth and
death of Finnish clergy 1554–1721. According to the re-
sulting rendering, the most active areas locate along the
coastal Finland with main focus on the town of Turku,
which during that era was the capital of Finland, and
some are scattered around Sweden. The second selection
(Fig. 8b) shows the data of Finnish clergy in 1800–1920.
The data does not clearly concentrate on the largest towns
of Helsinki and Turku, but seem to scatter evenly around
Southern Finland. During that era Finland was a part of
the Russian Empire but there are only a few markers on the
                                                                                     (a) Lifespan of people lived in 1700–1800.
Russian side except at the city of St. Petersburg.

 (a) The places of birth and death of Finnish clergy 1554–1721.                      (b) Lifespan of people lived in 1900–1950.

                                                                       Figure 9: Two different views of statistical visualizations.

                                                                       endpoint can be used for data analysis and visualizations
                                                                       in biographical and prosopographical research. According
                                                                       to our practical experiences, the technology is very useful
                                                                       and handy to use for this after learning the basics of Linked
                                                                       Data standard publishing principles.
                                                                       Previous works of applying Linked Data technologies to
 (b) The places of birth and death of Finnish clergy 1800–1920.        biographical data include, e.g., Larson (2010), Biogra-
                                                                       phynet.nl18 (Ockeloen et al., 2013), and our own earlier
  Figure 8: Two different views on the map application.                work (Hyvönen et al., 2014). The conference proceed-
                                                                       ings (ter Braake et al., 2015) include several papers on
The Semantic National Biography demonstrator also in-                  bringing biographical data online, on analyzing biographies
cludes a visualization page showing statistics as in the               with computational methods, on group portraits and net-
Norssit alumni case. The column charts in this case show               works, and on visualizations. Applying Linked Data prin-
(at the moment) five demographic histograms (with the                  ciples to cultural heritage data (Hyvönen, 2012) and his-
mean value and standard deviation) of the target group: dis-           torical research (Meroño-Peñuela et al., 2015) has been a
tribution of ages among the group, ages of marriage, ages              promising approach to solve the problems of isolated and
of having the first child, the number of children, and the             semantically heterogeneous data sources. Also a num-
number of spouses.                                                     ber of previous research exists in Linked Data visualiza-
Two examples of histograms are shown in Fig. 9. The up-                tion (Bikakis and Sellis, 2016; Dadzie and Rowe, 2011).
per (a) one shows the lifespan of people who lived in 18th             An important component in representing biographical data
century, and the lower one (b) people living in 1900–1950.             is representing people and their networks, so the next part
The two figures can be compared, e.g., how the amount of               of our work is applying the methods of computational net-
deaths among young children has decreased and how the                  work analyses on the data. Representing biographies as
average age has increased between the two time periods.                linked data provides several approaches for creating such
                                                                       networks. For example, the biographical texts can be ana-
  4.    Discussion, Related Work, and Future                           lyzed and people mentioned in text descriptions can be used
                      Research                                         as links in the person interrelation graph.
This paper demonstrated how Linked Data can be used as a
basis for representing biographical registries and for filter-
ing out target groups of persons of interest. Our particular
goal was to show by a series of examples, how a SPARQL                        http://www.biographynet.nl

                                                                       SPARQL. In Raphaël Troncy, Ruben Verborgh, Lyndon
