Analyzing and Visualizing Prosopographical Linked Data Based on Biographies Petri Leskinen1 , Eero Hyvönen1,2 , and Jouni Tuominen1,2 1 Semantic Computing Research Group (SeCo), Aalto University, Finland and 2 HELDIG – Helsinki Centre for Digital Humanities, University of Helsinki, Finland http://seco.cs.aalto.fi, http://heldig.fi firstname.lastname@aalto.fi Abstract This paper shows how faceted search on biographical data can be utilized as a flexible basis for filtering target groups of people and, in particular, how generic data analysis and visualizations tools can then be applied for solving prosopographical research questions based on the filtered data. This idea is demonstrated and evaluated in practice by presenting two application case studies: 1) linked data extracted from a printed registry of over 10 000 alumni (1867–1992) of the prominent Finnish high school Norssi, and 2) a knowledge graph extracted from 13 000 short biographies of significant Finnish people (from 3rd century to present times) in the National Biography of Finland. In both cases, the data is enriched by linking their entities with several other external datasets. Keywords: Linked Data, Data Visualization, Biography, Prosopography 1. Prosopographical Method Riddle, 2005) and visualizations (Dadzie and Rowe, 2011; Biographies describe life stories of particular people of sig- Kehrer and Hauser, 2013). nificance, with the aim of getting a better understanding of The main contribution of this paper is to test and demon- their personality and actions, e.g., to understand their mo- strate the prosopographical method in practice by pre- tives (Roberts, 2002). In contrast, the focus of prosopogra- senting how various data visualization tools using Google phy is to study life histories of groups of people in order to Charts and Google Maps can be integrated with the find out some kind of commonness or average in them (Ver- SPARQL endpoint allowing the end user to filter out tar- boven et al., 2007). For example, the research question may get groups of people and biographies, and then to study be to find out what happened to the students of a school be- them. In addition to providing statistical analyses of person fore the World War II in terms of social ranking, employ- groups, an interesting use case identified here is to compare ment, or military involvement after their graduation. analyses and visualizations based on different subgroups, The prosopographical research method (Verboven et al., e.g., people with same profession during different eras. 2007, p. 47) consists of two major steps. First, a target The paper is organized as follows. First, prosopographi- group of people is selected that share desired characteris- cal analyses and visualizations are presented and discussed tics for solving the research question at hand. Second, the for the two linked datasets and applications using the ap- target group is analyzed, and possibly compared with other proach outlined above: the Norssi high school alumni on groups, in order to solve the research question. the Semantic Web and the Semantic National Biography of In our earlier paper (Hyvönen et al., 2017) we presented an Finland. After this contributions of the work in relation to application case study where data from a printed collection related research are summarized and directions for further of over 10,000 short biographies (registry entries) of Norssi research are outlined. high school alumni were extracted and transformed into Linked Open Data, enriched by data linking to 10 exter- 2. Norssi Alumni Application nal data sources, and published in a SPARQL1 endpoint. A The Norssi alumni data service is available as linked open semantic faceted search engine and browser was developed data at the Linked Data Finland platform2 , including some for searching and filtering people and biographies that were 892,000 triples about 131,000 resources. The digitization, enriched with internal and external linking for biographical ”lodification”, and the Vanhat Norssit Portal3 is described research. Application of the same idea to the dataset of in more detail in Hyvönen et al. (2017). The datasets con- the Semantic National Biography of Finland (2014–2017) sist of 10 137 person resources, enriched with graphs of re- was considered in(Hyvönen et al., 2018), and the underly- lating career events and family relations, and vocabularies ing data model was presented in Leskinen et al. (2017). of titles, schools, companies, medals, and hobbies. These This paper extends this line of research by showing how additional data were extracted automatically from the short the filtered target group of faceted search can be utilized as biographical descriptions of a printed book using OCR and a basis for prosopographical research using different kind text extraction and cleaning tools based on regular expres- of data-analytic tools for solving prosopographical research sions. questions. Such tools may involve, e.g., methods of net- The ontology model representing people and their bio- work analysis (Easley and Kleinberg, 2010; Hanneman and graphical information in the Norssit alumni knowledge 1 2 SPARQL Protocol and RDF Query Language, http://www.ldf.fi/dataset/norssit 3 https://www.w3.org/TR/sparql11-query/ http://www.norssit.fi/semweb 39 graph is based on the Bio CRM data model4 (Tuominen et sponding educational titles (e.g., MSc in Technology, Doc- al., 2018), which has been developed to facilitate and har- tor of Medicine, etc.) on the right. From this visualization monize the representation of biographies and cultural her- one can see which titles were obtained from which universi- itage data on the Semantic Web. Bio CRM is a domain spe- ties regarding the filtered target group. The highlighted path cific extension of CIDOC CRM5 (Doerr, 2003), the event- in Fig. 3 shows, e.g., the connection from the University of based ISO standard for representing and harmonizing Cul- Helsinki to Bachelor of Arts when no filtering choices have tural Heritage data. It includes structures for basic data been made. of people, personal relations, professions, and events with participants in different qualified roles. Bio CRM makes a distinction between enduring unary roles of actors, their enduring binary relationships, and perduring events, where the participants can take different roles modeled as a role concept hierarchy. The ontology and data infrastructure used for the Norssi dataset are described in detail in Le- skinen et al. (2017). Figure 2: Pie chart showing the most common educations among high school alumni. On the second visualization page8 , there are first two his- tograms showing years of enrollment and matriculation of the target group. Below these, three multi-column charts show the most popular universities and colleges, employ- ers, and occupations of the filtered people on a decade by decade basis. For example, from the histogram represent- ing the years of enrollment (Fig. 4) one can see that when education in Norssi was started, a lot of pupils from other schools moved to Norssi (first high bar on the left). Also the changes made in the Finnish school system in the 1970’s are Figure 1: Faceted search for short biographies in the alumni clearly visible as very low enrollment rates. Fig. 5 depicts register Norssit 1867–1992. the most popular employers. It shows a great and inter- esting variation of companies and organizations at different times: in the late 1800’s the Finnish State Railways (Valtion The Vanhat Norssit Portal contains two search interfaces, Rautatiet, blue columns) was the most popular employer, person pages, and two pages for statistical visualiza- but declined soon probably because the main railway con- tions. The search interface (Fig. 1) is based on SPARQL nections in Finland were built in 1850–1900.9 The Finnish Faceter (Koho et al., 2016), a tool for creating faceted Defense Forces (Puolustusvoimat, green columns), on the search interfaces on a SPARQL endpoint. The interface al- other hand, has its highest peek during the Second World lows the user to filter the results based on, e.g., people’s War. After this the banking industry and the city of Helsinki education, profession, place of birth, or on which external became major employers for Norssi alumni. databases he or she has been linked to. The facet for links to external datasets provides also an in- For analyzing and visualizing data statistics of a filtered tar- teresting option for selecting target groups. For example, a get group of people, we created two views based on Google student in the school may ask herself/himself the question: Chart6 diagrams. On the first visualization page7 , the popu- where should I work if I want to become famous and get an larity of the most common educations (Fig. 2), universities entry in the National Biography? By making the selection and colleges, professions, and employers after the gradua- ”National Biography” on the facet and then looking at the tion of the alumni are shown as four pie charts. By making employer multi-column chart one can get an idea of where filtering selections on the facets, the graphics are updated to work in order to be included in the National Biography. accordingly. For example, by selecting ”professor” on the The official motto of the Norssi high school is Non scholae profession facet the employers of the 258 professors in the sed vitae (not for school, but for life). Data analytics based data can be seen on the employer pie chart. On the same on the linked data service now provides new insights on page, there is also a Sankey diagram depicted in Fig. 3 that what actually happened to the school alumni in life after shows a list of universities on the left side and the corre- graduation in a prosopographical sense. 4 http://seco.cs.aalto.fi/projects/biographies/ 5 8 http://cidoc-crm.org http://www.norssit.fi/semweb/#!/visualisointi2 6 9 https://developers.google.com/chart/ https://en.wikipedia.org/wiki/History of rail transport in 7 http://www.norssit.fi/semweb/#!/visualisointi Finland 40 Figure 3: Sankey diagram showing the linkage between the university and the education. et al., 2018), and the person ontology is compatible with the Getty ULAN LOD11 model. The source data consists (at the moment) of fields extracted from the original database dump in CSV format. In the sim- plest cases, the value of a data field is directly indicated by the value of a property, e.g., date or place of birth. However, most of the structured knowledge was extracted from short Figure 4: Column chart showing the amount of pupils by snippets of text in the end of each biography describing ma- enrollment year. jor life events of the protagonist, such as graduation from a university, designing a building, publishing a book, get- ting a honorary medal, etc. The resulting knowledge graph includes 13 144 people with a biographical description in 3. Semantic National Biography of Finland the National Biography, 51 243 relating people mentioned The National Biography of Finland10 consists of biogra- in the biographies, and 977 authors of the biographies. At phies of notable Finnish people throughout history (200– the moment, the data includes 37 730 births, 25 552 deaths, 2018). The biographies describe the lives and achievements and 102 300 other biographical events. In addition to that of these historical and contemporary figures, containing there are 51 937 family relations, 4953 places, 3101 oc- vast amounts of references to notable Finnish and foreign cupational titles, and 2938 companies extracted from the figures, including internal links to other biographies of the source data. (Hyvönen et al., 2018) On top of the data ser- National Biography of Finland. In addition, the text con- vice, a search interface (Fig. 6) using the SPARQL Faceter tains references to historical events, notable works (such as tool (Koho et al., 2016) and AngularJS12 framework was paintings, books, music, and acting), places (such as place created. It can be used for finding individual biographies of birth and death), organizations, and dates. and for filtering out target groups for prosopography. In this case, the texts and data were available in a database For biographical research, we created for each person entry in a semi-structured form. As in the Norssi case above, the page two tabs: one for the textual description of the person texts were transformed into RDF form by extracting entities with additional data links, and one for a spatio-temporal from the semi-structured texts, and the result was uploaded visualization of the life events of the person using a map into a SPARQL endpoint of the Linked Data Finland ser- and a timeline. For prosopography, there is 1) a page for vice. studying the events of the target group, and 2) a page for The underlying ontology model represents people and their visualizing statistics of the filtered people. The application biographical information. A natural choice for modeling will be opened to the public in September 2018. life stories is the event-based approach where a person’s life is seen as a sequence of spatio-temporal, possibly inter- linked events from birth to death (and beyond). The events Fig. 7 depicts an example of a person’s map-timeline page. are modeled according to the Bio CRM model (Tuominen 11 http://www.getty.edu/research/tools/vocabularies/lod 10 12 https://kansallisbiografia.fi/english/national-biography http://angularjs.org 41 Figure 5: Column chart showing the most common employers. rope, towns of the Hanseatic League15 , Finnish mansions, churches, and other well-known buildings were added to the place ontology using the Google services. The place on- tology includes locations in different scales, such as coun- tries, towns, villages, and in some cases even buildings with a known specified address. Figure 6: Main page of the Finnish National Biography. There is a chronological list of life events on the left col- umn. Events with known locations are shown on the map, and below there is a timeline showing the timespan of the events. The timeline spans from a person’s birth to death, and shows when the career highlights have taken place. There are four horizontal lines in the timeline for separat- Figure 7: Map and timeline showing events related to the ing different categories of biographical events, each rep- Finnish architect Eliel Saarinen. resented in a different color: family events (e.g., getting married, having children), career events (e.g., education, professional experience), achievements, and mentions of As for prosopographical research, there are two different honor. Corresponding markers on the map follow the same views available using Angular Google Maps16 . The target color schema. group can be filtered by using a time span slider17 that is When an event is hovered on the event list or on the time- included as a facet for the user to specify a desired range line, the corresponding marker on the map gets highlighted. of years in interest. Other filtering facets include choos- The size of the marker depends on the number of events re- ing person’s profession, gender, dataset, related companies, lated to that specific location, so the most important places related place, and linkage to external databases. for a person’s career are emphasized. In the example case, The visualizations depicted in Fig. 8, show the results of the visualization is based on the biography of architect a SPARQL query corresponding to the facet selections on Eliel Saarinen, and Helsinki and Michigan (where he lived Angular Google Maps. The markers on the map show his later years) are emphasized. Data about the places in places of birth in blue and places of death in red color. The Finland was extracted from the Finnish Gazetteer of His- size of the marker corresponds to the number of events that torical Places and Maps (Hipla) databases and data ser- has taken place in that particular location. Clicking on a vice13 (Ikkala et al., 2016; Hyvönen et al., 2016). Foreign marker opens a modal window containing a list of people placenames were linked using the Google Maps APIs14 . who were born or died at the location. For example, the locations of medieval universities in Eu- 15 https://www.britannica.com/topic/Hanseatic-League 13 16 http://hipla.fi http://angular-ui.github.io/angular-google-maps/ 14 17 http://developers.google.com/maps/ https://github.com/angular-slider/angularjs-slider 42 The first selection (Fig. 8a) shows the places of birth and death of Finnish clergy 1554–1721. According to the re- sulting rendering, the most active areas locate along the coastal Finland with main focus on the town of Turku, which during that era was the capital of Finland, and some are scattered around Sweden. The second selection (Fig. 8b) shows the data of Finnish clergy in 1800–1920. The data does not clearly concentrate on the largest towns of Helsinki and Turku, but seem to scatter evenly around Southern Finland. During that era Finland was a part of the Russian Empire but there are only a few markers on the (a) Lifespan of people lived in 1700–1800. Russian side except at the city of St. Petersburg. (a) The places of birth and death of Finnish clergy 1554–1721. (b) Lifespan of people lived in 1900–1950. Figure 9: Two different views of statistical visualizations. endpoint can be used for data analysis and visualizations in biographical and prosopographical research. According to our practical experiences, the technology is very useful and handy to use for this after learning the basics of Linked Data standard publishing principles. Previous works of applying Linked Data technologies to (b) The places of birth and death of Finnish clergy 1800–1920. biographical data include, e.g., Larson (2010), Biogra- phynet.nl18 (Ockeloen et al., 2013), and our own earlier Figure 8: Two different views on the map application. work (Hyvönen et al., 2014). The conference proceed- ings (ter Braake et al., 2015) include several papers on The Semantic National Biography demonstrator also in- bringing biographical data online, on analyzing biographies cludes a visualization page showing statistics as in the with computational methods, on group portraits and net- Norssit alumni case. The column charts in this case show works, and on visualizations. Applying Linked Data prin- (at the moment) five demographic histograms (with the ciples to cultural heritage data (Hyvönen, 2012) and his- mean value and standard deviation) of the target group: dis- torical research (Meroño-Peñuela et al., 2015) has been a tribution of ages among the group, ages of marriage, ages promising approach to solve the problems of isolated and of having the first child, the number of children, and the semantically heterogeneous data sources. Also a num- number of spouses. ber of previous research exists in Linked Data visualiza- Two examples of histograms are shown in Fig. 9. The up- tion (Bikakis and Sellis, 2016; Dadzie and Rowe, 2011). per (a) one shows the lifespan of people who lived in 18th An important component in representing biographical data century, and the lower one (b) people living in 1900–1950. is representing people and their networks, so the next part The two figures can be compared, e.g., how the amount of of our work is applying the methods of computational net- deaths among young children has decreased and how the work analyses on the data. Representing biographies as average age has increased between the two time periods. linked data provides several approaches for creating such networks. For example, the biographical texts can be ana- 4. Discussion, Related Work, and Future lyzed and people mentioned in text descriptions can be used Research as links in the person interrelation graph. This paper demonstrated how Linked Data can be used as a basis for representing biographical registries and for filter- ing out target groups of persons of interest. Our particular 18 goal was to show by a series of examples, how a SPARQL http://www.biographynet.nl 43 Acknowledgements Johannes Kehrer and Helwig Hauser. 2013. Visualization The presented research is part of the Severi project , 19 and visual analysis of multifaceted scientific data: A sur- funded mainly by Business Finland. Developing the Na- vey. IEEE transactions on visualization and computer tional Biography of Finland is also part of the Open Sci- graphics, 19(3):495–513. ence and Research Programme20 , funded by the Ministry Mikko Koho, Erkki Heino, and Eero Hyvönen. 2016. of Education and Culture of Finland. SPARQL Faceter—Client-side Faceted Search Based on SPARQL. In Raphaël Troncy, Ruben Verborgh, Lyndon 5. References Nixon, Thomas Kurz, Kai Schlegel, and Miel Vander Nikos Bikakis and Timos Sellis. 2016. Exploration and vi- Sande, editors, Joint Proc. of the 4th International Work- sualization in the web of big linked data: A survey of the shop on Linked Media and the 3rd Developers Hackshop. state of the art. In Proceedings of the Workshops of the CEUR Workshop Proceedings, Vol-1615. EDBT/ICDT 2016 Joint Conference. CEUR Workshop Ray Larson. 2010. Bringing lives to light: Biography in Proceedings, Vol-1558. context. Final Project Report, University of Berkeley. Aba Sah Dadzie and Matthew Rowe. 2011. Approaches Petri Leskinen, Jouni Tuominen, Erkki Heino, and Eero to visualising Linked Data: A survey. Semantic Web, Hyvönen. 2017. An ontology and data infrastructure for 2(2):89–124. publishing and using biographical linked data. In Pro- Martin Doerr. 2003. The CIDOC CRM – an ontological ceedings of the Workshop on Humanities in the Seman- approach to semantic interoperability of metadata. AI tic Web (WHiSe II), pages 15–26. CEUR Workshop Pro- Magazine, 24(3):75–92. ceedings, Vol-2014. David Easley and Jon Kleinberg. 2010. Networks, Crowds, Albert Meroño-Peñuela, Ashkan Ashkpour, Marieke and Markets: Reasoning about a Highly Connected Van Erp, Kees Mandemakers, Leen Breure, Andrea World. Cambridge University Press. Scharnhorst, Stefan Schlobach, and Frank Van Harme- Robert A. Hanneman and Mark Riddle. 2005. Introduc- len. 2015. Semantic technologies for historical research: tion to social network methods. University of California, A survey. Semantic Web, 6(6):539–564. Riverside, CA. http://faculty.ucr.edu/∼hanneman/. Niels Ockeloen, Antske Fokkens, Serge ter Braake, Piek Eero Hyvönen, Miika Alonen, Esko Ikkala, and Eetu Vossen, Victor De Boer, Guus Schreiber, and Susan Mäkelä. 2014. Life stories as event-based linked data: Legêne. 2013. BiographyNet: Managing provenance at Case Semantic National Biography. In Proceedings of multiple levels and from different perspectives. In Pro- ISWC 2014 Posters & Demonstrations Track. CEUR ceedings of the 3rd International Conference on Linked Workshop Proceedings, October. Science (LISC’13), pages 59–71. CEUR Workshp Pro- Eero Hyvönen. 2012. Publishing and Using Cultural Her- ceedings, Vol-1116. itage Linked Data on the Semantic Web. Synthesis Lec- Brian Roberts. 2002. Biographical Research. Understand- tures on the Semantic Web: Theory and Technology. ing social research. Open University Press. Morgan & Claypool, Palo Alto, CA, USA. Serge ter Braake, Ronald Sluijter Anstke Fokkens, Thierry Eero Hyvönen, Esko Ikkala, and Jouni Tuominen. 2016. Declerck, and Eveline Wandl-Vogt, editors. 2015. Linked data brokering service for historical places and BD2015 Biographical Data in a Digital World 2015. maps. In Proceedings of the 1st Workshop on Humani- CEUR Workshop Proceedings, Vol-1399. ties in the Semantic Web (WHiSe), pages 39–52. CEUR Jouni Tuominen, Eero Hyvönen, and Petri Leskinen. 2018. Workshop Proc. Vol 1608. Bio CRM: A data model for representing biographi- Eero Hyvönen, Petri Leskinen, Erkki Heino, Jouni Tuomi- cal data for prosopographical research. In BD2017 Bi- nen, and Laura Sirola. 2017. Reassembling and en- ographical Data in a Digital World 2017, Proceedings. riching the life stories in printed biographical registers: CEUR Workshop Proceedings. Norssi high school alumni on the Semantic Web. In Lan- Koenraad Verboven, Myriam Carlier, and Jan Dumolyn. guage, Technology and Knowledge. First International 2007. A short manual to the art of prosopography. In Conference, LDK 2017, Galway, Ireland, June 19-20, Prosopography Approaches and Applications. A Hand- 2017. Springer-Verlag. book, pages 35–70. University of Ghent. Eero Hyvönen, Petri Leskinen, Minna Tamper, Jouni Tuominen, and Kirsi Keravuori. 2018. Semantic Na- tional Biography of Finland. In Proceedings of the Dig- ital Humanities in the Nordic Countries 3rd Conference (DHN 2018), pages 372–385. CEUR Workshop Proceed- ings, Vol-2084, March. Esko Ikkala, Jouni Tuominen, and Eero Hyvönen. 2016. Contextualizing historical places in a gazetteer by us- ing historical maps and linked data. In Proceedings of Digital Humanities 2016, Krakow, Poland, short papers, pages 573–577. 19 http://seco.cs.aalto.fi/projects/severi 20 https://openscience.fi 44