Interfaces: Accessing Biographical Data and Metadata Matthias Reinert, Bernhard Ebneth Historical Commission at the Bavarian Academy of Sciences and Humanities, Munich; Head of department “Deutsche Biographie”: Malte Rehbein (University of Passau) reinert@hk.badw.de, ebneth@ndb.badw.de Abstract Based on the principles of interaction and cooperation in the field of biographical lexicography, the paper outlines the design and setup of interfaces. Interfaces enable access to both individual and sets of biographical data entries. This paper briefly describes the design of metadata mapping to semantic formats (RDF) and graphs, when metadata is provided by the historical biographical information system Deutsche Biographie. It demonstrates exemplary usages of the APIs in the historical sciences. Keywords: biographical metadata, interfaces, semantic metadata mapping, graph databases 1 Lexicography as interaction, cooperation, Authority files appear to be key to the interfaces and exchange of exchange and to ontology patterns and modes of Recent biographical dictionaries have emerged out of visualization. They were conceived in the bibliographical scientific debate and exchange on the international and field and proved valuable in interlinking both persons and interregional levels and in cross-disciplinary discourse. The concepts.(Ebneth and Reinert, 2018) majority of current biographical dictionaries have moved 2 Data Resources. content – everything from indices to complete articles – to The Deutsche Biographie is a joint endeavor of the digital media or were set up to be born digital.1 Historical Commission at the Bavarian Academy of In addition, digital media has made it easy to reference Sciences and Humanities and the Bavarian State Library in ongoing historiographical research, critical editions, doc- Munich. umentary projects, library catalogs, and archival sources. Since the late 1990s, several measures have been taken The common tasks of all lexicographers consist of to digitize two series of biographical dictionaries and to extending, revising, and correcting the biographical establish a website providing access to them. database. Therefore, collaboration through the sharing The latest efforts have targeted cultural institutions, in order and linking of information forms the backbone of future to enlarge the number of notable individuals in the corpus biographical dictionaries. (Reinert et al., 2015). The digital transformation is confronted with the challenge of maintaining formal and scientific standards that have Components evolved in the past 100 years of lexicography. Current The aggregated database consists of parts differing in biographical dictionaries share a common set of elements or provenance, density, and granularity of information: modules, e.g., names, dates, places of birth and death, fam- • Digitized biographical texts of the Allgemeine ily background, biographical description, and references Deutsche Biographie (ADB, 55 vols. and index vol. to works, archival resources, secondary literature, portraits published 1875–1912) and Neue Deutsche Biographie and authorship. These could probably be described as a (NDB, 26 [A–Vocke] of 28 vols. published since common web ontology. 1953) provide information on individuals, exact dates The similar structure of biographical articles in dictionaries of birth and death, places of birth, death, and burial, allows similar strategies of visualization – ranging from as well as partially encoded information on entities static genealogical trees to dynamic relations to persons, related to the individual. The full text of both series institutions and places. have been digitized and structured in XML (Reinert, 1 2010). In Europe there are only 9 national dictionaries accessible • The index of persons and families mentioned in ADB freely online: Historisches Lexikon der Schweiz – Dizionario and NDB was compiled manually and provides names, storico della Svizzera – Dictionnaire historique de la Suisse (HLS), Neue Deutsche Biographie (NDB), Österreichisches years of birth and death, a hierarchy of professions, Biographisches Lexikon 1815–1950 (ÖBL) mit 2. überarbeit- and references to pages in the printed volumes. eter Auflage ab 1815, Slovenska biografija (SBL), Nationaal Almost all index entries are aligned with the German Biografisch Woordenboek (NBW), Dizionario Biografico degli authority file (Gemeinsame Normdatei GND),2 and Italiani (DBI), Biographie Nationale de Belgique (BNB), further information can be derived from this resource Nouvelle Biographie Nationale numérisées de Belgique (NBNB), (Hockerts, 2008), (Ebneth, 2009), (Busch and Jordan, Kansallisbiografia (Finland), Norsk Biografisk Leksikon (NBL), 2011). Internetowy Polski Słownik Biograficzny (iPSB). Two others offer paid access: the Oxford Dictionary of National Biography and the 2 A complete dump is available under CC0 http://www. Dictionary of Irish Biography. dnb.de/lds (Pfeifer, 2015). 1 • The main work base of the editorial office helps to The “Local Grammars” at hand are capable of describing curate the last two volumes that are to appear in institutional bodies (enterprises, educational institution, print. A subset of this work base that includes all theatre and music groups, political parties), administrative entries on deceased persons with a GND-identifier is geographical regions (populated places, regions, countries, merged with the Deutsche Biographie. The dataset religious territories) and the proper names of certain includes differences in name spellings, pseudonyms, individual places (monasteries, churches, castles). dates and places of birth and death. The places are Finally biographical accounts refer to highly specific events rarely harmonized and aligned. While all entries are (s. (Rospocher et al., 2016)). Of these only the proper provided with a GND-identifier, further information names of the most prominent events like congresses, wars, from this resource can be derived (Hockerts, 2012). and peace treaties have been directly detected with certain • The last component is a set of persons and families “Local Grammars.” – amounting to about 600,000 entries – provided by websites of cooperating partners3 . The data for these entries are imported from the German national authority file GND (Ebneth, 2012), (Kraus et al., 2014). Data enhancement Metadata enhancement is crucial for extended search options. The main approach is to detect entities and determine the linguistic class they belong to, then identify entities against a given database, and finally detect relations and sentiments expressed with regard to them (Jurafsky and Martin, 2016). The Deutsche Biographie’ approach relies heavily on authority files. The index of persons was completely equipped with GND-entries and -identifiers in cooperation with the Bavarian State Library / Munich Digitization Center (Hockerts, 2008), (Busch and Jordan, 2011). This close cooperation makes the Deutsche Biographie a Figure 1: The web version of Albert Einstein’s ar- reference for biographical entries in the GND too. ticle, https://www.deutsche-biographie.de/ With the appearance of each volume in print, the index is sfz68290.html, written by Max von Laue, first enlarged and enhanced in term of GND-metadata. appeared in NDB 4 (1959). The RDF version linked on As soon as a new volume is published in print form, the the left to https://www.deutsche-biographie. previous volume will be transformed to deeply structured de/downloadRDF?url=sfz68290.rdf. XML and put online. Entity detection The structure of the articles converted from PDF to XML first covers the main parts of the article, namely the Entity linking/disambiguation headline, the genealogy, the life summary, and the technical To enhance this database, different strategies were applied parts listing awards, works, sources, secondary literature, to link named entities to the internal database of indexed and portraits. names. One obvious strategy relied on index entries It then deals with entities, like personal names occurring and calculated scores for identified personal names on a in verb phrases (“interpersonal relations”) (Stotz and given page in a given section (genealogy or biography), Reinert, 2013), (Stotz et al., 2015). The strategy of our depending on the length of the named entity and any given choice, named “Local Grammars,” is described in (Gross, birth or death dates as compared with those accounted for 1997), (Geierhos, 2007), (Geierhos, 2010) and relies in the index of names for that page (Reinert et al., 2015). on dictionaries as described by (Guenthner and Maier, The second strategy drew on professional descriptors 1994), (Guenthner and Maier-Meyer, 1996). We set up that preceded the named entity and compared them with different dictionaries of partial and complete forms of well the most well-known index entry, which assumed that known entities (first names, surnames, names of places, celebrity implied being named or connected with other disciplines, fields of study, relevant adjectives and noun famous personalities who had also been portrayed in the phrases). biographical dictionary. Another strategy consisted of detecting places of birth, 3 The partnering institutions provide individual personal death and burial and linking them to authority files. information on a selection of individuals who are representative These place names were prominent in the headlines of for their collecting focus and documentary field. A complete list each biography of an individual person and taggable by can be found at https://www.deutsche-biographie. regular expressions. Twelve thousand distinct names were de/partner uniquely detectable. 2 . Figure 2: Albert Einstein’s network up to a distance of 2 with interconnections, circle pack layout (Mike Bostock, in Gephi) coloured by religion/denomination. Red/right circle contains unidentified names, green/left circle represents (well connected) protestants, blue/center bottom circle represents Jewish and the upper middle (not circled) contains persons changing/not given denomination (like Einstein). Data https://data.deutsche-biographie.de/graph- open/?id=sfz68290&depth=2 [3.11.2017]. By accessing the web services of OpenStreetMap (OSM)4 Information extraction – summary and filtering the results by type of place and continent, Up to now, the following have been recognized and about one third could be identified directly in OSM. identified (linked to index database) in the Deutsche The majority of remaining places occurred only once in the Biographie: corpus and had to be aligned to OSM manually. Although OSM does not provide unique IDs, the coordinates and • 340,000 personal names tagged in the corpus of ADB some additional information were reusable. and NDB, of which about 145,000 are identified as With the help of the geographical coordinates, about database entries; half of these entries could be identified / reverse-located • 380,000 place names, of which 12,500 are places of in Geonames5 . Geonames provides unique and stable study, 1,900 “places of worship” (religious buildings); identifiers, but the qualification of the type of the place over 163,000 place names were identified as an differs. Even though over 36,000 GND-entries for place- instance of a given place in the database; or regionlike administrative territories had been aligned to • 103,000 relations found in 21,330 biographies Geonames, the data was incoherent. The GND referred to (NDB), of which 6,200 are teacher-student re- administrative division in Geonames, while we were using lations, 1,500 friendship predicates, about 1,000 populated places. predecessor/successor predicates, and another 3,000 membership relations and about 3,100 leadership relations relating to institutions or organizations; • 12,000 organizational names; 4 http://nominatim.openstreetmap.org. • 8,000 time expressions; 5 http://www.geonames.org/. • 11,500 mentions of discipline/fields of study. 3 The efforts in geo-locating places led to new search in Munich in 2010.6 There is an informal description7 options. Two start pages were introduced that provided that allows for aggregation of Beacons of different origin. a mapsearch and a zoomable geographical distribution of Most Beacon files are announced at the given Wikipedia places mentioned in the Deutsche Biographie. website, others are hosted by the Historical Commission or at Findbuch.de8 . 3 Interfaces #FORMAT: BEACON The notion of what an interface is ranges from intuitive #PREFIX: http://d-nb.info/gnd/ user interaction surface to machine readable bit-stream. In #TARGET: http://www.deutsche-biograph this paper, interface is understood as a programmable data- ie.de/pnd{ID}.html#ndbcontent delivering web application. 118643525 118500015 ... Type, purpose, coverage of the APIs In Deutsche Biographie we mapped metadata on the entry Figure 4: First lines of Beacon file. level. This means that the RDF is directly accessible from the web-version of an individual article (s. fig. 1). Another way of mapping relations to a graph format is described Solr below (s. sect. 3). The Solr9 index is configured for combined searches, auto- The APIs provided by http://data.deutsche- complete and faceted searches for names, places, and biographie.de split up into two groups: the Beacon- professions. Relations between personal entries are stored interface / Beacon-Aggregator and the Solr-based search in an abbreviated serialized form (s. fig. 5). The decision to index cover about 750,000 aggregated entries in the Deutsche Biographie whereas the RDF-based SPARQL- ... endpoint and the Neo4J-based graph-database cover only sfz112378@Verwandt@Einstein, Maria persons and families mentioned in the biographical (Maja) dictionaries (about 100,000). The number of entries linked sfz68291@Verwandt@Einstein, to others with an explicitly named relation shrinks to about Alfred 23,000, due to the limitations of named entity identification sfz58393@Verwandt@Maric, Mileva mentioned above. sfz107772@Verwandt@Maric, The Beacon aggregation provided here offers a list of links Mileva ... to resources for a given GND-Identifier (s. fig. 3). Figure 5: Serialised relations in index entries http://data.deutsche-biographie.de/ beta/solr-open/?q=defgnd:118529579. open the Solr index was supported by similar activities in the library sector,10 reflecting the need for easily accessible APIs. RDF Different approaches were proposed in the field of semantic data modelling for biographies: • a property-based approach is one in which persons are entitled with atomic properties that may be organized in classes (cf. GND approach with (Litz et al., 2012), 6 From Reference Work to Information System – Vom Nachschlagewerk zum Informationssystem. Wissenschaftliche Qualitätssicherung und Funktionalitätserweiterung historisch- biographischer Lexika in elektronischen Medien, Munich, 25.- 27.2.2010. cf. (Hockerts, 2012), (Ebneth, 2015). Figure 3: The first lines of the Beacon aggregation 7 http://de.wikipedia.org/wiki/Wikipedia: for Albert Einstein GND-ID: 118529579) http: BEACON/Format and a draft specification by Jakob Voss //data.deutsche-biographie.de/rest/bd/ and Mathias Schindler https://gbv.github.io/ gnd/118529579/alle_de. beaconspec/beacon.html, latest version of Nov. 2017). 8 http://beacon.findbuch.de/pnd-aks. 9 https://solr.apache.org, version 5.4. 10 Beacons and their aggregation See https://www.lobid.org – lobid offers search and APIs based on linked open (library) data (LOD). Lobid is maintained by The Beacon concept examined here was conceived by the Hochschulbibliothekszentrum des Landes NRW (North Rhine volunteers and Wikipedia-enthusiasts during a conference Westphalia). 4 Figure 6: Model of NDB-Graph stored in Neo4J. Drawn with yEd. the “factoids” proposed by (Michele Pasin and John Graph APIs Bradley, 2013) and the “aspects” deployed at the The API provides access to a graph database based Personendatenrepositorium (Walkowski, 2009). on Neo4J13 . The graph is modelled according to the • event-based ontologies represent a person’s life as index of persons and the structure of the printed a set of timespans or events of varying duration. volumes and its articles. These hierarchically ordered Examples for this approach are the Erlangen CRM11 , units () of text refer to entries in (Trame et al., 2013) and the “life-spans” deployed at the person and place databases by means of edges the Catalogus professorum Lipsiensis s. (Riechert et (). Persons may refer to al., 2010). a lineage (a group of family-like related individuals) or directly to others by named edges The initial data-model of Deutsche Biographie was (). A per- property based. It draws heavily on the concept of the son is connected to places () and to a profession classifica- see (Brümmer, 2011). We publish dumps of all generated tion (). RDF-triples periodically and serve a Sparql-endpoint on A second option makes it possible to export an ego-centered request. graph, based on the detected relations to identified personal names in the Solr index.14 The example presented in fig. 2 shows a data sample for 13 https://www.neo4j.org 11 14 Its current version “Erlangen CRM 170309” is http://data.deutsche-biographie.de/beta/ based on CIDOC-CRM 6.2.2 http://erlangen- open-graph This option requires the internal identifier and crm.org/170309/. loads all identified relations to other persons from the Solr-index. 12 http://d-nb.info/standards/elementset/ Further options allow exports in json and graphml. It covers gnd only biographical entries of the newer NDB-series. 5 Albert Einstein and all individuals in his network up to a Bernhard Ebneth. 2012. Aktueller Stand der Genealogien distance of 2, together with their cross-relations amongst in der Neuen Deutschen Biographie – Arbeit mit each other. The dataset has easily been imported into der Online-Version. In 64. Deutscher Genealogentag, Gephi15 , colored by an indexed category and visualized Augsburg. using a circle packing layout (Mike Bostock). Bernhard Ebneth. 2015. Auf dem Weg zu einem Historisch-biographischen Informationssystem. Daten- 4 Summary integration und Einsatz von Normdaten am Beispiel Recent incoming e-mail requests have showed a growing der Deutschen Biographie und des Biographie-Portals. interest in accessing the data. There were genuine research Jahrbuch für Universitätsgeschichte, pages 261–290. questions (e.g., the network of Martin Luther, investigating Michaela Geierhos. 2007. Grammatik der Menschen- German street names), integration tasks (RDF query of bezeichner in biographischen Kontexten. Arbeiten metadata for identified individual entries) and bridging zur Informations- und Sprachverarbeitung. Band 2. tasks (querying personal names in a back-end for an München. archival CMS). Michaela Geierhos. 2010. BiographIE. Klassifikation und A proposal has been launched to raise funding for a Extraktion karrierespezifischer Informationen. Num- web-based research laboratory that would work with ber 05 in Linguistic Resources for Natural Language the metadata encoded and aggregated in the Deutsche Processing. Lincom, München. Biographie. Maurice Gross. 1997. The Construction of Local 5 Acknowledgements Grammars. In E. Roche and Y. Schabès, editors, Finite- The work was funded by grants from the Deutsche State Language Processing, pages 329–354. Cambridge, Forschungsgmeinschaft (DFG) for the project “Deutsche Mass. Biographie as historical biographical information system Franz Guenthner and Petra Maier-Meyer. 1996. Das for German-speaking areas” in 2012 and 2015. The NDB CISLEX-Wörterbuchsystem. In H. Feldweg and E. W. received funding from the German Research Foundation Hinrichs, editors, Lexikon und Text, pages 69–82. (Deutsche Forschungsgemeinschaft DFG) for the years Tübingen. 2001-03, 2008-09, and 2010-11.16 Franz Guenthner and Petra Maier, editors. 1994. Das The RDF mapping was initially conceived by Martin CISLEX Wörterbuchsystem. München. Brümmer und Thomas Riechert within the framework of Hans Günter Hockerts. 2008. Vom nationalen Denkmal the EU project “LOD2-PUBLINK” (s. (Brümmer, 2011)). zum biographischen Portal: ADB und NDB. Akademie Aktuell, 33(2):19–22. 6 References Hans Günter Hockerts. 2012. Zertifiziertes biographi- Martin Brümmer. 2011. Realisierung eines RDF- sches Wissen im Netz. Forschungsnahe Informations- Interfaces für die Neue Deutsche Biographie. In Sören infrastruktur. Die ,,Deutsche Biographie“ auf dem Weg Auer, Johannes Schmidt, and Thomas Reichert, editors, zum zentralen historisch-biographischen Informations- SKIL 2011 – Studentenkonferenz Informatik Leipzig system für den deutschsprachigen Raum. Akademie 2011, Leipziger Beiträge zur Informatik Band XXVII, Aktuell, 37(4):34–36. Leipziger Informatik-Verbund (LIV), pages 31–42. Dan Jurafsky and James H. Martin. 2016. Speech and Thomas Busch and Stefan Jordan. 2011. Vernetzte Language Processing. Lebensläufe. Der Einsatz von Normdatenbanken zur Hans-Christof Kraus, Marco Jorio, Martina Schattkowsky, Verlinkung biographischer und bibliographischer Ange- Bernhard Ebneth, Matthias Reinert, Thierry Declerck, bote im Internet. Geschichte in Wissenschaft und Christine Gruber, and Eva Wandl-Vogt. 2014. Sektion: Unterricht, (11/12):684–691. Vernetzung von historisch-biographischen Lexika und Bernhard Ebneth and Matthias Reinert. 2018. Po- Fachportalen im Linked (Open) Data Framework. In tentiale der Deutschen Biographie (www.deutsche- 1. Jahrestagung der Digital Humanities im deutsch- biographie.de) als historisch-biographisches Informa- sprachigen Raum (DHd 2014), Universität Passau, 25.- tionssystem. In Ágoston Zénó Bernád, Christine Gruber, 28.3.2014. and Maximilian Kaiser, editors, Europa baut auf Berenike Litz, Aenne Löhden, Jan Hannemann, and Biographien. Aspekte, Bausteine, Normen und Standards Lars Svensson. 2012. AgRelOn – An Agent für eine europäische Biographik., pages 283–295, Wien. Relationship Ontology. In Juan Manuel Dodero, new academic press. Manuel Palomo-Duarte, and Pythagoras Karampiperis, Bernhard Ebneth. 2009. Vom digitalen Namenregister zum editors, Research Conference on Metadata and europäischen Biographie-Portal im Internet. In Martina Semantic Research, volume 6, pages 202–212. Schattkowsky and Frank Metasch, editors, Biografische https://de.slideshare.net/larsgsvensson/agrelon-an- Lexika im Internet, volume 15 of Bausteine aus dem agent-relationship-ontology. Institut für Sächsische Geschichte u. Volkskunde, pages Michele Pasin and John Bradley. 2013. Factoid-Based 13–44. Dresden. Prosopography and Computer Ontologies: Towards 15 https://www.gephi.org. an Integrated Approach. Literary and Linguistic 16 http://gepris.dfg.de/gepris/projekt/ Computing. (53346764|165972532|213818920). Barbara Pfeifer. 2015. Über Zweck und Nutzen der 6 Gemeinsamen Normdatei (GND). Jahrbuch für Univer- sitätsgeschichte, 16:251–259. Matthias Reinert, Maximilian Schrott, and Bernhard Ebneth. 2015. From Biographies to Data Curation- The Making of Www. Deutsche-Biographie. De. In Biographical Data in a Digital World (BD), pages 13– 19, Amsterdam. Matthias Reinert. 2010. Biographisches Wissen auf einen Klick. Akademie Aktuell, 35(4):44–46. Thomas Riechert, Ulf Morgenstern, Sören Auer, Sebastian Tramp, and Michael Martin. 2010. The Catalogus Pro- fessorum Lipsiensis – Semantics-Based Collaboration and Exploration for Historians. In Proceedings of the 9th International Semantic Web Conference (ISWC2010), Shanghai/China. Marco Rospocher, Marieke van Erp, Piek Vossen, Antske Fokkens, Itziar Aldabe, German Rigau, Aitor Soroa, Thomas Ploeger, and Tessel Bogaard. 2016. Building Event-Centric Knowledge Graphs from News. Web Semantics: Science, Services and Agents on the World Wide Web, 37-38:132–151, March. Sophia Stotz and Matthias Reinert. 2013. Detecting and Encoding Interpersonal Relations with Unitex/Local Grammars. Université Paris Est-Marne-la-Vallée. Sophia Stotz, Valentina Stuß, Matthias Reinert, and Maximilian Schrott. 2015. Interpersonal Relations in Biographical Dictionaries. A Case Study. In Biographical Data in a Digital World (BD), pages 74– 80. Johannes Trame, Carsten Keßler, and Werner Kuhn. 2013. Linked Data and Time — Modeling Researcher Life Lines by Events. In Proceedings of the 11th International Conference on Spatial Information Theory - Volume 8116, COSIT 2013, pages 205–223, New York, NY, USA. Springer-Verlag New York, Inc. Niels-Oliver Walkowski. 2009. Zur Problematik der Strukturierung und Abbildung von Personendaten in digitalen Systemen. In Workshop Personendateien der Arbeitsgruppe ,,Elektronisches Publizieren“ der Union der deutschen Akademien der Wissenschaften. 7