Interfaces: Accessing Biographical Data and Metadata
Matthias Reinert, Bernhard Ebneth
Historical Commission at the Bavarian Academy of Sciences and Humanities, Munich;
Head of department “Deutsche Biographie”: Malte Rehbein (University of Passau)
reinert@hk.badw.de, ebneth@ndb.badw.de
Abstract
Based on the principles of interaction and cooperation in the field of biographical lexicography, the paper outlines the design and setup
of interfaces. Interfaces enable access to both individual and sets of biographical data entries. This paper briefly describes the design of
metadata mapping to semantic formats (RDF) and graphs, when metadata is provided by the historical biographical information system
Deutsche Biographie. It demonstrates exemplary usages of the APIs in the historical sciences.
Keywords: biographical metadata, interfaces, semantic metadata mapping, graph databases
1 Lexicography as interaction, cooperation, Authority files appear to be key to the interfaces
and exchange of exchange and to ontology patterns and modes of
Recent biographical dictionaries have emerged out of visualization. They were conceived in the bibliographical
scientific debate and exchange on the international and field and proved valuable in interlinking both persons and
interregional levels and in cross-disciplinary discourse. The concepts.(Ebneth and Reinert, 2018)
majority of current biographical dictionaries have moved 2 Data Resources.
content – everything from indices to complete articles – to The Deutsche Biographie is a joint endeavor of the
digital media or were set up to be born digital.1 Historical Commission at the Bavarian Academy of
In addition, digital media has made it easy to reference Sciences and Humanities and the Bavarian State Library in
ongoing historiographical research, critical editions, doc- Munich.
umentary projects, library catalogs, and archival sources. Since the late 1990s, several measures have been taken
The common tasks of all lexicographers consist of to digitize two series of biographical dictionaries and to
extending, revising, and correcting the biographical establish a website providing access to them.
database. Therefore, collaboration through the sharing The latest efforts have targeted cultural institutions, in order
and linking of information forms the backbone of future to enlarge the number of notable individuals in the corpus
biographical dictionaries. (Reinert et al., 2015).
The digital transformation is confronted with the challenge
of maintaining formal and scientific standards that have Components
evolved in the past 100 years of lexicography. Current The aggregated database consists of parts differing in
biographical dictionaries share a common set of elements or provenance, density, and granularity of information:
modules, e.g., names, dates, places of birth and death, fam- • Digitized biographical texts of the Allgemeine
ily background, biographical description, and references Deutsche Biographie (ADB, 55 vols. and index vol.
to works, archival resources, secondary literature, portraits published 1875–1912) and Neue Deutsche Biographie
and authorship. These could probably be described as a (NDB, 26 [A–Vocke] of 28 vols. published since
common web ontology. 1953) provide information on individuals, exact dates
The similar structure of biographical articles in dictionaries of birth and death, places of birth, death, and burial,
allows similar strategies of visualization – ranging from as well as partially encoded information on entities
static genealogical trees to dynamic relations to persons, related to the individual. The full text of both series
institutions and places. have been digitized and structured in XML (Reinert,
1
2010).
In Europe there are only 9 national dictionaries accessible • The index of persons and families mentioned in ADB
freely online: Historisches Lexikon der Schweiz – Dizionario
and NDB was compiled manually and provides names,
storico della Svizzera – Dictionnaire historique de la Suisse
(HLS), Neue Deutsche Biographie (NDB), Österreichisches
years of birth and death, a hierarchy of professions,
Biographisches Lexikon 1815–1950 (ÖBL) mit 2. überarbeit- and references to pages in the printed volumes.
eter Auflage ab 1815, Slovenska biografija (SBL), Nationaal Almost all index entries are aligned with the German
Biografisch Woordenboek (NBW), Dizionario Biografico degli authority file (Gemeinsame Normdatei GND),2 and
Italiani (DBI), Biographie Nationale de Belgique (BNB), further information can be derived from this resource
Nouvelle Biographie Nationale numérisées de Belgique (NBNB), (Hockerts, 2008), (Ebneth, 2009), (Busch and Jordan,
Kansallisbiografia (Finland), Norsk Biografisk Leksikon (NBL), 2011).
Internetowy Polski Słownik Biograficzny (iPSB). Two others offer
paid access: the Oxford Dictionary of National Biography and the 2
A complete dump is available under CC0 http://www.
Dictionary of Irish Biography. dnb.de/lds (Pfeifer, 2015).
1
• The main work base of the editorial office helps to The “Local Grammars” at hand are capable of describing
curate the last two volumes that are to appear in institutional bodies (enterprises, educational institution,
print. A subset of this work base that includes all theatre and music groups, political parties), administrative
entries on deceased persons with a GND-identifier is geographical regions (populated places, regions, countries,
merged with the Deutsche Biographie. The dataset religious territories) and the proper names of certain
includes differences in name spellings, pseudonyms, individual places (monasteries, churches, castles).
dates and places of birth and death. The places are Finally biographical accounts refer to highly specific events
rarely harmonized and aligned. While all entries are (s. (Rospocher et al., 2016)). Of these only the proper
provided with a GND-identifier, further information names of the most prominent events like congresses, wars,
from this resource can be derived (Hockerts, 2012). and peace treaties have been directly detected with certain
• The last component is a set of persons and families “Local Grammars.”
– amounting to about 600,000 entries – provided
by websites of cooperating partners3 . The data for
these entries are imported from the German national
authority file GND (Ebneth, 2012), (Kraus et al.,
2014).
Data enhancement
Metadata enhancement is crucial for extended search
options. The main approach is to detect entities and
determine the linguistic class they belong to, then identify
entities against a given database, and finally detect relations
and sentiments expressed with regard to them (Jurafsky and
Martin, 2016).
The Deutsche Biographie’ approach relies heavily on
authority files. The index of persons was completely
equipped with GND-entries and -identifiers in cooperation
with the Bavarian State Library / Munich Digitization
Center (Hockerts, 2008), (Busch and Jordan, 2011).
This close cooperation makes the Deutsche Biographie a
Figure 1: The web version of Albert Einstein’s ar-
reference for biographical entries in the GND too.
ticle, https://www.deutsche-biographie.de/
With the appearance of each volume in print, the index is sfz68290.html, written by Max von Laue, first
enlarged and enhanced in term of GND-metadata. appeared in NDB 4 (1959). The RDF version linked on
As soon as a new volume is published in print form, the the left to https://www.deutsche-biographie.
previous volume will be transformed to deeply structured de/downloadRDF?url=sfz68290.rdf.
XML and put online.
Entity detection
The structure of the articles converted from PDF to XML
first covers the main parts of the article, namely the Entity linking/disambiguation
headline, the genealogy, the life summary, and the technical To enhance this database, different strategies were applied
parts listing awards, works, sources, secondary literature, to link named entities to the internal database of indexed
and portraits. names. One obvious strategy relied on index entries
It then deals with entities, like personal names occurring and calculated scores for identified personal names on a
in verb phrases (“interpersonal relations”) (Stotz and given page in a given section (genealogy or biography),
Reinert, 2013), (Stotz et al., 2015). The strategy of our depending on the length of the named entity and any given
choice, named “Local Grammars,” is described in (Gross, birth or death dates as compared with those accounted for
1997), (Geierhos, 2007), (Geierhos, 2010) and relies in the index of names for that page (Reinert et al., 2015).
on dictionaries as described by (Guenthner and Maier, The second strategy drew on professional descriptors
1994), (Guenthner and Maier-Meyer, 1996). We set up that preceded the named entity and compared them with
different dictionaries of partial and complete forms of well the most well-known index entry, which assumed that
known entities (first names, surnames, names of places, celebrity implied being named or connected with other
disciplines, fields of study, relevant adjectives and noun famous personalities who had also been portrayed in the
phrases). biographical dictionary.
Another strategy consisted of detecting places of birth,
3
The partnering institutions provide individual personal death and burial and linking them to authority files.
information on a selection of individuals who are representative These place names were prominent in the headlines of
for their collecting focus and documentary field. A complete list each biography of an individual person and taggable by
can be found at https://www.deutsche-biographie. regular expressions. Twelve thousand distinct names were
de/partner uniquely detectable.
2
.
Figure 2: Albert Einstein’s network up to a distance of 2 with interconnections, circle pack layout (Mike Bostock, in
Gephi) coloured by religion/denomination. Red/right circle contains unidentified names, green/left circle represents (well
connected) protestants, blue/center bottom circle represents Jewish and the upper middle (not circled) contains persons
changing/not given denomination (like Einstein). Data https://data.deutsche-biographie.de/graph-
open/?id=sfz68290&depth=2 [3.11.2017].
By accessing the web services of OpenStreetMap (OSM)4 Information extraction – summary
and filtering the results by type of place and continent,
Up to now, the following have been recognized and
about one third could be identified directly in OSM.
identified (linked to index database) in the Deutsche
The majority of remaining places occurred only once in the Biographie:
corpus and had to be aligned to OSM manually. Although
OSM does not provide unique IDs, the coordinates and • 340,000 personal names tagged in the corpus of ADB
some additional information were reusable. and NDB, of which about 145,000 are identified as
With the help of the geographical coordinates, about database entries;
half of these entries could be identified / reverse-located • 380,000 place names, of which 12,500 are places of
in Geonames5 . Geonames provides unique and stable study, 1,900 “places of worship” (religious buildings);
identifiers, but the qualification of the type of the place over 163,000 place names were identified as an
differs. Even though over 36,000 GND-entries for place- instance of a given place in the database;
or regionlike administrative territories had been aligned to • 103,000 relations found in 21,330 biographies
Geonames, the data was incoherent. The GND referred to (NDB), of which 6,200 are teacher-student re-
administrative division in Geonames, while we were using lations, 1,500 friendship predicates, about 1,000
populated places. predecessor/successor predicates, and another 3,000
membership relations and about 3,100 leadership
relations relating to institutions or organizations;
• 12,000 organizational names;
4
http://nominatim.openstreetmap.org. • 8,000 time expressions;
5
http://www.geonames.org/. • 11,500 mentions of discipline/fields of study.
3
The efforts in geo-locating places led to new search in Munich in 2010.6 There is an informal description7
options. Two start pages were introduced that provided that allows for aggregation of Beacons of different origin.
a mapsearch and a zoomable geographical distribution of Most Beacon files are announced at the given Wikipedia
places mentioned in the Deutsche Biographie. website, others are hosted by the Historical Commission or
at Findbuch.de8 .
3 Interfaces #FORMAT: BEACON
The notion of what an interface is ranges from intuitive #PREFIX: http://d-nb.info/gnd/
user interaction surface to machine readable bit-stream. In #TARGET: http://www.deutsche-biograph
this paper, interface is understood as a programmable data- ie.de/pnd{ID}.html#ndbcontent
delivering web application. 118643525
118500015
...
Type, purpose, coverage of the APIs
In Deutsche Biographie we mapped metadata on the entry Figure 4: First lines of Beacon file.
level. This means that the RDF is directly accessible from
the web-version of an individual article (s. fig. 1). Another
way of mapping relations to a graph format is described Solr
below (s. sect. 3). The Solr9 index is configured for combined searches, auto-
The APIs provided by http://data.deutsche- complete and faceted searches for names, places, and
biographie.de split up into two groups: the Beacon- professions. Relations between personal entries are stored
interface / Beacon-Aggregator and the Solr-based search in an abbreviated serialized form (s. fig. 5). The decision to
index cover about 750,000 aggregated entries in the
Deutsche Biographie whereas the RDF-based SPARQL- ...
endpoint and the Neo4J-based graph-database cover only sfz112378@Verwandt@Einstein, Maria
persons and families mentioned in the biographical (Maja)
dictionaries (about 100,000). The number of entries linked sfz68291@Verwandt@Einstein,
to others with an explicitly named relation shrinks to about Alfred
23,000, due to the limitations of named entity identification sfz58393@Verwandt@Maric, Mileva
mentioned above. sfz107772@Verwandt@Maric,
The Beacon aggregation provided here offers a list of links Mileva ...
to resources for a given GND-Identifier (s. fig. 3).
Figure 5: Serialised relations in index entries
http://data.deutsche-biographie.de/
beta/solr-open/?q=defgnd:118529579.
open the Solr index was supported by similar activities in
the library sector,10 reflecting the need for easily accessible
APIs.
RDF
Different approaches were proposed in the field of semantic
data modelling for biographies:
• a property-based approach is one in which persons are
entitled with atomic properties that may be organized
in classes (cf. GND approach with (Litz et al., 2012),
6
From Reference Work to Information System – Vom
Nachschlagewerk zum Informationssystem. Wissenschaftliche
Qualitätssicherung und Funktionalitätserweiterung historisch-
biographischer Lexika in elektronischen Medien, Munich, 25.-
27.2.2010. cf. (Hockerts, 2012), (Ebneth, 2015).
Figure 3: The first lines of the Beacon aggregation 7
http://de.wikipedia.org/wiki/Wikipedia:
for Albert Einstein GND-ID: 118529579) http: BEACON/Format and a draft specification by Jakob Voss
//data.deutsche-biographie.de/rest/bd/ and Mathias Schindler https://gbv.github.io/
gnd/118529579/alle_de. beaconspec/beacon.html, latest version of Nov. 2017).
8
http://beacon.findbuch.de/pnd-aks.
9
https://solr.apache.org, version 5.4.
10
Beacons and their aggregation See https://www.lobid.org – lobid offers search and APIs
based on linked open (library) data (LOD). Lobid is maintained by
The Beacon concept examined here was conceived by the Hochschulbibliothekszentrum des Landes NRW (North Rhine
volunteers and Wikipedia-enthusiasts during a conference Westphalia).
4
Figure 6: Model of NDB-Graph stored in Neo4J. Drawn with yEd.
the “factoids” proposed by (Michele Pasin and John Graph APIs
Bradley, 2013) and the “aspects” deployed at the The API provides access to a graph database based
Personendatenrepositorium (Walkowski, 2009). on Neo4J13 . The graph is modelled according to the
• event-based ontologies represent a person’s life as index of persons and the structure of the printed
a set of timespans or events of varying duration. volumes and its articles. These hierarchically ordered
Examples for this approach are the Erlangen CRM11 , units () of text refer to entries in
(Trame et al., 2013) and the “life-spans” deployed at the person and place databases by means of edges
the Catalogus professorum Lipsiensis s. (Riechert et (). Persons may refer to
al., 2010). a lineage (a group of family-like related individuals)
or directly to others by named edges
The initial data-model of Deutsche Biographie was
(). A per-
property based. It draws heavily on the concept of the
son is connected to places () and to a profession classifica-
see (Brümmer, 2011). We publish dumps of all generated
tion ().
RDF-triples periodically and serve a Sparql-endpoint on
A second option makes it possible to export an ego-centered
request.
graph, based on the detected relations to identified personal
names in the Solr index.14
The example presented in fig. 2 shows a data sample for
13
https://www.neo4j.org
11 14
Its current version “Erlangen CRM 170309” is http://data.deutsche-biographie.de/beta/
based on CIDOC-CRM 6.2.2 http://erlangen- open-graph This option requires the internal identifier and
crm.org/170309/. loads all identified relations to other persons from the Solr-index.
12
http://d-nb.info/standards/elementset/ Further options allow exports in json and graphml. It covers
gnd only biographical entries of the newer NDB-series.
5
Albert Einstein and all individuals in his network up to a Bernhard Ebneth. 2012. Aktueller Stand der Genealogien
distance of 2, together with their cross-relations amongst in der Neuen Deutschen Biographie – Arbeit mit
each other. The dataset has easily been imported into der Online-Version. In 64. Deutscher Genealogentag,
Gephi15 , colored by an indexed category and visualized Augsburg.
using a circle packing layout (Mike Bostock). Bernhard Ebneth. 2015. Auf dem Weg zu einem
Historisch-biographischen Informationssystem. Daten-
4 Summary integration und Einsatz von Normdaten am Beispiel
Recent incoming e-mail requests have showed a growing der Deutschen Biographie und des Biographie-Portals.
interest in accessing the data. There were genuine research Jahrbuch für Universitätsgeschichte, pages 261–290.
questions (e.g., the network of Martin Luther, investigating Michaela Geierhos. 2007. Grammatik der Menschen-
German street names), integration tasks (RDF query of bezeichner in biographischen Kontexten. Arbeiten
metadata for identified individual entries) and bridging zur Informations- und Sprachverarbeitung. Band 2.
tasks (querying personal names in a back-end for an München.
archival CMS).
Michaela Geierhos. 2010. BiographIE. Klassifikation und
A proposal has been launched to raise funding for a
Extraktion karrierespezifischer Informationen. Num-
web-based research laboratory that would work with
ber 05 in Linguistic Resources for Natural Language
the metadata encoded and aggregated in the Deutsche
Processing. Lincom, München.
Biographie.
Maurice Gross. 1997. The Construction of Local
5 Acknowledgements Grammars. In E. Roche and Y. Schabès, editors, Finite-
The work was funded by grants from the Deutsche State Language Processing, pages 329–354. Cambridge,
Forschungsgmeinschaft (DFG) for the project “Deutsche Mass.
Biographie as historical biographical information system Franz Guenthner and Petra Maier-Meyer. 1996. Das
for German-speaking areas” in 2012 and 2015. The NDB CISLEX-Wörterbuchsystem. In H. Feldweg and E. W.
received funding from the German Research Foundation Hinrichs, editors, Lexikon und Text, pages 69–82.
(Deutsche Forschungsgemeinschaft DFG) for the years Tübingen.
2001-03, 2008-09, and 2010-11.16 Franz Guenthner and Petra Maier, editors. 1994. Das
The RDF mapping was initially conceived by Martin CISLEX Wörterbuchsystem. München.
Brümmer und Thomas Riechert within the framework of Hans Günter Hockerts. 2008. Vom nationalen Denkmal
the EU project “LOD2-PUBLINK” (s. (Brümmer, 2011)). zum biographischen Portal: ADB und NDB. Akademie
Aktuell, 33(2):19–22.
6 References Hans Günter Hockerts. 2012. Zertifiziertes biographi-
Martin Brümmer. 2011. Realisierung eines RDF- sches Wissen im Netz. Forschungsnahe Informations-
Interfaces für die Neue Deutsche Biographie. In Sören infrastruktur. Die ,,Deutsche Biographie“ auf dem Weg
Auer, Johannes Schmidt, and Thomas Reichert, editors, zum zentralen historisch-biographischen Informations-
SKIL 2011 – Studentenkonferenz Informatik Leipzig system für den deutschsprachigen Raum. Akademie
2011, Leipziger Beiträge zur Informatik Band XXVII, Aktuell, 37(4):34–36.
Leipziger Informatik-Verbund (LIV), pages 31–42. Dan Jurafsky and James H. Martin. 2016. Speech and
Thomas Busch and Stefan Jordan. 2011. Vernetzte Language Processing.
Lebensläufe. Der Einsatz von Normdatenbanken zur Hans-Christof Kraus, Marco Jorio, Martina Schattkowsky,
Verlinkung biographischer und bibliographischer Ange- Bernhard Ebneth, Matthias Reinert, Thierry Declerck,
bote im Internet. Geschichte in Wissenschaft und Christine Gruber, and Eva Wandl-Vogt. 2014. Sektion:
Unterricht, (11/12):684–691. Vernetzung von historisch-biographischen Lexika und
Bernhard Ebneth and Matthias Reinert. 2018. Po- Fachportalen im Linked (Open) Data Framework. In
tentiale der Deutschen Biographie (www.deutsche- 1. Jahrestagung der Digital Humanities im deutsch-
biographie.de) als historisch-biographisches Informa- sprachigen Raum (DHd 2014), Universität Passau, 25.-
tionssystem. In Ágoston Zénó Bernád, Christine Gruber, 28.3.2014.
and Maximilian Kaiser, editors, Europa baut auf
Berenike Litz, Aenne Löhden, Jan Hannemann, and
Biographien. Aspekte, Bausteine, Normen und Standards
Lars Svensson. 2012. AgRelOn – An Agent
für eine europäische Biographik., pages 283–295, Wien.
Relationship Ontology. In Juan Manuel Dodero,
new academic press.
Manuel Palomo-Duarte, and Pythagoras Karampiperis,
Bernhard Ebneth. 2009. Vom digitalen Namenregister zum editors, Research Conference on Metadata and
europäischen Biographie-Portal im Internet. In Martina Semantic Research, volume 6, pages 202–212.
Schattkowsky and Frank Metasch, editors, Biografische https://de.slideshare.net/larsgsvensson/agrelon-an-
Lexika im Internet, volume 15 of Bausteine aus dem agent-relationship-ontology.
Institut für Sächsische Geschichte u. Volkskunde, pages
Michele Pasin and John Bradley. 2013. Factoid-Based
13–44. Dresden.
Prosopography and Computer Ontologies: Towards
15
https://www.gephi.org. an Integrated Approach. Literary and Linguistic
16
http://gepris.dfg.de/gepris/projekt/ Computing.
(53346764|165972532|213818920). Barbara Pfeifer. 2015. Über Zweck und Nutzen der
6
Gemeinsamen Normdatei (GND). Jahrbuch für Univer-
sitätsgeschichte, 16:251–259.
Matthias Reinert, Maximilian Schrott, and Bernhard
Ebneth. 2015. From Biographies to Data Curation-
The Making of Www. Deutsche-Biographie. De. In
Biographical Data in a Digital World (BD), pages 13–
19, Amsterdam.
Matthias Reinert. 2010. Biographisches Wissen auf einen
Klick. Akademie Aktuell, 35(4):44–46.
Thomas Riechert, Ulf Morgenstern, Sören Auer, Sebastian
Tramp, and Michael Martin. 2010. The Catalogus Pro-
fessorum Lipsiensis – Semantics-Based Collaboration
and Exploration for Historians. In Proceedings of the 9th
International Semantic Web Conference (ISWC2010),
Shanghai/China.
Marco Rospocher, Marieke van Erp, Piek Vossen, Antske
Fokkens, Itziar Aldabe, German Rigau, Aitor Soroa,
Thomas Ploeger, and Tessel Bogaard. 2016. Building
Event-Centric Knowledge Graphs from News. Web
Semantics: Science, Services and Agents on the World
Wide Web, 37-38:132–151, March.
Sophia Stotz and Matthias Reinert. 2013. Detecting
and Encoding Interpersonal Relations with Unitex/Local
Grammars. Université Paris Est-Marne-la-Vallée.
Sophia Stotz, Valentina Stuß, Matthias Reinert, and
Maximilian Schrott. 2015. Interpersonal Relations
in Biographical Dictionaries. A Case Study. In
Biographical Data in a Digital World (BD), pages 74–
80.
Johannes Trame, Carsten Keßler, and Werner Kuhn.
2013. Linked Data and Time — Modeling Researcher
Life Lines by Events. In Proceedings of the 11th
International Conference on Spatial Information Theory
- Volume 8116, COSIT 2013, pages 205–223, New York,
NY, USA. Springer-Verlag New York, Inc.
Niels-Oliver Walkowski. 2009. Zur Problematik der
Strukturierung und Abbildung von Personendaten in
digitalen Systemen. In Workshop Personendateien der
Arbeitsgruppe ,,Elektronisches Publizieren“ der Union
der deutschen Akademien der Wissenschaften.
7