=Paper= {{Paper |id=Vol-2119/paper13 |storemode=property |title=Connecting People Across Borders: a Repository for Biographical Data Models |pdfUrl=https://ceur-ws.org/Vol-2119/paper13.pdf |volume=Vol-2119 |authors=Antske Fokkens,Serge ter Braake |dblpUrl=https://dblp.org/rec/conf/bd/FokkensB17 }} ==Connecting People Across Borders: a Repository for Biographical Data Models== https://ceur-ws.org/Vol-2119/paper13.pdf
Connecting People Across Borders: a Repository for Biographical Data Models
                                         Antske Fokkens and Serge ter Braake
                     CLTL, Vrije Universiteit Amsterdam / Media Studies, University of Amsterdam
           De Boelelaan 1105 1081 HV Amsterdam, the Netherlands / Turfdraagsterpad 9, 1012 XT Amsterdam
                                  antske.fokkens@vu.nl,sergeterbraake@gmail.com

                                                              Abstract
This paper proposes a practical approach for sharing knowledge about biographical datamodels circumventing issues with copy-right.
We furthermore provide the main observations of a study analyzing the data structures of eight biographical resources, two platforms for
biographical information and four biographical data models. We outline an approach for designing a generic model that can be used for
linking information from different models despite differences in structure.

Keywords: Biographical data models, RDF


                     1    Introduction                                ical datasets have their own data representation making it
The biography genre has a long history. Plutarch (45-ca.              challenging to carry out research across datasets.
120 AD) is often considered the father of the biography.              A few examples of successful integration of biographical
He did not only provide syntheses of people’s life, but he            data and standardization of metadata from different sources
also tried to compare them to a similar person in a ‘double           are the national Australian Dictionary of National Biog-
biography’. Other than length, there is a difference between          raphy,1 the Biography Portal of the Netherlands,2 and the
such full length biographies and biographical entries in bi-          transnational Biographie-Portal3 and the APIS (Austrian
ographical dictionaries. Biographies in biographical dictio-          Prosopographical Information System) project (Gruber and
naries tend to be more factual. They provide a chronicle              Wandl-Vogt, 2017).
of the lives of noteworthy people, without necessarily giv-           This paper proposes a practical approach that addresses the
ing much attention to social environment, political circum-           problems faced when integrating biographical data from
stances or comparisons to other people. Full length biogra-           different sources into one repository. We introduce a repos-
phies paint a biographical narrative, while short biographi-          itory for biographical data models that provides exam-
cal entries in biographical dictionaries or encyclopedia such         ples and descriptions of existing data models. The repos-
as Wikipedia mostly provide ‘biographical data’. These bi-            itory provides illustrations of data models used in differ-
ographical data can be the building blocks for full length            ent projects using fictional biographies, accompanied with
biographies, or they can serve to build group portraits and           fictional biographical data. Researchers working with the
systematically compare people (see also Harrison (2004)).             models can add information about the process, why the
Over the past twenty years the amount of online available             model is designed in a particular way and problems and
‘biographical data’ has increased rapidly, with the advent            advantages they experienced from their modeling choices.
of the Internet and large digitization projects. The potential        In addition, the samples in the repository are used to design
for biographical research, network analysis and group por-            a generic overarching model that can combine data repre-
traits seem to be endless when all of this data can be linked         sented in different formats.
and shared for analysis (Fokkens et al., 2017; Arthur, 2017,          The main contributions of this paper are:
e.g.).                                                                  1. We compare and classify the design of models for
Projects aiming at making biographical data available, first               modeling biographical data from fourteen resources
need to address the question of how to represent this data.
Individual projects have dealt with this issue in different             2. We introduce a repository that provides insight into the
ways. Where some introduced or reused formally de-                         structure of one of these models
fined models, others used basic approaches using comma-
                                                                        3. We outline our approach for connecting models that
separated-values to represent the information most com-
                                                                           use different frameworks, formats and structures
monly provided by the original resource.
Because many projects did not consider data representation            The remainder of this paper is structured as follows. Sec-
a central issue in their digitization efforts, the number of          tion 2 discusses related work. The comparative analysis of
publications about this part of the process remained lim-             biographical data models is presented in Section 3. We de-
ited and, as a consequence, knowledge about existing mod-             scribe the set-up and current status of the Repository for
els and best practice for modeling biographical data is not           Biographical Data Models (BDM) and our proposal for de-
sufficiently shared. This resulted in two challenges for re-          signing a generic model for connecting data from various
searchers working with biographical data. First, researchers
working on new digitization projects for biographical data                1
                                                                            http://adb.anu.edu.au
                                                                          2
are ‘reinventing the wheel’ and run into the same problems                  http://www.biografischportaal.nl
                                                                          3
others have dealt with before them. Second, most biograph-                  http://www.biographie-portal.de
projects in Section 4. We conclude in Section 5. Ap-               Comparing data representations is complex, because for-
pendix A describes the resources we studied for this paper.        mats and models are regularly confused. In particular, ad-
                                                                   vantages and disadvantages of using RDF or XML ((eXten-
         2   Background and Related Work                           sible Markup Language) and JSON (JavaScript Object No-
Even though a handful of publicly available standards exist        tation) are frequently discussed even though XML is a seri-
for biographical data and some initiatives define their mod-       alization format and RDF is a data model, that can be rep-
els in RDF (Resource Description Framework) and make               resented in several formats including XML or JSON. Like-
use of existing vocabularies, most projects have designed          wise, XML and JSON can be used to represent data models
their own model. This can be for historical reasons, ei-           that are not RDF, specified in e.g. the DTD (Document Type
ther by the desire to stay close to the structure of an origi-     Definition) of the XML. When comparing XML to RDF,
nal (non)-digital source or by the direct research goals that      people generally mean the possibility of capturing infor-
were outlined in early stages of the digitization process. It      mation through its structure when using XML (by embed-
is however likely that this is at least partially due to lack of   ding elements or placing them in some order), where RDF
knowledge on existing resources. This lack of knowledge            enforces making all information explicit.5 Even though we
is not due to lack of interest, but to the fact that it is non-    are aware of the fact that XML and RDF operate on a differ-
trivial to obtain this information. Experience in creating         ent level and thus cannot be compared, we distinguish be-
structured data often stays project internal: publications on      tween models using RDF and models using non-RDF based
formalizing biographical data are limited, biographical re-        XML or non-RDF based JSON or CSV. Unless specified
sources are often part of national projects written in a local     otherwise, the terms XML and JSON will refer to (semi-
language or their use is restricted by copyright.                  )structured representations that are not defined in RDF in
Making use of other people’s experience in their digitiza-         the remainder of this paper, where we use RDF to refer to
tion and enrichment projects not only saves work, it can           RDF models regardless of the format they are represented
also help avoid problems further down the line. It is difficult    in.
to foresee exactly what information various researchers in-        Structured data forms the basis for applying digital models,
terested in a resource may need later on. Investigating data       but structure in itself does not provide the means to con-
structures that have already been used for various use cases       nect or compare data from various resources. In order to
can provide valuable insight into what works and what does         automate a process of connecting data, its category must
not. Following examples from other projects has the addi-          be formally defined. In RDF, identifiers are used to refer
tional advantage that it will be easier to make connections        to entities or their properties. These entities and properties
between different datasets facilitating, for instance, com-        can be formally defined, which also allows us to define cor-
parative biographical research across borders.                     respondences between entities and properties. These cor-
The situation of biographical data models is far from unique       respondences can link data across resources. We therefore
and some efforts have been made to address this issue.             aim to work towards a generic model in RDF.
Franzini et al. (2016) aim to provide an overview of prop-         A full discussion of related work on linking data within the
erties of digital editions and RIDE4 offers a Review jour-         digital humanities is beyond the scope of this paper. We
nal for digital editions and resources. In the typical case,       therefore limit this overview to projects that directly influ-
the data model used in digital humanities projects is deter-       enced the approach proposed in this paper. In our proposal,
mined by structure of the original resource or specific re-        we follow de Boer et al. (2012), who outline a procedure for
search questions from the early phases of the project. This        converting cultural heritage data structured in XML to RDF
is only natural, because staying close to the original source      with a minimum of data loss. Their approach will be ex-
minimizes loss of information and current research ques-           plained in detail in Section 4.2. They ultimately map their
tions form a concrete set of requirements that can be used         converted data to a common data model for cultural her-
for designing the model.                                           itage data: the Europeana Data Model (Doerr et al., 2010,
In the remainder of this section, we first provide back-           EDM). We propose to follow this example for biographical
ground information on data structures and clarify who re-          data, where we keep data representations as close as possi-
lated terminology will be used in the remainder of this pa-        ble to their original form and then connect them by defining
per. We then introduce previous projects that provide a            categories occurring in individual models by relating them
common model for multiple biographical resources.                  to a generic model for biographical data.
2.1 Formal Modeling and Linking
                                                                   2.2       Work on Biographical Datamodels
Data can be unstructured (such as flat text), semi-structured
(e.g. CSV (comma separated values) files containing de-            The BiographyNet project applied the procedure outlined
scriptions in natural language) or fully structured (e.g. a        by de Boer et al. (2012) to data from the Biography Por-
representation in RDF). Note that an RDF representation            tal of the Netherlands (BPN) as described in Ockeloen et
can also contain unstructured elements (e.g. a literal value       al. (2013). The BPN forms a collection of biographical
that is a text) and that CSV can also be used to provide fully     dictionaries describing people who are Dutch or lived in
structured information (e.g. only information that is numer-       the Netherlands. It is one of the projects that already pro-
ical or ontologically defined). In this paper, we only deal        posed an overarching generic structure for a heterogeneous
with semi-structured and structured data representations.
                                                                         5
                                                                       See for instance Fokkens et al. (2014) for a more elaborate
   4                                                               discussion on this matter.
       https://ride.i-d-e.de
dataset, resulting in an event-centric model for biographical   as part of the preparation for the Workshop on Biograph-
data (Hoekstra, 2013).                                          ical Data and Datamodels.8 A short description of each
The national Australian Dictionary of National Biography6       project can be found in Appendix A. The models we ob-
(ADNB), is part of a larger effort of data aggregation, col-    served as part of this investigation come from a wide vari-
laboration and cooperation together with the Humanities         ety of projects. Some projects mainly focus on the digitiza-
Network Infrastructure (HuNI) (Arthur, 2017).                   tion process or historical research where designing a model
The transnational “Biographie-Portal”7 which combines           for presenting biographical data emerged as a by-product.
nine biographical resources from four countries (Germany,       Others specifically aimed at developing a formal model for
Austria, Switzerland and Slovenia) and can be searched on       biographical data.
name and occupation. Richer developments for these re-          We compare the models on the level of content (what kind
sources, and in particularly the Austrian Biographical Lex-     of information is provided), the framework (is the model
icon (ÖBL) are developed as part of the APIS project (Gru-     formalized and how) and formatting (how is data repre-
ber and Wandl-Vogt, 2017).                                      sented). In this investigation, we only consider components
A handful of projects have made use of linked data for en-      of the data that are (semi-)structured: raw text is not ana-
richment and connecting biographical data to external re-       lyzed in depth.
sources. It is used for connecting data in the HuNI and
ADNB data aggregation projects. The Deutsche Biogra-            3.1       General Observations
phie (DB) also represents information in RDF. However, to
                                                                3.1.1 Content
our knowledge, neither of these resources represent all their
metadata in RDF. The BPN was converted to linked data           We first examine what kind of information can be included
as part of the BiographyNet project, which also enriched        in the models in a (semi-)structured manner. As expected,
the metadata by processing the biographical text automati-      all models we examined represent the person’s name and
cally and linking extracted information to external sources     lifespan (if known). When looking at richer models, we
(Fokkens et al., 2017). The model that is used to represent     observe common themes in the kind of information that is
this data in RDF including an elaborate schema for repre-       provided. Most resources and models address the individ-
senting provenance in a detailed manner can be found in         ual’s career, education, family relations and residence. Fur-
Ockeloen et al. (2013).                                         thermore, several resources make the reason for including
To our knowledge, none of the projects discussed above          a person explicit by providing information labeled ‘kind of
make use of linked data to provide a generic overarching        person’, ‘category’ or ‘claim to fame’.
model. The work by Leskinen et al. (2017) comes closest         The main differences lie in the level of granularity of the
to this idea. They provide a basic structure that can be used   information provided. Where some only indicate the sector
for prosopographical research defining name, lifespan and       in which a person worked, others provide detailed informa-
gender. More elaborate information can be defined using         tion about the firm, dates and time lines of the employment.
externally defined data models such as the Simple Event         The same can be observed for education.
Model (van Hage et al., 2011, SEM).
The Biographical Data Model Repository proposed in this         3.1.2 Framework and Structure
paper is intended to be complementary to all initiatives        The level of formalization highly differs from one model
mentioned above. It does not provide a platform for aggre-      to another. The least formalized models make use of text
gating the data itself like BNP, the ADNB or the transna-       fields for providing information. They use words repre-
tional Biographie-Portal. Its goal is to primarily provide      sented as strings to define various categories of information
examples of a wide variation of biographical data models.       and values are presented as descriptions. In these cases,
These can be collected across projects with relatively lim-     minor differences can already be observed in the way dates
ited effort. To illustrate, the fourteen resources presented    are represented or the same location may appear using a
here were collected in a couple of weeks. The method            different name. Other models use predefined classes and
we propose for converting and linking data aims to go be-       relations. This particularly holds to a large extent for the
yond defining a basic generic model for representing bio-       models that are defined in RDF. Finally, a handful of mod-
graphical data as developed by Leskinen et al. (2017). We       els adapted their basic structure from TEI P5, which defines
propose a bottom up approach for representing various re-       a generic XML structure.
sources in RDF, which can consequently be mapped on a           Basic representations in strings have the advantage that
high or fine-grained level to other sources.                    unstructured and semi-structured data from the original
                                                                sources can be represented in its surface form in a simple
             3   A comparative analysis                         and straight-forward manner. However, it may be worth-
We collected samples from two platforms for sharing bio-        while to invest in defining models and ontologies: prede-
graphical data, eight biographical databases and four data      fined categories have the advantage that identical informa-
models, two of which were specifically designed as part         tion is presented in a consistent manner. Formally defining
of a digitization/enhancement project related to one of the     information in RDF facilitates the process of connecting it
databases. This total of fourteen resources was collected       to external resources.

   6                                                                  8
       http://adb.anu.edu.au                                       http://www.biographynet.nl/
   7
       http://www.biographie-portal.de                          dh-biographical-data-workshop/
                                                            general                                                                                        categories




                                      framework or format




                                                                                                                                                                                                           personal relations
                                                                                                      metadata/in-text




                                                                                                                                                                                                                                further specifics
                                                                                                                                                     claim-of-fame/
                                                                event/relation



                                                                                   accessibility




                                                                                                                                                     person-type



                                                                                                                                                                                  occupation
                                                                                                                                                                      education



                                                                                                                                                                                               residence
                                                                                                                         lifespan

                                                                                                                                    gender
                          model




                                                                                                                                             faith
                AINM   TEI P5       XML                      relation             AFR              MD+IT                 3          3        3                        3           3                                              -
 Repositories




                ANB    TEI P5       XML                       event               CRR              MD+IT                 3          3                                             3                                              -
                BPN    TEI P5     RDF/XML                     event              OS/AFR             MD                   3          3        3           3            3           3            3            3                    -
                CBD     own         RDB                      relation              OS               MD                   3                               3                        3            3            3                    9
                CBW    SNAC       CSV/JSON                     n.a.                OS               MD                   3          3                    3                                                                       -
                DB      own       RDF/XML                    relation            OS/AFR            MD+IT                 3          3        3           3            3           3            3            3                    -
                ODNB   TEI P5       XML                       event               CRR              MD+IT                 3          3        3                        3           3            3            3                    3
                ÖBL    own         RDB                      relation             AFR               MD                   3          3                                 3           3                         3

                                  Table 1: Overview of properties of individual biographical databases


3.1.3           Representation                                                                     3.2 Data Sample Analyses
                                                                                                   We compared samples of fourteen biographical data re-
We compared choices of representation for various data                                             sources outlined in Appendix A9 paying attention to the
models. The most basic form of structuring data is through                                         level of formalization, the overall structure (relation-based,
CSV. Advantages of using CSV are clear: it is an easy to                                           event-based or both) of the model as well as the categories
understand format that can be operated well by humans as                                           provided for most entries or, for the four datamodels, which
well as machines. On the other hand, it provides little sup-                                       categories they specifically formalize. We also indicate the
port for defining more complex relations. Most data entries                                        availability of the data itself for the eight databases.
consist of rows defining the identifier for the person de-
scribed, name, dates of birth and death and possibly room                                          3.2.1 Databases
for a ‘claim-to-fame’ category and parents. They become                                            Table 1 provides an overview of the properties of the
less convenient when defining properties of which a per-                                           databases. The left side of the table indicates general prop-
son may have more than one during their life: professions,                                         erties. The first column indicates the generic model that
schools attended, residence, children, etc. They also fall                                         was used as a basis for the model employed by the database:
short when defining more complex relations, for instance,                                          three projects invented their own model from scratch, CBW
the start and end date of each profession together with the                                        makes use of representations developed as part of SNAC
location of the position. It is therefore not surprising that                                      and all others have taken TEI P5 as a basis. The second
CSV is mainly used for resources that only represent a rel-                                        column indicates whether the database makes use of the
atively modest amount of metadata on the person.                                                   framework RDF and otherwise, which representation for-
                                                                                                   mat is used. Both databases that have RDF representa-
Resources that do aim to define more complex relations ei-                                         tions also represent information in plain XML. ABD and
ther represent their data in RDF, which can be represented                                         CBDP are relational databases that can be queried using
in e.g. XML, turtle or LD-JSON, or they use some other                                             SQL. CBW uses CSV and JSON for data representations.
XML format or JSON structure. XML and JSON both pro-                                               The third column indicates whether the structure of the
vide straightforward means to define multiple entries of the                                       representation is event-centric or mainly relational. The
same categories (e.g. a list in JSON or sequence of XML                                            model used for CBW is not rich enough to make this dis-
elements) as well as the means to define more elaborate re-                                        tinction. Two databases are copyright restricted (CRR),
lations. It is possible to provide formal definitions of what                                      two databases can be made available for research purposes
constitutes well-formed XML of a given data structure, in-                                         (AFR), two are open source (OS) and two are partially
cluding the elements, attributes and values that are permit-                                       open source and can partially be made available for re-
ted. However, XML itself does not offer the means to for-                                          search (OS/AFR), as indicated in column five. The sixth
mally define the meaning of these elements, attributes and                                         column indicates whether the database only provides struc-
values. To summarize, RDF models provide, in principle,                                            tured data as metadata (MD) or whether it also provides
the richest formal definitions and are most (explicitly) ex-                                       structured data tagged in the biographical text (+IT).
pressive, followed by (non-RDF defined) XML structures,                                            The right side of the table indicates which categories of in-
JSON and finally CSV. The order of complexity of the                                               formation are provided as specifically structured data. It
model, the effort involved in defining them properly and                                           should be noted that lack of a checkbox does not necessar-
possibly the order of the gentlest learning curve for peo-
ple starting to work with them, is the inverse: CSV is the                                            9
                                                                                                       The abbreviations used in our comparison are introduced in
simplest, followed by JSON, XML and RDF.                                                           the Appendix as well.
ily mean that the information is not present in the resource.                                                                                      process of connecting data, including a conversion step to
The information can standardly provided in the biographi-                                                                                          representations in RDF.
cal text or it can be provided in a semi-structured manner,
rather than being part of the structured dataset. The last col-
umn indicates the extent to which alternative categories are
                                                                                                                                                                 4   The BDM Repository
provided in a structured way. The ÖBL has at least 36 addi-                                                                                       As a practical approach to address the two main drawbacks
tional relations defined, CBDP has 9 additional information                                                                                        of developing models independently outlined in Sections 1
fields and ODNB mainly provides relatively fine-grained                                                                                            and 2, we initiated a repository of biographical data models
subcategories.                                                                                                                                     (the BDM repository). We first describe the process of col-
                                                                                                                                                   lecting models in the BDM repository and then outline the
3.2.2 Platforms and Data models                                                                                                                    process we intent to follow to connect the models collected
What information is formally represented in the two plat-                                                                                          in this repository.
forms and four models is presented in Table 2. The infor-
mation provided by APIS and BiographyNet (BNET) cor-                                                                                               4.1    Collecting Data
respond to that included in the respective databases they are
related to (ABD and BNP). For reasons of space, we omit-                                                                                           The Biographical Data Model (BDM) Repository is a place
ted categories that are only provided by one of these two                                                                                          for collecting and connecting biographical data models.
resources.                                                                                                                                         The BDM Repository serves three purposes: First, re-
APIS provides the same 36+ relations that are indicated for                                                                                        searchers faced with the task of representing biographical
the ABD. The other resources can provide richer structured                                                                                         data can find various examples of models used by other
information due to their ability to be combined with other                                                                                         projects in one place. Second, the repository forms a nat-
models. BNET, BCRM and DFKI are defined in RDF for                                                                                                 ural environment for comparing data models and recording
this exact reason. SNAC and EIBIO do not represent their                                                                                           advantages and disadvantages of various representations.
data in RDF, but do make use of external links to connect                                                                                          Third, the repository will support the process of represent-
information from various sources.                                                                                                                  ing models in RDF (for those that are not represented in
                                                                                                                                                   RDF already) and defining correspondences between mod-
                                                                                                                       external links/extensions




                                                                                                                                                   els. These correspondence definitions can be used to link
                                                                                                                                                   data from various models, which in turn, enables a wide
                                                                                                                                                   range of comparative research.
                framework/format




                                                                                                  personal relations




                                                                                                                                                   The first challenge this repository faces is that many bio-
                                    event/relation




                                                                                                                                                   graphical data collections are copy-righted. From the col-
                                                                                     occupation
                                                                         education




                                                                                                                                                   lections described above, only two are completely open
                                                     lifespan

                                                                gender




                                                                                                                                                   source and two are partially open source. Samples from
                                                                                                                                                   the other resources cannot be made openly available to ev-
                                                                                                                                                   eryone. To circumvent this problem, we wrote a handful of
   APIS       RDB                   rel.             3          3        3           3             3                   3                           biographies of fictional characters and make the texts and
   BNET       RDF                  event             3          3        3           3             3                   3                           metadata we (partially) invented available under the Cre-
   BCRM       RDF                  event             3          3                    3             3                   3                           ative Commons License. The idea is that the repository will
   DFKI       RDF                  event             3                   3           3                                 3                           ultimately include representations of these non-copyrighted
                                                                                                                                                   texts in all biographical data models we are aware of. This
   SNAC       JSON                 rel.              3                   3                                             3
                                                                                                                                                   allows us to illustrate the structure of the models without
   EIBIO       CSV                 rel.              3                                                                 3
                                                                                                                                                   sharing their copy-righted content. It has the additional
                                                                                                                                                   advantage that it becomes easier to compare information
Table 2: Overview of properties defined in models and plat-                                                                                        between models, since different samples provide the same
forms                                                                                                                                              information.
                                                                                                                                                   The BDM repository currently provides samples for all
                                                                                                                                                   21 dictionaries included in the Biographical Portal of the
3.2.3 Summarizing the analysis
                                                                                                                                                   Netherlands. They are illustrated by the biography of Mary
Overall, we observe that all resources provide ways for
                                                                                                                                                   Morstan, protagonist in one of the Sherlock Holmes books
specifying a person’s life span in a structured way. Al-
                                                                                                                                                   and later wife of dr. Watson. The biographies are written in
most all resources provide means to specify a person’s oc-
                                                                                                                                                   English, but otherwise follow the conventions of the orig-
cupation or gender, CBW being the only exception when it
                                                                                                                                                   inal resources (concerning abbreviation and semi-structure
comes down to education and ABD and CBDP being the
                                                                                                                                                   in text). The information provided on Morstan currently
only two sources that do not seem to have a field to specify
                                                                                                                                                   covers the categories included in the BPN models and will
gender. The other categories, faith, person-type/claim-to-
                                                                                                                                                   be extended accordingly as models with structure for ad-
fame, education, residence and personal relations each oc-
                                                                                                                                                   ditional information are added. The latest version of the
cur in four to eight resources. The division between event-
                                                                                                                                                   BDM repository can be found on github.10
based and relational based structures is about 50-50. No-
tably resources that make use of RDF seem to have a pref-
erence for event-centric structures. A probable reason for                                                                                           10
                                                                                                                                                      https://github.com/cltl/
this will be outlined in Section 4.2, where we describe the                                                                                        BiographicalDataModels
                      Figure 1: Illustration of conversion of event-centric data representation to RDF

4.2   Connecting Biographical Data                               In the fifth step, these correspondences are used to link the
Once multiple data models have been included in the BDM          generated RDF to external sources after which it is possible
repository, we can investigate how to connect them. We           to publish the model as linked data. The BDM repository
plan to achieve this by representing all models in RDF.          aims to help researchers carry out the first four steps. Since
Once individual models have been formally defined, we can        the repository only provides mock-up samples of data, the
define correspondence between them. In this section, we          actual alignment of the resource and publication as linked
outline this process.                                            data is out of scope. In the next subsection, we will explain
                                                                 how correspondences may be defined between a relational
4.2.1 From CSV or XML to RDF                                     based and event-centric model.
The first step is to provide RDF representations for models
that have not been defined in RDF so far. When converting        4.2.2 Conversions and Linking
from one representation format to another, there is always       Figure 1 provides an illustration of the conversion of an
a risk of loss in information. This particularly applies when    event-centric representation to RDF. We illustrate the rep-
the data is converted to a standardized model. We avoid          resentation of the event after Step 3, before the step map-
this by following the procedure outlined in de Boer et al.       ping it to other resources. The namespace nns: stands for
(2012) for converting XML to RDF and adapting a similar          a new namespace for the dataset. Conversion to RDF is rel-
approach for converting CSV and JSON files. The proce-           atively straight-forward: a unique identifier is assigned to
dure consists of the following steps (adapted from de Boer       the event, this is typed as an occupation and all other infor-
et al. (2012), page 735): 1) XML/CSV/JSON ingestion.             mation can be defined directly as properties of the event. In
2) Crude conversion to RDF. 3) RDF restructuring. 4) De-         the next step, these relations can be mapped to other exist-
sign metadata mapping scheme. 5) Align vocabularies with         ing models. We can use the Simple Event Model (van Hage
external sources. 6) Publish as Linked Data.                     et al., 2011) for instance to define the location, the begin
In the first step, the original structure is interpreted. Then   time and end time. Categories that commonly occur in bio-
a direct conversion to RDF maintaining the full original         graphical data, such as occupations, should ideally also be
structure takes place. As also explained by de Boer et al.       defined by the same vocabulary across resources.
(2012), data in XML can be complex: elements can be              Representing a relational based structure in RDF requires
nested deeply within other elements, they may be grouped         more effort for relations that are temporary bound or tied
in a specific manner or ordered by the structure. Some           to a specific location. Figure 2 provides an illustration.
of these structural properties are meaningful (e.g. elements     In principle, the relation itself can easily be translated into
within a group are connected by some implicit link, or the       RDF by assigning a URI to the relation and specifying its
order of elements indicates their order in time), but many       meaning. However, we then need to decide how to specify
do not express information that needs to be maintained in        the duration and location of the employment. The problem
the RDF structure. If the original XML (or JSON) is com-         of making statements about a triple in RDF is well-known
plex, the resulting RDF structure is likely to be messy. The     and several solutions have been proposed for solving this
third step addresses this by restructuring the RDF so that       challenge. Van Atteveldt et al. (2007) provide an in depth
structures containing implicit information are translated to     analysis of proposals. We illustrate two commonly used
flatter (non-embedded) representations that make this infor-     approaches in Figure 2.
mation explicit and idiosyncratic complexities are removed.      On the left-hand side, the statement about Mary’s employ-
The first three steps ideally result in an RDF representation    ment is taken as a unit that can receive its own identifier.
that is as simple as possible, but still provides all informa-   This approach is used for defining context (Carroll et al.,
tion from the original data.                                     2005; MacGregor and Ko, 2003, e.g.). In our example, we
In the fourth step, researchers explore which categories and     use a named graph for assigning an identifier to the rela-
relations expressed in the generated RDF correspond to def-      tion. Information about time and place are then linked to
initions and classes defined in other vocabularies. Based        the identifier of the named graph. The advantage of this ap-
on this exploration, correspondences between the resulting       proach is that it remains close to the original data structure.
RDF and existing models and vocabularies can be defined.         Following a solution originally designed to define contexts
                         Figure 2: Illustration of conversion of relational data representation to RDF

also intuitively makes sense: the specific relation applied in     making it harder to make connections between various re-
a given time period and in a given place. On the other hand,       sources. We illustrated some of these differences through
we also want to define the context in which the informa-           an analysis of fourteen resources collected as part of the
tion about time and place is provided: what is the original        Workshop on Biographical Datamodels held in Krakow,
source of this information? How was it integrated in this          July 2016.
database and by whom? What conversions and other oper-             The problem of models being developed independently
ations were applied to this data? Modeling provenance is           is partially due to the difficulties involved in finding de-
essential for research in the digital humanities (Ockeloen et      tailed information on data representations used in various
al., 2013, among others). We can place the information in          projects. In this paper, we have taken a first step in address-
the left box of Figure 2 as well and then define provenance        ing the problem. We propose a practical approach in the
information for this new named graph, but (potentially ex-         form of a biographical data model repository where detailed
tensive) use of nested named graphs does not improve the           examples of different models can be collected. The samples
usability of our data structure.                                   will make use of biographical texts of fictional characters
The solution on the right-hand side is called reification. In      and invented data written under the create commons license
this case, a new node is introduced that splits the predicate      avoiding issues with copyright.
employed by into two relations: one with the subject of
the original triple and one with the object. Properties asso-      Once a number of resources have been collected, the reposi-
ciated with the relation can then be linked to this new node.      tory can furthermore be used to start and define connections
This solution changes the original structure making the re-        between models by mapping them to a generic biographical
lation between, in this example, the employer and employee         representation. We outlined a general procedure that starts
less direct: they are now connected to the same node rather        by converting resources to linked data representations (if
than each other. It also increases the number of relations.        they are not provided in RDF already) and consequently
On the other hand, it avoids introducing an additional layer       linking them to a generic model. We illustrated the pro-
of nested named graphs. An additional advantage is that            cess of converting event-centric and relationally structured
reification of relations that involve a state or event result in   resources to RDF. We showed that relational resources can
event-centric structures (compare the representation on the        be converted to event-centric representations in RDF when
right-hand side of Figure 2 to the one in Figure 1). Reifica-      applying reification.
tion thus facilitates the process of defining correspondences      As of the moment of submission, the repository illustrates
between information from these relational based represen-          all 23 biographical dictionaries included in the Biography
tations to information represented in event-centric models.        Portal of the Netherlands. In the near future, we plan to add
We will therefore adopt this solution once we start connect-       illustrations of the other thirteen resources we collected,
ing information from various models.                               as well as encourage researchers involved in other projects
                                                                   with biographical data to add illustrations of their models
                     5    Conclusion                               to the repository. The repository is available on github.11
Many projects that involve digitizing or enriching biograph-
ical data develop their own data model. In addition to the
inefficiency of not making use of knowledge acquired in by           11
                                                                      https://github.com/cltl/
other resources, this has led to differences between models        BiographicalDataModels
               6   Acknowledgements                              notations. In Proceedings 10th Joint ISO-ACL SIGSEM
This work was supported by the Amsterdam Academic Al-            Workshop on Interoperable Semantic Annotation, pages
liance Data Science (AAA-DS) Program Award to the UvA            9–16.
and VU Universities and NWO VENI grant 275-89-029              Antske Fokkens, Serge ter Braake, Niels Ockeloen, Piek
awarded to Antske Fokkens. We furthermore would like             Vossen, Susan Legêne, Guus Schreiber, and Victor
to thank researchers involved in the individual projects for     de Boer. 2017. Biographynet: Extracting relations be-
providing samples of their data as well as the participants      tween people and events. In Á. Z. Bernád, C. Gruber, and
of the BDM workshop in Krakow for their input during dis-        M. Kaiser, editors, Europa baut auf Biographien: As-
cussions. We thank the audience of BD2017 and anony-             pekte, Bausteine, Normen und Standards fr eine europis-
mous reviewers for their useful and detailed feedback. All       che Biographik, pages 193–224. New Academic Press,
remaining errors are our own.                                    Vienna.
                                                               Greta Franzini, Melissa Terras, and Simon Mahony. 2016.
                    7    References                              9. a catalogue of digital editions. Digital Scholarly Edit-
                                                                 ing, page 161.
Paul Arthur. 2017. Integrating biographical data in large-
                                                               Christine Gruber and Eveline Wandl-Vogt. 2017. Mapping
   scale research resources: Current and future direction.
                                                                 historical networks: Building the new Austrian Prosopo-
   In Á. Z. Bernád, C. Gruber, and M. Kaiser, editors, Eu-
                                                                 graphical Biographical Information System (APIS). In
   ropa baut auf Biographien: Aspekte, Bausteine, Normen
                                                                 Á. Z. Bernád, C. Gruber, and M. Kaiser, editors, Europa
   und Standards fr eine europische Biographik, pages 193–
                                                                 baut auf Biographien: Aspekte, Bausteine, Normen und
   224. New Academic Press, Vienna.
                                                                 Standards für eine europische Biographik, pages 271–
Peter K Bol, Robert M Hartwell, Michael A Fuller, et al.         282. New Academic Press, Vienna.
   2004. China biographical database project (cbdb).
                                                               Daniele Guido, Marten Düring, and Lars Wieneke. 2016.
Alison Booth. 1999. The lessons of the medusa: Anna
                                                                 European integration biographies reference database
   jameson and collective biographies of women. Victorian
                                                                 (eibio). In DH Benelux.
   Studies, 42(2):257–288.
                                                               Brian Harrison. 2004. The dictionary man in: M. bostridge
Jeremy J Carroll, Christian Bizer, Pat Hayes, and Patrick
                                                                 ed. In Lives for sale. Biographers tales, pages 76–85.
   Stickler. 2005. Named graphs, provenance and trust.
   In Proceedings of the 14th international conference on      Rik Hoekstra. 2013. Historische representativiteit in con-
   World Wide Web, pages 613–622. ACM.                           text. over het biografisch portaal als onderzoeksinstru-
                                                                 ment.
Victor de Boer, Jan Wielemaker, Judith van Gent, Michiel
   Hildebrand, Antoine Isaac, Jacco van Ossenbruggen, and      John Kendall. 2014. American national biography. Refer-
   Guus Schreiber. 2012. Supporting linked data produc-          ence Reviews, 28(2):7–10.
   tion for cultural heritage institutes: The amsterdam mu-    Hans-Ulrich Krieger and Thierry Declerck. 2015. An
   seum case study. In ESWC, volume 7295 of Lecture              owl ontology for biographical knowledge. representing
   Notes in Computer Science, pages 733–747, Berlin and          time-dependent factual knowledge. In Serge ter Braake,
   Heidelberg. Springer.                                         Antske Fokkens, Ronald Sluijter, Thierry Declerck, and
Thierry Declerck and Rachele Sprugnoli. 2018. Consider-          Eveline Wandl-Vogr, editors, Biographical Data in a
   ations about uniqueness and unalterability for the encod-     Digital World. Proceedings of the First Conference on
   ing of biographical data in ontologies. In Proceedings of     Biographical Data in a Digital World. Amsterdam, The
   the second Conference of Biographies in a Digital World       Netherlands, April 9, 2015, pages 101–110.
   BD2017.                                                     Katalin Lejtovicz and Amelie Dorn. 2017. Connecting
Österreichische Akademie der Wissenschaften. 2013.              people digitally-a semantic web based approach to link-
   Österreichisches biographisches lexikon 1815–1950. on-       ing heterogeneous data sets. In Proceedings of the Work-
   line edition. Online Publikation: http://www. biogra-         shop Knowledge Resources for the Socio-Economic Sci-
   phien. ac. at/oebl.                                           ences and Humanities associated with RANLP 2017,
Martin Doerr, Stefan Gradmann, Steffen Hennicke, An-             pages 1–8.
   toine Isaac, Carlo Meghini, and Herbert van de Som-         Petri Leskinen, Jouni Tuominen, Erkki Heino, and Eero
   pel. 2010. The europeana data model (edm). In World           Hyvönen. 2017. An ontology and data infrastructure for
   Library and Information Congress: 76th IFLA general           publishing and using biographical linked data. In Pro-
   conference and assembly, pages 10–15.                         ceedings of the Workshop on Humanities in the Semantic
Bernhard Ebneth and Matthias Reinert. 2017. Potentiale           Web (WHiSe II). CEUR Workshop Proceedings (October
   der deutschen biographie als historisch-biographisches        2017).
   informationssystem. In Á. Z. Bernád, C. Gruber, and       Tom J Lynch. 2014. Social networks and archival context
   M. Kaiser, editors, Europa baut auf Biographien: As-          project: A case study of emerging cyberinfrastructure.
   pekte, Bausteine, Normen und Standards fr eine europis-       DHQ: Digital Humanities Quarterly, 8(3).
   che Biographik, pages 283–295. New Academic Press,          Robert M MacGregor and In-Young Ko. 2003. Represent-
   Vienna.                                                       ing contextualized data using semantic web tools. In
Antske Fokkens, Aitor Soroa, Zuhaitz Beloki, Niels Ock-          PSSS.
   eloen, German Rigau, Willem Robert van Hage, and            Niels Ockeloen, Antske S. Fokkens, Serge ter Braake,
   Piek Vossen. 2014. Naf and gaf: Linking linguistic an-        Piek Vossen, Victor de Boer, Guus Schreiber, and Susan
  Legêne. 2013. Biographynet: Managing provenance at          contain three or more short biographies describing only
  multiple levels and from different perspectives. In Pro-     women. The collection was originally published as a book
  ceedings of the Workshop on Linked Science (LISC2013)        (Booth, 1999). The main metadata from this resource is
  at ISWC (2013).                                              available as CSV and it has been included in SNAC, which
Brian Ó Raghallaigh and Gearóid Ó Cleircı́n. 2015.          will be described below.
  Ainm.ie: Breathing new life into a canonical collec-         The Deutsche Biographie (Reinert et al., 2015, DB)
  tion of irish-language biographies. In Serge ter Braake,     (Ebneth and Reinert, 2017) consists of the old and new na-
  Antske Fokkens, Ronald Sluijter, Thierry Declerck, and       tional German biographical dictionary online.17 It includes
  Eveline Wandl-Vogt, editors, Biographical Data in a          information about 730,000 individuals in German speaking
  Digital World. Proceedings of the First Conference on        areas covering a timespan from the early Middle Ages until
  Biographical Data in a Digital World. Amsterdam, The         present. The resources also includes approximately 50,000
  Netherlands, April 9, 2015, pages 20–23.                     biographical descriptions.
Matthias Reinert, Maximilian Schrott, Bernhard Ebneth,         The Oxford Dictionary of National Biography (Harrison,
  and Team deutsche biographie.de. 2015. From biogra-          2004, ODNB) comprises an online version of the old bio-
  phies to data curation - the making of www.deutsche-         graphical dictionary as well as the new digital born addi-
  biographie.de. In Serge ter Braake, Antske Fokkens,          tions.18 In total, it contains over 60,000 biographies.
  Ronald Sluijter, Thierry Declerck, and Eveline Wandl-        The Austrian Biographical Lexicon Online (der Wis-
  Vogr, editors, Biographical Data in a Digital World. Pro-    senschaften, 2013, ÖBL) describes meaningful people born
  ceedings of the First Conference on Biographical Data        in the Austrian-Hungarian Empire, worked there or lived
  in a Digital World. Amsterdam, The Netherlands, April        there and died between 1815 and 1950. It currently con-
  9, 2015, pages 13–19.                                        tains more than 50,000 biographies.19
Wouter Van Atteveldt, Stefan Schlobach, and Frank
                                                               A.2 Platforms
  Van Harmelen. 2007. Media, politics and the seman-
  tic web. In European Semantic Web Conference, pages          Our study also included two platforms meant for sharing
  205–219. Springer.                                           information. The European Integration Biographies refer-
Willem Robert van Hage, Véronique Malaisé, Roxane            ence database (Guido et al., 2016, EIBIO) is a structured
  Segers, Laura Hollink, and Guus Schreiber. 2011. De-         repository for information about people. It combines struc-
  sign and use of the Simple Event Model (SEM). Journal        tured data with free text bringing information from exter-
  of Web Semantics, 9(2):128–136.                              nal repositories such as VIAF and Wikipedia together that
                                                               can be queried by an API. The data structure that is used is
       A   Appendix: Biographical Databases                    rather basic (data is shared as a CSV and not enough infor-
This appendix provides a brief description of all resources    mation is provided to determine whether it is relational or
included in the comparative study (Section 3.2).               event-centric).
                                                               The Social Networks and Archival Context project (Lynch,
A.1 Data collections                                           2014, SNAC) provides data of people and organizations in
AINM.IE (Raghallaigh and Cleircı́n, 2015, AINM) is a col-      their socio-historical context independently from the origi-
lection of biographies describing people who are in some       nal resources that provided information about their lives.20
way connected to the Irish language. It contains 1,749 bi-     Data from the CBW is included in this resource which uses
ographies written in Irish of people dating from 1560 until    JSON as an overall structure.
present.12
The American National Biography (Kendall, 2014, ANB)           A.3 Data models
covers the lives of 19,000 noteworthy American individu-       For our analysis we have looked at four data models. APIS
als.13                                                         provides rich structured data for the ÖBL (Gruber and
The Biographical Portal of the Netherlands (BNP) has been      Wandl-Vogt, 2017). Information comes from the original
introduced in the previous section. It is a collection of 23   metadata as well as from automated and manual annota-
different biographical dictionaries of Dutch people.14         tions (Lejtovicz and Dorn, 2017). Compared to the other
The China Biographical Database Project (Bol et al., 2004,     resources, it has a wide range of specifically defined rela-
CBD) provides biographical information about approxi-          tions between people, organizations and locations.
mately 360,000 persons15 most of whom lived between the        The BiographyNet project (BNET) aims to enhance the
7th and 19th century. It provides detailed information about   possibilities for historical research using the BPN by pro-
locations and has comparatively rich information about so-     viding structured information in RDF, extracting informa-
cial structures. It is the only resource in our sample that    tion from text and providing access to this information
specifies information about possessions.                       through a demonstrator (Fokkens et al., 2017). Among oth-
The Collective Biographies of Women16 (CBW) provides           ers, the project resulted in an RDF version of the BPN in-
annotated information on books written in English that         cluding an extensive model for representing provenance in-
                                                               formation (Ockeloen et al., 2013).
  12
     https://www.ainm.ie
  13                                                             17
     http://www.anb.org                                             http://www.deutsche-biographie.de
  14                                                             18
     http://www.biografischportaal.nl                               http://www.oxforddnb.com
  15                                                             19
     As of April 2015, indicated by the developers                  http://www.biographien.ac.at/oebl
  16                                                             20
     http://womensbios.lib.virginia.edu                             http://snaccooperative.org/?redirected=1
The BioCRM (BCRM) is designed for representing bio-
graphical information for supporting prosopographical re-
search in the context of the Republic of Letters.21 It is an
extension of CIDOC CRM so that it can easily be used in
a variety of digital humanities projects. The model pro-
vides the means for defining basic biographical information
and is mainly meant to complement or be complemented by
other models.
The final model we include in our comparative analysis
is the DFKI Biography Ontology (Krieger and Declerck,
2015). Contrary to all other resources included here, this
model does not provide specific relations for persons, but
rather a generic framework that can represent temporarily
bound events and states as well as fixed properties of per-
sons. It can be seen as complementary to the other models.
The latest status of this ontology and a proposal for moving
forward can be found in Declerck and Sprugnoli (2018),
this volume.




  21
       http://www.republicofletters.net