=Paper=
{{Paper
|id=Vol-2119/paper13
|storemode=property
|title=Connecting People Across Borders: a Repository for Biographical Data Models
|pdfUrl=https://ceur-ws.org/Vol-2119/paper13.pdf
|volume=Vol-2119
|authors=Antske Fokkens,Serge ter Braake
|dblpUrl=https://dblp.org/rec/conf/bd/FokkensB17
}}
==Connecting People Across Borders: a Repository for Biographical Data Models==
Connecting People Across Borders: a Repository for Biographical Data Models Antske Fokkens and Serge ter Braake CLTL, Vrije Universiteit Amsterdam / Media Studies, University of Amsterdam De Boelelaan 1105 1081 HV Amsterdam, the Netherlands / Turfdraagsterpad 9, 1012 XT Amsterdam antske.fokkens@vu.nl,sergeterbraake@gmail.com Abstract This paper proposes a practical approach for sharing knowledge about biographical datamodels circumventing issues with copy-right. We furthermore provide the main observations of a study analyzing the data structures of eight biographical resources, two platforms for biographical information and four biographical data models. We outline an approach for designing a generic model that can be used for linking information from different models despite differences in structure. Keywords: Biographical data models, RDF 1 Introduction ical datasets have their own data representation making it The biography genre has a long history. Plutarch (45-ca. challenging to carry out research across datasets. 120 AD) is often considered the father of the biography. A few examples of successful integration of biographical He did not only provide syntheses of people’s life, but he data and standardization of metadata from different sources also tried to compare them to a similar person in a ‘double are the national Australian Dictionary of National Biog- biography’. Other than length, there is a difference between raphy,1 the Biography Portal of the Netherlands,2 and the such full length biographies and biographical entries in bi- transnational Biographie-Portal3 and the APIS (Austrian ographical dictionaries. Biographies in biographical dictio- Prosopographical Information System) project (Gruber and naries tend to be more factual. They provide a chronicle Wandl-Vogt, 2017). of the lives of noteworthy people, without necessarily giv- This paper proposes a practical approach that addresses the ing much attention to social environment, political circum- problems faced when integrating biographical data from stances or comparisons to other people. Full length biogra- different sources into one repository. We introduce a repos- phies paint a biographical narrative, while short biographi- itory for biographical data models that provides exam- cal entries in biographical dictionaries or encyclopedia such ples and descriptions of existing data models. The repos- as Wikipedia mostly provide ‘biographical data’. These bi- itory provides illustrations of data models used in differ- ographical data can be the building blocks for full length ent projects using fictional biographies, accompanied with biographies, or they can serve to build group portraits and fictional biographical data. Researchers working with the systematically compare people (see also Harrison (2004)). models can add information about the process, why the Over the past twenty years the amount of online available model is designed in a particular way and problems and ‘biographical data’ has increased rapidly, with the advent advantages they experienced from their modeling choices. of the Internet and large digitization projects. The potential In addition, the samples in the repository are used to design for biographical research, network analysis and group por- a generic overarching model that can combine data repre- traits seem to be endless when all of this data can be linked sented in different formats. and shared for analysis (Fokkens et al., 2017; Arthur, 2017, The main contributions of this paper are: e.g.). 1. We compare and classify the design of models for Projects aiming at making biographical data available, first modeling biographical data from fourteen resources need to address the question of how to represent this data. Individual projects have dealt with this issue in different 2. We introduce a repository that provides insight into the ways. Where some introduced or reused formally de- structure of one of these models fined models, others used basic approaches using comma- 3. We outline our approach for connecting models that separated-values to represent the information most com- use different frameworks, formats and structures monly provided by the original resource. Because many projects did not consider data representation The remainder of this paper is structured as follows. Sec- a central issue in their digitization efforts, the number of tion 2 discusses related work. The comparative analysis of publications about this part of the process remained lim- biographical data models is presented in Section 3. We de- ited and, as a consequence, knowledge about existing mod- scribe the set-up and current status of the Repository for els and best practice for modeling biographical data is not Biographical Data Models (BDM) and our proposal for de- sufficiently shared. This resulted in two challenges for re- signing a generic model for connecting data from various searchers working with biographical data. First, researchers working on new digitization projects for biographical data 1 http://adb.anu.edu.au 2 are ‘reinventing the wheel’ and run into the same problems http://www.biografischportaal.nl 3 others have dealt with before them. Second, most biograph- http://www.biographie-portal.de projects in Section 4. We conclude in Section 5. Ap- Comparing data representations is complex, because for- pendix A describes the resources we studied for this paper. mats and models are regularly confused. In particular, ad- vantages and disadvantages of using RDF or XML ((eXten- 2 Background and Related Work sible Markup Language) and JSON (JavaScript Object No- Even though a handful of publicly available standards exist tation) are frequently discussed even though XML is a seri- for biographical data and some initiatives define their mod- alization format and RDF is a data model, that can be rep- els in RDF (Resource Description Framework) and make resented in several formats including XML or JSON. Like- use of existing vocabularies, most projects have designed wise, XML and JSON can be used to represent data models their own model. This can be for historical reasons, ei- that are not RDF, specified in e.g. the DTD (Document Type ther by the desire to stay close to the structure of an origi- Definition) of the XML. When comparing XML to RDF, nal (non)-digital source or by the direct research goals that people generally mean the possibility of capturing infor- were outlined in early stages of the digitization process. It mation through its structure when using XML (by embed- is however likely that this is at least partially due to lack of ding elements or placing them in some order), where RDF knowledge on existing resources. This lack of knowledge enforces making all information explicit.5 Even though we is not due to lack of interest, but to the fact that it is non- are aware of the fact that XML and RDF operate on a differ- trivial to obtain this information. Experience in creating ent level and thus cannot be compared, we distinguish be- structured data often stays project internal: publications on tween models using RDF and models using non-RDF based formalizing biographical data are limited, biographical re- XML or non-RDF based JSON or CSV. Unless specified sources are often part of national projects written in a local otherwise, the terms XML and JSON will refer to (semi- language or their use is restricted by copyright. )structured representations that are not defined in RDF in Making use of other people’s experience in their digitiza- the remainder of this paper, where we use RDF to refer to tion and enrichment projects not only saves work, it can RDF models regardless of the format they are represented also help avoid problems further down the line. It is difficult in. to foresee exactly what information various researchers in- Structured data forms the basis for applying digital models, terested in a resource may need later on. Investigating data but structure in itself does not provide the means to con- structures that have already been used for various use cases nect or compare data from various resources. In order to can provide valuable insight into what works and what does automate a process of connecting data, its category must not. Following examples from other projects has the addi- be formally defined. In RDF, identifiers are used to refer tional advantage that it will be easier to make connections to entities or their properties. These entities and properties between different datasets facilitating, for instance, com- can be formally defined, which also allows us to define cor- parative biographical research across borders. respondences between entities and properties. These cor- The situation of biographical data models is far from unique respondences can link data across resources. We therefore and some efforts have been made to address this issue. aim to work towards a generic model in RDF. Franzini et al. (2016) aim to provide an overview of prop- A full discussion of related work on linking data within the erties of digital editions and RIDE4 offers a Review jour- digital humanities is beyond the scope of this paper. We nal for digital editions and resources. In the typical case, therefore limit this overview to projects that directly influ- the data model used in digital humanities projects is deter- enced the approach proposed in this paper. In our proposal, mined by structure of the original resource or specific re- we follow de Boer et al. (2012), who outline a procedure for search questions from the early phases of the project. This converting cultural heritage data structured in XML to RDF is only natural, because staying close to the original source with a minimum of data loss. Their approach will be ex- minimizes loss of information and current research ques- plained in detail in Section 4.2. They ultimately map their tions form a concrete set of requirements that can be used converted data to a common data model for cultural her- for designing the model. itage data: the Europeana Data Model (Doerr et al., 2010, In the remainder of this section, we first provide back- EDM). We propose to follow this example for biographical ground information on data structures and clarify who re- data, where we keep data representations as close as possi- lated terminology will be used in the remainder of this pa- ble to their original form and then connect them by defining per. We then introduce previous projects that provide a categories occurring in individual models by relating them common model for multiple biographical resources. to a generic model for biographical data. 2.1 Formal Modeling and Linking 2.2 Work on Biographical Datamodels Data can be unstructured (such as flat text), semi-structured (e.g. CSV (comma separated values) files containing de- The BiographyNet project applied the procedure outlined scriptions in natural language) or fully structured (e.g. a by de Boer et al. (2012) to data from the Biography Por- representation in RDF). Note that an RDF representation tal of the Netherlands (BPN) as described in Ockeloen et can also contain unstructured elements (e.g. a literal value al. (2013). The BPN forms a collection of biographical that is a text) and that CSV can also be used to provide fully dictionaries describing people who are Dutch or lived in structured information (e.g. only information that is numer- the Netherlands. It is one of the projects that already pro- ical or ontologically defined). In this paper, we only deal posed an overarching generic structure for a heterogeneous with semi-structured and structured data representations. 5 See for instance Fokkens et al. (2014) for a more elaborate 4 discussion on this matter. https://ride.i-d-e.de dataset, resulting in an event-centric model for biographical as part of the preparation for the Workshop on Biograph- data (Hoekstra, 2013). ical Data and Datamodels.8 A short description of each The national Australian Dictionary of National Biography6 project can be found in Appendix A. The models we ob- (ADNB), is part of a larger effort of data aggregation, col- served as part of this investigation come from a wide vari- laboration and cooperation together with the Humanities ety of projects. Some projects mainly focus on the digitiza- Network Infrastructure (HuNI) (Arthur, 2017). tion process or historical research where designing a model The transnational “Biographie-Portal”7 which combines for presenting biographical data emerged as a by-product. nine biographical resources from four countries (Germany, Others specifically aimed at developing a formal model for Austria, Switzerland and Slovenia) and can be searched on biographical data. name and occupation. Richer developments for these re- We compare the models on the level of content (what kind sources, and in particularly the Austrian Biographical Lex- of information is provided), the framework (is the model icon (ÖBL) are developed as part of the APIS project (Gru- formalized and how) and formatting (how is data repre- ber and Wandl-Vogt, 2017). sented). In this investigation, we only consider components A handful of projects have made use of linked data for en- of the data that are (semi-)structured: raw text is not ana- richment and connecting biographical data to external re- lyzed in depth. sources. It is used for connecting data in the HuNI and ADNB data aggregation projects. The Deutsche Biogra- 3.1 General Observations phie (DB) also represents information in RDF. However, to 3.1.1 Content our knowledge, neither of these resources represent all their metadata in RDF. The BPN was converted to linked data We first examine what kind of information can be included as part of the BiographyNet project, which also enriched in the models in a (semi-)structured manner. As expected, the metadata by processing the biographical text automati- all models we examined represent the person’s name and cally and linking extracted information to external sources lifespan (if known). When looking at richer models, we (Fokkens et al., 2017). The model that is used to represent observe common themes in the kind of information that is this data in RDF including an elaborate schema for repre- provided. Most resources and models address the individ- senting provenance in a detailed manner can be found in ual’s career, education, family relations and residence. Fur- Ockeloen et al. (2013). thermore, several resources make the reason for including To our knowledge, none of the projects discussed above a person explicit by providing information labeled ‘kind of make use of linked data to provide a generic overarching person’, ‘category’ or ‘claim to fame’. model. The work by Leskinen et al. (2017) comes closest The main differences lie in the level of granularity of the to this idea. They provide a basic structure that can be used information provided. Where some only indicate the sector for prosopographical research defining name, lifespan and in which a person worked, others provide detailed informa- gender. More elaborate information can be defined using tion about the firm, dates and time lines of the employment. externally defined data models such as the Simple Event The same can be observed for education. Model (van Hage et al., 2011, SEM). The Biographical Data Model Repository proposed in this 3.1.2 Framework and Structure paper is intended to be complementary to all initiatives The level of formalization highly differs from one model mentioned above. It does not provide a platform for aggre- to another. The least formalized models make use of text gating the data itself like BNP, the ADNB or the transna- fields for providing information. They use words repre- tional Biographie-Portal. Its goal is to primarily provide sented as strings to define various categories of information examples of a wide variation of biographical data models. and values are presented as descriptions. In these cases, These can be collected across projects with relatively lim- minor differences can already be observed in the way dates ited effort. To illustrate, the fourteen resources presented are represented or the same location may appear using a here were collected in a couple of weeks. The method different name. Other models use predefined classes and we propose for converting and linking data aims to go be- relations. This particularly holds to a large extent for the yond defining a basic generic model for representing bio- models that are defined in RDF. Finally, a handful of mod- graphical data as developed by Leskinen et al. (2017). We els adapted their basic structure from TEI P5, which defines propose a bottom up approach for representing various re- a generic XML structure. sources in RDF, which can consequently be mapped on a Basic representations in strings have the advantage that high or fine-grained level to other sources. unstructured and semi-structured data from the original sources can be represented in its surface form in a simple 3 A comparative analysis and straight-forward manner. However, it may be worth- We collected samples from two platforms for sharing bio- while to invest in defining models and ontologies: prede- graphical data, eight biographical databases and four data fined categories have the advantage that identical informa- models, two of which were specifically designed as part tion is presented in a consistent manner. Formally defining of a digitization/enhancement project related to one of the information in RDF facilitates the process of connecting it databases. This total of fourteen resources was collected to external resources. 6 8 http://adb.anu.edu.au http://www.biographynet.nl/ 7 http://www.biographie-portal.de dh-biographical-data-workshop/ general categories framework or format personal relations metadata/in-text further specifics claim-of-fame/ event/relation accessibility person-type occupation education residence lifespan gender model faith AINM TEI P5 XML relation AFR MD+IT 3 3 3 3 3 - Repositories ANB TEI P5 XML event CRR MD+IT 3 3 3 - BPN TEI P5 RDF/XML event OS/AFR MD 3 3 3 3 3 3 3 3 - CBD own RDB relation OS MD 3 3 3 3 3 9 CBW SNAC CSV/JSON n.a. OS MD 3 3 3 - DB own RDF/XML relation OS/AFR MD+IT 3 3 3 3 3 3 3 3 - ODNB TEI P5 XML event CRR MD+IT 3 3 3 3 3 3 3 3 ÖBL own RDB relation AFR MD 3 3 3 3 3 Table 1: Overview of properties of individual biographical databases 3.1.3 Representation 3.2 Data Sample Analyses We compared samples of fourteen biographical data re- We compared choices of representation for various data sources outlined in Appendix A9 paying attention to the models. The most basic form of structuring data is through level of formalization, the overall structure (relation-based, CSV. Advantages of using CSV are clear: it is an easy to event-based or both) of the model as well as the categories understand format that can be operated well by humans as provided for most entries or, for the four datamodels, which well as machines. On the other hand, it provides little sup- categories they specifically formalize. We also indicate the port for defining more complex relations. Most data entries availability of the data itself for the eight databases. consist of rows defining the identifier for the person de- scribed, name, dates of birth and death and possibly room 3.2.1 Databases for a ‘claim-to-fame’ category and parents. They become Table 1 provides an overview of the properties of the less convenient when defining properties of which a per- databases. The left side of the table indicates general prop- son may have more than one during their life: professions, erties. The first column indicates the generic model that schools attended, residence, children, etc. They also fall was used as a basis for the model employed by the database: short when defining more complex relations, for instance, three projects invented their own model from scratch, CBW the start and end date of each profession together with the makes use of representations developed as part of SNAC location of the position. It is therefore not surprising that and all others have taken TEI P5 as a basis. The second CSV is mainly used for resources that only represent a rel- column indicates whether the database makes use of the atively modest amount of metadata on the person. framework RDF and otherwise, which representation for- mat is used. Both databases that have RDF representa- Resources that do aim to define more complex relations ei- tions also represent information in plain XML. ABD and ther represent their data in RDF, which can be represented CBDP are relational databases that can be queried using in e.g. XML, turtle or LD-JSON, or they use some other SQL. CBW uses CSV and JSON for data representations. XML format or JSON structure. XML and JSON both pro- The third column indicates whether the structure of the vide straightforward means to define multiple entries of the representation is event-centric or mainly relational. The same categories (e.g. a list in JSON or sequence of XML model used for CBW is not rich enough to make this dis- elements) as well as the means to define more elaborate re- tinction. Two databases are copyright restricted (CRR), lations. It is possible to provide formal definitions of what two databases can be made available for research purposes constitutes well-formed XML of a given data structure, in- (AFR), two are open source (OS) and two are partially cluding the elements, attributes and values that are permit- open source and can partially be made available for re- ted. However, XML itself does not offer the means to for- search (OS/AFR), as indicated in column five. The sixth mally define the meaning of these elements, attributes and column indicates whether the database only provides struc- values. To summarize, RDF models provide, in principle, tured data as metadata (MD) or whether it also provides the richest formal definitions and are most (explicitly) ex- structured data tagged in the biographical text (+IT). pressive, followed by (non-RDF defined) XML structures, The right side of the table indicates which categories of in- JSON and finally CSV. The order of complexity of the formation are provided as specifically structured data. It model, the effort involved in defining them properly and should be noted that lack of a checkbox does not necessar- possibly the order of the gentlest learning curve for peo- ple starting to work with them, is the inverse: CSV is the 9 The abbreviations used in our comparison are introduced in simplest, followed by JSON, XML and RDF. the Appendix as well. ily mean that the information is not present in the resource. process of connecting data, including a conversion step to The information can standardly provided in the biographi- representations in RDF. cal text or it can be provided in a semi-structured manner, rather than being part of the structured dataset. The last col- umn indicates the extent to which alternative categories are 4 The BDM Repository provided in a structured way. The ÖBL has at least 36 addi- As a practical approach to address the two main drawbacks tional relations defined, CBDP has 9 additional information of developing models independently outlined in Sections 1 fields and ODNB mainly provides relatively fine-grained and 2, we initiated a repository of biographical data models subcategories. (the BDM repository). We first describe the process of col- lecting models in the BDM repository and then outline the 3.2.2 Platforms and Data models process we intent to follow to connect the models collected What information is formally represented in the two plat- in this repository. forms and four models is presented in Table 2. The infor- mation provided by APIS and BiographyNet (BNET) cor- 4.1 Collecting Data respond to that included in the respective databases they are related to (ABD and BNP). For reasons of space, we omit- The Biographical Data Model (BDM) Repository is a place ted categories that are only provided by one of these two for collecting and connecting biographical data models. resources. The BDM Repository serves three purposes: First, re- APIS provides the same 36+ relations that are indicated for searchers faced with the task of representing biographical the ABD. The other resources can provide richer structured data can find various examples of models used by other information due to their ability to be combined with other projects in one place. Second, the repository forms a nat- models. BNET, BCRM and DFKI are defined in RDF for ural environment for comparing data models and recording this exact reason. SNAC and EIBIO do not represent their advantages and disadvantages of various representations. data in RDF, but do make use of external links to connect Third, the repository will support the process of represent- information from various sources. ing models in RDF (for those that are not represented in RDF already) and defining correspondences between mod- external links/extensions els. These correspondence definitions can be used to link data from various models, which in turn, enables a wide range of comparative research. framework/format personal relations The first challenge this repository faces is that many bio- event/relation graphical data collections are copy-righted. From the col- occupation education lections described above, only two are completely open lifespan gender source and two are partially open source. Samples from the other resources cannot be made openly available to ev- eryone. To circumvent this problem, we wrote a handful of APIS RDB rel. 3 3 3 3 3 3 biographies of fictional characters and make the texts and BNET RDF event 3 3 3 3 3 3 metadata we (partially) invented available under the Cre- BCRM RDF event 3 3 3 3 3 ative Commons License. The idea is that the repository will DFKI RDF event 3 3 3 3 ultimately include representations of these non-copyrighted texts in all biographical data models we are aware of. This SNAC JSON rel. 3 3 3 allows us to illustrate the structure of the models without EIBIO CSV rel. 3 3 sharing their copy-righted content. It has the additional advantage that it becomes easier to compare information Table 2: Overview of properties defined in models and plat- between models, since different samples provide the same forms information. The BDM repository currently provides samples for all 21 dictionaries included in the Biographical Portal of the 3.2.3 Summarizing the analysis Netherlands. They are illustrated by the biography of Mary Overall, we observe that all resources provide ways for Morstan, protagonist in one of the Sherlock Holmes books specifying a person’s life span in a structured way. Al- and later wife of dr. Watson. The biographies are written in most all resources provide means to specify a person’s oc- English, but otherwise follow the conventions of the orig- cupation or gender, CBW being the only exception when it inal resources (concerning abbreviation and semi-structure comes down to education and ABD and CBDP being the in text). The information provided on Morstan currently only two sources that do not seem to have a field to specify covers the categories included in the BPN models and will gender. The other categories, faith, person-type/claim-to- be extended accordingly as models with structure for ad- fame, education, residence and personal relations each oc- ditional information are added. The latest version of the cur in four to eight resources. The division between event- BDM repository can be found on github.10 based and relational based structures is about 50-50. No- tably resources that make use of RDF seem to have a pref- erence for event-centric structures. A probable reason for 10 https://github.com/cltl/ this will be outlined in Section 4.2, where we describe the BiographicalDataModels Figure 1: Illustration of conversion of event-centric data representation to RDF 4.2 Connecting Biographical Data In the fifth step, these correspondences are used to link the Once multiple data models have been included in the BDM generated RDF to external sources after which it is possible repository, we can investigate how to connect them. We to publish the model as linked data. The BDM repository plan to achieve this by representing all models in RDF. aims to help researchers carry out the first four steps. Since Once individual models have been formally defined, we can the repository only provides mock-up samples of data, the define correspondence between them. In this section, we actual alignment of the resource and publication as linked outline this process. data is out of scope. In the next subsection, we will explain how correspondences may be defined between a relational 4.2.1 From CSV or XML to RDF based and event-centric model. The first step is to provide RDF representations for models that have not been defined in RDF so far. When converting 4.2.2 Conversions and Linking from one representation format to another, there is always Figure 1 provides an illustration of the conversion of an a risk of loss in information. This particularly applies when event-centric representation to RDF. We illustrate the rep- the data is converted to a standardized model. We avoid resentation of the event after Step 3, before the step map- this by following the procedure outlined in de Boer et al. ping it to other resources. The namespace nns: stands for (2012) for converting XML to RDF and adapting a similar a new namespace for the dataset. Conversion to RDF is rel- approach for converting CSV and JSON files. The proce- atively straight-forward: a unique identifier is assigned to dure consists of the following steps (adapted from de Boer the event, this is typed as an occupation and all other infor- et al. (2012), page 735): 1) XML/CSV/JSON ingestion. mation can be defined directly as properties of the event. In 2) Crude conversion to RDF. 3) RDF restructuring. 4) De- the next step, these relations can be mapped to other exist- sign metadata mapping scheme. 5) Align vocabularies with ing models. We can use the Simple Event Model (van Hage external sources. 6) Publish as Linked Data. et al., 2011) for instance to define the location, the begin In the first step, the original structure is interpreted. Then time and end time. Categories that commonly occur in bio- a direct conversion to RDF maintaining the full original graphical data, such as occupations, should ideally also be structure takes place. As also explained by de Boer et al. defined by the same vocabulary across resources. (2012), data in XML can be complex: elements can be Representing a relational based structure in RDF requires nested deeply within other elements, they may be grouped more effort for relations that are temporary bound or tied in a specific manner or ordered by the structure. Some to a specific location. Figure 2 provides an illustration. of these structural properties are meaningful (e.g. elements In principle, the relation itself can easily be translated into within a group are connected by some implicit link, or the RDF by assigning a URI to the relation and specifying its order of elements indicates their order in time), but many meaning. However, we then need to decide how to specify do not express information that needs to be maintained in the duration and location of the employment. The problem the RDF structure. If the original XML (or JSON) is com- of making statements about a triple in RDF is well-known plex, the resulting RDF structure is likely to be messy. The and several solutions have been proposed for solving this third step addresses this by restructuring the RDF so that challenge. Van Atteveldt et al. (2007) provide an in depth structures containing implicit information are translated to analysis of proposals. We illustrate two commonly used flatter (non-embedded) representations that make this infor- approaches in Figure 2. mation explicit and idiosyncratic complexities are removed. On the left-hand side, the statement about Mary’s employ- The first three steps ideally result in an RDF representation ment is taken as a unit that can receive its own identifier. that is as simple as possible, but still provides all informa- This approach is used for defining context (Carroll et al., tion from the original data. 2005; MacGregor and Ko, 2003, e.g.). In our example, we In the fourth step, researchers explore which categories and use a named graph for assigning an identifier to the rela- relations expressed in the generated RDF correspond to def- tion. Information about time and place are then linked to initions and classes defined in other vocabularies. Based the identifier of the named graph. The advantage of this ap- on this exploration, correspondences between the resulting proach is that it remains close to the original data structure. RDF and existing models and vocabularies can be defined. Following a solution originally designed to define contexts Figure 2: Illustration of conversion of relational data representation to RDF also intuitively makes sense: the specific relation applied in making it harder to make connections between various re- a given time period and in a given place. On the other hand, sources. We illustrated some of these differences through we also want to define the context in which the informa- an analysis of fourteen resources collected as part of the tion about time and place is provided: what is the original Workshop on Biographical Datamodels held in Krakow, source of this information? How was it integrated in this July 2016. database and by whom? What conversions and other oper- The problem of models being developed independently ations were applied to this data? Modeling provenance is is partially due to the difficulties involved in finding de- essential for research in the digital humanities (Ockeloen et tailed information on data representations used in various al., 2013, among others). We can place the information in projects. In this paper, we have taken a first step in address- the left box of Figure 2 as well and then define provenance ing the problem. We propose a practical approach in the information for this new named graph, but (potentially ex- form of a biographical data model repository where detailed tensive) use of nested named graphs does not improve the examples of different models can be collected. The samples usability of our data structure. will make use of biographical texts of fictional characters The solution on the right-hand side is called reification. In and invented data written under the create commons license this case, a new node is introduced that splits the predicate avoiding issues with copyright. employed by into two relations: one with the subject of the original triple and one with the object. Properties asso- Once a number of resources have been collected, the reposi- ciated with the relation can then be linked to this new node. tory can furthermore be used to start and define connections This solution changes the original structure making the re- between models by mapping them to a generic biographical lation between, in this example, the employer and employee representation. We outlined a general procedure that starts less direct: they are now connected to the same node rather by converting resources to linked data representations (if than each other. It also increases the number of relations. they are not provided in RDF already) and consequently On the other hand, it avoids introducing an additional layer linking them to a generic model. We illustrated the pro- of nested named graphs. An additional advantage is that cess of converting event-centric and relationally structured reification of relations that involve a state or event result in resources to RDF. We showed that relational resources can event-centric structures (compare the representation on the be converted to event-centric representations in RDF when right-hand side of Figure 2 to the one in Figure 1). Reifica- applying reification. tion thus facilitates the process of defining correspondences As of the moment of submission, the repository illustrates between information from these relational based represen- all 23 biographical dictionaries included in the Biography tations to information represented in event-centric models. Portal of the Netherlands. In the near future, we plan to add We will therefore adopt this solution once we start connect- illustrations of the other thirteen resources we collected, ing information from various models. as well as encourage researchers involved in other projects with biographical data to add illustrations of their models 5 Conclusion to the repository. The repository is available on github.11 Many projects that involve digitizing or enriching biograph- ical data develop their own data model. In addition to the inefficiency of not making use of knowledge acquired in by 11 https://github.com/cltl/ other resources, this has led to differences between models BiographicalDataModels 6 Acknowledgements notations. In Proceedings 10th Joint ISO-ACL SIGSEM This work was supported by the Amsterdam Academic Al- Workshop on Interoperable Semantic Annotation, pages liance Data Science (AAA-DS) Program Award to the UvA 9–16. and VU Universities and NWO VENI grant 275-89-029 Antske Fokkens, Serge ter Braake, Niels Ockeloen, Piek awarded to Antske Fokkens. We furthermore would like Vossen, Susan Legêne, Guus Schreiber, and Victor to thank researchers involved in the individual projects for de Boer. 2017. Biographynet: Extracting relations be- providing samples of their data as well as the participants tween people and events. In Á. Z. Bernád, C. Gruber, and of the BDM workshop in Krakow for their input during dis- M. Kaiser, editors, Europa baut auf Biographien: As- cussions. We thank the audience of BD2017 and anony- pekte, Bausteine, Normen und Standards fr eine europis- mous reviewers for their useful and detailed feedback. All che Biographik, pages 193–224. New Academic Press, remaining errors are our own. Vienna. Greta Franzini, Melissa Terras, and Simon Mahony. 2016. 7 References 9. a catalogue of digital editions. Digital Scholarly Edit- ing, page 161. Paul Arthur. 2017. Integrating biographical data in large- Christine Gruber and Eveline Wandl-Vogt. 2017. Mapping scale research resources: Current and future direction. historical networks: Building the new Austrian Prosopo- In Á. Z. Bernád, C. Gruber, and M. Kaiser, editors, Eu- graphical Biographical Information System (APIS). In ropa baut auf Biographien: Aspekte, Bausteine, Normen Á. Z. Bernád, C. Gruber, and M. Kaiser, editors, Europa und Standards fr eine europische Biographik, pages 193– baut auf Biographien: Aspekte, Bausteine, Normen und 224. New Academic Press, Vienna. Standards für eine europische Biographik, pages 271– Peter K Bol, Robert M Hartwell, Michael A Fuller, et al. 282. New Academic Press, Vienna. 2004. China biographical database project (cbdb). Daniele Guido, Marten Düring, and Lars Wieneke. 2016. Alison Booth. 1999. The lessons of the medusa: Anna European integration biographies reference database jameson and collective biographies of women. Victorian (eibio). In DH Benelux. Studies, 42(2):257–288. Brian Harrison. 2004. The dictionary man in: M. bostridge Jeremy J Carroll, Christian Bizer, Pat Hayes, and Patrick ed. In Lives for sale. Biographers tales, pages 76–85. Stickler. 2005. Named graphs, provenance and trust. In Proceedings of the 14th international conference on Rik Hoekstra. 2013. Historische representativiteit in con- World Wide Web, pages 613–622. ACM. text. over het biografisch portaal als onderzoeksinstru- ment. Victor de Boer, Jan Wielemaker, Judith van Gent, Michiel Hildebrand, Antoine Isaac, Jacco van Ossenbruggen, and John Kendall. 2014. American national biography. Refer- Guus Schreiber. 2012. Supporting linked data produc- ence Reviews, 28(2):7–10. tion for cultural heritage institutes: The amsterdam mu- Hans-Ulrich Krieger and Thierry Declerck. 2015. An seum case study. In ESWC, volume 7295 of Lecture owl ontology for biographical knowledge. representing Notes in Computer Science, pages 733–747, Berlin and time-dependent factual knowledge. In Serge ter Braake, Heidelberg. Springer. Antske Fokkens, Ronald Sluijter, Thierry Declerck, and Thierry Declerck and Rachele Sprugnoli. 2018. Consider- Eveline Wandl-Vogr, editors, Biographical Data in a ations about uniqueness and unalterability for the encod- Digital World. Proceedings of the First Conference on ing of biographical data in ontologies. In Proceedings of Biographical Data in a Digital World. Amsterdam, The the second Conference of Biographies in a Digital World Netherlands, April 9, 2015, pages 101–110. BD2017. Katalin Lejtovicz and Amelie Dorn. 2017. Connecting Österreichische Akademie der Wissenschaften. 2013. people digitally-a semantic web based approach to link- Österreichisches biographisches lexikon 1815–1950. on- ing heterogeneous data sets. In Proceedings of the Work- line edition. Online Publikation: http://www. biogra- shop Knowledge Resources for the Socio-Economic Sci- phien. ac. at/oebl. ences and Humanities associated with RANLP 2017, Martin Doerr, Stefan Gradmann, Steffen Hennicke, An- pages 1–8. toine Isaac, Carlo Meghini, and Herbert van de Som- Petri Leskinen, Jouni Tuominen, Erkki Heino, and Eero pel. 2010. The europeana data model (edm). In World Hyvönen. 2017. An ontology and data infrastructure for Library and Information Congress: 76th IFLA general publishing and using biographical linked data. In Pro- conference and assembly, pages 10–15. ceedings of the Workshop on Humanities in the Semantic Bernhard Ebneth and Matthias Reinert. 2017. Potentiale Web (WHiSe II). CEUR Workshop Proceedings (October der deutschen biographie als historisch-biographisches 2017). informationssystem. In Á. Z. Bernád, C. Gruber, and Tom J Lynch. 2014. Social networks and archival context M. Kaiser, editors, Europa baut auf Biographien: As- project: A case study of emerging cyberinfrastructure. pekte, Bausteine, Normen und Standards fr eine europis- DHQ: Digital Humanities Quarterly, 8(3). che Biographik, pages 283–295. New Academic Press, Robert M MacGregor and In-Young Ko. 2003. Represent- Vienna. ing contextualized data using semantic web tools. In Antske Fokkens, Aitor Soroa, Zuhaitz Beloki, Niels Ock- PSSS. eloen, German Rigau, Willem Robert van Hage, and Niels Ockeloen, Antske S. Fokkens, Serge ter Braake, Piek Vossen. 2014. Naf and gaf: Linking linguistic an- Piek Vossen, Victor de Boer, Guus Schreiber, and Susan Legêne. 2013. Biographynet: Managing provenance at contain three or more short biographies describing only multiple levels and from different perspectives. In Pro- women. The collection was originally published as a book ceedings of the Workshop on Linked Science (LISC2013) (Booth, 1999). The main metadata from this resource is at ISWC (2013). available as CSV and it has been included in SNAC, which Brian Ó Raghallaigh and Gearóid Ó Cleircı́n. 2015. will be described below. Ainm.ie: Breathing new life into a canonical collec- The Deutsche Biographie (Reinert et al., 2015, DB) tion of irish-language biographies. In Serge ter Braake, (Ebneth and Reinert, 2017) consists of the old and new na- Antske Fokkens, Ronald Sluijter, Thierry Declerck, and tional German biographical dictionary online.17 It includes Eveline Wandl-Vogt, editors, Biographical Data in a information about 730,000 individuals in German speaking Digital World. Proceedings of the First Conference on areas covering a timespan from the early Middle Ages until Biographical Data in a Digital World. Amsterdam, The present. The resources also includes approximately 50,000 Netherlands, April 9, 2015, pages 20–23. biographical descriptions. Matthias Reinert, Maximilian Schrott, Bernhard Ebneth, The Oxford Dictionary of National Biography (Harrison, and Team deutsche biographie.de. 2015. From biogra- 2004, ODNB) comprises an online version of the old bio- phies to data curation - the making of www.deutsche- graphical dictionary as well as the new digital born addi- biographie.de. In Serge ter Braake, Antske Fokkens, tions.18 In total, it contains over 60,000 biographies. Ronald Sluijter, Thierry Declerck, and Eveline Wandl- The Austrian Biographical Lexicon Online (der Wis- Vogr, editors, Biographical Data in a Digital World. Pro- senschaften, 2013, ÖBL) describes meaningful people born ceedings of the First Conference on Biographical Data in the Austrian-Hungarian Empire, worked there or lived in a Digital World. Amsterdam, The Netherlands, April there and died between 1815 and 1950. It currently con- 9, 2015, pages 13–19. tains more than 50,000 biographies.19 Wouter Van Atteveldt, Stefan Schlobach, and Frank A.2 Platforms Van Harmelen. 2007. Media, politics and the seman- tic web. In European Semantic Web Conference, pages Our study also included two platforms meant for sharing 205–219. Springer. information. The European Integration Biographies refer- Willem Robert van Hage, Véronique Malaisé, Roxane ence database (Guido et al., 2016, EIBIO) is a structured Segers, Laura Hollink, and Guus Schreiber. 2011. De- repository for information about people. It combines struc- sign and use of the Simple Event Model (SEM). Journal tured data with free text bringing information from exter- of Web Semantics, 9(2):128–136. nal repositories such as VIAF and Wikipedia together that can be queried by an API. The data structure that is used is A Appendix: Biographical Databases rather basic (data is shared as a CSV and not enough infor- This appendix provides a brief description of all resources mation is provided to determine whether it is relational or included in the comparative study (Section 3.2). event-centric). The Social Networks and Archival Context project (Lynch, A.1 Data collections 2014, SNAC) provides data of people and organizations in AINM.IE (Raghallaigh and Cleircı́n, 2015, AINM) is a col- their socio-historical context independently from the origi- lection of biographies describing people who are in some nal resources that provided information about their lives.20 way connected to the Irish language. It contains 1,749 bi- Data from the CBW is included in this resource which uses ographies written in Irish of people dating from 1560 until JSON as an overall structure. present.12 The American National Biography (Kendall, 2014, ANB) A.3 Data models covers the lives of 19,000 noteworthy American individu- For our analysis we have looked at four data models. APIS als.13 provides rich structured data for the ÖBL (Gruber and The Biographical Portal of the Netherlands (BNP) has been Wandl-Vogt, 2017). Information comes from the original introduced in the previous section. It is a collection of 23 metadata as well as from automated and manual annota- different biographical dictionaries of Dutch people.14 tions (Lejtovicz and Dorn, 2017). Compared to the other The China Biographical Database Project (Bol et al., 2004, resources, it has a wide range of specifically defined rela- CBD) provides biographical information about approxi- tions between people, organizations and locations. mately 360,000 persons15 most of whom lived between the The BiographyNet project (BNET) aims to enhance the 7th and 19th century. It provides detailed information about possibilities for historical research using the BPN by pro- locations and has comparatively rich information about so- viding structured information in RDF, extracting informa- cial structures. It is the only resource in our sample that tion from text and providing access to this information specifies information about possessions. through a demonstrator (Fokkens et al., 2017). Among oth- The Collective Biographies of Women16 (CBW) provides ers, the project resulted in an RDF version of the BPN in- annotated information on books written in English that cluding an extensive model for representing provenance in- formation (Ockeloen et al., 2013). 12 https://www.ainm.ie 13 17 http://www.anb.org http://www.deutsche-biographie.de 14 18 http://www.biografischportaal.nl http://www.oxforddnb.com 15 19 As of April 2015, indicated by the developers http://www.biographien.ac.at/oebl 16 20 http://womensbios.lib.virginia.edu http://snaccooperative.org/?redirected=1 The BioCRM (BCRM) is designed for representing bio- graphical information for supporting prosopographical re- search in the context of the Republic of Letters.21 It is an extension of CIDOC CRM so that it can easily be used in a variety of digital humanities projects. The model pro- vides the means for defining basic biographical information and is mainly meant to complement or be complemented by other models. The final model we include in our comparative analysis is the DFKI Biography Ontology (Krieger and Declerck, 2015). Contrary to all other resources included here, this model does not provide specific relations for persons, but rather a generic framework that can represent temporarily bound events and states as well as fixed properties of per- sons. It can be seen as complementary to the other models. The latest status of this ontology and a proposal for moving forward can be found in Declerck and Sprugnoli (2018), this volume. 21 http://www.republicofletters.net