Small Lives, Big Meanings. Expanding the Scope of Biographical Data through Entity Linkage and Disambiguation

Lodewijk Petram, Jelle van Lottum, Rutger van Koert, Sebastiaan Derks
                             Small Lives, Big Meanings
        Expanding the Scope of Biographical Data through Entity Linkage and
                Lodewijk Petram, Jelle van Lottum, Rutger van Koert, Sebastiaan Derks
                                        Huygens ING, KNAW Humanities Cluster
                                   Oudezijds Achterburgwal 185, 1012 DK Amsterdam
    E-mail: {lodewijk.petram; jelle.van.lottum; sebastiaan.derks}@huygens.knaw.nl; rutger.van.koert@di.huc.knaw.nl

The Huygens institute for Dutch history and culture aims to facilitate and enhance collaborative research with and on biographical data.
We give a brief outline of the Huygens ING digital biographical data policy, describe how we share our data with the world, and explain
how we facilitate the exploration of similarities and interconnections between the Huygens data, external data collections and user-
uploaded datasets, without imposing selection criteria. Finally, we present a use case that shows how our policy and infrastructure enable
researchers to employ large collections of ambiguous biographical data, hitherto mainly used for genealogical reference, for addressing
innovative, challenging research questions.

Keywords: biographical data, entity matching, disambiguation, digital infrastructure, genealogical data, prosopography

                                                                        when he first sailed to Asia. This mini-biography is hardly
                  1.        Introduction                                revolutionary – historians have pieced together bits of
                                                                        biographical information from multiple sources for ages
Daniel Engel was born in Danzig (present-day Gdańsk in
                                                                        (e.g. Ogborne, 2008) – and it is also still not worthy of an
Poland) and signed up with the Delft branch of the Dutch
                                                                        entry in the Biography Portal. However, advances in digital
East India Company (VOC) on the first of October, 1766.
                                                                        techniques now allow for (semi-)automated matching of
He worked as an ordinary seaman during the seven-month
                                                                        large numbers of data entities. Disambiguating the just
journey to Batavia (now Jakarta), stayed there for nine
                                                                        under 800,000 person entities in the VOC employment
months and then sailed back to Europe on the same ship.
                                                                        records has become feasible, and the same holds for data
Engel was probably illiterate, as he signed with a cross. 1
                                                                        observations in other large, digitized source collections.
         This is all we can infer about the life of Daniel
                                                                        This opens up possibilities for employing the many
Engel from his employment record – too little by far to
                                                                        snapshots of persons’ lives that are available in e.g.
deserve an entry in the Biography Portal of the
                                                                        genealogical sources and historical employment records in
Netherlands2, the online collection of biographies of
                                                                        large-scale prosopographical analyses that may be
prominent people from Dutch history, maintained by the
                                                                        instrumental in answering urgent, challenging research
Huygens Institute for Dutch history and culture (Huygens
ING). Engel was simply one of the many thousands of men
                                                                                  At Huygens ING, we seek to connect such
from German lands who joined the ranks of the VOC in the
                                                                        collections of disambiguated data to our traditional, mostly
seventeenth and eighteenth centuries.
                                                                        highly curated sets of biographical data, with the intent to
         But there is more on Daniel Engel. It seems he
                                                                        create an integrated environment that meets the needs of
joined the VOC two more times, in 1788 and 1792, as a
                                                                        researchers working on a broad range of research questions.
boatswain’s mate and able seaman, respectively. The latter
                                                                        In the remainder of this paper, we outline the Huygens ING
employment record furthermore shows that Engel died in
                                                                        digital biographical data policy and how we aim to
Asia, on the second of October, 1798. There is also mention
                                                                        incorporate data on the lives of both prominent people and
of a Daniel Engel from Danzig in the interrogation
                                                                        small fry in our new linked open data infrastructure, give a
transcripts of the English admiralty, dating from the Fourth
                                                                        short overview of the technique we use for (semi-)
Anglo-Dutch War (1780-1784), when the English seized
                                                                        automatically matching entities from one or multiple
many Dutch ships. This sailor worked as a boatswain on a
                                                                        sources, and finally present a research use case.
merchant’s ship that was supposed to have brought cargo
from Curacao to Rotterdam in 1782. He was born in 1753
or 1754.3                                                                2.         Huygens ING and Biographical Data
         It is likely that these four data observations refer           The mission statement of Huygens ING reads: ‘Innovating
to the same individual: together they form a logical career             history: unravelling history with new technology’. The
path of an eighteenth-century sailor, even though Daniel                institute tries to accomplish this mission by developing and
Engel would have been only twelve or thirteen years old                 applying new, advanced digital tools that help open up

1                                                                       2
  These and other VOC employment data: VOC Opvarenden                       http://www.biografischportaal.nl/
database                                                                3
                                                                            Prize Paper Dataset, cf. footnote 6.
OPVARENDEN and http://dutchshipsandsailors.nl/).

historical sources, which are often difficult to access and            they can link up to our data, or make connections between
use, and hence stimulate innovation in research. The                   the data in the infrastructure and external datasets.
institute’s updated digital biographical data policy reflects          Furthermore, to accommodate researchers’ needs, we are
this mission.                                                          currently developing an entity matching tool, which will
          Traditionally, biographical dictionaries have                become available within the digital infrastructure, that
formed the heart of the Huygens ING biographical data                  allows researchers to easily find candidates of matches
collection. The institute has a long history of editing                between entities from multiple datasets. After validation,
biographical dictionaries and publishing these as book                 the matches will be linked to a resolved entity. We will go
series or, in more recent times, making the entries available          further into the details of this tool in the next section.
digitally through separate web interfaces. The development                       To gather and present the data in clear, domain-
of the Biography Portal, essentially an index to the various           specific collections, our infrastructure consists of multiple,
biographical dictionaries, was a first effort of bringing              interconnected instances. The curated Huygens ING
together the available biographical data.                              datasets on the history of knowledge, Dutch history and
          Huygens ING is now gradually entering a new                  literary studies are available for reference and analysis in
stage, in which all biographical data are migrated to the              Data Huygens ING. 4 This data hub is directly linked to that
institute’s new digital infrastructure. Structured data on             of CLARIAH5, the Dutch national digital infrastructure
person entities are interlinked with a text browser, in which          project for the Arts and Humanities. The benefit of this set-
the original texts of the biographical dictionaries and other          up is that it enables us to validate and manage the data
book and source collections are made available. A user can             within the domain context, and it also helps us implement
thus easily search for a person and view related entries in            our data provenance policy. Huygens ING provides
biographical dictionaries and mentionings in other texts.              comprehensive provenance information for all its datasets
          So far, the new infrastructure largely resembles a           and presents this in a form that is both understandable for
re-fashioned Biographical Portal. What is new, however, is             humans and interoperable with other data infrastructures
that the structured data are ingested into a linked open data          within the semantic web. On dataset level, the provenance
(LOD) environment, and can hence easily be linked with                 information consists of a short and general description of
other datasets (both internal and external, national and               the dataset, a list of most-used sources, and information on
international). To guarantee optimal findability and re-               selection criteria and information extraction techniques that
usability of our data on persons, we align all person entities         were applied in the process of compiling the dataset. This
to those linked open data ontologies that are most used in             information is available as an introductory text to the
the Arts and Humanities, and by cultural heritage                      dataset and is also added, in short form and modelled using
institutions, both within the Netherlands and                          the P-PLAN Ontology (Garijo and Gil, 2012), to every
internationally: CIDOC-CRM, Wikidata, schema.org and                   record. As such it will enable researchers who see an
FOAF.                                                                  isolated data observation in the LOD cloud to learn about
          Furthermore, the new digital infrastructure is               the context in which the data observation came about.
specifically designed as a humanities research                         Additionally, on record level, we provide specific
environment. Whereas the traditional book volumes, web                 references to sources. We encourage users to provide the
interfaces, and even the Biography Portal first and foremost           same information for user-uploaded datasets in the
served as reference works – a typical researcher would use             CLARIAH data hub. Furthermore, for all data in the
them to look up information on one or a small number of                infrastructure, technical provenance information is
persons – the new environment offers better search                     automatically retained. This allows users to see when a
(elasticsearch) and functionality to explore similarities and          particular dataset was originally uploaded and by whom,
interconnections, thus allowing users who practice                     and which edits were made on a particular data element,
collective biography and prosopography to easily collect               either manually or by built-in tooling, such as for entity
data on the groups of people of their interest (cf. Harders            matching.
and Lipphardt, 2006). Researchers can furthermore link
data elements across multiple sources and use data                       3.         Automated Record Linkage
observations to enrich their own datasets. Finally, they can
                                                                       The record linkage tool we are currently developing
query the data through the API or download selections of
                                                                       enables users to find matches between entities in one or
data in various file formats, and then analyse the data
                                                                       more sets of data observations, selected from the structured
offline or using tools for data analysis and visualisation that
                                                                       data repository within our digital research environment or
are available on the internet. In short, the data are ready to
                                                                       external LOD sources. For the time being, the tool is
be used by researchers.
                                                                       primarily intended for finding matches between person
          The Huygens ING digital infrastructure thus has
                                                                       entities. It allows users to measure name similarity and
an interactive character; the focus is not solely on making
                                                                       refine candidate matches using rules that are e.g. based on
data available, but also, and especially, on allowing
                                                                       geographical data or dates.
researchers to use and share them. In parallel to this, we
                                                                                 We chose to develop the tool in a PostgreSQL
aim to facilitate and enhance collaborative research with
                                                                       environment for the relatively speedy matching results it
and on biographical data, which comprises, in our view,
                                                                       offers, especially when using trigram matching. The tool
any biographical data that might be of interest to academia.
                                                                       downloads selected rdf triples, automatically converts them
Researchers are welcome to upload their own data, which

4                                                                      5
    https://data.huygens.knaw.nl/                                          https://www.clariah.nl/; https://anansi.clariah.nl/

into csv-format and loads them into the PostgreSQL                       these, simply having estimates of the size of the migrant
environment. In the matching process, it creates a new                   influx is not sufficient. After all, it makes a huge difference
dataset with matched entities, which, after validation by the            whether migrants are non-skilled, skilled or become skilled
user, is returned to the LOD environment. This new dataset               during their careers in the recipient economy.
includes full provenance information about the matching                            Although for the pre-1800 period sources
parameters that were applied (algorithm and additional                   containing clear indicators of education or training levels
rules) and the user doing the final validation step. All                 are rare, we do have large numbers of historical
provenance data are automatically retained during the                    employment records. However, such sources often provide
process of candidate generation and validation.                          no more than a snapshot of a person’s life and are therefore
           The tool offers various methods for measuring                 relatively limited in their use. But by matching entities
string similarity, which can be used for matching names                  from multiple source collections, these records become
and toponyms: trigram matching (the preferred method, for                much more meaningful. Matching a sufficiently large
speed reasons; it uses the similarity function in the                    number of entities was hitherto practically impossible, due
PostgreSQL (9.5) pg_trgm module), Levenshtein distance,                  to the simple fact that these data collections are large and
and (Double) Metaphone. When geocodes are available,                     manually finding matches takes a lot of time, but our
locations can also be matched using the PostgreSQL                       automated entity matching tool enables us to do so – and in
extension PostGIS. This extension allows users to find                   the near future other scholars as well. In the case of
matches based on either an exact geographical location or                HUMIGEC, the tool helps us to reconstruct individual
a user-set range around a geographic point.                              careers, which in turn makes it possible to compare the
           To start the matching procedure, a user first                 relative successfulness of migrant and native workers. As
manually selects data fields for matching and then creates               the success of careers is a good indicator of skills, this
a set of refinement rules, tailored to the data at hand, to              assessment will allow us to address the central research
improve matching results and/or exclude irrelevant                       question of the project.
matching candidates. For example, if a user wants to match                         We selected the maritime sector of the eighteenth-
entities from a birth register with a faculty list, he could             century Dutch Republic as a case study in HUMIGEC,
create a rule that discards candidates that would have been              because this was a key sector of the economy, characterised
under eighteen or over one hundred years of age when                     by a high level of migrant participation. Moreover, its
employed at university. Another rule could state that                    workers were well documented: we have almost 800,000
candidates who are between age 25 and 65 when employed                   employment records of the VOC, digitised by a number of
at university should get higher scores.                                  archival institutions in the Netherlands, that cover the entire
           The tool leads the user through an iterative                  eighteenth century, and c. 15,500 records on Dutch
matching procedure (cf. e.g. Efremova et al., 2014; Idrissou             mercantile marine crews from the Prize Paper Dataset
et al., 2017). Users are encouraged to set strict rules at first.        compiled by HUMIGEC’s PI Jelle van Lottum. 6 Each
This will yield a relatively small number of high-quality                record in both collections contains data on a sailor’s name,
candidates, from which the user can then select matches for              place of birth, rank on board and start date of the
approval. After this first matching and validation round, the            employment. For the sailors in the Prize Paper Dataset, we
matched records are split from the original dataset and sent             also know their age when questioned by the English
to a new dataset, which only contains validated data. The                admiralty.
user can then let the tool iterate once or multiple times over                     By matching entities within and between these
the remaining original data using different sets of matching             datasets, as shown by the example of Daniel Engel in the
rules to generate additional candidate sets, from which                  introduction to this paper, we can (partially) reconstruct
approved matches can be added to the set of validated data.              sailors’ careers, which we can then use to compare the level
Taken together, the steps in the matching procedure yield                of job mobility (i.e. promotion or job switching) of non-
results with high precision and recall.                                  migrant and migrant workers (Gibbons and Waldman,
                                                                         1999). This will give us insight into the extent to which
  4.       Research Use Case: Sailors’ Careers                           migrants succeeded in gaining skills (i.e. human capital)
                                                                         during their careers, and compare this to non-migrants.
The research project ‘Human capital, immigration and the
                                                                                   We use the entity alignment tool introduced in the
early modern Dutch economy: job mobility of native and
                                                                         previous section to find data observations that are probably
immigrant workers in the maritime labour market, c.1700-
                                                                         related to the same individual. We first look for data
1800 (HUMIGEC)’ illustrates the potential of our
                                                                         observations with a high level of name similarity, measured
infrastructure and entity linkage tool for academic research.
                                                                         on the basis of trigram matching, and filter out irrelevant
This project’s research question originates from a currently
                                                                         results by applying a set of rules based on dates (for
hotly debated topic in both the political and the public
                                                                         example, a person cannot have sailed out before birth or
arena: what is the economic contribution of migrant
                                                                         after death, cannot have been employed on two ships at the
workers on a recipient economy? This is a difficult question
                                                                         same time, cannot have been in Asia and Europe at the
to answer for modern economies, let alone for economies
                                                                         same time, etc.) and domain expertise (it is e.g. unlikely
from the past, since historical statistics on education or
                                                                         that a person who had worked as an ordinary seaman on a
training levels of workers are largely lacking. Without

6                                                                        in Europe, c. 1650–1815.
  Data collected in the Economic and Social Research Council
(ESRC) project no. RES-062-23-3339: Migration, human capital
and labour productivity: The international maritime labour market

single trip rejoined the ranks of the VOC as a captain).                van den Hoek from Delft, for example, a sailor we could
           Next, we use the sailors’ places of birth as a check         easily trace in eight different VOC employment records, is
on matches. Since we have to deal with quite a bit of                   not unrepresentative because of his name – were his name
variation in toponym spelling – all records were written                Jan de Jong, he would not suddenly become a synecdoche
down by clerks who often did not speak the same language                for maritime life (cf. Van Lottum, Brock and Sumnall,
as the sailors in front of them and who were also frequently            2015). Moreover, since we base our analysis on a large
unfamiliar with the towns and villages, often in the German             number of observations, we think the bias in our sample
lands and Scandinavia, mentioned by the sailors – we                    towards non-standard names will not have a significant
decided to try to standardise place names and reconcile                 influence on our results. However, to check whether the
them to their modern-day GeoNames equivalents. We have                  non-standard names that are likely to be overrepresented in
so far standardised around 30,000 unique toponym                        our sample were not typical for a certain class of
attestations and aim to at least double this number before              eighteenth-century society, we will compare them to the
project end. The standardised toponyms allow us to first                family names of Amsterdam’s highest-income tertile,
look for exact matches. Thereafter, using the geo                       derived from registers of a 1742 income tax (Oldewelt,
coordinates given back by GeoNames, we geo-group                        1945), and to the family names in Amsterdam’s birth,
locations to find possible additional matches. In this way,             marriage and death registers from the mid-eighteenth
we also catch sailors who used their birth place and region             century.8
interchangeably.                                                                  A discussion of the digital heuristics involved in
           For all remaining person entity matches,                     our project will naturally also be included in the general
suggested on the basis of name similarity, but not                      description of the set of validated record matches and, in
corroborated by matching places of birth, we perform a                  very short form, in each record’s P-Plan provenance. So, if
final birthplace check by measuring string similarity of the            for example a future researcher of the Asian activities of
original place name attestations, so as to account for                  the VOC would see that some person entities from the
possible mistakes by clerks or transcribers of the original             Official letters of the United East India Company – a
documents – Norden in East Frisia might easily have been                Huygens ING digital resource that will be added to our
misunderstood as Naarden close to Amsterdam.                            LOD infrastructure in due course – were connected to
           The scope of our project does not allow for                  records detailing sailors’ careers and others not, he would
experiments with standardising person names. We                         know that this could have as much to do with selection bias
therefore rely on the trigram matching algorithm to cope                in the linkset as with the actual careers of these people.
with spelling variations in names. However, for a follow-
up project to HUMIGEC, we are thinking of also                                             5.        Conclusions
standardising person names, beginning with native
                                                                        Biography as a historical method has traditionally mainly
workers’ names. To this end, we would use the Database of
                                                                        been used as a means to illustrate qualitative themes,
Surnames in The Netherlands 7 to standardise family names,
                                                                        generally based on one or a small set of case studies. From
and group variants of given names on the basis of data
                                                                        around the turn of the century the online availability of
generated by Gerrit Bloothooft (e.g. Bloothooft and
                                                                        national biographical dictionaries in e.g. the Netherlands,
Schraagen, 2015).
                                                                        Germany, the United Kingdom and Australia allowed for
           This paper is not well-suited for going deeply into
                                                                        larger-scale biographical research and the formation of
socioeconomic analysis and statistical results –
                                                                        collective biographies (cf. Arthur, 2015; Carter, 2012). But
incidentally, HUMIGEC is still an ongoing research project
                                                                        these were inevitably limited by the scope of the online
and we currently only have very preliminary results – but a
                                                                        biographical collection and influenced by the selection
brief reflection on methodology is in place. First of all, it is
                                                                        criteria (and biases) of its editors.
important to stress that our method is far from perfect. At
                                                                                   The Huygens research infrastructure and
best, it gives us a limited view on career paths in the Dutch
                                                                        biographical data policy, however, allow researchers to go
eighteenth-century maritime sector, for the available
                                                                        one step further. The institute makes available all
sources do not cover the entire sector and we have no
                                                                        biographical data contained in its collection, both highly
ground truth for assessing the performance of the entity
                                                                        curated data from biographical dictionaries and persons
linkage process. We do, however, have a set of manually-
                                                                        data retrieved from various textual sources. Furthermore,
matched entities that we use for a superficial assessment of
                                                                        as illustrated by the HUMIGEC research case, researchers
our matching method. However, these matches are self-
                                                                        can use the infrastructure to semi-automatically connect
evidently incomplete and are furthermore likely to be
                                                                        external datasets to the core data or disambiguate their own
biased towards non-standard names.
                                                                        data. In HUMIGEC, we use the large number of mini-
           That same bias will be present in the automatically
                                                                        biographies obtained through digital methods as a means of
generated       matching      candidates:      disambiguating
                                                                        illustrating wider social and economic processes. Indeed,
employment records of sailors with common names, who
                                                                        as Paul Arthur predicted, this approach is ‘a demonstration
were born in large towns and cities, is in many cases simply
                                                                        of biography’s greatly increased capacity, in the digital era,
impossible, both for humans and computers. This gives
                                                                        to activate cross-disciplinary investigation, and become a
reason for some concern about the representativeness of
                                                                        dynamic agent for integrating and connecting individual
our study, but then again, sailors with non-standard names
                                                                        lives and their historical contexts’ (Arthur, 2015).
were not atypical because of their unusual names. Sixtus

7                                                                       8
    http://www.cbgfamilienamen.nl/                                          https://archief.amsterdam/indexen/

          Digital advances such as the one described in this                 Migration and Human Capital in the Long Eighteenth
paper are blurring the boundaries between (collective)                       Century: The Life of Joseph Anton Ponsaing. In: M.
biography, prosopography and other socioeconomic                             Fusaro et al. (Eds.), Law, Labour, and Empire.
research methods. In parallel with this development, all                     Comparative Perspectives on Seafarers, c. 1500-1800.
biographical data observations, however insignificant they                   Basingstoke: Palgrave Macmillan, pp. 158--176.
may seem at first sight, may become very meaningful and                    Ogborne, M. (2008). Global lives. Britain and the world,
instrumental to answering important research questions                       1550-1800. Cambridge: Cambridge University Press.
when disambiguated and combined with other data.                           Oldewelt, W.F.H. (1945). Kohier van de personeele
Huygens ING aims to facilitate and enhance the full range                    quotisatie te Amsterdam over het jaar 1742. 2 vols.
of biography methods by making available a digital                           Amsterdam: Genootschap Amstelodamum.
infrastructure that welcomes all biographical data – be they
on the lives of prominent people or small fry – and offering
functionality for exploration of similarities and
interconnections between data observations.

            6.        Acknowledgements
We thank Ania Ahamed and Jessica den Oudsten for their
research assistance, and the Amsterdam City Archives for
sharing their genealogical data with us. HUMIGEC
received funding from CLARIAH. 9

