=Paper= {{Paper |id=Vol-3019/p54 |storemode=property |title=Link-Lives, Historical Big Data: Reconstructing Millions of Life Courses from Archival Records Using Domain Experts and Machine Learning |pdfUrl=https://ceur-ws.org/Vol-3019/LinkedArchives_2021_paper_9.pdf |volume=Vol-3019 |authors=Barbara Revuelta-Eugercios,Olivia Robinson,Anne Løkke }} ==Link-Lives, Historical Big Data: Reconstructing Millions of Life Courses from Archival Records Using Domain Experts and Machine Learning== https://ceur-ws.org/Vol-3019/LinkedArchives_2021_paper_9.pdf
  !"#$%!"&'()*+"(,-."/01*2"3*40,05*6'/-#(,.7/,"#3*8"11"-#(
  Link-Lives,  Historical Big Data: reconstructing millions
   *-9*!"9'*:-7.('(*9.-;*<./="&01*6'/-.>(*?("#3*4-;0"#
   of life courses from archival records using domain ex-
               *@AB'.,(*0#>*80/="#'*!'0.#"#3C
                  perts and machine learning
 Bárbara A. Revuelta-Eugercios1, 2 [0000-0002-2449-037X], Olivia Robinson2 [0000-0003-3085-0025]
                                  and Anne Løkke2
    1
        Rigsarkivet/National Archives of Denmark, (Jernbanegade 36A, Odense, 5000, Denmark)
  2 University of Copenhagen (SAXO Institute, Department of History, Karen Blixens Plads 4,

                                      Copenhagen 2300, Denmark)
                                            bre@sa.dk



           Abstract. 7KH'DQLVKDUFKLYHVFRPSULVHVRPHRIWKHZRUOG¶VPRVWFRPSUHKHQ
           sive source coverage but despite large-scale digitization and transcription pro-
           jects by diverse actors, there are no shared standards or possibilities for data link-
           age. The Denmark-based Link-Lives research project (2019-2024) is tackling this
           disparity by linking individual-level Danish records in census and parish record
           sources from 1787-1968 to create a multigenerational database for research using
           a combination of domain expertise and machine learning techniques. In contrast
           to small-sample linking or fully automated processes, Link-Lives is creating its
           own manually-linked data to train machine learning as well as exploring the im-
           pacts of different approaches to linking. Due to personal data protection legisla-
           tion and propriety agreements, the data cannot be fully open access, but data out-
           puts will be made available to both researchers and the general public via a web-
           VLWH7KHSURMHFW¶VLQWHUGLVFLSOLQDU\WHDPLVEDVHGDWWKH'DQLVK1DWLRQDO Archives
           and the University of Copenhagen, in partnership with Copenhagen City Ar-
           chives, and funded by Carlsberg and Innovation Fund Denmark.

           Keywords: archival record, multigenerational data, big data, record linkage


1          Introduction

The Danish archives compULVHVRPHRIWKHZRUOG¶VPRVWFRPSUHKHQVLYHVRXUFHFRYHU
age of the lives of individuals but despite large-scale digitization and transcription pro-
jects by diverse actors, there are no shared standards or ready-made possibilities for
data linkage. Link-Lives is a cross-disciplinary research project that combines infor-
mation relating to any given person drawn from diverse archival sources, to build life
courses and family relations from 1787 to the present. We combine machine learning,
historical research, bioinformatics, and citizen involvement to transform Danish ar-
chival sources into multigenerational big linked data. The result will be a research in-
frastructure at the Danish National Archives, created in cooperation with the Copenha-
gen City Archives (Københavns Stadsarkiv) and the University of Copenhagen. It will
expand the scope of registry-based research from decades to centuries, and opens up
new avenues for intergenerational research in the health and social sciences. Denmark
______________
* Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).
will be first in the world to implement this level of digitalization in full functional scale
as part of the service of its national archives. The results will be made available for two
main user groups: to researchers across disciplines, from historians to health and social
scientists, who will use the life-course and multigenerational data to explore new re-
search areas; and to the public, via a website disseminating the part of the data that is
unrestricted by legislation or propriety agreements. The links and life courses will be
freely available for anyone to search. Funded by the Innovation Fund Denmark and the
Carlsberg Foundation, the project started in 2019 and will be completed in 2024.

The paper presents the aims and scope of the project along with some preliminary re-
sults. The structure is as follows: Section 1 provides an overview of the Danish context
and how new projects are taking fresh advantage of the scope of archival digitization.
Section 2 discusses the methodological developments to deal with sources from widely
different provenances. Specifically, it focuses on the standardization and the innova-
tions that derive high quality linked data to train machine learning approaches. Section
3 discusses our decisions on disseminating the linked data in light of current constraints
and opportunities. Section 4 provides a conclusion and further perspectives.

1.1    Why do we need historical big data?
The biological and social life of humans is influenced by multigenerational mecha-
nisms. However, these mechanisms are poorly understood, not least due to the enor-
mous workload required to establish life courses and family relations spanning 4-5 gen-
erations [1]. In fact, for the era before the introduction of computerized civil registration
systems (in Denmark, the CPR in 1968), most of our knowledge is based on limited
samples: either long family pedigrees reconstructed manually within small geographic
areas or reconstructions of just a few generations for larger areas. Research using even
these limited datasets has nevertheless revealed promising findings: four hundred years
of parish records in Canada showed selective fertility advantages on the French Cana-
dian frontier [2]; rich contextual data revealed a transgenerational response to paternal
JUDQGIDWKHUV¶DFFHVVWRIRRGLQWKFHQWXU\ Northern Sweden [3]; and a correlation
was identified between political affiliation in the 2010s elections and cross-ethnic ex-
posure in the early twentieth century in some states in the US [4], to cite some examples.
In fact, the promise of new discoveries has led to new and renewed investments in the
construction of large-scale databases that are pioneering this type of research. National-
level work is underway in the Netherlands [5], Scotland [6], Norway [7], Sweden [8],
the US [9] and Canada [10], while others are harvesting somewhat unrepresentative
population-scale family trees from genealogy sites [11] using similar research goals.

1.2    The Danish context
The preconditions for this type of research have existed in Denmark for the last couple
of decades. Rich archival material documenting many dimensions of individual lives
over a period of more than 200 years has been preserved, and dedicated projects to
transcribe and/or digitize many of these have been underway for some time. However,
there has been relatively little research using these sources, as most have had to rely on
                                                                                       3


using small manually-linked samples [12, 13]. This has changed due to the advances in
record linkage technologies, increased access to computing power in both personal
computers and high performance computing facilities, and a growing awareness of the
opportunities offered by these large untapped collections of historical data for many
fields of research.

Revuelta-Eugercios, one of the Link-Lives PIs, was granted a project in 2014 that be-
came a methodological pilot for what would later became Link-Lives, with a focus on
both infrastructure building and research. More recently, the research potential of the
area of large-scale historical population databases has been acknowledged in Denmark
with three further projects attracting funding. Two of them will use and connect directly
to Link-Lives data (a PhD and a collaborative project); a third, the Multigenerational
Registry (funded by Novo Nordisk Foundation in 2021), is independent from Link-
Lives. The MGR will make even more data available for the 20th century by transcribing
parish records for the period 1920-1968, using image recognition to establish parental
information for all individuals born after 1920, before linking them to the CPR regis-
try.[14] The synergies between these two project investments are clear. In a few years,
it will be possible to create a Danish Historical Population Registry, featuring life
courses and relationships, as well as social, economic, and medical conditions, from a
wide variety of sources. It will provide multigenerational data with long coverage, from
1787 to 1968, and high density, covering many dimensions of individual persons. Pro-
jects such as these are timely and poised to harness the power of both the depth of
source coverage and breadth of digitization, to unleash the full potential of their re-
search use.


2      Source material and methodological developments

2.1    Obtaining data through a wide-partnerships: old and new crowdsourcing
       data and collaboration with Ancestry
In the past twenty years, research projects, crowdsourcing initiatives, genealogistV¶
homepages, and private companies have transcribed parts of the Danish archival herit-
age for their own use. Each project has invented ways of transcribing, standardizing
and storing data, and developed their own ontologies, terminologies and data formats.
Millions of records have been digitized, though with no shared standards or possibilities
RIOLQNDJH7KHVHµLQIRUPDWLRQLVODQGV¶KDYHH[FHOOHQWGDWDEXWLWLVERWKGLIILcult and
expensive to navigate the many databases and websites in the search of it, and any re-
use of the material for different purposes requires expensive re-processing. Among the
most important actors are volunteer-driven digitation (scanning/photographing from
analogue to facsimile) and datafication (transcription into machine-readable format)
initiatives which, in cooperation with archives, have made millions of records contain-
ing individual information available. The National Archives holds the oldest
crowdsourcing heritage data project in Denmark (founded in 1992) [15] but many other
archives, such as the Copenhagen City Archives [16] and Aarhus City Archives, have
started similar projects in the last decade. Genealogical associations have also partici-
pated in recent years and private companies have begun to enter the field with large
digitation ventures of their own, with varying types of collaboration and data sharing
agreements at both the national and municipal archive level. Additionally, there are
some key examples of historical sources transcribed for contemporary health research
[17, 18].
The current Link-Lives project (2019-2024) is linking four types of sources from four
different actors with overlapping and linkable chronologies: 1) more than 20 million
records, including the full count of 9 nationwide census years plus the local Copenha-
gen census of 1885 from the National Archives, plus 1921 and 1940, which will be
newly transcribed by commercial actors; 2) 300,000 burials for the period 1861-1920
from a crowdsourcing project at Copenhagen City Archives; 3) 22 million parish rec-
ords indexed by the genealogy company Ancestry; and 4) the Danish Civil Registration
System (CPR), established in 1968. Further collections can and will be added to extend
the range, subject to additional funding, since the infrastructure is being built to absorb
new datasets.

Given the different origins, aims, and context of each of the collections, a key feature
of the process is standardization to ensure that spellings and conventions are consistent
with one another. In a first phase, we have developed synonym catalogues for historical
names, places, civil status, etc. which ensure the maximal proximity to the source while
taking into account existing vocabularies. In a second phase, we are mapping variables
to existing classification schemes (and international ones where possible) and authority
lists (if they exist). To build these, we employ domain experts, who undertake the slow
yet important process of compiling synonym catalogues in combination with computer-
assisted processes. Our aim is to provide standardized and simpler access to the vast
collections, while still respecting the nature of the records and the context of their cre-
ation, preservation, and survival, through systematic documentation of metadata. This
will ensure users can clearly identify what version of a variable they are working with
and the process it has undergone prior to its coded form.

As of August 2021, we have completed the standardization of several sets of categori-
zations ± those for given names, surnames, geographic places and causes of death ± and
we are currently working on coding occupations. We extracted millions of given and
family names from census data and coded the 6,233 most-often occurring, which rep-
resent c.95% of all names. As there is no authority list for names, we collaborated with
a scholar of onomastics to create our own. We have also coded over 600 unique causes
of death, which covers 7,6% of all the unique causes of death that occurred in Copen-
hagen in the years 1861-1911. Despite the small number, this has allowed us to code
90% of all the deaths in Copenhagen during the period. The classification we use is the
ICD10-h (International Classification of Disease 10 adapted for Historical purposes),
which, as a partner in the SHiP project, we are currently involved in developing [22],
with contribution from history of medicine specialists. Places of birth appear in the
datasets in free text fields, in which places are recorded with all sorts of granularities
so there are more than 600.000 unique strings for places. Out of 13,261 place-name
                                                                                            5


components that appeared in the place of birth fields in the 1845-1901 censuses, we
have standardized 4,999 uniquely-spelled place-name words. These represent c.98% of
all place-name occurrences for this period. We have derived a reference list too, for
words that relate to existing places, from the DigDag project [21], which contains the
most updated survey on geographical boundaries in Denmark from 1500 to today.
While the residence parishes have already been geocoded to DigDag, we aim to geo-
code the actual standardized places and will start coding occupations to HISCO [19] in
the fall of 2021.

2.2    Automatic methods in record linkage: domain-expert grounded
       algorithms
Until recently, a huge amount of manual effort was required to establish these linked
life courses, but in the past decade, researchers in different countries have pioneered
methods for automatic linkage on national scales (Norway, Sweden, Scotland, Nether-
lands, Canada and in the US) [7, 23±27]. The core challenge in creating intergenera-
tional data from historical sources is entity resolution: identifying the same person
across multiple sources (in a context of low identifiability of persons and high variation
in quality and availability of critical attributes). Many of the existing automatic linking
projects have mostly used rule-based linkage methods [7, 28]. In recent years, some
researchers have carried out promising experiments with supervised machine learning
techniques, including support vector machine and random forest [24±26, 29]. Some
authors have even argued for the development of WUXO\³DXWRPDWLF´SUREDELOLVWLFmeth-
ods with little input from the researcher, to create simple, transparent and replicable
methods that are mostly context-neutral [30]. However, their focus on the ease of im-
plementation comes at the expense of the contextual, archival, historical features of the
data. These overly-automated processes omit a thorough GLVFXVVLRQRI³JURXQGWUXWK´
RU³WUDLQLQJGDWD´IRUHLWKHUWHVWLQJRUWUDLQLQJPRGHOV, which is a deficit that also applies
to much of the historical record linkage literature. ³Ground truth´ is the term used to
refer to information or data known to be real or true, provided by direct observation
and/or measurement, and acts as a benchmark for any replication attempt. Training data
usually refers to the specific dataset that is used to train machine learning models, which
in the best of cases is ground truth. The challenge in developing historical record link-
age approaches is that ground truth data does not actually exist: there is no additional
source to verify whether a link between any two historical records is right or wrong, so
all links are in reality the result of an estimation. Each researcher thus has to find and/or
FUHDWHWKHLURZQ³WUDLQLQJGDWD´WRWHVWRUWUDLQWKHLUPRGHOV

The three more common practices for linking have been: harvesting linked records from
genealogy sources; using manually-linked data produced by a single historian; or cre-
ating a small number of sets as part of a semi-automatic algorithm. Whatever the
method chosen, though, researchers do not often provide information on the process of
data construction or its consequences. The Link-Lives approach differs in this. Not only
do we create our own manually-linked data, but we explore the impacts of different
approaches. We take into account the specificity of data sources and contexts as well
as designing a protocol to identify the variation in human decision-making. We believe
that putting history and historians at the center of linking methodologies creates data of
the highest quality which is a pre-requisite to any successful machine learning model.
We ensure that our training data is created by a three-domain-expert approach. This
gives us assurances that any algorithm we create is built on robust data that reflects in
a transparent way our best estimate at ground truth. Additionally, rather than commit-
ting to one specific type of model (which will always have its own strengths and limi-
tations and is suited to specific research questions) Link-Lives is committed to develop
or implement multiple models, in order to provide a variety of options for researchers.

We have used Python and Kivik to develop a bespoke interface named ALA (Assisted
Linking Application) which requires no installation or programming competences, so
it can be widely distributed to non-experts in programming. Linkers are presented with
transcribed data from two sources and a subset of potential link candidates generated
by a rule-based algorithm, using some relatively simple rules and standardizations.
They then make a linking decision based on the data presented to them on the screen
and can further refine their searches manually outside of the potential cases proposed.
In total, over 30 people have been trained to link consistently using the ALA software
and a core team of nine linkers is currently using it to link parish records and census
returns to census data, to build high-quality training data. As at August 2021, the team
has created 31,880 training data records.

Although linkers are governed by a set of best practices and attend linker workshops to
guide and align their linking decisions, no two linkers link exactly the same way. We
require each record to be linked by two linkers, before a third resolves those cases where
linkers do not agree (c.5-15% of cases). We have also documented that linking rates
vary widely by factors unrelated to linker experience or abilities. For example, our do-
main-expert linked sets show that in some rural parishes we achieve 95% link rates
compared to 30-40% in urban areas (often explained by higher population mobility).
The preliminary results from our relatively simple rule-based algorithm are somewhat
lower but on a par with what we see in the international literature. On average, our early
linking of the 1845, 1850 and 1860 census years shows a 50% link rate, but we expect
other sources and later censuses to vary in this. We are now working on updated rule-
based models but also have reached a sufficient threshold of manually linked data to
start testing machine learning approaches. We are experimenting with variational re-
current neural networks, support vector machines and random forest specifications.
Given the high computing needs for comparing millions of records while handling per-
sonal data protected by GDPR (individuals born after 1901), we use a cloud cluster in
the Computerome 2.0 Supercomputer at Danish Technical University where we have a
reserved node with 40 cores. This allows us to run multiple variations of the algorithms
and test the effects on the linked data. For example, in the latest configuration, each run
of the comparison between two censuses (with c.1-2 million records each) takes 3-5
hours.
                                                                                        7


3      Delivering data to researchers and the wider public

To provide the highest quality linked data for everyone, it will be made available in two
ways for our main target user groups: researchers and the general public. Researchers
will be able to access both the historical part (freely downloadable) as well as data
protected by the EU-General Data Protection Regulation (GDPR) and its Danish im-
plementation (which offers an additional 10 years of posthumous protection) following
the same channels as any other registry-based research. For the public, the website link-
lives.dk will display the historical part, unaffected by these protections, where users
will be able to search and download limited amounts of data after logging in. By Feb-
ruary 2022 we will make available for researchers and the public all life courses for the
years 1787 to 1901 from our key sources. Between 2022 and 2024 we will expand the
connection to more contemporary sources, so researchers will have access (subject to
permission) to multigenerational data for the period 1787-1968.

We have chosen this dual dissemination strategy, instead of publishing open linked data
and taking advantage of the semantic web, because of two major constraints. First, a
part of the data is proprietary (Ancestry¶s indexing) and another part contains individ-
uals protected by GDPR/Danish law, so the whole collection could never be freely de-
livered or downloaded. Second, there is a low penetration of approaches to the semantic
web among the users we target, who are still tied to traditional forms of delivery (Excel
for historians, csv files or traditional SQL databases for social and health scientists)
even though there are promising projects publishing historical databases as open-linked
data [31, 32]. Thus, in order to maximize early and high usage of the collection, we
have prioritized formats and systems that are already in use, that do not require our
users to be trained further. The fit of this decision to the current project has been con-
firmed through user tests, teaching and other forms of dissemination with the two main
user groups ± family historians and researchers. As at August 2021 we have received
no explicit requests outside of these standard formats but we are aware that there is a
growing trend in Digital Humanities in Denmark for working with the Semantic Web.
In any case, we could easily move into publishing part of the data as open data in the
future, providing that additional funding and collaborations are found.


4      Conclusion

This paper has presented the key aspects of the Link-Lives project, featuring con-
straints, opportunities, challenges, and results to date. We have shown how a project of
this complexity requires competences and expertise from different types of institutions,
from archival personnel, historians, and computer scientists. Moreover, a project like
this offers archives new ways to engage with research and public audiences and creates
the foundation on which to build future collaborative partnerships and funding oppor-
tunities. On the methodological side, we have highlighted how it is key to have an un-
derstanding of the provenance of historical data, as well as to transparently record the
processes it undergoes in metadata form. A clear overview of the ways in which human
and algorithm decisions affect the data are prerequisites for transforming archival hold-
ings into new datasets. Embedding domain experts as key mediators of data ensures
that WKHGDWD¶VSRWHQWLDODQGOLPLWDWLRQVDUHIXOO\WUDQVPLWWHGWRWKHHQGXVHUV



References

1. Branje, S., Geeraerts, S., de Zeeuw, E.L., Oerlemans, A.M., Koopman-Verhoeff, M.E.,
    Schulz, S., Nelemans, S., Meeus, W., Hartman, C.A., Hillegers, M.H.J., Oldehinkel, A.J.,
    Boomsma, D.I.: Intergenerational transmission: Theoretical and methodological issues and
    an introduction to four Dutch cohorts. Developmental Cognitive Neuroscience. 45, 100835
    (2020). https://doi.org/10.1016/j.dcn.2020.100835.
2. Moreau, C., Bhérer, C., Vézina, H., Jomphe, M., Labuda, D., Excoffier, L.: Deep Human
    Genealogies Reveal a Selective Advantage to Be on an Expanding Wave Front. Science. 334,
    1148±1150 (2011). https://doi.org/10.1126/science.1212880.
3. Vågerö, D., Pinger, P.R., Aronsson, V., YDQGHQ%HUJ*-3DWHUQDOJUDQGIDWKHU¶VDFFHVVWR
    food predicts all-cause and cancer mortality in grandsons. Nature Communications. 9, 5124
    (2018). https://doi.org/10.1038/s41467-018-07617-9.
4. Brown, J.R., Enos, R.D., Feigenbaum, J., Mazumder, S.: Childhood cross-ethnic exposure
    predicts political behavior seven decades later: Evidence from linked administrative data.
    Science Advances. 7, eabe8432 (2021). https://doi.org/10.1126/sciadv.abe8432.
5. Mandemakers, K., Kok, J.: Dutch Lives. The Historical SDPSOHRIWKH1HWKHUODQGV í 
    Development and Research. Historical Life Course Studies. (2020).
6. Digitising Scotland, https://digitisingscotland.ac.uk/, last accessed 2021/02/23.
7. Thorvaldsen, G., Andersen, T., Sommerseth, H.L.: Record Linkage in the Historical Popula-
    tion Register for Norway. In: Bloothooft, G., Christen, P., Mandemakers, K., and Schraagen,
    M. (eds.) Population Reconstruction. pp. 155±172. Springer, (online) (2015).
8. Swedpop, http://service.re3data.org/repository/r3d100010146, last accessed 2021/04/15.
9. Ruggles, S., Fitch, C., Goeken, R., Hacker, J.D., Helgertz, J., Roberts, E., Sobek, M., Thomp-
    son, K., Warren, J.R., Wellington, J.: IPUMS Multigenerational Longitudinal Panel.
10. The Canadian Peoples, https://thecanadianpeoples.com/, last accessed 2021/06/28.
11. Kaplanis, J., Gordon, A., Shor, T., Weissbrod, O., Geiger, D., Wahl, M., Gershovits, M.,
    Markus, B., Sheikh, M., Gymrek, M., Bhatia, G., MacArthur, D.G., Price, A.L., Erlich, Y.:
    Quantitative analysis of population-scale family trees with millions of relatives. Science. 360,
    171±175 (2018). https://doi.org/10.1126/science.aam9309.
12. Thomsen, A.R.: Lykkens smedje?, social mobilitet og social stabilitet over fem generationer
    i tre jyske landsogne 1750-1850, PhDissertation, University of Copenhagen (2010).
13. Johansen, H.C.: Danish Population History. University Press of Southern Denmark, Odense
    (2002).
14. Novo Nordisk Fonden: Artificial intelligence will transcribe the family relationships of Danes
    and strengthen research, https://novonordiskfonden.dk/en/news/kunstig-intelligens-skal-
    kortlaegge-danskernes-stam-trae-og-styrke-forskning/, last accessed 2021/03/12.
                                                                                                  9


15. Clausen, N.F.: The Danish Demographic Database²Principles and Methods for Cleaning
    and Standardisation of Data. In: Bloothooft, G., Christen, P., Mandemakers, K., and Schraa-
    gen, M. (eds.) Population Reconstruction. pp. 3±22. Springer, (online) (2015).
16. Van Zeeland, N., Gronemann, S.T.: Participatory Archives. In: Benoit, E., III and Eveleigh,
    A. (eds.) Participatory Archives. pp. 103±114 (2019).
17. Baker, J.L., Olsen, L.W., Andersen, I., Pearson, S., Hansen, B., Sørensen, T.I.: Cohort Profile:
    The Copenhagen School Health Records Register. Int J Epidemiol. 38, 656±662 (2009).
    https://doi.org/10.1093/ije/dyn164.
18. Juel, K., Helweg-Larsen, K.: The Danish registers of causes of death. Dan Med Bull. 46, 354±
    357 (1999).
19. Van Leewen, M., Maas, I., Miles, A.: HISCO. Historical International Standard Classification
    of Occupations. Leuven University Press., Leuven (2002).
20. WHO: ICD-10, https://icd.who.int/browse10/2016/en, last accessed 2018/12/05.
21. DigDag.dk, http://digdag.dk/, last accessed 2020/03/26.
22. SHiP: Studying the history of Health in Port Cities, https://www.ru.nl/rich/our-research/re-
    search-groups/radboud-group-historical-demography-family-history/ship/, last accessed
    2021/08/31.
23. Wisselgren, M.J., Edvinsson, S., Berggren, M., Larsson, M.: Testing Methods of Record
    Linkage on Swedish Censuses. Historical Methods: A Journal of Quantitative and Interdisci-
    plinary History. 47, 138±151 (2014). https://doi.org/10.1080/01615440.2014.913967.
24. Ruggles, S., Fitch, C.A., Roberts, E.: Historical Census Record Linkage. Annual Review of
    Sociology. 44, null (2018). https://doi.org/10.1146/annurev-soc-073117-041447.
25. Antonie, L., Inwood, K., Lizotte, D.J., Andrew Ross, J.: Tracking people over time in 19th
    century Canada for longitudinal analysis. Machine Learning; Dordrecht. 95, 129±146 (2014).
    http://dx.doi.org.ep.fjernadgang.kb.dk/10.1007/s10994-013-5421-0.
26. Christen, V., Groß, A., Fisher, J., Wang, Q., Christen, P., Rahm, E.: Temporal group linkage
    and evolution analysis for census data, https://openproceedings.org/2017/conf/edbt/paper-
    269.pdf, (2017). https://doi.org/10.5441/002/EDBT.2017.83.
27. CLARIAH/burgerLinker. CLARIAH (2021).
28. Edvinsson, S., Engberg, E.: A Database for the Future. Major Contributions from 47 Years
    of Database Development and Research at the Demographic Data Base. Historical Life
    Course Studies. (2020).
29. Feigenbaum, J.J.: A Machine Learning Approach to Census Record Linking.
30. Abramitzky, R., Mill, R., Pérez, S.: Linking individuals across historical sources: A fully
    automated approach*. Historical Methods: A Journal of Quantitative and Interdisciplinary
    History. 1±18 (2019). https://doi.org/10.1080/01615440.2018.1543034.
31. Hoekstra, R., Meroño-Peñuela, A., Dentler, K., Rijpma, A., Zijdeman, R., Zandhuis, I.: An
    Ecosystem for Linked Humanities Data. In: Sack, H., Rizzo, *6WHLQPHW]10ODGHQLü'
    Auer, S., and Lange, C. (eds.) The Semantic Web. pp. 425±440. Springer International Pub-
    lishing, Cham (2016). https://doi.org/10.1007/978-3-319-47602-5_54.
32. Hooland, S. van: Linked data for libraries, archives and museums: how to clean, link and
    publish your metadata. Facet Publishing, London, [England (2014).