<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recognising and Linking Entities in Old Dutch Text: A Case Study on VOC Notary Records</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Barry Hendriks</string-name>
          <email>barryhendriks98@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Groth</string-name>
          <email>p.t.groth@uva.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marieke van Erp</string-name>
          <email>marieke.van.erp@dh.huc.knaw.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KNAW Humanities Cluster Amsterdam</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Amsterdam</institution>
        </aff>
      </contrib-group>
      <fpage>25</fpage>
      <lpage>36</lpage>
      <abstract>
        <p>The increased availability of digitised historical archives allows researchers to discover detailed information about people and companies from the past. However, the unconnected nature of these datasets presents a non-trivial challenge. In this paper, we present an approach and experiments to recognise person names in digitised notary records and link them to their job registration in the Dutch East India company's records. Our approach shows that standard state-of-the-art language models have di culties dealing with 18th century texts. However a small amount of domain adaption can improve the connection of information on sailors from di erent archives.</p>
      </abstract>
      <kwd-group>
        <kwd>named entity recognition maritime history domain adaptation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The Dutch East India Company (the VOC) is known as one of the rst
multinational corporations, employing thousands of people from a variety of countries
during its existence (1601-1800)[
        <xref ref-type="bibr" rid="ref18">17</xref>
        ]. The company held extensive records about
their employees, recording information about their place of origin, the ships they
sailed on, and the reason for their termination of employment [
        <xref ref-type="bibr" rid="ref22">21</xref>
        ]. Of these
records, 774,200 have been preserved and digitised, facilitating research into the
corporation (cf. [
        <xref ref-type="bibr" rid="ref21">20</xref>
        ]). Identifying which records within this collection refer to
the same person can provide more insight into the lives of VOC employees as
shown by [
        <xref ref-type="bibr" rid="ref17">16</xref>
        ]. Being able to connect to other sources (e.g. notary records) would
provide another dimension to the analysis. Enabling, for instance, research into
the lives of sailors as such records can provide information on who a sailor's
bene ciaries were or whether they had any debts.
      </p>
      <p>The Amsterdam City Archive has undertaken large-scale digitisation projects.
An example is the Alle Amsterdamse Akten project,3 which includes many
documents from notaries who are known to have dealt with VOC employees.</p>
      <p>
        In this paper, we present an approach and experiments for identifying and
linking sailors in both the VOC and Amsterdam notary records. We show the
importance of domain adaptation of state of the art Dutch language models
(e.g. BERTje[
        <xref ref-type="bibr" rid="ref23">22</xref>
        ]) in order to achieve acceptable performance on the named
entity recognition (NER) task for this domain. Our contributions are threefold:
1) named entity recognition and linking software adapted to the 17th century
maritime domain; 2) a gold standard dataset for evaluation in this domain; and
3) experimental insights into language technology for early-modern documents.
      </p>
      <p>This paper is structured as follows. In Section 2, we describe related work
and in Section 3, the datasets. Section 4 presents our approach and
experimental setup, followed by an evaluation in Section 5. This is followed by
conclusions and recommendations for future work (Section 6). The code and data of
all experiments performed can be found at https: // github. com/ barry98/
VOC-project .
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>We employ a combination of named entity recognition (NER), record linkage
(RL), and named entity linking (NEL). In this section, we give a brief overview
of the most important techniques from these three areas.</p>
      <p>
        Named Entity Recognition: Extensive research has been done into Named
Entity Recognition [
        <xref ref-type="bibr" rid="ref16">15</xref>
        ]. Currently, neural networks [
        <xref ref-type="bibr" rid="ref14 ref3 ref8">7, 3, 13</xref>
        ] achieve top
performance in NER with F1 scores of around .81-.82 as compared to 0.77 in previous
approaches. In particular, NER systems based on large scale language models
such as BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] perform well. Dutch versions of BERT have been created by
training on Dutch texts, resulting in the BERTje and RobBERT models. [
        <xref ref-type="bibr" rid="ref2 ref23">2, 22</xref>
        ].
BERTje achieves an F1 score of 0.88 on standard benchmark datasets.
Record Linkage Record Linkage, the nding of records that refer to the same
entity, has been a topic of interest for statisticians, historians, and computer
scientists alike [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. There are two main types of record linkage models,
deterministic and probabilistic. The older deterministic models are only able to nd
exact matches whilst newer probabilistic models can use a threshold to
determine whether non-exact matches should be linked [
        <xref ref-type="bibr" rid="ref20">19</xref>
        ]. Previous research has
investigated the use of record linkage on text data from the middle ages and
the early modern period using arti cially created database [
        <xref ref-type="bibr" rid="ref5 ref7">6</xref>
        ]. Other work has
shown the advantages of probabilistic record linkage for humanities related data
(e.g. to link entities of three di erent databases on Finnish soldiers in World
War II) [
        <xref ref-type="bibr" rid="ref12">11</xref>
        ].
      </p>
      <p>
        Named Entity Linking Named Entity Linking, the linking of entities to a
knowledge base, has been extensively researched [
        <xref ref-type="bibr" rid="ref19">18</xref>
        ]. Approaches using Wikipedia
as a knowledge base can achieve impressive performance with accuracy scores
ranging from 91.0 to 98.2 for linking to persons [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ]. Recent work has looked the
use of graph-based methods that use the neighborhood of an entity to improve
linking performance on knowledge basis other than Wikipedia [
        <xref ref-type="bibr" rid="ref11">10</xref>
        ]. Other work
has tried to remove hand crafted features by learning a entity linking end-to-end
using neural networks[
        <xref ref-type="bibr" rid="ref13">12</xref>
        ]. A critical challenge to using entity linking methods
in this domain is the lack of free text associated with the entities under
consideration (i.e. sailors).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>Amsterdam Notary Archives: Jan Verleij The Amsterdam City Archives
contain scans from various notaries. For this project, we focused on Jan Verleij,
as his o ce was situated near the harbour of Amsterdam and he is known to
have had many dealings with VOC personnel. All records were digitised with
help of Handwritten Text Recognition (HTR) software. The dataset4 consists of
annotated data detailing: date, names of those involved, entry type, description
of record, and the corresponding scans making up the record. The annotations
were performed by 938 volunteers and involved manual tagging of the names of
all clients and their associates found within each record. The names of the notary
and the professional witnesses that worked for the notary were not tagged. An
example of the data is shown in Figure 1.</p>
      <p>VOC data: The VOC data is a list of 774,200 entries describing personnel
sailing for the Dutch East India Company (VOC). As it was possible to
reenlist after completing a tour, not all entries describe distinct individuals. The
4 The dataset can be found at: https://assets.amsterdam.nl/publish/pages/
885135/saa_index_op_notarieel_archief_20191105.zip
information used for this project are name, birthplace, date of employment, date
of resignation, ships that were sailed on, and rank. An example of a single entry
can be found in Figure 2.5
Data Annotation To train and evaluate the proposed record linkage model,
links between individuals in the notary data and the VOC data were created.
The process consisted of selecting possible matches and subsequently con rming
or denying of these matches. To reduce the amount of annotation work, two
conditions were speci ed that needed to be satis ed in order for two individuals
to be considered a possible match.</p>
      <p>{ The rst condition was a fuzzy match ratio of at least 80% between the name
of the notary entry and the name of the VOC employee;
{ The second condition consisted of a notary entry date that was 90 days or
fewer before the leave date of the ship that the VOC employee left on, or a
notary entry date that was 90 days or fewer after the return date of the ship
that the VOC employee returned on.</p>
      <p>If these conditions were satis ed, then the individual of the notary entry and
the VOC employee were reviewed by annotators. Annotators manually reviewed
possible matches by looking in the HTR text of the notary entry for keywords
obtained from the VOC employee data. Examples of keywords include: the name
of the ships that were sailed on, rank, and place of origin. Aside from these
keywords some general keywords suggesting involvement with the VOC were
also looked for (e.g. `Oostindische Compagnie', `Kamer Amsterdam', and `Oost
Indie'). If, based on the found keywords, the annotator believed the possible
match to be a true match, then this match was recorded as such.</p>
      <p>Four annotators were used, one of which was one of the authors of this study.
To check the e ort and speed of the annotation process, the annotators were
asked to perform the task for an hour. The slowest annotator annotated 554
possible matches within the given hour, the fastest 991. Fleiss' Kappa was used
5 The dataset can be found at: https://www.nationaalarchief.nl/onderzoeken/
index/nt00444?searchTerm=
Annotator match non match
A 6 548
B 6 548
C 8 546
D 47 507
(a) The number of con rmed
matches and non matches from
the test set.</p>
      <p>
        Annotators Fleiss' Kappa
All 0.359
Without D 0.899
(b) Fleiss' Kappa with and
without annotator D
to measure inter-annotator agreement over the 554 cases that all four annotators
completed [
        <xref ref-type="bibr" rid="ref6">5</xref>
        ]. The number of matches con rmed by each annotator and matches
discon rmed by each annotator are found in Table 1a.
      </p>
      <p>
        Calculating Fleiss' Kappa results in value of 0.359 which according to [
        <xref ref-type="bibr" rid="ref15">14</xref>
        ]
equals a fair agreement. Many disagreements can be explained by the many
con rmed matches made by annotator D. After discussing the results with
annotator D, it became clear that they misunderstood the annotation guidelines,
assuming the data was far less imbalanced than it actually is. As a result, the
annotation guidelines were updated to clarify that partial matches of location,
rank, and ships alone are not enough to warrant a match. When annotator D is
excluded from the Fleiss' Kappa calculation, we nd a value of 0.899, equalling
an almost perfect agreement (see Table 1b).
      </p>
      <p>To further minimise mistakes made during the annotation process, we
reviewed all con rmed matches resulting in re-annotating three matches as
nonmatches and con rming three ambiguous matches. In total 1,624 possible matches
were annotated, resulting in 101 con rmed matches.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>Our approach starts with identifying names in the notary records, for which we
then try to nd candidate matches and then the best match in the VOC records.
In the remainder of this section, we detail each step.
4.1</p>
      <sec id="sec-4-1">
        <title>Named Entity Recognition</title>
        <p>To identity individuals in the notary data we rst created a basic NER model,
we then adapted it for persons in 18th century data, and nally other named
entities were recognised.</p>
        <p>
          Basic Model: Two di erent existing NER models were considered: 1) the spaCy
dutch model [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]; and 2) the dutch BERT model BERTje [
          <xref ref-type="bibr" rid="ref23">22</xref>
          ]. BERTje is chosen
for its accuracy compared to other multilingual BERT models, spaCy for its
simple API and fast processing time. We used recall and precision to evaluate
these models on the available notary data. As the HTR text contains many
misspellings, fuzzy matching is used. If a recognised person name matches at least
90% with an annotated name, the recognised entity is considered to be a true
positive. Since all names recorded in the notary texts are the full name of a
person, person entities consisting of a single token are discarded.
Domain adaptation: As 18th century Dutch di ers in form from the
contemporary texts the models were trained on, our results were much lower than
reported F1 scores of 0.8 or higher. We therefore adapted the model to the
18th century domain by rst adding the named persons the model previously
recognised correctly to further train the model. This method ensures that all
annotated entities are correct, however as we do not introduce new entities,
this method will probably not increase the recall of the model, only increase its
precision.
        </p>
        <p>The second approach is to use fuzzy matching to nd all instances of the
annotated names of the notary data in each HTR text. To accomplish this for
every HTR text available in the notary data, the corresponding annotated names
are gathered. Fuzzy matching is then used to nd each annotated name within
the text, requiring 80% of the names to match. This method not only allows
previously unrecognised names to be used for training, but it also removes many
falsely recognised persons from the training data.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Record Linkage</title>
        <p>We tested both record linkage (RL) and entity linking approaches for the linking
individuals in the notary and VOC datasets. The RL method proved to be more
e ective. This is likely due to the fact that the entries do not contain much free
text thus hurting the performance of entity linking as mentioned in Section 2.
Blocking As typical in record linkage, an initial selection of potential matching
records is performed (i.e. blocking). First, possible VOC candidates for each
individual are found within the notary data using fuzzy string matching. To speed
up the selection process, we only try to match individuals from the VOC data
that have a leave or return data within one year from the notary record date. To
consider an individual a possible candidate, there has to be at least an overlap of
80% in the spelling of their names in both datasets. We further narrow down the
initial selection based on the date of the notary record and the leave or return
date of the VOC individual: the date of the notary entry has to be either 90
days or less before the leave date or 90 days or less after the return date of the
VOC individual.</p>
        <p>
          Linking Records Dedupe was used to link the records [
          <xref ref-type="bibr" rid="ref9">8</xref>
          ]. Dedupe uses machine
learning to perform fuzzy matching, deduplication, and entity resolution with the
help of active learning. Models train themselves by presenting the user with its
least certain match to judge. Using the user's judgment, the model recalculates
the weights for each feature and repeats the process until the user stops it. Each
match is provided with a certainty score to establish a threshold for discarding
or keeping matches. One drawback of active learning is that it can be very hard
to con rm and discon rm the exact same matches for each model leading to
variations in the optimal threshold for each model.
        </p>
        <p>We trained several di erent models, each containing a di erent number of
con rmed and discon rmed matches. All models are trained and tested on a
subset of the notary data that was linked with the VOC data through annotation
as described in section 3. For a match to be considered a true positive the RL
model has to match a notary entity with its corresponding VOC entity.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>In this section, we rst present the results of the named entity recognition step,
followed by the record linkage step.
5.1</p>
      <sec id="sec-5-1">
        <title>Named Entity Recognition</title>
        <p>The NER models are evaluated on either the entire notary dataset in the case
of the basic models, or a test subset of the notary data in the case of domain
adaptation. For each model, the entities tagged as a person are compared to the
annotated names of those involved in the notary entry (Table 2a).</p>
        <p>Both basic models do not perform very well, most likely due to a combination
of both HTR text and old Dutch being too di erent from the modern Dutch
the models were trained on. Furthermore, there is a signi cant di erence in
processing time between these two models. Both models were tested on a single
computer possessing an Intel i5-6600 CPU, a NVIDIA GeForce GTX 1060 GPU,
and 16 GB of RAM. BERTje processes all 13,063 HTR texts in about two hours.
The spaCy model processes all texts in about 20 minutes. A possible explanation
for this di erence could be that current BERT models, including BERTje, only
allow for a maximum of 512 tokens to be processed at once, requiring many of
the HTR texts to be split into smaller texts.</p>
        <p>We evaluated two domain adaptation methods for the spaCy NER model, the
previously recognised approach and the fuzzy matching approach as explained in
section 4.1. The data was split randomly into a training set containing with 70%
and the remaining 30% was held out for testing (Table 2b).</p>
        <p>Both adapted approaches far outperform the basic spaCy model. However,
the fuzzy matching approach also achieves a higher recall than the previously
recognised approach. To validate the performance of the model created by the
fuzzy matching approach, we perform a k -fold cross validation with 10 folds. As
Table 2c shows, the fuzzy matching approach delivers a reliable model
independent of the way the data has been split.</p>
        <p>Error Analysis: To gain more insight into the mistakes that the NER model
makes, we analysed 500 false negatives and 536 false positives. From this analysis,
we found that some errors were caused by aws in the model, others can be
attributed to aws in the HTR or the annotation of the data.</p>
        <p>The false negatives from the model come in three di erent types:</p>
        <p>Model Precision Recall F1 score
spaCy 0.101 0.416 0.163
BERTje 0.072 0.538 0.127</p>
        <p>Model Precision Recall F1 score
Previously Recognised 0.732 0.491 0.588</p>
        <p>Fuzzy Matching 0.733 0.737 0.735
(a) Precision, Recall, and F1 score(b) Precision, Recall, and F1 score for the di erent
for the basic NER models approaches for further training</p>
        <p>Model Precision Recall F1 score
Worst Fold 0.694 0.745 0.719
Average 0.732 0.736 0.732
Best Fold 0.731 0.756 0.743
(c) Precision, recall, and F1 score for
the worst fold, best fold, and the
average over all folds from the trained NER
model</p>
        <p>Table 2: Results of NER experiment
{ The name was not tagged (313 counts)
{ The name was tagged but too dissimilar from the annotated name due to</p>
        <p>HTR (127 counts)
{ The name was only partially tagged (60 counts)
The majority of the mistakes are when the NER model will simply not tag
certain names. There does not seem to be any pattern in the names themselves
suggesting that the model was simply unable to tag them due to their position
within the text. Explanations for this could be the absence of capitalisation in
some names or a lack of punctuation between names. For example, for the name
`pieter Jansen Hendrik havens', the lack of capitalisation in the rst and last
word and the lack punctuation between `pieter Jansen' and `Hendriks havens'
causes the model to recognise `Jansen Hendrik' as the name. The second most
common mistake is not directly a aw in the NER model as much as it is a aw
in the HTR software. A prominent mistake is the use of the letter `y' or the
digraph `ij' as the use of the letter `y' was more common in the past, where as
it has been replaced with the digraph `ij' in many cases in modern Dutch. The
last and least common mistake is the partial tagging of a name. Since tagged
entities consisting of a single word are discarded before evaluation, all names
involving this mistake either include a family name a x or a middle name. The
most common mistake seems to be that the NER model mistakes the middle
name for the last name. This is the case for 41 of the 60 mistakes. An example of
this would be the name `Pieter Hendrik Kornelisse'. Here the NER model would
recognise the middle name `Hendrik' as the last name, resulting in tagging `Pieter
Hendrik' as a person whilst `Kornelisse' is discarded.</p>
        <p>The false positives can be divided in four di erent types:
{ The name was annotated but too dissimilar from the tagged name due to</p>
        <p>HTR (211 counts)
{ The name was only partially tagged (136 counts)</p>
        <p>{ The name was not annotated (97 counts)
{ The tagged entity is not an actual person (92 counts)
Similarly to the false negatives, the false positives contain many mistakes due
to the HTR software distorting the name of a person. However, unlike the false
negatives the use of the letter `y' and digraph `ij' seems to have far less of an
impact. Instead it would seem that since many names can be found multiple
times within a single text, the HTR software has a higher chance to make just
enough mistakes for one of the occurrences of a name so that it is no longer
similar enough to be recognised as the same name. The second most common
mistake was that the name was only partially tagged. Similarly to the partially
tagged false negatives, the largest problem here seems to be that the middle name
is mistaken for the last name. The third most common mistake was the absence
of annotation for an entity that was con rmed to be a person. These mistakes
can be explained by the fact that professional witnesses were not annotated.
The least common mistake is that the tagged entity was simply not a person.
Common mistakes are the combination of a persons last name along with his or
her trade, place of origin, or rst name of a di erent person. The latter case can
be explained by the lack of punctuation in the texts. Examples of this would be
`Evert Hendriks Kruidenier', where kruidenier is the trade of the person Evert
Hendriks or `Wilhelmina Hugenoot van Volendam' where van Volendam is the
place of origin. It is important to note that it is common for Dutch last names
to be either a trade or the place of origin, making this a hard problem to solve.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2 Record Linkage</title>
        <p>We evaluated the RL models on a subset of the test data containing a
representative number of matches and non-matches for the entire dataset. We evaluated
ten di erent models, each with a di erent number of matches con rmed and
discon rmed during active training. We decided to test a model for each increment
of ten discon rmed matches, as the minimum recommended amount of con rmed
and discon rmed matches is ten, according to the developers of Dedupe. The
results of these models can be found in Table 3.</p>
        <p>As expected, active learning causes some uctuation in the optimal threshold
for each model. The models that have a very low optimal threshold such as the
50 discon rmed and 100 discon rmed models seem to have the best performance.
Despite the low threshold, these models still obtain a satisfactory precision score.
This suggests that while models are hesitant to make matches, the matches that
it does make are accurate. Conversely, the models with high thresholds seem
to make more matches, which has to be compensated for with a high threshold
resulting in a reduced recall. An important thing to keep in mind when looking
at these results is that the test data had only 26 actual matches. This means
that just a single true positive more or less can make a signi cant di erence in
the recall. For example, the di erence between the recall of the 50 discon rmed
model and the 100 discon rmed model can be explained by a single true positive.
Error Analysis: As the test set was relatively small, we analysed all mistakes
made by the best performing RL model. In total, the model was responsible for
four false positives and four false negatives, with just a single type of mistake for
both categories. In the case of the false positives, the entity recognition of ship
names falsely recognised unrelated words as ship names. These where then also
deemed similar enough by the model to warrant a match. An example of this is
the word `beiden', the Dutch word for `both', that was falsely recognised as the
ship name `Leiden'. Two of the four false positives had no matches in rank or
location, implying the model based the match solely on the presence of the ship
name.</p>
        <p>For the false negatives, the problem would be the inverse of that of the false
positives. In these cases, a ship name was not found in the texts, causing the
model to not make any matches. Again, the model seems to value the presence
of a ship name far above the presence of a rank or location, as two of the four
false negatives did have a matching rank or location. However, if the model nds
a ship name, a matching rank or location does increase the certainty score of
the model. Matches based solely on a found ship name posses certainty scores
lower than 0.5. Meanwhile, matches with a matching rank or location all posses
scores of 0.65 or higher, with most obtaining a score between 0.85 and 0.99.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and future work</title>
      <p>We trained a NER model and an RL model to recognise and link entities between
notary records from 18th century notary Jan Verleij and the VOC employee
records. Our experiments show that readily available NER models, such as the
Dutch spaCy model and BERTje, perform poorly on HTR data of old Dutch
texts. However, if some annotated named entities are available performance can
be improved from F1=0.163 to F1=0.743. This is still somewhat below
state-ofthe-art performance of these models on modern text (e.g. CoNLL-2002
benchmark dataset (F1 = 0.883) but the notary records constitute far less training
data and are less conformant to spelling and punctuation standards.</p>
      <p>Although the precision of linking entities is quite high, the recall is still
somewhat lacking. In practice, this means that although the predicted matches
will almost always be a true match, only about 60-70% of the actual matches
are found. The usefulness of the current model depends on the use case and the
amount of data that is available. If enough data is available then the current
model can produce a su cient number of actual matches without providing too
many false matches. However, if data is scarce then the matches not found by
the model might be necessary, reducing the usability of the model.</p>
      <p>Annotated data for the locations, ranks, and ships in the notary records
would be a valuable addition to the the NER model. For the RL model the lack
of annotated data greatly reduced the amount of data that could be trained and
tested on. It is clear that for the advancement of NLP on old text more training
data is needed. Given the experience in this project, we believe that there is
further scope for nding latent training data in newly digitised historical data.
The dataset created for this project can provide a template for such initiatives.</p>
      <p>There are certainly areas of improvement possible for the language models.
For the NER model it would be interesting to fully train multilingual versions
of BERT on this type of text. Since these models perform extremely well on
modern texts, further training of these models for old texts might result in far
better models than those obtained in this project.6</p>
      <p>Aside from improving the RL model, the linking of entities might also be
improved by instead opting to make use of named entity linking techniques.
Further research could be conducted into named entity linking with smaller
local knowledge bases instead of the large knowledge bases such as Wikipedia.
Combining named entity linking and record linkage is also an interesting avenue
of research given the semi-structured nature of much of this data.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>The authors would like to thank their KNAW HuC colleagues from the maritime
careers project, in particular Lodewijk Petram and Jelle van Lottum for their
advice on all things VOC, Jirsi Reinders from Stadsarchief Amsterdam for access
to and explanations of the notary data, our annotators for helping create the
gold standard and the anonymous reviewers for their feedback and suggestions.</p>
      <p>Hendriks et al.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Christen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection</article-title>
          . (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Delobelle</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winters</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berendt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Robbert: a dutch roberta-based language model (</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          . CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Explosion: spacy (
          <year>2020</year>
          ), https://spacy.io/models/nl
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>6 The rst steps of this have already begun with the MacBERTh project that aims to create natural language models trained on both English and Dutch historical text data https://pdi-ssh</article-title>
          .nl/en/2020/06/funded-projects-2020-call/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fleiss</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Measuring nominal scale agreement among many raters</article-title>
          .
          <source>Psychological bulletin 76(5)</source>
          ,
          <volume>378</volume>
          (
          <year>1971</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          6.
          <string-name>
            <surname>Georgala</surname>
          </string-name>
          , K.,
          <string-name>
            <surname>van der Burgh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meeng</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knobbe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Record linkage in medieval and early modern text</article-title>
          .
          <source>In: Population Reconstruction</source>
          , pp.
          <volume>173</volume>
          {
          <fpage>195</fpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gillick</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brunk</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Subramanya</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Multilingual language processing from bytes (</article-title>
          <year>2015</year>
          ), https://arxiv.org/pdf/1512.00103.pdf
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gregg</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eder</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Dedupe</surname>
          </string-name>
          (
          <year>2019</year>
          ), https://github.com/dedupeio/dedupe
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hachey</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nothman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Honnibal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curran</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Evaluating entity linking with wikipedia</article-title>
          .
          <source>Arti cial Intelligence</source>
          <volume>194</volume>
          ,
          <fpage>130</fpage>
          {
          <fpage>150</fpage>
          (
          <year>2013</year>
          ), http://www.sciencedirect.com/science/article/pii/S0004370212000446, arti cial Intelligence, Wikipedia and Semi-Structured Resources
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          10.
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Collective entity linking in web text: A graphbased method</article-title>
          .
          <source>In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . pp.
          <volume>765</volume>
          {
          <fpage>774</fpage>
          . SIGIR '
          <volume>11</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2011</year>
          ), https://doi.org/10.1145/2009916.2010019
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          11.
          <string-name>
            <surname>Koho</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leskinen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Hyvonen, E.:
          <article-title>Integrating historical person registers as linked open data in the warsampo knowledge graph</article-title>
          .
          <source>In: SEMANTiCs</source>
          <year>2020</year>
          ,
          <article-title>In the Era of Knowledge Graphs</article-title>
          , Proceedings. Springer-Verlag (
          <year>09 2020</year>
          ), accepted
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kolitsas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganea</surname>
            ,
            <given-names>O.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>End-to-end neural entity linking</article-title>
          . arXiv preprint arXiv:
          <year>1808</year>
          .
          <volume>07699</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballesteros</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawakami</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Neural architectures for named entity recognition</article-title>
          .
          <source>CoRR abs/1603</source>
          .01360 (
          <year>2016</year>
          ), http: //arxiv.org/abs/1603.01360
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          14.
          <string-name>
            <surname>Landis</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>G.G.</given-names>
          </string-name>
          :
          <article-title>The measurement of observer agreement for categorical data</article-title>
          . biometrics pp.
          <volume>159</volume>
          {
          <issue>174</issue>
          (
          <year>1977</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          15.
          <string-name>
            <surname>Nadeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sekine</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A survey of named entity recognition and classi cation</article-title>
          .
          <source>Lingvistic Investigationes</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <volume>3</volume>
          {
          <fpage>26</fpage>
          (
          <year>2007</year>
          ), https://www.jbe-platform.
          <source>com/ content/journals/10.1075/li.30.1.03nad</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          16.
          <string-name>
            <surname>Petram</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>van Lottum</surname>
            , J., van Koert,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Derks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Small lives, big meanings. expanding the scope of biographical data through entity linkage and disambiguation</article-title>
          .
          <source>In: BD</source>
          . pp.
          <volume>22</volume>
          {
          <issue>26</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          17.
          <string-name>
            <surname>Petram</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>van Lottum</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Maritime careers: The life and work of european seafarers</article-title>
          , 1600-
          <fpage>present</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          18.
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNamee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dredze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Entity Linking:
          <article-title>Finding Extracted Entities in a Knowledge Base</article-title>
          , pp.
          <volume>93</volume>
          {
          <fpage>115</fpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg (
          <year>2013</year>
          ), https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -28569-
          <issue>1</issue>
          _
          <fpage>5</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          19.
          <string-name>
            <surname>Sayers</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ben-Shlomo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blom</surname>
            ,
            <given-names>A.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steele</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Probabilistic record linkage</article-title>
          .
          <source>International journal of epidemiology 45(3)</source>
          ,
          <volume>954</volume>
          {
          <fpage>964</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          20.
          <string-name>
            <surname>Van Bochove</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Velzen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Loans to salaried employees: the case of the Dutch East India Company</article-title>
          ,
          <volume>1602</volume>
          {
          <fpage>1794</fpage>
          .
          <source>European Review of Economic History</source>
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <volume>19</volume>
          {
          <volume>38</volume>
          (02
          <year>2014</year>
          ). https://doi.org/10.1093/ereh/het021, https://doi.org/ 10.1093/ereh/het021
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          21.
          <string-name>
            <surname>Velzen</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaastra</surname>
            ,
            <given-names>P.F.</given-names>
          </string-name>
          :
          <article-title>Thematische collectie: Voc opvarenden</article-title>
          . https://doi.org/10.17026/dans-xpp-abdp (
          <year>2000</year>
          ). https://doi.org/https://doi.org/10.17026/dans-xpp-abdp
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          22. de Vries, W., van
          <string-name>
            <surname>Cranenburgh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bisazza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caselli</surname>
            , T., van Noord,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nissim</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Bertje: A dutch bert model</article-title>
          . arXiv preprint arXiv:
          <year>1912</year>
          .
          <volume>09582</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>