Traces through time: a case-study of applying statistical methods to refine
                       algorithms for linking biographical data
                                                 Mark Bell, Sonia Ranade
                                                    The National Archives,
                                                         Kew, London
                               E-mail: { mark.bell; sonia.ranade }@nationalarchives.gsi.gov.uk

                                                               Abstract
The Traces through Time project, which ran at The UK National Archives in 2015, developed algorithms and tools to link people
appearing in historical records and to assign robust measures of confidence to the connections that are made. The method has application
across the digital humanities, including for biographical research.
Fuzzy matching relies on the availability of background statistics on the population, the distribution of data values, data quality and the
type and frequency of errors. This paper describes work to refine the original algorithms through implementation of a learning approach
in which insights arising from one analysis are fed back into the algorithm to improve the baseline statistics for subsequent analyses.
We find that this iterative approach delivers significant improvements over 'raw’ scoring mechanisms. It enables us to carefully target the
type and degree of fuzzy matching to be applied and can help balance the poor precision that results from allowing increased ‘fuzziness’
against the poor recall that arises from a more restrictive approach. Future work will extend the approach beyond names and dates of
birth, and will embed these enhancements into the Traces through Time framework and tools.

Keywords: Record Linkage, String Similarity, Statistical Inference

                                                                         between different individuals in the records, are
      1.    Background: Approach to Record                               problematic for historical data. Often we have neither a
           Linkage and confidence scoring                                date of birth nor an age. If only an age is provided it is not
The identification of a link between two occurrences of an               necessarily clear on which date that age applies - is it the
individual in the historical record is achieved through                  individual’s age at the date of the record? Or their age at
assessing the similarity between the individual attributes               the date of some other event mentioned in the record? And
of the two entities to be compared. During this project, we              dates are often estimated or rounded. When a date of birth
have worked extensively with data from World War One                     has been captured it is not necessarily accurate: consider
service records from The National Archives collections.1                 the case of under-18s claiming to be older in order to
The datasets in question were initially created by indexing              enlist for military service. So, we require new techniques
the original paper documents, and our analysis is limited                to derive confidence scores for dates, all based on
to those data attributes which were consistently captured                estimated distributions.
by previous digitisation and transcription projects. For                 In the case of a year being captured on the record, we
WW1 data we are generally restricted to linking records                  create a probability distribution of likely values. This
based only on names and either age or date of birth. Other               allows us to fuzzy match two different year values,
attributes such as place of birth and service number are                 adjusting for data quality and deriving a probability that
sometimes available but are not consistently captured                    the underlying values are the same. Instead of a single
across datasets.                                                         year, the record may state a range of years, possibly
Record Linkage is achieved using a probabilistic method                  derived from the age. In this case the calculation is the
based on the work of Fellegi and Sunter, (1969) and a                    same but the confidence scores returned will vary
variation suggested by Winkler (1990) to account for                     depending on the range of the stated ages. Finally, there
spelling differences between pairs of textual attributes.                are records with no indication of birth period. In this
The basic approach is to find, for each attribute, the ratio             situation, the best we can do is to derive a frequency
between the probability that a pair of records refer to the              distribution for the whole dataset, drawing on external,
same person and that of them referring to two different                  expert knowledge if this information is not available in
people. The Winkler variation allows the use of string                   the data. The calculation is then the probability that the
comparison algorithms to accommodate spelling                            person in dataset A could be in dataset B.
variations and applies a weighting to reduce the score for               The result of these individual attribute comparisons is a
an attribute comparison if the attributes lie within a                   score on a logarithmic scale which can be used to assess
certain threshold of similarity but are not identical.                   our confidence that the pair of occurrences represent the
Appendix A gives a brief outline of the calculation. Our                 same person.
work has taken this approach a step further and applied a
range of weightings to achieve more fine-grained fuzzy
matching.
Dates of birth, which are a key attribute for discriminating

1
    National Archives Discovery -
     http://discovery.nationalarchives.gov.uk/


                                                                    24
    2.   Learning from record linkage results                       2.2 Fuzzy name comparisons
                                                                    In the case of name comparisons the variation between
This paper focuses on methods for refining the statistical          name transcriptions for records representing the same
model described above by learning from the results of               person can be thought of as a function of several factors
matching many datasets. We describe an approach to                  (list not exhaustive):
identifying and incorporating common differences in                        Regional spelling variations.
textual information arising from factors such as:                          How the recorder hears the name, particularly
handwriting recognition errors, typographical errors and                       with unfamiliar names and regional accents.
phonetic errors made when names are recorded. A                            The recording medium – hand written vs. typed.
different approach is described for dates of birth, where                  Involuntary errors during data capture, spelling
the algorithm must accommodate inaccuracies in                                 mistakes while writing or typing the original
recording such as mis-representation of age or rounding                        document.
of declared ages 2 . In this case, the age distribution                    Involuntary errors         during transcription,
observed for each dataset is fed back into the algorithm to                    including those caused by difficult handwriting.
support a statistical approach to calculating the likelihood
that two occurrences of a person with different recorded            A commonly seen example of a transcription error caused
dates of birth, in fact, relate to the same individual. As          by handwriting is the cursive ‘T’ being misread as the
each incremental enhancement of the algorithm improves              letter ‘J’, due to the similarity between those letters in that
the results of the matching process, these in turn, reveal          style of writing. By analysing the frequency of high
further discrepancies in the data, from which the                   confidence matches which have this specific difference in
algorithm can learn. A number of distinct areas are being           their names we can refine the confidence scores returned
worked on, all building on previous research and reliant            when this difference is encountered. Without this more
on the gathering of statistics over time. Here we highlight         nuanced approach we could miss perfectly good matches
what has been done so far and which emerging ideas are              which only differ on a single initial or increase the rate of
being explored.                                                     false positives by allowing any single initial difference.
                                                                    The key to the approach is to capture the results not just
2.1 Learning from record linkage results                            for a single dataset but to associate the difference with
In our work so far we have discovered benefits in deriving          metadata connected to a collection as a whole – for
an age profile for a dataset which lacks dates of birth by          example, records in a particular format from some defined
linking to one which does. We have also improved linkage            time-period – allowing accumulation of generalised
results by allowing for discrepancies in ages.                      statistics based on many examples which are typical of a
In order to improve on this technique, we analyse the               type of record. The misreading of ‘T’ and ‘J’ is far less
dates of birth for high confidence matches to build a               likely in typed records since the typed letters have a
statistical profile of common differences. For example, in          distinct appearance but there are likely to be other typical
WW1 records this approach will highlight that for                   differences arising from keyboard layout.
soldiers in the 16-20 age range it is more common for two
records referring to the same person to have different              Simply informing the model of the probability that Ts and
years of birth than for those in, say, the 30-35 range.             Js have been interchanged has delivered good results, so
Therefore, if we had two records with years of birth 1882           the next step is to use the data to identify a wider range of
and 1883 (age 33-34 in 1916) then we would have less                commonly occurring transcription errors. We use the
confidence that they are a match than if they were 1897             Jaro-Winkler measure to find similar strings and
and 1898 (age 18-19 in 1916). This behaviour is particular          weightings to assign confidence depending on this
to WW1. Examination of another dataset such as the GRO              measure. Our aim now is to look at methods for using the
death registers 3 shows that it is quite common for the             difference itself to increase accuracy.
deceased’s age to have been guessed at the time of
registration. We would therefore want to accept a different         2.3 Name frequency statistics
profile of differences in this dataset, where there is a            In the absence of high volume name data, such as a
higher likelihood of discrepancies in dates of birth for            census, it can be difficult to accurately calculate the
older people as their deaths are less likely to have been           frequency of occurrence of a particular name in the
registered by a close relative.                                     population that appears in the records under
                                                                    consideration. This is especially the case if the data
                                                                    sources being matched are relatively small (< 10,000
2
  The ‘age heaping’ effect is observed in datasets which            records). Consider a dataset with 1,000 records including
record age (rather than date of birth). The resulting               Messrs. Taylor and Zephania. From this alone, we might
distribution is skewed, typically showing peaks at ‘round’          surmise that these surnames have equal probability of
ages (e.g. 10, 20, 30…). For an illustration, see the 1911          0.001 while, in reality, the former is more common and
census graphic from the ONS data visualisation centre
                                                                    the latter is rare. Accumulating match results over
( http://www.bbc.co.uk/news/uk-18854073).
3
  General Register Office death registrations supplied by           multiple datasets allows us to create a larger population of
http://www.gro.gov.uk/gro/content/                                  individuals from which to derive probabilities. The


                                                               25
difficulty arises from the fact that the most common                   transformation pattern Tx that Tx on S1 yields S2 if
names will, by nature, belong to lower confidence linked               applying the pattern Tx to S1 results in the string S2.
pairs, which are therefore more difficult to associate as
referring to the same individual. We have also identified a            DEFINITION 2: We consider a transformation pattern to
caveat to the assumption of attribute independence in the              be Common (a CTP) if it occurs above a certain
general linkage model. By clustering forenames and                     percentage of the time. More concretely, if we take n
surnames together we have identified groups of names                   record pairs where each pair has a record which contains
that typically occur together and which appear to align                an attribute, a1, containing the string ss1, then applying TP
with national or ethnic groups - e.g. Irish, Italian,                  to the equivalent attribute, a2, on the linked record will
Hispanic/Portuguese. Although the names themselves are                 yield a1 at least c% of the time. An attribute pair is
still independent of one another, there is an implicit                 defined as having attributes a1 and a2, so the set of n pairs
dependence with a third variable, nationality, which is not            P where either a1 or a2 contain the string ss1 is:
directly expressed in the data. As a result, name matches
such as Patrick Murphy, a common Irish name, and Angus                     𝑷 = {𝒑𝟏<𝒂𝟏,𝒂𝟐 > ,      𝒑𝟐<𝒂𝟏,𝒂𝟐 > ,      ⋯ , 𝒑𝒏<𝒂𝟏,𝒂𝟐> }
MacDonald, common in Scotland, are assigned higher
confidence scores than is warranted because each name                  Tx is the transformation pattern that transforms ss1 to ss2.
part, considered individually, is not particularly common
in the population as a whole. We mitigate this type of                                                    𝒚𝒊𝒆𝒍𝒅𝒔
association in the Traces through Time approach by                             ∑𝒏𝒊=𝟎 {𝟏 𝒊𝒇 𝑻𝒙 ∶ 𝒑𝒊 〈𝒂𝟏 〉 →         𝒑𝒊 〈𝒂𝟐 〉
arbitrarily reducing the population used in the probability                           𝟎 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆                             ≥𝒄
calculation by a factor of ten. This has some basis in the                                       𝒏
data, as Ireland has 10% of the population of England, for
example. However, it is a blunt tool. We are now working               In order to find common patterns of transformation we
on refining this technique, again by gathering statistics              must calculate all n-grams around each character
through linking multiple datasets.                                     difference between 𝒂𝟏 and 𝒂𝟐 , where the corresponding
                                                                       n-grams in the two attributes are different, and the n-gram
      3.    Identifying common differences                             length is up to the length of the longest string. In order to
                                                                       normalize different length strings we use the
3.1 Differences in names                                               Needleman-Wunsch (NW) (Needleman and Wunsch,
                                                                       1970) alignment function to find the maximal alignment
Our approach for identifying common differences in                     of two strings and then pad any gaps in alignment with
transcription and spelling is to look at matched records               ‘-‘ symbols or an ‘@’ symbol at the end of a string to
which differ in a single attribute. For example, the                   differentiate between characters being inserted within a
matched pair “Robert Adrian Gardner, born 17/11/1898”                  string and those added to the end.
and “Bob Adrian Gardner, born 17/11/1898” have
different first names but are otherwise identical. If the first        For example, NW(‘needle’, ‘nedle’) produces the
names were the same we would consider this to be a high                aligned strings:
confidence match. However, in our existing statistical                 needle
model the first name would contribute a negative
                                                                       ne-dle
weighting to the calculation. By looking at record pairings
which only differ on a single attribute and which would
score above a certain threshold, T, indicating a high                  And the resulting set of n-gram pairs is: { (‘e’, ‘-‘), (‘nee’,
confidence match, if the difference was not there we can               ‘ne-‘), (‘need’, ‘ne-d’), (‘needl’, ‘ne-dl’), (‘needle’,
ascertain patterns in these differences. So in the example             ‘ne-dle’), (‘ee’, ‘e-‘), (‘eed’, ‘e-d’), (‘eedl’, ‘e-dl’),
above, if we see a number of record pairings with the                  (‘eedle’, ‘e-dle’), (‘ed’, ‘-d’), (‘edl’, ‘-dl’), (‘edle’,
same pattern, we may deduce that Bob is an alternative                 ‘-dle’) }
form of Robert.
                                                                       The next step is to represent these n-gram pairs in a tree
In the following definition, when we refer to a                        structure where the parents of a pair are the pairs which
transformation we mean some difference in spelling has                 are produced by adding one character to each n-gram in
been encountered between two attributes which could be                 the original pair. So (‘ed’, ‘-d’) is a parent of (‘e’, ‘-‘)
due to common spelling variations (‘Phillip’ and ‘Philip’),            where ‘d’ has been appended to each entry in the child
spelling errors (‘Roland’, ‘Rolend’) or diminutive forms               pair. The resulting tree is shown in figure 1.
(‘Bob’, ‘Robert’).

DEFINITION 1: We define a Transformation Pattern (TP)
as a function which transforms a string S1 to a string S2 by
substituting any substring ss1 of length l in S1 with another
string ss2 also of length l. We shall also say that for a


                                                                  26
                                                                      in the ADM188 series which was not present in ADM339.
                                                                      Linking ADM337 and ADM339 returned results that we
                                                                      would have expected for WW1 records – the rate of 1 or 2
                                                                      year differences was between 0.17% and 1.44% for years
                                                                      of birth up to 1897 (taking the higher year of birth of any
                                                                      record pair), increasing to 11.69% and 25.58% for years
                                                                      of birth 1898 and 1899 respectively. Additionally we
                                                                      found that in the case of 1898 7.78% of pairs had a
                                                                      difference of 1 year, while for 1899 23.26% of pairs had a
                                                                      difference of 2 years. This tallies with our expectation of
                                                                      16 and 17 year olds inflating their ages in order to join the
    Figure 1: n-gram tree for name pair “needle, ne-dle”              war effort from 1916.

We can now generate n-gram trees for every pair of                    When we linked to ADM188 we discovered a different
different attributes from a list of matched pairs of records.         pattern. There was still a peak of 28.91% of 1897 births
These are then stored in a Directed Graph structure with              with 1 year difference. However, we saw a consistent
each node representing an n-gram pair and edges having a              10-20% 1 year difference rate for all other years of birth.
weight equal to the number of instances of their parent               One theory for why this should happen is that perhaps the
<𝒂𝟏 , 𝒂𝟐 > (the root node in the n-gram tree).                        application form asked for day and month of birth plus
                                                                      age at application. The year of birth was then calculated
We have made the graph generated from the results of                  from the age which introduced a high proportion of errors
linking a number of WW1 collections together available                in the year of birth.
to view online.4
                                                                              4.     Results of CTP identification
The reasoning behind loading the n-gram trees into a                  Figure 2 shows the graph of attributes which have
graph is that we are aiming not just to identify single letter        undergone the transformation p = “e” → “i“.
transcription     errors    but     also    multi-character
transformations, which may be phonetic in nature – for
example, ‘f’ for ‘ph’, or the prefix ‘Mc’ for ‘Mac’.

The final processing stage is to coalesce nodes with only
one parent as, if we have found a pattern of
transformation, we do not need to see that pattern repeated
in longer n-grams which are not encountered in other
attribute pairs. For the “needle” example, if there is no
other pairing <𝒂𝒊 , 𝒂𝒋 > where

      𝒚𝒊𝒆𝒍𝒅𝒔
𝒑: 𝒂𝒊 →        𝒂𝒋 for p = “e” → “-“

then we can connect the root node “needle, ne-dle” to the
node “e, -“ and remove all intermediary nodes in the tree.

3.2 Differences in years of birth
Applying the n-gram method to years of birth identified
lots of common differences but did not unearth any
                                                                      Figure 2: n-gram tree for attribute pairs where an e is
patterns in transcription errors, such as 1 for 7, as we may
have expected. We found a more effective approach was                 replace by i
to use the arithmetic differences between years.
                                                                      This diagram highlights a number of patterns of interest.
We compared one series of naval records, ADM337,                      The node sizes represent the weighted degree of the node
against two other naval series – ADM339 and ADM188.                   so we can see that the pairing ‘Wilfred, Wilfrid’ is the
The method was to analyse pairs of records which were                 most common. If we look at the most common spelling
identical in every way apart from the year of birth. The              differences overall, we find that “e” → “i“ occurs very
results were intriguing and suggest a pattern of behavior             frequently, but figure 2 suggests this result may be skewed
                                                                      by the very common spelling variation of Wilfred/Wilfrid.
4
                                                                      An even stronger example is given by figure 3 which
https://www.dropbox.com/s/8xndg8o26g9096d/ngram_tr
                                                                      demonstrates that the seemingly frequent transformation
ees_2.svg?dl=0

                                                                 27
“i” → “y“ is almost entirely due to the variant spelling of          have undergone the transformation against the number
Sidney/Sydney which accounts for 100 of the 108                      which are un-transformed , as shown in table 1:
occurrences of this transformation.


                                                                     TP              S1 == S2            𝑦𝑖𝑒𝑙𝑑𝑠        %
                                                                                                   𝑡𝑝: 𝑆1 →       𝑆2
                                                                                                                       transformed
                                                                     1               4482         43                   00.95
                                                                     2               1800         78                   04.15
                                                                     3               19076        163                  00.85
                                                                     4               424          31                   06.81
                                                                     Table    1:     Percentage    of   names     undergoing   the
                                                                     transformations. TP column refers to the numberings at top

Figure 3: n-gram tree for attribute pairs where an ‘i’ is            of section 5.

replaced by ‘y’
                                                                     These percentages are fed into our statistical model as
There are a number of CTPs which are worthy of further               probabilities. We will test the effectiveness of this by
investigation and we will examine the effect of capturing            comparing three methods:
four of these patterns in our statistical model below.
                                                                              Winkler - Use the current process of applying a
         5.   Using CMTs in record linkage                                     weighting to the probability score based on a
                                                                               string similarity.
5.1 Approach                                                                  Probability - When one of the 3 TPs is
                                                                               encountered we multiply the probability score by
We will further analyse a method for building four
                                                                               the appropriate percentage according to table 1.
patterns, which emerged from the method described
                                                                              Equivalent - We treat any string S2 which is the
above, into our probabilistic linkage model.
                                                                               result of applying the TP to S1 as equivalent to
The patterns are:
                                                                               S2 and therefore consider the strings to be equal
     1. “Henry” → “Harry“                                                      in our linkage algorithm – i.e. we do not apply
    2.    “Mac” → “Mc“                                                         the weighting in the first scenario.

    3.    “ll” → “-l”                                                Ideally we would use a golden record set with known
                                                                     results to compare the results of applying each method.
    4.    “J” → “T”                                                  Due to the resource intensive nature of creating golden
 In our existing record linkage process the names “Henry”            record sets of sufficient volume for record linkage, we ran
and “Harry” are considered different enough that a                   linkage exercises using records which had both a name
negative weighting is applied to our confidence score.               and date of birth. Only name was used to derive links, date
The effect of this is that the score changes from +1.84              of birth was used for later verification of these links. Since
(where both records use the name “Henry”), to -2.06 (for             most of testing was with files of circa 13k records there
“Henry”/“Harry”), a swing of -3.9. We consider a total               are unlikely to be enough pairs of different people with
score above 7.5 to be a high confidence match, so we will            exactly the same name and date of birth to have a
look for records where the names are identical and have a            significant effect on results. With this method we have at
score of > 7.5 or the names are different only as a result of        least a pseudo-golden result set.
the transformation “Henry” → “Harry“ and have a score
of > 3.6.                                                            5.2 Results and discussion
We follow a similar process for patterns 2 and 3, but this           The use of probabilities in our record linkage algorithm
time taking the swing to be only -1.5 as the resulting               gave good results, especially for patterns 1 and 4 which,
strings under these transformations are only one character           under our existing method, have a large negative impact
different and this has a smaller negative effect in our              on the match score. If we consider a strong match to have
calculations.                                                        a score above 7.0, then in a model where an incorrect
Finally for pattern 4, which is the scenario where an initial        transcription of the letter ‘J’ to ‘T’ causes the score to be
‘J’ or ‘T’ has been incorrectly transcribed (as a ‘T’ or a           lowered by 4, only records where the other attributes total
‘J’), we calculate using a swing of -6. This is a default            at least 11.0 would pass this threshold. In the absence of a
value in our model for different initials.                           date of birth, that individual would have to have quite an
We can then derive the probability of each transformation            unusual name for the link to be discovered.
occurring by comparing the number of records which
                                                                     Table 2 shows the results for each experiment. We ran

                                                                28
each experiment with a scoring threshold of >5.0. Since               investigation, so these results might be more reasonably
the ‘equivalent’ method treated every transformation as an            interpreted as the Equivalent method scoring some
agreement it will consistently score higher than the other            matches too highly. This is an example of the
two methods, so for the purposes of this test we consider it          recall-precision trade-off. The threshold allows the user of
to provide a baseline of results. This explains why it                the application to choose between seeing many possible
produces no false negatives in our results table.                     results, high recall, and restricting the results to only the
                                                                      most likely matches, high precision.
The first thing to note is that both the Winkler and
Probability methods produce far fewer false positives                 The probability method is particularly strong when tested
than the Equivalent method in all tests. This is to be                on patterns TP 1 and TP 4. Here Winkler performs badly
expected but is important to note since a high number of              since it treats “Henry” and “Harry”, and “J” and “T”, as
false positives could waste a considerable amount of time             different strings. We could pick up “Henry” and “Harry”
and effort if the linking approach is used to identify                as similar strings by lowering the Jaro-Winkler threshold
potential matches for further research. We can find all               in our string matching but this has a knock-on effect of
True Positive links by lowering the scoring threshold but             creating more false matches in our overall linkage results
there is always a balance to be made with the False                   and additionally reduces performance by generating more
Positive rate. In this respect the Winkler method was the             candidate pairs for matching. For TP 1 we can provide a
best performer for patterns 2 and 3, but the Probability              good example of the effect of lowering the threshold.
method was a close second.                                            Reducing it to >4.8 results in a 100% True Positive rate,
                                                                      albeit at the expense of 7 extra False Positives. Again the
Both Winkler and Probability failed to find some of the
                                                                      Equivalent method creates a high number of False
links but, as discussed above, they could always be found
                                                                      Positives, although performs better on TP 1 than for other
by reducing the scoring threshold. We should also
                                                                      patterns.
remember that the threshold chosen was an arbitrary one
for the purpose of comparing the methods under


               True Positives                        False Positives                            False Negatives
TP             Winkler     Prob.       Equiv.        Winkler      Prob.          Equiv.         Winkler     Prob.        Equiv.
1              1           5           7             0            1              17             6           2            0
2              8           8           11            7            10             82             3           3            0
3              11          10          12            10           10             33             1           2            0
4              0           5           6             0            17             71             6           1            0
Table     2:     Results   of      testing   three     scoring    methods        for      the    four   transformation     patterns


6.      Name Independence                                                 born in Scotland could be called Angus and only 20% of
                                                                          our population are from Scotland, then this probability
6.1 The independence assumption                                           becomes 0.0019. When fed into our logarithmic scoring
                                                                          algorithm, this represents a difference of 0.7 in the scores
                                                                          obtained. Thus, if a name is strongly dependent on
The probability model we use in TTT assumes that the                      country of origin then it makes sense to calculate the
attributes within a record are independent of each other.
                                                                          probability of that name based on the population of that
It can easily be shown that this isn’t always correct by
                                                                          country, not the population of the entire United
considering the relationship between forename and                         Kingdom.
gender, for example, –a ‘Mary’ is far more likely to be
female than male. However, for record linkage, using
several variables it has been found to be a reasonable
assumption which maintains simplicity in the model                        6.2 Calculating dependence
without compromising the accuracy of results. In the                      In the example above, external knowledge would readily
case of matching historical records, where we often only                  identify ‘Angus’ as a Scottish name. However, these
have the name of the person as a linking key, we have                     associations are also evident in the data. We took a
found that the assumption does not hold. In particular we                 selection of the most common forenames and surnames
have identified a relationship between the                                from our series of 582k records and classified them as
national/cultural background of a person and their name.                  English, Scottish, Irish, Hispanic or Italian. Then we
In the matching results this is manifested in the form of                 selected records from the series where the person’s name
unexpectedly high scores for some names. Consider the                     was comprised only of names in these forename and
name Angus which is typical to Scotland. In a series of                   surname lists. Figure 4 shows the result of cross
582k naval service records there are 218 Anguses which                    referencing the classifications of forename and surname
suggests a probability of 0.00037 of being called Angus.                  for these people. The x-axis represents the classification
If we were to imagine for a moment that only people                       of the forename, each bar represents that of the surname,


                                                                 29
and the height of the bars represents the percentage of
people. So we can interpret the tallest bar as – “93% of                    𝑓(𝑋)             𝑓(𝑌)               𝑑
people with an Italian forename also have an Italian              − log10 (      ) − log10 (      ) − log10 (       )
                                                                              𝑃                𝑃              𝑃𝑐⁄
surname”.                                                                                                         𝑃
                                                                Since we have an estimate of d we only need to estimate
                                                                Pc for each nationality group to adjust the score to
                                                                account for dependence.


                                                                6.4 Estimating national populations
                                                                In order to estimate the population Pc for nationality C
                                                                we need to look at names which are common enough to
                                                                be linked to several candidates. We matched together
                                                                two lists of names, A and B, with 50k and 582k records
                                                                respectively, and filtered out the matches for a single
                                                                instance of each unique, two item (i.e. forename,
                                                                surname) name in A. We then further filtered the results
                                                                to include only names comprised of the most common
                                                                English, Irish and Scottish forenames and surnames.
                                                                For English only names (English forename and surname)
                                                                which attracted at least 4 possible matches we performed
                                                                a linear regression, as seen in Figure 5:


Figure 4: The relationship between forename and surname
nationalities


6.3 Incorporating dependence into the model
Consider a simplified form of our model for all names in
a population P which are comprised of one forename and
one surname.
The score we calculate for a link between two records           Figure 5: Plot of matches against score for English names
having name “X Y”, assuming independence, is:
                      𝑓(𝑋)            𝑓(𝑌)                      This regression provides a mechanism for calculating the
            − log10 (      ) − log10 (     )
                       𝑃               𝑃                        expected score based on the number of matches. We can
                                                                use this to calculate an expected score for the Irish and
                                                                Scottish names which, in turn, allows us to estimate the
f(n) being the frequency of name “n” in population P.
In order to incorporate dependence on the cultural              population-sizes to be used to adjust our scoring for Irish
provenance of names into the calculation we will use the        and Scottish names. Our adjusted score is derived from
conditional probability formula:                                the intercept (4.34) and slope (-0.006) from the linear
                             𝑃(𝐴 𝑎𝑛𝑑 𝐵)                         regression and the numbers of matches from B for each
                 𝑃(𝐵 | 𝐴) =                                     person in A with an Irish or Scottish name. We then
                                𝑃(𝐴)
                                                                compare this to the actual score for the match and
For our person “X Y” where both “X” and “Y” originate           calculate the difference, D
from country C, with population Pc, we can revise the
formula to:                                                                 𝐷 = 𝑆𝑛 − 4.34 + (−0.006). 𝑀𝑛
            𝑓(𝑋)                 𝑑.𝑓(𝑌)
− log10 (          ) − log10 (            )
             𝑃                    𝑃𝑐                            𝑴𝒏 being the number of matches for name n and 𝑺𝒏
where d is a multiplier to give us the probability of a         being the actual score for exact matches on name n.
person from country C having the name “Y”. To simplify          By averaging these differences we were able to calculate
                                                                the ratio 𝑃𝑐 ⁄ 𝑃 to feed into the dependence formula.
we can use the average percentage from Figure 3 for
                                                                Taking the probabilities from Figure 4 for Irish/Irish and
nationality C as the multiplier d.                              Scottish/Scottish we arrive at figures of 1.58 for Irish
We can now put this formula into the same form as our           names and 1.54 for Scottish names.
original formula to obtain:                                     This means that whenever we come across a person with


                                                           30
Irish forename and surname or Scottish forename and               the adjustment this score would be 7.9 indicating a very
surname we will subtract these adjustment figures from            high confidence match. In reality this isn’t such an
the score.                                                        uncommon name so we shouldn’t consider our match to
                                                                  be quite so definite and therefore the adjusted score of
6.5 Results of adjusting for name dependence                      6.4 seems more appropriate.


                                                                          7.   Conclusion and future work
                                                                  We have discussed two enhancements to the Traces
                                                                  through Time record linkage model. The first was the use
                                                                  of comprehensive statistics of common differences in the
                                                                  spellings of names to incorporate the probability of a
                                                                  name being spelled two different ways between a pair of
                                                                  candidate records. This proved to be an effective
                                                                  addition to our model, especially for variations which
                                                                  can not necessarily be captured by standard string
                                                                  similarity measures, such as errors in transcribing initials
                                                                  or name variants which are very different, like ‘Jack’ and
                                                                  ‘John’. We found an advantage in compiling statistics
                                                                  from matching many different datasets in that the use of
                                                                  initials is uncommon enough in many of the datasets that
                                                                  no CTPs for initials were found until we matched one
                                                                  particular series that had a high incidence of initials. We
                                                                  can now apply the statistics derived from matching that
                                                                  one series to matching any series in the same format and
                                                                  from the same period. Unfortunately we didn’t have
Figure 6: Number of matches by score, with and without            enough examples of typed records to find any patterns
adjustment                                                        which were specific to that medium, but we hope to
                                                                  explore this further in the future. We also plan to apply
                                                                  the pattern detection algorithm to records from different
To test the nationality adjustment outlined above, we             historical periods to see how this effect varies through
linked together two de-duplicated sets of records, A and
                                                                  time.
B, with and without the adjustment for Irish and Scottish
names. We then counted the number of matches from B               Our investigations into year of birth differences returned
against each unique name in record set A. Figure 5 shows          very interesting results about how the forms in one
a plot of the match counts against the integer score, with        particular series were filled in. This is another avenue for
and without adjustment.                                           further exploration.
The effect of using the adjustment has been to lower the          The second enhancement was to incorporate an
scores of many records which have multiple matches.               adjustment to match scores depending on the national or
Without adjustment 85% of records with 4 matches had a            cultural origin of names. This is something we already
match score of <5, whereas with adjustment this                   do in our model but only by applying an arbitrary
increased to 96%. We had one instance of a record with a          adjustment. We demonstrated a data driven method for
score above 6, a score suggesting a medium confident              calculating an expected score based on the number of
match, with 4 matches which was for a person with a
                                                                  matches a particular name attracts. This seems to work
Scottish name. For records with 3 matches 5% had a
score of 6, reducing to 2.3% with the adjustment.                 well for Irish and Scottish names. We would now like to
As with the CTP experiments, the score provides a                 extend the model to names which originate further afield,
means of balancing precision and recall in the results. In        which are likely to have smaller populations within our
our record linkage results, a score above 7.0 suggests a          data. This also will involve the development of a more
high confidence match where we wouldn’t expect to see             robust method for identifying such names, as it will be
two different people with the same name occurring in the          time consuming to manually compile lists. We have
same context. Below 7.0 we begin to see more names                already explored a clustering approach which we will
shared by two different individuals, and below 6 more             continue to develop.
names shared by three different people. In our
experience, when we see more matches than we expect
for a particular score, these tend to be for people with
names not originating in England. Using this technique
of adjusting scores based on a population size derived
from nationality, which is in turn derived from a person’s
name, we have reduced the number with more matches
than we would expect for the score that is observed.
As an example, in the match results we found a single
match to “Angus McLeod” with a score of 6.4. Without

                                                             31
                   8.    References

Fellegi, I. P., and Sunter, A B. (1969), A Theory for
  Record Linkage, Journal of the American Statistical
  Association, Vol. 64, No. 328 (Dec., 1969), pp. 1183-
  1210
Winkler, W. E. (1990a), String Comparator Metrics and
  Enhanced Decision Rules in the Fellegi-Sunter Model
  of Record Linkage, Proceedings of the Section on
  Survey Research Methods, American Statistical
  Association., 354-359.
Needleman, Saul B.; and Wunsch, Christian D. (1970). A
  general method applicable to the search for
  similarities in the amino acid sequence of two
  proteins. Journal of Molecular Biology 48 (3): 443–
  53.


                     Appendix A
We present here a brief description of the probabilistic
linkage method used and how the Jaro-Winkler score is
used to cater for inexact string matching. We refer to this
as the ‘Winkler’ method in our paper.

The Fellegi-Sunter method calculates the ratio of the
probability of two records representing the same person
versus that of them representing two different people.
These are referred to a P(M) and P(U), for ‘Matched’ and
‘Unmatched’, respectively. Furthermore they calculate
this ratio differently depending on whether the attributes
being compared are the same or different, giving two
scores PA (for agreement) and PD (for disagreement).
When comparing two attributes a1 and a2 we calculate a
score, S, based on the following equations:
                          𝑃(𝑀)
              𝑆 = 𝑃𝐴 =          𝑖𝑓 𝑎1 == 𝑎2
                          𝑃(𝑈)
                        1 − 𝑃(𝑀)
          𝑆 = 𝑃𝐷 =               𝑖𝑓 𝑎1 <> 𝑎2
                        1 − 𝑃(𝑈)


In order to handle spelling errors, Winkler proposed
finding a point somewhere between PA and PD
depending on the Jaro-Winkler score for a1 and a2.
If J is the result of passing a1 and a2 into the
Jaro-Winkler algorithm then our calculation becomes:

       𝑆 = max(𝑃𝑎 − (𝑃𝐴 − 𝑃𝐷 ). (1 − 𝐽). 𝜌) , 𝑃𝐷 )

The constant 𝜌effectively controls how much tolerance
to string difference is allowed before the disagreement
score is reached.


                                                              32