Traces through time: a case-study of applying statistical methods to refine algorithms for linking biographical data Mark Bell, Sonia Ranade The National Archives, Kew, London E-mail: { mark.bell; sonia.ranade }@nationalarchives.gsi.gov.uk Abstract The Traces through Time project, which ran at The UK National Archives in 2015, developed algorithms and tools to link people appearing in historical records and to assign robust measures of confidence to the connections that are made. The method has application across the digital humanities, including for biographical research. Fuzzy matching relies on the availability of background statistics on the population, the distribution of data values, data quality and the type and frequency of errors. This paper describes work to refine the original algorithms through implementation of a learning approach in which insights arising from one analysis are fed back into the algorithm to improve the baseline statistics for subsequent analyses. We find that this iterative approach delivers significant improvements over 'raw’ scoring mechanisms. It enables us to carefully target the type and degree of fuzzy matching to be applied and can help balance the poor precision that results from allowing increased ‘fuzziness’ against the poor recall that arises from a more restrictive approach. Future work will extend the approach beyond names and dates of birth, and will embed these enhancements into the Traces through Time framework and tools. Keywords: Record Linkage, String Similarity, Statistical Inference between different individuals in the records, are 1. Background: Approach to Record problematic for historical data. Often we have neither a Linkage and confidence scoring date of birth nor an age. If only an age is provided it is not The identification of a link between two occurrences of an necessarily clear on which date that age applies - is it the individual in the historical record is achieved through individual’s age at the date of the record? Or their age at assessing the similarity between the individual attributes the date of some other event mentioned in the record? And of the two entities to be compared. During this project, we dates are often estimated or rounded. When a date of birth have worked extensively with data from World War One has been captured it is not necessarily accurate: consider service records from The National Archives collections.1 the case of under-18s claiming to be older in order to The datasets in question were initially created by indexing enlist for military service. So, we require new techniques the original paper documents, and our analysis is limited to derive confidence scores for dates, all based on to those data attributes which were consistently captured estimated distributions. by previous digitisation and transcription projects. For In the case of a year being captured on the record, we WW1 data we are generally restricted to linking records create a probability distribution of likely values. This based only on names and either age or date of birth. Other allows us to fuzzy match two different year values, attributes such as place of birth and service number are adjusting for data quality and deriving a probability that sometimes available but are not consistently captured the underlying values are the same. Instead of a single across datasets. year, the record may state a range of years, possibly Record Linkage is achieved using a probabilistic method derived from the age. In this case the calculation is the based on the work of Fellegi and Sunter, (1969) and a same but the confidence scores returned will vary variation suggested by Winkler (1990) to account for depending on the range of the stated ages. Finally, there spelling differences between pairs of textual attributes. are records with no indication of birth period. In this The basic approach is to find, for each attribute, the ratio situation, the best we can do is to derive a frequency between the probability that a pair of records refer to the distribution for the whole dataset, drawing on external, same person and that of them referring to two different expert knowledge if this information is not available in people. The Winkler variation allows the use of string the data. The calculation is then the probability that the comparison algorithms to accommodate spelling person in dataset A could be in dataset B. variations and applies a weighting to reduce the score for The result of these individual attribute comparisons is a an attribute comparison if the attributes lie within a score on a logarithmic scale which can be used to assess certain threshold of similarity but are not identical. our confidence that the pair of occurrences represent the Appendix A gives a brief outline of the calculation. Our same person. work has taken this approach a step further and applied a range of weightings to achieve more fine-grained fuzzy matching. Dates of birth, which are a key attribute for discriminating 1 National Archives Discovery - http://discovery.nationalarchives.gov.uk/ 24 2. Learning from record linkage results 2.2 Fuzzy name comparisons In the case of name comparisons the variation between This paper focuses on methods for refining the statistical name transcriptions for records representing the same model described above by learning from the results of person can be thought of as a function of several factors matching many datasets. We describe an approach to (list not exhaustive): identifying and incorporating common differences in  Regional spelling variations. textual information arising from factors such as:  How the recorder hears the name, particularly handwriting recognition errors, typographical errors and with unfamiliar names and regional accents. phonetic errors made when names are recorded. A  The recording medium – hand written vs. typed. different approach is described for dates of birth, where  Involuntary errors during data capture, spelling the algorithm must accommodate inaccuracies in mistakes while writing or typing the original recording such as mis-representation of age or rounding document. of declared ages 2 . In this case, the age distribution  Involuntary errors during transcription, observed for each dataset is fed back into the algorithm to including those caused by difficult handwriting. support a statistical approach to calculating the likelihood that two occurrences of a person with different recorded A commonly seen example of a transcription error caused dates of birth, in fact, relate to the same individual. As by handwriting is the cursive ‘T’ being misread as the each incremental enhancement of the algorithm improves letter ‘J’, due to the similarity between those letters in that the results of the matching process, these in turn, reveal style of writing. By analysing the frequency of high further discrepancies in the data, from which the confidence matches which have this specific difference in algorithm can learn. A number of distinct areas are being their names we can refine the confidence scores returned worked on, all building on previous research and reliant when this difference is encountered. Without this more on the gathering of statistics over time. Here we highlight nuanced approach we could miss perfectly good matches what has been done so far and which emerging ideas are which only differ on a single initial or increase the rate of being explored. false positives by allowing any single initial difference. The key to the approach is to capture the results not just 2.1 Learning from record linkage results for a single dataset but to associate the difference with In our work so far we have discovered benefits in deriving metadata connected to a collection as a whole – for an age profile for a dataset which lacks dates of birth by example, records in a particular format from some defined linking to one which does. We have also improved linkage time-period – allowing accumulation of generalised results by allowing for discrepancies in ages. statistics based on many examples which are typical of a In order to improve on this technique, we analyse the type of record. The misreading of ‘T’ and ‘J’ is far less dates of birth for high confidence matches to build a likely in typed records since the typed letters have a statistical profile of common differences. For example, in distinct appearance but there are likely to be other typical WW1 records this approach will highlight that for differences arising from keyboard layout. soldiers in the 16-20 age range it is more common for two records referring to the same person to have different Simply informing the model of the probability that Ts and years of birth than for those in, say, the 30-35 range. Js have been interchanged has delivered good results, so Therefore, if we had two records with years of birth 1882 the next step is to use the data to identify a wider range of and 1883 (age 33-34 in 1916) then we would have less commonly occurring transcription errors. We use the confidence that they are a match than if they were 1897 Jaro-Winkler measure to find similar strings and and 1898 (age 18-19 in 1916). This behaviour is particular weightings to assign confidence depending on this to WW1. Examination of another dataset such as the GRO measure. Our aim now is to look at methods for using the death registers 3 shows that it is quite common for the difference itself to increase accuracy. deceased’s age to have been guessed at the time of registration. We would therefore want to accept a different 2.3 Name frequency statistics profile of differences in this dataset, where there is a In the absence of high volume name data, such as a higher likelihood of discrepancies in dates of birth for census, it can be difficult to accurately calculate the older people as their deaths are less likely to have been frequency of occurrence of a particular name in the registered by a close relative. population that appears in the records under consideration. This is especially the case if the data sources being matched are relatively small (< 10,000 2 The ‘age heaping’ effect is observed in datasets which records). Consider a dataset with 1,000 records including record age (rather than date of birth). The resulting Messrs. Taylor and Zephania. From this alone, we might distribution is skewed, typically showing peaks at ‘round’ surmise that these surnames have equal probability of ages (e.g. 10, 20, 30…). For an illustration, see the 1911 0.001 while, in reality, the former is more common and census graphic from the ONS data visualisation centre the latter is rare. Accumulating match results over ( http://www.bbc.co.uk/news/uk-18854073). 3 General Register Office death registrations supplied by multiple datasets allows us to create a larger population of http://www.gro.gov.uk/gro/content/ individuals from which to derive probabilities. The 25 difficulty arises from the fact that the most common transformation pattern Tx that Tx on S1 yields S2 if names will, by nature, belong to lower confidence linked applying the pattern Tx to S1 results in the string S2. pairs, which are therefore more difficult to associate as referring to the same individual. We have also identified a DEFINITION 2: We consider a transformation pattern to caveat to the assumption of attribute independence in the be Common (a CTP) if it occurs above a certain general linkage model. By clustering forenames and percentage of the time. More concretely, if we take n surnames together we have identified groups of names record pairs where each pair has a record which contains that typically occur together and which appear to align an attribute, a1, containing the string ss1, then applying TP with national or ethnic groups - e.g. Irish, Italian, to the equivalent attribute, a2, on the linked record will Hispanic/Portuguese. Although the names themselves are yield a1 at least c% of the time. An attribute pair is still independent of one another, there is an implicit defined as having attributes a1 and a2, so the set of n pairs dependence with a third variable, nationality, which is not P where either a1 or a2 contain the string ss1 is: directly expressed in the data. As a result, name matches such as Patrick Murphy, a common Irish name, and Angus 𝑷 = {𝒑𝟏<𝒂𝟏,𝒂𝟐 > , 𝒑𝟐<𝒂𝟏,𝒂𝟐 > , ⋯ , 𝒑𝒏<𝒂𝟏,𝒂𝟐> } MacDonald, common in Scotland, are assigned higher confidence scores than is warranted because each name Tx is the transformation pattern that transforms ss1 to ss2. part, considered individually, is not particularly common in the population as a whole. We mitigate this type of 𝒚𝒊𝒆𝒍𝒅𝒔 association in the Traces through Time approach by ∑𝒏𝒊=𝟎 {𝟏 𝒊𝒇 𝑻𝒙 ∶ 𝒑𝒊 〈𝒂𝟏 〉 → 𝒑𝒊 〈𝒂𝟐 〉 arbitrarily reducing the population used in the probability 𝟎 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆 ≥𝒄 calculation by a factor of ten. This has some basis in the 𝒏 data, as Ireland has 10% of the population of England, for example. However, it is a blunt tool. We are now working In order to find common patterns of transformation we on refining this technique, again by gathering statistics must calculate all n-grams around each character through linking multiple datasets. difference between 𝒂𝟏 and 𝒂𝟐 , where the corresponding n-grams in the two attributes are different, and the n-gram 3. Identifying common differences length is up to the length of the longest string. In order to normalize different length strings we use the 3.1 Differences in names Needleman-Wunsch (NW) (Needleman and Wunsch, 1970) alignment function to find the maximal alignment Our approach for identifying common differences in of two strings and then pad any gaps in alignment with transcription and spelling is to look at matched records ‘-‘ symbols or an ‘@’ symbol at the end of a string to which differ in a single attribute. For example, the differentiate between characters being inserted within a matched pair “Robert Adrian Gardner, born 17/11/1898” string and those added to the end. and “Bob Adrian Gardner, born 17/11/1898” have different first names but are otherwise identical. If the first For example, NW(‘needle’, ‘nedle’) produces the names were the same we would consider this to be a high aligned strings: confidence match. However, in our existing statistical needle model the first name would contribute a negative ne-dle weighting to the calculation. By looking at record pairings which only differ on a single attribute and which would score above a certain threshold, T, indicating a high And the resulting set of n-gram pairs is: { (‘e’, ‘-‘), (‘nee’, confidence match, if the difference was not there we can ‘ne-‘), (‘need’, ‘ne-d’), (‘needl’, ‘ne-dl’), (‘needle’, ascertain patterns in these differences. So in the example ‘ne-dle’), (‘ee’, ‘e-‘), (‘eed’, ‘e-d’), (‘eedl’, ‘e-dl’), above, if we see a number of record pairings with the (‘eedle’, ‘e-dle’), (‘ed’, ‘-d’), (‘edl’, ‘-dl’), (‘edle’, same pattern, we may deduce that Bob is an alternative ‘-dle’) } form of Robert. The next step is to represent these n-gram pairs in a tree In the following definition, when we refer to a structure where the parents of a pair are the pairs which transformation we mean some difference in spelling has are produced by adding one character to each n-gram in been encountered between two attributes which could be the original pair. So (‘ed’, ‘-d’) is a parent of (‘e’, ‘-‘) due to common spelling variations (‘Phillip’ and ‘Philip’), where ‘d’ has been appended to each entry in the child spelling errors (‘Roland’, ‘Rolend’) or diminutive forms pair. The resulting tree is shown in figure 1. (‘Bob’, ‘Robert’). DEFINITION 1: We define a Transformation Pattern (TP) as a function which transforms a string S1 to a string S2 by substituting any substring ss1 of length l in S1 with another string ss2 also of length l. We shall also say that for a 26 in the ADM188 series which was not present in ADM339. Linking ADM337 and ADM339 returned results that we would have expected for WW1 records – the rate of 1 or 2 year differences was between 0.17% and 1.44% for years of birth up to 1897 (taking the higher year of birth of any record pair), increasing to 11.69% and 25.58% for years of birth 1898 and 1899 respectively. Additionally we found that in the case of 1898 7.78% of pairs had a difference of 1 year, while for 1899 23.26% of pairs had a difference of 2 years. This tallies with our expectation of 16 and 17 year olds inflating their ages in order to join the Figure 1: n-gram tree for name pair “needle, ne-dle” war effort from 1916. We can now generate n-gram trees for every pair of When we linked to ADM188 we discovered a different different attributes from a list of matched pairs of records. pattern. There was still a peak of 28.91% of 1897 births These are then stored in a Directed Graph structure with with 1 year difference. However, we saw a consistent each node representing an n-gram pair and edges having a 10-20% 1 year difference rate for all other years of birth. weight equal to the number of instances of their parent One theory for why this should happen is that perhaps the <𝒂𝟏 , 𝒂𝟐 > (the root node in the n-gram tree). application form asked for day and month of birth plus age at application. The year of birth was then calculated We have made the graph generated from the results of from the age which introduced a high proportion of errors linking a number of WW1 collections together available in the year of birth. to view online.4 4. Results of CTP identification The reasoning behind loading the n-gram trees into a Figure 2 shows the graph of attributes which have graph is that we are aiming not just to identify single letter undergone the transformation p = “e” → “i“. transcription errors but also multi-character transformations, which may be phonetic in nature – for example, ‘f’ for ‘ph’, or the prefix ‘Mc’ for ‘Mac’. The final processing stage is to coalesce nodes with only one parent as, if we have found a pattern of transformation, we do not need to see that pattern repeated in longer n-grams which are not encountered in other attribute pairs. For the “needle” example, if there is no other pairing <𝒂𝒊 , 𝒂𝒋 > where 𝒚𝒊𝒆𝒍𝒅𝒔 𝒑: 𝒂𝒊 → 𝒂𝒋 for p = “e” → “-“ then we can connect the root node “needle, ne-dle” to the node “e, -“ and remove all intermediary nodes in the tree. 3.2 Differences in years of birth Applying the n-gram method to years of birth identified lots of common differences but did not unearth any Figure 2: n-gram tree for attribute pairs where an e is patterns in transcription errors, such as 1 for 7, as we may have expected. We found a more effective approach was replace by i to use the arithmetic differences between years. This diagram highlights a number of patterns of interest. We compared one series of naval records, ADM337, The node sizes represent the weighted degree of the node against two other naval series – ADM339 and ADM188. so we can see that the pairing ‘Wilfred, Wilfrid’ is the The method was to analyse pairs of records which were most common. If we look at the most common spelling identical in every way apart from the year of birth. The differences overall, we find that “e” → “i“ occurs very results were intriguing and suggest a pattern of behavior frequently, but figure 2 suggests this result may be skewed by the very common spelling variation of Wilfred/Wilfrid. 4 An even stronger example is given by figure 3 which https://www.dropbox.com/s/8xndg8o26g9096d/ngram_tr demonstrates that the seemingly frequent transformation ees_2.svg?dl=0 27 “i” → “y“ is almost entirely due to the variant spelling of have undergone the transformation against the number Sidney/Sydney which accounts for 100 of the 108 which are un-transformed , as shown in table 1: occurrences of this transformation. TP S1 == S2 𝑦𝑖𝑒𝑙𝑑𝑠 % 𝑡𝑝: 𝑆1 → 𝑆2 transformed 1 4482 43 00.95 2 1800 78 04.15 3 19076 163 00.85 4 424 31 06.81 Table 1: Percentage of names undergoing the transformations. TP column refers to the numberings at top Figure 3: n-gram tree for attribute pairs where an ‘i’ is of section 5. replaced by ‘y’ These percentages are fed into our statistical model as There are a number of CTPs which are worthy of further probabilities. We will test the effectiveness of this by investigation and we will examine the effect of capturing comparing three methods: four of these patterns in our statistical model below.  Winkler - Use the current process of applying a 5. Using CMTs in record linkage weighting to the probability score based on a string similarity. 5.1 Approach  Probability - When one of the 3 TPs is encountered we multiply the probability score by We will further analyse a method for building four the appropriate percentage according to table 1. patterns, which emerged from the method described  Equivalent - We treat any string S2 which is the above, into our probabilistic linkage model. result of applying the TP to S1 as equivalent to The patterns are: S2 and therefore consider the strings to be equal 1. “Henry” → “Harry“ in our linkage algorithm – i.e. we do not apply 2. “Mac” → “Mc“ the weighting in the first scenario. 3. “ll” → “-l” Ideally we would use a golden record set with known results to compare the results of applying each method. 4. “J” → “T” Due to the resource intensive nature of creating golden In our existing record linkage process the names “Henry” record sets of sufficient volume for record linkage, we ran and “Harry” are considered different enough that a linkage exercises using records which had both a name negative weighting is applied to our confidence score. and date of birth. Only name was used to derive links, date The effect of this is that the score changes from +1.84 of birth was used for later verification of these links. Since (where both records use the name “Henry”), to -2.06 (for most of testing was with files of circa 13k records there “Henry”/“Harry”), a swing of -3.9. We consider a total are unlikely to be enough pairs of different people with score above 7.5 to be a high confidence match, so we will exactly the same name and date of birth to have a look for records where the names are identical and have a significant effect on results. With this method we have at score of > 7.5 or the names are different only as a result of least a pseudo-golden result set. the transformation “Henry” → “Harry“ and have a score of > 3.6. 5.2 Results and discussion We follow a similar process for patterns 2 and 3, but this The use of probabilities in our record linkage algorithm time taking the swing to be only -1.5 as the resulting gave good results, especially for patterns 1 and 4 which, strings under these transformations are only one character under our existing method, have a large negative impact different and this has a smaller negative effect in our on the match score. If we consider a strong match to have calculations. a score above 7.0, then in a model where an incorrect Finally for pattern 4, which is the scenario where an initial transcription of the letter ‘J’ to ‘T’ causes the score to be ‘J’ or ‘T’ has been incorrectly transcribed (as a ‘T’ or a lowered by 4, only records where the other attributes total ‘J’), we calculate using a swing of -6. This is a default at least 11.0 would pass this threshold. In the absence of a value in our model for different initials. date of birth, that individual would have to have quite an We can then derive the probability of each transformation unusual name for the link to be discovered. occurring by comparing the number of records which Table 2 shows the results for each experiment. We ran 28 each experiment with a scoring threshold of >5.0. Since investigation, so these results might be more reasonably the ‘equivalent’ method treated every transformation as an interpreted as the Equivalent method scoring some agreement it will consistently score higher than the other matches too highly. This is an example of the two methods, so for the purposes of this test we consider it recall-precision trade-off. The threshold allows the user of to provide a baseline of results. This explains why it the application to choose between seeing many possible produces no false negatives in our results table. results, high recall, and restricting the results to only the most likely matches, high precision. The first thing to note is that both the Winkler and Probability methods produce far fewer false positives The probability method is particularly strong when tested than the Equivalent method in all tests. This is to be on patterns TP 1 and TP 4. Here Winkler performs badly expected but is important to note since a high number of since it treats “Henry” and “Harry”, and “J” and “T”, as false positives could waste a considerable amount of time different strings. We could pick up “Henry” and “Harry” and effort if the linking approach is used to identify as similar strings by lowering the Jaro-Winkler threshold potential matches for further research. We can find all in our string matching but this has a knock-on effect of True Positive links by lowering the scoring threshold but creating more false matches in our overall linkage results there is always a balance to be made with the False and additionally reduces performance by generating more Positive rate. In this respect the Winkler method was the candidate pairs for matching. For TP 1 we can provide a best performer for patterns 2 and 3, but the Probability good example of the effect of lowering the threshold. method was a close second. Reducing it to >4.8 results in a 100% True Positive rate, albeit at the expense of 7 extra False Positives. Again the Both Winkler and Probability failed to find some of the Equivalent method creates a high number of False links but, as discussed above, they could always be found Positives, although performs better on TP 1 than for other by reducing the scoring threshold. We should also patterns. remember that the threshold chosen was an arbitrary one for the purpose of comparing the methods under True Positives False Positives False Negatives TP Winkler Prob. Equiv. Winkler Prob. Equiv. Winkler Prob. Equiv. 1 1 5 7 0 1 17 6 2 0 2 8 8 11 7 10 82 3 3 0 3 11 10 12 10 10 33 1 2 0 4 0 5 6 0 17 71 6 1 0 Table 2: Results of testing three scoring methods for the four transformation patterns 6. Name Independence born in Scotland could be called Angus and only 20% of our population are from Scotland, then this probability 6.1 The independence assumption becomes 0.0019. When fed into our logarithmic scoring algorithm, this represents a difference of 0.7 in the scores obtained. Thus, if a name is strongly dependent on The probability model we use in TTT assumes that the country of origin then it makes sense to calculate the attributes within a record are independent of each other. probability of that name based on the population of that It can easily be shown that this isn’t always correct by country, not the population of the entire United considering the relationship between forename and Kingdom. gender, for example, –a ‘Mary’ is far more likely to be female than male. However, for record linkage, using several variables it has been found to be a reasonable assumption which maintains simplicity in the model 6.2 Calculating dependence without compromising the accuracy of results. In the In the example above, external knowledge would readily case of matching historical records, where we often only identify ‘Angus’ as a Scottish name. However, these have the name of the person as a linking key, we have associations are also evident in the data. We took a found that the assumption does not hold. In particular we selection of the most common forenames and surnames have identified a relationship between the from our series of 582k records and classified them as national/cultural background of a person and their name. English, Scottish, Irish, Hispanic or Italian. Then we In the matching results this is manifested in the form of selected records from the series where the person’s name unexpectedly high scores for some names. Consider the was comprised only of names in these forename and name Angus which is typical to Scotland. In a series of surname lists. Figure 4 shows the result of cross 582k naval service records there are 218 Anguses which referencing the classifications of forename and surname suggests a probability of 0.00037 of being called Angus. for these people. The x-axis represents the classification If we were to imagine for a moment that only people of the forename, each bar represents that of the surname, 29 and the height of the bars represents the percentage of people. So we can interpret the tallest bar as – “93% of 𝑓(𝑋) 𝑓(𝑌) 𝑑 people with an Italian forename also have an Italian − log10 ( ) − log10 ( ) − log10 ( ) 𝑃 𝑃 𝑃𝑐⁄ surname”. 𝑃 Since we have an estimate of d we only need to estimate Pc for each nationality group to adjust the score to account for dependence. 6.4 Estimating national populations In order to estimate the population Pc for nationality C we need to look at names which are common enough to be linked to several candidates. We matched together two lists of names, A and B, with 50k and 582k records respectively, and filtered out the matches for a single instance of each unique, two item (i.e. forename, surname) name in A. We then further filtered the results to include only names comprised of the most common English, Irish and Scottish forenames and surnames. For English only names (English forename and surname) which attracted at least 4 possible matches we performed a linear regression, as seen in Figure 5: Figure 4: The relationship between forename and surname nationalities 6.3 Incorporating dependence into the model Consider a simplified form of our model for all names in a population P which are comprised of one forename and one surname. The score we calculate for a link between two records Figure 5: Plot of matches against score for English names having name “X Y”, assuming independence, is: 𝑓(𝑋) 𝑓(𝑌) This regression provides a mechanism for calculating the − log10 ( ) − log10 ( ) 𝑃 𝑃 expected score based on the number of matches. We can use this to calculate an expected score for the Irish and Scottish names which, in turn, allows us to estimate the f(n) being the frequency of name “n” in population P. In order to incorporate dependence on the cultural population-sizes to be used to adjust our scoring for Irish provenance of names into the calculation we will use the and Scottish names. Our adjusted score is derived from conditional probability formula: the intercept (4.34) and slope (-0.006) from the linear 𝑃(𝐴 𝑎𝑛𝑑 𝐵) regression and the numbers of matches from B for each 𝑃(𝐵 | 𝐴) = person in A with an Irish or Scottish name. We then 𝑃(𝐴) compare this to the actual score for the match and For our person “X Y” where both “X” and “Y” originate calculate the difference, D from country C, with population Pc, we can revise the formula to: 𝐷 = 𝑆𝑛 − 4.34 + (−0.006). 𝑀𝑛 𝑓(𝑋) 𝑑.𝑓(𝑌) − log10 ( ) − log10 ( ) 𝑃 𝑃𝑐 𝑴𝒏 being the number of matches for name n and 𝑺𝒏 where d is a multiplier to give us the probability of a being the actual score for exact matches on name n. person from country C having the name “Y”. To simplify By averaging these differences we were able to calculate the ratio 𝑃𝑐 ⁄ 𝑃 to feed into the dependence formula. we can use the average percentage from Figure 3 for Taking the probabilities from Figure 4 for Irish/Irish and nationality C as the multiplier d. Scottish/Scottish we arrive at figures of 1.58 for Irish We can now put this formula into the same form as our names and 1.54 for Scottish names. original formula to obtain: This means that whenever we come across a person with 30 Irish forename and surname or Scottish forename and the adjustment this score would be 7.9 indicating a very surname we will subtract these adjustment figures from high confidence match. In reality this isn’t such an the score. uncommon name so we shouldn’t consider our match to be quite so definite and therefore the adjusted score of 6.5 Results of adjusting for name dependence 6.4 seems more appropriate. 7. Conclusion and future work We have discussed two enhancements to the Traces through Time record linkage model. The first was the use of comprehensive statistics of common differences in the spellings of names to incorporate the probability of a name being spelled two different ways between a pair of candidate records. This proved to be an effective addition to our model, especially for variations which can not necessarily be captured by standard string similarity measures, such as errors in transcribing initials or name variants which are very different, like ‘Jack’ and ‘John’. We found an advantage in compiling statistics from matching many different datasets in that the use of initials is uncommon enough in many of the datasets that no CTPs for initials were found until we matched one particular series that had a high incidence of initials. We can now apply the statistics derived from matching that one series to matching any series in the same format and from the same period. Unfortunately we didn’t have Figure 6: Number of matches by score, with and without enough examples of typed records to find any patterns adjustment which were specific to that medium, but we hope to explore this further in the future. We also plan to apply the pattern detection algorithm to records from different To test the nationality adjustment outlined above, we historical periods to see how this effect varies through linked together two de-duplicated sets of records, A and time. B, with and without the adjustment for Irish and Scottish names. We then counted the number of matches from B Our investigations into year of birth differences returned against each unique name in record set A. Figure 5 shows very interesting results about how the forms in one a plot of the match counts against the integer score, with particular series were filled in. This is another avenue for and without adjustment. further exploration. The effect of using the adjustment has been to lower the The second enhancement was to incorporate an scores of many records which have multiple matches. adjustment to match scores depending on the national or Without adjustment 85% of records with 4 matches had a cultural origin of names. This is something we already match score of <5, whereas with adjustment this do in our model but only by applying an arbitrary increased to 96%. We had one instance of a record with a adjustment. We demonstrated a data driven method for score above 6, a score suggesting a medium confident calculating an expected score based on the number of match, with 4 matches which was for a person with a matches a particular name attracts. This seems to work Scottish name. For records with 3 matches 5% had a score of 6, reducing to 2.3% with the adjustment. well for Irish and Scottish names. We would now like to As with the CTP experiments, the score provides a extend the model to names which originate further afield, means of balancing precision and recall in the results. In which are likely to have smaller populations within our our record linkage results, a score above 7.0 suggests a data. This also will involve the development of a more high confidence match where we wouldn’t expect to see robust method for identifying such names, as it will be two different people with the same name occurring in the time consuming to manually compile lists. We have same context. Below 7.0 we begin to see more names already explored a clustering approach which we will shared by two different individuals, and below 6 more continue to develop. names shared by three different people. In our experience, when we see more matches than we expect for a particular score, these tend to be for people with names not originating in England. Using this technique of adjusting scores based on a population size derived from nationality, which is in turn derived from a person’s name, we have reduced the number with more matches than we would expect for the score that is observed. As an example, in the match results we found a single match to “Angus McLeod” with a score of 6.4. Without 31 8. References Fellegi, I. P., and Sunter, A B. (1969), A Theory for Record Linkage, Journal of the American Statistical Association, Vol. 64, No. 328 (Dec., 1969), pp. 1183- 1210 Winkler, W. E. (1990a), String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage, Proceedings of the Section on Survey Research Methods, American Statistical Association., 354-359. Needleman, Saul B.; and Wunsch, Christian D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48 (3): 443– 53. Appendix A We present here a brief description of the probabilistic linkage method used and how the Jaro-Winkler score is used to cater for inexact string matching. We refer to this as the ‘Winkler’ method in our paper. The Fellegi-Sunter method calculates the ratio of the probability of two records representing the same person versus that of them representing two different people. These are referred to a P(M) and P(U), for ‘Matched’ and ‘Unmatched’, respectively. Furthermore they calculate this ratio differently depending on whether the attributes being compared are the same or different, giving two scores PA (for agreement) and PD (for disagreement). When comparing two attributes a1 and a2 we calculate a score, S, based on the following equations: 𝑃(𝑀) 𝑆 = 𝑃𝐴 = 𝑖𝑓 𝑎1 == 𝑎2 𝑃(𝑈) 1 − 𝑃(𝑀) 𝑆 = 𝑃𝐷 = 𝑖𝑓 𝑎1 <> 𝑎2 1 − 𝑃(𝑈) In order to handle spelling errors, Winkler proposed finding a point somewhere between PA and PD depending on the Jaro-Winkler score for a1 and a2. If J is the result of passing a1 and a2 into the Jaro-Winkler algorithm then our calculation becomes: 𝑆 = max(𝑃𝑎 − (𝑃𝐴 − 𝑃𝐷 ). (1 − 𝐽). 𝜌) , 𝑃𝐷 ) The constant 𝜌effectively controls how much tolerance to string difference is allowed before the disagreement score is reached. 32