<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of Machine Learning Algorithms in a Human-Computer Hybrid Record Linkage System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mahin Ramezani</string-name>
          <email>mahin@tamu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guru Ilangovan</string-name>
          <email>ilan50_guru@tamu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hye-Chung Kum</string-name>
          <email>kum@tamu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Texas A&amp;M University</institution>
          ,
          <addr-line>400 Bizzell St</addr-line>
          ,
          <institution>College Station</institution>
          ,
          <addr-line>TX 77843</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>In A. Martin, K. Hinkelmann</institution>
          ,
          <addr-line>H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.)</addr-line>
          ,
          <institution>Proceedings of the AAAI 2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021) - Stanford University</institution>
          ,
          <addr-line>Palo Alto, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Population Informatics Lab, Department of Health Policy and Management, Texas A&amp;M University School of Public Health, 212 Adriance Lab Rd, College Station</institution>
          ,
          <addr-line>TX 77843</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Record linkage, often called entity resolution or de-duplication, refers to identifying the same entities across one or more databases. As the amount of data that is generated grows at an exponential rate, it becomes increasingly important to be able to integrate data from several sources to perform richer analysis. In this paper, we present an open source comprehensive end to end hybrid record linkage framework that combines the automatic and manual review process. Using this framework, we train several models based on diferent machine learning algorithms such as random forests, linear SVM, Radial SVM, and Dense Neural Networks and compare the efectiveness and eficiency of these models for record linkage in diferent settings. We evaluate model performance based on Recall, F1-score (quality of linkages) and number of uncertain pairs which is the number of pairs that need manual review. We also test our trained models in a new dataset to test how diferent trained models transfer to a new setting. The RF, linear SVM and radial SVM models transfer much better compared to the DNN. Finally, we study the efect of name2vec (n2v) feature, a letter embedding in names, on model performance. Using n2v results in smaller manual review set with slightly less F1-score. Overall the SVM models performed best in all experiments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Record Linkage</kwd>
        <kwd>deduplication</kwd>
        <kwd>entity resolution</kwd>
        <kwd>machine learning</kwd>
        <kwd>Benchmarking</kwd>
        <kwd>patient matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As the amount of data that is generated grows at an exponential rate, it becomes increasingly
important to be able to integrate data from several sources to perform richer analyses. For example,
the research on covid can be accelerated if all fragmented patient data could be integrated. In
error-free clean databases with unique identifiers common to all the databases, integrating them
can be easily accomplished with simple joins[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, such identifiers are often not available
in real world data. In that case, the available fields common to the databases are compared and
a decision has to be made on whether the two records refer to the same real world entity or not.
This problem of finding data records in heterogeneous databases that refer to the same entities
is referred to as record linkage (RL) or entity resolution. When finding data records for the same
entities in one database, this problem is also called de-duplication for linking the database to itself.
      </p>
      <p>
        Automated record linkage methods have been studied extensively in many fields since the
problem was first introduced by Newcomb[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The best results may be obtained by a hybrid
human-computer linkage process that augments the results of automatic algorithms with human
judgement[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It involves a small team of well trained human experts reviewing potential
uncertain pairs generated by algorithms and first making independent decisions then comparing notes
on disagreements and coming to consensus [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. Probabilistic methods and rule-based
approaches were the most common automated approaches but machine learning (ML) approaches
are rapidly gaining traction and proving to be the preferred automatic linkage methods.
      </p>
      <p>
        In this paper, we present a comprehensive end to end hybrid record linkage framework that
combines the manual review and the automated process to achieve both scalability and high
quality linkage results. Quality control in any record linkage project is critical because all approaches
will result in some level of incorrect matches that will generate erroneous integrated data as well
as miss correct matches resulting in a fragmented integrated dataset. We achieve the best of both
worlds by allowing the automated algorithms to resolve majority of the linkages that have a high
probability of being either a match or non-match, but also have the option to send ambiguous
pairs to human experts for final determination to improve the linkage quality[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Hence, the
goals of this hybrid record linkage process is to achieve optimum linkage quality, both in terms of
no mismatches and no true matches missed, while still minimizing the amount of manual review
required to achieve this quality. This paper focuses on comparing how well diferent ML
algorithms meet this goal. We also investigate how well diferent ML models trained on one dataset
transfer to other settings within the USA. Determining which ML models transfer better to other
settings is important because one of the dificulties to using ML methods on real projects is
challenges to building a training set that is comprehensive enough to build good models. In addition,
we studied how adding letter embedding[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] in names will efect the performance of these models.
In sum, the contributions of this paper are:
• A hybrid open source RL framework that can achieve scalability and high quality results
• A comparison of four diferent RL ML algorithms in meeting the goals of the hybrid system
• A comparison of how well ML RL models trained on one dataset transfer to diferent settings
• An evaluation on the impact of using letter embedding in names in RL ML algorithms
The rest of the paper is laid out as follows. In section 2, we briefly review the RL literature on
using ML algorithms. Section 3 describes a hybrid record linkage framework and section 4
describes the experimental design of our evaluation. Section 5 and 6 then describes the results from
the individual experiments and discuss main insights. Finally, Section 7 presents our conclusions.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Use of ML algorithms for RL has been studied in diferent contexts. For instance, a radial basis
kernel SVM was used successfully to link genealogy records from 19th century Canada[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Random
forests for RL in financial entity recognition[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and author disambiguation[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] demonstrated the
eficiency of random forests for this task. With the recent re-emergence of neural networks, a lot
of research shows the potential for neural networks in entity resolution. Using structured neural
networks for genealogical RL was shown to give very reliable results[
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. Feigenbaum[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
compared the performance of ML algorithms (SVM, random forest), logistic regression, and other
heuristic approaches on US census dataset and showed that SVM did slightly better than RF in
both true positive rate (TPR) and positive predictive value (PPV). Ilangovan[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] studied the
efectiveness and eficiency of diferent ML algorithms (SVM, Random Forest, and neural networks) in
a controlled experiment with diferent level of heterogeneity in data and size of training dataset.
He found that RF and SVM performed very well both in terms of traditional metrics like F1 score
as well as manual review set size for error rates from 0% to 60%. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] the performance of SVM
and Random Forest was compared to investigate the upper bound achieved in the linkage rate
and the conditions required to achieve the rate. The results of this study illustrated that the RF
produced high quality result at threshold value &gt;= 0.85 while for the cases where quantity is the
main concern SVM with a lower threshold is recommended.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>
        In the health sector, incorrectly linking records that belong to diferent patients with similar
identifiers such as twins or family members may lead to serious harm due to incorrect health
information (e.g., medication allergies). Thus, often algorithms are tuned to minimize false matches which
inevitably increases the rate of missing true matches leaving the health records fragmented. This
also lead to its own problems (e.g., incomplete medical history). More problematic is that the
unlinked true matches are often biased because there are more issues with identifying information
in lower socioeconomic populations such as ethnic names[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. On the other hand, manual RL
methods may be prohibitively time-consuming. One solution is to use a hybrid record linkage
framework which combines the automated process and the manual process (see Figure 1). First,
the automated algorithm will handle the records that have high probability of either a match or
unmatch, which for most applications is majority of the data, and then a human expert will
resolve those remaining records that the algorithms were uncertain on. Thus, in automated record
linkage algorithms one threshold is used to divide the data into two classes (match, unmatch),
while hybrid methods use two thresholds to form three classes (match, uncertain, unmatch).
      </p>
      <sec id="sec-3-1">
        <title>3.1. Pair generation</title>
        <p>The first step in hybrid RL process is the creation of pairs from one or more databases for potential
matches such as those that share some common identifier. Often referred to as blocking, the idea
is to use the identifier field just to generate candidate potential pairs to reduce computation. Once
the pairs are generated, features would be extracted from each pair and fed into the ML models.
In this study, we generated the pairs files by blocking on appropriate fields. For example, first
name and last name, first name and date of birth, last name and date of birth, etc.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Feature extraction in pairs</title>
        <p>
          After generating the pairs, we need to extract some useful features to feed them to ML
algorithms. For each pair of first and last names, we calculated the Jaro Winkler (JW) distance,
Damerau–Levenshtein (dl) distance, Longest Common Substitution (LCS) distance, and Name2Vec
(n2v) distance[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] which is a name-embedding using Doc2Vec methodology, where each name
is a document and each letter of the name is considered a word. We trained two separate models
using names from two public US datasets. Using code from[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and all names extracted from the
North Carolina Voter Registration data1 and ONC Patient Matching Algorithm Challenge data2,
we trained separate models for first name and last name. Using these models, we calculated n2v
distances between each names in pairs. We also created a boolean feature to detect when first
name and last name were swapped. Finally, the normalized frequency of the first and last names
1https://www.ncsbe.gov/results-data/voter-registration-data
2https://linkagelibrary.icpsr.umich.edu/linkagelibrary/project/111962
in their respective databases was also added to capture how rare the name is.
        </p>
        <p>For each pair of dates of birth, we calculated the DL distance, the DL distance for the year,
month, and day components individually, and a boolean feature to detect when month and day
were swapped. Finally, we used the raw birth years as a feature for age. Beside names and date
of birth features, DL and LCS distances have been calculated for address, phone and SSN. We also
engineered a binary feature to capture possible last name change due to marriage for females over
eighteen because this is one of the common issues in RL in the USA. Finally, we coded gender as
three categories based on the gender of the pair as f, mm, or diferent.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Machine Learning algorithms</title>
        <p>After extracting the features, the next step is to train the ML algorithms using those features. For
this study, we have used four ML algorithms to generate diferent models: Random Forest, Linear
and Radial Support Vector Machine, and Dense Neural Network.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Random Forest</title>
          <p>In order to train a random forest model, a grid search with 10-fold cross-validation was used on
the training set to tune the maximum number of features at each split hyper-parameter, tested for
3, 5, 7, 9, 11, 13 and 15. As the number of estimators goes up, the performance typically goes up
initially and plateaus after a point. The performance of the random forest started plateauing at
about 250 estimators. Thus, the number of estimators was fixed at 350, which allowed a margin
of 100 to ensure optimal performance. Once the best performing hyper-parameter was identified,
the random forest model was rebuilt on all of the training data using the same hyper-parameter.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Radial SVM and Linear SVM</title>
          <p>Two support vector machines were built, one with a radial basis function and the other with a
linear basis function. The two key parameters for a radial basis kernel SVM were the penalty
parameter, C and the kernel coeficient, sigma. Those two parameters were tuned using 10 fold
cross-validation and grid search on the training data. The grid was validated for all combinations
of C (0.1, 0.5, 1, 10) and sigma (0.03, 0.5, 0.9). The penalty parameter, C, was trained similarly for
the linear SVM. The models were retrained on all of the training data once the hyper-parameters
were fixed.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Dense Neural Network</title>
          <p>The neural net had one input layer, two hidden layers and one output layer with dense connections
between all layers. The input layer had 33 units (from the feature vector discussed in section 3.2).
The two hidden layers had 64 units each with relu activation functions. After the first layer, there
was batch normalization and 0.1% dropout. There was a batch normalization after the second
layer but no dropout as is the standard practice for layers preceding the output layer. The output
layer had 1 unit with a sigmoid activation that returned the probability of a match. An RMSprop
optimizer with a learning rate of 0.001 was used with binary cross-entropy as the loss function.
The batch-size was maintained at the default 32 and the network was trained for 20 epochs. 20%
of the data provided was used as the validation set for monitoring the validation performance.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Design</title>
      <p>
        4.1. Data
In this study, we use two diferent real world datasets. The first is a large gold standard RL dataset
to train the ML models[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], then we evaluate how well the models transfer to another dataset.
      </p>
      <sec id="sec-4-1">
        <title>4.1.1. Hospital EHR data</title>
        <p>
          For training our models, we used a large academic hospital EHR data. This dataset is based on
10,000,000 pairs that were generated from a hospital EHR dataset by blocking on first name and
last name, first name and date of birth, last name and date of birth, and social security number. A
gold standard dataset[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] was developed by randomly selecting 20,000 pairs and then reviewed
by consensus among a team of 4 people through independent review. In our study, we randomly
split the gold standard data into 10,000 for training data and 10,000 for test data. The test data
had a total of 613 linkages that needed to be identified. 55% of the records were female, 44% male,
and the remaining 1% was undesignated.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.1.2. NC voter registry data</title>
        <p>
          The North Carolina State Board of Elections curates large amounts of data on state elections
and voter registration. With several exceptions, this data is public and can be downloaded from
https://www.ncsbe.gov/results-data/voter-registration-data. We link data from two time points
(May 2017 and July 2020) using the voter registry number as the gold standard. Note that this
dataset is only used to test the trained models. We perturb this data by randomly generating
month and day of birth to account for twins as in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Using the records from Yancey county,
we generated 10,000 pairs by blocking on first name and last name, first name and date of birth,
and last name and date of birth. There were a total of 1773 linkages that needed to be identified.
There were 12700 unique records from which 52.8% of the records were female, 46.1% male, and
the remaining 1.1% was undesignated. Since the data was from a voter registry, the minimum age
on the month of data pull was 18 years. People with age 65 or above formed the biggest chunk of
the records followed by middle aged populations (45 to 65).
        </p>
        <sec id="sec-4-2-1">
          <title>4.2. Evaluation criteria</title>
          <p>For evaluating the models we used three measures: (1) the number of pairs that need manual
review, (2) F1-score for automated results only, and (3) Recall over all results. In the hybrid RL
system, the number of pairs that need manual review is determined by two thresholds, T1 and T2,
that are used to determine uncertain pairs. In this study, we defined T1 and T2 in terms of Positive
Predictive Value (PPV) and Negative Predictive Value (NPV) which are calculated as in equation
(1) where TP is the True Positive, TN is the True Negative, FP is the False Positive and FN is the
False Negative. Since accurate results are very important in the health domain, we selected T1
and T2 such that among all predictions with probabilities above and below them respectively, the
predictions were perfect on the training data.</p>
          <p>+     +</p>
          <p>F1-score is a common measure of linkage quality. It is the harmonic mean between precision
and recall (equation (2) ) and hence is very useful in evaluating the efectiveness of linkage
balancing between false positives and missing true links. We use the F1-score for ML RL models
( 1 ), which focuses on the efectiveness of only the subset of pairs that are labeled using the
automated methods outside T1 and T2. In addition, we use   , which takes into account
the full set of pairs of the whole hybrid system and depicts how much of the overall TP have been
correctly identified by the automated methods. If the   is too low, that means the
automated methods are mainly good at reducing the manual review work and rely mostly on the
manual process to detect much of the correct links and is roughly correlated with the manual
review set size. On the other hand, very high   shows that the ML models captured
most of the linkage and we may not need to spend much time for manual review.
 

=
 
 
+  
 1 = 2 ×
 
 
× 
+ 
  
=
  
=
 
 
=
 
 
+</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.3. Study design</title>
          <p>Three experiments have been designed to answer three study questions:
1. How well do the four ML RL algorithms meet the two goals of the hybrid RL system?
2. How well do the four ML RL models trained on one dataset transfer to diferent settings?
3. How does adding the n2v feature afect the diferent results in the experiments?
In the first experiment, we used only the hospital gold standard data. We trained using the
10,000 pairs in the training dataset then evaluated the performance on the other 10,000 testing
data. For each pair, we calculated the 33 features described above (excluding the two n2v features).
In the training phase, 9000 pairs were used for training and 1000 pairs were used as the validation
set to tune the hyper-parameters. After hyper-parameter tuning, we trained the models with the
selected hyper-parameter one more time on the whole training set. In our second experiment,
we test the trained models from previous experiment on 10,000 pairs randomly generated from
the NC voter dataset to see how well diferent ML models transfer to diferent settings. We ran
this experiment 100 times and report the mean and standard deviation for each model (Table 2).
Finally, as our last experiment, the first and the second experiments were repeated but this time
(1)
(2)
the two n2v features, one feature for first name and one for last name, were added. Thus, in this
experiment, each record had 35 features. The main purpose of this experiment was to observe the
efectiveness of the n2v distance on the hybrid record linkage process.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. The performance of diferent ML models</title>
        <p>Table 1-Exp1 compares the four algorithms. Both training and test datasets are from the hospital
data. Results demonstrate that although  1 scores were comparable across all methods,
random forest and linear SVM did worst in the   at slightly over 60%. Close to 40%
of the linkages has to be found through manual review. This is consistent with the considerably
bigger manual review size for these two algorithms compared to Radial SVM and DNN.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. The performance of diferent ML models on a new setting</title>
        <p>Experiment 2 studied how well the ML models trained in one setting transfers to diferent data.
We used the trained model that we created using the EHR data to test those models on a diferent
dataset (Voter Registry data). As seen in Table 1-Exp2, the two SVM models transfer to the new
setting best, followed by RF, then DNN. Although the manual review set size are bigger than
previous experiments, the  1 is still reasonable in the three models which is important because
the manual phase can then fill in the gap even if   is somewhat low. DNN model is
not usable because there are too many FP and FN, that cannot be overcome in the manual review
phase. Table 2 shows the mean and standard deviation of the 100 repeated experiments.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Efect of n2v feature on model performance</title>
        <p>For the last experiment, we added 2 more features: n2v distance for first name and last name to
see how adding these features can change the results in the first and the second experiments.
As seen in Table 1-Exp3a, adding n2v caused a big reduction in the number of pairs that needed
manual review, significant improvements to   , with only slight change in  1 for
all models except DNN. Results were somewhat similar for the second experiment presented in
Table 1-Exp3b, although the impact on   was not seen. It seems that adding n2v,
increases the confidence (higher probabilities) of the model predictions which means potentially
more matches and unmatches are detected through automated step, and the number of
observation classified as uncertain is smaller. However in the DNN model the impact was opposite with
higher manual review size and slightly higher F1-scores. Figure 2 depicts the efect of adding n2v
features to ML model performance in the first experiment.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>This research sought to systematically study the performance of the diferent ML algorithms
(Random Forest, Linear SVM, Radial SVM, and Dense Neural Network) on diferent settings for a
hybrid record linkage method in terms of F1 score, Recall, and size of manual review. The
automatic ML based RL code and models can be downloaded from https://github.com/pinformatics/
hybridRL_code_and_models. Users can use the trained models to conduct record linkage on their
data or train a new model using their own dataset.</p>
      <p>
        In the first experiment, the most interesting finding was that although RF had the best  1
score, it was not a good model overall for the hybrid system. The manual review size for radial
SVM was only 61% of random forest model at the cost of having more false labels, but little impact
of  1 scores. Radial SVM missed 16 true matches while DNN had 11 false links. And both
these models were able to identify many more of the linkages, giving much better  
scores. It is very clear that in a hybrid system the price for perfect performance in the first pass has
to be paid by a lot of manual review in the second pass. Obviously, the performance requirements
for the algorithms are afected by the importance of the linkage task. When medical databases
are to be linked[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], the process is often critical and requires thresholds that give perfect (100%)
results in the training set. However, in some domains like genealogy records, small error rates
are often acceptable. In that case, users can relax the thresholds to meet project requirements
that can result in smaller manual review set. Thus, depending on the performance requirements
of the task at hand, the performance requirements for the automatic linkage can be defined. This
can potentially save a lot of time and efort on the manual linkage.
      </p>
      <p>The results of the second experiment demonstrate that RF, linear SVM and radial SVM
models transfer to a new setting much better than DNN. For these three models,  1 score was
comparable, however the manual review size is much bigger than previous experiments and
impact the   score. The results indicate that the models built on EHR data can be used
to identify clear non-matches, and identify some number of true matches but manual review is
required to identify most of the true matches. SVM models perform much better than the RF by
reducing the manual review size to less than 42% while also identifying over 60% of the linkages.
In comparison, RF model had over 40% manual review size and only 20% of linkages. DNN
performance was not acceptable to be used and seems to indicate that the model may be over fitting
to the data it was trained on.</p>
      <p>The goal of the last experiment was to see how adding the n2v features, which is a letter
embedding for names, can afect the results. Clearly adding n2v features can reduce the size of manual
review significantly on all models except DNN in the first experiment (Table 1-Exp3a). As
expected, the results indicate that using n2v distance for first name and last name is similar to the
impact of adding in approximate name matching where it can increase identification of true
linkages automatically but at the cost of also increasing false linkages. More concretely,  
improved noticeably to over 90% when n2v features were added, but at the same time number of
FP went up from 1,2,0 to 15,16,24 respectively for RF, Radial SVM, and Linear SVM. Remember
that these errors cannot be corrected during the manual review phase, so the chose of using n2v or
not will depend on the error rate that is acceptable in the given application. Adding n2v features
had little impact on the DNN model performance which was low anyway.</p>
      <p>The impact of adding n2v features to model performance applied to a diferent dataset (voter
data) in Table 1-Exp3b was seen most in the two SVM models. Both RF and DNN had
comparable low results on all measures except DNN that has increased the manual review size making
things worse. In comparison, radial SVM models reduced the manual review size by more than
half (1857 to 927). The interesting finding was that this reduction did not translate directly into
improvements in   where radial SVM had a significant reduction to 77%. Upon closer
look, we can see that most of the reduction in manual review was due to pairs that were confirmed
correctly as TN in the radial SVM model. In comparison, the linear SVM dropped many TP (from
1057 to 594) when n2v was added reducing   . Thus, the radial SVM model benefited
the most from adding in the n2v features with negligible impact on Recall and F1 score.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and future works</title>
      <p>Automatic record linkage methods have made significant progress during the last few decades,
however they still may not have the high degree of reliability of manual record linkage. On the
other hand, the manual record linkage method is very expensive and time-consuming. Thus, in
this paper, we presented and evaluated an open source hybrid record linkage framework that
combines the manual process and the automated process to achieve both scalability and high quality
linkage results. More work is needed to test if more complex and efective neural network models
may have better performance. In addition, future work is needed to systematically quantify the
biases in RL by race to study the impact of RL on health disparities database studies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fosbøl</surname>
          </string-name>
          , et al.,
          <article-title>Prehospital system delay in st-segment elevation myocardial infarction care: A novel linkage of emergency med. svcs in hospital registry data</article-title>
          ,
          <source>AHJ</source>
          <volume>165</volume>
          (
          <year>2013</year>
          )
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Newcombe</surname>
          </string-name>
          , et al.,
          <source>Automatic linkage of vital records, Science</source>
          <volume>130</volume>
          (
          <year>1959</year>
          )
          <fpage>954</fpage>
          -
          <lpage>959</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Karim</surname>
          </string-name>
          , et al.,
          <article-title>View: a framework for organization level interactive record linkage to support reproducible data science</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2102</volume>
          .
          <fpage>08273</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.-C.</given-names>
            <surname>Kum</surname>
          </string-name>
          , et al.,
          <article-title>Enhancing privacy through an interactive on-demand incremental information disclosure interface: applying privacy-by-design to rl</article-title>
          , in: {SOUPS},
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Ragan</surname>
          </string-name>
          , H.-C. Kum, G. Ilangovan,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Balancing privacy and information disclosure in interactive record linkage with visual masking</article-title>
          ,
          <source>in: SGICHI</source>
          <year>2018</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>e.</surname>
          </string-name>
          <article-title>a. Bailey, How Well Do Automated Linking Methods Perform in Historical Data? Evidence from New US Ground Truth</article-title>
          .,
          <source>Technical Report, Mimeo</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.-C.</given-names>
            <surname>Kum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krishnamurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Machanavajjhala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahalt</surname>
          </string-name>
          ,
          <article-title>Privacy preserving interactive record linkage (ppirl)</article-title>
          ,
          <source>JAMIA</source>
          <volume>21</volume>
          (
          <year>2014</year>
          )
          <fpage>212</fpage>
          -
          <lpage>220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Foxcroft</surname>
          </string-name>
          , A.
          <string-name>
            <surname>d'Alessandro</surname>
            ,
            <given-names>L. Antonie,</given-names>
          </string-name>
          <article-title>Name2vec: Personal names embeddings</article-title>
          ,
          <source>in: Canadian Conference on Artificial Intelligence</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>505</fpage>
          -
          <lpage>510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Mumma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Diercks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Danielsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <article-title>Probabilistic linkage of prehospital &amp; outcomes data in out-of-hospital cardiac arrest</article-title>
          ,
          <source>Prehospital Emerg. Care</source>
          <volume>19</volume>
          (
          <year>2015</year>
          )
          <fpage>358</fpage>
          -
          <lpage>364</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>V. I. Levenshtein</surname>
          </string-name>
          ,
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          ,
          <source>in: Soviet physics doklady</source>
          , volume
          <volume>10</volume>
          ,
          <year>1966</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>710</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          , et al.,
          <article-title>Scikit-learn: Machine learning in python</article-title>
          ,
          <source>the JMLR 12</source>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Hettinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Cushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Noyes</surname>
          </string-name>
          ,
          <article-title>Emergency medical dispatch codes association with emergency department outcomes</article-title>
          ,
          <source>Prehospital Emerg. Care</source>
          <volume>17</volume>
          (
          <year>2013</year>
          )
          <fpage>29</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Waien</surname>
          </string-name>
          ,
          <article-title>Linking large administrative databases: a method for conducting emergency medical services cohort studies using existing data</article-title>
          ,
          <source>Ac. Emerg. Med</source>
          .
          <volume>4</volume>
          (
          <year>1997</year>
          )
          <fpage>1087</fpage>
          -
          <lpage>1095</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Feigenbaum</surname>
          </string-name>
          ,
          <article-title>Automated census record linking: A machine learning approach (</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ilangovan</surname>
          </string-name>
          ,
          <article-title>Benchmarking the Efectiveness and Eficiency of Machine Learning Algorithms for Record Linkage</article-title>
          ,
          <source>Master's thesis</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kaur</surname>
          </string-name>
          , et al.,
          <article-title>A comparison of machine learning classifiers for use on historical record linkage</article-title>
          ,
          <source>Master's thesis</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>J. M. Bronstein</surname>
            ,
            <given-names>C. T.</given-names>
          </string-name>
          <string-name>
            <surname>Lomatsch</surname>
          </string-name>
          , et al.,
          <article-title>Issues and biases in matching medicaid pregnancy episodes to vital records data: the arkansas experience</article-title>
          ,
          <source>MCHJ</source>
          <volume>13</volume>
          (
          <year>2009</year>
          )
          <fpage>250</fpage>
          -
          <lpage>259</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Jofe</surname>
          </string-name>
          , et al.,
          <article-title>A benchmark comparison of deterministic probabilistic methods for defining manual review datasets in duplicate records reconciliation</article-title>
          ,
          <source>JAMIA</source>
          <volume>21</volume>
          (
          <year>2014</year>
          )
          <fpage>97</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>