<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applying term frequency-based indexing to improve scalability and accuracy of probabilistic data linkage</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Robespierre Pita</string-name>
          <email>robespierredrp@dcc.ufba.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luan Menezes</string-name>
          <email>luanmenezes@dcc.ufba.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcos E. Barreto</string-name>
          <email>marcoseb@dcc.ufba.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Data and Knowledge Integration for Health (CIDACS), Oswaldo Cruz Foundation (FIOCRUZ)</institution>
          ,
          <addr-line>41.940-220, Salvador, BA</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Mathematics and Statistics, Computer Science Department, Federal University of Bahia (UFBA)</institution>
          ,
          <addr-line>40.170-110, Salvador, BA</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>65</fpage>
      <lpage>72</lpage>
      <abstract>
        <p>Record or data linkage is a technique frequently used in diverse domains to aggregate data stored in different sources that presumably pertain to the same real world entity. Deterministic (key-based) or probabilistic (rulebased) linkage methods can be used to implement data linkage, being the second approach suitable when no common link attributes exist amongst the data sources involved. Depending on the volume of data being linked, indexing (or blocking) techniques should be used to reduce the number of pairwise comparisons that need to be executed to decide if a given pair of records match or not. In this paper, we discuss a new indexing scheme, based on term-frequency counts, deployed in our data linkage tool (AtyImo). We present our algorithm design and some metrics related to accuracy and efficiency (reduction ratio achieved during blocking construction), as well a comparative analysis with a predicatebased technique also used in AtyImo. Our results shows a very high level of accuracy and reduction in terms of pairwise comparison tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In Public Health, this technique is frequently used to aggregate data from different
electronic health records (EHR) usually managed by governmental bodies within the
public health system. Throughout his life, an individual (or patient) has access to a diversity
of EHR systems that store specific data on clinical care episodes, symptoms and diseases,
prescribed medication and treatments, and associated outcomes. Being able to aggregate
these data is a crucial step to build the patient’s history, as well support different types
of studies, such as quasi-experimental analysis, clinical trials and longitudinal
(cohortbased) evaluations [
        <xref ref-type="bibr" rid="ref7">Newcombe et al. 1959</xref>
        ]. The Brazilian Public Health System (SUS)
is comprised by dozens of freely accessible databases providing anonymized data on live
births, mortality, notifiable diseases, nutritional growth, hospital episodes etc. These
databases present a high degree of structural heterogeneity and coverage periods although
being managed, most of them, by a central department (DATASUS). Heterogeneity is
related to the lack of common key attributes amongst all databases that hinder the usage of
a deterministic approach to integrate data pertaining to the same individual. In such
situations, we should rely on probabilistic approaches to retrieve as much records as possible,
as well on effective ways to validate the retrieved records as truly positive pairs (i.e. make
sure recovered pairs actually belong to the same individual).
      </p>
      <p>
        The scope of our work comprises the usage of term frequency-based indexing
to improve the accuracy and the scalability of our probabilistic data linkage tool —
AtyImo [
        <xref ref-type="bibr" rid="ref8">Pita et al. 2018</xref>
        ]. We started designing AtyImo in 2013 as a solution to aggregate
data from several Brazilian governmental databases and generate specific data sets for
diverse epidemiological studies. These studies are being conducted within joint Brazil-UK
projects aiming to build large population-based cohorts and assess the effectiveness of
public health policies 1.
      </p>
      <p>AtyImo is a Python-based data linkage tool implemented over Spark and CUDA
able to explore highly distributed and hybrid (multicore CPUs + multi-GPU) parallel
architectures, respectively. It is used at CIDACS to support data linkage tasks involving
a huge population-based cohort (the 100 million cohort) and public health databases, as
well the setup of a live birth cohort (around 80 million records) comprised by children
diagnosed with microcephaly due to Zika infection. It is also used at UFBA to aggregate
data from different sources within the Brazilian malaria ecosystem to support the design
and validation of forecasting models applied to malaria epidemics.</p>
      <p>
        AtyImo is structured as a 4-step pipelining implementing data quality assessment,
data pre-processing, record linkage (pairwise comparison) and accuracy assessment. The
data pre-processing step is responsible for data cleansing and harmonization, block
construction and anonymization. Blocking is a mandatory approach to reduce the number of
comparisons, specially in scenarios involving huge databases (cohorts) as ours. The basic
idea is to define some criteria to group records into blocks and perform comparisons only
among blocks presumably similar. Measuring the relative reduction of comparisons may
evidence the effectiveness of a blocking solution [
        <xref ref-type="bibr" rid="ref5">Christen and Goiser 2007</xref>
        ].
      </p>
      <p>
        Besides scaling record linkage solutions over huge data sources by reducing
comparisons, another issue concerns the capacity of keeping a good accuracy level. To comply
with this requisite, a good indexing solution must apply criteria that increase pair
completeness, i.e. which records should be included in each block to keep comparisons and
increase accuracy [
        <xref ref-type="bibr" rid="ref6">Elfeky et al. 2002</xref>
        ].
      </p>
      <p>In this work, we discuss a term frequency-based indexing technique implemented
in our Spark-based version of AtyImo. We rely on this technique as an alternative to the
existing approach based on predicates. Building a ranking of records with most terms in
common can result in fewer and better comparisons than the predicates-based solution.
We calculated reduction ratio and accuracy measures to compare the efficiency of our
proposal with other indexing algorithms. The results show that term frequency indexing
outperforms other existing solutions, decreasing the number of pairwise comparisons and
the execution tume, while preserving good accuracy levels.</p>
      <p>This paper is structured as follows: Section 2 briefly describe some existing
tech1This work was supported by CNPq, FINEP, FAPESB, Bill and Melinda Gates Foundation
(OPP1161996), The Royal Society (NF160879), National Institute for Health Research
(RP-PG040710314) and also supported by the Wellcome Trust (086091/Z/08/Z)
niques used for indexing. Section 3 presents our term-frequency indexing
implementation and compare it with the predicates-based method used by the production version of
AtyImo. Section 4 discuss some experimental results. Some related works are metioned
in Section 5. Finally, section 6 brings some conclusions and current work directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Indexing techiniques for record linkage</title>
      <p>
        There are several solutions to scale up record linkage by reducing unnecessary
comparisons [
        <xref ref-type="bibr" rid="ref4">Christen 2012</xref>
        ]. In this work we focus on those already used in AtyImo tool in order
to ensure the understanding of our proposal.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.1. Traditional blocking</title>
      <p>Traditional blocking techniques are generally based on a blocking key which is created
using attributes like first name, last name and date of birth, or some combination of them
to split the data into buckets. Records that exactly match with the blocking key on each
database are compared assuming their potential to be a true link. The reduction ratio
of this approach depends on the discriminative power of the attributes used to build the
blocking key.</p>
      <p>However, this method can be very biased due to the nature of these attributes and
imputation errors that can prevent true matches to be compared. To soften this, phonetic
codes can be applied to nominal data and some normalization can be done on numerical
data. The low complexity and the easy implementation make this strategy very useful in
most situations.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2. Blocking with predicates</title>
      <p>This method is presented as multi-pass sorted neighborhood alternative and consists of
building some predicate from fields or portions of them. Finally, a function uses this
predicate to make a junction of disjunctions.</p>
      <p>To illustrate, we can suppose a patient record including attributes like name,
mother name, date of birth and gender. Some predicates can use the first name, birth
year and gender. A second may contains the last name, first mother’s name and date
of birth. So every record that agrees with the expression pf irstname ^ birthyear ^
genderq _ plastname ^ mothersf irstname ^ birthdateq will be part of the same block.</p>
      <p>
        The use of predicate-based indexing can prevent input errors to separate true
matches from right blocks, thus increasing pair completeness. Other variants can learn the
best predicates to use [
        <xref ref-type="bibr" rid="ref1">Bilenko et al. 2006</xref>
        ] or cluster nominal values into phonetic codes.
However, the best use of this technique relies on choosing very discriminative attributes
in order to get smaller and powerful blocks.
      </p>
    </sec>
    <sec id="sec-5">
      <title>3. AtyImo’s implementation of term frequency-based indexing</title>
      <p>The current version of AtyImo implements indexing through a predicate-based
approach. This implementation comprises two predicates: pr1 “ pf irstname ^
pbirthday _ birthmonth _ birthyearqq _ plastname ^ pbirthday _ birthmonth _
birthyearqq and the second as pr2 “ ppf irstname ^ f irstmothersnameq ^
pbirthday_birthmonth_birthyearqq_pplastname^lastmothersnameq^pbirthday_
birthmonth _ birthyearqq.</p>
      <p>Figura 1. Predicate-based blocking.</p>
      <p>Let consider Bi a database with a finite set o X objects, described by attributes
r “ x1, ..., xn. This solution inserted three new steps into AtyImo’s pipeline: indexing,
exploration, and ranking procedures, as illustrated on Figure 2:</p>
      <p>Figura 2. Term frequency-based approach used in AtyImo.</p>
    </sec>
    <sec id="sec-6">
      <title>3.1. Indexing step</title>
      <p>
        The indexing step intend to build a key´value structure by visiting every X P B1
(larger database) and collecting all terms as keys. The values correspond to every record’s
primary key (x1) whose contains the term. To cluster the terms with the same phonetics,
we use a custom implementation of Metaphone algorithm [
        <xref ref-type="bibr" rid="ref2">Binstock and Rex 1995</xref>
        ] for
Brazillian Portuguese. Every term has a prefix to indicate which attribute it came from,
this prefix can be n for the name, nm for mother’s name, the day, month and year of birth
are represented by bd, bm and by, respectively. The prefix g stands for gender and m
represents the municipality of birth. Either prefix and phonetic encoded term are set to be
the key of the proposed structure and the name of JSON file to store it.
      </p>
    </sec>
    <sec id="sec-7">
      <title>3.2. Exploration step</title>
      <p>To every term extracted from an X P B2 (smaller database this time), we search the
respective JSON file and concatenate their values into a vector of indexes vt. Very
popular terms can make this phase too onerous, in order to meet this issue, we submit the
processing of these JSON files to a distributed processing supported by pySpark, the
Algorithm 1.</p>
    </sec>
    <sec id="sec-8">
      <title>3.3. Ranking step</title>
      <p>
        fpt,vtq
The term frequency is calculated by ∞ tPvt fpt1,vtq . After that, we sort vt to find the most
frequent B1’s primary keys to set up a ranking with the best N candidates to be compared
to some B2’s record. This approach aims to eliminate the variance of the comparisons
amount even to records which contain very popular terms. This work scope is limited to
indexing the data, search the best candidates and provide a comparison structure for the
further phases of AtyImo (well explained in [
        <xref ref-type="bibr" rid="ref8">Pita et al. 2018</xref>
        ].
      </p>
      <sec id="sec-8-1">
        <title>Data: larger dataset</title>
        <p>Result: json files indexed
initialization;
while not at end of this
document do
read current;
for current.split(;) to
l do
term– metaPTBR(l);
createJsonFile(term)
end
end
Algorithm 1: indexing step</p>
      </sec>
      <sec id="sec-8-2">
        <title>Data: smaller dataset</title>
        <p>Result: terms ranking
initialization;
while not at end of this
document do
read current;
for current.split(;) to
l do
result– searchJsonFile(l);
ranking.append(result)
end
end
mostFrequencyTerms(ranking)
Algorithm 2: exploration step</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4. Experimental results</title>
      <p>
        To evaluate our term frequency-based implementation, we have used samples from two
Brazilian governmental databases: SIM (mortality data, presented as the smaller
database) and SINASC (live births data, presented as the larger database), as summarized in
Table 1. These data are used as “gold standard” to validate our linkage and deduplication
routines, since they share a common key which refers to a death certificate. We expect
to retrieve 3.030 true match pairs among these databases after around 84 million pairwise
comparisons. Regarding the known true matches numbers, we understand it is a
realworld dataset with some impure data, which limits the identification of all true matches
without a specificity and sensitivity trade-off. Without blocking, AtyImo retrieves 3018
true matches. Further discussion on AtyImo’s accuracy is found in [
        <xref ref-type="bibr" rid="ref8">Pita et al. 2018</xref>
        ].
      </p>
      <p>Tabela 1. Gold standard data set used for validation.</p>
      <p>SIM 6,458 records
SINASC 13,046 records
Total of comparisons 84,251,068 records</p>
      <p>Expected true positives 3,030 records</p>
      <p>
        We use reduction ratio and pair completeness as the main measures to evaluate the
quality of our proposed indexing technique [
        <xref ref-type="bibr" rid="ref4">Christen 2012</xref>
        ]. Both measures are presented
by the Equations 1 and 2. Where, BLcs is the number of records sets in the block, true
and false matchs. On the other hand BLtm is only true matchs. T rs is the total number
of sets in the dataset and T tm is the total of true matches.
      </p>
      <p>BLcs BLtm</p>
      <p>RR “ 1 ´ T rs (1) P C “ T tm (2)</p>
      <p>We have compared the predicate-based approach used by AtyImo (in its
production version) against the new term frequency-based approach proposed in this work.
Table 2 presents the distribution of blocks sizes for each indexing method. The predicate1
achieve smaller blocks due the discriminative power of attributes used. This result
reinforce the need of well choose which portions of data will be submitted to the indexing
technique. In spite of predicates steady improvement, the results of term frequency
approach outperforms all the other compared solution. Since we have used N “ 100 to
define how much pairwise comparisons will be made to every record on smaller database,
the size of blocks became homogeneous.</p>
      <p>Tabela 2. Size of generated blocks for each indexing technique
method predicate 1 predicate 2 term frequecy
database sb lb sb lb sb lb
min 1 1 1 1 1 100
med 24 51 2 2 1 100
mean 43 88.38 8.289 11.57 1 100
max 1855 41528 87 611 1 100</p>
      <p>Table 3 summarizes the results obtained for each method. The predicate1 got
poor results of reduction ratio by performing almost half of all pairwise comparisons.
Even with higher average block size, predicate1 execution only retrieve 2,382 from the
3,030 expected, which decrease it pair completeness to 0.786. Better results has been
made by predicate2, which manage to get 0.996 of pair completeness with more
discriminative and smaller blocks. The predicate2 achieved 3,018 true matches and 0,654 of
reductions ratio. The best metrics were achieved by our indexing techinique based on
phonetic encoded term frequency. Despite the major number of blocks, this approach
obtained 3,020 true matches doing less comparisons. These results reflected on reduction
ratio and pair completeness, getting 0.992 and 0.996, respectively.</p>
      <p>Tabela 3. Results of each indexing techinique used.</p>
      <p>predicate 1 predicate 2 term frequency
true matches retrieved 2,382 3,018 3,020</p>
      <p>number of blocks 5,806 6,432 6,458
number of comparisons 44,406,049 29,111,755 645,800
reduction ratio 0.472 0,654 0,992
pair completness 0.786 0.996 0.996</p>
      <p>Our proposal of phonetic encoded term frequency to establish the best candidade
pairs for linkage comparisons revealed efficient in terms of accuracy and runtime
execution. Another major contribution of this work refers to perform indexing while allow the
comparison methods to encode or encript the blocks, considering the privacy-preserving
concern of AtyImo tool.</p>
    </sec>
    <sec id="sec-10">
      <title>5. Related Work</title>
      <p>
        Record linkage tools frequently offer a set of methods to reduce the amount of
pairwise comparisons potentially appearing in big data scenarios. Most popular
tools [
        <xref ref-type="bibr" rid="ref9">Schnell et al. 2004</xref>
        ,
        <xref ref-type="bibr" rid="ref6">Elfeky et al. 2002</xref>
        ,
        <xref ref-type="bibr" rid="ref3">Christen 2008</xref>
        ] can provide a indexing based
on a single attribute (traditional indexing), some sorted neighborhood strategies, canopy
clustering and string-map based approaches. To each of these indexing methods, several
parameters are available to be tested on user data in order to choose the best approach.
      </p>
      <p>
        Several indexing techniques usually employed to decrease record linkage
complexity were described and compared on [
        <xref ref-type="bibr" rid="ref4">Christen 2012</xref>
        ]. They applied different methods to
link some real-world and synthetic datasets. Measures like reduction ratio, pair
completeness, pair quality, and accuracy were utilized to assess the results. Their experiments
showed that the number of parameters to be configured and the quality of the data to be
linked make difficult a successful application of any indexing technique. Some of these
techniques are also explained and evaluated in [
        <xref ref-type="bibr" rid="ref11">Yeddula and Lakshmaiah 2016</xref>
        ].
      </p>
      <p>
        Traditional blocking techniques, as well locality-based ones are discussed and
compared in [
        <xref ref-type="bibr" rid="ref10">Steorts et al. 2014</xref>
        ]. The authors have used synthetic data sets and evaluated
different metrics (recall, reduction ratio, and complexity). They also discussed some
privacy-preserving requisites related to blocking and indexing techniques.
      </p>
      <p>Python provides the recordlinkage library that implements a full index approach
based on the MultiIndex object provided in the Pandas library. This approch returns all
pairwise combinations (product of the records present in both data sets). It is possible to
provide a blocking key (column name) in order to reduce the number of blocks generated.
The library also implements a sorted neighborhood approach.</p>
      <p>
        In order to meet the requisites imposed by big data scenarios, AtyImo runs over
the pySpark library and offer two predicate-based indexing approaches to split the data
into blocks. These predicates put in the same block those records which agree with some
rules, like first name and birth date, or last name and mother’s name. Experiments show
a good accuracy level using this approaches [
        <xref ref-type="bibr" rid="ref8">Pita et al. 2018</xref>
        ].
      </p>
    </sec>
    <sec id="sec-11">
      <title>6. Conclusion and future work</title>
      <p>Record linkage is a technique widely used in data mining and data warehousing
applications to allow for the aggregation of data coming from disparate data sources. Big data
scenarios impose significant challenges for data linkage tools. The growth of data sources
and the need for increasing levels of accuracy open room to the development of novel
solutions and the improvement of classical tools.</p>
      <p>This work presented an extension of a cluster-based record linkage tool for
indexing data using phonetic enconded term frequency. Results for accuracy and pairwise
comparison are promising and outperforms existing, concurrent solutions. As future
work, we plan to develop other techniques such as TF-IDF and unsupervised machine
learning approaches.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bilenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamath</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Mooney</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Adaptive blocking: Learning to scale up record linkage</article-title>
          .
          <source>In Data Mining</source>
          ,
          <year>2006</year>
          . ICDM'
          <fpage>06</fpage>
          . Sixth International Conference on, pages
          <fpage>87</fpage>
          -
          <lpage>96</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Binstock</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rex</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>1995</year>
          ).
          <article-title>Metaphone: a modern soundex. Practical Algorithms for Programmers</article-title>
          . Addison Wesley.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Christen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Febrl: a freely available record linkage system with a graphical user interface</article-title>
          .
          <source>In Proceedings of the second Australasian workshop on Health data and knowledge management-</source>
          Volume
          <volume>80</volume>
          , pages
          <fpage>17</fpage>
          -
          <lpage>25</lpage>
          . Australian Computer Society, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Christen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>A survey of indexing techniques for scalable record linkage and deduplication</article-title>
          .
          <source>IEEE transactions on knowledge and data engineering</source>
          ,
          <volume>24</volume>
          (
          <issue>9</issue>
          ):
          <fpage>1537</fpage>
          -
          <lpage>1555</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Christen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Goiser</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Quality and complexity measures for data linkage and deduplication</article-title>
          .
          <source>In Quality Measures in Data Mining</source>
          , pages
          <fpage>127</fpage>
          -
          <lpage>151</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Elfeky</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verykios</surname>
            ,
            <given-names>V. S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>A. K.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Tailor: A record linkage toolbox</article-title>
          .
          <source>In Data Engineering</source>
          ,
          <year>2002</year>
          . Proceedings. 18th International Conference on, pages
          <fpage>17</fpage>
          -
          <lpage>28</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Newcombe</surname>
            ,
            <given-names>H. B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kennedy</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Axford</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>James</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          (
          <year>1959</year>
          ).
          <article-title>Automatic linkage of vital records</article-title>
          .
          <source>Science</source>
          , pages
          <fpage>954</fpage>
          -
          <lpage>959</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Pita</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sena</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fiaccone</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amorim</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barreto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Denaxas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Barreto</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>On the accuracy and scalability of probabilistic data linkage over the Brazilian 114 million cohort</article-title>
          .
          <source>IEEE Journal of Biomedical and Health Informatics</source>
          ,
          <volume>22</volume>
          (
          <issue>2</issue>
          ):
          <fpage>346</fpage>
          -
          <lpage>353</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Schnell</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bachteler</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bender</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>A toolbox for record linkage</article-title>
          .
          <source>Austrian Journal of Statistics</source>
          ,
          <volume>33</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>125</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Steorts</surname>
            ,
            <given-names>R. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ventura</surname>
            ,
            <given-names>S. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sadinle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Fienberg</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>A comparison of blocking methods for record linkage</article-title>
          . In Domingo-Ferrer, J., editor,
          <source>Privacy in Statistical Databases</source>
          , pages
          <fpage>253</fpage>
          -
          <lpage>268</lpage>
          , Cham. Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Yeddula</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lakshmaiah</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Investigation of techniques for efficient and accurate indexing for scalable record linkage and deduplication</article-title>
          .
          <source>International Journal of Computer &amp; Communication Technology</source>
          , pages
          <fpage>24</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>