LADaS 2018 - Latin America Data Science Workshop


        Applying term frequency-based indexing to improve
        scalability and accuracy of probabilistic data linkage
                 Robespierre Pita1,2 , Luan Menezes1,2 , Marcos E. Barreto1,2
         1
          Institute of Mathematics and Statistics, Computer Science Department,
          Federal University of Bahia (UFBA), 40.170-110, Salvador, BA, Brazil
           Centre for Data and Knowledge Integration for Health (CIDACS),
             2

        Oswaldo Cruz Foundation (FIOCRUZ), 41.940-220, Salvador, BA, Brazil
                 [robespierredrp, luanmenezes, marcoseb]@dcc.ufba.br

     Abstract. Record or data linkage is a technique frequently used in diverse do-
     mains to aggregate data stored in different sources that presumably pertain to
     the same real world entity. Deterministic (key-based) or probabilistic (rule-
     based) linkage methods can be used to implement data linkage, being the se-
     cond approach suitable when no common link attributes exist amongst the data
     sources involved. Depending on the volume of data being linked, indexing (or
     blocking) techniques should be used to reduce the number of pairwise compari-
     sons that need to be executed to decide if a given pair of records match or not. In
     this paper, we discuss a new indexing scheme, based on term-frequency counts,
     deployed in our data linkage tool (AtyImo). We present our algorithm design
     and some metrics related to accuracy and efficiency (reduction ratio achieved
     during blocking construction), as well a comparative analysis with a predicate-
     based technique also used in AtyImo. Our results shows a very high level of
     accuracy and reduction in terms of pairwise comparison tasks.

1. Introduction
Record linkage is a well-known technique used to integrate data by finding records po-
tentially belonging to the same entity on distinct sources [Newcombe et al. 1959]. It is
considered an important approach in several domains as the data aggregated from diffe-
rent data sources can be used to build a complete view of a given scenario, as well support
diverse types of studies and decision-making processes.
        In Public Health, this technique is frequently used to aggregate data from different
electronic health records (EHR) usually managed by governmental bodies within the pu-
blic health system. Throughout his life, an individual (or patient) has access to a diversity
of EHR systems that store specific data on clinical care episodes, symptoms and diseases,
prescribed medication and treatments, and associated outcomes. Being able to aggregate
these data is a crucial step to build the patient’s history, as well support different types
of studies, such as quasi-experimental analysis, clinical trials and longitudinal (cohort-
based) evaluations [Newcombe et al. 1959]. The Brazilian Public Health System (SUS)
is comprised by dozens of freely accessible databases providing anonymized data on live
births, mortality, notifiable diseases, nutritional growth, hospital episodes etc. These da-
tabases present a high degree of structural heterogeneity and coverage periods although
being managed, most of them, by a central department (DATASUS). Heterogeneity is re-
lated to the lack of common key attributes amongst all databases that hinder the usage of

                                                   65
LADaS 2018 - Latin America Data Science Workshop


a deterministic approach to integrate data pertaining to the same individual. In such situa-
tions, we should rely on probabilistic approaches to retrieve as much records as possible,
as well on effective ways to validate the retrieved records as truly positive pairs (i.e. make
sure recovered pairs actually belong to the same individual).
        The scope of our work comprises the usage of term frequency-based indexing
to improve the accuracy and the scalability of our probabilistic data linkage tool —
AtyImo [Pita et al. 2018]. We started designing AtyImo in 2013 as a solution to aggregate
data from several Brazilian governmental databases and generate specific data sets for di-
verse epidemiological studies. These studies are being conducted within joint Brazil-UK
projects aiming to build large population-based cohorts and assess the effectiveness of
public health policies 1 .
        AtyImo is a Python-based data linkage tool implemented over Spark and CUDA
able to explore highly distributed and hybrid (multicore CPUs + multi-GPU) parallel ar-
chitectures, respectively. It is used at CIDACS to support data linkage tasks involving
a huge population-based cohort (the 100 million cohort) and public health databases, as
well the setup of a live birth cohort (around 80 million records) comprised by children
diagnosed with microcephaly due to Zika infection. It is also used at UFBA to aggregate
data from different sources within the Brazilian malaria ecosystem to support the design
and validation of forecasting models applied to malaria epidemics.
         AtyImo is structured as a 4-step pipelining implementing data quality assessment,
data pre-processing, record linkage (pairwise comparison) and accuracy assessment. The
data pre-processing step is responsible for data cleansing and harmonization, block cons-
truction and anonymization. Blocking is a mandatory approach to reduce the number of
comparisons, specially in scenarios involving huge databases (cohorts) as ours. The basic
idea is to define some criteria to group records into blocks and perform comparisons only
among blocks presumably similar. Measuring the relative reduction of comparisons may
evidence the effectiveness of a blocking solution [Christen and Goiser 2007].
        Besides scaling record linkage solutions over huge data sources by reducing com-
parisons, another issue concerns the capacity of keeping a good accuracy level. To comply
with this requisite, a good indexing solution must apply criteria that increase pair com-
pleteness, i.e. which records should be included in each block to keep comparisons and
increase accuracy [Elfeky et al. 2002].
        In this work, we discuss a term frequency-based indexing technique implemented
in our Spark-based version of AtyImo. We rely on this technique as an alternative to the
existing approach based on predicates. Building a ranking of records with most terms in
common can result in fewer and better comparisons than the predicates-based solution.
We calculated reduction ratio and accuracy measures to compare the efficiency of our
proposal with other indexing algorithms. The results show that term frequency indexing
outperforms other existing solutions, decreasing the number of pairwise comparisons and
the execution tume, while preserving good accuracy levels.
        This paper is structured as follows: Section 2 briefly describe some existing tech-
   1
    This work was supported by CNPq, FINEP, FAPESB, Bill and Melinda Gates Foundation
(OPP1161996), The Royal Society (NF160879), National Institute for Health Research (RP-PG-
040710314) and also supported by the Wellcome Trust (086091/Z/08/Z)


                                                   66
LADaS 2018 - Latin America Data Science Workshop


niques used for indexing. Section 3 presents our term-frequency indexing implementa-
tion and compare it with the predicates-based method used by the production version of
AtyImo. Section 4 discuss some experimental results. Some related works are metioned
in Section 5. Finally, section 6 brings some conclusions and current work directions.

2. Indexing techiniques for record linkage
There are several solutions to scale up record linkage by reducing unnecessary compari-
sons [Christen 2012]. In this work we focus on those already used in AtyImo tool in order
to ensure the understanding of our proposal.

2.1. Traditional blocking
Traditional blocking techniques are generally based on a blocking key which is created
using attributes like first name, last name and date of birth, or some combination of them
to split the data into buckets. Records that exactly match with the blocking key on each
database are compared assuming their potential to be a true link. The reduction ratio
of this approach depends on the discriminative power of the attributes used to build the
blocking key.
        However, this method can be very biased due to the nature of these attributes and
imputation errors that can prevent true matches to be compared. To soften this, phonetic
codes can be applied to nominal data and some normalization can be done on numerical
data. The low complexity and the easy implementation make this strategy very useful in
most situations.

2.2. Blocking with predicates
This method is presented as multi-pass sorted neighborhood alternative and consists of
building some predicate from fields or portions of them. Finally, a function uses this
predicate to make a junction of disjunctions.
        To illustrate, we can suppose a patient record including attributes like name,
mother name, date of birth and gender. Some predicates can use the first name, birth
year and gender. A second may contains the last name, first mother’s name and date
of birth. So every record that agrees with the expression pf irstname ^ birthyear ^
genderq _ plastname ^ mothersf irstname ^ birthdateq will be part of the same block.
        The use of predicate-based indexing can prevent input errors to separate true mat-
ches from right blocks, thus increasing pair completeness. Other variants can learn the
best predicates to use [Bilenko et al. 2006] or cluster nominal values into phonetic codes.
However, the best use of this technique relies on choosing very discriminative attributes
in order to get smaller and powerful blocks.

3. AtyImo’s implementation of term frequency-based indexing
The current version of AtyImo implements indexing through a predicate-based ap-
proach. This implementation comprises two predicates: pr1 “ pf irstname ^
pbirthday _ birthmonth _ birthyearqq _ plastname ^ pbirthday _ birthmonth _
birthyearqq and the second as pr2 “ ppf irstname ^ f irstmothersnameq ^
pbirthday_birthmonth_birthyearqq_pplastname^lastmothersnameq^pbirthday_
birthmonth _ birthyearqq.

                                                   67
LADaS 2018 - Latin America Data Science Workshop


                              Figura 1. Predicate-based blocking.


        Figure 1 shows how the blocks are build based on predicate pr1. The smaller data-
base are used to build an auxiliar file which stores all blocking keys related to each block.
After that, a second phase runs over both databases and the auxiliar file to determine in
which block a given record must be inserted.
       Let consider Bi a database with a finite set o X objects, described by attributes
r “ x1 , ..., xn . This solution inserted three new steps into AtyImo’s pipeline: indexing,
exploration, and ranking procedures, as illustrated on Figure 2:


                  Figura 2. Term frequency-based approach used in AtyImo.


                                                   68
LADaS 2018 - Latin America Data Science Workshop


3.1. Indexing step
The indexing step intend to build a key´value structure by visiting every X P B1 (lar-
ger database) and collecting all terms as keys. The values correspond to every record’s
primary key (x1 ) whose contains the term. To cluster the terms with the same phonetics,
we use a custom implementation of Metaphone algorithm [Binstock and Rex 1995] for
Brazillian Portuguese. Every term has a prefix to indicate which attribute it came from,
this prefix can be n for the name, nm for mother’s name, the day, month and year of birth
are represented by bd, bm and by, respectively. The prefix g stands for gender and m
represents the municipality of birth. Either prefix and phonetic encoded term are set to be
the key of the proposed structure and the name of JSON file to store it.

3.2. Exploration step
To every term extracted from an X P B2 (smaller database this time), we search the
respective JSON file and concatenate their values into a vector of indexes vt. Very po-
pular terms can make this phase too onerous, in order to meet this issue, we submit the
processing of these JSON files to a distributed processing supported by pySpark, the Al-
gorithm 1.

3.3. Ranking step
The term frequency is calculated by ∞ f pt,vtq    1     . After that, we sort vt to find the most
                                         tPvt f pt ,vtq
frequent B1 ’s primary keys to set up a ranking with the best N candidates to be compared
to some B2 ’s record. This approach aims to eliminate the variance of the comparisons
amount even to records which contain very popular terms. This work scope is limited to
indexing the data, search the best candidates and provide a comparison structure for the
further phases of AtyImo (well explained in [Pita et al. 2018].
                                                          Data: smaller dataset
                 Data: larger dataset                     Result: terms ranking
                 Result: json files indexed               initialization;
                 initialization;                          while not at end of this
                 while not at end of this                   document do
                   document do                                 read current;
                      read current;                            for current.split(;) to
                      for current.split(;) to                   l do
                       l do                                         result–searchJsonFile(l);
                           term–metaPTBR(l);
                                                                    ranking.append(result)
                             createJsonFile(term)             end
                       end
                                                          end
                 end                                      mostFrequencyTerms(ranking)
          Algorithm 1: indexing step
                                                         Algorithm 2: exploration step

4. Experimental results
To evaluate our term frequency-based implementation, we have used samples from two
Brazilian governmental databases: SIM (mortality data, presented as the smaller data-
base) and SINASC (live births data, presented as the larger database), as summarized in
Table 1. These data are used as “gold standard” to validate our linkage and deduplication

                                                    69
LADaS 2018 - Latin America Data Science Workshop


routines, since they share a common key which refers to a death certificate. We expect
to retrieve 3.030 true match pairs among these databases after around 84 million pairwise
comparisons. Regarding the known true matches numbers, we understand it is a real-
world dataset with some impure data, which limits the identification of all true matches
without a specificity and sensitivity trade-off. Without blocking, AtyImo retrieves 3018
true matches. Further discussion on AtyImo’s accuracy is found in [Pita et al. 2018].

                     Tabela 1. Gold standard data set used for validation.
                            SIM                     6,458 records
                            SINASC                  13,046 records
                            Total of comparisons    84,251,068 records
                            Expected true positives 3,030 records

         We use reduction ratio and pair completeness as the main measures to evaluate the
quality of our proposed indexing technique [Christen 2012]. Both measures are presented
by the Equations 1 and 2. Where, BLcs is the number of records sets in the block, true
and false matchs. On the other hand BLtm is only true matchs. T rs is the total number
of sets in the dataset and T tm is the total of true matches.
                              BLcs                              BLtm
                  RR “ 1 ´                 (1)            PC “               (2)
                               T rs                             T tm
         We have compared the predicate-based approach used by AtyImo (in its produc-
tion version) against the new term frequency-based approach proposed in this work. Ta-
ble 2 presents the distribution of blocks sizes for each indexing method. The predicate1
achieve smaller blocks due the discriminative power of attributes used. This result rein-
force the need of well choose which portions of data will be submitted to the indexing
technique. In spite of predicates steady improvement, the results of term frequency ap-
proach outperforms all the other compared solution. Since we have used N “ 100 to
define how much pairwise comparisons will be made to every record on smaller database,
the size of blocks became homogeneous.

               Tabela 2. Size of generated blocks for each indexing technique
                     method      predicate 1  predicate 2    term frequecy
                    database     sb      lb    sb      lb    sb       lb
                      min         1       1     1      1      1      100
                      med        24      51    2       2      1      100
                      mean       43    88.38 8.289 11.57 1           100
                      max      1855 41528      87     611     1      100

        Table 3 summarizes the results obtained for each method. The predicate1 got
poor results of reduction ratio by performing almost half of all pairwise comparisons.
Even with higher average block size, predicate1 execution only retrieve 2,382 from the
3,030 expected, which decrease it pair completeness to 0.786. Better results has been
made by predicate2, which manage to get 0.996 of pair completeness with more discri-
minative and smaller blocks. The predicate2 achieved 3,018 true matches and 0,654 of
reductions ratio. The best metrics were achieved by our indexing techinique based on
phonetic encoded term frequency. Despite the major number of blocks, this approach
obtained 3,020 true matches doing less comparisons. These results reflected on reduction
ratio and pair completeness, getting 0.992 and 0.996, respectively.

                                                   70
LADaS 2018 - Latin America Data Science Workshop


                     Tabela 3. Results of each indexing techinique used.
                                       predicate 1 predicate 2 term frequency
               true matches retrieved     2,382       3,018          3,020
                  number of blocks        5,806       6,432          6,458
               number of comparisons 44,406,049 29,111,755          645,800
                   reduction ratio        0.472       0,654          0,992
                  pair completness        0.786       0.996          0.996


        Our proposal of phonetic encoded term frequency to establish the best candidade
pairs for linkage comparisons revealed efficient in terms of accuracy and runtime execu-
tion. Another major contribution of this work refers to perform indexing while allow the
comparison methods to encode or encript the blocks, considering the privacy-preserving
concern of AtyImo tool.


5. Related Work

Record linkage tools frequently offer a set of methods to reduce the amount of
pairwise comparisons potentially appearing in big data scenarios. Most popular to-
ols [Schnell et al. 2004, Elfeky et al. 2002, Christen 2008] can provide a indexing based
on a single attribute (traditional indexing), some sorted neighborhood strategies, canopy
clustering and string-map based approaches. To each of these indexing methods, several
parameters are available to be tested on user data in order to choose the best approach.
        Several indexing techniques usually employed to decrease record linkage comple-
xity were described and compared on [Christen 2012]. They applied different methods to
link some real-world and synthetic datasets. Measures like reduction ratio, pair comple-
teness, pair quality, and accuracy were utilized to assess the results. Their experiments
showed that the number of parameters to be configured and the quality of the data to be
linked make difficult a successful application of any indexing technique. Some of these
techniques are also explained and evaluated in [Yeddula and Lakshmaiah 2016].
        Traditional blocking techniques, as well locality-based ones are discussed and
compared in [Steorts et al. 2014]. The authors have used synthetic data sets and evaluated
different metrics (recall, reduction ratio, and complexity). They also discussed some
privacy-preserving requisites related to blocking and indexing techniques.
       Python provides the recordlinkage library that implements a full index approach
based on the MultiIndex object provided in the Pandas library. This approch returns all
pairwise combinations (product of the records present in both data sets). It is possible to
provide a blocking key (column name) in order to reduce the number of blocks generated.
The library also implements a sorted neighborhood approach.
         In order to meet the requisites imposed by big data scenarios, AtyImo runs over
the pySpark library and offer two predicate-based indexing approaches to split the data
into blocks. These predicates put in the same block those records which agree with some
rules, like first name and birth date, or last name and mother’s name. Experiments show
a good accuracy level using this approaches [Pita et al. 2018].

                                                   71
LADaS 2018 - Latin America Data Science Workshop


6. Conclusion and future work
Record linkage is a technique widely used in data mining and data warehousing applica-
tions to allow for the aggregation of data coming from disparate data sources. Big data
scenarios impose significant challenges for data linkage tools. The growth of data sources
and the need for increasing levels of accuracy open room to the development of novel
solutions and the improvement of classical tools.
       This work presented an extension of a cluster-based record linkage tool for inde-
xing data using phonetic enconded term frequency. Results for accuracy and pairwise
comparison are promising and outperforms existing, concurrent solutions. As future
work, we plan to develop other techniques such as TF-IDF and unsupervised machine
learning approaches.

Referências
Bilenko, M., Kamath, B., and Mooney, R. J. (2006). Adaptive blocking: Learning to scale
   up record linkage. In Data Mining, 2006. ICDM’06. Sixth International Conference
   on, pages 87–96. IEEE.
Binstock, A. and Rex, J. (1995). Metaphone: a modern soundex. Practical Algorithms
  for Programmers. Addison Wesley.
Christen, P. (2008). Febrl: a freely available record linkage system with a graphical user
  interface. In Proceedings of the second Australasian workshop on Health data and
  knowledge management-Volume 80, pages 17–25. Australian Computer Society, Inc.
Christen, P. (2012). A survey of indexing techniques for scalable record linkage and
  deduplication. IEEE transactions on knowledge and data engineering, 24(9):1537–
  1555.
Christen, P. and Goiser, K. (2007). Quality and complexity measures for data linkage and
  deduplication. In Quality Measures in Data Mining, pages 127–151. Springer.
Elfeky, M. G., Verykios, V. S., and Elmagarmid, A. K. (2002). Tailor: A record linkage
   toolbox. In Data Engineering, 2002. Proceedings. 18th International Conference on,
   pages 17–28. IEEE.
Newcombe, H. B., Kennedy, J. M., Axford, S., and James, A. P. (1959). Automatic
  linkage of vital records. Science, pages 954–959.
Pita, R., Pinto, C., Sena, S., Fiaccone, R., Amorim, L., Reis, S., Barreto, M., Denaxas,
   S., and Barreto, M. E. (2018). On the accuracy and scalability of probabilistic data
   linkage over the Brazilian 114 million cohort. IEEE Journal of Biomedical and Health
   Informatics, 22(2):346–353.
Schnell, R., Bachteler, T., and Bender, S. (2004). A toolbox for record linkage. Austrian
  Journal of Statistics, 33(1-2):125–133.
Steorts, R. C., Ventura, S. L., Sadinle, M., and Fienberg, S. E. (2014). A comparison
   of blocking methods for record linkage. In Domingo-Ferrer, J., editor, Privacy in
   Statistical Databases, pages 253–268, Cham. Springer International Publishing.
Yeddula, S. and Lakshmaiah, K. (2016). Investigation of techniques for efficient and
  accurate indexing for scalable record linkage and deduplication. International Journal
  of Computer & Communication Technology, pages 24–30.

                                                   72