=Paper=
{{Paper
|id=Vol-3135/darliap_paper7
|storemode=property
|title=Reference Sources in Clearing Customer Data: Conclusions from a R&D Project
|pdfUrl=https://ceur-ws.org/Vol-3135/darliap_paper7.pdf
|volume=Vol-3135
|authors=Mariusz Sienkiewicz
|dblpUrl=https://dblp.org/rec/conf/edbt/Sienkiewicz22
}}
==Reference Sources in Clearing Customer Data: Conclusions from a R&D Project==
<pdf width="1500px">https://ceur-ws.org/Vol-3135/darliap_paper7.pdf</pdf>
<pre>
Reference Sources in Clearing Customer Data:
Conclusions from a R&D Project
Mariusz Sienkiewicz1
1
    Poznan University of Technology, Poznań, Poland


                                             Abstract
                                             The digitization and virtualization of many aspects of life pose a question for many organizations regarding customer
                                             identification. Problem is extremely important in the context of financial institutions (FI), where customer identification
                                             is related to a number of aspects of the company’s operation and the products and services provided. The problem of
                                             unambiguous customer identification consists of data errors, dirty data and duplicate records describing the customer. It is
                                             estimated that 1% to approximately 5% of FI data are affected by errors. The scope of data collected by institutions about
                                             their clients is enormous and results from many needs. Each of these needs may require a different scope of data and expect
                                             different levels of quality. Regardless of the needs for data collection and processing, certain data is particularly important
                                             and important - we are talking about data allowing for unambiguous customer identification. In this article, we will pay
                                             special attention to the data set that allows for unambiguous customer identification.

                                             Keywords
                                             data cleaning, deduplication, dictionary cleaning, geocoding


1. Introduction                                                                                                       the data describing clients collected and processed by
                                                                                                                      financial institutions. This is due to: 1) the long his-
The issue of data cleaning has been raised many times                                                                 tory of IT systems, 2) numerous system migrations, 3)
in the literature [1, 2, 3, 4, 5, 6, 7]. There are various sug-                                                       acquisitions on the financial market, 4) human errors,
gestions for data error detection and deduplication based                                                             5) intended actions (e.g. attempts to extort financial re-
on e.g. on the methods of comparing the text, corud-                                                                  sources). The effectiveness of the procedures created for
sorbing or classification. Based on the experience from a                                                             the sales force is limited and depends on many factors,
project implemented for a large financial institution, we                                                             and it is not the subject of this article.
will present a concept of detecting and correcting cus-                                                                  Clean and standardized data are required in many ar-
tomer identification data. This is the first article based on                                                         eas of data processing in financial institutions, including
the work carried out on a large (over 2 million records)                                                              1) risk models, 2) security mechanisms, 3) offer and sales
database of real physical and legal clients.                                                                          support models, 4) ML-based solutions, 5) data dedupli-
   Due to the nature of their business, financial institu-                                                            cation.
tions pay special attention to unambiguous and complete                                                                  The standard data deduplication pipeline [10, 11, 12,
identification of clients. Customer identification is of                                                              13] assumes that the data delivered to the pipeline is
great importance both for IF collecting data and provid-                                                              cleared (eg. no zero values, no spelling errors, unified
ing products and services, as well as for customers due                                                               hashes). Unfortunately, this assumption in real projects
to e.g. the security of funds. Financial institutions and                                                             cannot be guaranteed, especially in the financial sector.
financial market regulators pay great attention to the                                                                There are typos, missing values, inconsistent values in
quality of the collected and processed data [8, 9]. Many                                                              the attributes that store personal data, institution names
financial institutions have extensive data management                                                                 and addresses. Moreover, not all natural identifiers are
and data quality management systems, and use various                                                                  reliable. It should also be taken into account that the fi-
techniques to detect errors and correct them in the col-                                                              nancial market is largely regulated by law. Interpretation
lected and processed data. These are mechanisms based                                                                 of the current legal regulations and the security practices
on specialized software supporting the detection of de-                                                               applied by FI limit the possibility of making changes to
fects and on customer service procedures focused on the                                                               customer data and thus the possibility of improving the
correctness of data, which can be treated as croudsorcing.                                                            data. In addition, the client, in accordance with the pro-
   Despite the mechanisms used, errors are identified in                                                              visions of contracts concluded by financial institutions,
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint                                                     is obliged to ensure that the data made available to the
Conference (March 29-April 1, 2022), Edinburgh, UK                                                                    financial institution is up-to-date and correct. Despite the
$ mariusz.sienkiewicz@doctorate.put.poznan.pl (M. Sienkiewicz)                                                        efforts made by financial institutions and the obligations
 URL (M. Sienkiewicz)                                                                                                imposed on clients, observations from a project carried
 0000-0002-1665-4928 (M. Sienkiewicz)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative   out at a large financial institution show that errors in
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        data occur and constitute a significant obstacle for the
organization.                                                  contact details, it may not be possible to reach some of
   This article is a continuation of [14] focusing on error    the customers.
detection and improvement of identification data. In              As a result of work on deduplication, pairs with a very
particular, we present our experiences and conclusions         high degree of similarity are identified for which there
from the use of reference data sources for error detection     are differences in customer identification data. Deter-
and cleaning of customer data (Section 2). We present          mining whether the identification data set is correct and
conclusions regarding cleaning and standardization of          which data set is important for the customer may consist
address data (Section 3). Final conclusions were drawn in      of: 1) verification of the data set with the customer - as
Section 4. Note that this article presents the results of an   indicated earlier, this is not an attractive solution from
actual research and development project, and therefore         the point of view of the enterprise scale, 2) checking the
not all details may be disclosed as they are treated as        data on the basis of a reference data sources.
company know-how.
                                                               2.2. Verification of identification data
2. Customer identification data                             The possibility of mass verification of identification data
                                                            for a large financial institution seems to be an attractive
The scope of data collected by financial institutions is solution. Access to state registers containing basic data
wide. A data collection vector describing a single cus- describing the client allows you to verify whether the set
tomer can include more than 1000 features. These can of identification data is correct. Verification of identifica-
be contact, socioeconomic, behavioral data, e.g. product tion data with the use of registry data allows to clearly
use, transaction data, property ownership status, commu- indicate errors in the data and thus to precisely improve
nication channels used, etc. Data relating to individual the data, which in the case of financial institutions is
features may be dirty. Regardless of the length of the extremely important.
collected data vector, basic identification data are the       State registers such as 1) population registration sys-
most important. Of course, the design of IT systems most tem, 2) register of business activity records should be
often ensures the existence of a unique artificial system treated as the reference source of data allowing for the
key that distinguishes records, but from the point of view verification of the correctness of the identification data
of a financial institution, it is important to identify all held. Access to individual registers is regulated by law
customer instances in order to consolidate knowledge and not all entities can use them equally, and access may
about it.                                                   be payable. In the project conducted for a large financial
   As a result of the project work, based on the knowledge institution, 1) the population register and 2) the register
of the financial institution’s experts, a small subset of of economic activities were used.
data was determined, which is particularly important
in identifying the client. The basic identification data
                                                            2.2.1. Contents of the population register
include: 1) natural key from the population or business
registration system, 2) name and surname or name of the The population register contains a number of personal
entity, 3) document ID, 4) legal form of the entity.        data of citizens of a given country. In the case of the
                                                            Polish register, it is about 30 items. The following are
2.1. Detection of identification data errors particularly useful for the verification of identification
                                                            data:
Basic identification data errors can be detected using a
range of algorithms and tools, i.e. regular expressions,          • number ID,
patterns, dictionaries, calculation rules (standard data          • previous number ID (if changed),
cleaning mechanisms). However, legal and regulatory               • surname and first names,
constraints significantly limit the use of cleaning mech-         • family name,
anisms. Most often, modification is possible after con-           • the previous surnames and first names with the
firming the correctness of these identification data with            date of their change and the name of the office
the customer. Data requirements and availability change              that made the change,
over time. In a project implemented for a large financial         • names and surnames of parents (in the case of
institution, the researched database of projects has over 2          data change: date and name of the office that
million records. The entire production customer database             made the change),
is much larger. Verifying the correctness of identification
                                                                  • date and place of birth (in the case of data change:
data for the entire customer base is: 1) costly, 2) bur-
                                                                     date and name of the office that made the change),
dened with image risk, 3) long-term, 4) burdened with
                                                                  • country of birth,
human error. Moreover, due to the lack of up-to-date
     • sex (in case of data change: date and name of the       of natural persons from the created pairs, where there
       office that made the change),                           was one difference in the identification data, approx. 87%
     • series and number of the last ID card, its expiry       were confirmed to be correct based on the population
       date and the name of the office that issued the ID      register.
       card,
     • series and number of the last passport and its
       expiry date,
                                                               3. Cleaning of address data
     • date of death or the date the body was found, the    Address data, right after customer identification data,
       number of the death certificate and the registry     constitute an important element of data in many enter-
       office which drew up the record.                     prises, especially in financial institutions (mainly due to
                                                            numerous information obligations that IFs are obliged to
2.2.2. Contents of the register of economic                 fulfill in a letter form). Designing application interfaces
        activities                                          for entering addresses very often, for various reasons,
The register of economic activities contains a number does not have implemented data validation mechanisms.
of data concerning entities operating in a given country. Failure to implement validation rules causes numerous
In the case of the Polish register, it is about 60 items. errors in the data. The existence of validation rules does
Particularly useful for the verification of legal entities’ not free the system from the problems related to the pu-
identification data are:                                    rity of address data, as the names of towns and streets
                                                            may change.
     • number ID,
     • name,                                                3.1. Address reference data
     • short name,
     • date of creation,                                    There are reference databases. These are dictionary sys-
                                                            tems describing the territorial subordination of a given
     • date of commencement of activities,
                                                            country. Most often they are organized in the form of
     • registered office address,
                                                            hierarchical dictionaries from the largest (province) to
     • legal form,                                          the smallest (street) territorial unit. The use of these
     • type of business,                                    dictionaries using one of the similarity methods can be
     • termination date.                                    used to detect errors in the address data. Due to possi-
   On the basis of the indicated registers, the correctness    ble abbreviations, renaming and data errors, the use of
of the identification data held by the financial institution   territorial dictionaries for validation and improvement
was tested. In the case of natural persons, these were: 1)     of addresses is difficult, especially when we are dealing
the number of the population registration system, 2) first     with a large database of institutions with a long history
name, 3) surname. For business entities, the following         of functioning and numerous system migrations.
were examined: 1) business registration number, 2) name
of the entity, 3) legal form of the business. In the case      3.2. Geocoder as a tool for
of legal entities, the access to data is wide and it was            standardization of andres data
possible to verify all entities subject to registration.
   With regard to the verification results:                 There are geocoder tools on the market that allow you
                                                            to efficiently verify the correctness of the address. The
     • records were marked where the set of identifica- most common result of address geocoding is a standard-
        tion data was correct,                              ized record of the geocoded address along with the ge-
     • in the case of identified pairs of similar records, ographic position (longitude, latitude) and quality of
        where one of the records was confirmed in the match. Geocoders work on the basis of text parsing mech-
        reference database, it was possible to decide to anisms and similarity algorithms - hence the measure of
        create a pair despite differences in identification matching, which shows how exactly the geocoded ad-
        data, e.g. a different value of one of the compared dress matches the pattern. Territorial dictionaries are
        features,                                           often used as a pattern.
     • designation of a limited set that requires verifica-    It would seem that since cleared and standardized data
        tion in contact with the customer.                  is required for the deduplication process, a geocoder type
                                                            tool is an ideal solution for data cleansing. As mentioned
   The obtained results were verified by experts of the before, the geocoder operates on the basis of text simi-
financial institution and proved that the applied cleaning larity testing methods that the geocoder supplier treats
method was adequate to the cleaning problem under as a trade secret. The project uses a commercial solution
consideration. On a representative sample of the records
provided by a supplier who has been developing the ad-             • The use of a geocoder for cleaning and standardiz-
dress base of the territorial area covered by the project            ing address data seems to be justified if in the next
for many years and geocoding algorithms, providing so-               steps the data obtained as a result of geocoding
lutions for business and individual customers. The tools             will not take part in the comparison of the similar-
used by FI constitute a trade secret and cannot be dis-              ity of the text. The availability of numerical data
closed. In addition, record similarity measures based on             obtained as a result of geocoding allows for easy
text comparison are often used to compare data in the                comparison of obtained address points based on
deduplication process. When performing deduplication                 geographic position. The use of a geocoder in the
on geocoded data, you should be aware that the compared              preparation of data for deduplication requires fur-
data in the previous cleansing and standardization step              ther research due to the lack of knowledge about
was established on the basis of some unknown measure                 the error of geocoding results related to the use
of text similarity. Since the geocoder returns some mea-             of undisclosed methods of comparing texts used
sure of match and is not always able to correctly match              in geocoding engines of different suppliers.
the correct address, it is questionable whether comparing
records with the measure of similarity of the data text          Acknowledgements.          The work of Mariusz
after geocoding is appropriate.                               Sienkiewicz is is supported by the Applied Doctorate
   In the implemented project, we decided to use the          Scholarship no. DWD/4/24/2020 from the Ministry of
address data without geocoding them and testing their         Education and Science and additionally the project is sup-
similarity with the measure of the similarity of the text.    ported by a grant from the National Center for Research
Thanks to this approach, the obtained result of comparing     and Development no. POIR.01.01.01-00-0287/19.
records in the deduplication process is not disturbed by
the use of indirect rough search processes.                   References
                                                               [1] X. Chu,          Data cleaning,        in: S. Sakr,
4. Conclusion and Future work                                      A. Y. Zomaya (Eds.), Encyclopedia of Big Data
Based on a project implemented for a large financial               Technologies, Springer, 2019. URL: https://doi.
institution:                                                       org/10.1007/978-3-319-63962-8_3-1. doi:10.1007/
                                                                   978-3-319-63962-8\_3-1.
     • The possibility and usefulness of data from state       [2] E. K. Rezig, Data cleaning in the era of data science:
       registers for data correctness verification has             Challenges and opportunities, in: 11th Conference
       been positively verified.                                   on Innovative Data Systems Research, CIDR 2021,
     • Access to the data contained in the population              Virtual Event, January 11-15, 2021, Online Proceed-
       register is difficult, and it may turn out to be im-        ings, www.cidrdb.org, 2021. URL: http://cidrdb.org/
       possible for entities from outside the financial            cidr2021/papers/cidr2021_abstract09.pdf.
       market or public services.                              [3] X. Chu, I. F. Ilyas, S. Krishnan, J. Wang, Data clean-
     • Business registers are open but there may be a              ing: Overview and emerging challenges, in: F. Öz-
       fee to access them.                                         can, G. Koutrika, S. Madden (Eds.), Proceedings of
     • Confirmation of identification data on the basis            the 2016 International Conference on Management
       of state registers is possible for an entity oper-          of Data, SIGMOD Conference 2016, San Francisco,
       ating on a national scale, international entities           CA, USA, June 26 - July 01, 2016, ACM, 2016, pp.
       would require access to state registers of various          2201–2206. URL: https://doi.org/10.1145/2882903.
       countries.                                                  2912574. doi:10.1145/2882903.2912574.
                                                               [4] G. Y. Lee, L. Alzamil, B. Doskenov, A. Terme-
     • Not all economic entities (some forms of activity)
                                                                   hchy, A survey on data cleaning methods for
       are included in the business records (they do not
                                                                   improved machine learning model performance,
       require registration),
                                                                   CoRR abs/2109.07127 (2021). URL: https://arxiv.org/
     • There is no confirmation of non-resident data in
                                                                   abs/2109.07127. arXiv:2109.07127.
       the population register database.
                                                               [5] E. Rahm, H. H. Do, Data cleaning: Problems
     • The use of territorial dictionaries as a source             and current approaches, IEEE Data Eng. Bull. 23
       of reference data requires building mechanisms              (2000) 3–13. URL: http://sites.computer.org/debull/
       based on comparing the similarity of the text.              A00DEC-CD.pdf.
       Due to the occurrence of abbreviations of names,        [6] O. Azeroual, Data wrangling in database systems:
       errors in data, renaming of address names, the              Purging of dirty data, Data 5 (2020) 50. URL:
       usefulness of a solution built solely on the basis          https://doi.org/10.3390/data5020050. doi:10.3390/
       of reference data may not be satisfactory.                  data5020050.
 [7] M. A. Hernández, S. J. Stolfo, Real-world data is
     dirty: Data cleansing and the merge/purge problem,
     Data Min. Knowl. Discov. 2 (1998) 9–37. URL: https:
     //doi.org/10.1023/A:1009761603038. doi:10.1023/
     A:1009761603038.
 [8] T. P. F. S. Authority, Recommendation d. concern-
     ing the management of information technology
     areas and security of the ict environment in banks,
     https://www.knf.gov.pl/knf/pl/komponenty/img/
     Rekomendacja_D_8_01_13_uchwala_7_33016.pdf,
     2013.
 [9] O. J. of the European Union, Regulation (eu)
     2016/679 of the european parliament and of the
     council of 27 april 2016 on the protection of natural
     persons with regard to the processing of personal
     data and on the free movement of such data, and re-
     pealing directive 95/46/ec (general data protection
     regulation), https://eur-lex.europa.eu/eli/reg/2016/
     679/oj, 2016.
[10] A. Colyer, The morning paper on An overview of
     end-to-end entity resolution for big data, https:
     //blog.acolyer.org/2020/12/14/entity-resolution/,
     2020.
[11] A. Simitsis, P. Vassiliadis, T. K. Sellis, State-space
     optimization of ETL workflows, IEEE Transactions
     on Knowledge and Data Engineering 17 (2005) 1404–
     1419.
[12] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas,
     Blocking and filtering techniques for entity reso-
     lution: A survey, ACM Comput. Surv. 53 (2020)
     31:1–31:42.
[13] G. Papadakis, L. Tsekouras, E. Thanos, G. Gian-
     nakopoulos, T. Palpanas, M. Koubarakis, Domain-
     and structure-agnostic end-to-end entity resolution
     with jedai, SIGMOD Record 48 (2019) 30–36.
[14] M. Sienkiewicz, R. Wrembel, Managing data in a
     big financial institution: Conclusions from a r&d
     project, in: C. Costa, E. Pitoura (Eds.), Proceedings
     of the Workshops of the EDBT/ICDT 2021 Joint
     Conference, Nicosia, Cyprus, March 23, 2021, vol-
     ume 2841 of CEUR Workshop Proceedings, CEUR-
     WS.org, 2021. URL: http://ceur-ws.org/Vol-2841/
     DARLI-AP_9.pdf.

</pre>