=Paper=
{{Paper
|id=Vol-3135/darliap_paper7
|storemode=property
|title=Reference Sources in Clearing Customer Data: Conclusions from a R&D Project
|pdfUrl=https://ceur-ws.org/Vol-3135/darliap_paper7.pdf
|volume=Vol-3135
|authors=Mariusz Sienkiewicz
|dblpUrl=https://dblp.org/rec/conf/edbt/Sienkiewicz22
}}
==Reference Sources in Clearing Customer Data: Conclusions from a R&D Project==
Reference Sources in Clearing Customer Data:
Conclusions from a R&D Project
Mariusz Sienkiewicz1
1
Poznan University of Technology, Poznań, Poland
Abstract
The digitization and virtualization of many aspects of life pose a question for many organizations regarding customer
identification. Problem is extremely important in the context of financial institutions (FI), where customer identification
is related to a number of aspects of the company’s operation and the products and services provided. The problem of
unambiguous customer identification consists of data errors, dirty data and duplicate records describing the customer. It is
estimated that 1% to approximately 5% of FI data are affected by errors. The scope of data collected by institutions about
their clients is enormous and results from many needs. Each of these needs may require a different scope of data and expect
different levels of quality. Regardless of the needs for data collection and processing, certain data is particularly important
and important - we are talking about data allowing for unambiguous customer identification. In this article, we will pay
special attention to the data set that allows for unambiguous customer identification.
Keywords
data cleaning, deduplication, dictionary cleaning, geocoding
1. Introduction the data describing clients collected and processed by
financial institutions. This is due to: 1) the long his-
The issue of data cleaning has been raised many times tory of IT systems, 2) numerous system migrations, 3)
in the literature [1, 2, 3, 4, 5, 6, 7]. There are various sug- acquisitions on the financial market, 4) human errors,
gestions for data error detection and deduplication based 5) intended actions (e.g. attempts to extort financial re-
on e.g. on the methods of comparing the text, corud- sources). The effectiveness of the procedures created for
sorbing or classification. Based on the experience from a the sales force is limited and depends on many factors,
project implemented for a large financial institution, we and it is not the subject of this article.
will present a concept of detecting and correcting cus- Clean and standardized data are required in many ar-
tomer identification data. This is the first article based on eas of data processing in financial institutions, including
the work carried out on a large (over 2 million records) 1) risk models, 2) security mechanisms, 3) offer and sales
database of real physical and legal clients. support models, 4) ML-based solutions, 5) data dedupli-
Due to the nature of their business, financial institu- cation.
tions pay special attention to unambiguous and complete The standard data deduplication pipeline [10, 11, 12,
identification of clients. Customer identification is of 13] assumes that the data delivered to the pipeline is
great importance both for IF collecting data and provid- cleared (eg. no zero values, no spelling errors, unified
ing products and services, as well as for customers due hashes). Unfortunately, this assumption in real projects
to e.g. the security of funds. Financial institutions and cannot be guaranteed, especially in the financial sector.
financial market regulators pay great attention to the There are typos, missing values, inconsistent values in
quality of the collected and processed data [8, 9]. Many the attributes that store personal data, institution names
financial institutions have extensive data management and addresses. Moreover, not all natural identifiers are
and data quality management systems, and use various reliable. It should also be taken into account that the fi-
techniques to detect errors and correct them in the col- nancial market is largely regulated by law. Interpretation
lected and processed data. These are mechanisms based of the current legal regulations and the security practices
on specialized software supporting the detection of de- applied by FI limit the possibility of making changes to
fects and on customer service procedures focused on the customer data and thus the possibility of improving the
correctness of data, which can be treated as croudsorcing. data. In addition, the client, in accordance with the pro-
Despite the mechanisms used, errors are identified in visions of contracts concluded by financial institutions,
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint is obliged to ensure that the data made available to the
Conference (March 29-April 1, 2022), Edinburgh, UK financial institution is up-to-date and correct. Despite the
$ mariusz.sienkiewicz@doctorate.put.poznan.pl (M. Sienkiewicz) efforts made by financial institutions and the obligations
URL (M. Sienkiewicz) imposed on clients, observations from a project carried
0000-0002-1665-4928 (M. Sienkiewicz)
© 2022 Copyright for this paper by its authors. Use permitted under Creative out at a large financial institution show that errors in
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org) data occur and constitute a significant obstacle for the
organization. contact details, it may not be possible to reach some of
This article is a continuation of [14] focusing on error the customers.
detection and improvement of identification data. In As a result of work on deduplication, pairs with a very
particular, we present our experiences and conclusions high degree of similarity are identified for which there
from the use of reference data sources for error detection are differences in customer identification data. Deter-
and cleaning of customer data (Section 2). We present mining whether the identification data set is correct and
conclusions regarding cleaning and standardization of which data set is important for the customer may consist
address data (Section 3). Final conclusions were drawn in of: 1) verification of the data set with the customer - as
Section 4. Note that this article presents the results of an indicated earlier, this is not an attractive solution from
actual research and development project, and therefore the point of view of the enterprise scale, 2) checking the
not all details may be disclosed as they are treated as data on the basis of a reference data sources.
company know-how.
2.2. Verification of identification data
2. Customer identification data The possibility of mass verification of identification data
for a large financial institution seems to be an attractive
The scope of data collected by financial institutions is solution. Access to state registers containing basic data
wide. A data collection vector describing a single cus- describing the client allows you to verify whether the set
tomer can include more than 1000 features. These can of identification data is correct. Verification of identifica-
be contact, socioeconomic, behavioral data, e.g. product tion data with the use of registry data allows to clearly
use, transaction data, property ownership status, commu- indicate errors in the data and thus to precisely improve
nication channels used, etc. Data relating to individual the data, which in the case of financial institutions is
features may be dirty. Regardless of the length of the extremely important.
collected data vector, basic identification data are the State registers such as 1) population registration sys-
most important. Of course, the design of IT systems most tem, 2) register of business activity records should be
often ensures the existence of a unique artificial system treated as the reference source of data allowing for the
key that distinguishes records, but from the point of view verification of the correctness of the identification data
of a financial institution, it is important to identify all held. Access to individual registers is regulated by law
customer instances in order to consolidate knowledge and not all entities can use them equally, and access may
about it. be payable. In the project conducted for a large financial
As a result of the project work, based on the knowledge institution, 1) the population register and 2) the register
of the financial institution’s experts, a small subset of of economic activities were used.
data was determined, which is particularly important
in identifying the client. The basic identification data
2.2.1. Contents of the population register
include: 1) natural key from the population or business
registration system, 2) name and surname or name of the The population register contains a number of personal
entity, 3) document ID, 4) legal form of the entity. data of citizens of a given country. In the case of the
Polish register, it is about 30 items. The following are
2.1. Detection of identification data errors particularly useful for the verification of identification
data:
Basic identification data errors can be detected using a
range of algorithms and tools, i.e. regular expressions, • number ID,
patterns, dictionaries, calculation rules (standard data • previous number ID (if changed),
cleaning mechanisms). However, legal and regulatory • surname and first names,
constraints significantly limit the use of cleaning mech- • family name,
anisms. Most often, modification is possible after con- • the previous surnames and first names with the
firming the correctness of these identification data with date of their change and the name of the office
the customer. Data requirements and availability change that made the change,
over time. In a project implemented for a large financial • names and surnames of parents (in the case of
institution, the researched database of projects has over 2 data change: date and name of the office that
million records. The entire production customer database made the change),
is much larger. Verifying the correctness of identification
• date and place of birth (in the case of data change:
data for the entire customer base is: 1) costly, 2) bur-
date and name of the office that made the change),
dened with image risk, 3) long-term, 4) burdened with
• country of birth,
human error. Moreover, due to the lack of up-to-date
• sex (in case of data change: date and name of the of natural persons from the created pairs, where there
office that made the change), was one difference in the identification data, approx. 87%
• series and number of the last ID card, its expiry were confirmed to be correct based on the population
date and the name of the office that issued the ID register.
card,
• series and number of the last passport and its
expiry date,
3. Cleaning of address data
• date of death or the date the body was found, the Address data, right after customer identification data,
number of the death certificate and the registry constitute an important element of data in many enter-
office which drew up the record. prises, especially in financial institutions (mainly due to
numerous information obligations that IFs are obliged to
2.2.2. Contents of the register of economic fulfill in a letter form). Designing application interfaces
activities for entering addresses very often, for various reasons,
The register of economic activities contains a number does not have implemented data validation mechanisms.
of data concerning entities operating in a given country. Failure to implement validation rules causes numerous
In the case of the Polish register, it is about 60 items. errors in the data. The existence of validation rules does
Particularly useful for the verification of legal entities’ not free the system from the problems related to the pu-
identification data are: rity of address data, as the names of towns and streets
may change.
• number ID,
• name, 3.1. Address reference data
• short name,
• date of creation, There are reference databases. These are dictionary sys-
tems describing the territorial subordination of a given
• date of commencement of activities,
country. Most often they are organized in the form of
• registered office address,
hierarchical dictionaries from the largest (province) to
• legal form, the smallest (street) territorial unit. The use of these
• type of business, dictionaries using one of the similarity methods can be
• termination date. used to detect errors in the address data. Due to possi-
On the basis of the indicated registers, the correctness ble abbreviations, renaming and data errors, the use of
of the identification data held by the financial institution territorial dictionaries for validation and improvement
was tested. In the case of natural persons, these were: 1) of addresses is difficult, especially when we are dealing
the number of the population registration system, 2) first with a large database of institutions with a long history
name, 3) surname. For business entities, the following of functioning and numerous system migrations.
were examined: 1) business registration number, 2) name
of the entity, 3) legal form of the business. In the case 3.2. Geocoder as a tool for
of legal entities, the access to data is wide and it was standardization of andres data
possible to verify all entities subject to registration.
With regard to the verification results: There are geocoder tools on the market that allow you
to efficiently verify the correctness of the address. The
• records were marked where the set of identifica- most common result of address geocoding is a standard-
tion data was correct, ized record of the geocoded address along with the ge-
• in the case of identified pairs of similar records, ographic position (longitude, latitude) and quality of
where one of the records was confirmed in the match. Geocoders work on the basis of text parsing mech-
reference database, it was possible to decide to anisms and similarity algorithms - hence the measure of
create a pair despite differences in identification matching, which shows how exactly the geocoded ad-
data, e.g. a different value of one of the compared dress matches the pattern. Territorial dictionaries are
features, often used as a pattern.
• designation of a limited set that requires verifica- It would seem that since cleared and standardized data
tion in contact with the customer. is required for the deduplication process, a geocoder type
tool is an ideal solution for data cleansing. As mentioned
The obtained results were verified by experts of the before, the geocoder operates on the basis of text simi-
financial institution and proved that the applied cleaning larity testing methods that the geocoder supplier treats
method was adequate to the cleaning problem under as a trade secret. The project uses a commercial solution
consideration. On a representative sample of the records
provided by a supplier who has been developing the ad- • The use of a geocoder for cleaning and standardiz-
dress base of the territorial area covered by the project ing address data seems to be justified if in the next
for many years and geocoding algorithms, providing so- steps the data obtained as a result of geocoding
lutions for business and individual customers. The tools will not take part in the comparison of the similar-
used by FI constitute a trade secret and cannot be dis- ity of the text. The availability of numerical data
closed. In addition, record similarity measures based on obtained as a result of geocoding allows for easy
text comparison are often used to compare data in the comparison of obtained address points based on
deduplication process. When performing deduplication geographic position. The use of a geocoder in the
on geocoded data, you should be aware that the compared preparation of data for deduplication requires fur-
data in the previous cleansing and standardization step ther research due to the lack of knowledge about
was established on the basis of some unknown measure the error of geocoding results related to the use
of text similarity. Since the geocoder returns some mea- of undisclosed methods of comparing texts used
sure of match and is not always able to correctly match in geocoding engines of different suppliers.
the correct address, it is questionable whether comparing
records with the measure of similarity of the data text Acknowledgements. The work of Mariusz
after geocoding is appropriate. Sienkiewicz is is supported by the Applied Doctorate
In the implemented project, we decided to use the Scholarship no. DWD/4/24/2020 from the Ministry of
address data without geocoding them and testing their Education and Science and additionally the project is sup-
similarity with the measure of the similarity of the text. ported by a grant from the National Center for Research
Thanks to this approach, the obtained result of comparing and Development no. POIR.01.01.01-00-0287/19.
records in the deduplication process is not disturbed by
the use of indirect rough search processes. References
[1] X. Chu, Data cleaning, in: S. Sakr,
4. Conclusion and Future work A. Y. Zomaya (Eds.), Encyclopedia of Big Data
Based on a project implemented for a large financial Technologies, Springer, 2019. URL: https://doi.
institution: org/10.1007/978-3-319-63962-8_3-1. doi:10.1007/
978-3-319-63962-8\_3-1.
• The possibility and usefulness of data from state [2] E. K. Rezig, Data cleaning in the era of data science:
registers for data correctness verification has Challenges and opportunities, in: 11th Conference
been positively verified. on Innovative Data Systems Research, CIDR 2021,
• Access to the data contained in the population Virtual Event, January 11-15, 2021, Online Proceed-
register is difficult, and it may turn out to be im- ings, www.cidrdb.org, 2021. URL: http://cidrdb.org/
possible for entities from outside the financial cidr2021/papers/cidr2021_abstract09.pdf.
market or public services. [3] X. Chu, I. F. Ilyas, S. Krishnan, J. Wang, Data clean-
• Business registers are open but there may be a ing: Overview and emerging challenges, in: F. Öz-
fee to access them. can, G. Koutrika, S. Madden (Eds.), Proceedings of
• Confirmation of identification data on the basis the 2016 International Conference on Management
of state registers is possible for an entity oper- of Data, SIGMOD Conference 2016, San Francisco,
ating on a national scale, international entities CA, USA, June 26 - July 01, 2016, ACM, 2016, pp.
would require access to state registers of various 2201–2206. URL: https://doi.org/10.1145/2882903.
countries. 2912574. doi:10.1145/2882903.2912574.
[4] G. Y. Lee, L. Alzamil, B. Doskenov, A. Terme-
• Not all economic entities (some forms of activity)
hchy, A survey on data cleaning methods for
are included in the business records (they do not
improved machine learning model performance,
require registration),
CoRR abs/2109.07127 (2021). URL: https://arxiv.org/
• There is no confirmation of non-resident data in
abs/2109.07127. arXiv:2109.07127.
the population register database.
[5] E. Rahm, H. H. Do, Data cleaning: Problems
• The use of territorial dictionaries as a source and current approaches, IEEE Data Eng. Bull. 23
of reference data requires building mechanisms (2000) 3–13. URL: http://sites.computer.org/debull/
based on comparing the similarity of the text. A00DEC-CD.pdf.
Due to the occurrence of abbreviations of names, [6] O. Azeroual, Data wrangling in database systems:
errors in data, renaming of address names, the Purging of dirty data, Data 5 (2020) 50. URL:
usefulness of a solution built solely on the basis https://doi.org/10.3390/data5020050. doi:10.3390/
of reference data may not be satisfactory. data5020050.
[7] M. A. Hernández, S. J. Stolfo, Real-world data is
dirty: Data cleansing and the merge/purge problem,
Data Min. Knowl. Discov. 2 (1998) 9–37. URL: https:
//doi.org/10.1023/A:1009761603038. doi:10.1023/
A:1009761603038.
[8] T. P. F. S. Authority, Recommendation d. concern-
ing the management of information technology
areas and security of the ict environment in banks,
https://www.knf.gov.pl/knf/pl/komponenty/img/
Rekomendacja_D_8_01_13_uchwala_7_33016.pdf,
2013.
[9] O. J. of the European Union, Regulation (eu)
2016/679 of the european parliament and of the
council of 27 april 2016 on the protection of natural
persons with regard to the processing of personal
data and on the free movement of such data, and re-
pealing directive 95/46/ec (general data protection
regulation), https://eur-lex.europa.eu/eli/reg/2016/
679/oj, 2016.
[10] A. Colyer, The morning paper on An overview of
end-to-end entity resolution for big data, https:
//blog.acolyer.org/2020/12/14/entity-resolution/,
2020.
[11] A. Simitsis, P. Vassiliadis, T. K. Sellis, State-space
optimization of ETL workflows, IEEE Transactions
on Knowledge and Data Engineering 17 (2005) 1404–
1419.
[12] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas,
Blocking and filtering techniques for entity reso-
lution: A survey, ACM Comput. Surv. 53 (2020)
31:1–31:42.
[13] G. Papadakis, L. Tsekouras, E. Thanos, G. Gian-
nakopoulos, T. Palpanas, M. Koubarakis, Domain-
and structure-agnostic end-to-end entity resolution
with jedai, SIGMOD Record 48 (2019) 30–36.
[14] M. Sienkiewicz, R. Wrembel, Managing data in a
big financial institution: Conclusions from a r&d
project, in: C. Costa, E. Pitoura (Eds.), Proceedings
of the Workshops of the EDBT/ICDT 2021 Joint
Conference, Nicosia, Cyprus, March 23, 2021, vol-
ume 2841 of CEUR Workshop Proceedings, CEUR-
WS.org, 2021. URL: http://ceur-ws.org/Vol-2841/
DARLI-AP_9.pdf.