<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Reference Sources in Clearing Customer Data: Conclusions from a R&amp;D Project</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mariusz Sienkiewicz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Poznan University of Technology</institution>
          ,
          <addr-line>Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>1</volume>
      <issue>2022</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The digitization and virtualization of many aspects of life pose a question for many organizations regarding customer identification. Problem is extremely important in the context of financial institutions (FI), where customer identification is related to a number of aspects of the company's operation and the products and services provided. The problem of unambiguous customer identification consists of data errors, dirty data and duplicate records describing the customer. It is estimated that 1% to approximately 5% of FI data are afected by errors. The scope of data collected by institutions about their clients is enormous and results from many needs. Each of these needs may require a diferent scope of data and expect diferent levels of quality. Regardless of the needs for data collection and processing, certain data is particularly important and important - we are talking about data allowing for unambiguous customer identification. In this article, we will pay special attention to the data set that allows for unambiguous customer identification.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;data cleaning</kwd>
        <kwd>deduplication</kwd>
        <kwd>dictionary cleaning</kwd>
        <kwd>geocoding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        the data describing clients collected and processed by
ifnancial institutions. This is due to: 1) the long
hisThe issue of data cleaning has been raised many times tory of IT systems, 2) numerous system migrations, 3)
in the literature [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6">1, 2, 3, 4, 5, 6, 7</xref>
        ]. There are various sug- acquisitions on the financial market, 4) human errors,
gestions for data error detection and deduplication based 5) intended actions (e.g. attempts to extort financial
reon e.g. on the methods of comparing the text, corud- sources). The efectiveness of the procedures created for
sorbing or classification. Based on the experience from a the sales force is limited and depends on many factors,
project implemented for a large financial institution, we and it is not the subject of this article.
will present a concept of detecting and correcting cus- Clean and standardized data are required in many
artomer identification data. This is the first article based on eas of data processing in financial institutions, including
the work carried out on a large (over 2 million records) 1) risk models, 2) security mechanisms, 3) ofer and sales
database of real physical and legal clients. support models, 4) ML-based solutions, 5) data
dedupli
      </p>
      <p>Due to the nature of their business, financial institu- cation.
tions pay special attention to unambiguous and complete The standard data deduplication pipeline [10, 11, 12,
identification of clients. Customer identification is of 13] assumes that the data delivered to the pipeline is
great importance both for IF collecting data and provid- cleared (eg. no zero values, no spelling errors, unified
ing products and services, as well as for customers due hashes). Unfortunately, this assumption in real projects
to e.g. the security of funds. Financial institutions and cannot be guaranteed, especially in the financial sector.
ifnancial market regulators pay great attention to the There are typos, missing values, inconsistent values in
quality of the collected and processed data [8, 9]. Many the attributes that store personal data, institution names
ifnancial institutions have extensive data management and addresses. Moreover, not all natural identifiers are
and data quality management systems, and use various reliable. It should also be taken into account that the
fitechniques to detect errors and correct them in the col- nancial market is largely regulated by law. Interpretation
lected and processed data. These are mechanisms based of the current legal regulations and the security practices
on specialized software supporting the detection of de- applied by FI limit the possibility of making changes to
fects and on customer service procedures focused on the customer data and thus the possibility of improving the
correctness of data, which can be treated as croudsorcing. data. In addition, the client, in accordance with the
proDespite the mechanisms used, errors are identified in visions of contracts concluded by financial institutions,
is obliged to ensure that the data made available to the
ifnancial institution is up-to-date and correct. Despite the
eforts made by financial institutions and the obligations
imposed on clients, observations from a project carried
out at a large financial institution show that errors in
data occur and constitute a significant obstacle for the
organization. contact details, it may not be possible to reach some of</p>
      <p>This article is a continuation of [14] focusing on error the customers.
detection and improvement of identification data. In As a result of work on deduplication, pairs with a very
particular, we present our experiences and conclusions high degree of similarity are identified for which there
from the use of reference data sources for error detection are diferences in customer identification data.
Deterand cleaning of customer data (Section 2). We present mining whether the identification data set is correct and
conclusions regarding cleaning and standardization of which data set is important for the customer may consist
address data (Section 3). Final conclusions were drawn in of: 1) verification of the data set with the customer - as
Section 4. Note that this article presents the results of an indicated earlier, this is not an attractive solution from
actual research and development project, and therefore the point of view of the enterprise scale, 2) checking the
not all details may be disclosed as they are treated as data on the basis of a reference data sources.
company know-how.</p>
      <sec id="sec-1-1">
        <title>2.2. Verification of identification data</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Customer identification data</title>
      <sec id="sec-2-1">
        <title>The possibility of mass verification of identification data</title>
        <p>for a large financial institution seems to be an attractive
The scope of data collected by financial institutions is solution. Access to state registers containing basic data
wide. A data collection vector describing a single cus- describing the client allows you to verify whether the set
tomer can include more than 1000 features. These can of identification data is correct. Verification of
identificabe contact, socioeconomic, behavioral data, e.g. product tion data with the use of registry data allows to clearly
use, transaction data, property ownership status, commu- indicate errors in the data and thus to precisely improve
nication channels used, etc. Data relating to individual the data, which in the case of financial institutions is
features may be dirty. Regardless of the length of the extremely important.
collected data vector, basic identification data are the State registers such as 1) population registration
sysmost important. Of course, the design of IT systems most tem, 2) register of business activity records should be
often ensures the existence of a unique artificial system treated as the reference source of data allowing for the
key that distinguishes records, but from the point of view verification of the correctness of the identification data
of a financial institution, it is important to identify all held. Access to individual registers is regulated by law
customer instances in order to consolidate knowledge and not all entities can use them equally, and access may
about it. be payable. In the project conducted for a large financial</p>
        <p>As a result of the project work, based on the knowledge institution, 1) the population register and 2) the register
of the financial institution’s experts, a small subset of of economic activities were used.
data was determined, which is particularly important
in identifying the client. The basic identification data
include: 1) natural key from the population or business 2.2.1. Contents of the population register
registration system, 2) name and surname or name of the
entity, 3) document ID, 4) legal form of the entity.</p>
        <sec id="sec-2-1-1">
          <title>2.1. Detection of identification data errors</title>
          <p>Basic identification data errors can be detected using a
range of algorithms and tools, i.e. regular expressions,
patterns, dictionaries, calculation rules (standard data
cleaning mechanisms). However, legal and regulatory
constraints significantly limit the use of cleaning
mechanisms. Most often, modification is possible after
conifrming the correctness of these identification data with
the customer. Data requirements and availability change
over time. In a project implemented for a large financial
institution, the researched database of projects has over 2
million records. The entire production customer database
is much larger. Verifying the correctness of identification
data for the entire customer base is: 1) costly, 2)
burdened with image risk, 3) long-term, 4) burdened with
human error. Moreover, due to the lack of up-to-date
The population register contains a number of personal
data of citizens of a given country. In the case of the
Polish register, it is about 30 items. The following are
particularly useful for the verification of identification
data:
• number ID,
• previous number ID (if changed),
• surname and first names,
• family name,
• the previous surnames and first names with the
date of their change and the name of the ofice
that made the change,
• names and surnames of parents (in the case of
data change: date and name of the ofice that
made the change),
• date and place of birth (in the case of data change:
date and name of the ofice that made the change),
• country of birth,
• sex (in case of data change: date and name of the
ofice that made the change),
• series and number of the last ID card, its expiry
date and the name of the ofice that issued the ID
card,
• series and number of the last passport and its
expiry date,
• date of death or the date the body was found, the
number of the death certificate and the registry
ofice which drew up the record.
2.2.2. Contents of the register of economic
activities
The register of economic activities contains a number
of data concerning entities operating in a given country.
In the case of the Polish register, it is about 60 items.
Particularly useful for the verification of legal entities’
identification data are:
• number ID,
• name,
• short name,
• date of creation,
• date of commencement of activities,
• registered ofice address,
• legal form,
• type of business,
• termination date.</p>
          <p>On the basis of the indicated registers, the correctness
of the identification data held by the financial institution
was tested. In the case of natural persons, these were: 1)
the number of the population registration system, 2) first
name, 3) surname. For business entities, the following
were examined: 1) business registration number, 2) name
of the entity, 3) legal form of the business. In the case
of legal entities, the access to data is wide and it was
possible to verify all entities subject to registration.</p>
          <p>With regard to the verification results:
• records were marked where the set of
identification data was correct,
• in the case of identified pairs of similar records,
where one of the records was confirmed in the
reference database, it was possible to decide to
create a pair despite diferences in identification
data, e.g. a diferent value of one of the compared
features,
• designation of a limited set that requires
verification in contact with the customer.</p>
          <p>The obtained results were verified by experts of the
ifnancial institution and proved that the applied cleaning
method was adequate to the cleaning problem under
consideration. On a representative sample of the records
of natural persons from the created pairs, where there
was one diference in the identification data, approx. 87%
were confirmed to be correct based on the population
register.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Cleaning of address data</title>
      <p>Address data, right after customer identification data,
constitute an important element of data in many
enterprises, especially in financial institutions (mainly due to
numerous information obligations that IFs are obliged to
fulfill in a letter form). Designing application interfaces
for entering addresses very often, for various reasons,
does not have implemented data validation mechanisms.
Failure to implement validation rules causes numerous
errors in the data. The existence of validation rules does
not free the system from the problems related to the
purity of address data, as the names of towns and streets
may change.</p>
      <sec id="sec-3-1">
        <title>3.1. Address reference data</title>
        <p>There are reference databases. These are dictionary
systems describing the territorial subordination of a given
country. Most often they are organized in the form of
hierarchical dictionaries from the largest (province) to
the smallest (street) territorial unit. The use of these
dictionaries using one of the similarity methods can be
used to detect errors in the address data. Due to
possible abbreviations, renaming and data errors, the use of
territorial dictionaries for validation and improvement
of addresses is dificult, especially when we are dealing
with a large database of institutions with a long history
of functioning and numerous system migrations.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Geocoder as a tool for standardization of andres data</title>
        <p>There are geocoder tools on the market that allow you
to eficiently verify the correctness of the address. The
most common result of address geocoding is a
standardized record of the geocoded address along with the
geographic position (longitude, latitude) and quality of
match. Geocoders work on the basis of text parsing
mechanisms and similarity algorithms - hence the measure of
matching, which shows how exactly the geocoded
address matches the pattern. Territorial dictionaries are
often used as a pattern.</p>
        <p>It would seem that since cleared and standardized data
is required for the deduplication process, a geocoder type
tool is an ideal solution for data cleansing. As mentioned
before, the geocoder operates on the basis of text
similarity testing methods that the geocoder supplier treats
as a trade secret. The project uses a commercial solution
provided by a supplier who has been developing the
address base of the territorial area covered by the project
for many years and geocoding algorithms, providing
solutions for business and individual customers. The tools
used by FI constitute a trade secret and cannot be
disclosed. In addition, record similarity measures based on
text comparison are often used to compare data in the
deduplication process. When performing deduplication
on geocoded data, you should be aware that the compared
data in the previous cleansing and standardization step
was established on the basis of some unknown measure
of text similarity. Since the geocoder returns some
measure of match and is not always able to correctly match
the correct address, it is questionable whether comparing
records with the measure of similarity of the data text
after geocoding is appropriate.</p>
        <p>In the implemented project, we decided to use the
address data without geocoding them and testing their
similarity with the measure of the similarity of the text.
Thanks to this approach, the obtained result of comparing
records in the deduplication process is not disturbed by
the use of indirect rough search processes.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future work</title>
      <p>Based on a project implemented for a large financial
institution:
• The possibility and usefulness of data from state
registers for data correctness verification has
been positively verified.
• Access to the data contained in the population
register is dificult, and it may turn out to be
impossible for entities from outside the financial
market or public services.
• Business registers are open but there may be a
fee to access them.
• Confirmation of identification data on the basis
of state registers is possible for an entity
operating on a national scale, international entities
would require access to state registers of various
countries.
• Not all economic entities (some forms of activity)
are included in the business records (they do not
require registration),
• There is no confirmation of non-resident data in
the population register database.
• The use of territorial dictionaries as a source
of reference data requires building mechanisms
based on comparing the similarity of the text.
Due to the occurrence of abbreviations of names,
errors in data, renaming of address names, the
usefulness of a solution built solely on the basis
of reference data may not be satisfactory.
• The use of a geocoder for cleaning and
standardizing address data seems to be justified if in the next
steps the data obtained as a result of geocoding
will not take part in the comparison of the
similarity of the text. The availability of numerical data
obtained as a result of geocoding allows for easy
comparison of obtained address points based on
geographic position. The use of a geocoder in the
preparation of data for deduplication requires
further research due to the lack of knowledge about
the error of geocoding results related to the use
of undisclosed methods of comparing texts used
in geocoding engines of diferent suppliers.</p>
      <sec id="sec-4-1">
        <title>Acknowledgements. The work of Mariusz</title>
        <p>Sienkiewicz is is supported by the Applied Doctorate
Scholarship no. DWD/4/24/2020 from the Ministry of
Education and Science and additionally the project is
supported by a grant from the National Center for Research
and Development no. POIR.01.01.01-00-0287/19.
[7] M. A. Hernández, S. J. Stolfo, Real-world data is
dirty: Data cleansing and the merge/purge problem,
Data Min. Knowl. Discov. 2 (1998) 9–37. URL: https:
//doi.org/10.1023/A:1009761603038. doi:10.1023/
A:1009761603038.
[8] T. P. F. S. Authority, Recommendation d.
concerning the management of information technology
areas and security of the ict environment in banks,
https://www.knf.gov.pl/knf/pl/komponenty/img/
Rekomendacja_D_8_01_13_uchwala_7_33016.pdf ,
2013.
[9] O. J. of the European Union, Regulation (eu)
2016/679 of the european parliament and of the
council of 27 april 2016 on the protection of natural
persons with regard to the processing of personal
data and on the free movement of such data, and
repealing directive 95/46/ec (general data protection
regulation), https://eur-lex.europa.eu/eli/reg/2016/
679/oj, 2016.
[10] A. Colyer, The morning paper on An overview of
end-to-end entity resolution for big data, https:
//blog.acolyer.org/2020/12/14/entity-resolution/,
2020.
[11] A. Simitsis, P. Vassiliadis, T. K. Sellis, State-space
optimization of ETL workflows, IEEE Transactions
on Knowledge and Data Engineering 17 (2005) 1404–
1419.
[12] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas,
Blocking and filtering techniques for entity
resolution: A survey, ACM Comput. Surv. 53 (2020)
31:1–31:42.
[13] G. Papadakis, L. Tsekouras, E. Thanos, G.
Giannakopoulos, T. Palpanas, M. Koubarakis,
Domainand structure-agnostic end-to-end entity resolution
with jedai, SIGMOD Record 48 (2019) 30–36.
[14] M. Sienkiewicz, R. Wrembel, Managing data in a
big financial institution: Conclusions from a r&amp;d
project, in: C. Costa, E. Pitoura (Eds.), Proceedings
of the Workshops of the EDBT/ICDT 2021 Joint
Conference, Nicosia, Cyprus, March 23, 2021,
volume 2841 of CEUR Workshop Proceedings,
CEURWS.org, 2021. URL: http://ceur-ws.org/Vol-2841/
DARLI-AP_9.pdf .</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <article-title>Data cleaning</article-title>
          , in: S. Sakr, A. Y. Zomaya (Eds.),
          <source>Encyclopedia of Big Data Technologies</source>
          , Springer,
          <year>2019</year>
          . URL: https://doi. org/10.1007/978-3-
          <fpage>319</fpage>
          -63962-
          <issue>8</issue>
          _
          <fpage>3</fpage>
          -
          <lpage>1</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>319</fpage>
          -63962-8\_
          <fpage>3</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E. K.</given-names>
            <surname>Rezig</surname>
          </string-name>
          ,
          <article-title>Data cleaning in the era of data science: Challenges and opportunities</article-title>
          ,
          <source>in: 11th Conference on Innovative Data Systems Research, CIDR</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          ,
          <source>January 11-15</source>
          ,
          <year>2021</year>
          ,
          <string-name>
            <given-names>Online</given-names>
            <surname>Proceedings</surname>
          </string-name>
          , www.cidrdb.org,
          <year>2021</year>
          . URL: http://cidrdb.org/ cidr2021/papers/cidr2021_abstract09.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Data cleaning: Overview and emerging challenges</article-title>
          , in: F. Özcan, G. Koutrika, S. Madden (Eds.),
          <source>Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2016</year>
          , San Francisco, CA, USA, June 26 - July 01,
          <year>2016</year>
          , ACM,
          <year>2016</year>
          , pp.
          <fpage>2201</fpage>
          -
          <lpage>2206</lpage>
          . URL: https://doi.org/10.1145/2882903. 2912574. doi:
          <volume>10</volume>
          .1145/2882903.2912574.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G. Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Alzamil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Doskenov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Termehchy</surname>
          </string-name>
          ,
          <article-title>A survey on data cleaning methods for improved machine learning model performance</article-title>
          ,
          <source>CoRR abs/2109</source>
          .07127 (
          <year>2021</year>
          ). URL: https://arxiv.org/ abs/2109.07127. arXiv:
          <volume>2109</volume>
          .
          <fpage>07127</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <article-title>Data cleaning: Problems and current approaches</article-title>
          ,
          <source>IEEE Data Eng. Bull</source>
          .
          <volume>23</volume>
          (
          <year>2000</year>
          )
          <fpage>3</fpage>
          -
          <lpage>13</lpage>
          . URL: http://sites.computer.org/debull/ A00DEC-CD.
          <year>pdf</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Azeroual</surname>
          </string-name>
          ,
          <article-title>Data wrangling in database systems: Purging of dirty data</article-title>
          ,
          <source>Data</source>
          <volume>5</volume>
          (
          <year>2020</year>
          )
          <article-title>50</article-title>
          . URL: https://doi.org/10.3390/data5020050. doi:
          <volume>10</volume>
          .3390/ data5020050.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>