<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Managing Data in a Big Financial Institution: Conclusions from a R&amp;D Project</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mariusz Sienkiewicz</string-name>
          <email>mariusz.sienkiewicz@doctorate.put.poznan.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Wrembel</string-name>
          <email>robert.wrembel@cs.put.poznan.pl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Poznan University of Technology</institution>
          ,
          <addr-line>Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Poznan University of Technology</institution>
          ,
          <addr-line>Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Financial institutions (FIs) use the state-of-the art data management and data engineering solutions to support their day-to-day businesses. They follow strict data governance policies as well as country and international regulations. In spite of these facts, the quality of their data is not perfect. Experts in the field estimate that from 1% to about 5% of data owned by financial institutions are dirty. Typically, FIs include in their IT architectures from dozens to a few hundreds of data sources that are being integrated in multiple data warehouses. Such complex architectures generate substantial monetary costs and they are dificult to manage. FIs compete in the financial services market. One way to gain a competitive advantage is to apply the latest technologies for the purpose of data management and shortening a software development cycle. A promising direction is to migrate on-premise infrastructures into private, public, or private-public cloud architectures. In this paper we present our experience from preparing and running a project for a big financial institution in Poland. The project is run in two stages: (1) building a central repository of customers data and (2) developing a data lake architecture in a private-public cloud.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Financial institutions (FIs) use the state-of-the art data
management and data engineering solutions to store and process their
data. They apply strict data governance policies, defined by
country (e.g., Financial Supervision Commission in Poland) and
international (e.g., European Banking Authority) financial regulatory
authorities. User applications, before being deployed in
operational IT architectures, undergo thorough testing. FIs strive to
gain a competitive advantage by constantly providing new
products and services, necessitating the use of the latest technologies
and shortening the software development process. Despite the
care for the quality of the produced software and data governance
processes, the quality of data in financial databases is not perfect.
Experts in the field estimate that from 1% to about 5% of data
owned by FIs are dirty - mainly with missing or erroneous values,
duplicated, and outdated.</p>
      <p>Duplicates mainly concern customers data, for the following
reasons. First, banks buy other banks, with their information
systems and data stored there. Second, some banking products
(e.g., a checking account and a stockbroker account) require a
separate customer instance in a system for each product, even
if a real customer is the same. Third, the imperfection of the
software and processes used in data governance allow to create
separate customer instances in an information system for the
same physical customer even if it is not necessary. Fourth, FIs
often function in capital groups, in which individual entities have
their own customer databases. In order to manage the
relationship with a customer at the level of a capital group, deduplication
of customer data is necessary. It is worth to mention that core
customers data are related to other data, like contact addresses.
This way, duplicate customers data cause duplicates of the related
data.</p>
      <p>Outdated data is the second issue impacting the quality of data
in a FI. This problem concerns customers last names (typically
caused by name changing after getting married), postal address
(caused by moving to another location), phone numbers, and
email addresses, to name the most typical cases.</p>
      <p>Duplicated and outdated data cause economic loses and
deteriorate reputation of a FI. Thus, clean data are necessary for
eficiently conducting a business and for proper functioning
artiifcial intelligence technologies (prediction models, natural
language processing, chat bots), which are becoming increasingly
important in sales and service processes.</p>
      <p>The world of data and IT architectures of a FI must ensure:
unambiguous identification of customers, ensure security
requirements, meet risk needs, and meet the requirements imposed by
institutions regulating the market, including counteracting
terrorist financing and money laundering. Another important aspect
influencing an IT architecture and data governance policies in a
FI is virtualization and service automation. The above aspects, on
the one hand, require correct, up-to-date data, but on the other
hand, the security policies adopted by a FI limit the possibility of
entering or modifying data via remote channels.</p>
      <p>
        Typically, FIs include in their IT architectures from dozens to
hundreds of data sources (DSs). Eficient processing of data in
diferent database structures, distributed among multiple DSs,
requires the application of an integration architecture. An industry
standard is a data warehouses (DW) architecture. In this
architecture, DSs are integrated in a central repository - data warehouse
by the so-called Extract Transform Load (ETL) processes or their
alternative ELT variants (a.k.a. data processing workflows, data
processing pipelines, or data wrangling [
        <xref ref-type="bibr" rid="ref18 ref29">18, 29</xref>
        ]).
      </p>
      <p>
        An ETL process first extracts data of interest from multiple
DSs. Second, it transforms, cleans, and homogenizes the data.
Finally it loads the data into a DW. The ELT alternative, extracts
data from DSs, loads them in their original formats into an
intermediate storage (called an operational data store, data stage, or
staging area) and then transforms the data and loads them into
a DW. Traditional IT architectures are build based on multiple
stand-alone servers and/or data warehouse appliances (e.g., IBM
Pure Data for Analytics - Netezza, Oracle Exadata, SAP Hana,
Teradata). ETL processes are run by dedicated engines, e.g., [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
typically deployed on a dedicated hardware.
      </p>
      <p>
        Such complex architectures are dificult to manage from a
technological point of view and generate substantial monetary costs.
In this context, two following business trends can be observed.
• Leading FIs initiate projects aiming at transforming their
on-premise infrastructures into novel architectures - either
hybrid or cloud. A hybrid architecture includes on-premise
databases integrated with databases in a cloud (either
private or private-public). Data in both on-premise databases
and in a cloud can be accessed and dynamically integrated
via a dedicated software layer (with the functionality of a
mediated architecture and ETL processes). A pure cloud
architecture assumes that all on-premise databases have
been migrated into a cloud eco-system. In this eco-system,
cloud data warehouses are built as well, based on dedicated
systems (e.g., Amazon Redshift, Snowflake, Presto, Google
BigQuery, Azure Synapse Analytics, Oracle Autonomous
Data Warehouse, IBM Db2 Warehouse on Cloud) [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ].
• FIs are building cloud repositories of heterogeneous data,
typically ingested from sources external to a company,
including among others open data published by (local)
governments, professional portals, and social media. The data
are stored in the repository in their original formats. Such
a repository is typically called a data lake (DL) [
        <xref ref-type="bibr" rid="ref23 ref38">23, 38</xref>
        ].
Further, these data are unified either on-the-fly in the
so-called logical data warehouse (LDW) [
        <xref ref-type="bibr" rid="ref11 ref20 ref30">11, 20, 30</xref>
        ] or
are homogenized and uploaded into a physical cloud data
warehouse (CDW) [
        <xref ref-type="bibr" rid="ref16 ref31 ref37">16, 31, 37</xref>
        ]. Another possible
architecture to store and query heterogeneous data is a polystore.
In this architecture, a few physical data repositories (each
of which stores unified data represented in the same data
model) are virtually integrated and queried by means of a
mediated architecture [
        <xref ref-type="bibr" rid="ref10 ref17 ref41 ref5">5, 10, 17, 41</xref>
        ].
      </p>
      <p>Cloud technologies are inevitable also in the financial sector,
and the usage of these technologies is already supported by
country financial regulatory authorities. For example, in January 2020,
the Financial Supervision Commission in Poland approved a
document with guidelines and recommendations for deploying cloud
services in FIs, thus giving a green light for IT projects based on
cloud technologies in the financial sector. In particular,
Recommendation D.10.6 Cooperation with external service providers (by
the Financial Supervision Commission) defines 5 requirements
that an external cloud provider must fulfill to be able to ofer
services for a financial institution.</p>
      <p>In this paper, we present our initial experience and challenges
in launching a project for a big financial institution in Poland
(for this publication, we are not authorized to reveal the name
of the institution). Since the project has just started, it is too
early to present solutions to the challenges mentioned in this
paper. The project is divided into two stages. The first one aims
at building a Central Repository of Customers Data, integrated
from several data sources (cf. Section 2). The second stage aims at
building a Cloud Data Repository architecture in a Polish National
Cloud1 (it is a private-public cloud operated by Microsoft and
Google). A few on-premise databases will be next migrated into
the repository (cf. Section 3).
1https://chmurakrajowa.pl/en/</p>
    </sec>
    <sec id="sec-2">
      <title>STAGE 1: BUILDING CENTRAL</title>
    </sec>
    <sec id="sec-3">
      <title>REPOSITORY OF CUSTOMERS DATA</title>
      <p>In this stage, data about customers and data related to customers,
from multiple data sources, are integrated into the Central
Repository of Customers Data (CRCD).
2.1
An overall technical architecture being build in this stage is
shown in Figure 2.2. It is a standard data warehouse architecture,
where goals G1-G4 are implemented by means of ETL processes.
Source customers data and related data are stored in relational
databases, denoted as 1-6; the CRCD is also a relational
database (Oracle DBMS). In order to support data cleaning and
standardization, ETL processes use open data sources (denoted
as openDS) and paid data sources (denoted as paidDS). Both types
of DSs are made available by the public administration. These
sources include reference data, among others on: citizen unique
IDs, administrative division of a country, zip codes, and
companies.</p>
      <p>An important functionality of the ETL layer is a data
deduplication pipeline (DDP). The DDP realizes goal G5. Having profiled
the customers data in the available source systems, we conclude
that the DDP cannot be fully automatized and there are cases
where an expert knowledge is required to support the process,
however, the DDP is to minimize the number of rows requiring
manual work.</p>
      <p>Being processed by the ETL processes, customers and related
data are uploaded into the CRCD. The usage of the CRCD is
twofold. First, its content is to be analyzed by data mining
algorithms in order to discover models for data aging (goal G6).
The overall idea behind this component is to be able to discover
classes of data and their properties that share similar (or identical)
aging characteristics. Second, the CRCD will become a source of
truth for the whole data infrastructure of the FI, thus it will be
accessed by other internal systems of the FI.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Challenges</title>
      <p>While designing the aforementioned architecture, even though it
is a standard one, we encountered a few issues. The most
challenging ones include: (1) designing a data deduplication pipeline
and (2) developing ML models for predicting data aging.</p>
      <p>
        2.3.1 Designing data deduplication pipeline. An important
and challenging task in the ETL is a data deduplication pipeline.
The state of the art pipeline [
        <xref ref-type="bibr" rid="ref26 ref8">8, 26</xref>
        ] is shown in Figure 2.3.1. It is
composed of four main tasks, namely blocking, block processing,
entity matching, and entity clustering. Blocking aims at dividing
into groups records that are likely to represent duplicates, with
the final goal to reduce the number of records that need to be
compared. In this DDP, records are compared only within blocks
they belong to, which can be done in parallel. Block processing
aims at reorganizing records in blocks, in order to further reduce
the number of needed comparisons. Entity matching consists in
computing similarity measures between records. Finally, in the
entity clustering step, identical records (i.e., those whose
similarity measures, e.g., [
        <xref ref-type="bibr" rid="ref24 ref42">24, 42</xref>
        ] are higher than a given threshold) are
grouped together and merged.
      </p>
      <p>
        Even though, research on deduplication has been done for
decades, and multiple algorithms have been developed for each
step in the DDP, the whole pipeline has to be constructed
manually. The main problem remains in selecting the most adequate
algorithm to run each of the four tasks, so that the pipeline
maximizes precision and recall. To fully solve this problem, one would
need to apply a greedy approach of testing the results of all
possible combinations of these algorithms. However, it is not feasible
as it is an optimization problem of an exponential complexity.
For example, in block building there are at least 14 popular
algorithms, in block processing there are at least 18 algorithms, in
entity matching there are at least 20 algorithms, and in entity
clustering there are at least 7 algorithms [
        <xref ref-type="bibr" rid="ref25 ref26 ref27 ref9">9, 25–27</xref>
        ], resulting in
at least 35280 combinations.
      </p>
      <p>The algorithms in the DDP assume that the processed data
have been cleaned and standardized earlier. In the case of
customer data, such cleaning and standardization is not possible for
ifrst and last names, as there are multiple variants of standard
names. For example, names like ’Anna’ and ’Ana’ may be treated
as being the same, assuming that a typo was done, or as being
two diferent names; a non-Latin name may be translated into
the Latin alphabet in multiple ways.</p>
      <p>Once the entity clustering step has identified records
representing the same customer, a reference customer record needs
to be build based on these records. If, however, the records
differ with values of addresses, a problem arises in figuring out
which address it the current one. Typically, changes in data are
timestamped and in such a case, the most recently timestamped
address and other contact data may be used as the current ones.
Unfortunately, in the reality, the most recent timestamp may
be old. In such a case, it is probable that some data may have
changed since then. This observation led us to the conclusion
that data aging models could help solving also this problem (cf.
Section 2.3.2.</p>
      <p>To conclude, having done the analysis of the state of the art
in designing the DDP, to the best of our knowledge, we conclude
that an automatic approach to constructing an end-to-end data
deduplication pipeline has not been proposed yet for traditional
record-like data. The problem is getting more dificult for the
deduplication of big data, mainly, because big data are
represented in a plenthora of diferent formats. Before applying the
DDP, all these data must be unified into the same format, which
itself is a challenge. We encountered this problem in the second
stage of our project (cf. Section 3).</p>
      <p>2.3.2 Developing ML models for data aging. An intrinsic
feature of some types of data, is their aging. Four main types of
such data are of special interests by FIs, namely: (1) customers
last names, (2) customers identification documents, (3) postal
addresses, and (4) contact addresses (phone numbers, emails). Such
data become outdated mainly as the result of: last name
changing after getting married or divorced, expiry dates reached by
identification documents, customers moving to other locations,
as well as changing phone numbers and email addresses.</p>
      <p>Outdated data decrease reliability of data and cause financial
loses. For these reasons, FIs so fare have been using analog
methods to keep customers data up to date (e.g., checking data upon
a customer arrival to a FI branch, checking data of delivered and
not deliver mailings, calling a customer to verify her/his data).
Moreover, FIs strive to ensure the highest possible level of
customer self-service by means of remote services via the Internet,
which results in limited contact between employees of a branch
network and customers. Therefore, the analog methods of
verifying the correctness and timeliness of customer data either do not
work or are ineficient in therms of costs, speed, and the amount
of data that can be updated.</p>
      <p>
        For this reason, in this project we aim at developing data
aging models based on ML techniques. We assume that dedicated
models will be built at least for a few age groups of customers. To
the best of our knowledge, such models have not been proposed
yet in the context of our project. The only approach addressing a
related problem is [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ], but in the context of temporal data. Other
approaches address the problem of moving cold data from a hot
to a cold storage, e.g., [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] or to an external storage, e.g., [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>At this stage of the project development, we plan to start
experimenting with classification for building data aging models.
It is very likely that data imbalance will require the application
also of other ML techniques, like neural networks.
3</p>
    </sec>
    <sec id="sec-5">
      <title>STAGE 2: BUILDING CLOUD DATA</title>
    </sec>
    <sec id="sec-6">
      <title>REPOSITORY</title>
      <p>In this stage, the so-called Cloud Data Repository (CDR) is build
(notice that such a repository has features of a data lake). The
CDR will include: (1) the CRCD developed in Stage 1 (cf. 2), (2)
the data sources storing data related to customers records, and (3)
external data sources augmenting views on customers, to achieve
functionality of customer 360.
3.1</p>
      <p>G7 - to build the CRD for storing data from selected internal
company DSs;
G8 - to augment customers view with data from external DSs
(professional portals and social portals);
G9 - to design data retention models in a cloud eco-system;
G10 - to develop a data governance method in a cloud
ecosystem; the method must be compliant with national and
international regulations in the financial sector;
G11 - to develop an end-to-end method for designing and
deploying a data repository in a cloud eco-system.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Architecture</title>
      <p>The aforementioned goals will be achieved in the architecture
that we have designed and will be implementing. The architecture
is shown in Figure 3.2. It is a hybrid cloud eco-system, where
some DSs are stored in an on-premise architecture (1, . . . , 
and CRCD) and some DSs are stored in a private-public cloud
eco-system.</p>
      <p>The core component of the hybrid cloud eco-system is the
Cloud Data Repository. It includes:
• internal company DSs, denoted as    , . . . ,   , and
CRCD; these DSs will be migrated from the on-premise
architecture into the CDR; the sources are relational databases
and they store customers and related data; notice that the
CRCD developed in the first stage is also integrated into
the CDR;
• external DSs, denoted as  ,  , and  ; 
stores data ingested from open data provided by the public
administration;  stores data ingested from
commercial repositories (paid) provided by the public
administration;  stores data ingested from professional and
social media sources in the Internet and internal customer
behavioral data;  ,  , and  store data related
to customers (both individuals and companies); these DSs
are non-relational, typically HTML; XML; JSON, and RDF;
• a data warehouse, denoted as DW, integrating customers
and related data from: (1)    , . . . ,   , (2) CRCD, and
(3)  ,  ,  , cleaned, deduplicated, and unified
into a common data format; the content of the DW will
become the source of truth about customers for the
onpremise databases in the FI.</p>
      <p>Data from DSs in the Cloud Data Repository will be integrated
into the DW by means of ETL/ELT processes implemented in
the cloud eco-system. The DW will be periodically refreshed
with customers data from the CRCD by means of ETL processes
implemented in Informatica. Since the CRCD is replicated into the
DW, the CRCD will be maintained in the on-premise architecture
only until the CDR is fully operational. After achieving a fully
operational architecture, the CRCD will be replaced by the DW
in the CDR.
3.3</p>
    </sec>
    <sec id="sec-8">
      <title>Challenges</title>
      <p>While designing the aforementioned architecture we encountered
the following challenges: (1) designing eficient ETL/ELT
processes, (2) handling the evolution of data sources, (3) designing
logical and physical data lake schemas for storing unstructured
data, and (4) provisioning optimal cloud resources.</p>
      <p>
        3.3.1 Designing eficient ETL processes. Typically, in a
standard DW architecture, all ETL processes must finish within a
given time window, usually within a few hours, to make a DW
available for analtytics. Therefore, assuring eficient executions
of ETL processes (typically measured by throughput and an
overall execution time) is challenging [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In a DL architecture, this
task is much more challenging due to two main reasons. First,
ETL processes in a DL architecture ingest and process larger data
volumes than in a standard DW architecture. Second, the variety
and complexity of data models and formats that ETL processes
must process is much larger in a DL architecture.
      </p>
      <p>
        In practice, the performance of an ETL process may be
improved by: (1) scaling up or out hardware on which the process
is run, (2) running the process in parallel, and (3) reordering
its tasks [
        <xref ref-type="bibr" rid="ref19 ref34">19, 34</xref>
        ]. Existing commercial ETL/ELT tools provide
means of parallelizing ETL tasks, but it is the responsibility of
an ETL developer to select tasks for parallelization and to apply
an appropriate parallelization skeleton. Reordering ETL tasks
may reduce the execution time of an ETL process, but finding
a reordering that would yield the shortest execution time is of
exponential complexity [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ]. Again, either commercial or free
ETL engines do not ofer any means for automatic optimization
of ETL processes by tasks reordering, with the exception of IBM
InfoSphere and Informatica PowerCenter, which allow to move
some tasks into a data source to be executed there [
        <xref ref-type="bibr" rid="ref14 ref21">14, 21</xref>
        ].
      </p>
      <p>To sum up, the performance of an ETL process largely depends
on a manual orchestration and parallelization of tasks within an
ETL process by an ETL developer.</p>
      <p>
        At this stage of the project, we plan to apply a known heuristic
to place the most restrictive tasks, i.e., those which filter data, at
the beginning of an ETL process, in order to reduce a data volume
ingested by an ETL process as soon as possible. Another
promising direction would be to allocated adequate cloud resources to
run a given ETL process under a monetary or time constraint
budget, in the spirit of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        3.3.2 Handling evolution of DSs. An intrinsic feature of
internal company DSs and external DSs is the evolution of their
structures (schemas) in time [
        <xref ref-type="bibr" rid="ref39 ref40">39, 40</xref>
        ]. Internet data sources evolve
much frequently than internal company DSs. Such evolution has
an impact on a data integration layer, i.e., ETL processes ingesting
data from evolving DSs stop working with errors.
      </p>
      <p>As part of the preparation to the project described in this
paper, we had run a pilot micro-project aiming at creating a
micro-DL with data ingested from Internet data sources. The
micro-DL stored: (1) data about companies being customers of the
FI and (2) descriptions of financial products ofered by banks in
Poland. To this end, we integrated: (1) a few professional portals
providing data about companies (including LinkedIn, Glassdoor,
GoldenLine) and (2) several portals of banks ofering services in
Poland, to compare their product ofers.</p>
      <p>
        We applied an ELT architecture in two alternative cloud
ecosystems, namely GCP and AWS. In both cases we experienced
problems with fast (sometimes day-to-day) changing structures
of data provided by the aforementioned Internet data sources. As
a consequence, previously designed and deployed ELT processes
generated errors and needed to be repaired, i.e., adjusted to new
structures of the changed DSs. This had to be done manually
as neither of the commercial and free ETL/ELT tools supports
an automatic repair of such processes. It is one of the still open
problems in DW/DL research [
        <xref ref-type="bibr" rid="ref43 ref6">6, 43</xref>
        ].
      </p>
      <p>The second problem caused by evolving DSs is related to
detecting data changes and structural changes. A DW or DL are
typically refreshed incrementally. To this end, an ETL/ELT
process has to construct the increment (a.k.a. delta). However, in
multiple architectures, the only way to access a DS is to use
its data export (a.k.a. snapshot) provided by the DS. Frequently,
such a snapshot includes the whole content of the DS. From this
content, the ETL/ELT process has to extract the increment. It
is done by comparing two consecutive snapshots. When a DS
changed its structure, then two consecutive snapshots could not
be comparable, thus an increment could not be constructed.</p>
      <p>Furthermore, snapshot comparison is challenging for DSs
that use complex data structures (e.g., graphs, semi-structured,
NoSQL) as the comparison algorithm has to be able to traverse
nested structures and handle cases when new nested objects
appear.</p>
      <p>The micro-project revealed that while dealing with Internet
data sources, one cannot use of-the shelf solutions (as they do
not exist). In the project, ELT processes were repaired manually.
Data increments were constructed by comparing two consecutive
snapshots of nested data, at a cost of expensive processing. The
snapshot comparison algorithm had to be changed manually
when a DS changed its structure.</p>
      <p>3.3.3 Designing logical and physical DL schemas. Designing
logical and physical schemas for relational DWs is a very well
researched topic, supported by mature relational database
management systems as well as by design and development tools.
DW modeling is a fundamental task being done in an early phase
of a DW development. Modeling data lakes still needs to be
researched. Dealing with unstructured data tempts a designer to
apply NoSQL storage, to model a flexible schema. Unfortunately,
NoSQL storage servers are not mature yet and they do not ofer
a rich SQL syntax, advanced indexing, and advanced cost-based
query optimization. On the contrary, relational database
management systems (especially the commercial ones) ofer such a
features at the cost of rigid schemas.</p>
      <p>
        Despite of the fact that research is being conducted on data
lake modeling (i.e., for unstructured data), e.g., [
        <xref ref-type="bibr" rid="ref13 ref15">13, 15</xref>
        ], there is
no commonly accepted modeling method. Research and
development in physical storage, physical data structures, e.g., [
        <xref ref-type="bibr" rid="ref28 ref45">28, 45</xref>
        ],
and query optimization techniques for NoSQL storage, e.g., [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
is being conducted as well, but there is no single product that
would ofer all these three important functionalities, which
mature commercial RDBMSs do ofer.
      </p>
      <p>To conclude, designing logical DL schemas, physical storage,
and physical data structures are not guided by any method (like
for relational databases) and must be done case by case (ad-hoc).</p>
      <p>3.3.4 Provisioning optimal cloud resources. While deploying
a cloud architecture, one typically aims at achieving the best
possible performance with minimized monetary costs paid for the
cloud infrastructure. Diferent types of processing (e.g., analytical,
transactional, ETL) require diferent amounts of cloud resources
(e.g., the number of virtual machines, the number of CPU and
cores, main memory, disk storage).</p>
      <p>
        Provisioning optimal resources for a given type of processing
to maximize performance is contradictory to minimizing
monetary costs and the optimization of these goals is a combinatorial
problem of exponential complexity, e.g., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Despite the fact that
some research has been and still is conducted in this area, e.g.,
[
        <xref ref-type="bibr" rid="ref29 ref36 ref4">4, 29, 36</xref>
        ], it is considered for the time being as an open problem.
4
      </p>
    </sec>
    <sec id="sec-9">
      <title>SUMMARY</title>
      <p>In this paper we outlined a project being done for a big financial
institution in Poland. Its goal is to build a system providing clean
data about customers and their related data, augmented from
external data sources. To this end, a hybrid architecture is built for
the FI. The architecture is composed of the on-premise databases
and the private-public cloud eco-system. The cloud eco-system
will include customers data and data related to customers,
migrated from the old on-premise architecture and data ingested
from Internet data sources.</p>
      <p>The focus of this paper is on presenting the hybrid architecture
that we designed and on challenges that we encountered while
realizing the project (since the project has just started, we are
not able to provide solutions to the challenges yet).</p>
      <p>Based on the gained early experience and based on the state
of the art analysis in research and technology, we can draw the
following conclusions:
• a commonly shared knowledge on building data lakes in
a cloud for a FI is not available (as the most probably it is
considered as a company’s asset);
• comprehensive methods for guiding the process of
eficient data migration from an on-premise to a cloud
architecture are not available either;
• a comprehensive method for building logical and physical</p>
      <p>DL models are not available either;
• a reference DL architecture in a cloud for a FI is not
available (as the most probably it is considered again as a
company’s asset);
• finally, an end-to-end method for designing and deploying
a data lake for a FI (from conceptual, logical, and
physical modeling, through data migration and ingestion, data
governance, to performance optimization) still needs to
be developed.</p>
      <p>Moreover, running a project for a big FI requires usage of
commercial tools at every development stage and for every task
of a DW and DL development step. For a big FI, the only
applicable solution is to use powerful commercial software. For
example, commercial ETL engines ofer rich functionalities,
including parsing addresses and names, algorithms ready to use,
cleaning techniques and some deduplication techniques, parallel
processing either in on a single server or in a cloud, accessing
non-relational data. Also technical support from a software house
is of great importance. In the described project we use
Informatica for data profiling and ETL; Oracle databases - for storing
internal company data, including the CRCD; Microsoft Azure and
Google Cloud Platform - as private-public cloud eco-systems in
the Polish National Cloud.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Syed</given-names>
            <surname>Muhammad Fawad Ali</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>From conceptual design to performance optimization of ETL workflows: current state of research and open problems</article-title>
          .
          <source>The VLDB Journal 26</source>
          ,
          <issue>6</issue>
          (
          <year>2017</year>
          ),
          <fpage>777</fpage>
          -
          <lpage>801</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Syed</given-names>
            <surname>Muhammad Fawad Ali</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Towards a Cost Model to Optimize User-Defined Functions in an ETL Workflow Based on User-Defined Performance Metrics</article-title>
          .
          <source>In European Conf. Advances in Databases and Information Systems (LNCS</source>
          , Vol.
          <volume>11695</volume>
          ). Springer,
          <fpage>441</fpage>
          -
          <lpage>456</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Syed</given-names>
            <surname>Muhammad Fawad Ali</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Framework to Optimize Data Processing Pipelines Using Performance Metrics</article-title>
          .
          <source>In Int. Conf. on Big Data Analytics</source>
          and
          <article-title>Knowledge Discovery (DaWaK) (LNCS</article-title>
          , Vol.
          <volume>12393</volume>
          ). Springer,
          <fpage>131</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Abdullah</given-names>
            <surname>Khalid A. Almasaud</surname>
          </string-name>
          , Agresh Bharadwaj, Sandra Sampaio, and
          <string-name>
            <given-names>Rizos</given-names>
            <surname>Sakellariou</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Challenges in Resource Provisioning for the Execution of Data Wrangling Workflows on the Cloud: A Case Study</article-title>
          .
          <source>In Int. Conf. on Database and Expert Systems Applications (DEXA)</source>
          .
          <source>LNCS 12392</source>
          ,
          <fpage>66</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Rana</given-names>
            <surname>Alotaibi</surname>
          </string-name>
          , Damian Bursztyn, Alin Deutsch, Ioana Manolescu, and
          <string-name>
            <given-names>Stamatis</given-names>
            <surname>Zampetakis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Towards Scalable Hybrid Stores: Constraint-Based Rewriting to the Rescue</article-title>
          .
          <source>In SIGMOD</source>
          .
          <volume>1660</volume>
          -
          <fpage>1677</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Judith</given-names>
            <surname>Awiti</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Rule Discovery for (Semi-)automatic Repairs of ETL Processes</article-title>
          .
          <source>In Int. Baltic Conf. on Databases and Information Systems (CCIS</source>
          , Vol.
          <volume>1243</volume>
          ). Springer,
          <fpage>250</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Arun</given-names>
            <surname>Balasubramanyan</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Data warehouse augmentation, Part 3: Use big data technology for an active archive</article-title>
          . https://www.ibm.com/developerworks/ library/ba-augment
          <article-title>-data-warehouse3/index</article-title>
          .html. IBM DeveloperWorks.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Vassilis</given-names>
            <surname>Christophides</surname>
          </string-name>
          , Vasilis Efthymiou, Themis Palpanas, George Papadakis, and
          <string-name>
            <given-names>Kostas</given-names>
            <surname>Stefanidis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>End-to-End Entity Resolution for Big Data: A Survey</article-title>
          . CoRR abs/
          <year>1905</year>
          .06397 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Adrian</given-names>
            <surname>Colyer</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>The morning paper on An overview of end-to-end entity resolution for big data</article-title>
          . https://blog.acolyer.org/
          <year>2020</year>
          /12/14/entity-resolution/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jennie</surname>
            <given-names>Duggan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Aaron J.</given-names>
            <surname>Elmore</surname>
          </string-name>
          , Michael Stonebraker, Magdalena Balazinska, Bill Howe, Jeremy Kepner, Sam Madden, David Maier,
          <string-name>
            <given-names>Tim</given-names>
            <surname>Mattson</surname>
          </string-name>
          , and
          <string-name>
            <surname>Stanley</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Zdonik</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>The BigDAWG Polystore System</article-title>
          .
          <source>SIGMOD Rec</source>
          .
          <volume>44</volume>
          ,
          <issue>2</issue>
          (
          <year>2015</year>
          ),
          <fpage>11</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Ted</given-names>
            <surname>Friedman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nick</given-names>
            <surname>Heudecker</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Data Hubs, Data Lakes and Data Warehouses: How They Are Diferent and Why They Are Better Together</article-title>
          . Gartner.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Gartner</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Magic Quadrant for Data Integration Tools</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Corinna</surname>
            <given-names>Giebler</given-names>
          </string-name>
          , Christoph Gröger, Eva Hoos, Holger Schwarz, and
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Mitschang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Modeling Data Lakes with Data Vault: Practical Experiences, Assessment, and Lessons Learned</article-title>
          .
          <source>In Int. Conf. on Conceptual Modeling ER (LNCS</source>
          , Vol.
          <volume>11788</volume>
          ). Springer,
          <fpage>63</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Informatica</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>How to Achieve Flexible, Cost-efective Scalability and Performance through Pushdown Processing</article-title>
          . Whitepaper.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Jarke</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Quix</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data Integration</article-title>
          . In Conceptual Modeling Perspectives. Springer,
          <fpage>231</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Sean</surname>
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Kerner</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Top 8 Cloud Data Warehouses</article-title>
          . https://www. datamation.com/cloud-computing/
          <article-title>top-cloud-data-warehouses.html</article-title>
          . Datamation.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Boyan</surname>
            <given-names>Kolev</given-names>
          </string-name>
          , Carlyna Bondiombouy, Patrick Valduriez, Ricardo Jiménez-Peris,
          <string-name>
            <given-names>Raquel</given-names>
            <surname>Pau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>José</given-names>
            <surname>Pereira</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The CloudMdsQL Multistore System</article-title>
          .
          <source>In SIGMOD</source>
          .
          <volume>2113</volume>
          -
          <fpage>2116</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Nikolaos</given-names>
            <surname>Konstantinou</surname>
          </string-name>
          and
          <string-name>
            <given-names>Norman W.</given-names>
            <surname>Paton</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Feedback driven improvement of data preparation pipelines</article-title>
          .
          <source>Inf. Syst</source>
          .
          <volume>92</volume>
          (
          <year>2020</year>
          ),
          <fpage>101480</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Nitin</given-names>
            <surname>Kumar</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. Sreenivasa</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>An Eficient Heuristic for Logical Optimization of ETL Workflows</article-title>
          .
          <source>In VLDB Workshop on Enabling Real-Time Business Intelligence</source>
          .
          <fpage>68</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Alice</surname>
            <given-names>LaPlante.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Building a Unified Data Infrastructure. O'Reilly whitepaper</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Rao</given-names>
            <surname>Lella</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Optimizing BDFS jobs using InfoSphere DataStage Balanced Optimization</article-title>
          .
          <source>IBM DeveloperWorks.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Divya</surname>
            <given-names>Mahajan</given-names>
          </string-name>
          , Cody Blakeney, and
          <string-name>
            <given-names>Ziliang</given-names>
            <surname>Zong</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Improving the energy eficiency of relational and NoSQL databases via query optimizations</article-title>
          .
          <source>Sustain. Comput. Informatics Syst</source>
          .
          <volume>22</volume>
          (
          <year>2019</year>
          ),
          <fpage>120</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Fatemeh</surname>
            <given-names>Nargesian</given-names>
          </string-name>
          , Erkang Zhu,
          <string-name>
            <given-names>Renée J</given-names>
            .
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ken Q.</given-names>
            <surname>Pu</surname>
          </string-name>
          , and
          <string-name>
            <surname>Patricia</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Arocena</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Data Lake Management: Challenges and Opportunities</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>12</volume>
          ,
          <issue>12</issue>
          (
          <year>2019</year>
          ),
          <fpage>1986</fpage>
          -
          <lpage>1989</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Felix</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Similarity measures</article-title>
          .
          <source>Hasso Plattner Institut.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>George</given-names>
            <surname>Papadakis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Themis</given-names>
            <surname>Palpanas</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Web-scale, Schema-Agnostic, End-to-End Entity Resolution</article-title>
          . In Tutorial at World Wide Web.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>George</surname>
            <given-names>Papadakis</given-names>
          </string-name>
          , Dimitrios Skoutas, Emmanouil Thanos, and
          <string-name>
            <given-names>Themis</given-names>
            <surname>Palpanas</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Blocking and Filtering Techniques for Entity Resolution: A Survey</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>53</volume>
          ,
          <issue>2</issue>
          (
          <year>2020</year>
          ),
          <volume>31</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          :
          <fpage>42</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>George</surname>
            <given-names>Papadakis</given-names>
          </string-name>
          , Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, and
          <string-name>
            <given-names>Manolis</given-names>
            <surname>Koubarakis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Domain-</article-title>
          and
          <string-name>
            <surname>Structure-Agnostic</surname>
          </string-name>
          End
          <article-title>-to-End Entity Resolution with JedAI</article-title>
          .
          <source>SIGMOD Record 48</source>
          ,
          <issue>4</issue>
          (
          <year>2019</year>
          ),
          <fpage>30</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Philipp</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Rohde</surname>
          </string-name>
          and
          <string-name>
            <surname>Maria-Esther Vidal</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Optimizing Federated Queries Based on the Physical Design of a Data Lake</article-title>
          .
          <source>In Proc. of EDBT/ICDT Workshops (CEUR Workshop Proceedings</source>
          , Vol.
          <volume>2578</volume>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Oscar</given-names>
            <surname>Romero</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Data Engineering for Data Science: Two Sides of the Same Coin</article-title>
          .
          <source>In Proc. of Int. Conf. on Big Data Analytics and Knowledge Discovery (DaWaK) (Lecture Notes in Computer Science</source>
          , Vol.
          <volume>12393</volume>
          ). Springer,
          <fpage>157</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Russom</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Modernizing the Logical Data Warehouse</article-title>
          . https://tdwi. org/articles/2019/10/14/dwt-all
          <article-title>-modernizing-the-logical-data-warehouse. aspx</article-title>
          . TDWI.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>John</given-names>
            <surname>Ryan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Uli</given-names>
            <surname>Bethke</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A Comparison of Cloud Data Warehouse Platforms</article-title>
          . https://www.datamation.com/cloud-computing/
          <article-title>top-cloud-data-warehouses.html</article-title>
          .
          <source>Sonora Intelligence.</source>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32] SAP. [n.d.].
          <article-title>Data aging. SAP Help portal</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33] ScienceSoft. [n.d.].
          <article-title>Data Warehouse in the Cloud: Features, Important Integrations, Success Factors, Benefits and More</article-title>
          . https://www.scnsoft.com/analytics/ data-warehouse/cloud.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Alkis</surname>
            <given-names>Simitsis</given-names>
          </string-name>
          , Panos Vassiliadis, and
          <string-name>
            <surname>Timos</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Sellis</surname>
          </string-name>
          . [n.d.].
          <source>Optimizing ETL Processes in Data Warehouses. In Int. Conf. on Data Engineering ICDE</source>
          .
          <fpage>564</fpage>
          -
          <lpage>575</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Alkis</surname>
            <given-names>Simitsis</given-names>
          </string-name>
          , Panos Vassiliadis, and
          <string-name>
            <surname>Timos</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Sellis</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>State-Space Optimization of ETL Workflows</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>17</volume>
          ,
          <issue>10</issue>
          (
          <year>2005</year>
          ),
          <fpage>1404</fpage>
          -
          <lpage>1419</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>Sukhpal</given-names>
            <surname>Singh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Inderveer</given-names>
            <surname>Chana</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Cloud resource provisioning: survey, status and future research directions</article-title>
          .
          <source>Knowl. Inf. Syst</source>
          .
          <volume>49</volume>
          ,
          <issue>3</issue>
          (
          <year>2016</year>
          ),
          <fpage>1005</fpage>
          -
          <lpage>1069</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Mohamed</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Soliman</surname>
            , Lyublena Antova,
            <given-names>Marc</given-names>
          </string-name>
          <string-name>
            <surname>Sugiyama</surname>
          </string-name>
          , Michael Duller, Amirhossein Aleyasen, Gourab Mitra, Ehab Abdelhamid, Mark Morcos, Michele Gage, Dmitri Korablev, and
          <string-name>
            <surname>Florian</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Waas</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A Framework for Emulating Database Operations in Cloud Data Warehouses</article-title>
          .
          <source>In Proc. of SIGMOD. ACM</source>
          ,
          <volume>1447</volume>
          -
          <fpage>1461</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Ignacio</surname>
            <given-names>Terrizzano</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Schwarz</surname>
          </string-name>
          , Mary Roth,
          <string-name>
            <surname>and John E. Colino.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Data Wrangling: The Challenging Journey from the Wild to the Lake</article-title>
          .
          <source>In Conf. on Innovative Data Systems Research (CIDR).</source>
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>Panos</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Apostolos V.</given-names>
            <surname>Zarras</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Schema Evolution Survival Guide for Tables: Avoid Rigid Childhood and You're En Route to a Quiet Life</article-title>
          .
          <source>J. Data Semant. 6</source>
          ,
          <issue>4</issue>
          (
          <year>2017</year>
          ),
          <fpage>221</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <surname>Panos</surname>
            <given-names>Vassiliadis</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Apostolos</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Zarras</surname>
            , and
            <given-names>Ioannis</given-names>
          </string-name>
          <string-name>
            <surname>Skoulis</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Gravitating to rigidity: Patterns of schema evolution - and its absence - in the lives of tables</article-title>
          .
          <source>Inf. Systems</source>
          <volume>63</volume>
          (
          <year>2017</year>
          ),
          <fpage>24</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <surname>Marco</surname>
            <given-names>Vogt</given-names>
          </string-name>
          , Alexander Stiemer, and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Schuldt</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Polypheny-DB: Towards a Distributed and Self-Adaptive Polystore</article-title>
          . In BigData.
          <fpage>3364</fpage>
          -
          <lpage>3373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <surname>Jiannan</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Guoliang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <source>Jefrey Xu Yu, and Jianhua Feng</source>
          .
          <year>2011</year>
          .
          <article-title>Entity Matching: How Similar Is Similar</article-title>
          .
          <source>VLDB Endow</source>
          .
          <volume>4</volume>
          ,
          <issue>10</issue>
          (
          <year>2011</year>
          ),
          <fpage>622</fpage>
          -
          <lpage>633</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>Artur</given-names>
            <surname>Wojciechowski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>On Case-Based Reasoning for ETL Process Repairs: Making Cases Fine-Grained</article-title>
          .
          <source>In Int. Baltic Conf. on Databases and Information Systems (CCIS</source>
          , Vol.
          <volume>1243</volume>
          ). Springer,
          <fpage>235</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>Anita</given-names>
            <surname>Zakrzewska</surname>
          </string-name>
          and
          <string-name>
            <given-names>David A.</given-names>
            <surname>Bader</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Aging data in dynamic graphs: A comparative study</article-title>
          .
          <source>In Int. Conf. on Advances in Social Networks Analysis and Mining, ASONAM. IEEE Computer Society</source>
          ,
          <fpage>1055</fpage>
          -
          <lpage>1062</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <surname>Dongfang</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Yong Wang, Zhenling Liu, and
          <string-name>
            <given-names>Shijie</given-names>
            <surname>Dai</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Improving NoSQL Storage Schema Based on Z-Curve for Spatial Vector Data</article-title>
          .
          <source>IEEE Access</source>
          <volume>7</volume>
          (
          <year>2019</year>
          ),
          <fpage>78817</fpage>
          -
          <lpage>78829</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>