=Paper=
{{Paper
|id=Vol-2841/DARLI-AP_9
|storemode=property
|title=Managing Data in a Big Financial Institution: Conclusions from a R&D Project
|pdfUrl=https://ceur-ws.org/Vol-2841/DARLI-AP_9.pdf
|volume=Vol-2841
|authors=Mariusz Sienkiewicz,Robert Wrembel
|dblpUrl=https://dblp.org/rec/conf/edbt/SienkiewiczW21
}}
==Managing Data in a Big Financial Institution: Conclusions from a R&D Project==
<pdf width="1500px">https://ceur-ws.org/Vol-2841/DARLI-AP_9.pdf</pdf>
<pre>
                        Managing Data in a Big Financial Institution:
                            Conclusions from a R&D Project
                                                                          industrial paper

                           Mariusz Sienkiewicz                                                            Robert Wrembel
                Poznan University of Technology                                                   Poznan University of Technology
                          Poznań, Poland                                                                   Poznań, Poland
           mariusz.sienkiewicz@doctorate.put.poznan.pl                                            robert.wrembel@cs.put.poznan.pl

ABSTRACT                                                                               separate customer instances in an information system for the
Financial institutions (FIs) use the state-of-the art data manage-                     same physical customer even if it is not necessary. Fourth, FIs
ment and data engineering solutions to support their day-to-day                        often function in capital groups, in which individual entities have
businesses. They follow strict data governance policies as well as                     their own customer databases. In order to manage the relation-
country and international regulations. In spite of these facts, the                    ship with a customer at the level of a capital group, deduplication
quality of their data is not perfect. Experts in the field estimate                    of customer data is necessary. It is worth to mention that core
that from 1% to about 5% of data owned by financial institutions                       customers data are related to other data, like contact addresses.
are dirty. Typically, FIs include in their IT architectures from                       This way, duplicate customers data cause duplicates of the related
dozens to a few hundreds of data sources that are being inte-                          data.
grated in multiple data warehouses. Such complex architectures                            Outdated data is the second issue impacting the quality of data
generate substantial monetary costs and they are difficult to man-                     in a FI. This problem concerns customers last names (typically
age. FIs compete in the financial services market. One way to gain                     caused by name changing after getting married), postal address
a competitive advantage is to apply the latest technologies for                        (caused by moving to another location), phone numbers, and
the purpose of data management and shortening a software de-                           email addresses, to name the most typical cases.
velopment cycle. A promising direction is to migrate on-premise                           Duplicated and outdated data cause economic loses and de-
infrastructures into private, public, or private-public cloud archi-                   teriorate reputation of a FI. Thus, clean data are necessary for
tectures. In this paper we present our experience from preparing                       efficiently conducting a business and for proper functioning arti-
and running a project for a big financial institution in Poland.                       ficial intelligence technologies (prediction models, natural lan-
The project is run in two stages: (1) building a central repository                    guage processing, chat bots), which are becoming increasingly
of customers data and (2) developing a data lake architecture in                       important in sales and service processes.
a private-public cloud.                                                                   The world of data and IT architectures of a FI must ensure:
                                                                                       unambiguous identification of customers, ensure security require-
                                                                                       ments, meet risk needs, and meet the requirements imposed by
1    INTRODUCTION                                                                      institutions regulating the market, including counteracting ter-
Financial institutions (FIs) use the state-of-the art data manage-                     rorist financing and money laundering. Another important aspect
ment and data engineering solutions to store and process their                         influencing an IT architecture and data governance policies in a
data. They apply strict data governance policies, defined by coun-                     FI is virtualization and service automation. The above aspects, on
try (e.g., Financial Supervision Commission in Poland) and inter-                      the one hand, require correct, up-to-date data, but on the other
national (e.g., European Banking Authority) financial regulatory                       hand, the security policies adopted by a FI limit the possibility of
authorities. User applications, before being deployed in opera-                        entering or modifying data via remote channels.
tional IT architectures, undergo thorough testing. FIs strive to                          Typically, FIs include in their IT architectures from dozens to
gain a competitive advantage by constantly providing new prod-                         hundreds of data sources (DSs). Efficient processing of data in
ucts and services, necessitating the use of the latest technologies                    different database structures, distributed among multiple DSs, re-
and shortening the software development process. Despite the                           quires the application of an integration architecture. An industry
care for the quality of the produced software and data governance                      standard is a data warehouses (DW) architecture. In this architec-
processes, the quality of data in financial databases is not perfect.                  ture, DSs are integrated in a central repository - data warehouse
Experts in the field estimate that from 1% to about 5% of data                         by the so-called Extract Transform Load (ETL) processes or their
owned by FIs are dirty - mainly with missing or erroneous values,                      alternative ELT variants (a.k.a. data processing workflows, data
duplicated, and outdated.                                                              processing pipelines, or data wrangling [18, 29]).
   Duplicates mainly concern customers data, for the following                            An ETL process first extracts data of interest from multiple
reasons. First, banks buy other banks, with their information                          DSs. Second, it transforms, cleans, and homogenizes the data.
systems and data stored there. Second, some banking products                           Finally it loads the data into a DW. The ELT alternative, extracts
(e.g., a checking account and a stockbroker account) require a                         data from DSs, loads them in their original formats into an inter-
separate customer instance in a system for each product, even                          mediate storage (called an operational data store, data stage, or
if a real customer is the same. Third, the imperfection of the                         staging area) and then transforms the data and loads them into
software and processes used in data governance allow to create                         a DW. Traditional IT architectures are build based on multiple
                                                                                       stand-alone servers and/or data warehouse appliances (e.g., IBM
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-   Pure Data for Analytics - Netezza, Oracle Exadata, SAP Hana,
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
                                                                                       Teradata). ETL processes are run by dedicated engines, e.g., [12],
International (CC BY 4.0)                                                              typically deployed on a dedicated hardware.
   Such complex architectures are difficult to manage from a tech-    2     STAGE 1: BUILDING CENTRAL
nological point of view and generate substantial monetary costs.            REPOSITORY OF CUSTOMERS DATA
In this context, two following business trends can be observed.
                                                                      In this stage, data about customers and data related to customers,
     • Leading FIs initiate projects aiming at transforming their     from multiple data sources, are integrated into the Central Repos-
       on-premise infrastructures into novel architectures - either   itory of Customers Data (CRCD).
       hybrid or cloud. A hybrid architecture includes on-premise
       databases integrated with databases in a cloud (either pri-    2.1     Stage goals
       vate or private-public). Data in both on-premise databases     The goals to achieve within this stage are as follows:
       and in a cloud can be accessed and dynamically integrated          G1 - to integrate customers data and related data in the CRCD;
       via a dedicated software layer (with the functionality of a        G2 - to homogenize the structures of records describing cus-
       mediated architecture and ETL processes). A pure cloud                tomers and the structures of related data, coming from
       architecture assumes that all on-premise databases have               multiple data sources;
       been migrated into a cloud eco-system. In this eco-system,         G3 - to homogenize values of the integrated data;
       cloud data warehouses are built as well, based on dedicated        G4 - to clean data;
       systems (e.g., Amazon Redshift, Snowflake, Presto, Google          G5 - to develop data deduplication piplelines, in two variants:
       BigQuery, Azure Synapse Analytics, Oracle Autonomous                  based on statistical models and based on machine learning
       Data Warehouse, IBM Db2 Warehouse on Cloud) [33].                     models;
     • FIs are building cloud repositories of heterogeneous data,         G6 - to develop machine learning models for customers data
       typically ingested from sources external to a company, in-            aging.
       cluding among others open data published by (local) gov-
       ernments, professional portals, and social media. The data     2.2     Architecture
       are stored in the repository in their original formats. Such   An overall technical architecture being build in this stage is
       a repository is typically called a data lake (DL) [23, 38].    shown in Figure 2.2. It is a standard data warehouse architecture,
       Further, these data are unified either on-the-fly in the       where goals G1-G4 are implemented by means of ETL processes.
       so-called logical data warehouse (LDW) [11, 20, 30] or         Source customers data and related data are stored in relational
       are homogenized and uploaded into a physical cloud data        databases, denoted as 𝐷𝑆 1 -𝐷𝑆 6 ; the CRCD is also a relational
       warehouse (CDW) [16, 31, 37]. Another possible architec-       database (Oracle DBMS). In order to support data cleaning and
       ture to store and query heterogeneous data is a polystore.     standardization, ETL processes use open data sources (denoted
       In this architecture, a few physical data repositories (each   as openDS) and paid data sources (denoted as paidDS). Both types
       of which stores unified data represented in the same data      of DSs are made available by the public administration. These
       model) are virtually integrated and queried by means of a      sources include reference data, among others on: citizen unique
       mediated architecture [5, 10, 17, 41].                         IDs, administrative division of a country, zip codes, and compa-
                                                                      nies.
   Cloud technologies are inevitable also in the financial sector,
                                                                         An important functionality of the ETL layer is a data dedupli-
and the usage of these technologies is already supported by coun-
                                                                      cation pipeline (DDP). The DDP realizes goal G5. Having profiled
try financial regulatory authorities. For example, in January 2020,
                                                                      the customers data in the available source systems, we conclude
the Financial Supervision Commission in Poland approved a doc-
                                                                      that the DDP cannot be fully automatized and there are cases
ument with guidelines and recommendations for deploying cloud
                                                                      where an expert knowledge is required to support the process,
services in FIs, thus giving a green light for IT projects based on
                                                                      however, the DDP is to minimize the number of rows requiring
cloud technologies in the financial sector. In particular, Recom-
                                                                      manual work.
mendation D.10.6 Cooperation with external service providers (by
                                                                         Being processed by the ETL processes, customers and related
the Financial Supervision Commission) defines 5 requirements
                                                                      data are uploaded into the CRCD. The usage of the CRCD is
that an external cloud provider must fulfill to be able to offer
                                                                      twofold. First, its content is to be analyzed by data mining al-
services for a financial institution.
                                                                      gorithms in order to discover models for data aging (goal G6).
   In this paper, we present our initial experience and challenges
                                                                      The overall idea behind this component is to be able to discover
in launching a project for a big financial institution in Poland
                                                                      classes of data and their properties that share similar (or identical)
(for this publication, we are not authorized to reveal the name
                                                                      aging characteristics. Second, the CRCD will become a source of
of the institution). Since the project has just started, it is too
                                                                      truth for the whole data infrastructure of the FI, thus it will be
early to present solutions to the challenges mentioned in this
                                                                      accessed by other internal systems of the FI.
paper. The project is divided into two stages. The first one aims
at building a Central Repository of Customers Data, integrated
from several data sources (cf. Section 2). The second stage aims at
                                                                      2.3     Challenges
building a Cloud Data Repository architecture in a Polish National    While designing the aforementioned architecture, even though it
Cloud 1 (it is a private-public cloud operated by Microsoft and       is a standard one, we encountered a few issues. The most chal-
Google). A few on-premise databases will be next migrated into        lenging ones include: (1) designing a data deduplication pipeline
the repository (cf. Section 3).                                       and (2) developing ML models for predicting data aging.
                                                                         2.3.1 Designing data deduplication pipeline. An important
                                                                      and challenging task in the ETL is a data deduplication pipeline.
                                                                      The state of the art pipeline [8, 26] is shown in Figure 2.3.1. It is
                                                                      composed of four main tasks, namely blocking, block processing,
1 https://chmurakrajowa.pl/en/
                                                                      entity matching, and entity clustering. Blocking aims at dividing
                                                                        to be build based on these records. If, however, the records dif-
                                                                        fer with values of addresses, a problem arises in figuring out
                                                                        which address it the current one. Typically, changes in data are
                                                                        timestamped and in such a case, the most recently timestamped
                                                                        address and other contact data may be used as the current ones.
                                                                        Unfortunately, in the reality, the most recent timestamp may
                                                                        be old. In such a case, it is probable that some data may have
                                                                        changed since then. This observation led us to the conclusion
                                                                        that data aging models could help solving also this problem (cf.
                                                                        Section 2.3.2.
                                                                           To conclude, having done the analysis of the state of the art
                                                                        in designing the DDP, to the best of our knowledge, we conclude
                                                                        that an automatic approach to constructing an end-to-end data
                                                                        deduplication pipeline has not been proposed yet for traditional
                                                                        record-like data. The problem is getting more difficult for the
                                                                        deduplication of big data, mainly, because big data are repre-
                                                                        sented in a plenthora of different formats. Before applying the
                                                                        DDP, all these data must be unified into the same format, which
Figure 1: The overall architecture of the Central Reposi-               itself is a challenge. We encountered this problem in the second
tory of Customers Data                                                  stage of our project (cf. Section 3).

                                                                            2.3.2 Developing ML models for data aging. An intrinsic fea-
into groups records that are likely to represent duplicates, with
                                                                        ture of some types of data, is their aging. Four main types of
the final goal to reduce the number of records that need to be
                                                                        such data are of special interests by FIs, namely: (1) customers
compared. In this DDP, records are compared only within blocks
                                                                        last names, (2) customers identification documents, (3) postal ad-
they belong to, which can be done in parallel. Block processing
                                                                        dresses, and (4) contact addresses (phone numbers, emails). Such
aims at reorganizing records in blocks, in order to further reduce
                                                                        data become outdated mainly as the result of: last name chang-
the number of needed comparisons. Entity matching consists in
                                                                        ing after getting married or divorced, expiry dates reached by
computing similarity measures between records. Finally, in the
                                                                        identification documents, customers moving to other locations,
entity clustering step, identical records (i.e., those whose similar-
                                                                        as well as changing phone numbers and email addresses.
ity measures, e.g., [24, 42] are higher than a given threshold) are
                                                                            Outdated data decrease reliability of data and cause financial
grouped together and merged.
                                                                        loses. For these reasons, FIs so fare have been using analog meth-
   Even though, research on deduplication has been done for
                                                                        ods to keep customers data up to date (e.g., checking data upon
decades, and multiple algorithms have been developed for each
                                                                        a customer arrival to a FI branch, checking data of delivered and
step in the DDP, the whole pipeline has to be constructed manu-
                                                                        not deliver mailings, calling a customer to verify her/his data).
ally. The main problem remains in selecting the most adequate
                                                                        Moreover, FIs strive to ensure the highest possible level of cus-
algorithm to run each of the four tasks, so that the pipeline maxi-
                                                                        tomer self-service by means of remote services via the Internet,
mizes precision and recall. To fully solve this problem, one would
                                                                        which results in limited contact between employees of a branch
need to apply a greedy approach of testing the results of all possi-
                                                                        network and customers. Therefore, the analog methods of verify-
ble combinations of these algorithms. However, it is not feasible
                                                                        ing the correctness and timeliness of customer data either do not
as it is an optimization problem of an exponential complexity.
                                                                        work or are inefficient in therms of costs, speed, and the amount
For example, in block building there are at least 14 popular al-
                                                                        of data that can be updated.
gorithms, in block processing there are at least 18 algorithms, in
                                                                            For this reason, in this project we aim at developing data
entity matching there are at least 20 algorithms, and in entity
                                                                        aging models based on ML techniques. We assume that dedicated
clustering there are at least 7 algorithms [9, 25–27], resulting in
                                                                        models will be built at least for a few age groups of customers. To
at least 35280 combinations.
                                                                        the best of our knowledge, such models have not been proposed
                                                                        yet in the context of our project. The only approach addressing a
                                                                        related problem is [44], but in the context of temporal data. Other
                                                                        approaches address the problem of moving cold data from a hot
                                                                        to a cold storage, e.g., [32] or to an external storage, e.g., [7].
Figure 2: The state of the art data deduplication pipeline                  At this stage of the project development, we plan to start
                                                                        experimenting with classification for building data aging models.
   The algorithms in the DDP assume that the processed data             It is very likely that data imbalance will require the application
have been cleaned and standardized earlier. In the case of cus-         also of other ML techniques, like neural networks.
tomer data, such cleaning and standardization is not possible for
first and last names, as there are multiple variants of standard        3   STAGE 2: BUILDING CLOUD DATA
names. For example, names like ’Anna’ and ’Ana’ may be treated
as being the same, assuming that a typo was done, or as being
                                                                            REPOSITORY
two different names; a non-Latin name may be translated into            In this stage, the so-called Cloud Data Repository (CDR) is build
the Latin alphabet in multiple ways.                                    (notice that such a repository has features of a data lake). The
   Once the entity clustering step has identified records repre-        CDR will include: (1) the CRCD developed in Stage 1 (cf. 2), (2)
senting the same customer, a reference customer record needs            the data sources storing data related to customers records, and (3)
external data sources augmenting views on customers, to achieve
functionality of customer 360.

3.1     Stage goals
The goals to achieve in this stage are as follows:
  G7 - to build the CRD for storing data from selected internal
     company DSs;
  G8 - to augment customers view with data from external DSs
     (professional portals and social portals);
  G9 - to design data retention models in a cloud eco-system;
 G10 - to develop a data governance method in a cloud eco-
     system; the method must be compliant with national and
     international regulations in the financial sector;
 G11 - to develop an end-to-end method for designing and de-
     ploying a data repository in a cloud eco-system.

3.2     Architecture                                                      Figure 3: The overall architecture of the Cloud Data Repos-
                                                                          itory
The aforementioned goals will be achieved in the architecture
that we have designed and will be implementing. The architecture
is shown in Figure 3.2. It is a hybrid cloud eco-system, where
some DSs are stored in an on-premise architecture (𝐷𝑆 1, . . . , 𝐷𝑆𝑖      logical and physical data lake schemas for storing unstructured
and CRCD) and some DSs are stored in a private-public cloud               data, and (4) provisioning optimal cloud resources.
eco-system.                                                                  3.3.1 Designing efficient ETL processes. Typically, in a stan-
   The core component of the hybrid cloud eco-system is the               dard DW architecture, all ETL processes must finish within a
Cloud Data Repository. It includes:                                       given time window, usually within a few hours, to make a DW
      • internal company DSs, denoted as 𝐼𝐷𝑆 𝑗 , . . . , 𝐼𝐷𝑆𝑚 , and       available for analtytics. Therefore, assuring efficient executions
        CRCD; these DSs will be migrated from the on-premise ar-          of ETL processes (typically measured by throughput and an over-
        chitecture into the CDR; the sources are relational databases     all execution time) is challenging [1]. In a DL architecture, this
        and they store customers and related data; notice that the        task is much more challenging due to two main reasons. First,
        CRCD developed in the first stage is also integrated into         ETL processes in a DL architecture ingest and process larger data
        the CDR;                                                          volumes than in a standard DW architecture. Second, the variety
      • external DSs, denoted as 𝐸𝐷𝑆𝑛 , 𝐸𝐷𝑆𝑜 , and 𝐸𝐷𝑆𝑝 ; 𝐸𝐷𝑆𝑛            and complexity of data models and formats that ETL processes
        stores data ingested from open data provided by the public        must process is much larger in a DL architecture.
        administration; 𝐸𝐷𝑆𝑜 stores data ingested from commer-               In practice, the performance of an ETL process may be im-
        cial repositories (paid) provided by the public adminis-          proved by: (1) scaling up or out hardware on which the process
        tration; 𝐸𝐷𝑆𝑝 stores data ingested from professional and          is run, (2) running the process in parallel, and (3) reordering
        social media sources in the Internet and internal customer        its tasks [19, 34]. Existing commercial ETL/ELT tools provide
        behavioral data; 𝐸𝐷𝑆𝑛 , 𝐸𝐷𝑆𝑜 , and 𝐸𝐷𝑆𝑝 store data related        means of parallelizing ETL tasks, but it is the responsibility of
        to customers (both individuals and companies); these DSs          an ETL developer to select tasks for parallelization and to apply
        are non-relational, typically HTML; XML; JSON, and RDF;           an appropriate parallelization skeleton. Reordering ETL tasks
      • a data warehouse, denoted as DW, integrating customers            may reduce the execution time of an ETL process, but finding
        and related data from: (1) 𝐼𝐷𝑆 𝑗 , . . . , 𝐼𝐷𝑆𝑚 , (2) CRCD, and   a reordering that would yield the shortest execution time is of
        (3) 𝐸𝐷𝑆𝑛 , 𝐸𝐷𝑆𝑜 , 𝐸𝐷𝑆𝑝 , cleaned, deduplicated, and unified       exponential complexity [35]. Again, either commercial or free
        into a common data format; the content of the DW will             ETL engines do not offer any means for automatic optimization
        become the source of truth about customers for the on-            of ETL processes by tasks reordering, with the exception of IBM
        premise databases in the FI.                                      InfoSphere and Informatica PowerCenter, which allow to move
   Data from DSs in the Cloud Data Repository will be integrated          some tasks into a data source to be executed there [14, 21].
into the DW by means of ETL/ELT processes implemented in                     To sum up, the performance of an ETL process largely depends
the cloud eco-system. The DW will be periodically refreshed               on a manual orchestration and parallelization of tasks within an
with customers data from the CRCD by means of ETL processes               ETL process by an ETL developer.
implemented in Informatica. Since the CRCD is replicated into the            At this stage of the project, we plan to apply a known heuristic
DW, the CRCD will be maintained in the on-premise architecture            to place the most restrictive tasks, i.e., those which filter data, at
only until the CDR is fully operational. After achieving a fully          the beginning of an ETL process, in order to reduce a data volume
operational architecture, the CRCD will be replaced by the DW             ingested by an ETL process as soon as possible. Another promis-
in the CDR.                                                               ing direction would be to allocated adequate cloud resources to
                                                                          run a given ETL process under a monetary or time constraint
3.3     Challenges                                                        budget, in the spirit of [2].
While designing the aforementioned architecture we encountered               3.3.2 Handling evolution of DSs. An intrinsic feature of in-
the following challenges: (1) designing efficient ETL/ELT pro-            ternal company DSs and external DSs is the evolution of their
cesses, (2) handling the evolution of data sources, (3) designing         structures (schemas) in time [39, 40]. Internet data sources evolve
much frequently than internal company DSs. Such evolution has           no commonly accepted modeling method. Research and develop-
an impact on a data integration layer, i.e., ETL processes ingesting    ment in physical storage, physical data structures, e.g., [28, 45],
data from evolving DSs stop working with errors.                        and query optimization techniques for NoSQL storage, e.g., [22]
    As part of the preparation to the project described in this         is being conducted as well, but there is no single product that
paper, we had run a pilot micro-project aiming at creating a            would offer all these three important functionalities, which ma-
micro-DL with data ingested from Internet data sources. The             ture commercial RDBMSs do offer.
micro-DL stored: (1) data about companies being customers of the           To conclude, designing logical DL schemas, physical storage,
FI and (2) descriptions of financial products offered by banks in       and physical data structures are not guided by any method (like
Poland. To this end, we integrated: (1) a few professional portals      for relational databases) and must be done case by case (ad-hoc).
providing data about companies (including LinkedIn, Glassdoor,
GoldenLine) and (2) several portals of banks offering services in          3.3.4 Provisioning optimal cloud resources. While deploying
Poland, to compare their product offers.                                a cloud architecture, one typically aims at achieving the best
    We applied an ELT architecture in two alternative cloud eco-        possible performance with minimized monetary costs paid for the
systems, namely GCP and AWS. In both cases we experienced               cloud infrastructure. Different types of processing (e.g., analytical,
problems with fast (sometimes day-to-day) changing structures           transactional, ETL) require different amounts of cloud resources
of data provided by the aforementioned Internet data sources. As        (e.g., the number of virtual machines, the number of CPU and
a consequence, previously designed and deployed ELT processes           cores, main memory, disk storage).
generated errors and needed to be repaired, i.e., adjusted to new          Provisioning optimal resources for a given type of processing
structures of the changed DSs. This had to be done manually             to maximize performance is contradictory to minimizing mone-
as neither of the commercial and free ETL/ELT tools supports            tary costs and the optimization of these goals is a combinatorial
an automatic repair of such processes. It is one of the still open      problem of exponential complexity, e.g., [3]. Despite the fact that
problems in DW/DL research [6, 43].                                     some research has been and still is conducted in this area, e.g.,
    The second problem caused by evolving DSs is related to de-         [4, 29, 36], it is considered for the time being as an open problem.
tecting data changes and structural changes. A DW or DL are
typically refreshed incrementally. To this end, an ETL/ELT pro-         4    SUMMARY
cess has to construct the increment (a.k.a. delta). However, in         In this paper we outlined a project being done for a big financial
multiple architectures, the only way to access a DS is to use           institution in Poland. Its goal is to build a system providing clean
its data export (a.k.a. snapshot) provided by the DS. Frequently,       data about customers and their related data, augmented from ex-
such a snapshot includes the whole content of the DS. From this         ternal data sources. To this end, a hybrid architecture is built for
content, the ETL/ELT process has to extract the increment. It           the FI. The architecture is composed of the on-premise databases
is done by comparing two consecutive snapshots. When a DS               and the private-public cloud eco-system. The cloud eco-system
changed its structure, then two consecutive snapshots could not         will include customers data and data related to customers, mi-
be comparable, thus an increment could not be constructed.              grated from the old on-premise architecture and data ingested
    Furthermore, snapshot comparison is challenging for DSs             from Internet data sources.
that use complex data structures (e.g., graphs, semi-structured,           The focus of this paper is on presenting the hybrid architecture
NoSQL) as the comparison algorithm has to be able to traverse           that we designed and on challenges that we encountered while
nested structures and handle cases when new nested objects              realizing the project (since the project has just started, we are
appear.                                                                 not able to provide solutions to the challenges yet).
    The micro-project revealed that while dealing with Internet            Based on the gained early experience and based on the state
data sources, one cannot use off-the shelf solutions (as they do        of the art analysis in research and technology, we can draw the
not exist). In the project, ELT processes were repaired manually.       following conclusions:
Data increments were constructed by comparing two consecutive
                                                                            • a commonly shared knowledge on building data lakes in
snapshots of nested data, at a cost of expensive processing. The
                                                                              a cloud for a FI is not available (as the most probably it is
snapshot comparison algorithm had to be changed manually
                                                                              considered as a company’s asset);
when a DS changed its structure.
                                                                            • comprehensive methods for guiding the process of effi-
                                                                              cient data migration from an on-premise to a cloud archi-
   3.3.3 Designing logical and physical DL schemas. Designing                 tecture are not available either;
logical and physical schemas for relational DWs is a very well              • a comprehensive method for building logical and physical
researched topic, supported by mature relational database man-                DL models are not available either;
agement systems as well as by design and development tools.                 • a reference DL architecture in a cloud for a FI is not avail-
DW modeling is a fundamental task being done in an early phase                able (as the most probably it is considered again as a com-
of a DW development. Modeling data lakes still needs to be re-                pany’s asset);
searched. Dealing with unstructured data tempts a designer to               • finally, an end-to-end method for designing and deploying
apply NoSQL storage, to model a flexible schema. Unfortunately,               a data lake for a FI (from conceptual, logical, and physi-
NoSQL storage servers are not mature yet and they do not offer                cal modeling, through data migration and ingestion, data
a rich SQL syntax, advanced indexing, and advanced cost-based                 governance, to performance optimization) still needs to
query optimization. On the contrary, relational database man-                 be developed.
agement systems (especially the commercial ones) offer such a              Moreover, running a project for a big FI requires usage of
features at the cost of rigid schemas.                                  commercial tools at every development stage and for every task
   Despite of the fact that research is being conducted on data         of a DW and DL development step. For a big FI, the only ap-
lake modeling (i.e., for unstructured data), e.g., [13, 15], there is   plicable solution is to use powerful commercial software. For
example, commercial ETL engines offer rich functionalities, in-                         [25] George Papadakis and Themis Palpanas. 2018. Web-scale, Schema-Agnostic,
cluding parsing addresses and names, algorithms ready to use,                                End-to-End Entity Resolution. In Tutorial at World Wide Web.
                                                                                        [26] George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Pal-
cleaning techniques and some deduplication techniques, parallel                              panas. 2020. Blocking and Filtering Techniques for Entity Resolution: A Survey.
processing either in on a single server or in a cloud, accessing                             ACM Comput. Surv. 53, 2 (2020), 31:1–31:42.
                                                                                        [27] George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Gian-
non-relational data. Also technical support from a software house                            nakopoulos, Themis Palpanas, and Manolis Koubarakis. 2019. Domain- and
is of great importance. In the described project we use Informat-                            Structure-Agnostic End-to-End Entity Resolution with JedAI. SIGMOD Record
ica for data profiling and ETL; Oracle databases - for storing                               48, 4 (2019), 30–36.
                                                                                        [28] Philipp D. Rohde and Maria-Esther Vidal. 2020. Optimizing Federated Queries
internal company data, including the CRCD; Microsoft Azure and                               Based on the Physical Design of a Data Lake. In Proc. of EDBT/ICDT Workshops
Google Cloud Platform - as private-public cloud eco-systems in                               (CEUR Workshop Proceedings, Vol. 2578).
the Polish National Cloud.                                                              [29] Oscar Romero and Robert Wrembel. 2020. Data Engineering for Data Science:
                                                                                             Two Sides of the Same Coin. In Proc. of Int. Conf. on Big Data Analytics and
                                                                                             Knowledge Discovery (DaWaK) (Lecture Notes in Computer Science, Vol. 12393).
REFERENCES                                                                                   Springer, 157–166.
                                                                                        [30] Philip Russom. 2019. Modernizing the Logical Data Warehouse. https://tdwi.
 [1] Syed Muhammad Fawad Ali and Robert Wrembel. 2017. From conceptual de-                   org/articles/2019/10/14/dwt-all-modernizing-the-logical-data-warehouse.
     sign to performance optimization of ETL workflows: current state of research            aspx. TDWI.
     and open problems. The VLDB Journal 26, 6 (2017), 777–801.                         [31] John Ryan and Uli Bethke. 2019.             A Comparison of Cloud Data
 [2] Syed Muhammad Fawad Ali and Robert Wrembel. 2019. Towards a Cost                        Warehouse Platforms.         https://www.datamation.com/cloud-computing/
     Model to Optimize User-Defined Functions in an ETL Workflow Based on                    top-cloud-data-warehouses.html. Sonora Intelligence.
     User-Defined Performance Metrics. In European Conf. Advances in Databases          [32] SAP. [n.d.]. Data aging. SAP Help portal.
     and Information Systems (LNCS, Vol. 11695). Springer, 441–456.                     [33] ScienceSoft. [n.d.]. Data Warehouse in the Cloud: Features, Important Integra-
 [3] Syed Muhammad Fawad Ali and Robert Wrembel. 2020. Framework to Opti-                    tions, Success Factors, Benefits and More. https://www.scnsoft.com/analytics/
     mize Data Processing Pipelines Using Performance Metrics. In Int. Conf. on Big          data-warehouse/cloud.
     Data Analytics and Knowledge Discovery (DaWaK) (LNCS, Vol. 12393). Springer,       [34] Alkis Simitsis, Panos Vassiliadis, and Timos K. Sellis. [n.d.]. Optimizing ETL
     131–140.                                                                                Processes in Data Warehouses. In Int. Conf. on Data Engineering ICDE. 564–
 [4] Abdullah Khalid A. Almasaud, Agresh Bharadwaj, Sandra Sampaio, and Rizos                575.
     Sakellariou. 2020. Challenges in Resource Provisioning for the Execution           [35] Alkis Simitsis, Panos Vassiliadis, and Timos K. Sellis. 2005. State-Space Op-
     of Data Wrangling Workflows on the Cloud: A Case Study. In Int. Conf. on                timization of ETL Workflows. IEEE Transactions on Knowledge and Data
     Database and Expert Systems Applications (DEXA). LNCS 12392, 66–75.                     Engineering 17, 10 (2005), 1404–1419.
 [5] Rana Alotaibi, Damian Bursztyn, Alin Deutsch, Ioana Manolescu, and Stamatis        [36] Sukhpal Singh and Inderveer Chana. 2016. Cloud resource provisioning:
     Zampetakis. 2019. Towards Scalable Hybrid Stores: Constraint-Based Rewrit-              survey, status and future research directions. Knowl. Inf. Syst. 49, 3 (2016),
     ing to the Rescue. In SIGMOD. 1660–1677.                                                1005–1069.
 [6] Judith Awiti and Robert Wrembel. 2020. Rule Discovery for (Semi-)automatic         [37] Mohamed A. Soliman, Lyublena Antova, Marc Sugiyama, Michael Duller,
     Repairs of ETL Processes. In Int. Baltic Conf. on Databases and Information             Amirhossein Aleyasen, Gourab Mitra, Ehab Abdelhamid, Mark Morcos,
     Systems (CCIS, Vol. 1243). Springer, 250–264.                                           Michele Gage, Dmitri Korablev, and Florian M. Waas. 2020. A Framework
 [7] Arun Balasubramanyan. 2014. Data warehouse augmentation, Part 3: Use big                for Emulating Database Operations in Cloud Data Warehouses. In Proc. of
     data technology for an active archive. https://www.ibm.com/developerworks/              SIGMOD. ACM, 1447–1461.
     library/ba-augment-data-warehouse3/index.html. IBM DeveloperWorks.                 [38] Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino. 2015. Data
 [8] Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis,           Wrangling: The Challenging Journey from the Wild to the Lake. In Conf. on
     and Kostas Stefanidis. 2019. End-to-End Entity Resolution for Big Data: A               Innovative Data Systems Research (CIDR).
     Survey. CoRR abs/1905.06397 (2019).                                                [39] Panos Vassiliadis and Apostolos V. Zarras. 2017. Schema Evolution Survival
 [9] Adrian Colyer. 2020. The morning paper on An overview of end-to-end entity              Guide for Tables: Avoid Rigid Childhood and You’re En Route to a Quiet Life.
     resolution for big data. https://blog.acolyer.org/2020/12/14/entity-resolution/.        J. Data Semant. 6, 4 (2017), 221–241.
[10] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska,         [40] Panos Vassiliadis, Apostolos V. Zarras, and Ioannis Skoulis. 2017. Gravitating
     Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and                     to rigidity: Patterns of schema evolution - and its absence - in the lives of
     Stanley B. Zdonik. 2015. The BigDAWG Polystore System. SIGMOD Rec. 44, 2                tables. Inf. Systems 63 (2017), 24–46.
     (2015), 11–16.                                                                     [41] Marco Vogt, Alexander Stiemer, and Heiko Schuldt. 2018. Polypheny-DB:
[11] Ted Friedman and Nick Heudecker. 2020. Data Hubs, Data Lakes and Data                   Towards a Distributed and Self-Adaptive Polystore. In BigData. 3364–3373.
     Warehouses: How They Are Different and Why They Are Better Together.               [42] Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity
     Gartner.                                                                                Matching: How Similar Is Similar. VLDB Endow. 4, 10 (2011), 622–633.
[12] Gartner. 2019. Magic Quadrant for Data Integration Tools.                          [43] Artur Wojciechowski and Robert Wrembel. 2020. On Case-Based Reasoning
[13] Corinna Giebler, Christoph Gröger, Eva Hoos, Holger Schwarz, and Bernhard               for ETL Process Repairs: Making Cases Fine-Grained. In Int. Baltic Conf. on
     Mitschang. 2019. Modeling Data Lakes with Data Vault: Practical Experiences,            Databases and Information Systems (CCIS, Vol. 1243). Springer, 235–249.
     Assessment, and Lessons Learned. In Int. Conf. on Conceptual Modeling ER           [44] Anita Zakrzewska and David A. Bader. 2016. Aging data in dynamic graphs:
     (LNCS, Vol. 11788). Springer, 63–77.                                                    A comparative study. In Int. Conf. on Advances in Social Networks Analysis and
[14] Informatica. 2007. How to Achieve Flexible, Cost-effective Scalability and              Mining, ASONAM. IEEE Computer Society, 1055–1062.
     Performance through Pushdown Processing. Whitepaper.                               [45] Dongfang Zhang, Yong Wang, Zhenling Liu, and Shijie Dai. 2019. Improving
[15] Matthias Jarke and Christoph Quix. 2017. On Warehouses, Lakes, and Spaces:              NoSQL Storage Schema Based on Z-Curve for Spatial Vector Data. IEEE Access
     The Changing Role of Conceptual Modeling for Data Integration. In Conceptual            7 (2019), 78817–78829.
     Modeling Perspectives. Springer, 231–245.
[16] Sean Michael Kerner. 2019. Top 8 Cloud Data Warehouses. https://www.
     datamation.com/cloud-computing/top-cloud-data-warehouses.html. Data-
     mation.
[17] Boyan Kolev, Carlyna Bondiombouy, Patrick Valduriez, Ricardo Jiménez-Peris,
     Raquel Pau, and José Pereira. 2016. The CloudMdsQL Multistore System. In
     SIGMOD. 2113–2116.
[18] Nikolaos Konstantinou and Norman W. Paton. 2020. Feedback driven im-
     provement of data preparation pipelines. Inf. Syst. 92 (2020), 101480.
[19] Nitin Kumar and P. Sreenivasa Kumar. 2010. An Efficient Heuristic for Logical
     Optimization of ETL Workflows. In VLDB Workshop on Enabling Real-Time
     Business Intelligence. 68–83.
[20] Alice LaPlante. 2020. Building a Unified Data Infrastructure.           O’Reilly
     whitepaper.
[21] Rao Lella. 2014. Optimizing BDFS jobs using InfoSphere DataStage Balanced
     Optimization. IBM DeveloperWorks.
[22] Divya Mahajan, Cody Blakeney, and Ziliang Zong. 2019. Improving the energy
     efficiency of relational and NoSQL databases via query optimizations. Sustain.
     Comput. Informatics Syst. 22 (2019), 120–133.
[23] Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C.
     Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proc.
     VLDB Endow. 12, 12 (2019), 1986–1989.
[24] Felix Naumann. 2013. Similarity measures. Hasso Plattner Institut.

</pre>