<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimizing Data Integration Processes with the Support of Machine Learning - Is it Really Possible?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Robert Wrembel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Poznan University of Technology and Interdisciplinary Centre for Artificial Intelligence and Cybersecurity</institution>
          ,
          <addr-line>pl. Sklodowskiej-Curie 5, 60965, Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this panel session I address two research questions in the area of the optimization of data integration (DI) processes (a.k.a. ETL processes), which (in my opinion) still need substantial research. The questions include: (1) how to eficiently push down executions of DI tasks to non-relational data sources and (2) how to handle user-defined functions (especially treated as black-boxes) in optimizing the performance of DI processes. The discussion to be initiated during the panel is whether sound answers to these questions can be found by the support of machine learning techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;data integration process</kwd>
        <kwd>ETL process</kwd>
        <kwd>optimizing data integration process</kwd>
        <kwd>user-defined functions</kwd>
        <kwd>resource usage time series</kwd>
        <kwd>machine learning</kwd>
        <kwd>time series similarity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>For years, the widespread of complex, data-driven systems
has been observed, e.g., medical systems, smart agriculture,
and smart cities. These systems produce huge volumes of
highly heterogeneous data (a.k.a. big data) that need to be
integrated to feed various applications providing descriptive
analytics or prediction models. Thus, data integration (DI)
architectures are inevitable in modern information systems
and they are constantly facing new challenges caused by
complex, fast arriving, and ample data as well as emerging
data engineering technologies.</p>
      <p>
        A common goal of DI is to make heterogeneous and
typically distributed data available for an end user in a unified
format. Research and development works resulted in a few
standard DI architectures, namely: (1) federated [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
mediated [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], (2) data warehouse (DW) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], (3) lambda [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
(4) data lake (DL) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], (5) data lake house (DLH) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], (6)
polystore [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and (7) data mesh/ data fabric [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In all of the
aforementioned architectures, data from heterogeneous and
distributed data sources (DSs) are made available in an
integrated system (either by virtual or materialized integration)
by means of an integration layer. This layer is implemented
by a sophisticated software, which runs the so-called DI
processes (a.k.a. ETL - in data warehouse architectures, data
processing pipeline - in data science, data wrangling, or data
processing workflows [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]).
      </p>
      <p>DI processes are core elements of all DI architectures.
DI processes are complex workflows composed of dozens
to thousands of tasks. These tasks are responsible for
extracting data from DSs, transforming data into a common
model and data structures, cleaning data, removing missing,
inconsistent, and redundant data items, integrating data,
and loading them into a central repository (i.e., DW, DL,
or DLH) or making them available in virtual integration
architectures (i.e., federated, mediated, polystore, or data
mesh). DI processes are managed by a dedicated software,
called a DI engine (an ETL engine in a DW architecture).</p>
      <p>
        Most of the DI engines support a set of predefined (out of
the box) tasks [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Even though, methods for developing DI processes have
been researched and developed for decades (see [
        <xref ref-type="bibr" rid="ref10 ref12">10, 12</xref>
        ])
and were included in commercial (and some open license)
DI design environments and DI engines [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the task of
designing and managing DI processes is still dificult and
time costly. Moreover, the support from these design tools
for optimizing such designs and optimizing the execution
of DI processes is very limited.
      </p>
      <p>
        In this context, with the fast advances of machine
learning (ML) techniques, the application of such techniques to
designing and optimizing DI processes may sound attractive.
However, research works on DI focus mainly on mappings
between values [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or schemas [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], data cleaning [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
data deduplication [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ]. Moreover, event though
multiple providers of DI technologies and consulting companies
opt for applying ML techniques in data integration, a clear
step-by-step and end-to-end approach has not been
proposed yet.
      </p>
      <p>
        ML techniques have already been successfully applied to
optimizing system performance, e.g., [
        <xref ref-type="bibr" rid="ref19 ref20 ref21 ref22 ref23 ref24 ref25">19, 20, 21, 22, 23, 24,
25</xref>
        ]. They typically build performance models, which are
based on performance characteristics (typically CPU, I/O,
and memory usage) collected during a normal runtime of a
system or during an excessive testing phases. Then,
performance models are learned, based on these characteristics.
The works reported in [
        <xref ref-type="bibr" rid="ref26 ref27">26, 27</xref>
        ] focus on applying ML
techniques to provide auto-tuning capabilities in the so-called
self-driving database management systems.
      </p>
      <p>In this panel talk, I will focus on selected challenges
related to the performance of ETL processes. My subjective
point of view on the presented open issues/challenges
results from a cooperation with IBM Software Lab in Kraków
(Poland) on a data integration project.
2. Performance optimization of DI
processes
In order to reduce the execution time of a DI process, a few
classes of solutions have been proposed. First, a business
approach is to scale-up or scale-out a DI server. Second, DI
engines existing on the market support parallel processing
of DI tasks. This is also a trend in research. Third, some</p>
      <p>DI engines support moving the execution of some DI tasks
close to storage. One technique of this class is called the
push-down optimization. Fourth, re-ordering of DI tasks
has been well researched and resulted in a few approaches.</p>
      <p>Scaling refers to adding computing power into a DI
architecture. Two types of scaling are common, namely: (1)
vertical scaling of a DI server, i.e., by increasing the number
of CPUs, the size of RAM, adding specialized hardware like
FPGAs and (2) horizontal scaling of a DI architecture, i.e.,
adding new computing nodes.</p>
      <p>
        Parallel processing consists in computing tasks by
parallel OS processes or threads. This technique was well
researched as well, e.g., [
        <xref ref-type="bibr" rid="ref28 ref29 ref30">28, 29, 30</xref>
        ]. In the simplest case
(available in commercial DI engines, see [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]), uploading data
into a data warehouse is executed in parallel. A challenge
in applying parallelism is to figure out the most eficient
parallelization schemes for a given DI task or the whole DI
process [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
      </p>
      <p>Data processing close to storage - in the context of
databases, there are numerous implementations supporting
moving data-intensive processing from an application layer
to storage. Examples of such systems are IBM Pure Data for
Analytics and Oracle Exadata. Both of them use a dedicated
hardware to perform operations on data read from disks,
like decompression, filtering, and projection.</p>
      <p>Push-down - the principle of the push-down optimization
is to move some DI tasks into a data source, to be executed
there. Push-down is available in IBM InfoSphere Data Stage
and Informatica, but only for relational DSs.</p>
      <p>
        Task reordering is the most researched technique for DI
process optimization. A group of approaches draws upon
the idea of changing the order of tasks in an original DI
process, such that a reordered process is more eficient than the
original one. Finding a (sub-)optimal order of tasks is
computationally complex, and for this reason, some heuristics
have to be used [
        <xref ref-type="bibr" rid="ref32 ref33">32, 33</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Still open research challenges</title>
      <p>This section outlines my subjective view on two still open
research challenges in the optimization of DI processes. They
include: the push-down optimization on non-relational data
sources and DI processes with black-box user-defined
functions (BBUDFs).</p>
      <p>As stated above, the push-down technique in commercial
systems was made available to work only with relational DSs
and it is typically applicable to tasks at the beginning of a
DI process, i.e., filtering or simple pre-processing. Moreover,
push-down may also be applicable to tasks that migrate
large volumes of data between systems. The first example
of such a task is data anonymization. According to GDPR,
sensitive data cannot ’leave’ a source system before being
anonymized. It means that a DI engine cannot run the
anonymization and this task has to be pushed-down into
the source system. The second example is enforcing data
access policies. Sensitive data that cannot be accessed in the
source system must be filtered out directly in the system.
In this case, a data access policy originally included in a DI
process must be pushed-down into the DS.</p>
      <p>
        With the widespread of big data storage systems, a natural
step is to extend push-down to non-relational DSs. To the
best of our knowledge, the applicability and eficiency of
this technique for non-relational (a.k.a. NoSQL) DSs has not
been studied yet (with the exception of [
        <xref ref-type="bibr" rid="ref34 ref35 ref36">34, 35, 36</xref>
        ]). The
issues that have to be investigated include: (1) analyzing
which DI tasks can be pushed down to contribute to the
improvement of performance of a DI process and (2) how
to eficiently implement a given pushed down task in a DS,
leveraging the functionality and internal structures of the
DS.
      </p>
      <p>
        DI processes use not only predefined tasks (e.g., [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ])
available in design, development, and management tools
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], but also require the deployment of user defined
functions (UDFs), in order to implement specific tasks [
        <xref ref-type="bibr" rid="ref37 ref38">37, 38</xref>
        ].
UDFs can be implemented in various programming
languages and are treated by a DI engine as black-boxes.
Typically, the most advanced commercial engines allow to
implement a UDF in any programming language and call
it from the engine as an external program (i.e., as a pure
black-box). For this reason, optimizing the execution of DI
processes with BBUDFs is more than challenging. To be
able to apply the aforementioned optimization techniques,
one must know performance characteristics of BBUDFs and
(if possible) their semantics.
      </p>
    </sec>
    <sec id="sec-3">
      <title>4. Research hypothesis</title>
      <p>The research hypothesis stated in this panel talk threefold.</p>
      <p>First, we expect that the push-down technique applied
to non-relational DSs will allow to increase performance of
DI processes (i.e., reduce their execution time). Based on
the developed execution cost models and implementation
skeletons, it will be possible to push-down typical DI tasks
into non-relational DSs. The question, however, is how
pushdown could benefit from machine learning (ML) techniques
in the course of: (1) deciding whether a given task should
be pushed down into a non-relational DS and (2) providing
an eficient implementation of the task in the DS.</p>
      <p>
        Second, we expect that it will be possible to build
performance models of basic and complex BBUDFs (like those
listed in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]) by applying ML techniques. The models of
BBUDFs will be assigned to performance classes provided
by prediction models built from analyzing the performance
models of known UDFs. For BBUDFs, their performance
characteristics will be collected and they will be classified
into one of the already known performance classes, thus
allowing us to reason at least about an expected BBUDF
performance. Our initial work [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] shows that the proposed
approach is feasible on basic BBDUFs.
      </p>
      <p>Third, we expect that it will be possible to build semantic
models of basic BBUDFs by means of machine learning
techniques - possibly by applying deep neural networks.
Here, open question are: whether ML techniques could be
used to build the models; what kind of techniques would be
suitable; what input data would be required?</p>
      <p>
        The discussion on the aforementioned challenges should
be extended towards a broader scope of DI: (1) whether ML
techniques can revolutionize the development and
deployment methods of eficient DI pipelines, (2) how to build
an end-to-end DI pipeline with the support of ML, (3) how
to assure and verify the quality of data produced by such
pipelines, (4) how to leverage the ML techniques for
building complex DI architectures with appropriately designed
software and hardware, and (5) how to mitigate bias in the
ML techniques used to build DI pipelines. Furthermore,
another question is whether ML could help solving the still
unsolved challenge of the ETL evolution, e.g., [
        <xref ref-type="bibr" rid="ref40 ref41">40, 41</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bouguettaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Benatallah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elmargamid</surname>
          </string-name>
          ,
          <source>Interconnecting Heterogeneous Information Systems</source>
          , Kluwer Academic Publishers, ISBN
          <volume>0792382161</volume>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Brezany</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Tjoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wöhrer</surname>
          </string-name>
          ,
          <article-title>Mediators in the architecture of grid information systems</article-title>
          ,
          <source>in: Int. Conf. Parallel Processing and Applied Mathematics (PPAM)</source>
          , volume
          <volume>3019</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2003</year>
          , pp.
          <fpage>788</fpage>
          -
          <lpage>795</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Errami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A. E.</given-names>
            <surname>Kadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Badir</surname>
          </string-name>
          ,
          <article-title>Spatial big data architecture: From data warehouses and data lakes to the lakehouse</article-title>
          ,
          <source>Journal of Parallel and Distributed Computing</source>
          <volume>176</volume>
          (
          <year>2023</year>
          )
          <fpage>70</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Munshi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. A. I. Mohamed</surname>
          </string-name>
          ,
          <article-title>Data lake lambda architecture for smart grids big data analytics</article-title>
          ,
          <source>IEEE Access 6</source>
          (
          <year>2018</year>
          )
          <fpage>40463</fpage>
          -
          <lpage>40471</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Hai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jarke</surname>
          </string-name>
          ,
          <article-title>Data lakes: A survey of functions and systems,</article-title>
          <year>2023</year>
          . arXiv:
          <volume>2106</volume>
          .
          <fpage>09592</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Harby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. H.</given-names>
            <surname>Zulkernine</surname>
          </string-name>
          ,
          <article-title>From data warehouse to lakehouse: A comparative review</article-title>
          ,
          <source>in: IEEE Big Data</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>389</fpage>
          -
          <lpage>395</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chirkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gadepally</surname>
          </string-name>
          , T. G. Mattson,
          <article-title>Enabling query processing across heterogeneous data models: A survey</article-title>
          ,
          <source>in: IEEE Big Data</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>3211</fpage>
          -
          <lpage>3220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          , Data Mesh:
          <article-title>Delivering Data-Driven Value at Scale, O'Reilly</article-title>
          , ISBN
          <volume>1492092398</volume>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Furche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          , G. Orsi,
          <string-name>
            <given-names>N. W.</given-names>
            <surname>Paton</surname>
          </string-name>
          ,
          <article-title>Data wrangling for big data: Challenges and opportunities</article-title>
          ,
          <source>in: Int. Conf. on Extending Database Technology (EDBT)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>473</fpage>
          -
          <lpage>478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Simitsis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Skiadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          ,
          <article-title>The history, present, and future of ETL technology (invited)</article-title>
          ,
          <source>in: Int. Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP) @EDBT/ICDT</source>
          , volume
          <volume>3369</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>IBM</surname>
          </string-name>
          ,
          <string-name>
            <surname>Product</surname>
            <given-names>documentation</given-names>
          </string-name>
          :
          <source>Infosphere information server 11</source>
          .3, https://www.ibm.com/docs/en/iis/11.3
          <article-title>? topic=jobs-processing-</article-title>
          <string-name>
            <surname>data</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S. M. F.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wrembel</surname>
          </string-name>
          ,
          <article-title>From conceptual design to performance optimization of ETL workflows: current state of research and open problems</article-title>
          ,
          <source>The VLDB Journal</source>
          <volume>26</volume>
          (
          <year>2017</year>
          )
          <fpage>777</fpage>
          -
          <lpage>801</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Gartner</surname>
          </string-name>
          ,
          <article-title>Magic quadrant for data integration tools</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Birgersson</surname>
          </string-name>
          , G. Hansson, U. Franke,
          <article-title>Data integration using machine learning</article-title>
          ,
          <source>in: IEEE Int. Enterprise Distributed Object Computing Workshop (EDOC)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rekatsinas</surname>
          </string-name>
          ,
          <article-title>Data integration and machine learning: a natural synergy</article-title>
          ,
          <source>Proc. VLDB Endowment</source>
          <volume>11</volume>
          (
          <year>2018</year>
          )
          <fpage>2094</fpage>
          -
          <lpage>2097</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rekatsinas</surname>
          </string-name>
          ,
          <article-title>Machine learning and data cleaning: Which serves the other?</article-title>
          ,
          <source>ACM Journal of Data and Information Quality</source>
          <volume>14</volume>
          (
          <year>2022</year>
          )
          <volume>13</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          :
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Barlaug</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Gulla</surname>
          </string-name>
          ,
          <article-title>Neural networks for entity matching: a survey</article-title>
          ,
          <source>ACM Transactions on Knowledge Discovery from Data</source>
          <volume>15</volume>
          (
          <year>2021</year>
          )
          <volume>52</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>52</lpage>
          :
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zeakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Skoutas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koubarakis</surname>
          </string-name>
          ,
          <article-title>Pre-trained embeddings for entity resolution: An experimental analysis</article-title>
          ,
          <source>Proc. VLDB Endowment</source>
          <volume>16</volume>
          (
          <year>2023</year>
          )
          <fpage>2225</fpage>
          -
          <lpage>2238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Aken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pavlo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Zhang,</surname>
          </string-name>
          <article-title>Automatic database management system tuning through largescale machine learning</article-title>
          ,
          <source>in: Int. Conf. on Management of Data (SIGMOD)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1009</fpage>
          -
          <lpage>1024</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Graziani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <article-title>An active learning approach to build adaptive cost models for web services</article-title>
          ,
          <source>Data &amp; Knowledge Engineering</source>
          <volume>119</volume>
          (
          <year>2019</year>
          )
          <fpage>89</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Á. B. Hernández</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Pérez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>V. MuntésMulero,</given-names>
          </string-name>
          <article-title>Using machine learning to optimize parallelism in big data applications</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>86</volume>
          (
          <year>2018</year>
          )
          <fpage>1076</fpage>
          -
          <lpage>1092</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pumma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Phunchongharn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chapeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Achalakul</surname>
          </string-name>
          ,
          <article-title>A runtime estimation framework for ALICE</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>72</volume>
          (
          <year>2017</year>
          )
          <fpage>65</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sellami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Defude</surname>
          </string-name>
          ,
          <article-title>Complex queries optimization and evaluation over relational and nosql data stores in cloud environments</article-title>
          ,
          <source>IEEE Transactions on Big Data</source>
          <volume>4</volume>
          (
          <year>2018</year>
          )
          <fpage>217</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Taheri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Zomaya</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Kassler, vmbbprofiler: a black-box profiling approach to quantify sensitivity of virtual machines to shared cloud resources</article-title>
          ,
          <source>Computing</source>
          <volume>99</volume>
          (
          <year>2017</year>
          )
          <fpage>1149</fpage>
          -
          <lpage>1177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>C.</given-names>
            <surname>Witt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gusew</surname>
          </string-name>
          , U. Leser,
          <article-title>Predictive performance modeling for distributed batch processing using black box monitoring and machine learning</article-title>
          ,
          <source>Information Systems</source>
          <volume>82</volume>
          (
          <year>2019</year>
          )
          <fpage>33</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pavlo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Butrovich</surname>
          </string-name>
          , L. Ma, P. Menon,
          <string-name>
            <given-names>W. S.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Aken</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. Zhang,</surname>
          </string-name>
          <article-title>Make your database system dream of electric sheep: Towards self-driving operation</article-title>
          ,
          <source>Proc. VLDB Endowment</source>
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>3211</fpage>
          -
          <lpage>3221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Markakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ngom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Check out the big brain on BRAD: simplifying cloud data processing with learned automated data meshes</article-title>
          ,
          <source>Proc. VLDB Endowment</source>
          <volume>16</volume>
          (
          <year>2023</year>
          )
          <fpage>3293</fpage>
          -
          <lpage>3301</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S. M. F.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Thiele</surname>
          </string-name>
          ,
          <article-title>Parallelizing user-defined functions in the etl workflow using orchestration style sheets</article-title>
          ,
          <source>Int. Journal of Applied Mathematics and Computer Science</source>
          <volume>29</volume>
          (
          <year>2019</year>
          )
          <fpage>69</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Karagiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Simitsis</surname>
          </string-name>
          ,
          <article-title>Scheduling strategies for eficient etl execution</article-title>
          ,
          <source>Information Systems</source>
          <volume>38</volume>
          (
          <year>2013</year>
          )
          <fpage>927</fpage>
          -
          <lpage>945</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Iftikhar</surname>
          </string-name>
          ,
          <article-title>An ETL optimization framework using partitioning and parallelization</article-title>
          ,
          <source>in: ACM Symposium on Applied Commputing (SAC)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1015</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>S. M. F.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wrembel</surname>
          </string-name>
          ,
          <article-title>Framework to optimize data processing pipelines using performance metrics</article-title>
          ,
          <source>in: Int. Conf. on Big Data Analytics and Knowledge Discovery (DAWAK)</source>
          ,
          <source>LNCS 12393</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>A.</given-names>
            <surname>Simitsis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Sellis</surname>
          </string-name>
          ,
          <article-title>State-space optimization of ETL workflows</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>17</volume>
          (
          <year>2005</year>
          )
          <fpage>1404</fpage>
          -
          <lpage>1419</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Predicate pushdown for data science pipelines</article-title>
          ,
          <source>Int. Conf. on Management of Data (SIGMOD) 1</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bodziony</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morawski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wrembel</surname>
          </string-name>
          ,
          <article-title>Evaluating push-down on nosql data sources: experiments and analysis paper</article-title>
          , in: Int. Workshop on
          <article-title>Big Data in Emergent Distributed Environments(BiDEDE) @</article-title>
          ACM SIGMOD/PODS Conference, ACM,
          <year>2022</year>
          , pp.
          <volume>4</volume>
          :
          <fpage>1</fpage>
          -
          <issue>4</issue>
          :
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bodziony</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roszyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wrembel</surname>
          </string-name>
          ,
          <article-title>On evaluating performance of balanced optimization of ETL processes for streaming data sources</article-title>
          ,
          <source>in: Int. Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP) @EDBT/ICDT</source>
          , volume
          <volume>2572</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>78</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>C.</given-names>
            <surname>Forresi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Francia</surname>
          </string-name>
          , E. Gallinucci,
          <string-name>
            <given-names>M.</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          ,
          <article-title>Costbased optimization of multistore query plans</article-title>
          ,
          <source>Information Systems Frontiers</source>
          <volume>25</volume>
          (
          <year>2023</year>
          )
          <fpage>1925</fpage>
          -
          <lpage>1951</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>A.</given-names>
            <surname>Crotty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galakatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dursun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Binnig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Cetintemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zdonik</surname>
          </string-name>
          ,
          <article-title>An architecture for compiling udf-centric workflows</article-title>
          ,
          <source>VLDB Endowment 8</source>
          (
          <year>2015</year>
          )
          <fpage>1466</fpage>
          -
          <lpage>1477</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Zhang,</surname>
          </string-name>
          <article-title>Extend core UDF framework for gpu-enabled analytical query evaluation</article-title>
          ,
          <source>in: Int. Database Engineering and Applications Symposium (IDEAS)</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lehnhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ciesielski</surname>
          </string-name>
          ,
          <article-title>Designing and implementing a method for assessing similarities between time series on computer resources consumed by data processing tasks</article-title>
          ,
          <source>Master thesis</source>
          , Poznan University of Technology,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>D.</given-names>
            <surname>Butkevicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Freiberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Halberg</surname>
          </string-name>
          ,
          <article-title>MAIME: A maintenance manager for ETL processes</article-title>
          ,
          <source>in: Workshops of the EDBT/ICDT Joint Conference</source>
          , volume
          <volume>1810</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papastefanatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Simitsis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Vassiliou</surname>
          </string-name>
          ,
          <article-title>Policy-regulated management of ETL evolution</article-title>
          ,
          <source>Journal on Data Semantics</source>
          <volume>13</volume>
          (
          <year>2009</year>
          )
          <fpage>147</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>