<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Evaluating Performance of Balanced Optimization of ETL Processes for Streaming Data Sources</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michał Bodziony</string-name>
          <email>michal.bodziony@pl.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Szymon Roszyk</string-name>
          <email>szymon.roszyk@student.put</email>
          <email>szymon.roszyk@student.put. poznan.pl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Wrembel</string-name>
          <email>robert.wrembel@cs.put.poznan.pl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Poland, Software Lab Kraków</institution>
          ,
          <addr-line>Kraków</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Poznan University of Technology</institution>
          ,
          <addr-line>Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Poznan University of Technology</institution>
          ,
          <addr-line>Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A core component of each data warehouse architecture is the Extract-Transform-Load (ETL) layer. The processes in this layer integrate, transform, and move large volumes of data from data sources into a data warehouse (or data lake). For this reason, eficient execution of ETL processes is of high practical importance. Research approaches propose task re-ordering and parallel processing. In practice, companies apply scale-up or scale-out techniques to increase performance of the ETL layer. ETL tools existing on the market support mainly parallel processing of ETL tasks. Additionally, the IBM DataStage ETL engine applies the so-called balanced optimization that allows to execute ETL tasks either in a data source or in a data warehouse, thus optimizing overall resource consumption. With the widespread of non-relational technologies for storing big data, it is essential to apply ETL processes to this type of data. In this paper, we present an initial assessment of the balanced optimization applied to IBM DataStage, ingesting data from Kafka streaming data source.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        A core component of a data warehouse (DW) [25] or a data lake
(DL) [22] is an Extract-Transform-Load (ETL) software. It is
responsible for ingesting data from data sources (DSs), integrating,
and cleaning data as well as uploading them into a DW/DL. ETL
is implemented as a workflow (process) composed of tasks,
managed by an ETL engine, frequently run on a dedicated server or
on a cluster. An ETL process moves large volumes of data
between DSs and a DW. Therefore, its execution may take hours to
complete. For example, simple movement of 1TB of data between
a DS and DW (which use magnetic disks with a typical 200MB/s
throughput) takes 2.7 hours by an ETL process. Resource
consumption and the overall processing time are further increased
by complex tasks executed within the process, including
integrating, cleaning, and de-duplicating data. For this reason, providing
means for optimization of ETL processes is of high practical
importance. Big data add to the problem more complexity that
results from: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) bigger data volumes, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) much more complex
and diverse data models and formats that need to be processed
by ETL. These characteristics call for yet more eficient execution
of ETL processes.
      </p>
      <p>Unfortunately, this problem has been only partially addressed
by technology and research, cf. Section 2. In practice, companies
apply scale-up or scale-out for their ETL servers, to reduce
processing time. ETL tools existing on the market support mainly
parallel processing of ETL tasks. Some of them, on top of that,
apply task movement within an ETL process. One of such
techniques is called balanced optimization [17]. Research approaches
propose task re-ordering and parallel processing.</p>
      <p>
        This paper presents first findings from an R&amp;D project, jointly
done by IBM Software Lab and Poznan University of
Technology, on assessing the performance of the balanced optimization
on a streaming data source. In particular, we were interested in
assessing: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) if the balanced optimization may increase the
performance of ETL processes - run by IBM DataStage and ingesting
data from a stream DS - run by Kafka, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) which ETL tasks may
benefit from the balanced optimization, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) what parameters
of the DS may afect performance. The findings are summarized
in Section 4.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        In this section we overview the solutions to optimizing ETL
processes: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) applied in practice by companies, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) ofered by ETL
engines existing on the market, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) developed by researchers.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Industrial approaches</title>
      <p>
        Industrial projects utilizing ETL, reduce processing time of ETL
processes by increasing computational power of an ETL server
and by applying parallel processing of data. This can be achieved
by: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) scaling-up an ETL server, i.e., by means of increasing the
number of CPU and the size of memory, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) scaling-out an ETL
architecture, i.e., by means of increasing the number of
workstations in a cluster running the ETL engine. All the commercial
ETL engines, cf. [8], support this kind of optimization. Parallel
processing is also applicable to uploading data into a data
warehouse, e.g., command nzload (IBM Pure Data for Analytics, a.k.a.
Netezza) or import (Oracle).
      </p>
      <p>On top of this, an ETL task re-ordering may be used to produce
a more eficient processes. In the simplest case, called push down,
the most selective tasks are moved towards the beginning of
an ETL process (towards data sources) to reduce a data volume
(I/O) as soon as possible [26]. IBM extended this technique into
balanced optimization, where some tasks are moved into DSs and
some are moved into a DW, to be executed there, cf. Section 3.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Research approaches</title>
      <p>
        Research approaches to optimizing ETL execution can be
categorized as: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) optimization by task re-ordering augmented with:
estimation of execution costs and parallelization and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
optimization of ETL processes with user defined functions (UDFs).
      </p>
      <p>2.2.1 Cost-based, task re-ordering, and parallelization.
In [19, 20] the authors apply the MapReduce framework to
process ETL workflows to feed in parallel dimensional schemas (star,
snowflake, and SCDs). The authors provide some processing
skeletons for MapReduce for dimensions of diferent sizes and
structures. A system for optimizing MapReduce workloads was
presented in [11]. The optimization uses: sharing of data flows,
materialization, and physical data layouts. However, neither an
optimization technique nor a cost function were explained.</p>
      <p>[18] proposes to parallelize ETL process P in three steps. First,
P is partitioned into linear sub-processes LPi . Second, in each LPi ,
input data are partitioned horizontally into n disjoint partitions
(n is parameterized) and each partition is processed in parallel by
a separate thread. Third, for tasks with a heavy computational
load, an internal multi-threading parallelization may be applied,
if needed.</p>
      <p>[10] assumes that an ETL process is assigned an estimated
execution cost and a more eficient variant of this process is produced
by task re-ordering, but the authors do not provide any method
for the re-ordering. The main goal of [10] is to identify a set of
statistics to collect for this kind of optimization. The constraint
is that this set must be: minimal and applicable to estimate costs
of all possible task re-orderings. Moreover, the cost of collecting
statistics must be minimal. The statistics include: a cardinality
of table Ti , attribute histograms of Ti , and a number of distinct
values of attributes of Ti . A proposed cost function includes: data
statistics, CPU speed, disk-access speed, and memory usage. The
authors proposed cost functions for the following ETL tasks (all
expressed via SQL): select, project, join, group-by, and transform.
To find the set of statistics, which is an NP-hard problem, the
authors use linear programming.</p>
      <p>The approach presented in [24] goes one step further, as it
proposes a specific performance optimization technique by task
re-ordering. To this end, each workload gets assigned an
execution cost. The main components of the cost formula are time
and data volume. The five following workflow transformation
operations were proposed: swap - changing the order of two
tasks, factorize - combining tasks that execute the same
operation on diferent flows, distribute - splitting the execution of
a task into n parallel tasks, merge - merging two consecutive
tasks, and its reverse operation - split [23]. For each of these
operations, criteria for correct workflow transformations were
proposed. Finally, a heuristic for pruning the search space of
all possible workflow transformations (by task re-ordering) was
proposed, with the main goal to filter data as soon as possible.
In [15], the re-ordering of operators is based on their semantics,
e.g., a highly selective operator would be placed (re-ordered) at
the beginning of a workflow.</p>
      <p>In the same spirit, workflow transformations were proposed in
[14] for the purpose of being able to reuse existing data processing
workflows and integrate them into other workflows.</p>
      <p>The approach proposed in [16] draws upon the contributions
of [24]. In [16], possible tasks re-orderings are constrained by
means of a dedicated structure called a dependency graph. This
approach optimizes only linear workflows. To this end, a
nonlinear workflow is split into linear ones (called groups), by means
of pre-defined split rules. Next, parallel groups are optimized
independently by task re-ordering - tasks can be moved between
adjacent groups, and adjacent groups can be merged. The
drawback of this approach is however, that the most selective tasks
can be moved towards the end of a workflow as the result of a
re-ordering.</p>
      <p>2.2.2 ETL with UDFs.</p>
      <p>
        None of the approaches described in Sections 2.2.1 and 2.1
supports the optimization of ETL processes with user-defined
functions. In the techniques based on the re-ordering of operators,
the semantics of the operators must be known. Such a
semantics is known and understood for traditional operators based on
the relational algebra. However, the semantics of user defined
functions is typically unknown. For this reason, UDFs have been
handled by solutions that require: either (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) manual annotation
of UDFs or (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) perform an analysis of an UDF code to explore
some options for optimization or (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) use explicitly defined parallel
skeletons.
      </p>
      <p>
        In [12, 13], UDFs are treated as black boxes. The framework
consists of the Nephele execution engine and the PACT compiler
to execute UDFs, based on the PACT programming model [5]
(PACT is a generalization of MapReduce). PACT can use hints to
execute a given task in parallel. Such hints are later exploited by
a cost-based optimizer to generate parallel execution plans. The
optimization is based on: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a re-ordering of UDFs in a workflow
and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) an execution of UDFs in a parallel environment. The
reordering takes into account properties of UDFs. The properties
are discovered by means of a static code analysis. A cost-based
optimizer (model) is used to construct some possible re-orderings.
Also in [9], UDFs annotations are used to generate an optimized
query plan, but only for relational operators.
      </p>
      <p>A framework proposed in [7], called SQL/MR, enables a
parallelization of UDFs in a massively-parallel shared-nothing
database. To achieve a parallelism, SQL/MR requires a definition of
the Row and Partition functions and corresponding execution
models for the SQL/MR function instances. The Row function is
described as an equivalent to a Map function in MapReduce. Row
functions perform row-level transformation and processing. The
execution model of the Row function allows independent
processing of each input row by exactly one instance of the SQL/MR
function. The Partition function is similar to the Reduce function
in MapReduce. Exactly one instance of a SQL/MR function is
used to independently process each group of rows defined by
the PARTITION BY clause in a query. Independent processing of
each partition allows the execution engine to achieve parallelism
at the level of a partition. The dynamic cost-based optimization
is enabled for re-orderings of UDFs.</p>
      <p>
        An optimizer proposed in [21] rewrites an execution plan
based on: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) automatically inferring the semantics of a
MapReduce style UDF and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) a small set of rewrite rules. The semantics
can be provided by means of manual UDF annotations or an
automatic discovery. The manual annotations include: a cost function,
resource consumption, a number of input rows and output rows.
The automatically discovered annotations include:
parallelization function of a given operator, a schema information, and
read/write operations on attributes.
      </p>
      <p>The aforementioned approaches require understanding the
semantics of an UDF (a black-box) by means of either parsing an
UDF code, or applying certain coding style, or using certain
keywords, or using parallelization hints. Moreover, the approaches
do not provide means of analyzing an optimal architecture
conifguration (e.g., the number of nodes in a cluster) or a degree of
parallelism.</p>
      <p>In [2] the authors developed a framework for using pre-defined
generic and specific code skeletons for writing programs to be
executed in a parallel environment. The code is then generated
automatically and the process is guided by configuration
parameters that define the number of nodes a program is executed
on.</p>
      <p>In [1, 3] the authors proposed an overall architecture for
executing UDFs in a parallel environment. The architecture was
further extended in [4] with a model for optimizing a cluster
conifguration, to provide a sub-optimal performance of a given ETL
process with UDFs. The applied model is based on the multiple
choice knapsack problem and the lp_solve library is used to solve
the problem (its implementation is available on GitHub1).</p>
      <p>
        The work described in this paper applies: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the balanced
optimization, which uses task re-orderings to move some ETL
tasks into a data source, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) parallel processing of ETL tasks, by
standard parallelization mechanisms of IBM DataStage run in
a micro-cluster, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) parallel processing of some ETL tasks
moved into and executed in a data source, by means of standard
parallelization mechanisms available in Kafka.
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>BALANCED OPTIMIZATION</title>
      <p>
        The balanced optimization [17, 27] is implemented in IBM
InfoSphere DataStage. Its overall goals are to: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) minimize data
movement, i.e., I/O, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) use optimization techniques available in
a source or target data servers, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) maximize parallel
processing. The principle of this optimization is to balance processing
between a data source, an ETL server, and a destination (typically
a data warehouse). To this end, a given ETL task can be moved
into a data source (operation push down), to be executed there. It
is typically applied to tasks at the beginning of an ETL process,
i.e., ingesting, filtering, or pre-processing data. ETL tasks that
cannot be moved into a data source are executed in the DataStage
engine. Finally, tasks that profit from processing in a data
warehouse are moved and executed there (operation push up). This
way, specialized features of a DW management system can be
applied to processing these tasks, e.g., dedicated indexes,
multidimensional clusters, partitioning schemes, and materialized
views.
      </p>
      <p>Moving tasks is controlled by DataStage preferences [17] that
guide the DataStage engine to move certain pre-defined tasks
into either a DS, or a DW, or execute them in the engine. The
following tasks are controlled by the preferences: transformation,
ifltering, aggregation, join, look-up, merge, project, bulk I/O, and
temporary staging tables. Once a given ETL process has been
designed, a user specifies the preferences. Then the ETL process
gets optimized by DataStage, taking into account the preferences.
An executable version of the process is generated and deployed.</p>
      <p>So far, the balanced optimization has been proven to be
profitable when applied to structured data sources. Since numerous
NoSQL and stream DSs become first class citizens, assessing the
applicability of this type of optimization to such data sources
may have a real business value. The work presented in this paper
is the first one to assess the balanced optimization on a stream
data source.
4</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL EVALUATION</title>
      <p>
        The goal of the experiments was to assess if the balanced
optimization in IBM DataStage may increase the performance of an
ETL process ingesting data from a stream data source, run by
Kafka. Such a software architecture is frequently used by IBM
customers. In particular, we were interested in figuring out: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
what parameters of Kafka may afect performance, cf. Sections
4.1 and 4.2 as well as (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) which ETL tasks may benefit from the
balanced optimization. In this phase of the project we evaluated
ifltering (Section 4.3), dataflow split (Section 4.4), and aggregation
(Section 4.5).
1https://github.com/fawadali/MCKPCostModel
      </p>
      <p>To this end a pico-cluster composed of four physical
workstations was built. Each workstation included: a 4-core CPU 3GHz,
16GB RAM, 256GB HDD, and was run under Linux RedHat. Two
workstations run Kafka and the other two run the IBM InfoSphere
DataStage ETL server. The ETL processes were designed using
DataStage Designer. The system was fed with rows from table
Store_Sales (1.5GB), from the TPC-D benchmark. The
performance statistics were measured on each node by iostat. In each
scenario 12 experiments were run. The lowest and the highest
measurements were discarded and the average of the remaining
10 was computed and is shown in all the figures. Due to the space
constraints, we present only selected experimental scenarios.</p>
      <p>Notice that the goal of these experiments was to assess the
behaviour of the system only at the border between Kafka and
DataStage. For this reason, the ETL processes used in the
experiments included elementary tasks/components, like: Kafka
connector, column import, filtering, aggregation, and storing
output in a file. An example process splitting a data flow is shown
in Figure 1.
RecordCount is a parameter of a connector from DataStage to
Kafka. It defines a number of rows (a rowset) that are read from
a topic. After a rowset is read, the ETL connector outputs the
rowset to the ETL process for further processing.</p>
      <p>
        The value of RecordCount in the experiments ranged from 1 to
’ALL’. In the latter case, the whole test data set was read before
starting the ETL process (in the experiment 11 million of rows).
The elapsed times of processing the whole data set are shown
in Figure 2. The value of the standard deviation ranges from
0.5 to 2.5. From the chart we observe that: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the performance
strongly depends on the value of the parameter and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the
performance does not improve for values of RecordCount greater than
50. Having analyzed detailed CPU and I/O statistics (not shown
in this paper) we conclude that an optimal value of RecordCount
is within the range (400, 600).
4.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>The number of Kafka partitions</title>
      <p>Kafka allows to split streaming data into partitions, each of which
is read by a consumer. As the number of partitions may strongly
influence the performance of the whole ETL process 2, figuring
out how performance is afected by the number of partitions is
of great interest.</p>
      <p>To assess the impact of the number of partitions on processing
time, in our experiments the number of partitions ranged from 1
to 8. The number of partitions equaled to the number of Kafka
consumers. The performance results are shown in Figure 3. Here
we show only the results for RecordCount = ALL. As we can
observe, the elapsed processing time of the ETL process varies and
2https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster
the best performance was achieved for the number of partitions
equaled to 2 and 3. The standard deviation ranges from 0.5 to 1.3.</p>
      <p>A detailed analysis of I/O (not shown here due to space
constraints) reveals that the number of I/O increases with the
increasing number of partitions. Similar increase was observed also
in a CPU time. That may be some of the factors that influence
the characteristics shown in Figure 3.
4.3</p>
    </sec>
    <sec id="sec-8">
      <title>Selectivity of data ingest</title>
      <p>In this experiment we assessed the elapsed ETL processing time
w.r.t. the selectivity of a data ingest from Kafka (the selectivity
is defined as: the number of rows ingested / the total number of
rows). Two scenarios were implemented. In the first one,
operation push down was applied to move data filtering into Kafka
(available in library Kafka Streams). In the second one, data were
ifltered by a standard pre-defined task in DataStage. The results
for RecordCount=10 and RecordCount=ALL are shown in Figure
4. The results labeled with sufix PD denote the scenario with
operation push down applied. The standard deviation ranges from
0.1 to 2.8.</p>
      <p>The results clearly show that a wrong combination of
RecordCount with applied push down can decrease the overall
performance of an ETL process, cf. RecordCount=ALL PD (with push
down) vs. RecordCount=ALL (without push down), for
selectivity=90%. From other experiments (not presented in this paper)
we observed that values of RecordCount ≤ 20, caused that push
down ofered stable better performance for the whole range of
selectivities.
10</p>
      <p>50
selectivity [%]
90
4.4</p>
    </sec>
    <sec id="sec-9">
      <title>Flow split</title>
      <p>
        The purpose of a flow split is to divide a data flow into multiple
lfows. In our case, we split the input flow into two, using a
parameterized split ratio: from 10% of data in one flow to an even
split. The flow split was executed in: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Kafka, as the result of
push down, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) DataStage. The obtained elapsed execution
times for RecordCount=10 and RecordCount=ALL are shown in
Figure 5. The standard deviation ranged from 1.3 to 3.3.
RecordCount=10 PD
      </p>
      <p>RecordCount=10
RecordCount=ALL PD</p>
      <p>RecordCount=ALL
50/50
40/60</p>
      <p>30/70
data flow split ratio [%]
20/80
10/90</p>
      <p>The characteristics show the performance improvement while
applied push down for RecordCount=10, cf. bars labeled
RecordCount=10 PD vs. RecordCount=10. From other experiments (not
discussed in this paper) we observed that the maximum value
of RecordCount that improved the performance is 10. From the
analysis of detailed data on CPU and I/O, we tentatively draw a
conclusion that the observed behaviour is caused by Kafka that
was not able to deliver data on time into DataStage.
4.5</p>
    </sec>
    <sec id="sec-10">
      <title>Aggregation</title>
      <p>
        In this experiment, count is used to count records in groups.
The number of groups ranges from 10 to 90. The aggregation is
executed in two variants: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) in Kafka (available in library Kafka
Streams), as the result of push down, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) in DataStage. The
results are shown in Figure 6 for RecordCount=10 and
RecordCount=ALL. The standard deviation ranged from 0.5 to 2.2. As
we can see, the execution time does not profit from push down
in either of these cases, cf. bars labeled as RecordCount=10 PD
(push down applied) vs. RecordCount=10 (without push down) and
RecordCount=ALL PD vs. RecordCount=ALL.
      </p>
      <p>The analysis of CPU and I/O reveals that push down caused
higher CPU usage times and higher I/O, which resulted in much
worse performance of the aggregation in Kafka than in DataStage.
180
160
140
60
40
20</p>
    </sec>
    <sec id="sec-11">
      <title>SUMMARY</title>
      <p>In practice, data integration architectures are frequently built
using IBM DataStage, NoSQL (HBase, Cassandra), and stream DSs
(Kafka). The balanced optimization available in DataStage ofers
performance improvements for relational data sources, however,
its characteristics on NoSQL and stream DSs are unknown. For
this reason, a project was launched at IBM to discover these
characteristics, with a goal to build an execution optimizer for a
standard and multi-cloud architectures [6], based on a learning
recommender.</p>
      <p>In this paper we presented the experimental evaluation of the
balanced optimization applied to a stream DS run by Kafka, within
a joint project of IBM and Poznan University of Technology. To
the best of our knowledge, it is the first project on analyzing
the possibility of using this type of ETL optimization on
nonrelational DSs.</p>
      <p>
        From this evaluation, the most interesting observations of a
real business value are as follows. First, Kafka turned out to be
a bottleneck for push down applied to: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) filtering, for
RecordCount=ALL and the selectivity 50%; for other tested operations
push down increased performance; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) split, for RecordCount=ALL,
for all split ratios; (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) aggregation, for RecordCount in {10, ALL}.
Second, an overall performance of an ETL process strongly
depends on specific parameters of Kafka, e.g., RecordCount and the
number of partitions. Third, the characteristics of CPU and I/O
usage may suggest that in order to increase the performance of
Kafka, more hardware needs to be allocated for Kafka than for
DataStage.
      </p>
      <p>Even though, the aforementioned observations cannot be
generalized (they apply to this particular experimental setting), they
turned out to be of practical value. First, they were very well
received by the Executive Management @IBM. Second, the
observations were integrated into a knowledge base of IBM Software
Lab and have already been used for multiple proofs of concept.</p>
      <p>
        The next phase of this project will consist in: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) extending the
evaluation of push down to Kafka (other sizes of a cluster, other
parameters of Kafka), (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) evaluating push down for HBase and
Cassandra, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) designing and building a learning recommender for
DataStage. The recommender will analyze experimental metadata
to discover patterns and build recommendation models, with a
goal to propose configuration parameters for a given ETL process
and to propose an orchestration scenarios of tasks within the
balanced optimization.
      </p>
      <p>
        Acknowledgements. The work of Robert Wrembel is supported
by: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the grant of the Polish National Agency for Academic
Exchange, within the Bekker Programme and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) IBM Shared
University Reward 2019.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Syed</given-names>
            <surname>Muhammad Fawad Ali</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Next-generation ETL Framework to Address the Challenges Posed by Big Data</article-title>
          .
          <source>In DOLAP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Syed</given-names>
            <surname>Muhammad Fawad Ali</surname>
          </string-name>
          , Johannes Mey, and
          <string-name>
            <given-names>Maik</given-names>
            <surname>Thiele</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Parallelizing user-defined functions in the ETL workflow using orchestration style sheets</article-title>
          .
          <source>Int. J. of Applied Mathematics and Comp. Science (AMCS)</source>
          (
          <year>2019</year>
          ),
          <fpage>69</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Syed</given-names>
            <surname>Muhammad Fawad Ali</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>From conceptual design to performance optimization of ETL workflows: current state of research and open problems</article-title>
          .
          <source>The VLDB Journal</source>
          (
          <year>2017</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Syed</given-names>
            <surname>Muhammad Fawad Ali</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Towards a Cost Model to Optimize User-Defined Functions in an ETL Workflow Based on User-Defined Performance Metrics</article-title>
          .
          <source>In ADBIS. LNCS 11695</source>
          ,
          <fpage>441</fpage>
          -
          <lpage>456</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Dominic</given-names>
            <surname>Battré</surname>
          </string-name>
          , Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Warneke</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Nephele/PACTs: a programming model and execution framework for web-scale analytical processing</article-title>
          .
          <source>In ACM Symposium on Cloud Computing</source>
          .
          <fpage>119</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Michal</given-names>
            <surname>Bodziony</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>ETL in Big Data Architectures: IBM Approach to Design and Optimization of ETL Workflows (Invited talk)</article-title>
          .
          <source>In DOLAP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Eric</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Pawlowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and John</given-names>
            <surname>Cieslewicz</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable userdefined functions</article-title>
          .
          <source>PVLDB 2</source>
          ,
          <issue>2</issue>
          (
          <year>2009</year>
          ),
          <fpage>1402</fpage>
          -
          <lpage>1413</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Gartner</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Magic Quadrant for Data Integration Tools</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Große</surname>
          </string-name>
          , Norman May, and
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Lehner</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A study of partitioning and parallel UDF execution with the SAP HANA database</article-title>
          .
          <source>In SSDBM. 36.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ramanujam</surname>
            <given-names>Halasipuram</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prasad M. Deshpande</surname>
            , and
            <given-names>Sriram</given-names>
          </string-name>
          <string-name>
            <surname>Padmanabhan</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Determining Essential Statistics for Cost Based Optimization of an ETL Workflow</article-title>
          . In EDBT.
          <volume>307</volume>
          -
          <fpage>318</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Herodotos</surname>
            <given-names>Herodotou</given-names>
          </string-name>
          , Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and
          <string-name>
            <given-names>Shivnath</given-names>
            <surname>Babu</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Starfish: A Self-tuning System for Big Data Analytics</article-title>
          .
          <source>In CIDR</source>
          , Vol.
          <volume>11</volume>
          .
          <fpage>261</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Fabian</surname>
            <given-names>Hueske</given-names>
          </string-name>
          , Mathias Peters, Aljoscha Krettek, Matthias Ringwald, Kostas Tzoumas, Volker Markl, and
          <string-name>
            <surname>Johann-Christoph Freytag</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Peeking into the optimization of data flow programs with mapreduce-style udfs</article-title>
          .
          <source>In ICDE</source>
          .
          <volume>1292</volume>
          -
          <fpage>1295</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Fabian</surname>
            <given-names>Hueske</given-names>
          </string-name>
          , Mathias Peters, Matthias J Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, and
          <string-name>
            <given-names>Kostas</given-names>
            <surname>Tzoumas</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Opening the black boxes in data flow optimization</article-title>
          .
          <source>PVLDB 5</source>
          ,
          <issue>11</issue>
          (
          <year>2012</year>
          ),
          <fpage>1256</fpage>
          -
          <lpage>1267</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Petar</surname>
            <given-names>Jovanovic</given-names>
          </string-name>
          , Oscar Romero, Alkis Simitsis, and
          <string-name>
            <given-names>Alberto</given-names>
            <surname>Abelló</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Incremental Consolidation of Data-Intensive Multi-Flows</article-title>
          .
          <source>IEEE TKDE 28</source>
          ,
          <issue>5</issue>
          (
          <year>2016</year>
          ),
          <fpage>1203</fpage>
          -
          <lpage>1216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Anastasios</surname>
            <given-names>Karagiannis</given-names>
          </string-name>
          , Panos Vassiliadis, and
          <string-name>
            <given-names>Alkis</given-names>
            <surname>Simitsis</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Scheduling strategies for eficient ETL execution</article-title>
          .
          <source>Information Syst</source>
          .
          <volume>38</volume>
          ,
          <issue>6</issue>
          (
          <year>2013</year>
          ),
          <fpage>927</fpage>
          -
          <lpage>945</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Nitin</given-names>
            <surname>Kumar</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. Sreenivasa</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>An Eficient Heuristic for Logical Optimization of ETL Workflows</article-title>
          .
          <source>In VLDB Workshop on Enabling Real-Time Business Intelligence</source>
          .
          <fpage>68</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Rao</given-names>
            <surname>Lella</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Optimizing BDFS jobs using InfoSphere DataStage Balanced Optimization</article-title>
          .
          <article-title>IBM white paper: Developer Works</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Xiufeng</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nadeem</given-names>
            <surname>Iftikhar</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>An ETL optimization framework using partitioning and parallelization</article-title>
          .
          <source>In ACM SAC</source>
          .
          <volume>1015</volume>
          -
          <fpage>1022</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Xiufeng</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Thomsen</surname>
          </string-name>
          , and Torben Bach Pedersen.
          <year>2012</year>
          .
          <article-title>MapReducebased Dimensional ETL Made Easy</article-title>
          .
          <source>PVLDB 5</source>
          ,
          <issue>12</issue>
          (
          <year>2012</year>
          ),
          <fpage>1882</fpage>
          -
          <lpage>1885</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Xiufeng</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Thomsen</surname>
          </string-name>
          , and Torben Bach Pedersen.
          <year>2013</year>
          .
          <article-title>ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce</article-title>
          . Trans.
          <article-title>Large-Scale Data-</article-title>
          and
          <string-name>
            <surname>Knowledge-Centered</surname>
            <given-names>Systems</given-names>
          </string-name>
          8 (
          <year>2013</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Astrid</surname>
            <given-names>Rheinländer</given-names>
          </string-name>
          , Arvid Heise, Fabian Hueske, Ulf Leser, and
          <string-name>
            <given-names>Felix</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>SOFA: An extensible logical optimizer for UDF-heavy data flows</article-title>
          .
          <source>Information Syst</source>
          .
          <volume>52</volume>
          (
          <year>2015</year>
          ),
          <fpage>96</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Russom</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Data Lakes: Purposes, Practices, Patterns, and</article-title>
          <string-name>
            <surname>Platforms. TDWI</surname>
          </string-name>
          <article-title>white paper</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Alkis</surname>
            <given-names>Simitsis</given-names>
          </string-name>
          , Panos Vassiliadis, and
          <string-name>
            <given-names>Timos</given-names>
            <surname>Sellis</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Optimizing ETL Processes in Data Warehouses</article-title>
          .
          <source>In ICDE</source>
          .
          <volume>564</volume>
          -
          <fpage>575</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Alkis</surname>
            <given-names>Simitsis</given-names>
          </string-name>
          , Panos Vassiliadis, and
          <string-name>
            <surname>Timos</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Sellis</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>State-Space Optimization of ETL Workflows</article-title>
          .
          <source>IEEE TKDE 17</source>
          ,
          <issue>10</issue>
          (
          <year>2005</year>
          ),
          <fpage>1404</fpage>
          -
          <lpage>1419</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Alejandro</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Vaisman</surname>
            and
            <given-names>Esteban</given-names>
          </string-name>
          <string-name>
            <surname>Zimányi</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <source>Data Warehouse Systems - Design and Implementation</source>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Informatica</surname>
            <given-names>white paper.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>How to Achieve Flexible, Cost-efective Scalability and Performance through Pushdown Processing</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <article-title>IBM white paper</article-title>
          .
          <year>2008</year>
          .
          <article-title>IBM InfoSphere DataStage Balanced Optimization</article-title>
          .
          <source>Information Management Software.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>