<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Next-generation ETL Framework to address the challenges posed by Big Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Syed Muhammad Fawad Ali</string-name>
          <email>fawadali.ali@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Poznan University of Technology Poznan Poland trivago N.V.</institution>
          <addr-line>Leipzig</addr-line>
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The specific features of Big Data i.e., variety, volume, and velocity call for special measures to create ETL data pipelines and data warehouses. A rapidly growing need for analyzing Big Data calls for novel architectures for warehousing the data, such as data lakes or polystores. In both of the architectures, ETL processes serve similar purposes as in traditional data warehouse architectures. Except the fact that the data to process has multitude of formats and the relationships between data are often very complex. Furthermore, most of the times data transformations are required on-the-fly that have to be executed and completed in near real-time. For these reasons designing and optimizing ETL workflows for Big Data is much more dificult than for traditional data. In this paper, we focus on the ETL aspect of Big Data and propose an extendable ETL workflow that addresses the aforementioned challenges posed by Big Data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Since early 2000s, the volume of produced, collected, and
stockpiled digital data has been continuously growing exponentially.
It is expected that by 2020 there will be more than 16 Zettabytes
(16 Trillion GB) of useful data.This Big Data provides great
opportunities and harnessing that leads to great benefits in science and
business. Thus, having the right technological basis to exploit
the potential of Big Data, it is essential for most organizations to
gain competitive advantage or even survive in today’s world.</p>
      <p>
        Big data itself requires a great scientific contribution to deal
with it. That is, most of such data is not easily accessible or can be
processed by existing technologies and this imposes to academics
a lot of challenges to be solved and questions to be answered.
For example, traditional ETL or data integration tools work well
with clean and consistent data and are not capable to eficiently
deal with the variety of Big Data. Although there are already
numerous Big Data, IoT, and analytics solutions that enable people
to obtain valuable insights from vast amount of data, such
solutions are still in their early stages of development[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Therefore,
there is a need for developing new and advanced methods and
technologies to extract, transform, load, analyze, and visualize
such data in order to obtain valuable insights from it.
      </p>
      <p>This paper focuses on the ETL aspect of Big Data. Traditional
ETL frameworks, methods, and built-in operators provided by the
existing ETL tools are now obsolete in case of Big Data due to its
volume, variety, and velocity. The existing ETL tools and
frameworks were designed for creating a traditional Data Warehouse
(DW), which eficiently supports light-weight computations on
smaller data sets. However, Big Data demands new and advanced
computations, as an example from the data cleansing side, the
messy and noisy nature of Big Data requires new types of
cleansing operators, such as outlier detection or de-duplication that
specifically fit the ever changing characteristics of the data. The
same applies to the data analytics side where we find a zoo of
algorithms such as classification, regression, clustering, collaborative
ifltering, and many more.</p>
      <p>
        We carried out an extensive study on the current practices,
short-comings, limitations, and open issues of existing ETL
methodologies and tools [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. According to this study, the still open issues
on ETL development become much more dificult to solve in the
ifeld of Big Data. Therefore, in this paper, we propose a
nextgeneration extendable ETL framework in order to address the
challenges caused by Big Data. The proposed framework is based
on the outcome of our aforementioned study.
      </p>
      <p>In Section 2 we present our motivation for the new ETL
framework. We then introduce and explain the proposed extendable
ETL framework in Section 3. In Section 4 we discuss the related
work in the same field. Section 5 contains the conclusion and the
future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>MOTIVATION</title>
      <p>
        As mentioned in Section 1, we carried out an intensive study [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
on the existing methods for designing, implementing, and
optimizing of ETL workflows. We analyzed several techniques w.r.t
their pros, cons, and challenges in the context of metrics such
as: autonomous behavior, support for quality metrics, and
support for ETL activities as user-defined functions. Following is
the summary of conclusions on open research and technological
issues in the field of ETL:
(1) The support for semi-structured and unstructured data
is very limited. Whereas, the variety of data format
especially the unstructured and raw data is growing rapidly.
Therefore, there is a need to extend the support for
processing an unstructured data in an ETL workflow along
with other data formats (e.g., video, audio, binary).
(2) There is a lack or no support for user-defined functions
(UDFs) as ETL activities. Whereas, the volume and
variety of Big Data require custom functionality in order to
perform complex and intensive computations. The
reason being, traditional DWs are not optimized enough to
store huge volume and variety of data, therefore novel
data warehousing architectures like data lake [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or a
polystore [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are introduced. These DWs support diferent
kind of data formats, which eventually lead to complex
ETL workflows to populate such DWs. Designing ETL
workflows for the Big Data is a challenging task because
traditional ETL operators are not suitable for processing
Big Data and such tasks have to be implemented by UDFs.
Therefore, there is a need to consolidate and fully
support UDFs in an ETL workflow along with traditional ETL
operators.
(3) Only a few methods emphasized on the issues of eficient,
reliable, and improved execution of an ETL workflow.
Whereas, today’s need of real-time availability of data
requires eficient ETL workflows that can quickly process
and analyze huge amount of data. Therefore, to improve
the execution performance of an entire ETL workflow,
techniques based on task parallelism, data parallelism, and
a combination of both for traditional ETL operators as
well as UDFs are required.
(4) Most of the design methods require ETL developers to
extensively provide input during the modeling and design
phase of an ETL workflow, thus it can be error prone, time
consuming, and ineficient. Hence, there is a need for an
ETL framework that shall reduce the work of the ETL
developer from a design and performance optimization
perspective. The framework should provide
recommendations on: (1) an eficient design for an ETL workflow
according to the business requirements, (2) how and when
to improve the performance of an ETL workflow without
conceding other quality metrics.
      </p>
      <p>The consequence of the aforementioned observation is that
designing and optimizing ETL workflows for Big Data is much
more dificult than for traditional data and is much needed at
this point in time.
3</p>
    </sec>
    <sec id="sec-3">
      <title>THE EXTENDABLE ETL FRAMEWORK</title>
      <p>On the basis of conclusions discussed in Section 2, we present an
extendable theoretical ETL Framework. A three-layered
architecture of the ETL Framework is shown in Figure 1.</p>
      <p>The bottom layer is an ETL Workflow Designer , which may be
any standard open source ETL tool for designing ETL workflows.
This layer communicates with the middle layer, which is
extendable and consists of the four components: (1) a UDFs Component,
(2) a Recommender, (3) a Cost Model, and (4) a Monitoring Agent,
described in detail in the following sub-sections.</p>
      <p>The top layer in the architecture is the Distributed Framework.
Its task is to execute parallel codes of UDFs in a distributed
environment, in order to improve the overall execution performance
of an ETL workflow.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>A UDFs Component</title>
      <p>The idea behind introducing this component is to assist the ETL
developer in writing a parallelizable UDF by separating
parallelization concerns from the code.</p>
      <p>A UDF is a software program written in any programming,
scripting, or procedural language. It allows the ETL developer to
extend the functionality of an ETL tool that is outside the scope
of the already provided built-in ETL operators. For example, the
messy and noisy nature of Big Data demands new types of
cleansing operators, such as outlier detection or de-duplication that
specifically fit the ever-changing characteristics of the data. The
same applies to the data analytics side where we find a zoo of
algorithms such as classification, regression, clustering,
collaborative filtering, and many more. A UDF can be used to implement
aforementioned operators in a Big Data setup or to perform
aggregations or any kind of run-time intensive computations on a
data that may be necessary before loading into a data warehouse.</p>
      <p>A UDFs component contains a library of Parallel Algorithmic
Skeletons (PASs) or parallelizable code templates. These PASs are
designed to be executed in a distributed environment, (e.g., a
template for MapReduce or Spark to be executed in Hadoop).</p>
      <p>The UDFs component requires a basic knowledge of distributed
computing and parallelization aspects from the ETL developer.</p>
      <p>Figure 2 shows the working of UDFs Component. The
component provides the already parallelizable code for the list of
commonly used Big Data operators (case-based PASs) to the ETL
developer (e.g., sentiment analysis, de-duplication of rows,
outlier detection) and a list of generic PASs (e.g., worker-farm model,
divide and conquer, branch and bound, systolic, MapReduce). The
ETL developer either chooses case based PAS or a generic PAS
based on his/her requirements.
As shown in Figure 2, a generic input to UDFs component
is depicted as [{usercode, case based PAS}, {(input format, output
format)}, {max execution time constraint}, {distributed machine
specifications}] . For example, in case of case-based reasoning the
ETL developer only has to provide the input and output data
formats {(input format, output format)}, execution time constraint
to run the ETL workflow (e.g., the ETL job must complete
execution with in ’x’ number of hours {max execution time}), and
distributed machine specifications {distributed machine
specifications}, if known. In case of generic PAS, the ETL developer has
to provide the basic program for the chosen PAS {usercode}, an
execution time constraint to run the ETL workflow {max
execution time}, and distributed machine specifications {distributed
machine specifications} . That is, for the MapReduce paradigm as
a PAS, only Map and Reduce functions would be required. The
MapReduce configurations (i.e., partitioning parameters, number
of nodes) will be provided by the UDFs component. The Code
Generator then generates the configuration and a paralellizable
code based on the ETL developer’s input to the component about
the distributed machine specifications, time constraints on the
completion of the ETL workflow, and by the recommendation of
the Recommender component in the proposed ETL framework.
The specific configurations provided by this component are very
critical to achieve the right degree of parallelism.</p>
      <p>Once the configurations are generated, the code provided by
the ETL developer and the distributed environment
configurations will be executed in Distributed Framework. The computed
results are then returned to the ETL workflow for the next steps
in the workflow.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>A Recommender</title>
      <p>
        A Recommender includes an extendable set of machine learning
algorithms to optimize a given ETL workflow (based on metadata
collected during past ETL executions) and to generate a more
eficient version of the workflow. Metadata may be collected with
the help of Monitoring Agent, where it collects various
performance statistics of diferent ETL workflows and provide them
to a Recommender. Since, there are a few algorithms that can
be applied to optimizing a workflow (e.g., Dependency Graph
approach) [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], Scheduling Strategies [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the ETL developer
would then be able to experiment with alternative algorithms
and compare their optimization outcomes.
      </p>
      <p>Recommender component also helps the ETL developer to
choose the best possible PAS from a UDFs component based on
the developer’s input (c.f. Section 3.1) to the Recommender. To
provide the optimal PAS to a UDFs component, it uses the Cost
Model component.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>A Cost Model</title>
      <p>The algorithms used by Recommender need cost models. A
Recommender can choose the appropriate cost model from a library
of cost models in order to make optimal decisions based on the
ETL developer’s input to it.</p>
      <p>
        The library of cost models may include cost models for
monetary cost, performance cost, and both cost and execution
performance optimization. Since most of the Big Data ETL workflows
or UDFs for Big Data are executed in a cloud or a distributed
framework, there would be cost models to evaluate the
performance of workflows in a cloud computing environment [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]
and also to determine the best possible configuration of virtual
machines both in terms of execution time and monetary cost
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Since the Recommender uses the Cost Model component to
provide the optimal PAS to a UDFs component, the cost model
would be able to select the optimal PAS based on the Multiple
Choice Knapsack Problem (MCKP) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For example, suppose an
ETL workflow consists of n diferent computationally intensive
UDFs and UDFs component may generate m parallel variants of
each UDF, there are mn combinations of code variants. Therefore,
ifnding an optimal UDF may be mapped to MCKP.
3.4
      </p>
    </sec>
    <sec id="sec-7">
      <title>A Monitoring Agent</title>
      <p>Monitoring Agent allows to:
• monitor ETL workflow executions - e.g., number of input
rows, number of output rows, execution time of each step,
number of rows processed per second.
• identify performance bottlenecks - e.g., which tasks are
being delayed or aborted, which tasks need to be optimized.
• report errors - e.g., task or workflow failures and the
possible reasons.
• schedule executions - e.g., execution time of ETL
worklfows and creating a dependency chart for ETL tasks and
workflows.
• gather various performance statistics - execution time of
each ETL activity w.r.t rows processed per second,
execution time of the entire ETL workflow w.r.t rows processed
per second, memory consumption by each ETL activity.</p>
      <p>This is a standard component of any ETL engine. However, we
would store all of the aforementioned collected information in an
ETL framework repository to be later utilized by Recommender
and Cost Model in order to make recommendations to the ETL
developer and to generate optimal ETL workflows.
4</p>
    </sec>
    <sec id="sec-8">
      <title>RELATED WORK</title>
      <p>There does not exist much research work in literature on ETL
frameworks specifically for Big Data besides some cloud based
distributed frameworks (e.g., Amazon Web Services1 stack, Google
Cloud Platform2, and Microsoft Azure3). These cloud based
distributed platforms provide several products that help in creating
Big Data ETL data pipelines and solutions. However, the provided
products are not fully autonomous as well as does not provide
recommendations to the ETL developer for creating optimized
data pipelines at run-time.</p>
      <p>In research, there exists a few stand alone methods, data
warehousing architecture, and utilities for the extraction and
transformation phases in an ETL workflow for Big data.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the authors proposed a method to semantically extract
the data from a variety of data sources e.g., text, video, email,
audio in an ETL workflow. The discussed approach is focused
on extracting the data and does not cover the compute-intensive
transformation phase required for the 3Vs Big Data. Furthermore,
to define the semantics of data it requires a human expert to
define ontology, which is a tedious and a time consuming task.
      </p>
      <p>
        A data warehousing architecture for Big Data is discussed
in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The proposed architecture uses HDFS for data storage,
Talend Open Studio for ETL transformations, and Hive as a data
warehouse. However, the work presented did not mention the
dificulties to tackle the 3Vs of Big Data. In our paper, we
addressed the issues like complex ETL workflows due to 3Vs of
Big Data and how to solve the issues of compute-intensive UDFs.
Finally we also provided a fully automated framework to create
ETL workflows.
5
      </p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>In this paper, we presented an extendable ETL framework in
order to address the challenges posed by Big Data. We proposed
this ETL framework on the basis of limitations and
shortcomings in the currently existing ETL methodologies and tools. We
proposed a UDF’s Component to address the the issue of no or
minimal support for UDFs and their optimization in currently
existing ETL frameworks, which is an integral part to develop
ETL transformations for Big Data. Furthermore, we proposed
recommendation module that utilizes a library of cost models
and retrieves information from a monitoring agent in order to
provide recommendations to the ETL developer. The monitoring
1https://aws.amazon.com
2https://cloud.google.com/products
3https://azure.microsoft.com
agent module is proposed to assist the recommendation module
as well as an end-to-end monitoring of ETL workflows.</p>
      <p>We believe that the proposed ETL framework is a step forward
towards a fully automated ETL framework to help the ETL
developers optimize ETL tasks and an overall ETL workflow for Big
Data with the help of recommendations, montoring agent, and
UDFs provided by the tool.</p>
      <p>Currently we are working on the first steps towards building a
complete ETL Framework i.e., (1) a UDFs Component - to provide
the library of reusable parallel algorithmic skeletons for the ETL
developer and (2) Cost Model - to generate the most eficient
execution plan for an ETL workflow.</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>The research of Syed Ali has been funded by the European
Commission through the Erasmus Mundus Joint Doctorate
"Information Technologies for Business Intelligence Doctoral College"
(IT4BI-DC) and trivago N.V.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. M. F.</given-names>
            <surname>Ali</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Wrembel</surname>
          </string-name>
          .
          <article-title>From conceptual design to performance optimization of ETL workflows: current state of research and open problems</article-title>
          .
          <source>The VLDB Journal</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Bansal</surname>
          </string-name>
          .
          <article-title>Towards a semantic extract-transform-load (ETL) framework for big data integration</article-title>
          .
          <source>In Proceedings of International Congress on Big Data</source>
          , pages
          <fpage>522</fpage>
          -
          <lpage>529</lpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Duggan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Elmore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balazinska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kepner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mattson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Zdonik</surname>
          </string-name>
          .
          <article-title>The BigDAWG Polystore System</article-title>
          .
          <source>SIGMOD Record</source>
          , pages
          <fpage>11</fpage>
          -
          <lpage>16</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ibaraki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hasegawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Teranaka</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Iwase.</surname>
          </string-name>
          <article-title>The multiple choice knapsack problem</article-title>
          .
          <source>Journal of Operations Research Society Japan</source>
          , pages
          <fpage>59</fpage>
          -
          <lpage>94</lpage>
          ,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Iosup</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ostermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Yigitbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Prodan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fahringer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Epema</surname>
          </string-name>
          .
          <article-title>Performance analysis of cloud computing services for many-tasks scientific computing</article-title>
          .
          <source>Transactions on Parallel and Distributed systems</source>
          , pages
          <fpage>931</fpage>
          -
          <lpage>945</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Muriki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Canon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cholia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shalf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Wasserman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Wright</surname>
          </string-name>
          .
          <article-title>Performance analysis of high performance computing applications on the amazon web services cloud</article-title>
          .
          <source>In International Conference on Cloud Computing Technology and Science</source>
          , pages
          <fpage>159</fpage>
          -
          <lpage>168</lpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Karagiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Simitsis</surname>
          </string-name>
          .
          <article-title>Scheduling strategies for eficient ETL execution</article-title>
          .
          <source>Information Systems</source>
          , pages
          <fpage>927</fpage>
          -
          <lpage>945</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Marjani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nasaruddin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. A. T.</given-names>
            <surname>Hashem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddiqa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Yaqoob</surname>
          </string-name>
          .
          <article-title>Big IoT data analytics: Architecture, opportunities, and open research challenges</article-title>
          .
          <source>IEEE Access</source>
          , pages
          <fpage>5247</fpage>
          -
          <lpage>5261</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Martinho</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Santos</surname>
          </string-name>
          .
          <article-title>An architecture for data warehousing in big data environments</article-title>
          .
          <source>In Proceedings of Research and Practical Issues of Enterprise Information Systems</source>
          , pages
          <fpage>237</fpage>
          -
          <lpage>250</lpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Simitsis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellis</surname>
          </string-name>
          .
          <article-title>State-space optimization of ETL workflows</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering (TKDE)</source>
          , pages
          <fpage>1404</fpage>
          -
          <lpage>1419</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Simitsis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wilkinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Dayal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Castellanos</surname>
          </string-name>
          .
          <article-title>Optimizing ETL workflows for fault-tolerance</article-title>
          .
          <source>In Proceedings of IEEE International Conference on Data Engineering (ICDE)</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Terrizzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Roth</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Colino</surname>
          </string-name>
          .
          <article-title>Data Wrangling: The Challenging Journey from the Wild to the Lake</article-title>
          .
          <source>In Proceedings of Conference on Innovative Data Systems Research (CIDR)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Viana</surname>
          </string-name>
          , D. De Oliveira, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Mattoso</surname>
          </string-name>
          .
          <article-title>Towards a cost model for scheduling scientific workflows activities in cloud environments</article-title>
          .
          <source>In Proceedings of IEEE World Congress on Services</source>
          , pages
          <fpage>216</fpage>
          -
          <lpage>219</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>