MAIME: A Maintenance Manager for ETL Processes

                                            ∗
               Dariuš Butkevičius                        Philipp D. Freiberger               Frederik M. Halberg
                   Vilnius University                         Aalborg University                 Aalborg University
           darius.butkeviciu@gmail.com                           {pfreib15                          fhalbe12
                 Jacob B. Hansen                               Søren Jensen                        Michael Tarp
                   Aalborg University                         Aalborg University                 Aalborg University
                           jh12                                    sajens12               mtarp09}@student.aau.dk

ABSTRACT                                                                  use a Data Warehouse (DW) containing the data for the
The proliferation of business intelligence applications moves             decision making of a business. A DW contains transformed
most organizations into an era where data becomes an es-                  data from one or more External Data Sources (EDSs) [4]. In
sential part of the success factors. More and more busi-                  order to populate a DW, an Extract-Transform-Load (ETL)
ness focus has thus been added to the integration and pro-                process is used. An ETL process extracts data from one or
cessing of data in the enterprise environment. Develop-                   more EDSs, transforms the data into the desired format by
ing and maintaining Extraction-Transform-Load (ETL) pro-                  cleansing it (i.e., correcting or removing corrupt/inaccurate
cesses becomes critical in most data-driven organizations.                data), conforming from multiple sources, deriving new val-
External Data Sources (EDSs) often change their schema                    ues, etc., and finally loads it into the target DW.
which potentially leaves the ETL processes that extract data                 Construction of an ETL process is very time-consuming [4].
from those EDSs invalid. Repairing these ETL processes                    Maintaining ETL processes after deployment is, however,
is time-consuming and tedious. As a remedy, we propose                    also very time-consuming. The challenges in the mainte-
MAIME as a tool to (semi-)automatically maintain ETL                      nance of ETL process are very well demonstrated by the
processes. MAIME works with SQL Server Integration Ser-                   following examples which we have obtained from Danish
vices (SSIS) and uses a graph model as a layer of abstraction             organizations. The data warehouse solution at a pension
on top of SSIS Data Flow tasks (ETL processes). We intro-                 and insurance firm needs to cope with unknown changes in
duce a graph alteration algorithm which propagates detected               the main pension system when weekly hotfixes are deployed.
EDS schema changes through the graph. Modifications done                  The database administrator at a facility management com-
to a graph are directly applied to the underlying ETL pro-                pany has to manage more than 10,000 ETL jobs that execute
cess. It can be configured how MAIME handles EDS schema                   daily. The ETL team at an online gaming-engine vendor has
changes for different SSIS transformations. For the con-                  to manage the daily format changes in the web services deliv-
sidered set of transformations, MAIME can maintain SSIS                   ering sales and marketing data. The data warehouse team
Data Flow tasks (semi-)automatically. Compared to doing                   at a financial institution strives between the challenges of
this manually, the amount of user inputs is decreased by a                variances of source data formats and a fixed monthly release
factor of 9.5 and the spent time is reduced by a factor of 9.8            window for ETL programs. To summarize, when the number
in an evaluation.                                                         of ETL programs becomes overwhelming for the IT team,
                                                                          management of these programs becomes time-consuming.
                                                                          In such a situation, unexpected changes in source system
Categories and Subject Descriptors                                        databases or source system deliveries add more complexity
Information systems [Information integration]: Extrac-                    and risk in the maintenance of the ETL programs.
tion, transformation and loading                                             The impact of an EDS schema change depends on both
                                                                          the type of the EDS schema change and how ETL processes
                                                                          use the changed EDSs. Maintenance of ETL processes thus
1.    INTRODUCTION                                                        requires manual work, is very time-consuming, and is quite
  Business Intelligence (BI) is an essential set of techniques            error-prone. To remedy these problems, we propose the tool
and tools for a business to provide analytics and reporting in            MAIME which can (1) detect schema changes in EDSs and
order to give decision support. The BI techniques and tools               (2) semi-automatically repair the affected ETL processes.
∗Work done at Aalborg University                                          MAIME works with SQL Server Integration Services (SSIS)
                                                                          [5] and supports Microsoft SQL Server. SSIS is chosen as the
                                                                          ETL platform since it is in the top three of the most used
                                                                          tools by businesses for data integration [12], has an easy to
                                                                          use graphical tool (SSIS Designer [2]), and includes an API,
                                                                          which allows access to modify ETL processes through third
                                                                          party programs like MAIME. To maintain ETL processes,
2017, Copyright is with the authors. Published in the Workshop Proceed-   we formalize and implement a graph model as a layer of
ings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice,      abstraction on top of SSIS Data Flow tasks. By doing so,
Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is     we only have to modify the graph which then automatically
permitted under the terms of the Creative Commons license CC-by-nc-nd
4.0
                                                                          modifies the corresponding SSIS Data Flow task. Running
ETL processes is often an extensive and time-consuming
task, and a single ETL process not being able to execute                                      EDS
could cause a considerable amount of time wasted since the                                       1...n
administrator would have to repair the ETL process and run                                       1

the ETL processes once again. Based on this observation,                                    EDS Change
                                                                                            Manager        Metadata Store
MAIME can be configured to repair ETL processes even if
this requires deletion of parts/transformations of the ETL
processes.
   In an evaluation MAIME is shown to be able to suc-                                       Maintenance
                                                                               GUI                            Log
                                                                                            Manager
cessfully repair ETL processes in response to EDS schema
changes. For the implemented set of transformations, a com-
parison between resolving EDS schema changes in MAIME                                      Configuration
and doing it manually in the SSIS Designer tool shows that
MAIME is on average 9.8 times faster and required 9.5 times
less input from the user to maintain.                                     Figure 1: Architecture of MAIME.
   The rest of the paper is organized as follows. Section 2
provides an overview of MAIME and how it detects EDS
schema changes. Section 3 formalizes the graph model and         source databases look and when MAIME is run, it creates a
describes its usage. Section 4 describes how graph alter-        new snapshot and compares that to the previous snapshot
ations are done. Section 5 describes the implementation.         to detect which changes have occurred.
Section 6 shows how MAIME compares to doing the main-               Figure 1 shows the architecture of MAIME. We first pro-
tenance manually. Section 7 covers related work. Section 8       vide short descriptions for MAIME’s surrounding elements
concludes and provides directions for future work.               that are used by the core MAIME components. EDSs corre-
                                                                 spond to the external data sources of which schema changes
2.   OVERVIEW OF MAIME                                           are detected. In Figure 1, the Metadata Store corresponds
                                                                 to a local directory. Each captured EDS metadata snapshot
   In this section, we give an overview of the SSIS com-
                                                                 is stored as a JSON file in the Metadata Store. The Log
ponents covered by our solution. Then, we explain how
                                                                 is a collection files that log every change done by MAIME
MAIME detects EDS schema changes. Finally, we give a
                                                                 so that an administrator can analyze the maintenance. The
description of the architecture of MAIME.
                                                                 Configuration is a JSON file which stores the user-defined
   Knight et al. [5] provide an overview of the SSIS architec-
                                                                 configurations.
ture. To briefly summarize, a package is a core component of
                                                                    The EDS Change Manager (ECM) captures metadata
SSIS which connects all of its tasks together. The package is
                                                                 from the EDSs by using the Information Schema Views.
stored in a file which the SSIS engine can load to execute the
                                                                 The metadata snapshots are stored in the Metadata Store.
contained control flows. A control flow of a package controls
                                                                 Before repairing an ETL process, the current snapshot is
the order and execution of its contained tasks. This work
                                                                 compared with the previous snapshot. As a result, a list
focuses on Data Flow tasks. A Data Flow task can extract
                                                                 of EDS schema changes is produced which is accessible to
data from one or more sources, transform the data, and load
                                                                 the Maintenance Manager (MM). The core logic of MAIME
it into one or more destinations. It can be observed that the
                                                                 resides in the MM. The MM loads specifications of ETL pro-
actions performed in a Data Flow task thus resemble those
                                                                 cesses (i.e., SSIS Data Flow tasks) and creates corresponding
of the general notion of an ETL process. We therefore re-
                                                                 graphs. One graph is responsible for updating one underly-
gard a Data Flow task as an ETL process. A Data Flow task
                                                                 ing ETL process. The MM contains multiple graphs and
can make use of several types of transformations referred to
                                                                 operates on them when the graph alteration algorithm (ex-
as Data Flow components in SSIS. The current prototype of
                                                                 plained in Section 4.2) is called for the found EDS schema
MAIME covers the following subset of common transforma-
                                                                 changes. Modifications made by the graph alteration algo-
tions: Aggregate, Conditional Split, Data Conversion, De-
                                                                 rithm to the graphs are done semi-automatically since the
rived Column, Lookup, Sort, and Union All. To extract and
                                                                 user can be prompted. How exactly this is done depends
load data, we use OLE DB Source and OLE DB Destina-
                                                                 on the configurations described in Section 4.1. The GUI is
tion. For convenience, we extend the term “transformations”
                                                                 accessed by an administrator to maintain ETL processes. It
to also cover OLE DB Source and OLE DB Destination even
                                                                 involves: (1) Selecting ETL processes to maintain, (2) ad-
though they do not transform data.
                                                                 justing administrator configurations, (3) confirming or deny-
   We now consider how to maintain ETL processes using
                                                                 ing changes by answering to prompts, (4) visualizing applied
these transformations. The first problem is to detect EDS
                                                                 changes to the ETL processes, and (5) displaying log entries
schema changes. We do not want MAIME to rely on third
                                                                 for the applied changes.
party tools for this. One possibility is then to use triggers.
They would, however, need to be created first and for many
source systems, there would be many of them to handle both       3.   GRAPH MODEL
deletions, renames, and additions for hundreds of tables.          To analyze ETL processes and the effects of EDS schema
Further, they would fire for each individual modification and    changes, the ETL processes are modeled using graphs which
not only after all modifications are done. Instead, we there-    provide a level of abstraction over the SSIS packages. When
fore just extract metadata from SQL Server’s Information         we make a change in our graph, corresponding changes are
Schema Views with plain SQL statements (other DBMSs              also applied to the underlying SSIS package. Afterwards,
typically offer something similar such that support for them     we save the SSIS package which is now in an executable
also could be added). We thus store a snapshot of how the        state. An advantage of using graphs for modeling an ETL
process lies in the capability of easily representing dependen-
cies between columns for all transformations and handling          Table 1: Transformations, their properties, and
cascading changes using graph traversal. In this section, we       number of allowed incoming and outgoing edges.
                                                                      Transformation         Specific properties         In      Out
describe our graph model in details.                                  OLE DB Source          database,    table,          0       1
    Each ETL process is represented as an acyclic property                                   and columns
graph G = (V, E) where a vertex v ∈ V represents a trans-             OLE DB Destin.         database,    table,          1        0
formation and an edge e ∈ E represents that columns are                                      and columns
transferred from one transformation to another. A property            Aggregate              aggregations                 1     many
graph is a labeled, attributed, and directed multi-graph [10].        Conditional split      conditions                   1     many
Each vertex and edge can have multiple properties. A prop-            Data conversion        conversions                  1      1
erty is a key-value pair. To refer to the property name of            Derived column         derivations                  1      1
                                                                      Lookup                 database,    table,          1      2
vertex v1 , we use the notation v1 .name. The set C rep-                                     joins, columns, and
resents all columns used in an ETL process (i.e., columns                                    outputcolumns
from an EDS and columns created during the ETL process).              Sort                   sortings        and          1        1
Each column c ∈ C is a 3-tuple (id, name, type) where id is                                  passthrough
a unique number, name is the name of the column, and type             Union all              inputedges      and       many        1
holds the data type of the column. id was included because                                   unions
name and type are not enough to uniquely identify a col-
umn. Each edge e ∈ E is a 3-tuple (v1 , v2 , columns) where
v1 , v2 ∈ V and columns ⊆ C is the set of the columns be-
                                                                   number of incoming and outgoing edges. The following de-
ing transferred from v1 to v2 . Putting columns on the edge
                                                                   scribes the definitions of properties in more detail and how
is particularly advantageous for transformations which can
                                                                   dependencies is specified for each type of transformation.
have multiple outgoing edges where each edge can transfer
                                                                      An OLE DB Source is used to extract data from a table
a different set of columns, such as Aggregate in SSIS.
                                                                   or view of an EDS. It provides data for upcoming transfor-
    A vertex v representing a transformation has a set of
                                                                   mations and is thus represented by a vertex with no in-
properties depending on the type of transformation. The
                                                                   coming edges. An OLE DB Source has some additional
properties all vertices have in common are: name, type,
                                                                   properties. database is the name of the database data is
and dependencies. The only exception is a vertex for an
                                                                   extracted from. table is the name of the table data is ex-
OLE DB Destination which does not have the dependencies
                                                                   tracted from. columns is the list of columns extracted from
property. name is a unique name used to identify vertex
                                                                   the table. dependencies for OLE DB Source is trivial as for
v. type denotes which kind of transformation v represents
                                                                   each column c ∈ columns, c 7→ ∅. Each column is dependent
(e.g., Conditional Split or Aggregate). dependencies shows
                                                                   on nothing in the OLE DB Source, as this is the first time
how columns depend on each other. If, for example, we
                                                                   the columns appear in the graph.
extract the column c from an EDS and use an Aggregate
                                                                      An OLE DB Destination represents an ending point in
transformation that takes the average of c and outputs c0 ,
                                                                   the graph and is responsible for loading data into a DW.
we say that c0 7→ {c}. In the case of the Aggregate trans-
                                                                   OLE DB Destination has the same properties as OLE DB
formation, c0 originates from c and is therefore dependent
                                                                   Source, here representing where columns are loaded to. Since
on c. Any modifications such as deletion of c or a modi-
                                                                   OLE DB Destination does not have any outgoing edges,
fication of its data type can result in a similar change to
                                                                   dependencies does not exist for OLE DB Destination.
c0 . Formally, dependencies is a mapping from an output
                                                                      An Aggregate applies aggregate functions such as SUM
column o ∈ C to a set of input columns {c1 , . . . , cn } ⊆ C.
                                                                   to values of columns which results in new outputs with
We say that o is dependent on {c1 , . . . , cn } and denote this
                                                                   the aggregated values. Formally, the only property spe-
as: o 7→ {c1 , . . . , cn }. The output columns are defined as
                                                                   cific to an Aggregate vertex is defined as aggregations =
the columns that a vertex sends to another vertex through
                                                                   {(f1 , input1 , output1 , dest1 ),..., (fn , inputn , outputn , destn )}
an outgoing edge, such as an OLE DB Source transferring
                                                                   where fi ∈ {COU N T, COU N T DIST IN CT, GROU P BY,
columns to an Aggregate. The input columns are defined
                                                                   SU M, AV G, M IN, M AX} is the aggregate function, inputi
as the columns that a vertex receives from another vertex
                                                                   ∈ C is the input column that the ith output column is com-
through an incoming edge. The main purpose of dependen-
                                                                   puted from, and outputi ∈ C is the result of the ith aggrega-
cies is to detect whether an EDS schema change has any cas-
                                                                   tion and output of aggregate i, respectively. desti ∈ V is the
cading effects. Due to this, the graph contains some trivial
                                                                   vertex receiving the output column outputi . A given desti-
dependencies if a transformation does not affect a column,
                                                                   nation dest can appear in multiple tuples of aggregations,
e.g., a dependency such as: c 7→ {c} where c is dependent
                                                                   which shows that the dest receives multiple columns through
on itself. One example of an output column depending on
                                                                   a single edge. For each tuple i in aggregations, we have the
multiple input columns is a Derived Column transformation
                                                                   dependency outputi 7→ {inputi }.
where a derived output column o is dependent on the input
                                                                      A Conditional Split applies expressions to rows and
columns i1 and i2 if they were used to derive the value of
                                                                   can thereby route them to different destinations based on
o. Thus o 7→ {i1 , i2 } showing that if an EDS schema change
                                                                   which conditions they satisfy. Expressions are for example
affects i1 or i2 then it may also affect o.
                                                                   SU BST RIN G(title, 1, 3) == ”M r.” or salary > 30000.
    Additional properties of a vertex are defined by the type
                                                                   Formally, the only property specific to the Conditional Split
of the vertex. For instance, v.aggregations is a necessary
                                                                   vertex is defined as conditions = {(expr1 , p1 , dest1 ), . . . ,
property for Aggregate, whereas OLE DB Source does not
                                                                   (exprn , pn , destn )} where expri is a predicate that speci-
use aggregations and does not have this property. Table 1
                                                                   fies which rows are routed to desti , pi ∈ N+ is a priority
shows all properties specific to each transformation, and the
                                                                   which indicates in which order the conditions are evaluated
                                                                                    Store                   Sale                   Person
in (where 1 is the highest priority). The set of priorities
                                                                            ID: int (PK)            ID: int (PK)           ID: int (PK)
{p1 , . . . , pn } is a gap-free series of numbers that range from          Address: nvarchar(60)   CustomerID: int (FK)   Name: nvarchar(30)
1 to n, where n is the number of conditions. desti ∈ V is                                           StoreID: int (FK)      Age: int
                                                                                                    TotalAmount: money
the vertex that receives rows (that have not been taken by
a higher priority condition) for which expri holds. In Con-
ditional Split the set of output columns on each outgoing                           Figure 2: Example of an EDS schema.
edge is equal to the set of input columns. This is because a
Conditional Split transformation is not able to add, delete,
or otherwise change columns, but is only able to filter rows.               of orders {order1 , . . . , ordern } is a gap-free series of num-
Thus, each column depends on itself.                                        bers from 1 to n. passthrough = {c1 , . . . , cj } is the set of
   A Data Conversion casts the value of a column into a                     columns passed through the transformation. This does not
new column with the new data type. Formally, the only                       include columns that are used for sorting. For dependencies,
specific property of the data conversion vertex is defined as               every output column is dependent on itself.
conversions = {(input1 , output1 ), . . . , (inputn , outputn )}               A Union All combines rows from multiple vertices. For-
where inputi ∈ C is the input column of the vertex and                      mally, the properties of a Union All vertex are defined as
outputi ∈ C is a new column that has been created through                   follows. inputedges = (e1 , . . . , ej ) where ei is the ith incom-
the Data Conversion from inputi . The specific conversions                  ing edge of the Union All and j is the amount of incoming
of data types are present in the type of outputi . A Data                   edges. unions = {(output1 , input1 ), . . . , (outputn , inputn )}
Conversion creates a new column rather than replacing the                   where inputi = (c1 , . . . , cj ). Each element in unions shows
existing column and thus both inputi and outputi are out-                   how multiple input columns are combined into a single out-
put columns. dependencies is simple for Data Conversion,                    put column. ci ∈ C ∪  where  is a convention used to
as a dependency is made for each new outputi such that                      indicate that for a given union, no input is taken from the
outputi 7→ {inputi } for all tuples in conversions and since                corresponding edge. For example, the tuple (outputi , inputi )
all input columns are sent to next vertex, each depends on                  where inputi = (c1 , , c3 ) indicates that the unioned outputi
itself.                                                                     uses c1 from e1 , c3 from e3 , but no column is used from e2 .
   A Derived Column creates new columns based on input                      outputi is the column generated by unioning the columns
columns and expressions applied to those inputs. Formally,                  in the ith input tuple. A column from a table can only
the only property specific to the Derived Column vertex                     be part of a single union. For the ith tuple of unions,
is defined as derivations = {(expr1 , output1 ), . . . , (exprn ,           outputi 7→ {c1 , . . . , cn } such that the new column, which
outputn )} where expri is an expression used for computing                  is a result of the union, is dependent on all columns that
the new values of outputi , and outputi ∈ C is a newly cre-                 were used for the union. This dependency is derived for all
ated column. dependencies for Derived Column is similar to                  tuples in unions.
Data Conversion in that each input column is dependent on
itself. For all output columns in derivations we define the                    Example 1. Figure 2 shows the schema of an EDS called
dependency outputi 7→ {ci,1 , . . . , ci,j } where {ci,1 , . . . , ci,j }   SourceDB. Now consider the SSIS Data Flow task shown in
is the set of all columns used in expri .                                   Figure 3 where
   A Lookup extracts additional data from a database by                     – OLE DB Source extracts the columns ID, Name, and
equi-joining with given input columns. Formally, the prop-                  Age from the Person table.
erties of a Lookup vertex are defined as follows: database                  – Lookup extracts the TotalAmount from the Sale table by
is the name of the database to lookup additional columns                    joining Sale.CustomerID with Person.ID.
from. table is the name of the table used. columns ⊆                        – Derived Column derives the new column AmountTimes10
C represents all columns in the table. outputcolumns ⊆                      which is derived from the derivation (TotalAmount * 10,
columns is the columns extracted from the table. joins =                    AmountTimes10).
{(input1 , lookup1 ), . . . , (inputn , lookupn )} where joins rep-         – Conditional Split splits the rows into two directions based
resents the equi-join of the Lookup. inputi is a column from                on {(Age > 40, 1, Aggregate),(TotalAmount > 10000, 2,
the preceding vertex used for the join condition and lookupi                OLE DB Destination)}.
is a column from columns. The equi-join is used to extract                  – Aggregate computes into AvgAmount the average of Amount-
the columns in outputcolumns. Each input column has a                       Times10 when grouping by Age.
dependency to itself. Each new column in outputcolumns is                   – OLE DB Destination loads ID, Age, and AmountTimes10
dependent on all the input columns used in the join con-                    into the DW table PersonSalesData.
ditions of the Lookup. In other words, for each output
column output ∈ outputcolumns, we have that output 7→
{input1 , . . . , inputn } where inputi is the input column of
the ith tuple in joins.
   A Sort sorts the input rows in either ascending or de-
scending order and creates new output columns for the sorted
rows. It is possible to sort on multiple columns where a cer-
tain priority has to be given. Formally, the properties of
a Sort vertex are defined as follows. sortings = {(input1 ,
output1 , sorttype1 , order1 ), . . . , (inputn , outputn , sorttypen ,
ordern )} where inputi , outputi ∈ C, sorttypei ∈ {ascending,
descending}, and orderi ∈ N+ indicates in which order the
columns are sorted (where 1 is the highest priority). The set                                Figure 3: An ETL process.
                                                                                                    Name = Conditional
                                            Name Lookup                                             Split
                                            Database = SourceDB                                     Conditions = {
                                            Table = Sale                Name = Derived                Age > 40", 1,
                                            Columns = {                 Column                      Aggregate),
                                            ID, CustomerID, StoreID,    Derivations = {               TotalAmount >
              Name OLE DB Source            TotalAmount}                  TotalAmount * 10",        10000", 2, OLE DB
              Database = SourceDB           OutputColumns =             AmountTimes10)}             Destination)}
                                            {TotalAmount}               Dependencies = {            Dependencies = {             Name = OLE DB
              Table = Person                                                                                                     Destination
              Columns = { ID, Name, Age}    Joins =                     ID  {ID}                   ID  {ID}
                                            {(ID, CustomerID)}          Name  {Name}               Name  {Name}                Database =
              Dependencies = {                                                                                                    TargetDW
              ID  Ø                        Dependencies = {            Age  {Age}                 Age  {Age}
                                            ID  {ID}                                                                            Table =
              Name  Ø                                                  TotalAmount                TotalAmount
                                            Name  {Name}                                                                         PersonSalesData
              Age  Ø }                                                 {TotalAmount}               {TotalAmount}
                                            Age  {Age}                                                                          Columns = { ID, Age,
                                                                        AmountTimes10              AmountTimes10 
                                                                                                                                 AmountTimes10 }
                                            TotalAmount {ID}}          {TotalAmount}}              {AmountTimes10}}

                          1                            2                           3                           4                          5

                               Columns =                   Columns =                   Columns =
                               {(1, ID, int),              {(1, ID, int),              {(1, ID, int),                            Columns =
                               (2, Name, string),          (2, Name, string),          (2, Name, string),                        {(1, ID, int),
                               (3, Age, int)}              (3, Age, int),              (3, Age, int),                            (2, Name, string),
                                                           (4, TotalAmount,            (4, TotalAmount,                          (3, Age, int),
                                                           currency)}                  currency),                                (4, TotalAmount,
                                                                                       (5, AmountTimes10,                        currency),
                       Name = OLE DB                                                   currency)}                                (5, AmountTimes10,
                       Destination 1                                                                                             currency)}
                       Database = TargetDW
                                                              7                           6
                       Table = SalesData
                       Columns = {
                       Age, AvgAmount }             Columns =                   Name = Aggregate            Columns =
                                                    {(1, Age, int),             Aggregations = {            {(1, ID, int),
                                                    (2, AvgAmount,              (AVERAGE,                   (2, Name, string),
                                                    currency)}                  AmountTimes10,              (3, Age, int),
                                                                                AvgAmount, OLE DB           (4, TotalAmount,
                                                                                Destination 1),             currency),
                                                                                (GROUP BY, Age, Age,        (5, AmountTimes10,
                                                                                OLE DB Destination 1)}      currency)}
                                                                                Dependencies = {
                                                                                Age  {Age}
                                                                                AvgAmount 
                                                                                {AmountTimes10}}


                              Figure 4: A MAIME graph for the Data Flow task in Figure 3.


– OLE DB Destination 1 loads AvgAmount and Age into                                    should block (i.e., not propagate) any other changes in the
the DW table SalesData.                                                                EDS which then have to be handled manually. If MAIME is
  Figure 4 shows the corresponding MAIME graph where                                   given free reign, it can, however, make any number of mod-
there are seven vertices, i.e., one for each transformation                            ifications to ensure that ETL processes execute successfully.
(the types are not shown since they are obvious from the                               In the following, we describe the configurations that an ad-
names of the vertices). As an example of a dependency, we                              ministrator can specify before executing MAIME. There are
have AmountTimes10 7→ {TotalAmount} in Derived Col-                                    two types of configurations: (1) EDS schema change con-
umn. This comes from the derivation TotalAmount * 10.                                  figurations, and (2) Advanced configurations. To properly
This shows that any change to TotalAmount can affect Am-                               understand the configurations, we first explain the different
ountTimes10, and if TotalAmount is deleted, AmountTimes-                               kinds of EDS schema changes. The schema changes that
10 can no longer be derived and therefore also needs to be                             we consider are the following. Addition which represents
deleted.                                                                               a new column in a schema; Deletion which represents a
                                                                                       deleted column in a schema; Rename which represents a
                                                                                       renamed column in a schema; Data Type Change which
4.    GRAPH ALTERATION                                                                 represents a column where the data type has changed (incl.
  In this section, we describe how the MAIME graph is up-                              properties such as length).
dated when changes happen at the EDSs. First, we describe                                 For an EDS schema change ch ∈ {Addition, Deletion, Re-
MAIME’s user-configurable settings. Then we describe the                               name, Data Type Change} and vertex type t, a policy
algorithm that updates the graph (which in turn updates                                p(t, ch) ∈ {Propagate, Block, Prompt} can be specified by
the underlying SSIS Data Flow task).                                                   the administrator. The Propagate policy defines that when
                                                                                       p(t, ch) = Propagate, reparation of vertices of type t is al-
4.1   Administrator Configurations                                                     lowed for EDS schema changes of type ch. The Block pol-
  This section describes how an administrator can configure                            icy says that when p(t, ch) = Block then for every vertex v
which modifications MAIME is allowed to perform in case of                             where v.type = t, the alteration algorithm (explained in Sec-
EDS schema changes. As an example, an administrator can                                tion 4.2) is not allowed to make modifications to v or to any
define that only renames in the EDS should be propagated                               successor vsucc of v, even if p(vsucc .type, ch) = Propagate.
to the SSIS Data Flow tasks automatically, and MAIME                                   The Prompt policy defines that the choice of whether to
block or propagate the change is deferred to runtime where            and assume that an EDS schema change of b being deleted
the user is prompted to make the choice. Figure 5 shows               occurs. If Allow modification of expressions is enabled, the
how the administrator can set a policy for a type of an EDS           alteration algorithm attempts to update the expression into
schema change, or be more specific and choose a policy for            a < 20. This would make the process run successfully with-
each type of vertex for a given EDS change. For example, she          out removing the whole expression, but the semantics is
can configure that for all Conditional Split vertices, deletion       different. If Allow modification of expressions is disabled,
of columns should be propagated, but all other EDS changes            the alteration algorithm would instead remove the expres-
should be blocked.                                                    sion. Modification of expressions is always allowed if the
                                                                      EDS schema change is the renaming of a column. This is
      EDS Schema Change          Advanced Configurations
        Configurations
                                                                      because it is very simple to preserve the original semantics
       Prop.    Addition
                                                                      by replacing the old name of the renamed column with the
                                  Allow deletion of transformations
                                                                      new name in the expression.
        Prop.   Aggregate         Allow modification of expressions
        Block   Cond. Split       Use global blocking semantics
        Prop.   Sort                                                  4.2    The Graph Alteration Algorithm
       Block    Deletion                                                 This section details how our graph model adapts to EDS
        Block   Aggregate
                                                                      schema changes. As described above, the administrator can
        Prop.   Cond. Split                                           choose between the Propagate, Block, and Prompt poli-
        Block   Sort                                                  cies. Only a propagation (which can also occur through
                                                                      a prompt) alters the graph. This section goes into depth on
       Prop.    Rename
                                                                      how to propagate an EDS schema change through the graph.
       Block    Data type                                             How propagation is handled for a given type of EDS schema
                                                                      change is specific to each type of transformation. For space
                                                                      reasons, details about the different actions are not provided
          Figure 5: Configuration of MAIME.                           here, but are available elsewhere [3]. However, just apply-
                                                                      ing actions to each vertex independently is not enough, as a
   As can be seen in Figure 5, advanced configurations can            change in one vertex can also affect successive vertices. The
also be specified. MAIME is based on the idea that ETL                Alter-graph algorithm takes in a single EDS schema change
processes should be repaired such that all of them can exe-           ch and our property graph G. To be able to handle multiple
cute afterwards. We now explain some more difficult cases             EDS schema changes, the Alter-graph algorithm is called
of maintaining ETL processes and which configurations the             for each EDS schema change. For simplicity, we only present
administrator can use for such cases. Considering the case            the case where Allow deletion of transformations is enabled
of a deletion of a column, we need to take into account that          and neither Prompt nor data type change is used.
both columns and vertices can be dependent on the deleted                The order in which we traverse the graph matters as vis-
column. Columns such as a derived column is dependent on              iting a vertex also affects successive vertices. For this, topo-
the columns that it was derived from. Vertices are depen-             logical sorting is used on Line 1 such that when a vertex is
dent on columns if the columns are a part of that vertex’s            visited, it is guaranteed that all its predecessors have been
input, expression, condition, or used in some other way by            visited beforehand. The list of topologically sorted vertices
the vertex.                                                           is referred to as L. On Lines 2–6, the algorithm handles the
   The Use global blocking semantics option dictates                  case where Use global blocking semantics is enabled. This
whether the alteration algorithm should terminate if the pol-         entails checking if there exists a vertex in L with the Block
icy for any vertex is Block. If Use global blocking semantics         policy defined for the EDS change type ch. In case such
is enabled and p(t, ch) = Block then an ETL process contain-          a vertex exists, the algorithm returns an unchanged graph.
ing a vertex of type t will not be considered for reparation          On Line 7, we start traversing each vertex v of L. On Lines
whenever the EDS schema change is of type ch.                         8–11, we check if the policy for v given ch is Block. If this is
   The Allow deletion of transformations option allows                the case, we can disregard this branch of the graph for the
deletion of vertices in a graph (and thus transformations in          traversal, which is why we remove all successors of v from
a SSIS Data Flow task). With this option enabled, MAIME               the sorted list of vertices. Line 12 updates the dependencies
can modify graphs to such an extent that large portions of a          of a vertex. This is done by going through each dependency
graph are removed, but the corresponding updated ETL pro-             in v.dependencies for a given vertex v and checking that
cess can execute. Consider, for example, if a Sort vertex uses        all of the involved columns are still present in the incom-
column a for sorting. If a is deleted from the EDS, it would          ing edges. The reason for this is that some columns might
render the vertex invalid. With Allow deletion of transfor-           have been removed from v’s incoming edges when traversing
mations option enabled, the alteration algorithm attempts             the preceding vertices, such that v.dependencies refers to
to repair the process by deleting the Sort vertex and any             columns that no longer exist in v’s input.
successive vertices, thereby making the process executable.              As stated before, it is not sufficient to just look at vertices
The algorithm described in Section 4.2 further explains this          independently when going through the output columns of
principle.                                                            each vertex. Lines 13–19 show the case for deleting columns
   The Allow modification of expressions option allows                with no dependencies. As an example, consider a vertex v
the alteration algorithm to modify expressions in vertices            whose preceding transformation is a Derived Column with
(such as Conditional Split and Derived Column) in the event           the input column c, which is used to derive a new column
of deletions or data type changes of columns. Consider, for           d through the expression d = c + 42. Now, v will receive
example, a Conditional Split with input columns a and b               c and d as input columns, but if the EDS schema change
and one condition with the expression a < 20 && b < 40                (ch) is a deletion of c, not only will c not be available to v
 Algorithm 1: Algorithm for propagating EDS                            Table 2: Currently supported EDS changes.
 schema changes to the graph                                                                  Deletion Rename
    Name: Alter-graph                                                     OLE DB Source          3       3
    Input: EDS-Change ch, Graph G                                         OLE DB Destination     3       3
    Output: Graph G                                                       Aggregate              3       3
  1 List L = topological-sort(G)
                                                                          Conditional Split      3       3
  2 if U se global blocking semantics is enabled then
                                                                          Data Conversion        7        7
  3     foreach Vertex v ∈ L do
                                                                          Derived Column         3       3
  4         Policy p = lookup-policy(v.type, ch.type)
                                                                          Lookup                 3       3
  5         if p = Block then
                                                                          Sort                   7        7
  6             return G
                                                                          Union All              7        7
  7 foreach Vertex v ∈ L (in topological order) do
  8    Policy p = lookup-policy(v.type, ch.type)
                                                                  iterating over the output columns of a vertex that has been
  9    if p = Block then
                                                                  deleted. The other case of only deleting the given column
 10        Remove all successors of v from L
                                                                  is shown in Lines 18–19. Since a vertex of type OLE DB
 11        continue
                                                                  Destination does not have any outgoing edges, it is not con-
 12     Update v’s dependencies to not include deleted            sidered in the previous loop on Lines 13–19. However, we
        columns                                                   still want to delete the vertex if it is invalid, i.e., if it has
 13     foreach Column c ∈ v’s outgoing edges do                  no incoming edge. This is performed on Lines 20–21. Af-
 14         if v.dependencies(c) = ∅ AND v.type 6= OLE            terwards on Lines 22–23, if v was not deleted at an earlier
            DB Source then                                        point in the algorithm, then the corresponding propagation
 15             if v is fully dependent on c then                 action for the EDS schema change and transformation type
 16                 Delete v and v’s incoming and                 is invoked. Finally, the fully altered graph is returned. Re-
                    outgoing edges from G                         call that the underlying SSIS Data Flow tasks gets updated
 17                 Break inner loop                              automatically when the graph is updated.
 18              else
 19                 Delete c from v’s corresponding               5.   IMPLEMENTATION
                    outgoing edges and dependencies                  This section describes the implementation of the proto-
                                                                  type. MAIME was developed in C# for Microsoft SQL
 20     if v.type is OLE DB Destination AND v has no
                                                                  Server 2014 Developer Edition and with the SSIS Designer
        incoming edges then
                                                                  as the data integration tool. The SSIS Designer is part of
 21         Delete v
                                                                  the SQL Server Data Tools (v. 14.0.60203.0). The code-
 22     if v was not deleted AND ch affects v then                base for the implementation includes around 7,700 lines of
 23         G = alter(G, v, ch)                                   code, 85 classes, and 410 members. The implementation
                                                                  is open source and is available from https://github.com/
 24   return G
                                                                  sajens/MAIME.
                                                                     The advanced configurations, Allow deletion of transfor-
                                                                  mations and Use global blocking semantics are implemented.
anymore, d will also be deleted, as it is no longer possible to   We did not implement Allow modification of expressions
derive it from the Derived Column without c. The way to           yet. We focused our attention primarily on Allow dele-
find out that d is no longer computable is by seeing that d is    tion of transformations and Use global blocking semantics,
dependent on ∅ (Line 14). This signifies that d was depen-        since they seemed to be the most impactful configurations
dent on column(s) which have been deleted. The exception          for MAIME. Currently, Allow deletion of transformations is
is an OLE DB Source, which is the only vertex for which           always enabled. The Propagate and Block policies are imple-
dependencies maps columns to ∅, as explained in Section           mented as described previously. Table 2 shows which trans-
3 (OLE DB Source is the only transformation allowing no           formations have been implemented with respect to deletion
incoming edges). Beginning on Line 15, it is considered if        and renaming of EDS columns in the current prototype.
it is possible to delete only the given column d or if it is         We now consider how the graph is represented internally
necessary to delete the entire vertex v. This narrows down        in MAIME. Our graph is implemented as a layer on top of
to whether v is fully dependent on d or not. We say that a        SSIS. Therefore, our graph constructs have references to the
vertex v is fully dependent on some column c, when v would        corresponding SSIS constructs. To be more exact, each ver-
be rendered invalid if c was deleted. For instance, if v is       tex refers to a SSIS component (transformation), each edge
an Aggregation and c is the only remaining column that is         refers to a SSIS path, and each column has a corresponding
being used for aggregations, then v is fully dependent on c.      reference to a so-called SSIS IDTS object. The SSIS IDTS
The deletion of c would result in v.aggregations = ∅, which       object covers both input and output columns in SSIS. By
is not a valid transformation. What qualifies a vertex as         having these references, we can easily propagate changes to
being fully dependent on a column is specific for each type       the underlying ETL process during execution of the graph
of transformation, for details see [3]. Lines 16–17 show the      alteration algorithm. For the graph, we have implemented
case of deleting a whole vertex and all of its incoming and       the classes Graph, Vertex, Edge, Column, and a class ex-
outgoing edges, if they are no longer used. This iteration of     tending Vertex for each supported transformation such that
the loop breaks because it would not make sense to continue       specific dependencies and properties can be represented.
  In order to instantiate the graph, we have to translate the    starts when the user begins to use MAIME and ends when
components from SSIS1 . Initially, a SSIS package is loaded      the SSIS package has been repaired. MAIME is also used to
by parsing its .dtsx file conforming to XML specification.       repair the package three times and the duration and amount
Then, we extract its SSIS Data Flow tasks. Note that every       of user inputs are again recorded using a software tool.
Data Flow task is stored as a graph in a .dtsx file. Con-           We acknowledge that the user knows MAIME very well
struction of a MAIME graph is done in two iterations. Dur-       and thus easily can use it. That is why the user also is given
ing the first iteration, we extract one Data Flow task and       the desired resulting SSIS package for the manual approach
go through its components to create corresponding vertices       such that he does not have to think about what needs to be
with references to the Data Flow task components. With           changed. This is thus an overly optimistic way to measure
the set of vertices V and references to the SSIS components,     the needed time for the manual approach (the amounts of
we can extract information for each vertex such as v.name        needed clicks and keystrokes are not affected, though).
and v.type. Afterwards, every SSIS path is translated into          We now elaborate on one of our test cases and a given list
an edge. This gives us the set of edges E of our graph. It is    of EDS schema changes. In test case 1, we consider the sce-
important to note that columns are not stored in the edges       nario from Example 1 and assume the following changes to
in this iteration, since they do not exist on SSIS paths. Af-    the EDS: (1) Age is renamed to RenamedAge in the Person
ter the first iteration is complete, we have created the basic   table and (2) TotalAmount is deleted from the Sale table.
structure of the graph of MAIME. In the second iteration,        The state of the graph after it has been maintained is shown
we deduce dependencies, assign columns to edges, and create      in Figure 6 in the SSIS designer tool, and with MAIME’s
transformation specific properties.                              graph in Figure 7. The latter shows that Age is success-
                                                                 fully renamed on all edges. The deletion of TotalAmount is
                                                                 slightly more complicated since the condition TotalAmount
6.     EVALUATION                                                > 10000 in the Conditional Split involves the column. This
   We now evaluate the efficiency of MAIME. In the evalua-       condition is no longer valid which results in the removal of
tion, we compare how much time and how many user inputs          the outgoing edge containing it. We therefore delete OLE
are required to resolve a series of EDS schema changes when      DB Destination, as it no longer has any incoming edges.
using MAIME and when repairing the ETL flow manually.            Derived Column derives the column AmountTimes10 from
A user input is defined to be a mouse click or a keystroke       TotalAmount and can therefore no longer be derived. We
done by the user (both are recorded separately). We dis-         delete this derivation but still retain our Derived Column
cussed and validated this approach with practitioners in field   transformation since it can exist without doing any deriva-
of ETL development.                                              tions. Another possibility would be to let MAIME delete the
   We consider three SSIS packages and each has a single         Derived Column and connect the Lookup and Conditional
Data Flow task. For each considered SSIS package, both           Split transformations. The Aggregate transformation takes
MAIME and the manual work end up having the same end             the average of AmountTimes10, which no longer exists and
result, i.e. the same state of the maintained SSIS package.      this aggregation is therefore also deleted. OLE DB Desti-
Only the process of achieving that end result differs. The       nation 1 no longer loads the aggregated AvgAmount from
EDS schema changes used for the evaluation include deletion      the Aggregate transformation into the DW. Both TotalAm-
and renaming.                                                    ount, AmountTimes10, and AvgAmount are deleted from
   For the manual approach, the user performing the evalu-       the edges.
ation (one of the authors) is given the EDS schema changes
and two versions of the ETL process: One in its initial state,
and another in the desired result state. The user then per-
forms maintenance of each package three times using Visual
Studio 2015 Community edition and SQL Server Data Tools.
The first evaluation gives an indication of how long it takes
to maintain an ETL process without knowing every step to
maintain the ETL process. The second and third attempt
are performed in order to provide a best-case scenario, since
the user learns the quickest way to maintain the ETL pro-
cess through repetition. The time of the evaluation starts
when the user begins repairing the ETL process after having
seen the given information (i.e., EDS schema changes and
                                                                 Figure 6: The Data Flow task of the SSIS package
the two versions of the ETL process). During the test, the
                                                                 for test case 1 after the EDS schema changes.
duration and amount of user inputs used are recorded using
a software tool.
   When considering MAIME, the user interacts with MAIME           Test cases 2 and 3 are not shown here for space reasons,
to (1) accept the configurations, currently loaded ETL pro-      but only briefly described. In test case 2, TotalAmount is
cesses, and connected EDSs, and (2) start the maintenance        deleted, and both Address and Person.ID are renamed. In
process. The EDS schema change configurations of MAIME           test case 3, Name is deleted and Age is renamed.
for all evaluations are all set to propagate. For the advanced     For each test case, the used time is shown in Table 3 and
configurations, Allow deletion of transformations is enabled,    the amounts of user inputs are shown in Table 4. For time-
Allow modification of expressions is disabled, and Use global    consumption, MAIME took on average 4 seconds across all
blocking semantics is disabled. The time of the evaluation       3 test cases while manually resolving changes took 39.3 sec-
                                                                 onds on average across all 3 cases for the third attempt. This
1
    msdn.microsoft.com/en-us/library/ms403344.aspx               means MAIME was on average of 9.8 times faster for resolv-
                            Name Lookup
                            Database = SourceDB
                                                                                                                     Table 4: User inputs needed for handling EDS
                            Table = Sale
                            Columns = {
                                                                                                                     schema changes for the three test cases.
                            ID, CustomerID,                                                                           User Inputs   MAIME                   Manual
Name OLE DB Source          StoreID}                                                 Name = Conditional
Database = SourceDB         OutputColumns = {}                                       Split                                          Keystrokes: 0, 0, 0     Keystrokes: 23, 15, 12
Table = Person                                          Name = Derived                                                Test Case 1
Columns = { ID, Name,
                            Joins =
                                                        Column
                                                                                     Conditions = {                                 Mouse clicks: 4, 4, 4   Mouse clicks: 88, 85, 38
                            {(ID, CustomerID)}                                         RenamedAge > 40", 1,
RenamedAge}                 Dependencies = {            Derivations = {}             Aggregate)}                                    Keystrokes: 0, 0, 0     Keystrokes: 9, 9, 8
Dependencies = {            ID  {ID}                   Dependencies = {             Dependencies = {                 Test Case 2
ID  Ø                      Name  {Name}               ID  {ID}                    ID  {ID}
                                                                                                                                    Mouse clicks: 4, 4, 4   Mouse clicks: 92, 48, 45
Name  Ø                    RenamedAge                 Name  {Name}                Name  {Name}                                  Keystrokes: 0, 0, 0     Keystrokes: 0, 0, 0
RenamedAge  Ø }            {RenamedAge}}               RenamedAge                  RenamedAge                      Test Case 3
                                                        {RenamedAge}}                {RenamedAge}}                                  Mouse clicks: 4, 4, 4   Mouse clicks: 16, 11, 11
           1                            2                           3                            4

                Columns =                   Columns =                    Columns =
                {(1, ID, int),              {(1, ID, int),               {(1, ID, int),
                (2, Name, string),
                (3, RenamedAge, int)}
                                            (2, Name, string),
                                            (3, RenamedAge, int)}
                                                                         (2, Name, string),
                                                                         (3, RenamedAge, int)}
                                                                                                                     (called evolution events) and proposes changes to the ETL
                                                                                                                     processes based on defined policies. The Hecataeus frame-
               Name = OLE DB
                                                7                           6                                        work abstracts ETL processes as SQL queries and views
               Destination 1
               Database = TargetDW                              Name = Aggregate             Columns =               which are used in a graph to represent the activities. An ac-
               Table = SalesData                                Aggregations = {             {(1, ID, int),
               Columns = {Age}                                  (GROUP BY,                   (2, Name, string),      tivity resembles an ETL transformation. Hecataeus’s graph
                                                                RenamedAge, Age, OLE
                                                                DB Destination 1)}
                                                                                             (3, RenamedAge, int)}   is captured by an ETL Summary which can have multiple
                                    Columns =                   Dependencies = {
                                                                Age  {RenamedAge}}
                                                                                                                     subgraphs for each activity. Each activity can further be
                                    {(4, Age, int)}
                                                                                                                     broken down, as it includes nodes or entire subgraphs for
                                                                                                                     relations, SQL queries, conditions, and uses views, e.g., for
Figure 7: The MAIME graph for the Data Flow task                                                                     input and output. The types of evolution events taken into
in Figure 6.                                                                                                         account are addition, deletion, and modification. An ad-
                                                                                                                     ministrator can annotate nodes and edges with policies for
                                                                                                                     each type of evolution event, while in MAIME policies are
Table 3: Time needed for handling EDS schema                                                                         provided for each EDS change type and not specifically for
changes for the three test cases.                                                                                    each node or edge. The approach taken by MAIME would
  Time Elapsed                 MAIME                                    Manual
                                                                                                                     be preferable if the administrator had to repair a lot of ETL
  Test Case 1                  4 sec, 4 sec, 4 sec                      187 sec, 159 sec, 59 sec
  Test Case 2                  4 sec, 4 sec, 4 sec                      154 sec, 60 sec, 49 sec                      processes. The three Hecataeus policies dictate how ETL
  Test Case 3                  4 sec, 4 sec, 4 sec                      23 sec, 13 sec, 10 sec                       processes should be adjusted when an evolution event oc-
                                                                                                                     curs: (1) Propagate readjusts the graph to reflect the new
                                                                                                                     semantics according to the evolution event throughout the
ing EDS schema changes. For user input, MAIME required                                                               rest of the graph, (2) Block retains the old semantics, and
on average 4 user inputs, while manually resolving changes                                                           (3) Prompt asks the administrator to choose Propagate or
required 38 user inputs on average for the third attempt                                                             Block. Both Propagate and Prompt are similar in MAIME,
across all 3 cases. This means MAIME on average required                                                             whereas their Block attempts to retain the semantics. An
9.5 times less inputs for resolving EDS schema changes.                                                              extension of Hecataeus has been made for what-if analysis of
   For the manual work, it took significantly more time the                                                          evolutions of relations, views, or queries in a data-intensive
first time the user had to maintain the ETL process. The                                                             ecosystem [6]. While MAIME models columns on edges,
user did, however, get progressively quicker at maintaining                                                          Hecataeus models input and output schemata as subgraphs.
the ETL processes for the second and third attempt, with                                                                Another framework is E-ETL [14, 15] by Wojciechowski
the best result being from the third attempt. The number of                                                          which is also able to semi-automatically adapt ETL pro-
both mouse clicks and keystrokes showed similar results. For                                                         cesses to EDS schema changes. In order to adapt the ETL
these evaluations MAIME is thus able to significantly reduce                                                         processes, there are different methods which define how repa-
the amount of time and user input needed for maintaining                                                             rations are propagated. E-ETL has the same three policies
ETL processes.                                                                                                       as Hecataeus, with the exception of the Block policy. In
   In a real life scenario, a company would have to maintain                                                         E-ETL the Block policy ignores the EDS schema change
a multitude of ETL processes, which for each could take                                                              and does not attempt to modify the graph. The E-ETL
on average 39.3 seconds to repair as our evaluation showed.                                                          framework has multiple ways to handle an EDS schema
However, the timings shown for MAIME in Table 3 (4 sec-                                                              change. These include Defined rules in which the admin-
onds) include the time to first load the ETL processes, detect                                                       istrator can define rules himself for nodes and edges; Stan-
EDS schema changes, and the time the user spent on click-                                                            dard rules which are the default rules used if the admin-
ing buttons in the GUI. If there were many ETL processes,                                                            istrator did not define anything; and Alternative scenarios
it would still be enough to do this once. The reparation al-                                                         where case-based reasoning is used to adjust the ETL pro-
gorithm of MAIME uses less than a second in all the cases,                                                           cess based on solutions to similar problems experienced pre-
and we thus argue that the evaluation setup actually gives                                                           viously. How these algorithms are used together is not speci-
the manual process an advantage.                                                                                     fied. Like Hecataeus, E-ETL models ETL processes through
                                                                                                                     SQL queries.
                                                                                                                        Related work also exists on how to conceptually model
7.     RELATED WORK                                                                                                  ETL processes using transformations that are commonly
  Related work exists on the topic of maintaining ETL pro-                                                           used in ETL processes. Trujillo et al. [13] do this by defin-
cesses when the schema of EDSs change. The framework                                                                 ing a small set of transformations modeled through UML
Hecataeus by Papastefanatos et al. [7, 8, 9] analyzes ETL                                                            diagrams. Examples of the mechanism include Aggrega-
processes by detecting changes to the schema of the EDSs                                                             tion, Conversion, and Join. This is in contrast to other
frameworks, such as Hecataeus and E-ETL, where ETL pro-          be to extract all SSIS Data Flow tasks from each SSIS pack-
cesses are described by SQL queries and views, and repre-        age. Furthermore, support for more SSIS transformations
sented with graphs. Using transformations rather than SQL        could be implemented as well as support for SSIS Con-
queries can make conceptual modeling simpler and more            trol Flow tasks. For example, it would be interesting to
maintainable as each transformation has a clear responsi-        have support for more kinds of data sources, such as XML
bility and provides a higher level of abstraction compared       (where a schema can be explicitly given) and a CSV (where
to SQL queries, but at the cost of expressiveness. Trujillo      a schema typically has to be inferred). We note that a few
et al. provide a set of general ETL processes that are not       local software vendors and IT service providers have been in
specific to any platform, while our transformations are spe-     the progress of adopting and extending MAIME into their
cific to SSIS. Furthermore, instead of using UML diagrams        product and service offerings.
we use property graphs to more closely relate to the internal
graphs of SSIS. In [11], conceptual models are specified with    9.    ADDITIONAL AUTHORS
a kind of diagrams which then are mapped to logical models
                                                                   Harry Xuegang Huang (Danmarks Nationalbank, email:
where graphs also are used.
                                                                 xh@nationalbanken.dk) and Christian Thomsen (Dept. of
   Business Intelligence Markup Language (BIML) [1] is an
                                                                 Comp. Science, Aalborg University, email: chr@cs.aau.dk).
XML dialect that can be used to generate SSIS packages. It
can be used to create templates and reusable code snippets
to make the creation of SSIS packages easier and less time-      10.   REFERENCES
consuming. BIML does, however, not handle changes, but            [1] Business Intelligence Markup Language.
can rather re-generate packages when a change occurs.                 https://www.varigence.com/biml (Last accessed
                                                                      2016-10-31).
8.   CONCLUSION AND FUTURE WORK                                   [2] SSIS Designer. https://msdn.microsoft.com/en-us/
   ETL processes can be complex and time-consuming to re-             library/ms137973(v=sql.120).aspx (Last accessed
pair manually. We presented MAIME as a tool to reduce                 2016-10-31).
the amount of errors and time spent on maintaining ETL            [3] D. Butkevičius et al. MAIME - Maintenance Manager
processes. To accomplish this task, we introduced and im-             for ETL. Student project d802f16, Dept of Comp.
plemented a graph model as a layer on top of SSIS Data                Science, Aalborg University, August 2016.
Flow tasks, which simplifies the handling of EDS schema               https://github.com/sajens/MAIME/blob/master/
changes.                                                              Technical-Report.pdf.
   As we have seen in the evaluation, MAIME required 9.5          [4] R. Kimball and M. Ross. The Data Warehouse
times less input from the user and was 9.8 times faster com-          Toolkit. John Wiley & Sons, 2011.
pared to doing it manually with the SSIS Designer tool.           [5] B. Knight et al. Professional Microsoft SQL Server
MAIME can thus ease the burden of maintaining ETL pro-                2012 Integration Services. John Wiley & Sons, 2012.
cesses.                                                           [6] P. Manousis, P. Vassiliadis, and G. Papastefanatos.
   There are a number of possible directions for future work.         Impact Analysis and Policy-Conforming Rewriting of
The paper presents an adaptive way of handling source data            Evolving Data-Intensive Ecosystems. Journal on Data
changes in ETL programs. One direction for future work is             Semantics, 4(4), 2015.
to make MAIME more robust against the variance of source          [7] G. Papastefanatos et al. Rule-Based Management of
types and change types. The adaptivity very often depends             Schema Changes at ETL Sources. ADBIS, 2010.
on manual decisions made by the ETL administrator. Thus,          [8] G. Papastefanatos et al. What-If Analysis for Data
an interesting extension of this work is to model or formalize        Warehouse Evolution. DaWaK, 2007.
the pattern of the manual decisions and predict the next de-      [9] G. Papastefanatos et al. Policy-Regulated
cisions such that the system execution process involves less          Management of ETL Evolution. Journal on Data
manual interruptions. Since the system starts to make auto-           Semantics XIII, 2009.
matic changes to ETL programs, it is very natural that the
                                                                 [10] M. A. Rodriguez and P. Neubauer. Constructions from
ETL administrators considers to “roll-back” the execution of
                                                                      dots and lines. Bulletin of the American Society for
certain groups of ETL programs to the previous state. How
                                                                      Information Science and Technology, 2010.
to define the transaction process in such an ETL program
                                                                 [11] A. Simitsis. Mapping Conceptual to Logical Models
and how to ensure the possibility of the roll-back action is
                                                                      for ETL Processes. DOLAP, 2005.
another direction to explore.
   We have so far disregarded propagating changes to the         [12] E. Thoo and M. A. Beyer. Gartner’s Magic Quadrant
DW, e.g., adding a column to the relevant DW schema.                  for Data Integration Tools. 2014.
However, an administrator might want to do proper adjust-        [13] J. Trujillo and S. Luján-Mora. A UML Based
ments to the DW in order to capture the correct semantics             Approach for Modeling ETL Processes in Data
of an EDS schema change. For example, when a column was               Warehouses. ER, 2003.
inserted in an EDS, it might be reasonable to insert a corre-    [14] A. Wojciechowski. E-ETL: Framework for Managing
sponding column into the DW. This is another direction to             Evolving ETL Workflows. Foundations of Computing
investigate. It could also be interesting to investigate how          and Decision Sciences, 38(2), 2013.
to maintain ETL programs when the target DW is changed.          [15] A. Wojciechowski. E-ETL Framework: ETL Process
   Another direction for future work is to extend the current         Reparation Algorithms Using Case-Based Reasoning.
prototype. Currently, we extract only one SSIS Data Flow              ADBIS Short Papers and Workshops, 2015.
task from the current SSIS package. An improvement would