MAIME: A Maintenance Manager for ETL Processes ∗ Dariuš Butkevičius Philipp D. Freiberger Frederik M. Halberg Vilnius University Aalborg University Aalborg University darius.butkeviciu@gmail.com {pfreib15 fhalbe12 Jacob B. Hansen Søren Jensen Michael Tarp Aalborg University Aalborg University Aalborg University jh12 sajens12 mtarp09}@student.aau.dk ABSTRACT use a Data Warehouse (DW) containing the data for the The proliferation of business intelligence applications moves decision making of a business. A DW contains transformed most organizations into an era where data becomes an es- data from one or more External Data Sources (EDSs) [4]. In sential part of the success factors. More and more busi- order to populate a DW, an Extract-Transform-Load (ETL) ness focus has thus been added to the integration and pro- process is used. An ETL process extracts data from one or cessing of data in the enterprise environment. Develop- more EDSs, transforms the data into the desired format by ing and maintaining Extraction-Transform-Load (ETL) pro- cleansing it (i.e., correcting or removing corrupt/inaccurate cesses becomes critical in most data-driven organizations. data), conforming from multiple sources, deriving new val- External Data Sources (EDSs) often change their schema ues, etc., and finally loads it into the target DW. which potentially leaves the ETL processes that extract data Construction of an ETL process is very time-consuming [4]. from those EDSs invalid. Repairing these ETL processes Maintaining ETL processes after deployment is, however, is time-consuming and tedious. As a remedy, we propose also very time-consuming. The challenges in the mainte- MAIME as a tool to (semi-)automatically maintain ETL nance of ETL process are very well demonstrated by the processes. MAIME works with SQL Server Integration Ser- following examples which we have obtained from Danish vices (SSIS) and uses a graph model as a layer of abstraction organizations. The data warehouse solution at a pension on top of SSIS Data Flow tasks (ETL processes). We intro- and insurance firm needs to cope with unknown changes in duce a graph alteration algorithm which propagates detected the main pension system when weekly hotfixes are deployed. EDS schema changes through the graph. Modifications done The database administrator at a facility management com- to a graph are directly applied to the underlying ETL pro- pany has to manage more than 10,000 ETL jobs that execute cess. It can be configured how MAIME handles EDS schema daily. The ETL team at an online gaming-engine vendor has changes for different SSIS transformations. For the con- to manage the daily format changes in the web services deliv- sidered set of transformations, MAIME can maintain SSIS ering sales and marketing data. The data warehouse team Data Flow tasks (semi-)automatically. Compared to doing at a financial institution strives between the challenges of this manually, the amount of user inputs is decreased by a variances of source data formats and a fixed monthly release factor of 9.5 and the spent time is reduced by a factor of 9.8 window for ETL programs. To summarize, when the number in an evaluation. of ETL programs becomes overwhelming for the IT team, management of these programs becomes time-consuming. In such a situation, unexpected changes in source system Categories and Subject Descriptors databases or source system deliveries add more complexity Information systems [Information integration]: Extrac- and risk in the maintenance of the ETL programs. tion, transformation and loading The impact of an EDS schema change depends on both the type of the EDS schema change and how ETL processes use the changed EDSs. Maintenance of ETL processes thus 1. INTRODUCTION requires manual work, is very time-consuming, and is quite Business Intelligence (BI) is an essential set of techniques error-prone. To remedy these problems, we propose the tool and tools for a business to provide analytics and reporting in MAIME which can (1) detect schema changes in EDSs and order to give decision support. The BI techniques and tools (2) semi-automatically repair the affected ETL processes. ∗Work done at Aalborg University MAIME works with SQL Server Integration Services (SSIS) [5] and supports Microsoft SQL Server. SSIS is chosen as the ETL platform since it is in the top three of the most used tools by businesses for data integration [12], has an easy to use graphical tool (SSIS Designer [2]), and includes an API, which allows access to modify ETL processes through third party programs like MAIME. To maintain ETL processes, 2017, Copyright is with the authors. Published in the Workshop Proceed- we formalize and implement a graph model as a layer of ings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice, abstraction on top of SSIS Data Flow tasks. By doing so, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is we only have to modify the graph which then automatically permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 modifies the corresponding SSIS Data Flow task. Running ETL processes is often an extensive and time-consuming task, and a single ETL process not being able to execute EDS could cause a considerable amount of time wasted since the 1...n administrator would have to repair the ETL process and run 1 the ETL processes once again. Based on this observation, EDS Change Manager Metadata Store MAIME can be configured to repair ETL processes even if this requires deletion of parts/transformations of the ETL processes. In an evaluation MAIME is shown to be able to suc- Maintenance GUI Log Manager cessfully repair ETL processes in response to EDS schema changes. For the implemented set of transformations, a com- parison between resolving EDS schema changes in MAIME Configuration and doing it manually in the SSIS Designer tool shows that MAIME is on average 9.8 times faster and required 9.5 times less input from the user to maintain. Figure 1: Architecture of MAIME. The rest of the paper is organized as follows. Section 2 provides an overview of MAIME and how it detects EDS schema changes. Section 3 formalizes the graph model and source databases look and when MAIME is run, it creates a describes its usage. Section 4 describes how graph alter- new snapshot and compares that to the previous snapshot ations are done. Section 5 describes the implementation. to detect which changes have occurred. Section 6 shows how MAIME compares to doing the main- Figure 1 shows the architecture of MAIME. We first pro- tenance manually. Section 7 covers related work. Section 8 vide short descriptions for MAIME’s surrounding elements concludes and provides directions for future work. that are used by the core MAIME components. EDSs corre- spond to the external data sources of which schema changes 2. OVERVIEW OF MAIME are detected. In Figure 1, the Metadata Store corresponds to a local directory. Each captured EDS metadata snapshot In this section, we give an overview of the SSIS com- is stored as a JSON file in the Metadata Store. The Log ponents covered by our solution. Then, we explain how is a collection files that log every change done by MAIME MAIME detects EDS schema changes. Finally, we give a so that an administrator can analyze the maintenance. The description of the architecture of MAIME. Configuration is a JSON file which stores the user-defined Knight et al. [5] provide an overview of the SSIS architec- configurations. ture. To briefly summarize, a package is a core component of The EDS Change Manager (ECM) captures metadata SSIS which connects all of its tasks together. The package is from the EDSs by using the Information Schema Views. stored in a file which the SSIS engine can load to execute the The metadata snapshots are stored in the Metadata Store. contained control flows. A control flow of a package controls Before repairing an ETL process, the current snapshot is the order and execution of its contained tasks. This work compared with the previous snapshot. As a result, a list focuses on Data Flow tasks. A Data Flow task can extract of EDS schema changes is produced which is accessible to data from one or more sources, transform the data, and load the Maintenance Manager (MM). The core logic of MAIME it into one or more destinations. It can be observed that the resides in the MM. The MM loads specifications of ETL pro- actions performed in a Data Flow task thus resemble those cesses (i.e., SSIS Data Flow tasks) and creates corresponding of the general notion of an ETL process. We therefore re- graphs. One graph is responsible for updating one underly- gard a Data Flow task as an ETL process. A Data Flow task ing ETL process. The MM contains multiple graphs and can make use of several types of transformations referred to operates on them when the graph alteration algorithm (ex- as Data Flow components in SSIS. The current prototype of plained in Section 4.2) is called for the found EDS schema MAIME covers the following subset of common transforma- changes. Modifications made by the graph alteration algo- tions: Aggregate, Conditional Split, Data Conversion, De- rithm to the graphs are done semi-automatically since the rived Column, Lookup, Sort, and Union All. To extract and user can be prompted. How exactly this is done depends load data, we use OLE DB Source and OLE DB Destina- on the configurations described in Section 4.1. The GUI is tion. For convenience, we extend the term “transformations” accessed by an administrator to maintain ETL processes. It to also cover OLE DB Source and OLE DB Destination even involves: (1) Selecting ETL processes to maintain, (2) ad- though they do not transform data. justing administrator configurations, (3) confirming or deny- We now consider how to maintain ETL processes using ing changes by answering to prompts, (4) visualizing applied these transformations. The first problem is to detect EDS changes to the ETL processes, and (5) displaying log entries schema changes. We do not want MAIME to rely on third for the applied changes. party tools for this. One possibility is then to use triggers. They would, however, need to be created first and for many source systems, there would be many of them to handle both 3. GRAPH MODEL deletions, renames, and additions for hundreds of tables. To analyze ETL processes and the effects of EDS schema Further, they would fire for each individual modification and changes, the ETL processes are modeled using graphs which not only after all modifications are done. Instead, we there- provide a level of abstraction over the SSIS packages. When fore just extract metadata from SQL Server’s Information we make a change in our graph, corresponding changes are Schema Views with plain SQL statements (other DBMSs also applied to the underlying SSIS package. Afterwards, typically offer something similar such that support for them we save the SSIS package which is now in an executable also could be added). We thus store a snapshot of how the state. An advantage of using graphs for modeling an ETL process lies in the capability of easily representing dependen- cies between columns for all transformations and handling Table 1: Transformations, their properties, and cascading changes using graph traversal. In this section, we number of allowed incoming and outgoing edges. Transformation Specific properties In Out describe our graph model in details. OLE DB Source database, table, 0 1 Each ETL process is represented as an acyclic property and columns graph G = (V, E) where a vertex v ∈ V represents a trans- OLE DB Destin. database, table, 1 0 formation and an edge e ∈ E represents that columns are and columns transferred from one transformation to another. A property Aggregate aggregations 1 many graph is a labeled, attributed, and directed multi-graph [10]. Conditional split conditions 1 many Each vertex and edge can have multiple properties. A prop- Data conversion conversions 1 1 erty is a key-value pair. To refer to the property name of Derived column derivations 1 1 Lookup database, table, 1 2 vertex v1 , we use the notation v1 .name. The set C rep- joins, columns, and resents all columns used in an ETL process (i.e., columns outputcolumns from an EDS and columns created during the ETL process). Sort sortings and 1 1 Each column c ∈ C is a 3-tuple (id, name, type) where id is passthrough a unique number, name is the name of the column, and type Union all inputedges and many 1 holds the data type of the column. id was included because unions name and type are not enough to uniquely identify a col- umn. Each edge e ∈ E is a 3-tuple (v1 , v2 , columns) where v1 , v2 ∈ V and columns ⊆ C is the set of the columns be- number of incoming and outgoing edges. The following de- ing transferred from v1 to v2 . Putting columns on the edge scribes the definitions of properties in more detail and how is particularly advantageous for transformations which can dependencies is specified for each type of transformation. have multiple outgoing edges where each edge can transfer An OLE DB Source is used to extract data from a table a different set of columns, such as Aggregate in SSIS. or view of an EDS. It provides data for upcoming transfor- A vertex v representing a transformation has a set of mations and is thus represented by a vertex with no in- properties depending on the type of transformation. The coming edges. An OLE DB Source has some additional properties all vertices have in common are: name, type, properties. database is the name of the database data is and dependencies. The only exception is a vertex for an extracted from. table is the name of the table data is ex- OLE DB Destination which does not have the dependencies tracted from. columns is the list of columns extracted from property. name is a unique name used to identify vertex the table. dependencies for OLE DB Source is trivial as for v. type denotes which kind of transformation v represents each column c ∈ columns, c 7→ ∅. Each column is dependent (e.g., Conditional Split or Aggregate). dependencies shows on nothing in the OLE DB Source, as this is the first time how columns depend on each other. If, for example, we the columns appear in the graph. extract the column c from an EDS and use an Aggregate An OLE DB Destination represents an ending point in transformation that takes the average of c and outputs c0 , the graph and is responsible for loading data into a DW. we say that c0 7→ {c}. In the case of the Aggregate trans- OLE DB Destination has the same properties as OLE DB formation, c0 originates from c and is therefore dependent Source, here representing where columns are loaded to. Since on c. Any modifications such as deletion of c or a modi- OLE DB Destination does not have any outgoing edges, fication of its data type can result in a similar change to dependencies does not exist for OLE DB Destination. c0 . Formally, dependencies is a mapping from an output An Aggregate applies aggregate functions such as SUM column o ∈ C to a set of input columns {c1 , . . . , cn } ⊆ C. to values of columns which results in new outputs with We say that o is dependent on {c1 , . . . , cn } and denote this the aggregated values. Formally, the only property spe- as: o 7→ {c1 , . . . , cn }. The output columns are defined as cific to an Aggregate vertex is defined as aggregations = the columns that a vertex sends to another vertex through {(f1 , input1 , output1 , dest1 ),..., (fn , inputn , outputn , destn )} an outgoing edge, such as an OLE DB Source transferring where fi ∈ {COU N T, COU N T DIST IN CT, GROU P BY, columns to an Aggregate. The input columns are defined SU M, AV G, M IN, M AX} is the aggregate function, inputi as the columns that a vertex receives from another vertex ∈ C is the input column that the ith output column is com- through an incoming edge. The main purpose of dependen- puted from, and outputi ∈ C is the result of the ith aggrega- cies is to detect whether an EDS schema change has any cas- tion and output of aggregate i, respectively. desti ∈ V is the cading effects. Due to this, the graph contains some trivial vertex receiving the output column outputi . A given desti- dependencies if a transformation does not affect a column, nation dest can appear in multiple tuples of aggregations, e.g., a dependency such as: c 7→ {c} where c is dependent which shows that the dest receives multiple columns through on itself. One example of an output column depending on a single edge. For each tuple i in aggregations, we have the multiple input columns is a Derived Column transformation dependency outputi 7→ {inputi }. where a derived output column o is dependent on the input A Conditional Split applies expressions to rows and columns i1 and i2 if they were used to derive the value of can thereby route them to different destinations based on o. Thus o 7→ {i1 , i2 } showing that if an EDS schema change which conditions they satisfy. Expressions are for example affects i1 or i2 then it may also affect o. SU BST RIN G(title, 1, 3) == ”M r.” or salary > 30000. Additional properties of a vertex are defined by the type Formally, the only property specific to the Conditional Split of the vertex. For instance, v.aggregations is a necessary vertex is defined as conditions = {(expr1 , p1 , dest1 ), . . . , property for Aggregate, whereas OLE DB Source does not (exprn , pn , destn )} where expri is a predicate that speci- use aggregations and does not have this property. Table 1 fies which rows are routed to desti , pi ∈ N+ is a priority shows all properties specific to each transformation, and the which indicates in which order the conditions are evaluated Store Sale Person in (where 1 is the highest priority). The set of priorities ID: int (PK) ID: int (PK) ID: int (PK) {p1 , . . . , pn } is a gap-free series of numbers that range from Address: nvarchar(60) CustomerID: int (FK) Name: nvarchar(30) 1 to n, where n is the number of conditions. desti ∈ V is StoreID: int (FK) Age: int TotalAmount: money the vertex that receives rows (that have not been taken by a higher priority condition) for which expri holds. In Con- ditional Split the set of output columns on each outgoing Figure 2: Example of an EDS schema. edge is equal to the set of input columns. This is because a Conditional Split transformation is not able to add, delete, or otherwise change columns, but is only able to filter rows. of orders {order1 , . . . , ordern } is a gap-free series of num- Thus, each column depends on itself. bers from 1 to n. passthrough = {c1 , . . . , cj } is the set of A Data Conversion casts the value of a column into a columns passed through the transformation. This does not new column with the new data type. Formally, the only include columns that are used for sorting. For dependencies, specific property of the data conversion vertex is defined as every output column is dependent on itself. conversions = {(input1 , output1 ), . . . , (inputn , outputn )} A Union All combines rows from multiple vertices. For- where inputi ∈ C is the input column of the vertex and mally, the properties of a Union All vertex are defined as outputi ∈ C is a new column that has been created through follows. inputedges = (e1 , . . . , ej ) where ei is the ith incom- the Data Conversion from inputi . The specific conversions ing edge of the Union All and j is the amount of incoming of data types are present in the type of outputi . A Data edges. unions = {(output1 , input1 ), . . . , (outputn , inputn )} Conversion creates a new column rather than replacing the where inputi = (c1 , . . . , cj ). Each element in unions shows existing column and thus both inputi and outputi are out- how multiple input columns are combined into a single out- put columns. dependencies is simple for Data Conversion, put column. ci ∈ C ∪  where  is a convention used to as a dependency is made for each new outputi such that indicate that for a given union, no input is taken from the outputi 7→ {inputi } for all tuples in conversions and since corresponding edge. For example, the tuple (outputi , inputi ) all input columns are sent to next vertex, each depends on where inputi = (c1 , , c3 ) indicates that the unioned outputi itself. uses c1 from e1 , c3 from e3 , but no column is used from e2 . A Derived Column creates new columns based on input outputi is the column generated by unioning the columns columns and expressions applied to those inputs. Formally, in the ith input tuple. A column from a table can only the only property specific to the Derived Column vertex be part of a single union. For the ith tuple of unions, is defined as derivations = {(expr1 , output1 ), . . . , (exprn , outputi 7→ {c1 , . . . , cn } such that the new column, which outputn )} where expri is an expression used for computing is a result of the union, is dependent on all columns that the new values of outputi , and outputi ∈ C is a newly cre- were used for the union. This dependency is derived for all ated column. dependencies for Derived Column is similar to tuples in unions. Data Conversion in that each input column is dependent on itself. For all output columns in derivations we define the Example 1. Figure 2 shows the schema of an EDS called dependency outputi 7→ {ci,1 , . . . , ci,j } where {ci,1 , . . . , ci,j } SourceDB. Now consider the SSIS Data Flow task shown in is the set of all columns used in expri . Figure 3 where A Lookup extracts additional data from a database by – OLE DB Source extracts the columns ID, Name, and equi-joining with given input columns. Formally, the prop- Age from the Person table. erties of a Lookup vertex are defined as follows: database – Lookup extracts the TotalAmount from the Sale table by is the name of the database to lookup additional columns joining Sale.CustomerID with Person.ID. from. table is the name of the table used. columns ⊆ – Derived Column derives the new column AmountTimes10 C represents all columns in the table. outputcolumns ⊆ which is derived from the derivation (TotalAmount * 10, columns is the columns extracted from the table. joins = AmountTimes10). {(input1 , lookup1 ), . . . , (inputn , lookupn )} where joins rep- – Conditional Split splits the rows into two directions based resents the equi-join of the Lookup. inputi is a column from on {(Age > 40, 1, Aggregate),(TotalAmount > 10000, 2, the preceding vertex used for the join condition and lookupi OLE DB Destination)}. is a column from columns. The equi-join is used to extract – Aggregate computes into AvgAmount the average of Amount- the columns in outputcolumns. Each input column has a Times10 when grouping by Age. dependency to itself. Each new column in outputcolumns is – OLE DB Destination loads ID, Age, and AmountTimes10 dependent on all the input columns used in the join con- into the DW table PersonSalesData. ditions of the Lookup. In other words, for each output column output ∈ outputcolumns, we have that output 7→ {input1 , . . . , inputn } where inputi is the input column of the ith tuple in joins. A Sort sorts the input rows in either ascending or de- scending order and creates new output columns for the sorted rows. It is possible to sort on multiple columns where a cer- tain priority has to be given. Formally, the properties of a Sort vertex are defined as follows. sortings = {(input1 , output1 , sorttype1 , order1 ), . . . , (inputn , outputn , sorttypen , ordern )} where inputi , outputi ∈ C, sorttypei ∈ {ascending, descending}, and orderi ∈ N+ indicates in which order the columns are sorted (where 1 is the highest priority). The set Figure 3: An ETL process. Name = Conditional Name Lookup Split Database = SourceDB Conditions = { Table = Sale Name = Derived Age > 40", 1, Columns = { Column Aggregate), ID, CustomerID, StoreID, Derivations = { TotalAmount > Name OLE DB Source TotalAmount} TotalAmount * 10", 10000", 2, OLE DB Database = SourceDB OutputColumns = AmountTimes10)} Destination)} {TotalAmount} Dependencies = { Dependencies = { Name = OLE DB Table = Person Destination Columns = { ID, Name, Age} Joins = ID  {ID} ID  {ID} {(ID, CustomerID)} Name  {Name} Name  {Name} Database = Dependencies = { TargetDW ID  Ø Dependencies = { Age  {Age} Age  {Age} ID  {ID} Table = Name  Ø TotalAmount TotalAmount Name  {Name} PersonSalesData Age  Ø } {TotalAmount} {TotalAmount} Age  {Age} Columns = { ID, Age, AmountTimes10  AmountTimes10  AmountTimes10 } TotalAmount {ID}} {TotalAmount}} {AmountTimes10}} 1 2 3 4 5 Columns = Columns = Columns = {(1, ID, int), {(1, ID, int), {(1, ID, int), Columns = (2, Name, string), (2, Name, string), (2, Name, string), {(1, ID, int), (3, Age, int)} (3, Age, int), (3, Age, int), (2, Name, string), (4, TotalAmount, (4, TotalAmount, (3, Age, int), currency)} currency), (4, TotalAmount, (5, AmountTimes10, currency), Name = OLE DB currency)} (5, AmountTimes10, Destination 1 currency)} Database = TargetDW 7 6 Table = SalesData Columns = { Age, AvgAmount } Columns = Name = Aggregate Columns = {(1, Age, int), Aggregations = { {(1, ID, int), (2, AvgAmount, (AVERAGE, (2, Name, string), currency)} AmountTimes10, (3, Age, int), AvgAmount, OLE DB (4, TotalAmount, Destination 1), currency), (GROUP BY, Age, Age, (5, AmountTimes10, OLE DB Destination 1)} currency)} Dependencies = { Age  {Age} AvgAmount  {AmountTimes10}} Figure 4: A MAIME graph for the Data Flow task in Figure 3. – OLE DB Destination 1 loads AvgAmount and Age into should block (i.e., not propagate) any other changes in the the DW table SalesData. EDS which then have to be handled manually. If MAIME is Figure 4 shows the corresponding MAIME graph where given free reign, it can, however, make any number of mod- there are seven vertices, i.e., one for each transformation ifications to ensure that ETL processes execute successfully. (the types are not shown since they are obvious from the In the following, we describe the configurations that an ad- names of the vertices). As an example of a dependency, we ministrator can specify before executing MAIME. There are have AmountTimes10 7→ {TotalAmount} in Derived Col- two types of configurations: (1) EDS schema change con- umn. This comes from the derivation TotalAmount * 10. figurations, and (2) Advanced configurations. To properly This shows that any change to TotalAmount can affect Am- understand the configurations, we first explain the different ountTimes10, and if TotalAmount is deleted, AmountTimes- kinds of EDS schema changes. The schema changes that 10 can no longer be derived and therefore also needs to be we consider are the following. Addition which represents deleted. a new column in a schema; Deletion which represents a deleted column in a schema; Rename which represents a renamed column in a schema; Data Type Change which 4. GRAPH ALTERATION represents a column where the data type has changed (incl. In this section, we describe how the MAIME graph is up- properties such as length). dated when changes happen at the EDSs. First, we describe For an EDS schema change ch ∈ {Addition, Deletion, Re- MAIME’s user-configurable settings. Then we describe the name, Data Type Change} and vertex type t, a policy algorithm that updates the graph (which in turn updates p(t, ch) ∈ {Propagate, Block, Prompt} can be specified by the underlying SSIS Data Flow task). the administrator. The Propagate policy defines that when p(t, ch) = Propagate, reparation of vertices of type t is al- 4.1 Administrator Configurations lowed for EDS schema changes of type ch. The Block pol- This section describes how an administrator can configure icy says that when p(t, ch) = Block then for every vertex v which modifications MAIME is allowed to perform in case of where v.type = t, the alteration algorithm (explained in Sec- EDS schema changes. As an example, an administrator can tion 4.2) is not allowed to make modifications to v or to any define that only renames in the EDS should be propagated successor vsucc of v, even if p(vsucc .type, ch) = Propagate. to the SSIS Data Flow tasks automatically, and MAIME The Prompt policy defines that the choice of whether to block or propagate the change is deferred to runtime where and assume that an EDS schema change of b being deleted the user is prompted to make the choice. Figure 5 shows occurs. If Allow modification of expressions is enabled, the how the administrator can set a policy for a type of an EDS alteration algorithm attempts to update the expression into schema change, or be more specific and choose a policy for a < 20. This would make the process run successfully with- each type of vertex for a given EDS change. For example, she out removing the whole expression, but the semantics is can configure that for all Conditional Split vertices, deletion different. If Allow modification of expressions is disabled, of columns should be propagated, but all other EDS changes the alteration algorithm would instead remove the expres- should be blocked. sion. Modification of expressions is always allowed if the EDS schema change is the renaming of a column. This is EDS Schema Change Advanced Configurations Configurations because it is very simple to preserve the original semantics Prop. Addition by replacing the old name of the renamed column with the Allow deletion of transformations new name in the expression. Prop. Aggregate Allow modification of expressions Block Cond. Split Use global blocking semantics Prop. Sort 4.2 The Graph Alteration Algorithm Block Deletion This section details how our graph model adapts to EDS Block Aggregate schema changes. As described above, the administrator can Prop. Cond. Split choose between the Propagate, Block, and Prompt poli- Block Sort cies. Only a propagation (which can also occur through a prompt) alters the graph. This section goes into depth on Prop. Rename how to propagate an EDS schema change through the graph. Block Data type How propagation is handled for a given type of EDS schema change is specific to each type of transformation. For space reasons, details about the different actions are not provided Figure 5: Configuration of MAIME. here, but are available elsewhere [3]. However, just apply- ing actions to each vertex independently is not enough, as a As can be seen in Figure 5, advanced configurations can change in one vertex can also affect successive vertices. The also be specified. MAIME is based on the idea that ETL Alter-graph algorithm takes in a single EDS schema change processes should be repaired such that all of them can exe- ch and our property graph G. To be able to handle multiple cute afterwards. We now explain some more difficult cases EDS schema changes, the Alter-graph algorithm is called of maintaining ETL processes and which configurations the for each EDS schema change. For simplicity, we only present administrator can use for such cases. Considering the case the case where Allow deletion of transformations is enabled of a deletion of a column, we need to take into account that and neither Prompt nor data type change is used. both columns and vertices can be dependent on the deleted The order in which we traverse the graph matters as vis- column. Columns such as a derived column is dependent on iting a vertex also affects successive vertices. For this, topo- the columns that it was derived from. Vertices are depen- logical sorting is used on Line 1 such that when a vertex is dent on columns if the columns are a part of that vertex’s visited, it is guaranteed that all its predecessors have been input, expression, condition, or used in some other way by visited beforehand. The list of topologically sorted vertices the vertex. is referred to as L. On Lines 2–6, the algorithm handles the The Use global blocking semantics option dictates case where Use global blocking semantics is enabled. This whether the alteration algorithm should terminate if the pol- entails checking if there exists a vertex in L with the Block icy for any vertex is Block. If Use global blocking semantics policy defined for the EDS change type ch. In case such is enabled and p(t, ch) = Block then an ETL process contain- a vertex exists, the algorithm returns an unchanged graph. ing a vertex of type t will not be considered for reparation On Line 7, we start traversing each vertex v of L. On Lines whenever the EDS schema change is of type ch. 8–11, we check if the policy for v given ch is Block. If this is The Allow deletion of transformations option allows the case, we can disregard this branch of the graph for the deletion of vertices in a graph (and thus transformations in traversal, which is why we remove all successors of v from a SSIS Data Flow task). With this option enabled, MAIME the sorted list of vertices. Line 12 updates the dependencies can modify graphs to such an extent that large portions of a of a vertex. This is done by going through each dependency graph are removed, but the corresponding updated ETL pro- in v.dependencies for a given vertex v and checking that cess can execute. Consider, for example, if a Sort vertex uses all of the involved columns are still present in the incom- column a for sorting. If a is deleted from the EDS, it would ing edges. The reason for this is that some columns might render the vertex invalid. With Allow deletion of transfor- have been removed from v’s incoming edges when traversing mations option enabled, the alteration algorithm attempts the preceding vertices, such that v.dependencies refers to to repair the process by deleting the Sort vertex and any columns that no longer exist in v’s input. successive vertices, thereby making the process executable. As stated before, it is not sufficient to just look at vertices The algorithm described in Section 4.2 further explains this independently when going through the output columns of principle. each vertex. Lines 13–19 show the case for deleting columns The Allow modification of expressions option allows with no dependencies. As an example, consider a vertex v the alteration algorithm to modify expressions in vertices whose preceding transformation is a Derived Column with (such as Conditional Split and Derived Column) in the event the input column c, which is used to derive a new column of deletions or data type changes of columns. Consider, for d through the expression d = c + 42. Now, v will receive example, a Conditional Split with input columns a and b c and d as input columns, but if the EDS schema change and one condition with the expression a < 20 && b < 40 (ch) is a deletion of c, not only will c not be available to v Algorithm 1: Algorithm for propagating EDS Table 2: Currently supported EDS changes. schema changes to the graph Deletion Rename Name: Alter-graph OLE DB Source 3 3 Input: EDS-Change ch, Graph G OLE DB Destination 3 3 Output: Graph G Aggregate 3 3 1 List L = topological-sort(G) Conditional Split 3 3 2 if U se global blocking semantics is enabled then Data Conversion 7 7 3 foreach Vertex v ∈ L do Derived Column 3 3 4 Policy p = lookup-policy(v.type, ch.type) Lookup 3 3 5 if p = Block then Sort 7 7 6 return G Union All 7 7 7 foreach Vertex v ∈ L (in topological order) do 8 Policy p = lookup-policy(v.type, ch.type) iterating over the output columns of a vertex that has been 9 if p = Block then deleted. The other case of only deleting the given column 10 Remove all successors of v from L is shown in Lines 18–19. Since a vertex of type OLE DB 11 continue Destination does not have any outgoing edges, it is not con- 12 Update v’s dependencies to not include deleted sidered in the previous loop on Lines 13–19. However, we columns still want to delete the vertex if it is invalid, i.e., if it has 13 foreach Column c ∈ v’s outgoing edges do no incoming edge. This is performed on Lines 20–21. Af- 14 if v.dependencies(c) = ∅ AND v.type 6= OLE terwards on Lines 22–23, if v was not deleted at an earlier DB Source then point in the algorithm, then the corresponding propagation 15 if v is fully dependent on c then action for the EDS schema change and transformation type 16 Delete v and v’s incoming and is invoked. Finally, the fully altered graph is returned. Re- outgoing edges from G call that the underlying SSIS Data Flow tasks gets updated 17 Break inner loop automatically when the graph is updated. 18 else 19 Delete c from v’s corresponding 5. IMPLEMENTATION outgoing edges and dependencies This section describes the implementation of the proto- type. MAIME was developed in C# for Microsoft SQL 20 if v.type is OLE DB Destination AND v has no Server 2014 Developer Edition and with the SSIS Designer incoming edges then as the data integration tool. The SSIS Designer is part of 21 Delete v the SQL Server Data Tools (v. 14.0.60203.0). The code- 22 if v was not deleted AND ch affects v then base for the implementation includes around 7,700 lines of 23 G = alter(G, v, ch) code, 85 classes, and 410 members. The implementation is open source and is available from https://github.com/ 24 return G sajens/MAIME. The advanced configurations, Allow deletion of transfor- mations and Use global blocking semantics are implemented. anymore, d will also be deleted, as it is no longer possible to We did not implement Allow modification of expressions derive it from the Derived Column without c. The way to yet. We focused our attention primarily on Allow dele- find out that d is no longer computable is by seeing that d is tion of transformations and Use global blocking semantics, dependent on ∅ (Line 14). This signifies that d was depen- since they seemed to be the most impactful configurations dent on column(s) which have been deleted. The exception for MAIME. Currently, Allow deletion of transformations is is an OLE DB Source, which is the only vertex for which always enabled. The Propagate and Block policies are imple- dependencies maps columns to ∅, as explained in Section mented as described previously. Table 2 shows which trans- 3 (OLE DB Source is the only transformation allowing no formations have been implemented with respect to deletion incoming edges). Beginning on Line 15, it is considered if and renaming of EDS columns in the current prototype. it is possible to delete only the given column d or if it is We now consider how the graph is represented internally necessary to delete the entire vertex v. This narrows down in MAIME. Our graph is implemented as a layer on top of to whether v is fully dependent on d or not. We say that a SSIS. Therefore, our graph constructs have references to the vertex v is fully dependent on some column c, when v would corresponding SSIS constructs. To be more exact, each ver- be rendered invalid if c was deleted. For instance, if v is tex refers to a SSIS component (transformation), each edge an Aggregation and c is the only remaining column that is refers to a SSIS path, and each column has a corresponding being used for aggregations, then v is fully dependent on c. reference to a so-called SSIS IDTS object. The SSIS IDTS The deletion of c would result in v.aggregations = ∅, which object covers both input and output columns in SSIS. By is not a valid transformation. What qualifies a vertex as having these references, we can easily propagate changes to being fully dependent on a column is specific for each type the underlying ETL process during execution of the graph of transformation, for details see [3]. Lines 16–17 show the alteration algorithm. For the graph, we have implemented case of deleting a whole vertex and all of its incoming and the classes Graph, Vertex, Edge, Column, and a class ex- outgoing edges, if they are no longer used. This iteration of tending Vertex for each supported transformation such that the loop breaks because it would not make sense to continue specific dependencies and properties can be represented. In order to instantiate the graph, we have to translate the starts when the user begins to use MAIME and ends when components from SSIS1 . Initially, a SSIS package is loaded the SSIS package has been repaired. MAIME is also used to by parsing its .dtsx file conforming to XML specification. repair the package three times and the duration and amount Then, we extract its SSIS Data Flow tasks. Note that every of user inputs are again recorded using a software tool. Data Flow task is stored as a graph in a .dtsx file. Con- We acknowledge that the user knows MAIME very well struction of a MAIME graph is done in two iterations. Dur- and thus easily can use it. That is why the user also is given ing the first iteration, we extract one Data Flow task and the desired resulting SSIS package for the manual approach go through its components to create corresponding vertices such that he does not have to think about what needs to be with references to the Data Flow task components. With changed. This is thus an overly optimistic way to measure the set of vertices V and references to the SSIS components, the needed time for the manual approach (the amounts of we can extract information for each vertex such as v.name needed clicks and keystrokes are not affected, though). and v.type. Afterwards, every SSIS path is translated into We now elaborate on one of our test cases and a given list an edge. This gives us the set of edges E of our graph. It is of EDS schema changes. In test case 1, we consider the sce- important to note that columns are not stored in the edges nario from Example 1 and assume the following changes to in this iteration, since they do not exist on SSIS paths. Af- the EDS: (1) Age is renamed to RenamedAge in the Person ter the first iteration is complete, we have created the basic table and (2) TotalAmount is deleted from the Sale table. structure of the graph of MAIME. In the second iteration, The state of the graph after it has been maintained is shown we deduce dependencies, assign columns to edges, and create in Figure 6 in the SSIS designer tool, and with MAIME’s transformation specific properties. graph in Figure 7. The latter shows that Age is success- fully renamed on all edges. The deletion of TotalAmount is slightly more complicated since the condition TotalAmount 6. EVALUATION > 10000 in the Conditional Split involves the column. This We now evaluate the efficiency of MAIME. In the evalua- condition is no longer valid which results in the removal of tion, we compare how much time and how many user inputs the outgoing edge containing it. We therefore delete OLE are required to resolve a series of EDS schema changes when DB Destination, as it no longer has any incoming edges. using MAIME and when repairing the ETL flow manually. Derived Column derives the column AmountTimes10 from A user input is defined to be a mouse click or a keystroke TotalAmount and can therefore no longer be derived. We done by the user (both are recorded separately). We dis- delete this derivation but still retain our Derived Column cussed and validated this approach with practitioners in field transformation since it can exist without doing any deriva- of ETL development. tions. Another possibility would be to let MAIME delete the We consider three SSIS packages and each has a single Derived Column and connect the Lookup and Conditional Data Flow task. For each considered SSIS package, both Split transformations. The Aggregate transformation takes MAIME and the manual work end up having the same end the average of AmountTimes10, which no longer exists and result, i.e. the same state of the maintained SSIS package. this aggregation is therefore also deleted. OLE DB Desti- Only the process of achieving that end result differs. The nation 1 no longer loads the aggregated AvgAmount from EDS schema changes used for the evaluation include deletion the Aggregate transformation into the DW. Both TotalAm- and renaming. ount, AmountTimes10, and AvgAmount are deleted from For the manual approach, the user performing the evalu- the edges. ation (one of the authors) is given the EDS schema changes and two versions of the ETL process: One in its initial state, and another in the desired result state. The user then per- forms maintenance of each package three times using Visual Studio 2015 Community edition and SQL Server Data Tools. The first evaluation gives an indication of how long it takes to maintain an ETL process without knowing every step to maintain the ETL process. The second and third attempt are performed in order to provide a best-case scenario, since the user learns the quickest way to maintain the ETL pro- cess through repetition. The time of the evaluation starts when the user begins repairing the ETL process after having seen the given information (i.e., EDS schema changes and Figure 6: The Data Flow task of the SSIS package the two versions of the ETL process). During the test, the for test case 1 after the EDS schema changes. duration and amount of user inputs used are recorded using a software tool. When considering MAIME, the user interacts with MAIME Test cases 2 and 3 are not shown here for space reasons, to (1) accept the configurations, currently loaded ETL pro- but only briefly described. In test case 2, TotalAmount is cesses, and connected EDSs, and (2) start the maintenance deleted, and both Address and Person.ID are renamed. In process. The EDS schema change configurations of MAIME test case 3, Name is deleted and Age is renamed. for all evaluations are all set to propagate. For the advanced For each test case, the used time is shown in Table 3 and configurations, Allow deletion of transformations is enabled, the amounts of user inputs are shown in Table 4. For time- Allow modification of expressions is disabled, and Use global consumption, MAIME took on average 4 seconds across all blocking semantics is disabled. The time of the evaluation 3 test cases while manually resolving changes took 39.3 sec- onds on average across all 3 cases for the third attempt. This 1 msdn.microsoft.com/en-us/library/ms403344.aspx means MAIME was on average of 9.8 times faster for resolv- Name Lookup Database = SourceDB Table 4: User inputs needed for handling EDS Table = Sale Columns = { schema changes for the three test cases. ID, CustomerID, User Inputs MAIME Manual Name OLE DB Source StoreID} Name = Conditional Database = SourceDB OutputColumns = {} Split Keystrokes: 0, 0, 0 Keystrokes: 23, 15, 12 Table = Person Name = Derived Test Case 1 Columns = { ID, Name, Joins = Column Conditions = { Mouse clicks: 4, 4, 4 Mouse clicks: 88, 85, 38 {(ID, CustomerID)} RenamedAge > 40", 1, RenamedAge} Dependencies = { Derivations = {} Aggregate)} Keystrokes: 0, 0, 0 Keystrokes: 9, 9, 8 Dependencies = { ID  {ID} Dependencies = { Dependencies = { Test Case 2 ID  Ø Name  {Name} ID  {ID} ID  {ID} Mouse clicks: 4, 4, 4 Mouse clicks: 92, 48, 45 Name  Ø RenamedAge  Name  {Name} Name  {Name} Keystrokes: 0, 0, 0 Keystrokes: 0, 0, 0 RenamedAge  Ø } {RenamedAge}} RenamedAge  RenamedAge  Test Case 3 {RenamedAge}} {RenamedAge}} Mouse clicks: 4, 4, 4 Mouse clicks: 16, 11, 11 1 2 3 4 Columns = Columns = Columns = {(1, ID, int), {(1, ID, int), {(1, ID, int), (2, Name, string), (3, RenamedAge, int)} (2, Name, string), (3, RenamedAge, int)} (2, Name, string), (3, RenamedAge, int)} (called evolution events) and proposes changes to the ETL processes based on defined policies. The Hecataeus frame- Name = OLE DB 7 6 work abstracts ETL processes as SQL queries and views Destination 1 Database = TargetDW Name = Aggregate Columns = which are used in a graph to represent the activities. An ac- Table = SalesData Aggregations = { {(1, ID, int), Columns = {Age} (GROUP BY, (2, Name, string), tivity resembles an ETL transformation. Hecataeus’s graph RenamedAge, Age, OLE DB Destination 1)} (3, RenamedAge, int)} is captured by an ETL Summary which can have multiple Columns = Dependencies = { Age  {RenamedAge}} subgraphs for each activity. Each activity can further be {(4, Age, int)} broken down, as it includes nodes or entire subgraphs for relations, SQL queries, conditions, and uses views, e.g., for Figure 7: The MAIME graph for the Data Flow task input and output. The types of evolution events taken into in Figure 6. account are addition, deletion, and modification. An ad- ministrator can annotate nodes and edges with policies for each type of evolution event, while in MAIME policies are Table 3: Time needed for handling EDS schema provided for each EDS change type and not specifically for changes for the three test cases. each node or edge. The approach taken by MAIME would Time Elapsed MAIME Manual be preferable if the administrator had to repair a lot of ETL Test Case 1 4 sec, 4 sec, 4 sec 187 sec, 159 sec, 59 sec Test Case 2 4 sec, 4 sec, 4 sec 154 sec, 60 sec, 49 sec processes. The three Hecataeus policies dictate how ETL Test Case 3 4 sec, 4 sec, 4 sec 23 sec, 13 sec, 10 sec processes should be adjusted when an evolution event oc- curs: (1) Propagate readjusts the graph to reflect the new semantics according to the evolution event throughout the ing EDS schema changes. For user input, MAIME required rest of the graph, (2) Block retains the old semantics, and on average 4 user inputs, while manually resolving changes (3) Prompt asks the administrator to choose Propagate or required 38 user inputs on average for the third attempt Block. Both Propagate and Prompt are similar in MAIME, across all 3 cases. This means MAIME on average required whereas their Block attempts to retain the semantics. An 9.5 times less inputs for resolving EDS schema changes. extension of Hecataeus has been made for what-if analysis of For the manual work, it took significantly more time the evolutions of relations, views, or queries in a data-intensive first time the user had to maintain the ETL process. The ecosystem [6]. While MAIME models columns on edges, user did, however, get progressively quicker at maintaining Hecataeus models input and output schemata as subgraphs. the ETL processes for the second and third attempt, with Another framework is E-ETL [14, 15] by Wojciechowski the best result being from the third attempt. The number of which is also able to semi-automatically adapt ETL pro- both mouse clicks and keystrokes showed similar results. For cesses to EDS schema changes. In order to adapt the ETL these evaluations MAIME is thus able to significantly reduce processes, there are different methods which define how repa- the amount of time and user input needed for maintaining rations are propagated. E-ETL has the same three policies ETL processes. as Hecataeus, with the exception of the Block policy. In In a real life scenario, a company would have to maintain E-ETL the Block policy ignores the EDS schema change a multitude of ETL processes, which for each could take and does not attempt to modify the graph. The E-ETL on average 39.3 seconds to repair as our evaluation showed. framework has multiple ways to handle an EDS schema However, the timings shown for MAIME in Table 3 (4 sec- change. These include Defined rules in which the admin- onds) include the time to first load the ETL processes, detect istrator can define rules himself for nodes and edges; Stan- EDS schema changes, and the time the user spent on click- dard rules which are the default rules used if the admin- ing buttons in the GUI. If there were many ETL processes, istrator did not define anything; and Alternative scenarios it would still be enough to do this once. The reparation al- where case-based reasoning is used to adjust the ETL pro- gorithm of MAIME uses less than a second in all the cases, cess based on solutions to similar problems experienced pre- and we thus argue that the evaluation setup actually gives viously. How these algorithms are used together is not speci- the manual process an advantage. fied. Like Hecataeus, E-ETL models ETL processes through SQL queries. Related work also exists on how to conceptually model 7. RELATED WORK ETL processes using transformations that are commonly Related work exists on the topic of maintaining ETL pro- used in ETL processes. Trujillo et al. [13] do this by defin- cesses when the schema of EDSs change. The framework ing a small set of transformations modeled through UML Hecataeus by Papastefanatos et al. [7, 8, 9] analyzes ETL diagrams. Examples of the mechanism include Aggrega- processes by detecting changes to the schema of the EDSs tion, Conversion, and Join. This is in contrast to other frameworks, such as Hecataeus and E-ETL, where ETL pro- be to extract all SSIS Data Flow tasks from each SSIS pack- cesses are described by SQL queries and views, and repre- age. Furthermore, support for more SSIS transformations sented with graphs. Using transformations rather than SQL could be implemented as well as support for SSIS Con- queries can make conceptual modeling simpler and more trol Flow tasks. For example, it would be interesting to maintainable as each transformation has a clear responsi- have support for more kinds of data sources, such as XML bility and provides a higher level of abstraction compared (where a schema can be explicitly given) and a CSV (where to SQL queries, but at the cost of expressiveness. Trujillo a schema typically has to be inferred). We note that a few et al. provide a set of general ETL processes that are not local software vendors and IT service providers have been in specific to any platform, while our transformations are spe- the progress of adopting and extending MAIME into their cific to SSIS. Furthermore, instead of using UML diagrams product and service offerings. we use property graphs to more closely relate to the internal graphs of SSIS. In [11], conceptual models are specified with 9. ADDITIONAL AUTHORS a kind of diagrams which then are mapped to logical models Harry Xuegang Huang (Danmarks Nationalbank, email: where graphs also are used. xh@nationalbanken.dk) and Christian Thomsen (Dept. of Business Intelligence Markup Language (BIML) [1] is an Comp. Science, Aalborg University, email: chr@cs.aau.dk). XML dialect that can be used to generate SSIS packages. It can be used to create templates and reusable code snippets to make the creation of SSIS packages easier and less time- 10. REFERENCES consuming. BIML does, however, not handle changes, but [1] Business Intelligence Markup Language. can rather re-generate packages when a change occurs. https://www.varigence.com/biml (Last accessed 2016-10-31). 8. CONCLUSION AND FUTURE WORK [2] SSIS Designer. https://msdn.microsoft.com/en-us/ ETL processes can be complex and time-consuming to re- library/ms137973(v=sql.120).aspx (Last accessed pair manually. We presented MAIME as a tool to reduce 2016-10-31). the amount of errors and time spent on maintaining ETL [3] D. Butkevičius et al. MAIME - Maintenance Manager processes. To accomplish this task, we introduced and im- for ETL. Student project d802f16, Dept of Comp. plemented a graph model as a layer on top of SSIS Data Science, Aalborg University, August 2016. Flow tasks, which simplifies the handling of EDS schema https://github.com/sajens/MAIME/blob/master/ changes. Technical-Report.pdf. As we have seen in the evaluation, MAIME required 9.5 [4] R. Kimball and M. Ross. The Data Warehouse times less input from the user and was 9.8 times faster com- Toolkit. John Wiley & Sons, 2011. pared to doing it manually with the SSIS Designer tool. [5] B. Knight et al. Professional Microsoft SQL Server MAIME can thus ease the burden of maintaining ETL pro- 2012 Integration Services. John Wiley & Sons, 2012. cesses. [6] P. Manousis, P. Vassiliadis, and G. Papastefanatos. There are a number of possible directions for future work. Impact Analysis and Policy-Conforming Rewriting of The paper presents an adaptive way of handling source data Evolving Data-Intensive Ecosystems. Journal on Data changes in ETL programs. One direction for future work is Semantics, 4(4), 2015. to make MAIME more robust against the variance of source [7] G. Papastefanatos et al. Rule-Based Management of types and change types. The adaptivity very often depends Schema Changes at ETL Sources. ADBIS, 2010. on manual decisions made by the ETL administrator. Thus, [8] G. Papastefanatos et al. What-If Analysis for Data an interesting extension of this work is to model or formalize Warehouse Evolution. DaWaK, 2007. the pattern of the manual decisions and predict the next de- [9] G. Papastefanatos et al. Policy-Regulated cisions such that the system execution process involves less Management of ETL Evolution. Journal on Data manual interruptions. Since the system starts to make auto- Semantics XIII, 2009. matic changes to ETL programs, it is very natural that the [10] M. A. Rodriguez and P. Neubauer. Constructions from ETL administrators considers to “roll-back” the execution of dots and lines. Bulletin of the American Society for certain groups of ETL programs to the previous state. How Information Science and Technology, 2010. to define the transaction process in such an ETL program [11] A. Simitsis. Mapping Conceptual to Logical Models and how to ensure the possibility of the roll-back action is for ETL Processes. DOLAP, 2005. another direction to explore. We have so far disregarded propagating changes to the [12] E. Thoo and M. A. Beyer. Gartner’s Magic Quadrant DW, e.g., adding a column to the relevant DW schema. for Data Integration Tools. 2014. However, an administrator might want to do proper adjust- [13] J. Trujillo and S. Luján-Mora. A UML Based ments to the DW in order to capture the correct semantics Approach for Modeling ETL Processes in Data of an EDS schema change. For example, when a column was Warehouses. ER, 2003. inserted in an EDS, it might be reasonable to insert a corre- [14] A. Wojciechowski. E-ETL: Framework for Managing sponding column into the DW. This is another direction to Evolving ETL Workflows. Foundations of Computing investigate. It could also be interesting to investigate how and Decision Sciences, 38(2), 2013. to maintain ETL programs when the target DW is changed. [15] A. Wojciechowski. E-ETL Framework: ETL Process Another direction for future work is to extend the current Reparation Algorithms Using Case-Based Reasoning. prototype. Currently, we extract only one SSIS Data Flow ADBIS Short Papers and Workshops, 2015. task from the current SSIS package. An improvement would