=Paper= {{Paper |id=Vol-2324/Paper27-COrdonez |storemode=property |title=Enhancing ER Diagrams to View Data Transformations Computed with Queries |pdfUrl=https://ceur-ws.org/Vol-2324/Paper27-COrdonez.pdf |volume=Vol-2324 |authors=Carlos Ordonez,Ladjel Bellatreche |dblpUrl=https://dblp.org/rec/conf/dolap/0001B19 }} ==Enhancing ER Diagrams to View Data Transformations Computed with Queries== https://ceur-ws.org/Vol-2324/Paper27-COrdonez.pdf
         Enhancing ER Diagrams to View Data Transformations
                      Computed with Queries
                            Carlos Ordonez                                                         Ladjel Bellatreche
                      University of Houston, USA                                               LIAS/ISAE-ENSMA, France

ABSTRACT                                                                        (3) showing SPJA queries as an optional zoom-in exploratory
Transforming relational tables to build a data set takes most of                feature; (4) an algorithm to automate extending and maintaining
the time in a machine learning (ML) project centered around a                   the transformation ER diagram. We believe our ER diagram has
relational database. The explanation is simple: a relational data-              potential usefulness. Existing tuned queries can be reused and
base has a collection of tables that are joined and aggregated with             extended. Redundant copies of data can be eliminated by reusing
complex relational queries, and whose columns are transformed                   existing queries. Provenance of attributes can be tracked. Nev-
with complex SQL expressions, in order to build the required data               ertheless, we do not intend to represent every query in a single
set. In general, such data is wide, gathering many ML variables                 ER diagram since that would create a huge diagram. Instead, we
together. Such complicated data pre-processing results in a large               propose to represent each set of tuned queries to build each data
set of SQL queries that are independently developed from each                   set separately. Each set of queries is similar to a data mart with
other for different ML models. The database grows with impor-                   the main difference being that table structure is gradually refined
tant tables and views that are absent in the original ER diagram.               by queries, not by database design.
More importantly, similar SQL queries tend to be written multiple
times, creating problems in database evolution, disk space utiliza-             2 DEFINITIONS
tion and software maintenance. In this paper, we go in opposite
                                                                                2.1 ER Diagram for a Relational Database
direction from a physical level (tables) to a logical level (enti-
ties) representation, providing a unifying diagram of both levels.              In modern database design tools an ER diagram is refined and
Specifically, we propose minimal, but powerful, extensions to an                consolidated into a collection of “structured” entities represented
ER diagram in UML notation to represent most common data-                       in a simplified UML diagram, closer to a physical model. In the
base transformations. Our “transformation” ER diagram helps                     resulting relational database schema entities and relationships
analytic users understanding complex transformations, consoli-                  are mapped to tables linked by referential integrity constraints.
dating columns representing analytic variables into fewer tables                As it is standard in most modern ER tools, each entity corresponds
(i.e. eliminating redundant tables), reusing existing SQL queries               to a table and each relationship is represented by foreign keys
(i.e. avoid forking new queries) and explaining data provenance                 (FKs).
(where data originated from).                                                       We assume 1:1 relationships are consolidated into one entity
                                                                                because they share the same primary key (PK). Moreover, we
                                                                                assume M:N (many to many) relationships get mapped to a “link”
1    INTRODUCTION                                                               entity, which connects 2+ entities, merging their respective FKs
The entity-relationship (ER) model provides methods (step by                    into a new PK. Therefore, a refined database ER diagram has only
step) and diagram notation to design a database, by defining its                1:N (1 to many) and N:1 (many to 1) relationships. In short, we
structure before storing data values. On the other hand, the rela-              assume there exists an ER diagram ready to be converted and
tional database model provides a precise mathematical definition                deployed as physical relational tables, with a 1-1 correspondence
to store and query data in the form of tables (relations) linked                between entities and tables. Such restricted ER diagram allows
by foreign keys (FKs), whose basic structure is sketched in an ER               representing tables returned by queries as entities.
model. We are concerned with exploiting a DBMS to analyze data                      A database is defined as D(T , I ), where T = {T1 , . . . ,Tn }, is
sets with exploratory analysis, machine learning and statistics [2].            a set of n tables and I is a set of integrity constraints (entity
We are interested in modeling all potential database transforma-                for primary key (PK), referential for foreign key (FK)). We use
tions via relational queries to build a specific data set that can be           the term “table” to make an explicit connection to the physical
analyzed by machine learning (ML) algorithms. In general, such                  level and SQL. Each table has a set of columns, coming from
data set has a tabular form, where every row corresponds to an                  different domains. The entity constraint asserts that every table
observation, instance or point and every column is associated to a              should have a primary key. Referential integrity is denoted by
variable or feature. The main side effect is that relational queries            Ti (K) → T j (K) where K is the primary key (PK) of T j and K is a
produce many new tables, which do not appear as entities in the                 foreign key (FK) in Ti . The common key attribute K is assumed
existing ER diagram. Such collection of transformed tables and                  to have the same name on both tables.
disconnected queries complicate database management, software
development, and computing further ML models.
                                                                                2.2    Relational Queries
   Based on the motivation discussed above, we introduce minor,
but powerful, changes to ER diagrams to represent data trans-                   Database transformations are assumed to be computed only with
formations computed by queries, used to build data sets for ML                  SQL queries, mixing joins (1), aggregations (extended π oper-
models. Our main contributions are the following: (1) labeling                  ator) and selection (σ ). Relational queries combine aggregate
ER entities to distinguish source and transformation entities; (2)              functions (built-in or UDFs), scalar functions (built-in or UDFs),
numbering entities to understand a sequence of transformations;                 mathematical operators, string/date operators and the powerful
                                                                                SQL CASE statement. However, in order to have a precise mathe-
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings   matical foundation we study data transformation with relational
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
CEUR-WS.org.                                                                    queries, which combine π , σ and 1 operators, where π is used
as a GROUP BY aggregation and derivation operator. Notice SQL            set comes from putting together the primary keys of a subset of
PIVOT and UNPIVOT operators available in some DBMSs are                  the k source tables.
great to manipulate matrices, but they are not relational. Fortu-            We now discuss A, the set of non-key attributes. Assume
nately, these operators can be represented with tables with extra        T = T1 1 . . . T2 · · · 1 Tk gathers all raw attributes. The goal
columns storing subscripts generated by the DBMS. We assume              is to further transform T into X . We emphasize T is not in 3NF.
well formed queries are used to create data sets, which include a        On the other hand, if T has a composite key it will not be in 2NF
primary key to identify rows and at least one non-key attribute          either. Each attribute in A can come from: a table, an aggregation
(variable, feature).                                                     or a scalar expression (arithmetic, logical, string). Therefore, a
    We consider left outer join as a prominent operator needed to        transformed attribute must either be the result of an aggregation
merge partial tables sharing the same PK to build the data set for       summarizing several rows into one or computing some derived
analysis. In query terms, referential integrity between two tables       expression (denormalization). Below we explore data transfor-
Ti ,T j means π K (Ti ) ⊆ π K (T j ).                                    mations in more depth.

2.3      Running Example                                                 3.2    Representing Data Transformations
Our example is based on the benchmark TPC-H database, shown              We explain database transformations at two levels: entity (table)
on Figure 1. Consider the following representative ML problem:           and attribute (column), in order to define an ER diagram at two
predicting whether a product will be returned or not, based on his-      levels of abstraction.
toric sales information. A long database transformation process
is needed to build the data set to be used as input in a regression          Entity Level. We extend an existing ER diagram with new enti-
or classification model. Our objective is to extend the existing         ties providing a diagram representation of data transformations.
ER diagram to capture this data transformation process.                  We call existing entities source entities and added entities are
                                                                         called transformation entities (computed with SPJA queries). We
   SUPPLIER
PK S_SUPPKEY
                                                                         emphasize that a single SQL query can create several temporary
                                                                         tables during its evaluation with one table per relational opera-
                                                            ORDERS
                                  LINEITEM
                                                         PK O_ORDERKEY   tor. In order to define a faithful representation each nested SQL
                            PK,FK3 L_ORDERKEY
                            PK     L_LINENUMBER                          query will correspond to one temporary table, which in turn will
      PART
 PK P_PARTKEY
                            FK2   L_PARTKEY                              be represented by one relational algebra query and one trans-
                            FK1   L_SUPPKEY
                                                                         formation entity. By a generalization, the data transformation
                                                                         process will create a sequence of transformation tables that will
      Figure 1: Input database ER diagram for example.                   be successively joined and aggregated. The data set X will be the
                                                                         final output. Such database transformation process with nested
                                                                         queries (based on views or temporary tables) will result in a
3 DATA TRANSFORMATIONS                                                   tree (similar to a parse tree from a query), where internal nodes
                                                                         are joins or projections (aggregations), leaves are tables and the
3.1 Data Set                                                             root is one final query returning the data set. In the extended
The main objective is to create a single table, that will be the         ER diagram, the new entities are linked with existing entities or
input for an ML algorithm. This table will be the output of a            previous transformation entities.
sequence of SPJA queries filtering, merging, transforming and                The π operator eliminates duplicates, if any. Even though SQL
aggregating data.                                                        has bag semantics, allowing duplicate rows, such semantics do
    The data set is represented by an entity with two sets of at-        not make sense from a machine learning perspective. The reason
tributes X (K, A), where K is the primary key (generally simple,         is simple: each represented object must be identifiable. Assume a
but it could be composite) and A is a set of p non-key attributes        well formed query T = T1 1 T2 · · · 1 Tk on appropriate foreign
(numerical & categorical variables). From a database ER mod-             keys. We propose the following rules to enable query composition.
eling standpoint the data set corresponds to a weak entity that          If we project columns fromT they must either include the primary
depends on some strong entity [1], defining an "is-a" relationship       key of T (being the primary key of some table Ti ) or they include
(i.e. the data set is a sub-type of some strong entity). In a machine    a GROUP BY key to compute aggregations. When computing
learning context, X will be the input for a mathematical model           data transformations with SQL queries the output table cannot
like linear regression, PCA dimensionality reduction, Bayesian           be used in future queries if it does not include the primary key of
classification, decision trees or clustering.                            T and it does not have aggregations to derive non-key attributes.
    The first consideration is K, which can be simple or composite.      That is, π must include both GROUP BY attributes and a list of
The most common case is that K is simple with a 1-1 correspon-           aggregation functions.
dence to some existing source entity (e.g. customer id). In such
case, we will use i instead of K to make reference to the ith point         Attribute Level. There exist two mutually exclusive data trans-
or the ith instance. On the other hand, if K is composite then it        formation classes: (1) Denormalization (via join), which gathers
takes primary keys from several entities (e.g. customer id, prod-        attributes from several entities into a single transformation en-
uct id), similar to a data cube with a composite key or it is the        tity or simply derives new attributes from existing attributes. (2)
result of adding a key attribute that was not originally part of any     Aggregation (via projection with aggregation), which creates a
entity (e.g. month id). In the latter case, there is likely already a    new aggregation value attribute grouping rows by a primary
“dimension” table where the new key attribute is already available       key and computing some summarization. An aggregation cannot
(e.g. a cube dimension table). Evidently, given k source tables          be considered denormalization because, in general, we cannot
(entities) there should be k − 1 foreign keys linking them in order      reason about functional dependencies beyond 1NF between the
to enable k − 1 joins. In other words, the primary key of the data       grouping column(s) and the aggregated column.
   Denormalization: When denormalization brings attributes from        flow of transformations. On the other hand, the low level diagram
other entities the attribute values themselves are not transformed.    provides a specific idea on how transformations make up a table.
That is, the table structure changes, but the column values remain     The low level view can be selectively displayed in a “zoom in"
the same. When attributes come directly, without transforming          view on each entity.
values, from other entities they can have three roles: being part          ER diagram notation in the literature has many variants: ev-
of the primary key, being part of a foreign key or being non-          idently the classical, but highly intuitive, notation with small
key whatsoever. Attributes being part of the PK or a FK can            boxes for entities and external ellipses for attributes [1] is in-
later be used to group rows to compute aggregations, which are         adequate and cumbersome as it does not scale to a large num-
our second class of data transformations. In general, non-key at-      ber of entities and much less to a large number of attributes.
tributes interact together using all SQL operators, programming        That is why we defend the idea of using UML notation as most
constructs and functions (scalar, aggregate, UDFs). Arithmetic         modern ER tools do. A transformation entity represents a weak
expressions and string expressions fall into this category. The        entity in a strict sense since each object existence depends on
only non-key denormalization data transformation that deserves         the source entity. The classical ER notation uses dotted lines.
special attention is the SQL CASE statement. There are two im-         Since we intend to use UML notation and we intend to extend
portant issues introduced by the CASE statement: (1) it may            it further we prefer to show transformation entities with solid
create new values, not present on any previous column. (2) it          lines. As introduced above, we classify entities as source or trans-
may introduce nulls which were not present before. Therefore,          formation, where transformation entities are either GROUP BY
an SQL query with CASE statements cannot be evaluated with             aggregations or denormalization. Therefore, each entity will be
an equivalent SPJA query.                                              labeled as “source:”, or “denormalization:”, or
   Aggregation: Aggregations are expressed with the π oper-            “aggregation:”. A complicated aspect is where to place at-
ator having a grouping attribute(s) and a list of functions (e.g.      tribute transformations. This is not an easy choice since a typical
sum(),count()). In SQL, the GROUP BY caluse partitions output at-      modern ER diagram may have hundreds of entities and thou-
tributes into key attributes and aggregated attributes. In general,    sands of attributes, exploding further with the analytic database
aggregations return numbers, but notice "max()" in SQL aggrega-        transformations. Given the column-oriented programming ap-
tions can have strings or dates as argument. Global aggregations       proach in SQL and the fact that modern ER diagrams display one
without a GROUP BY clause (e.g. a count() or rows, a total sum())      attribute per line we decided to show denormalization expres-
can be represented with a π operator with an artificial column         sions and aggregations to the right of the attribute name. Putting
with a constant value. Representing such aggregation in the ER         all these elements together, our proposal is to show database
diagram is problematic because there is no given primary key;          transformations as an additional column inside an ER entity box.
this is a research issues in database modeling. In meantime, we        Such column can substitute the data type definitions in a typical
propose to "pin" such aggregation to the entity name as a small        physical model (convertible to SQL Data Definition Language). A
bag of aggregations (e.g. counts, sums, statistics).                   major requirement is understanding where columns come from
                                                                       in a given transformation entity. This information must neces-
                                                                       sarily come from the 1 operator in a query. We believe the best
3.3    Extending ER Diagram
                                                                       place to display a query is next to the entity name, simply show-
We propose extensions to ER diagram notation to represent data-        ing which tables participate in 1. Attributes provenance will be
base transformations to build one data set, which can be exploited     shown with single dot or double dot notation (e.g. T.A or T..B).
by multiple ML models. We clarify there will be multiple ER dia-           Finally, we must consider attributes (table columns). A major
grams, corresponding to different data sets (i.e. having different     requirement is to track provenance. We propose the following
PK) and that we aim to represent only polished tuned queries.          notation: a single dot to indicate the source table if such table
That is, we do not propose to build a single ER diagram repre-         participates in a join in the corresponding query (e.g. T .A). Oth-
senting all temporary tables since that would create a huge ER         erwise, we propose to use a double dot to indicate the column
diagram, impossible to interpret. We emphasize an ER diagram           originates from a previous temporary table (e.g. T ..A). Based on
works at a conceptual/logical level and SQL works at a physi-          the fact that SPJA queries cannot express CASE computations
cal level. The ER diagram helps designing a coherent database          it is necessary to define a special notation to represent CASE
structure, whereas SQL queries help processing the database.           statements. Basically, there are two elements: a predicate combin-
Therefore, our proposed notation brings both paradigms closer,         ing comparisons with and/or/not and the value(s) to be assigned.
enriching the ER diagram with the ability to represent trans-          Therefore, we can think of the CASE statement as a functional
formations on database attributes, but which mixes database            form which returns a value based on the predicate being true or
modeling with database processing.                                     false. Given its similarity to the C++ "?:" operators, it is the nota-
   The first consideration is notation for queries. SQL has bag        tion we use. For the example mentioned above it will be shown
semantics and SQL queries tend to be verbose. On the other hand,       as "(A >= 0) ? ’positive’:’negative’", which is shorter than the
given its precise mathematical definition and conciseness rela-        CASE syntax and which is intuitive to C++ or Java programmers.
tional algebra is the best choice. The second aspect is defining our       The actual SQL queries, including each temporary table are
diagram notation, a thorny and subjective aspect. After exploring      selectively displayed on each trasformation entity with a "zoom
several alternatives keeping in mind the large number of tempo-        in" low level view. That is, we avoid showing all SQL queries in
rary tables and columns created by SQL queries, we propose two         the ER diagram.
ER diagrams: (1) a high level “abstract” ER diagram displaying
only entities and (2) a low level (fine granularity) “query” ER           Extended ER Diagram Properties. From a theory perspective,
diagram displaying all attributes, the data transformations and        since we use relational algebra to represent database transforma-
the corrsponding query. The high level diagram can be used to          tions our extended ER diagram is guaranteed to be: (1) complete
explore the connection among tables and understand the overall         and (2) consistent. From a database design perspective, our model
                 Denormalization :Discount_Category
                                                                 Source:LineItem
                 PK Discount_Category                                                                                                               Aggregation:Supp_Count     Π l_SuppKey, COUNT(*) (LineItem)
                                                               PK l_OrderKey
                                                               PK l_LineNumber                                                                      PK,FK1 l_SuppKey           GROUP BY l_SuppKey
                      LowerLimit
                      UpperLimit                                                                                                                                               COUNT(*)
                                                                   l_SuppKey                                                                                  SuppCount
                                                                   l_ReturnFlag
                                                                   l_Discount
                                                                   l_ExtendedPrice
                                                                   l_PartKey

Denormalization :LinePrice_Category                                                                                                           Aggregation:Supp_R_Count      l_ReturnFlag= ’R’ (Π l_SuppKey, COUNT(*) (LineItem))
PK LinePrice_Category                                                                                                                         PK,FK1 l_SuppKey             GROUP BY l_SuppKey

    LowerLimit                                                                                                                                         SuppRCount          COUNT(*)
    UpperLimit
                                                  Denormalization :LineItem_Category   Π (LineItem)                                                    l_ReturnFlag        LineItem.l_ReturnFlag
                                                 PK,FK3 l_OrderKey                     LineItem.l_OrderKey
                                                 PK,FK3 l_LineNumber                   LineItem.l_LineNumber

                                                           l_SuppKey                   LineItem.l_SuppKey
                                                           l_PartKey                   LineItem.l_PartKey
                                                 FK2       LinePrice_Category          (LineItem.l_Discount<0.1) ? 1:2
                                                 FK1       Discount_Category           (LineItem.l_ExtendedPrice <10000) ? 1:2
                                                                                                                                               Denormalization :Supplier   Supp_Count‫ ڇ‬Supp_R_Count
                                                           l_ReturnFlag                LineItem.l_ReturnFlag
                                                                                                                                              PK,FK1,FK2 s_SuppKey         s_SuppKey

                                                                                                                                                            SuppCount      SuppCount.SuppCount
                                                                                                                                                            SuppRCount     Supp_R_Count.SuppRCount

                                            Dataset:Customer          LineItem_Category‫ ڇ‬Supplier
                                      PK,FK1 l_OrderKey               LineItem…l_OrderKey
                                      PK,FK1 l_LineNumber             LineItem…l_LineNumber

                                      FK2      l_SuppKey              LineItem…l_SuppKey
                                               l_PartKey              LineItem…l_PartKey
                                               SuppCount              Supplier. SuppCount
                                               SuppRCount             Supplier. .SuppRCount
                                               LinePrice_Category     LineItem_Category.LinePrice_Category
                                               Discount_Category      LineItem_Category.Discount_Category
                                               l_ReturnFlag           LineItem…l_ReturnFlag




                                                  Figure 2: Transformation ER diagram for TPC-H database (low level).


can improve database evolution it can track attribute provenance,                                                            build a data set to predict if a customer will buy or not any prod-
it can help reusing existing tables and it motivates reusing trans-                                                          uct in the next 6 months, given past sales history. This database
formation queries.                                                                                                           consists of approximately 70 tables with bike stores sales over 4
                                                                                                                             years. Our program, written in C#, using 12 SQL queries as in-
3.4       Extended ER Diagram Example                                                                                        put, produced a sequence of relational schemas with appropriate
We now illustrate our extended ER diagram notation and the al-                                                               primary and foreign keys, which were labeled as transformation
gorithm with the sample database introduced in Section 2. Figure                                                             entities (with our algorithm) in a couple of seconds (since this
2 presents the low level view with entities, data transformations                                                            involves only schema data). The data set was used as input in
at the attribute level and the query.                                                                                        several ML predictive models, including Naive Bayes and logistic
                                                                                                                             regression. Using a standard CSV format, these entities can be vi-
3.5       Algorithm to Extend an ER Diagram                                                                                  sualized in automated ER diagram drawing tools (e.g. smartdraw,
                                                                                                                             Lucidchart).
Given an ER diagram (generally corresponding to a tuned OLTP
                                                                                                                                Measuring disk space savings, reduction in development time,
or OLAP database) and an existing set of polished SQL queries
                                                                                                                             time to automatically produce an ER diagram and easiness to
that create data set X , the following steps help building an ex-
                                                                                                                             maintain the extended ER diagram requires a long-term case
tended ER diagram.
                                                                                                                             study, which we will conduct in the future.
    (1) Initialize extended ER diagram with the original ER diagram;
        labeling each entity as “source” entity (S).
    (2) Create a transformation (T) entity for every intermediate table;                                                     5   RELATED WORK
        consider nested queries and views as additional temporary ta-                                                        Research on ER modeling can be classified as models for transac-
        bles. Label each intermediate table as T< 99 >, where 99 is an                                                       tional databases and models for analytic databases (data ware-
        increasing integer, resulting from an incremental computation.                                                       houses). Models for transactional databases ensure the database
    (3) Label each attribute as key or non-key.
                                                                                                                             can be maintained in a valid state (complete and consistent),
    (4) For each non-key attribute associate to either: a derived expres-
                                                                                                                             whereas models for database analysis enable multidimensional
        sion or an aggregation. Indicate provenance (lineage) of attributes
        coming from the denormalization process. For aggregations use                                                        analysis on cubes and basic machine learning. Since we are con-
        the same function name provided by SQL in a relational algebra                                                       cerned with analyzing the database rather than updating it our
        expression.                                                                                                          extended ER model is more closely related to models for data
    (5) Add a final main data set entity joining all intermediate tables;                                                    warehouses. However, there are fundamental differences between
        this data set entity will be highlighted in the ER diagram and                                                       both kinds of models. Cube dimensions and measures are identi-
        labeled “data set”.                                                                                                  fied early on the database design to denormalize tables, whereas
                                                                                                                             attributes in the data set are built later, during the iterative data
4     CASE STUDY                                                                                                             mining process. Generally speaking, data transformations for
Due to lack of space we do not present experiments. Intead we                                                                ML analysis are more complex since they also involve mathe-
summarize a case study with a real database with bike sales infor-                                                           matical functions and complex logic in the CASE statement. On
mation, that comes as a test database for the Microsoft SQL Server                                                           the other hand, provenance identifies the sources of information
DBMS (AdventureWorks). We manually wrote SQL queries to                                                                      in a data warehouse. We have adapted provenance to trace the
source tables an attribute in the data set comes from. A closely      model has two kinds of entities: source entities and transforma-
related work that studied the problem of transforming tables          tion entities, which correspond to standard tables coming from
from a modeling perspective is [5], where the authors study the       an ER diagram and temporary “analytic” tables created with SQL
transformation of a database schema into another equivalent one       queries, respectively. Data transformations are further classified
in order to facilitate information exchange with another database;    into two main categories: denormalization and aggregations. Due
[5] compares alternatives schemas, but does not consider queries      to the large number of entities and attributes involved, we use
combining aggregations, denormalization and math functions on         modern UML diagram notation, where all entities are boxes and
attributes to transform tables like we do. The closest research       entities are linked by 1:1 and 1:N relationships. We extended
we found that studies how to represent a data set to compute a        ER diagram entity boxes with a new box where transformations
machine learning model is [7], where the authors exploit UML          are expressed with extended relational algebra notation (SPJA
notation to extend an existing data warehouse model to represent      queries). We introduced an algorithm to extend an existing ER
attributes used in classification, a prominent statistical learning   model, using data transformation queries as input where the data
problem; we emphasize [7] focuses more on the final data set,         set is the final database goal. We emphasize that the extended
rather than studying the data transformations to build it. The        ER diagram is automatically created for each ML project, to be
idea to represent data transformations in the ER model was pro-       relevant and useful. That is, different data sets, having different
posed in [3], where simple data transformations are classified        primary keys, have separate extended ER diagrams.
and SQL queries are shown in the ER diagram. Our paper takes             Our proposed ER diagram extended with data transformation
a step further by focusing on the ER diagram (coining the term        offers many opportunities for future research. It is necessary to
“transformation” ER diagram), formally studying the problem           study data transformations deeper, considering coding, pivoting
with relational notation instead of SQL, considering a broader        and cubes. We want to further explore UML diagram constructs
class of queries (e.g. representing σ and views), and introducing     to enrich ER to represent database transformations. We want to
an algorithm to automate ER diagram construction.                     determine when it is feasible to design transformation entities be-
   The closest approach in the literature is reverse data engi-       fore writing SQL queries to close the loop. From a normalization
neering, where the main purpose is to produce an ER diagram           perspective, it is necessary to study aggregations choosing a best
using a database schema as input. In contrast, we assume queries      table schema to minimize storage of redundant information. The
use tables already stored on the database. Moreover, we con-          SQL CASE statement is powerful, but it introduces many issues
sider attribute-level data transformations beyond denormaliza-        from a theoretical and database modeling perspective. We plan
tion with joins. ML models generally require many tables to be        to explore provenance aspects in depth. We intend to develop
joined, aggregated and transformed to derive variables (features).    metrics to quantify savings in software development effort (e.g.
In [4] there is a proposal to analyze inclusion and functional        lines of source code, number of functions or classes) and database
dependencies in denormalized tables built with relational queries.    maintenance (number of tables that can be reused, number of
However, the goal is to transform tables into a 3NF database          attributes that can be used by multiple ML models, or number of
model representable in ER form, rather than providing an ab-          tables that can serve multiple analytic goals).
stract representation of data transformation queries.
   Our work shares similarities with research on ETL (Extract-        REFERENCES
Transorm-Load) [6] where records are transformed and cleaned          [1] Z. Akkaoui, J.N. Mazón, A.A. Vaisman, and E. Zimányi. Bpmn-based conceptual
                                                                          modeling of ETL processes. In Proc. DaWaK Conference, pages 1–14, 2012.
with traditional programming languages (including UDFs) before        [2] H. Garcia-Molina, J.D. Ullman, and J. Widom. Database Systems: The Complete
being loaded; this includes conceptual modeling for ETL [? ?              Book. Prentice Hall, 2nd edition, 2008.
] and ELT, where data records are transformed with queries            [3] B. Oliveira and O. Belo. Using REO on ETL conceptual modelling: a first
                                                                          approach. In Proc. ACM DOLAP, pages 55–60, 2013.
after being loaded. Important similarities include representing       [4] C. Ordonez. Can we analyze big data inside a DBMS? In Proc. ACM DOLAP
data processing with a diagram, capturing a comprehensive set             Workshop, 2013.
of data transformations and meeting users requirements. On            [5] Carlos Ordonez, Sofian Maabout, David Sergio Matusevich, and Wellington
                                                                          Cabrera. Extending ER models to capture database transformations to build
the other hand, important differences include the following. We           data sets for data mining. Data & Knowledge Engineering, 2013.
assume referential integrity problems have been solved. The           [6] J.M. Petit, F. Toumani, J.F. Boulicaut, and J. Kouloumdjian. Towards the reverse
                                                                          engineering of denormalized relational databases. In Proc. ICDE Conference,
input to our diagram is a set of queries, which work strictly on          pages 218–227, 1996.
tables, not on files or documents. We draw a clear boundary           [7] Alexandra Poulovassilis and Peter McBrien. A general formal framework for
on data transformations based on joins (only derivation) or on            schema transformation. Data & Knowledge Engineering (DKE), 28(1):47–71,
                                                                          1998.
aggregation (some aggregation function). Instead of using a flow      [8] P. Vassiliadis, A. Simitsis, and E. Baikousi. A taxonomy of ETL activities. In
diagram to represent data processing we propose to represent              DOLAP, pages 25–32, 2009.
such processing with “processing” entity boxes. We believe our        [9] J. Zubcoff and J. Trujillo. Conceptual modeling for classification mining in data
                                                                          warehouses. In Proc. DaWaK Conference, LNCS, pages 566–575. Springer, 2006.
“zoom-in” view of transformation entities with relational queries
is novel and meaningful.


6    CONCLUSIONS
We introduced minimal ER diagram extensions to represent data
transformations in abstract form, bridging the gap between a
logical database model and a physical database. We focused on
studying database transformations to build a data set for machine
learning analysis. We assume such data set is wide, gathering
many ML variables (features) in one place. Our extended ER