=Paper= {{Paper |id=Vol-2324/Paper27-COrdonez |storemode=property |title=Enhancing ER Diagrams to View Data Transformations Computed with Queries |pdfUrl=https://ceur-ws.org/Vol-2324/Paper27-COrdonez.pdf |volume=Vol-2324 |authors=Carlos Ordonez,Ladjel Bellatreche |dblpUrl=https://dblp.org/rec/conf/dolap/0001B19 }} ==Enhancing ER Diagrams to View Data Transformations Computed with Queries== https://ceur-ws.org/Vol-2324/Paper27-COrdonez.pdf

Enhancing ER Diagrams to View Data Transformations
Computed with Queries
Carlos Ordonez Ladjel Bellatreche
University of Houston, USA LIAS/ISAE-ENSMA, France

ABSTRACT (3) showing SPJA queries as an optional zoom-in exploratory
Transforming relational tables to build a data set takes most of feature; (4) an algorithm to automate extending and maintaining
the time in a machine learning (ML) project centered around a the transformation ER diagram. We believe our ER diagram has
relational database. The explanation is simple: a relational data- potential usefulness. Existing tuned queries can be reused and
base has a collection of tables that are joined and aggregated with extended. Redundant copies of data can be eliminated by reusing
complex relational queries, and whose columns are transformed existing queries. Provenance of attributes can be tracked. Nev-
with complex SQL expressions, in order to build the required data ertheless, we do not intend to represent every query in a single
set. In general, such data is wide, gathering many ML variables ER diagram since that would create a huge diagram. Instead, we
together. Such complicated data pre-processing results in a large propose to represent each set of tuned queries to build each data
set of SQL queries that are independently developed from each set separately. Each set of queries is similar to a data mart with
other for different ML models. The database grows with impor- the main difference being that table structure is gradually refined
tant tables and views that are absent in the original ER diagram. by queries, not by database design.
More importantly, similar SQL queries tend to be written multiple
times, creating problems in database evolution, disk space utiliza- 2 DEFINITIONS
tion and software maintenance. In this paper, we go in opposite
2.1 ER Diagram for a Relational Database
direction from a physical level (tables) to a logical level (enti-
ties) representation, providing a unifying diagram of both levels. In modern database design tools an ER diagram is refined and
Specifically, we propose minimal, but powerful, extensions to an consolidated into a collection of “structured” entities represented
ER diagram in UML notation to represent most common data- in a simplified UML diagram, closer to a physical model. In the
base transformations. Our “transformation” ER diagram helps resulting relational database schema entities and relationships
analytic users understanding complex transformations, consoli- are mapped to tables linked by referential integrity constraints.
dating columns representing analytic variables into fewer tables As it is standard in most modern ER tools, each entity corresponds
(i.e. eliminating redundant tables), reusing existing SQL queries to a table and each relationship is represented by foreign keys
(i.e. avoid forking new queries) and explaining data provenance (FKs).
(where data originated from). We assume 1:1 relationships are consolidated into one entity
because they share the same primary key (PK). Moreover, we
assume M:N (many to many) relationships get mapped to a “link”
1 INTRODUCTION entity, which connects 2+ entities, merging their respective FKs
The entity-relationship (ER) model provides methods (step by into a new PK. Therefore, a refined database ER diagram has only
step) and diagram notation to design a database, by defining its 1:N (1 to many) and N:1 (many to 1) relationships. In short, we
structure before storing data values. On the other hand, the rela- assume there exists an ER diagram ready to be converted and
tional database model provides a precise mathematical definition deployed as physical relational tables, with a 1-1 correspondence
to store and query data in the form of tables (relations) linked between entities and tables. Such restricted ER diagram allows
by foreign keys (FKs), whose basic structure is sketched in an ER representing tables returned by queries as entities.
model. We are concerned with exploiting a DBMS to analyze data A database is defined as D(T , I ), where T = {T1 , . . . ,Tn }, is
sets with exploratory analysis, machine learning and statistics [2]. a set of n tables and I is a set of integrity constraints (entity
We are interested in modeling all potential database transforma- for primary key (PK), referential for foreign key (FK)). We use
tions via relational queries to build a specific data set that can be the term “table” to make an explicit connection to the physical
analyzed by machine learning (ML) algorithms. In general, such level and SQL. Each table has a set of columns, coming from
data set has a tabular form, where every row corresponds to an different domains. The entity constraint asserts that every table
observation, instance or point and every column is associated to a should have a primary key. Referential integrity is denoted by
variable or feature. The main side effect is that relational queries Ti (K) → T j (K) where K is the primary key (PK) of T j and K is a
produce many new tables, which do not appear as entities in the foreign key (FK) in Ti . The common key attribute K is assumed
existing ER diagram. Such collection of transformed tables and to have the same name on both tables.
disconnected queries complicate database management, software
development, and computing further ML models.
2.2 Relational Queries
Based on the motivation discussed above, we introduce minor,
but powerful, changes to ER diagrams to represent data trans- Database transformations are assumed to be computed only with
formations computed by queries, used to build data sets for ML SQL queries, mixing joins (1), aggregations (extended π oper-
models. Our main contributions are the following: (1) labeling ator) and selection (σ ). Relational queries combine aggregate
ER entities to distinguish source and transformation entities; (2) functions (built-in or UDFs), scalar functions (built-in or UDFs),
numbering entities to understand a sequence of transformations; mathematical operators, string/date operators and the powerful
SQL CASE statement. However, in order to have a precise mathe-
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings matical foundation we study data transformation with relational
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
CEUR-WS.org. queries, which combine π , σ and 1 operators, where π is used
as a GROUP BY aggregation and derivation operator. Notice SQL set comes from putting together the primary keys of a subset of
PIVOT and UNPIVOT operators available in some DBMSs are the k source tables.
great to manipulate matrices, but they are not relational. Fortu- We now discuss A, the set of non-key attributes. Assume
nately, these operators can be represented with tables with extra T = T1 1 . . . T2 · · · 1 Tk gathers all raw attributes. The goal
columns storing subscripts generated by the DBMS. We assume is to further transform T into X . We emphasize T is not in 3NF.
well formed queries are used to create data sets, which include a On the other hand, if T has a composite key it will not be in 2NF
primary key to identify rows and at least one non-key attribute either. Each attribute in A can come from: a table, an aggregation
(variable, feature). or a scalar expression (arithmetic, logical, string). Therefore, a
We consider left outer join as a prominent operator needed to transformed attribute must either be the result of an aggregation
merge partial tables sharing the same PK to build the data set for summarizing several rows into one or computing some derived
analysis. In query terms, referential integrity between two tables expression (denormalization). Below we explore data transfor-
Ti ,T j means π K (Ti ) ⊆ π K (T j ). mations in more depth.

2.3 Running Example 3.2 Representing Data Transformations
Our example is based on the benchmark TPC-H database, shown We explain database transformations at two levels: entity (table)
on Figure 1. Consider the following representative ML problem: and attribute (column), in order to define an ER diagram at two
predicting whether a product will be returned or not, based on his- levels of abstraction.
toric sales information. A long database transformation process
is needed to build the data set to be used as input in a regression Entity Level. We extend an existing ER diagram with new enti-
or classification model. Our objective is to extend the existing ties providing a diagram representation of data transformations.
ER diagram to capture this data transformation process. We call existing entities source entities and added entities are
called transformation entities (computed with SPJA queries). We
SUPPLIER
PK S_SUPPKEY
emphasize that a single SQL query can create several temporary
tables during its evaluation with one table per relational opera-
ORDERS
LINEITEM
PK O_ORDERKEY tor. In order to define a faithful representation each nested SQL
PK,FK3 L_ORDERKEY
PK L_LINENUMBER query will correspond to one temporary table, which in turn will
PART
PK P_PARTKEY
FK2 L_PARTKEY be represented by one relational algebra query and one trans-
FK1 L_SUPPKEY
formation entity. By a generalization, the data transformation
process will create a sequence of transformation tables that will
Figure 1: Input database ER diagram for example. be successively joined and aggregated. The data set X will be the
final output. Such database transformation process with nested
queries (based on views or temporary tables) will result in a
3 DATA TRANSFORMATIONS tree (similar to a parse tree from a query), where internal nodes
are joins or projections (aggregations), leaves are tables and the
3.1 Data Set root is one final query returning the data set. In the extended
The main objective is to create a single table, that will be the ER diagram, the new entities are linked with existing entities or
input for an ML algorithm. This table will be the output of a previous transformation entities.
sequence of SPJA queries filtering, merging, transforming and The π operator eliminates duplicates, if any. Even though SQL
aggregating data. has bag semantics, allowing duplicate rows, such semantics do
The data set is represented by an entity with two sets of at- not make sense from a machine learning perspective. The reason
tributes X (K, A), where K is the primary key (generally simple, is simple: each represented object must be identifiable. Assume a
but it could be composite) and A is a set of p non-key attributes well formed query T = T1 1 T2 · · · 1 Tk on appropriate foreign
(numerical & categorical variables). From a database ER mod- keys. We propose the following rules to enable query composition.
eling standpoint the data set corresponds to a weak entity that If we project columns fromT they must either include the primary
depends on some strong entity [1], defining an "is-a" relationship key of T (being the primary key of some table Ti ) or they include
(i.e. the data set is a sub-type of some strong entity). In a machine a GROUP BY key to compute aggregations. When computing
learning context, X will be the input for a mathematical model data transformations with SQL queries the output table cannot
like linear regression, PCA dimensionality reduction, Bayesian be used in future queries if it does not include the primary key of
classification, decision trees or clustering. T and it does not have aggregations to derive non-key attributes.
The first consideration is K, which can be simple or composite. That is, π must include both GROUP BY attributes and a list of
The most common case is that K is simple with a 1-1 correspon- aggregation functions.
dence to some existing source entity (e.g. customer id). In such
case, we will use i instead of K to make reference to the ith point Attribute Level. There exist two mutually exclusive data trans-
or the ith instance. On the other hand, if K is composite then it formation classes: (1) Denormalization (via join), which gathers
takes primary keys from several entities (e.g. customer id, prod- attributes from several entities into a single transformation en-
uct id), similar to a data cube with a composite key or it is the tity or simply derives new attributes from existing attributes. (2)
result of adding a key attribute that was not originally part of any Aggregation (via projection with aggregation), which creates a
entity (e.g. month id). In the latter case, there is likely already a new aggregation value attribute grouping rows by a primary
“dimension” table where the new key attribute is already available key and computing some summarization. An aggregation cannot
(e.g. a cube dimension table). Evidently, given k source tables be considered denormalization because, in general, we cannot
(entities) there should be k − 1 foreign keys linking them in order reason about functional dependencies beyond 1NF between the
to enable k − 1 joins. In other words, the primary key of the data grouping column(s) and the aggregated column.
Denormalization: When denormalization brings attributes from flow of transformations. On the other hand, the low level diagram
other entities the attribute values themselves are not transformed. provides a specific idea on how transformations make up a table.
That is, the table structure changes, but the column values remain The low level view can be selectively displayed in a “zoom in"
the same. When attributes come directly, without transforming view on each entity.
values, from other entities they can have three roles: being part ER diagram notation in the literature has many variants: ev-
of the primary key, being part of a foreign key or being non- idently the classical, but highly intuitive, notation with small
key whatsoever. Attributes being part of the PK or a FK can boxes for entities and external ellipses for attributes [1] is in-
later be used to group rows to compute aggregations, which are adequate and cumbersome as it does not scale to a large num-
our second class of data transformations. In general, non-key at- ber of entities and much less to a large number of attributes.
tributes interact together using all SQL operators, programming That is why we defend the idea of using UML notation as most
constructs and functions (scalar, aggregate, UDFs). Arithmetic modern ER tools do. A transformation entity represents a weak
expressions and string expressions fall into this category. The entity in a strict sense since each object existence depends on
only non-key denormalization data transformation that deserves the source entity. The classical ER notation uses dotted lines.
special attention is the SQL CASE statement. There are two im- Since we intend to use UML notation and we intend to extend
portant issues introduced by the CASE statement: (1) it may it further we prefer to show transformation entities with solid
create new values, not present on any previous column. (2) it lines. As introduced above, we classify entities as source or trans-
may introduce nulls which were not present before. Therefore, formation, where transformation entities are either GROUP BY
an SQL query with CASE statements cannot be evaluated with aggregations or denormalization. Therefore, each entity will be
an equivalent SPJA query. labeled as “source:”, or “denormalization:”, or
Aggregation: Aggregations are expressed with the π oper- “aggregation:”. A complicated aspect is where to place at-
ator having a grouping attribute(s) and a list of functions (e.g. tribute transformations. This is not an easy choice since a typical
sum(),count()). In SQL, the GROUP BY caluse partitions output at- modern ER diagram may have hundreds of entities and thou-
tributes into key attributes and aggregated attributes. In general, sands of attributes, exploding further with the analytic database
aggregations return numbers, but notice "max()" in SQL aggrega- transformations. Given the column-oriented programming ap-
tions can have strings or dates as argument. Global aggregations proach in SQL and the fact that modern ER diagrams display one
without a GROUP BY clause (e.g. a count() or rows, a total sum()) attribute per line we decided to show denormalization expres-
can be represented with a π operator with an artificial column sions and aggregations to the right of the attribute name. Putting
with a constant value. Representing such aggregation in the ER all these elements together, our proposal is to show database
diagram is problematic because there is no given primary key; transformations as an additional column inside an ER entity box.
this is a research issues in database modeling. In meantime, we Such column can substitute the data type definitions in a typical
propose to "pin" such aggregation to the entity name as a small physical model (convertible to SQL Data Definition Language). A
bag of aggregations (e.g. counts, sums, statistics). major requirement is understanding where columns come from
in a given transformation entity. This information must neces-
sarily come from the 1 operator in a query. We believe the best
3.3 Extending ER Diagram
place to display a query is next to the entity name, simply show-
We propose extensions to ER diagram notation to represent data- ing which tables participate in 1. Attributes provenance will be
base transformations to build one data set, which can be exploited shown with single dot or double dot notation (e.g. T.A or T..B).
by multiple ML models. We clarify there will be multiple ER dia- Finally, we must consider attributes (table columns). A major
grams, corresponding to different data sets (i.e. having different requirement is to track provenance. We propose the following
PK) and that we aim to represent only polished tuned queries. notation: a single dot to indicate the source table if such table
That is, we do not propose to build a single ER diagram repre- participates in a join in the corresponding query (e.g. T .A). Oth-
senting all temporary tables since that would create a huge ER erwise, we propose to use a double dot to indicate the column
diagram, impossible to interpret. We emphasize an ER diagram originates from a previous temporary table (e.g. T ..A). Based on
works at a conceptual/logical level and SQL works at a physi- the fact that SPJA queries cannot express CASE computations
cal level. The ER diagram helps designing a coherent database it is necessary to define a special notation to represent CASE
structure, whereas SQL queries help processing the database. statements. Basically, there are two elements: a predicate combin-
Therefore, our proposed notation brings both paradigms closer, ing comparisons with and/or/not and the value(s) to be assigned.
enriching the ER diagram with the ability to represent trans- Therefore, we can think of the CASE statement as a functional
formations on database attributes, but which mixes database form which returns a value based on the predicate being true or
modeling with database processing. false. Given its similarity to the C++ "?:" operators, it is the nota-
The first consideration is notation for queries. SQL has bag tion we use. For the example mentioned above it will be shown
semantics and SQL queries tend to be verbose. On the other hand, as "(A >= 0) ? ’positive’:’negative’", which is shorter than the
given its precise mathematical definition and conciseness rela- CASE syntax and which is intuitive to C++ or Java programmers.
tional algebra is the best choice. The second aspect is defining our The actual SQL queries, including each temporary table are
diagram notation, a thorny and subjective aspect. After exploring selectively displayed on each trasformation entity with a "zoom
several alternatives keeping in mind the large number of tempo- in" low level view. That is, we avoid showing all SQL queries in
rary tables and columns created by SQL queries, we propose two the ER diagram.
ER diagrams: (1) a high level “abstract” ER diagram displaying
only entities and (2) a low level (fine granularity) “query” ER Extended ER Diagram Properties. From a theory perspective,
diagram displaying all attributes, the data transformations and since we use relational algebra to represent database transforma-
the corrsponding query. The high level diagram can be used to tions our extended ER diagram is guaranteed to be: (1) complete
explore the connection among tables and understand the overall and (2) consistent. From a database design perspective, our model
Denormalization :Discount_Category
Source:LineItem
PK Discount_Category Aggregation:Supp_Count Π l_SuppKey, COUNT(*) (LineItem)
PK l_OrderKey
PK l_LineNumber PK,FK1 l_SuppKey GROUP BY l_SuppKey
LowerLimit
UpperLimit COUNT(*)
l_SuppKey SuppCount
l_ReturnFlag
l_Discount
l_ExtendedPrice
l_PartKey

Denormalization :LinePrice_Category Aggregation:Supp_R_Count l_ReturnFlag= ’R’ (Π l_SuppKey, COUNT(*) (LineItem))
PK LinePrice_Category PK,FK1 l_SuppKey GROUP BY l_SuppKey

LowerLimit SuppRCount COUNT(*)
UpperLimit
Denormalization :LineItem_Category Π (LineItem) l_ReturnFlag LineItem.l_ReturnFlag
PK,FK3 l_OrderKey LineItem.l_OrderKey
PK,FK3 l_LineNumber LineItem.l_LineNumber

l_SuppKey LineItem.l_SuppKey
l_PartKey LineItem.l_PartKey
FK2 LinePrice_Category (LineItem.l_Discount<0.1) ? 1:2
FK1 Discount_Category (LineItem.l_ExtendedPrice <10000) ? 1:2
Denormalization :Supplier Supp_Count‫ ڇ‬Supp_R_Count
l_ReturnFlag LineItem.l_ReturnFlag
PK,FK1,FK2 s_SuppKey s_SuppKey

SuppCount SuppCount.SuppCount
SuppRCount Supp_R_Count.SuppRCount

Dataset:Customer LineItem_Category‫ ڇ‬Supplier
PK,FK1 l_OrderKey LineItem…l_OrderKey
PK,FK1 l_LineNumber LineItem…l_LineNumber

FK2 l_SuppKey LineItem…l_SuppKey
l_PartKey LineItem…l_PartKey
SuppCount Supplier. SuppCount
SuppRCount Supplier. .SuppRCount
LinePrice_Category LineItem_Category.LinePrice_Category
Discount_Category LineItem_Category.Discount_Category
l_ReturnFlag LineItem…l_ReturnFlag

Figure 2: Transformation ER diagram for TPC-H database (low level).

can improve database evolution it can track attribute provenance, build a data set to predict if a customer will buy or not any prod-
it can help reusing existing tables and it motivates reusing trans- uct in the next 6 months, given past sales history. This database
formation queries. consists of approximately 70 tables with bike stores sales over 4
years. Our program, written in C#, using 12 SQL queries as in-
3.4 Extended ER Diagram Example put, produced a sequence of relational schemas with appropriate
We now illustrate our extended ER diagram notation and the al- primary and foreign keys, which were labeled as transformation
gorithm with the sample database introduced in Section 2. Figure entities (with our algorithm) in a couple of seconds (since this
2 presents the low level view with entities, data transformations involves only schema data). The data set was used as input in
at the attribute level and the query. several ML predictive models, including Naive Bayes and logistic
regression. Using a standard CSV format, these entities can be vi-
3.5 Algorithm to Extend an ER Diagram sualized in automated ER diagram drawing tools (e.g. smartdraw,
Lucidchart).
Given an ER diagram (generally corresponding to a tuned OLTP
Measuring disk space savings, reduction in development time,
or OLAP database) and an existing set of polished SQL queries
time to automatically produce an ER diagram and easiness to
that create data set X , the following steps help building an ex-
maintain the extended ER diagram requires a long-term case
tended ER diagram.
study, which we will conduct in the future.
(1) Initialize extended ER diagram with the original ER diagram;
labeling each entity as “source” entity (S).
(2) Create a transformation (T) entity for every intermediate table; 5 RELATED WORK
consider nested queries and views as additional temporary ta- Research on ER modeling can be classified as models for transac-
bles. Label each intermediate table as T< 99 >, where 99 is an tional databases and models for analytic databases (data ware-
increasing integer, resulting from an incremental computation. houses). Models for transactional databases ensure the database
(3) Label each attribute as key or non-key.
can be maintained in a valid state (complete and consistent),
(4) For each non-key attribute associate to either: a derived expres-
whereas models for database analysis enable multidimensional
sion or an aggregation. Indicate provenance (lineage) of attributes
coming from the denormalization process. For aggregations use analysis on cubes and basic machine learning. Since we are con-
the same function name provided by SQL in a relational algebra cerned with analyzing the database rather than updating it our
expression. extended ER model is more closely related to models for data
(5) Add a final main data set entity joining all intermediate tables; warehouses. However, there are fundamental differences between
this data set entity will be highlighted in the ER diagram and both kinds of models. Cube dimensions and measures are identi-
labeled “data set”. fied early on the database design to denormalize tables, whereas
attributes in the data set are built later, during the iterative data
4 CASE STUDY mining process. Generally speaking, data transformations for
Due to lack of space we do not present experiments. Intead we ML analysis are more complex since they also involve mathe-
summarize a case study with a real database with bike sales infor- matical functions and complex logic in the CASE statement. On
mation, that comes as a test database for the Microsoft SQL Server the other hand, provenance identifies the sources of information
DBMS (AdventureWorks). We manually wrote SQL queries to in a data warehouse. We have adapted provenance to trace the
source tables an attribute in the data set comes from. A closely model has two kinds of entities: source entities and transforma-
related work that studied the problem of transforming tables tion entities, which correspond to standard tables coming from
from a modeling perspective is [5], where the authors study the an ER diagram and temporary “analytic” tables created with SQL
transformation of a database schema into another equivalent one queries, respectively. Data transformations are further classified
in order to facilitate information exchange with another database; into two main categories: denormalization and aggregations. Due
[5] compares alternatives schemas, but does not consider queries to the large number of entities and attributes involved, we use
combining aggregations, denormalization and math functions on modern UML diagram notation, where all entities are boxes and
attributes to transform tables like we do. The closest research entities are linked by 1:1 and 1:N relationships. We extended
we found that studies how to represent a data set to compute a ER diagram entity boxes with a new box where transformations
machine learning model is [7], where the authors exploit UML are expressed with extended relational algebra notation (SPJA
notation to extend an existing data warehouse model to represent queries). We introduced an algorithm to extend an existing ER
attributes used in classification, a prominent statistical learning model, using data transformation queries as input where the data
problem; we emphasize [7] focuses more on the final data set, set is the final database goal. We emphasize that the extended
rather than studying the data transformations to build it. The ER diagram is automatically created for each ML project, to be
idea to represent data transformations in the ER model was pro- relevant and useful. That is, different data sets, having different
posed in [3], where simple data transformations are classified primary keys, have separate extended ER diagrams.
and SQL queries are shown in the ER diagram. Our paper takes Our proposed ER diagram extended with data transformation
a step further by focusing on the ER diagram (coining the term offers many opportunities for future research. It is necessary to
“transformation” ER diagram), formally studying the problem study data transformations deeper, considering coding, pivoting
with relational notation instead of SQL, considering a broader and cubes. We want to further explore UML diagram constructs
class of queries (e.g. representing σ and views), and introducing to enrich ER to represent database transformations. We want to
an algorithm to automate ER diagram construction. determine when it is feasible to design transformation entities be-
The closest approach in the literature is reverse data engi- fore writing SQL queries to close the loop. From a normalization
neering, where the main purpose is to produce an ER diagram perspective, it is necessary to study aggregations choosing a best
using a database schema as input. In contrast, we assume queries table schema to minimize storage of redundant information. The
use tables already stored on the database. Moreover, we con- SQL CASE statement is powerful, but it introduces many issues
sider attribute-level data transformations beyond denormaliza- from a theoretical and database modeling perspective. We plan
tion with joins. ML models generally require many tables to be to explore provenance aspects in depth. We intend to develop
joined, aggregated and transformed to derive variables (features). metrics to quantify savings in software development effort (e.g.
In [4] there is a proposal to analyze inclusion and functional lines of source code, number of functions or classes) and database
dependencies in denormalized tables built with relational queries. maintenance (number of tables that can be reused, number of
However, the goal is to transform tables into a 3NF database attributes that can be used by multiple ML models, or number of
model representable in ER form, rather than providing an ab- tables that can serve multiple analytic goals).
stract representation of data transformation queries.
Our work shares similarities with research on ETL (Extract- REFERENCES
Transorm-Load) [6] where records are transformed and cleaned [1] Z. Akkaoui, J.N. Mazón, A.A. Vaisman, and E. Zimányi. Bpmn-based conceptual
modeling of ETL processes. In Proc. DaWaK Conference, pages 1–14, 2012.
with traditional programming languages (including UDFs) before [2] H. Garcia-Molina, J.D. Ullman, and J. Widom. Database Systems: The Complete
being loaded; this includes conceptual modeling for ETL [? ? Book. Prentice Hall, 2nd edition, 2008.
] and ELT, where data records are transformed with queries [3] B. Oliveira and O. Belo. Using REO on ETL conceptual modelling: a first
approach. In Proc. ACM DOLAP, pages 55–60, 2013.
after being loaded. Important similarities include representing [4] C. Ordonez. Can we analyze big data inside a DBMS? In Proc. ACM DOLAP
data processing with a diagram, capturing a comprehensive set Workshop, 2013.
of data transformations and meeting users requirements. On [5] Carlos Ordonez, Sofian Maabout, David Sergio Matusevich, and Wellington
Cabrera. Extending ER models to capture database transformations to build
the other hand, important differences include the following. We data sets for data mining. Data & Knowledge Engineering, 2013.
assume referential integrity problems have been solved. The [6] J.M. Petit, F. Toumani, J.F. Boulicaut, and J. Kouloumdjian. Towards the reverse
engineering of denormalized relational databases. In Proc. ICDE Conference,
input to our diagram is a set of queries, which work strictly on pages 218–227, 1996.
tables, not on files or documents. We draw a clear boundary [7] Alexandra Poulovassilis and Peter McBrien. A general formal framework for
on data transformations based on joins (only derivation) or on schema transformation. Data & Knowledge Engineering (DKE), 28(1):47–71,
1998.
aggregation (some aggregation function). Instead of using a flow [8] P. Vassiliadis, A. Simitsis, and E. Baikousi. A taxonomy of ETL activities. In
diagram to represent data processing we propose to represent DOLAP, pages 25–32, 2009.
such processing with “processing” entity boxes. We believe our [9] J. Zubcoff and J. Trujillo. Conceptual modeling for classification mining in data
warehouses. In Proc. DaWaK Conference, LNCS, pages 566–575. Springer, 2006.
“zoom-in” view of transformation entities with relational queries
is novel and meaningful.

6 CONCLUSIONS
We introduced minimal ER diagram extensions to represent data
transformations in abstract form, bridging the gap between a
logical database model and a physical database. We focused on
studying database transformations to build a data set for machine
learning analysis. We assume such data set is wide, gathering
many ML variables (features) in one place. Our extended ER