=Paper= {{Paper |id=Vol-2469/ERDemo04 |storemode=property |title=ER4ML: An ER Modeling Tool to Represent Data Transformations in Data Science |pdfUrl=https://ceur-ws.org/Vol-2469/ERDemo04.pdf |volume=Vol-2469 |authors=Dihia Lanasri,Carlos Ordonez,Ladjel Bellatreche,Selma Khouri |dblpUrl=https://dblp.org/rec/conf/er/Lanasri0BK19 }} ==ER4ML: An ER Modeling Tool to Represent Data Transformations in Data Science== https://ceur-ws.org/Vol-2469/ERDemo04.pdf
    ER4ML: An ER Modeling Tool to Represent
      Data Transformations in Data Science

    Dihia Lanasri1 , Carlos Ordonez2 , Ladjel Bellatreche3 , and Selma Khouri1
                                1
                                ESI, Algiers, Algeria
                            2
                             University of Houston, USA
                       3
                         LIAS/ISAE-ENSMA, Poitiers, France

       Abstract. Transforming several relational tables into a data set to be
       used as input to a Machine Learning (ML) model is a complex task
       since the data scientist has to derive many intermediate tables, queries
       and views in a disorganized manner. This process creates many SQL
       queries to facilitate the exploration task of the data scientist. Because
       the provenance of the intermediate results is not reflected, similar SQL
       queries tend to be written multiple times causing repeated manual work.
       In this paper, we propose a tool ”ER4ML” assisting data scientists in
       modeling and visualizing the transformations applied to the relational
       database before obtaining a dataset feeding the ML models. ER4ML is
       a diagram flow based on a conceptual view of the database schema and
       transformations based on powerful extensions of ER diagram in UML
       notations. In addition, ER4ML tracks data provenance, improving query
       and data set reuse.


Keywords: ER diagram · Provenance · Data transformation · Query · Machine
Learning.

1    Introduction
In order to feed their defined Machine Learning (ML) models, data scientists
collect, clean, aggregate and transform important amounts of relational data
to elaborate a tabular dataset with many features. Pre-processing relational
datasets is a complex process consuming more than 80 % of the project time.This
Pre-processing needs to execute different SQL queries in a disorganized manner
outside the DBMS, without thinking to model the entity-relationship(ER) mod-
ifications. These transfromations result in several intermediate temporary tables
or views,that are not modeled in the ER diagram. These transformations are
valuable knowledge for future analytics reusing similar sets of entities. However,
reusing them presents a challenge for data scientists because of the lack of a
modeling and GUI tools helping them to understand what has been performed
and the intermediate results of SQL queries that construct the ML model. We
defend the idea of creating an extended ER model to understand such a set of
disconnected tables and views at a high level.
    To overcome this issue, we propose a tool ”ER4ML” aiming to assist data
scientists to visualize and understand the different transformations using com-
plex SQL queries (including joins, selections, projections, aggregations and the


Copyright © 2019 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0).
124     Lanasri et al.

powerful CASE statement), applied to the relational dataset given as an input
to an ML model. ER4ML is based on a defined model ”Extended ER” detailed
in [5] proposing minimal but powerful extensions to an ER diagram as a con-
ceptual flow representation. It helps to understand a chain of transformations
basing on SQL queries going from the physical level(tables) to the logical level
(entities), in UML notations. The importance of SQL queries to transform re-
lational data to build a data set for ML is identified in [4]. ER4ML provides a
graphical web interface that shows the transformations applied to the ER model
and the provenance of each attribute (primary or foreign keys and non-key) with
a coloring code to distinguish between source and transformation entities. The
transformation entities can be of two main types: denormalization or aggrega-
tion, while the final dataset is the result of outer joins (denormalization) tending
to create non-3NF tables. Our system shares some similarities with data reverse
engineering and ETL tools, but they have many differences[5]. We establish a
connection with existing ER diagrams, instead of building a separate diagram.
    This paper is organized as follows: Section 2 discusses the related work. Sec-
tion 3 presents the architecture of our tool and details its main modules. Section
4 presents our demonstration scenario and Section 5 concludes our paper.


2     Related Work
Based on reverse engineering processes for extracting the database conceptual
model from its physical model, some studies defined conceptual view of relational
datasets[1] or extending ER modeling by defining some schema transformations[6];
or data warehouse conceptual model by classification on mining models[8]. Our
paper focuses on a unique extended ER diagram that unifies the physical and
conceptual levels. Other solutions propose to model the ETL/ELT processes at a
conceptual level[3] basing on the BPMN model[2]. Our solution represents data
processing with transformation entity boxes and SQL queries.


3     Research Application Contribution
3.1. Extended ER model for Data Transformation: The proposed tool
is based on the Extended ER model that defines a database as D(T , I), where
T = {T1 , . . . , Tn }, is a set of n tables and I is a set of integrity constraints (entity
for primary key (PK), referential for foreign key (FK)), and represents the chain
of transformations as a conceptual flow diagram. We use the term ”table” to
make a connection to the physical level and SQL querying. SQL queries can be
classified into either aggregation or denormalization transformation. We assume
well formed queries are used to create datasets, which include a PK to identify
rows and at least one non-key attribute.
3.2. Layered System Architecture: Figure 3 illustrates the architecture of
our system which is designed following the 3-tiers architecture based on the MVC
paradigm covering three layers:
Presentation layer: is the GUI presented to the user for visualizing the database
                                  Title Suppressed Due to Excessive Length     125

schema, the different transformations and the used SQL queries. The system dis-
tinguishes between source and transformation entities. The type of each entity
is indicated in its name (eg. Source:EntityName, Denormalization:Ti and Ag-
gregation:Tj ). The final obtained dataset is named ”Dataset:Tx” where (i,j,x)
are sequence numbers. Our diagram flow is automatically generated based on
physical model notation complementary to UML notations. Our Extended ER
notation does not change UML notation for ER, instead we show transformation
entities in color green/Red/blue and ER entities in white. The arrows represent
data flow (NOT relationships) but we still link entities (source & transforma-
tion) with FKs/PKs. This layer is developed using HTML, BOOTSTRAP, CSS,
JavaScript (JS), and GOJS library used to build interactive ER diagrams where
the user can hide or show selectively the transformation entities.
Functional Logic layer: is the core of our solution. It contains three main
modules of the system: Model extraction, ER initialization, and Transformation
modules responsible for generating an extended ER model for each dataset to
be given to an ML model. (i) Model extraction module: generates an ER model
from a relational data source using a specific DBMS connector. Our solution deals
with any DBMS, but in our demonstration scenario we consider sqlite3 module
for node JS connceting to SQLite DBMS thanks to its simplicity and overhead.
This module extracts the structure of the database (tables, their columns, con-
straints and relationships) and generates two JSON files that define the model
and that can be read by GOJS library. The first and the second files contain re-
spectively the relationships and the entities. (ii) ER initialization module: takes
the two JSON files as input and extracts the detailed structures of nodes and
links to initialize the ER model, to be displayed by the presentation module.
(iii) Transformation module: we assume the user is responsible for specifying
the components and results of each transformation and has the prior knowledge
to distinguish between fact table & analytic dataset and check the validity of
ER models. This module enriches the extended ER model for each new SQL
query by: (a) defining the transformation type (Aggregation or Denormaliza-
tion), (b) defining the SQL query that generates the intermediate table, the
query is displayed as a comment next to the transformation entity so to zoom
in the transformation applied. (c) creating the transformation entities (named
”TYPE Ti” i is an auto-increment sequence number) that are displayed using
different colors. The module also allows: (d) associating the PK, FK & other
attributes to the transformation entity. (e) According to the tables joined in
the From clause of the query, the module identifies the relationships to existing
entities to show the provenance of the transformation entities & their attributes,
using an ARROW link between transformation entities, this allows the user to
folow the chain of transfromations. (f) creating cardinalities of relationships:
one2many and many2one.
Data layer: after each valid transformation step, the new generated extended
ER model is stored as JSON files. The development of this tool is performed
on WebStorm IDE. GitHub is used for the versioning of the project. The source
code of our solution is available on: https://github.com/dihiaselma/ER4ML
126     Lanasri et al.




                     Fig. 1. The Overall Architecture of ER4ML



4     Demonstration Overview
In our system demonstration we will show our extended, elegant and intuitive ER
model obtained in a few seconds from typical SQL queries. It helps understand-
ing complex data transformations. Our demonstration is based on the basketball
dataset containing statistics about players, coaches, and teams in men’s profes-
sional basketball leagues from 1937 to 2012, which are extracted from the Open
Source Sports website. We constructed an SQLite database from these files and
proceeded to verification of integrity constraints and PK-FK links. We use 14
transformations4 deriving features using some complex Joins, aggregations and
the powerful CASE statement, resulting from the previous work [7] that include
queries using CASE statement to obtain the last dataset that will be used for
the ML model aiming to predict winning and losing teams. Using our tool, a
user will see the ER model corresponding to the defined database, displayed as
a dynamic graph. Let us take as an example the first query that represents de-
normalization. The user chooses ’Denormalization’ transformation type, specifies
the query and the set of included tables. Then, she specifies the list of attributes,
the cardinalities between entities can also be specified. The new diagram that
extends the existing one is generated displaying the first transformation entity
as shown in figure 2. The new transformation table is in light green, named ”De-
normalization T1”, next to the SQL query defined as a comment bull to help the
user understanding the transformation executed. Furthermore, the PK and FK
columns are labeled and the cardinalities are displayed above the arrow link col-
ored in dark green. In the case of Aggregation, the transformation entities are in
light red while the relationship(arrow) is in dark red. Our tool helps data scien-
tists refine data source in multiple iterations by adding, removing and changing
features basing on JOINS and aggregations via the described SQL queries until
obtaining the final dataset containing offensive & defensive features.
    A detailed demonstration video is available at https://youtu.be/nJkZ8aBa6co
which gives a nice overview of different functionalities of our tool. It emphasizes
on following points: (1)ER representation of temporary tables, (2)Flow of a chain
of data transformations, (3)Conceptual separation between aggregation and de-
4
    https://drive.google.com/file/d/1A29k9wjIfp0QpYgtf8dbooQH3qmbDqy6/view
                                    Title Suppressed Due to Excessive Length         127




               Fig. 2. Denormalization & Aggregation transformations.
normalization, (4)Complex SQL queries, (5)Reusing existing queries and tables
and (6)Connected view of data via primary and foreign keys.

5    Conclusions
In order to assist data scientists managing and understanding the different trans-
formations applied to a relational database and create a tabular dataset serving
as input to their ML models, we developed a new tool “ER4ML” that graphi-
cally generates an extended ER model based on UML notations. It models all the
intermediate temporary tables or views resulting from SQL queries, as entities
enriching the existing database schema. This tool can be seen as an interesting
response to the major question discussed during the Dagstuhl seminar 18471 ti-
tled Next Generation Domain Specific Conceptual Modeling: Principles and Methods,
in last November regarding the integration of conceptual modeling in ML. Despite this,
our tool presents some limitations: i) The user cannot view the ML model as entities.
ii)The user cannot view data transformations done with external languages like Java,
R or Python. iii) Our tool is good for data represented with tables, not for text or
images. As a perspective, we are extending our tool to deal with datasets issued from
different database formats.
References
1. Boyd, M., McBrien, P.: Comparing and transforming between data models via an
   intermediate hypergraph data model. In: JoDS IV, pp. 69–109. Springer (2005)
2. El Akkaoui, Z., Mazón, J.N., Vaisman, A., Zimányi, E.: Bpmn-based conceptual
   modeling of etl processes. In: DaWaK. pp. 1–14. Springer (2012)
3. Oliveira, B., Belo, O.: Using reo on etl conceptual modelling: a first approach. In:
   DOLAP. pp. 55–60. ACM (2013)
4. Ordonez, C.: Data set preprocessing and transformation in a database system. In-
   telligent Data Analysis (IDA) 15(4) (2011)
5. Ordonez, C., Bellatreche, L.: Enhancing ER diagrams to view data transformations
   computed with queries. In: DOLAP (2019)
6. Poulovassilis, A., McBrien, P.: A general formal framework for schema transforma-
   tion. DKE 28(1), 47–71 (1998)
7. Shi, Z., Moorthy, S., Zimmermann, A.: Predicting ncaab match outcomes using ml
   techniques-some results and lessons learned. In: ECML/PKDD (2013)
8. Zubcoff, J., Trujillo, J.: Conceptual modeling for classification mining in data ware-
   houses. In: DaWaK. pp. 566–575. Springer (2006)