Just-In-Time Modeling with DataMingler?

                   Damianos Chatziantoniou and Verena Kantere
                    1
                     Dept. of Management Science and Technology
                     Athens University of Economics and Business
                                  damianos@aueb.gr
                   2
                     School of Electrical and Computer Engineering
                       National Technical University of Athens
                                verena@mail.ntua.gr


        Abstract. DataMingler is a prototype tool that implements a novel
        conceptual model, the Data Virtual Machine (DVM) and can be used
        for agile just-in-time modeling of data from diverse sources. The DVM
        provides easy-to-understand semantics and fast and flexible schema ma-
        nipulations. An important and useful class of queries in analytics environ-
        ments, dataframes, is defined in the context of DVMs. These queries can
        be expressed either visually or through a novel query language, DVM-
        QL. We demonstrate DataMingler’s capabilities map relational sources
        and queries on the latter in a DVM schema and augment it with informa-
        tion from semi-structured and unstructured sources. We also show how
        to express on the DVM easily complex relational queries or queries on
        structured, semi-structured and unstructured sources combined.

        Keywords: Data virtualization · just-in-time modeling.


1     Introduction

In a recent analytics project for a telecom provider, a customers’ feature store
had to be built and used by data scientists to develop churn prediction mod-
els. The features had to be rapidly defined and presented to the analysts in an
easy to understand and use manner. Rapid development usually suggests the
use of Python or R to build dataframes. This approach, however, leads to imple-
mentations that lack data semantics, limited data exploration capabilities and
involvement of the data engineers end-to-end. A classic alternative is the deploy-
ment of an application-specific data warehouse, which is quite expensive both
in terms of resources and schema changes. However, this is the approach most
consulting firms (e.g. Accenture, EY, PwC) follow for analytics project. The
question was whether we can combine the semantics offered by data modeling
and the agility of programming languages. The challenge was how to, quickly,
in a just-in-time manner, bind an attribute to a customer entity. To tackle the
above, we propose the Data Virtual Machine (DVM) (briefly introduced in [3]),
?
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
44     D. Chatziantoniou and V. Kantere

a conceptual model based on entities and attributes – concepts that users un-
derstand well. A DVM is a graph where nodes represent attribute domains and
edges represent mappings between these. The mappings are expressed by data
processing tasks (e.g. a query) that provide their output as pairs of values. It is
a form of data virtualization, a rapidly emerging trend in the business world [9],
the main function of which is to hide technical details from users and provide a
single view of any entity. Since nodes of a DVM can be attributes and entities at
the same time, the “single view of any entity” can be trivially served by a DVM.
The DVM provides intuitive semantics and fast and flexible schema manipula-
tions. Dataframes, an important class of queries in analytics environments, can
be expressed either through a novel query language, DVM-QL, or visually, and
are evaluated within a formal algebraic framework.
    DataMingler is a prototype tool [3] that implements DVM design and query-
ing. It enables (a) efficient DVM management, (b) visual and textual specifica-
tion of dataframe queries, (c) optimization and evaluation of dataframe queries,
and (d) materialization of a DVM into different logical models (model polymor-
phism.) In this paper we demonstrate how information in relational sources and
queries on the latter can be mapped to a DVM, and, further, combined with
information in semi-structured and unstructured and sources, like JSON, Excel
and csv files. We also show how we can very easily express on a DVM queries
that are either very complex to declare on the relational sources or impossible to
declare on the ensemble of structured, semi-structured and unstructured sources.

2    Concepts and Current Landscape
Modern organizations collect, store and analyze a wealth of data from different
sources and applications, used in a variety of analysis projects to provide a
competitive advantage to business. This data has to be integrated to provide data
scientists with a ”holistic” view of the enterprise’s data infrastructure, a term
that encompasses much more than data persistently stored in a DBMS, SQL,
NoSQL, or otherwise. It also involves flat files, spreadsheets and transient data
handled by stream engines. Furthermore, it includes processes (i.e. programs)
that produce output useful in the analysis phase, e.g. a Python program that
computes the social influence or the churn category of a customer. It is important
to easily and agilely represent all these in a conceptual data layer, e.g. by using
data virtualization, a new business trend and is closely related to mediators and
virtual databases (a form of data integration), if not a reinvention of these: It
allows data manipulation without requiring technical details about the data,
such as how it is formatted at source or where it is physically located by people
who own, contribute, manage and analyse the data and assume several roles.
    Current Situation in Industry The industrial scene is prevailed by com-
panies that offer practical data virtualization. Oracle [10], Denodo [7] and other
database technology companies [9] already offer data virtualization products.
Most implement a virtual relational model, defining views over data sources
having a relational interface. While the relational model and its query language,
SQL, are well-known and well-understood by the IT community, they present
                                       Just-In-Time Modeling with DataMingler               45

shortcomings when used as a data virtualization technique – more or less the
same shortcomings existing in data warehousing environments. The relational
model is not a high-level conceptual model and is cumbersome for many data
analysts and contributors. Schematic modifications and extensions are very rigid
and may depend on complex constraints. As the relational model is flat, focus-
ing on a conceptual entity and collecting all relevant data may be tedious and
require the expression and execution of complex SQL queries. Last, visualization
of schema manipulation is not intuitive due to the tabular format.
    Current Situation in Research Classical data integration deals with the
problem of defining a global schema on heterogeneous relational data sources
using a mapping between the global and the local schemata [8]. The issue of
query containment and equivalence is of great importance, which is investigated
with respect to the open and closed world assumption. Oppositely, in the DVM
approach our attention is not on the original queries on the data, neither with
respect to definition, nor evaluation. Furthermore, there are works that create
Virtual Knowledge Graphs as means of integrating and accessing autonomous
and heterogeneous sources. The focus has been on the definition of mappings [1]
with a well-balanced trade-off of expressiveness and complexity: the mappings
need to be adequately expressive in order to allow for complex querying, but
queries should be tractable especially in the size of data. Oppositely to OWL 2.0
and semantic reasoners, DVM is not meant for the expression of complex con-
ceptual relations or for reasoning. In the DVM approach we are not interested in
defining sophisticated mappings and balancing expressiveness with complexity.
    Enabling query processing across heterogeneous data models by federating
specialized data stores, an interesting research topic, is discussed in [11]. A sys-
tem that supports this can be classified as a federated database, a polyglot, a
multistore or a polystore system according to a taxonomy presented in [12]. Our
work is motivated by similar concerns and could be considered a multistore.

3    Data Virtual Machines
In this section we briefly present the definition of the Data Virtual Machine
(DVM). For a more extended discussion the reader is referred to [2, 3].
    In simple terms, a DVM represents a collection of mappings (edges) between
attribute domains (nodes), where mappings are manifested as data processes
with a 2-dimensional output (the attribute domains) over existing data.
Definition 1. [Key-list Structure] A key-list structure (KL-structure) K is
a set of (key, list) pairs, K = {(k, Lk )}, where Lk is a list of elements or the
empty list and ∀ (k1 , Lk1 ), (k2 , Lk2 ) ∈ K, k1 6= k2 . Both keys and elements of
the lists are strings. The set of keys of KL-structure K is denoted as keys(K);
the list of key k of KL-structure K is denoted as list(k, K).
Definition 2. [Data Virtual Machines] Assume a collection A of n do-
mains A1 , A2 , . . . , An , called attributes. Assume a collection S of m multisets,
S1 , S2 , . . . , Sm , where each multiset S has the form: S = {(u, v) : u ∈ Ai , v ∈
Aj , i, j ∈ {1, 2, . . . , n}}, called data processing tasks. For each such S ∈ {S1 , S2 , . . . , Sm }
                                            S        S
we define two key-list structures, Kij         and Kji  as:
46      D. Chatziantoniou and V. Kantere


Fig. 1. Key-list structures to
represent edges of DVMs

                                          Fig. 2. DataMingler’s Architecture
      S
    Kij : for each u in the set {u : (u, v) ∈ S} we define the list Lu = {v : (u, v) ∈
                                        S
    S} and (u, Lu ) is appended to Kij    .
      S
    Kji is similarly defined.
The data virtual machine corresponding to these attributes and data processing
tasks is a multi-graph G = {A, S} constructed as follows:
– each attribute becomes a node in G
– for each data processing task S we draw two edges Ai → Aj and Aj → Ai ,
                   S        S
   labeled with Kij  and Kji   respectively.
The key-list structure that corresponds to an edge e :Ai → Aj is denoted as
KL(e), with schema (Ai , Aj ).
Example 1. Assume the SQL query "SELECT custID, transID FROM Customers
that maps transactions to customers and vice versa. The attributes, edges and
the respective key-list structures are shown in Figure 1.

4    DataMingler: An Overview
The architecture of DataMingler is shown in Figure 2. Data Canvas is the module
that enables the creation and manipulation of a DVM graph by mapping data
and processes onto the graph and extending it with new nodes and edges. The
resource types that DataMingler currently handles are: relational databases, csv
files, excel and stand-alone programs (Java and Python). Description of resources
(connection strings, database names, file paths and names, etc.) can be found in
the data dictionary in XML. A DVM is kept in a Neo4j graph database. Nodes
have as properties the name and the description of the attribute. Edges have as
properties the data source name, the query string (if any), and the positions in
the output that correspond to the head and tail nodes of the edge.
     The user can formulate dataframe queries either textually, or visually using
Query Builder. In both cases, queries are represented in an XML-based interme-
diate representation, and parsed and transformed to a key-list algebraic expres-
sion, which is given to the optimizer and an execution plan is generated. Redis
is currently used as they key-value engine for loading and manipulating key-list
structures. Transformations(), rollupJoin() and thetaCombine() operators have
been implemented both in Python, Java and R. The output at the top-level is a
dataframe in Python/R, i.e. columns may contain atomic values or lists.
                                                    Just-In-Time Modeling with DataMingler        47


 Employee (SSN, Gender, LastName, DeptCode)

 Dept(DeptCode, Name, Budget, MgrSSN)

 Assigned (SSN, Code, Hours)

 Project (Code, Description, Location)

 Supervision (SSN, Code, Order, Supervisor)

 Founder (SSN, Title)

Fig. 3.  A    relational
database schema to be
mapped to a DVM                               Fig. 4. The DVM schema from a relational source aug-
                                              mented with attributes from queries and an excel file

    An association between nodes is defined via the use of a data source: the
user specifies (a) the column/output position in the source that corresponds to
the head node of the edge, the root (e.g. custID) and, (b) the column/output
position in the source that corresponds to the tail node of the edge, the child (e.g.
Age.) If these nodes do not exist in the DVM, they are created. The required data
processing tasks (e.g. “SELECT custID, Age FROM Customers”, “SELECT
Age, custID FROM Customers”) are derived automatically and attached to the
edges between the nodes. The creation of a DVM using a relational database,
an excel, a csv and a program is depicted in video [4].
    Using Query Builder over a DVM, complex, conceptually difficult queries
become simple and intuitive to express. Video [6] provides a step-by-step con-
struction of two dataframe queries. Furthermore, DataMingler can be used to
instantiate model-specific databases. The user selects a node and a breadth-first-
search tree rooted on this node is defined. The system generates a collection of
JSON documents corresponding to this tree. It implements attributes as lists or
strings, depending on the cardinality of lists (whether they contain multiple or
single values.), demonstrated in video [5].

5       Demonstration Description: Just-In-Time Modeling

In this demonstration we show how to use DataMingler in a real use-case scenario
to create a schema that coalesces information from multiple sources and how to
form visually queries that are otherwise very complex to declare in SQL or im-
possible to declare on multiple sources. We show in detail how we map relational
data sources and existing queries on them to a DVM, as well as how we augment
the latter with nodes from semi-structured sources line JSON and unstructured
sources, like collections of csv or excel files. Let us assume a company with many
departments that runs projects in multiple locations. Employees are managers,
supervisors or founders of the company. The schema of a respective relational
database is shown in Figure 3. There is also an Excel file that stores information
on KPIs for some projects (e.g. the projects of a specific department). The DVM
can be enriched with columns from the Excel file, concerning e.g. funding already
spent (‘MoneySpent’ and percentage of the project that is completed (‘Prct-
48      D. Chatziantoniou and V. Kantere

Completed’). Figure 4 shows the produced DVM. The screenshot also shows the
query that produced the node ’LHours’. Note that the nodes ’ProjectLocation’
and ’ProjectCode’ which correspond to attributes in the relational schema, be-
come entities in the DVM after the addition of further information from queries
(entities shown in blue and attributes in pink). Also, attributes comprised in a
primary key, such as ‘SSN’, ‘Code’ and ‘Order’ in relation ‘Supervision’, do not
appear as nodes in the DVM, but are represented by the node ’Supervision:PK’.
The latter is connected with ‘Assigned:PK’, the node representing the primary
key of ‘Assigned’ (i.e. the combination of attributes ‘SSN’ and ‘Code’), with a
pair of edges that represents the respective foreign key constraint. Finally, the
nodes ‘Dept:Code’ and ‘Employee:SSN’ are connected with two pairs of edges,
representing the two foreign key constraints between the respective relations.
    The demonstration delves in the details of expressing on a DVM in a simple,
intuitive and seamless manner queries complex and non-intuitive to structure
in SQL, and combinations of complex SQL queries and processing tasks on un-
structured data. One such query returns the SSNs of employees that participate
in at least one project with their manager. This is a complex query to express on
the relational database, but a very easy one to express on the DVM, in which the
user needs just to select the nodes and processing tasks for the query and build
a tree. Another such query asks for the supervisors of projects together with the
number of employees they supervise in these and the sum of their hours for all
projects the employees work for. This can be expressed by a combination of SQL
queries. Due to lack of space, we do not show the screenshots of the queries.

References
 1. Calvanese, D., Gal, A., Lanti, D., Montali, M., Mosca, A., Shraga, R.: Mapping
    patterns for virtual knowledge graphs (12 2020)
 2. Chatziantoniou, D., Kantere, V.: Data Virtual Machines: Data-Driven Conceptual
    Modeling of Big Data Infrastructures. In: Workshops of EDBT 2020 (2020)
 3. Chatziantoniou, D., Kantere, V.: Datamingler: A novel approach to data virtual-
    ization. In: ACM SIGMOD. pp. 2681–2685 (2021)
 4. DataMingler:        Data       canvas.      https://drive.google.com/file/d/
    1hArnqc9HjSWobDWFbdjQZ1L_e9WU-eQQ/view?usp=sharing (2020)
 5. DataMingler:         Json      export.      https://drive.google.com/file/d/
    1Ruv356HQxtoZyRunBPuf-aA-NHkhGIlF/view?usp=sharing (2020)
 6. DataMingler:        Query      builder.     https://drive.google.com/file/d/
    1FQyuXQ8BDzcoDnHy0aQcW0FEKR2FapUZ/view?usp=sharing (2020)
 7. Denodo: Data Virtualization: The Modern Data Integration Solution (2019)
 8. Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kauf-
    mann (2012)
 9. Gartner: Market Guide for Data Virtualization (2018)
10. Oracle Corp.: Oracle Data Service Integrator (2020)
11. Stonebraker, M.: The Case for Polystores. ACM SIGMOD Blog (July 13 2015),
    http://wp.sigmod.org/?p=1629
12. Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling Query Processing
    Across Heterogeneous Data Models: A Survey. In: IEEE BigData. pp. 3211–3220
    (2017)