Data Virtual Machines: Data-Driven Conceptual Modeling of
                     Big Data Infrastructures
                      Damianos Chatziantoniou                                                              Verena Kantere
           Athens University of Economics and Business                                        National Technical University of Athens
                       damianos@aueb.gr                                                             verena@dblab.ece.ntua.gr

ABSTRACT                                                                               desktop”, where schema designers (IT people) could rapidly map
In this paper we introduce the concept of Data Virtual Machines                        customer’s attributes, and data scientists could simply define
(DVM), a graph-based conceptual model of the data infrastructure                       (possibly in a polyglot manner) transformations over attributes
of an organization, much like the traditional Entity-Relationship                      and combine them into dataframes. The evaluation of dataframes
Model (ER). However, while ER uses a top-down approach, in                             should be efficient and based on a solid theoretical framework.
which real-world entities and their relationships are depicted and                        Data integration can be seen as constructing a data warehouse,
utilized in the production of a relational representation, DVMs are                    or creating a virtual database [6]. It is worth mentioning that
based on a bottom up approach, mapping the data infrastructure                         defining global views over heterogeneous data sources is not
of an organization to a graph-based model. With the term “data                         a big data-era issue and has been extensively discussed in the
infrastructure” we refer to not only data persistently stored in                       past (e.g. [1]). While data warehousing was the way to go in
data management systems adhering to some data model, but                               the past – mainly due to the dominance of relational systems
also of generic data processing tasks that produce an output                           in data management – there are well-thought arguments to re-
useful in decision making. For example, a python program that                          consider a virtual database approach, a rapidly emerging trend
“does something” and computes for each customer her probability                        in the business world as data virtualization [9]. There are needs
to churn is an essential component of the organization’s data                          to accommodate data regulations; manage schema in an agile
landscape and has to be made available to the user, e.g. a data                        manner; integrate rapidly (parts of) data sources; perform ad hoc
scientist, in an easy to understand and intuitive to use manner,                       data preparation without the rigidity of data warehouses. All
the same way the age or gender of a customer are made. In fact,                        these requirements can be served well with a virtual approach.
a DVM depicts only associations between attributes (nodes). An                         The focus of this era is on “schematic flexibility/versatility” rather
association is given by some computation on the underlying data                        than ”querying performance”. Similar arguments can be found in
that relates values of these attributes. In this respect, is model-                    [10], proposing a data virtualization approach and in Polystores
agnostic. However, a DVM can be quite helpful in a variety of                          [7].
crucial tasks carried out by statisticians and data engineers.                            In this paper we discuss our vision for the creation of a data
                                                                                       virtual machine, as a graph-based conceptual model which is
KEYWORDS                                                                               built bottom-up. The high level goals of this work are:
                                                                                       Support for end-to-end processing This is a well-known re-
data virtualization, data virtual machines, big data infrastructures
                                                                                       search mandate for the data analysis pipeline [8], stating the need
                                                                                       for the “development of multiple tools, each solving some piece
1    INTRODUCTION                                                                      of the raw-data-to-knowledge puzzle, which can be seamlessly
Modern organizations collect, store and analyze a wealth of data                       integrated and be easy to use for both lay and expert users.” Seam-
from different sources and applications, used in a variety of data                     less integration requires a high-level conceptual layer where data,
analysis projects, such as traditional BI, data exploration, data                      processes and models can be easily mapped and manipulated.
mining, etc. to provide a competitive advantage to the business.                       Coping with diversity in the data management landscape:
This data has to be integrated to provide the data scientist with a                    This is another well-known research mandate [4], [8]. Multiple
“holistic” view of the enterprise’s data infrastructure. The term                      big data systems and analysis platforms need to coexist, and query
data infrastructure encompasses much more than data persis-                            support that span such systems necessitates that platforms are
tently stored in data management systems. It also involves pro-                        integrated and federated. While data warehousing is the de facto
cesses that can be useful during analysis, such as a Python pro-                       approach, it is rigid for a rapidly changing data environment.
gram that computes the social influence of each customer.                              Redefining data infrastructure: An organization’s data infras-
   In a recent project at a major telecom provider, we had to                          tructure includes data stored persistently (possibly adhering to
predict churn in the presence of structured and unstructured                           different data models) but also programs that produce output
data residing at different systems, relational and non-relational.                     useful in the analysis phase. Both data and programs should be
For this project a predictive model had to be designed and imple-                      treated as first class citizens and both should be mapped in a
mented taking into account the many possible variables (features)                      high-level conceptual model.
characterizing the customer (demographic, interactions with call                       Polyglotism: A data scientist can choose from a range of func-
center, emails, social data, etc.) The goal was to equip the data                      tions/methods in different programming languages to perform a
scientist with a simple tool that would allow her to choose and                        specific task (e.g. extraction of sentiment, scoring, aggregation,
experiment in an ad-hoc manner with multiple tabular views                             etc.) She should be enabled to use these in the same query and
of customer-related data. We wanted to create a “virtual data                          the query evaluation engine should handle them efficiently.
                                                                                       Visual Schema Management and Query Formulation Data
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,
                                                                                       scientists do not necessarily understand relational modeling or
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-              know SQL, which both could become quite complex for large
tribution 4.0 International (CC BY 4.0)
    Figure 1: DVM-modeling vs traditional ER-modeling
                                                                                         Figure 2: A customer entity with several attributes
schemas. They want to see visualizations that they can under-
stand and explore, and they need to visually define transfor-
                                                                                     databases, flat files, excel files, NoSQL, etc.) For instance, for a
mations over attributes (variables); they need to navigate the
                                                                                     customer entity, examples of attributes include his age, gender
schema to choose inputs for learning algorithms; and they want
                                                                                     and income, but also his emails, images, locations and transac-
to easily extend the schema using wizards. Such needs lead to the
                                                                                     tions. An attribute of an entity could have one or more values
requirement of a model that natively supports visual exploration
                                                                                     – for example, the age of a customer is a single value, but the
and schema management and is amenable to the development of
                                                                                     emails of a customer can be many – in ER theory these are called
graphical query languages. Graph-based models are amenable to
                                                                                     multi-valued attributes. In addition, attributes can be derived. A
GUI implementations – more than other models.
                                                                                     derived attribute is an attribute where its value is produced by
Shareable: The model or parts of it must be easily available to
                                                                                     some computational process, i.e. there exists a process that maps
third parties within or outside the enterprise, since the value of
                                                                                     one or more values to the entity. In a DVM, since we map existing
data explodes when the latter can be linked with other data [8].
                                                                                     data to entities, we can only have derived attributes. For exam-
Crawlable: Feature selection is a well-known process in statis-
                                                                                     ple, the query “SELECT custID, age FROM Customers” can be
tics. The data scientist selects and places attributes in a dataframe.
                                                                                     used to bind an age to the customer entity (using the primary key
We need an automated way to generate such attributes related to
                                                                                     of the entity, custID). The computational process that “defines”
specific entities. For this, we need a model that supports crawl-
                                                                                     the attribute (in this case, the SQL statement) accompanies, as
ing, e.g. starting from an entity, an algorithm collects or defines
                                                                                     semantics, the edge connecting the entity and the attribute. In
relevant attributes. The web is an example of a crawlable model.
                                                                                     this way, one can semantically represent any data processing task
                                                                                     onto the conceptual model. Examples involve the SQL statement
2    DATA VIRTUAL MACHINES                                                           mentioned above, but also a MongoDB query, a Cypher query,
A Data Virtual Machine describes entities and their attributes in                    programs that read from a flat or an excel file, even programs
a graphical way, much like the traditional Entity-Relationship                       that assign a churn probability to a customer. The only require-
Model (ER). A conceptual model, like the ER is simple to under-                      ment is that the data process maps one or more values to an
stand, succinct, and depicts entities at a higher level. However,                    entity, i.e. to have a two-column (id, value) output. An important
developing a conceptual model as a DVM is the reverse process                        observation to make is that this computation can be intra- or
of the one followed in a traditional ER design: while ER uses                        inter-organization. Figure 2 shows additional attributes for the
a top-down approach, DVM uses a bottom up approach, from                             customer entity (for simplicity we draw attributes with a solid
existing data – stored in model-specific implementations – back                      line rather than a dashed line, as in traditional ER design). Let
to a conceptual model1 . Figure 1 shows DVM- vs ER-modeling.                         us assume that all entities have a primary key (a quite realistic
   In the past there has been some little interest in the creation                   assumption in most real-life implementations), so an entity (rec-
of bottom-up approaches for the construction of a RDF graph                          tangle) can be represented by its primary key, which is also an
from the data. Some of these focus on the creation of RDF views                      attribute. In Figure 3, the customer entity is represented by the
on top of relational data, e.g. [12]. In this case, there is already                 custID attribute. The transactions of a customer (consisting of
a user-defined mapping between the relational schema and a                           transIDs) is another attribute (multi-valued) of the entity cus-
target ontology, which is employed for the creation of a represen-                   tomer, but at the same time is an entity itself, with its own set of
tation of relational schema concepts in terms of RDF classes and                     attributes, which means that there is no need for relationships,
properties. The work in [11] also considers the same problem, i.e.                   as in the traditional ER theory. This is also shown in Figure 3.
given a relational database schema and its integrity constraints,                        Finally, let us consider once again the query "SELECT custID,
a mapping to an OWL ontology is produced, which, provides the                        age FROM Customers”. While this query maps an age to a custID,
basis for generating RDF instances. Such works are orthogonal                        it also maps one or more custIDs to a specific age value. In other
to our vision, as they assume that the starting point is a given                     words, a data processing task with an output {(u, v) : u ∈ U , v ∈
relational schema, and the goal is to translate this schema into                     V } (multi-set semantics) provides two mappings, one from U
RDF (via the employment of ontologies). The notion of DVM that                       to V and one from V to U . This means that edges in a DVM
we discuss focuses on, first, creating an integrated conceptual                      graph are bidirectional (Figure 3). In that respect, all nodes in this
model that can accommodate various data models, and, second,                         graph are equal, i.e. there is no hierarchy, and all connections
produce the conceptual schema based on the processing of the                         are symmetrical, i.e. there are no primary keys. However, one
data, rather than the data itself.                                                   can consider a node with degree > 1 as a “primary” key, shown
   The key idea in a DVM is to make it easy to add an entity or                      in different color. A data virtual machine is as a graph-based
an attribute to an entity from a variety of data sources (relational                 arrangement of data processing tasks with output a pair of values,
1 We note that the conceptual model that the DVM follows is not the ER model,        namely mappings between two attribute domains.
but an ER-like model, i.e. a model based on notions of entities and attributes. As
such, it can also be characterized as an RDF-like model or a network-like model.        Definition 2.1 (Key-list Structure). A key-list structure K is a
For simplicity, in this paper, we make references to the ER model only               set of (key, list) pairs, K = {(k, Lk )}, where Lk is a list of elements
                                                                                Figure 5: Visual representation of a dataframe query
               Figure 3: A simple DVM example

                                                                             files fields. In most cases this is done manually, by exporting to
                                                                             a csv and moving the file around. There is no principled way
                                                                             to describe formally these in an intermediate representation.
                                                                             DVMs can become the medium for data sharing in a standardized,
                                                                             collaborative, distributed manner. For example, a data engineer
                                                                             can select a pair of columns in an excel file and represent them as
                                                                             nodes in the DVM, via some wizard that generates the necessary
                                                                             computation. The data scientist can then use these within a data
                                                                             model. This is very similar to what ETL tools/Visualization tools
Figure 4: Key-list structures to represent edges of DVMs                     do, when the output of the ETL task is binary (i.e. two columns).

or the special value null and ∀(k 1, Lk 1 ), (k 2, Lk 2 ) ∈ K, k 1 , k 2 .   3.2    Visual Query Formulation - Dataframes
Both keys and elements of the lists are strings. □                           What kind of queries can we have on top of DVMs? There is
  Definition 2.2 (Data Virtual Machines). A Data Virtual Machine             a large number of research papers on visual query formulation
(DVM) is a (multi)graph that is constructed as follows:                      over ERs, dating back from the 80s, that are relevant here. But, let
                                                                             us consider what data scientists usually do, since this is the target
   • Assume n attributes A1, A2, . . . , An drawn from domains
                                                                             group of this work. They usually form dataframes in Python, R or
      D 1, D 2, . . . , D n respectively. Each attribute becomes a
                                                                             Spark. A dataframe is a table that is built incrementally, column-
      node in the graph.
   • Assume one or more data processing tasks (DPT), where                   by-column. The first column(s) is some kind of key (customer
      each DPT P has as output a multiset S = {(u, v) : u ∈                  ID, transaction ID, etc.) and the remaining ones are “attached” to
      D i , v ∈ D j }. Based on S, one can define two key-list struc-        the dataframe via a left-outer join on the key, denoting related
      tures, denoted as KLi j (S) and KL ji (S) as:                          “attributes”. These columns may come from different data sources
      K = {k : (k, v) ∈ S } (a set),                                         and can be transformed before being “glued” to the dataframe.
      ∀k ∈ K, Lk = [v : (k, v) ∈ S], (a list),                               A dataframe provides a tabular representation of an entity and
      KLi j (S) = {(k, Lk ) : k ∈ K }                                        usually serves as input to ML algorithms. We have extensively
                                                                             discussed this class of queries (termed as multi-feature queries,
       KL ji (S) is similarly defined, treating as key the second
                                                                             not dataframes), both in terms of syntax (by proposing SQL ex-
       constituent of the value pairs of S. For P we define two
                                                                             tensions [2]) and evaluation (by proposing a relational operator
       edges Ai → A j and A j → Ai , each labeled with KLi j (S)
                                                                             relying in parallel processing and in-memory evaluation tech-
       and KL ji (S) respectively. □
                                                                             niques [3]). It is important to facilitate this process in a simple
   Example 2.3. Assume the SQL query "SELECT custID, transID                 and intuitive, visual, manner.
FROM Customers that maps transactions to customers and vice                     One can easily pick and transform attributes (possibly along
versa. The attributes, edges and the respective key-list structures          a path), to form a dataframe over a DVM. For example, using
are shown in Figure 4. □                                                     the DVM of Figure 2, one may want to form a dataframe using
                                                                             the custID as key (1st column) and her age, gender, the average
  The next section discussed that the concept of DVM is an
                                                                             sentiment of her comments containing the keyword “google”, the
appropriate high-level model for a big data environment.
                                                                             count of her friends and the total amount of her transactions on
                                                                             May 2019, as additional columns. Graphically, the user selects a
3     CHALLENGES AND OPPORTUNITIES
                                                                             node as the key of the dataframe and one or more of that node’s
DVMs allow the agile construction of graph-based schemas of                  descendants as the additional columns. Aggregation is used to
existing data within an organization. We discuss below how                   reduce multi-valued nodes to a single value. Figure 5 shows a
DVMs contribute to the high level goals as set in Section 1                  visual representation of this query. The research questions focus
                                                                             on (i) what kind of dataframe queries one can express on top of a
3.1    Model-agnostic data Sharing and                                       DVM, (ii) visual interfaces, and (iii) how can a system efficiently
       Exchange                                                              evaluate these queries. Regarding (i), there exists a well-defined
Any computation that produces as output a collection of pairs                grammar that specifies what is allowed and what is not (omitted
(programs, queries, web services), can be represented in a data              here). In terms of efficient evaluation, a dataframe query is a tree
virtual machine as an edge between two nodes. In real-world                  rooted at a node of DVM. Recall from Section 2 that edges in a
environments people need to share parts of spreadsheets, flat                DVM correspond to key-list structures. One can define a set of
files, json documents or relations, and usually specific columns             operators having as input and output key-list structures, forming
of these. This involves some excel formulas, db columns, or flat             thus an algebra over key-list structures. For example, a filtering
operator could get a key-list structure and a condition θ and filter   It is inherently aimed at producing quick and timely insights
the list of each key based on this expression, producing a new         from multiple sources without having to embark on a major data
key-list structure. Another operator could get as input a key-list     project with extensive ETL and data storage.” Existing such plat-
structure and a function and apply it on each member of each           forms, usually implement a relational model. A DVM provides
list (mapping). A dataframing operator gets two or more key-list       a virtual layer where the data engineer can easily map data and
structures and join them on the key, unioning the matching keys’       processes related to an entity. In this respect, it can be considered
lists. As a result, dataframe queries can be expressed, optimized      as a data virtualization platform.
and evaluated within an algebraic framework.
                                                                       3.6      Model-specific Database Instantiations
3.3    Polyglot Data Preparation                                       A data virtual machine is a conceptual model. While in a tra-
The dataframe example of Section 3.2 involves a function writ-         ditional database design the data model is predefined and de-
ten in some programming language (Python) to compute the               termines storage models, in a conceptual design one can cre-
sentiment of each comment. For a different attribute transfor-         ate database instances in different data models (e.g. relational,
mation in the same query, we can use a function in R. Finally,         semi-structured, multi-dimensional, etc.) – and possibly use this
a user-defined aggregate function can be in another program-           model’s query language to run queries on top of the instance. For
ming language. The query evaluation engine should support this         example, one can define a collection of JSON documents rooted
kind of polyglotism within the algebraic framework. For example,       on CustID for an application (containing customer’s transactions
while key-list structures could materialize within a single key-       within the document), but another user can define a collection of
value system, the set of operators manipulating these structures       JSON documents rooted on TransID. Recall the research ques-
could be implemented (redundantly) in different programming            tion posed in Section 3.4, regarding the delivering format of an
languages (i.e. all operators could be implemented both in R and       individual’s data under GDPR compliance. Using a DVM’s ap-
Python and the query engine selects the PL-specific version of         proach, the service owner can instantiate a database containing
the operator(s) to apply, depending on the used function).             the individual’s data in the preferred data model of the user.

3.4    Accommodating Data Regulations                                  4     CONCLUSIONS
The EU General Data Protection Regulation (GDPR) driven by             We introduce a graph-based model to depict data and data pro-
privacy concerns dictates that the data generated by the activity      cessing tasks of the data infrastructure of an organization at a
of an individual using a service can be deleted or exploited by        conceptual layer. We argue that this abstraction is useful in a
the individual. Thus, one can ask the service owner to hand in         plethora of analytics tasks performed by analysts and data en-
all of her data. For example, a user could request from Facebook,      gineers alike. We are currently developing the operators of the
Google or Wal-Mart for her activity data. One question is in           algebra over key-list structures in Python. Dataframe queries are
what format these data will be handed to her, and another, how         translated to an algebraic expression and a simple (unoptimized)
the user will create her data portfolio, i.e how she will repre-       plan is generated. The system that handles key-list structures is
sent and integrate these data, i.e. in which model: Relational?        Redis. Neo4j is used for DVMs. We are developing a tool called
Semi-structured? Graph-based? Some sort of self-service data           DataMingler that allows the management of data sources and
integration is necessary. The DVM model-agnostic exchange and          the respective DVM, and query formulation in a visual manner.
integration capability can naturally serve this cause. The last
question is on what the user can do with these data. Can she           REFERENCES
                                                                        [1] Silvana Castano, Valeria De Antonellis, and Sabrina De Capitani di Vimercati.
give them to a credit bureau to provide a specific evaluation on            2001. Global Viewing of Heterogeneous Data Sources. IEEE Trans. Knowl.
her? People already discuss micro-evaluation services on specific           Data Eng. 13, 2 (2001), 277–297. https://doi.org/10.1109/69.917566
datasets. Also, she could just sell them. For this, the data model      [2] Damianos Chatziantoniou. 1999. The PanQ Tool and EMF SQL for Complex
                                                                            Data Management. In Proceedings of ACM SIGKDD, 1999. 420–424.
(or part of it) has to be shareable, e.g. available by a link. DVM      [3] Damianos Chatziantoniou, Michael Akinde, Ted Johnson, and Samuel Kim.
seems as a good candidate to model, represent and share personal            2001. The MD-Join: An Operator for Complex OLAP. In IEEE International
data. It is a graph-based conceptual model, focused on entities             Conference on Data Engineering. 524–533.
                                                                        [4] Damianos Chatziantoniou and Florents Tselai. 2014. Introducing Data Connec-
and attributes. Given a single entity, people easily understand             tivity in a Big Data Web. In Proceedings of the Third Workshop on Data analytics
the concept of an attribute: my age, my emails, my transactions,            in the Cloud, DanaC 2014. 7:1–7:4. https://doi.org/10.1145/2627770.2627773
                                                                        [5] Denodo. 2019. Data Virtualization: The Modern Data Integration Solution. In
etc. A conceptual model also makes visualization easier and thus            White Paper.
appropriate for some kind of self-service data integration.             [6] AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data
                                                                            Integration. Morgan Kaufmann. http://research.cs.wisc.edu/dibook/
                                                                        [7] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska,
3.5    Data Virtualization                                                  Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and
Data virtualization is a relatively new business trend [5]. Com-            Stanley B. Zdonik. 2015. The BigDAWG Polystore System. SIGMOD Record
                                                                            44, 2 (2015), 11–16. https://doi.org/10.1145/2814710.2814713
panies like Denodo, Oracle, SAS and others already offer rele-          [8] Daniel Abadi et. al. 2016. The Beckman report on database research. Commun.
vant products. Data virtualization is closely related to mediators          ACM 59, 2 (2016), 92–99. https://doi.org/10.1145/2845915
                                                                        [9] Gartner. 2018. Data Virtualization Market Guide. In White Paper.
and virtual databases, if not a reinvention of these. According        [10] Manos Karpathiotakis, Ioannis Alagiannis, Thomas Heinis, Miguel Branco,
to Wikipedia, “data virtualization is any approach to data man-             and Anastasia Ailamaki. 2015. Just-In-Time Data Virtualization: Lightweight
agement that allows an application to retrieve and manipulate               Data Management with ViDa. In CIDR 2015.
                                                                       [11] Juan F. Sequeda, Marcelo Arenas, and Daniel P. Miranker. 2012. On Directly
data without requiring technical details about the data, such as            Mapping Relational Databases to RDF and OWL. In WWW. 649–658.
how it is formatted at source, or where it is physically located,      [12] Vania Maria P. et al. Vidal. 2013. Incremental Maintenance of RDF Views of
and can provide a single customer view (or single view of any               Relational Data. In On the Move to Meaningful Internet Systems.

other entity) of the overall data. Data virtualization may also
be considered as an alternative to ETL and data warehousing.