Data Virtual Machines: Data-Driven Conceptual Modeling of Big Data Infrastructures Damianos Chatziantoniou Verena Kantere Athens University of Economics and Business National Technical University of Athens damianos@aueb.gr verena@dblab.ece.ntua.gr ABSTRACT desktop”, where schema designers (IT people) could rapidly map In this paper we introduce the concept of Data Virtual Machines customer’s attributes, and data scientists could simply define (DVM), a graph-based conceptual model of the data infrastructure (possibly in a polyglot manner) transformations over attributes of an organization, much like the traditional Entity-Relationship and combine them into dataframes. The evaluation of dataframes Model (ER). However, while ER uses a top-down approach, in should be efficient and based on a solid theoretical framework. which real-world entities and their relationships are depicted and Data integration can be seen as constructing a data warehouse, utilized in the production of a relational representation, DVMs are or creating a virtual database [6]. It is worth mentioning that based on a bottom up approach, mapping the data infrastructure defining global views over heterogeneous data sources is not of an organization to a graph-based model. With the term “data a big data-era issue and has been extensively discussed in the infrastructure” we refer to not only data persistently stored in past (e.g. [1]). While data warehousing was the way to go in data management systems adhering to some data model, but the past – mainly due to the dominance of relational systems also of generic data processing tasks that produce an output in data management – there are well-thought arguments to re- useful in decision making. For example, a python program that consider a virtual database approach, a rapidly emerging trend “does something” and computes for each customer her probability in the business world as data virtualization [9]. There are needs to churn is an essential component of the organization’s data to accommodate data regulations; manage schema in an agile landscape and has to be made available to the user, e.g. a data manner; integrate rapidly (parts of) data sources; perform ad hoc scientist, in an easy to understand and intuitive to use manner, data preparation without the rigidity of data warehouses. All the same way the age or gender of a customer are made. In fact, these requirements can be served well with a virtual approach. a DVM depicts only associations between attributes (nodes). An The focus of this era is on “schematic flexibility/versatility” rather association is given by some computation on the underlying data than ”querying performance”. Similar arguments can be found in that relates values of these attributes. In this respect, is model- [10], proposing a data virtualization approach and in Polystores agnostic. However, a DVM can be quite helpful in a variety of [7]. crucial tasks carried out by statisticians and data engineers. In this paper we discuss our vision for the creation of a data virtual machine, as a graph-based conceptual model which is KEYWORDS built bottom-up. The high level goals of this work are: Support for end-to-end processing This is a well-known re- data virtualization, data virtual machines, big data infrastructures search mandate for the data analysis pipeline [8], stating the need for the “development of multiple tools, each solving some piece 1 INTRODUCTION of the raw-data-to-knowledge puzzle, which can be seamlessly Modern organizations collect, store and analyze a wealth of data integrated and be easy to use for both lay and expert users.” Seam- from different sources and applications, used in a variety of data less integration requires a high-level conceptual layer where data, analysis projects, such as traditional BI, data exploration, data processes and models can be easily mapped and manipulated. mining, etc. to provide a competitive advantage to the business. Coping with diversity in the data management landscape: This data has to be integrated to provide the data scientist with a This is another well-known research mandate [4], [8]. Multiple “holistic” view of the enterprise’s data infrastructure. The term big data systems and analysis platforms need to coexist, and query data infrastructure encompasses much more than data persis- support that span such systems necessitates that platforms are tently stored in data management systems. It also involves pro- integrated and federated. While data warehousing is the de facto cesses that can be useful during analysis, such as a Python pro- approach, it is rigid for a rapidly changing data environment. gram that computes the social influence of each customer. Redefining data infrastructure: An organization’s data infras- In a recent project at a major telecom provider, we had to tructure includes data stored persistently (possibly adhering to predict churn in the presence of structured and unstructured different data models) but also programs that produce output data residing at different systems, relational and non-relational. useful in the analysis phase. Both data and programs should be For this project a predictive model had to be designed and imple- treated as first class citizens and both should be mapped in a mented taking into account the many possible variables (features) high-level conceptual model. characterizing the customer (demographic, interactions with call Polyglotism: A data scientist can choose from a range of func- center, emails, social data, etc.) The goal was to equip the data tions/methods in different programming languages to perform a scientist with a simple tool that would allow her to choose and specific task (e.g. extraction of sentiment, scoring, aggregation, experiment in an ad-hoc manner with multiple tabular views etc.) She should be enabled to use these in the same query and of customer-related data. We wanted to create a “virtual data the query evaluation engine should handle them efficiently. Visual Schema Management and Query Formulation Data © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, scientists do not necessarily understand relational modeling or Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- know SQL, which both could become quite complex for large tribution 4.0 International (CC BY 4.0) Figure 1: DVM-modeling vs traditional ER-modeling Figure 2: A customer entity with several attributes schemas. They want to see visualizations that they can under- stand and explore, and they need to visually define transfor- databases, flat files, excel files, NoSQL, etc.) For instance, for a mations over attributes (variables); they need to navigate the customer entity, examples of attributes include his age, gender schema to choose inputs for learning algorithms; and they want and income, but also his emails, images, locations and transac- to easily extend the schema using wizards. Such needs lead to the tions. An attribute of an entity could have one or more values requirement of a model that natively supports visual exploration – for example, the age of a customer is a single value, but the and schema management and is amenable to the development of emails of a customer can be many – in ER theory these are called graphical query languages. Graph-based models are amenable to multi-valued attributes. In addition, attributes can be derived. A GUI implementations – more than other models. derived attribute is an attribute where its value is produced by Shareable: The model or parts of it must be easily available to some computational process, i.e. there exists a process that maps third parties within or outside the enterprise, since the value of one or more values to the entity. In a DVM, since we map existing data explodes when the latter can be linked with other data [8]. data to entities, we can only have derived attributes. For exam- Crawlable: Feature selection is a well-known process in statis- ple, the query “SELECT custID, age FROM Customers” can be tics. The data scientist selects and places attributes in a dataframe. used to bind an age to the customer entity (using the primary key We need an automated way to generate such attributes related to of the entity, custID). The computational process that “defines” specific entities. For this, we need a model that supports crawl- the attribute (in this case, the SQL statement) accompanies, as ing, e.g. starting from an entity, an algorithm collects or defines semantics, the edge connecting the entity and the attribute. In relevant attributes. The web is an example of a crawlable model. this way, one can semantically represent any data processing task onto the conceptual model. Examples involve the SQL statement 2 DATA VIRTUAL MACHINES mentioned above, but also a MongoDB query, a Cypher query, A Data Virtual Machine describes entities and their attributes in programs that read from a flat or an excel file, even programs a graphical way, much like the traditional Entity-Relationship that assign a churn probability to a customer. The only require- Model (ER). A conceptual model, like the ER is simple to under- ment is that the data process maps one or more values to an stand, succinct, and depicts entities at a higher level. However, entity, i.e. to have a two-column (id, value) output. An important developing a conceptual model as a DVM is the reverse process observation to make is that this computation can be intra- or of the one followed in a traditional ER design: while ER uses inter-organization. Figure 2 shows additional attributes for the a top-down approach, DVM uses a bottom up approach, from customer entity (for simplicity we draw attributes with a solid existing data – stored in model-specific implementations – back line rather than a dashed line, as in traditional ER design). Let to a conceptual model1 . Figure 1 shows DVM- vs ER-modeling. us assume that all entities have a primary key (a quite realistic In the past there has been some little interest in the creation assumption in most real-life implementations), so an entity (rec- of bottom-up approaches for the construction of a RDF graph tangle) can be represented by its primary key, which is also an from the data. Some of these focus on the creation of RDF views attribute. In Figure 3, the customer entity is represented by the on top of relational data, e.g. [12]. In this case, there is already custID attribute. The transactions of a customer (consisting of a user-defined mapping between the relational schema and a transIDs) is another attribute (multi-valued) of the entity cus- target ontology, which is employed for the creation of a represen- tomer, but at the same time is an entity itself, with its own set of tation of relational schema concepts in terms of RDF classes and attributes, which means that there is no need for relationships, properties. The work in [11] also considers the same problem, i.e. as in the traditional ER theory. This is also shown in Figure 3. given a relational database schema and its integrity constraints, Finally, let us consider once again the query "SELECT custID, a mapping to an OWL ontology is produced, which, provides the age FROM Customers”. While this query maps an age to a custID, basis for generating RDF instances. Such works are orthogonal it also maps one or more custIDs to a specific age value. In other to our vision, as they assume that the starting point is a given words, a data processing task with an output {(u, v) : u ∈ U , v ∈ relational schema, and the goal is to translate this schema into V } (multi-set semantics) provides two mappings, one from U RDF (via the employment of ontologies). The notion of DVM that to V and one from V to U . This means that edges in a DVM we discuss focuses on, first, creating an integrated conceptual graph are bidirectional (Figure 3). In that respect, all nodes in this model that can accommodate various data models, and, second, graph are equal, i.e. there is no hierarchy, and all connections produce the conceptual schema based on the processing of the are symmetrical, i.e. there are no primary keys. However, one data, rather than the data itself. can consider a node with degree > 1 as a “primary” key, shown The key idea in a DVM is to make it easy to add an entity or in different color. A data virtual machine is as a graph-based an attribute to an entity from a variety of data sources (relational arrangement of data processing tasks with output a pair of values, 1 We note that the conceptual model that the DVM follows is not the ER model, namely mappings between two attribute domains. but an ER-like model, i.e. a model based on notions of entities and attributes. As such, it can also be characterized as an RDF-like model or a network-like model. Definition 2.1 (Key-list Structure). A key-list structure K is a For simplicity, in this paper, we make references to the ER model only set of (key, list) pairs, K = {(k, Lk )}, where Lk is a list of elements Figure 5: Visual representation of a dataframe query Figure 3: A simple DVM example files fields. In most cases this is done manually, by exporting to a csv and moving the file around. There is no principled way to describe formally these in an intermediate representation. DVMs can become the medium for data sharing in a standardized, collaborative, distributed manner. For example, a data engineer can select a pair of columns in an excel file and represent them as nodes in the DVM, via some wizard that generates the necessary computation. The data scientist can then use these within a data model. This is very similar to what ETL tools/Visualization tools Figure 4: Key-list structures to represent edges of DVMs do, when the output of the ETL task is binary (i.e. two columns). or the special value null and ∀(k 1, Lk 1 ), (k 2, Lk 2 ) ∈ K, k 1 , k 2 . 3.2 Visual Query Formulation - Dataframes Both keys and elements of the lists are strings. □ What kind of queries can we have on top of DVMs? There is Definition 2.2 (Data Virtual Machines). A Data Virtual Machine a large number of research papers on visual query formulation (DVM) is a (multi)graph that is constructed as follows: over ERs, dating back from the 80s, that are relevant here. But, let us consider what data scientists usually do, since this is the target • Assume n attributes A1, A2, . . . , An drawn from domains group of this work. They usually form dataframes in Python, R or D 1, D 2, . . . , D n respectively. Each attribute becomes a Spark. A dataframe is a table that is built incrementally, column- node in the graph. • Assume one or more data processing tasks (DPT), where by-column. The first column(s) is some kind of key (customer each DPT P has as output a multiset S = {(u, v) : u ∈ ID, transaction ID, etc.) and the remaining ones are “attached” to D i , v ∈ D j }. Based on S, one can define two key-list struc- the dataframe via a left-outer join on the key, denoting related tures, denoted as KLi j (S) and KL ji (S) as: “attributes”. These columns may come from different data sources K = {k : (k, v) ∈ S } (a set), and can be transformed before being “glued” to the dataframe. ∀k ∈ K, Lk = [v : (k, v) ∈ S], (a list), A dataframe provides a tabular representation of an entity and KLi j (S) = {(k, Lk ) : k ∈ K } usually serves as input to ML algorithms. We have extensively discussed this class of queries (termed as multi-feature queries, KL ji (S) is similarly defined, treating as key the second not dataframes), both in terms of syntax (by proposing SQL ex- constituent of the value pairs of S. For P we define two tensions [2]) and evaluation (by proposing a relational operator edges Ai → A j and A j → Ai , each labeled with KLi j (S) relying in parallel processing and in-memory evaluation tech- and KL ji (S) respectively. □ niques [3]). It is important to facilitate this process in a simple Example 2.3. Assume the SQL query "SELECT custID, transID and intuitive, visual, manner. FROM Customers that maps transactions to customers and vice One can easily pick and transform attributes (possibly along versa. The attributes, edges and the respective key-list structures a path), to form a dataframe over a DVM. For example, using are shown in Figure 4. □ the DVM of Figure 2, one may want to form a dataframe using the custID as key (1st column) and her age, gender, the average The next section discussed that the concept of DVM is an sentiment of her comments containing the keyword “google”, the appropriate high-level model for a big data environment. count of her friends and the total amount of her transactions on May 2019, as additional columns. Graphically, the user selects a 3 CHALLENGES AND OPPORTUNITIES node as the key of the dataframe and one or more of that node’s DVMs allow the agile construction of graph-based schemas of descendants as the additional columns. Aggregation is used to existing data within an organization. We discuss below how reduce multi-valued nodes to a single value. Figure 5 shows a DVMs contribute to the high level goals as set in Section 1 visual representation of this query. The research questions focus on (i) what kind of dataframe queries one can express on top of a 3.1 Model-agnostic data Sharing and DVM, (ii) visual interfaces, and (iii) how can a system efficiently Exchange evaluate these queries. Regarding (i), there exists a well-defined Any computation that produces as output a collection of pairs grammar that specifies what is allowed and what is not (omitted (programs, queries, web services), can be represented in a data here). In terms of efficient evaluation, a dataframe query is a tree virtual machine as an edge between two nodes. In real-world rooted at a node of DVM. Recall from Section 2 that edges in a environments people need to share parts of spreadsheets, flat DVM correspond to key-list structures. One can define a set of files, json documents or relations, and usually specific columns operators having as input and output key-list structures, forming of these. This involves some excel formulas, db columns, or flat thus an algebra over key-list structures. For example, a filtering operator could get a key-list structure and a condition θ and filter It is inherently aimed at producing quick and timely insights the list of each key based on this expression, producing a new from multiple sources without having to embark on a major data key-list structure. Another operator could get as input a key-list project with extensive ETL and data storage.” Existing such plat- structure and a function and apply it on each member of each forms, usually implement a relational model. A DVM provides list (mapping). A dataframing operator gets two or more key-list a virtual layer where the data engineer can easily map data and structures and join them on the key, unioning the matching keys’ processes related to an entity. In this respect, it can be considered lists. As a result, dataframe queries can be expressed, optimized as a data virtualization platform. and evaluated within an algebraic framework. 3.6 Model-specific Database Instantiations 3.3 Polyglot Data Preparation A data virtual machine is a conceptual model. While in a tra- The dataframe example of Section 3.2 involves a function writ- ditional database design the data model is predefined and de- ten in some programming language (Python) to compute the termines storage models, in a conceptual design one can cre- sentiment of each comment. For a different attribute transfor- ate database instances in different data models (e.g. relational, mation in the same query, we can use a function in R. Finally, semi-structured, multi-dimensional, etc.) – and possibly use this a user-defined aggregate function can be in another program- model’s query language to run queries on top of the instance. For ming language. The query evaluation engine should support this example, one can define a collection of JSON documents rooted kind of polyglotism within the algebraic framework. For example, on CustID for an application (containing customer’s transactions while key-list structures could materialize within a single key- within the document), but another user can define a collection of value system, the set of operators manipulating these structures JSON documents rooted on TransID. Recall the research ques- could be implemented (redundantly) in different programming tion posed in Section 3.4, regarding the delivering format of an languages (i.e. all operators could be implemented both in R and individual’s data under GDPR compliance. Using a DVM’s ap- Python and the query engine selects the PL-specific version of proach, the service owner can instantiate a database containing the operator(s) to apply, depending on the used function). the individual’s data in the preferred data model of the user. 3.4 Accommodating Data Regulations 4 CONCLUSIONS The EU General Data Protection Regulation (GDPR) driven by We introduce a graph-based model to depict data and data pro- privacy concerns dictates that the data generated by the activity cessing tasks of the data infrastructure of an organization at a of an individual using a service can be deleted or exploited by conceptual layer. We argue that this abstraction is useful in a the individual. Thus, one can ask the service owner to hand in plethora of analytics tasks performed by analysts and data en- all of her data. For example, a user could request from Facebook, gineers alike. We are currently developing the operators of the Google or Wal-Mart for her activity data. One question is in algebra over key-list structures in Python. Dataframe queries are what format these data will be handed to her, and another, how translated to an algebraic expression and a simple (unoptimized) the user will create her data portfolio, i.e how she will repre- plan is generated. The system that handles key-list structures is sent and integrate these data, i.e. in which model: Relational? Redis. Neo4j is used for DVMs. We are developing a tool called Semi-structured? Graph-based? Some sort of self-service data DataMingler that allows the management of data sources and integration is necessary. The DVM model-agnostic exchange and the respective DVM, and query formulation in a visual manner. integration capability can naturally serve this cause. The last question is on what the user can do with these data. Can she REFERENCES [1] Silvana Castano, Valeria De Antonellis, and Sabrina De Capitani di Vimercati. give them to a credit bureau to provide a specific evaluation on 2001. Global Viewing of Heterogeneous Data Sources. IEEE Trans. Knowl. her? People already discuss micro-evaluation services on specific Data Eng. 13, 2 (2001), 277–297. https://doi.org/10.1109/69.917566 datasets. Also, she could just sell them. For this, the data model [2] Damianos Chatziantoniou. 1999. The PanQ Tool and EMF SQL for Complex Data Management. In Proceedings of ACM SIGKDD, 1999. 420–424. (or part of it) has to be shareable, e.g. available by a link. DVM [3] Damianos Chatziantoniou, Michael Akinde, Ted Johnson, and Samuel Kim. seems as a good candidate to model, represent and share personal 2001. The MD-Join: An Operator for Complex OLAP. In IEEE International data. It is a graph-based conceptual model, focused on entities Conference on Data Engineering. 524–533. [4] Damianos Chatziantoniou and Florents Tselai. 2014. Introducing Data Connec- and attributes. Given a single entity, people easily understand tivity in a Big Data Web. In Proceedings of the Third Workshop on Data analytics the concept of an attribute: my age, my emails, my transactions, in the Cloud, DanaC 2014. 7:1–7:4. https://doi.org/10.1145/2627770.2627773 [5] Denodo. 2019. Data Virtualization: The Modern Data Integration Solution. In etc. A conceptual model also makes visualization easier and thus White Paper. appropriate for some kind of self-service data integration. [6] AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data Integration. Morgan Kaufmann. http://research.cs.wisc.edu/dibook/ [7] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska, 3.5 Data Virtualization Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Data virtualization is a relatively new business trend [5]. Com- Stanley B. Zdonik. 2015. The BigDAWG Polystore System. SIGMOD Record 44, 2 (2015), 11–16. https://doi.org/10.1145/2814710.2814713 panies like Denodo, Oracle, SAS and others already offer rele- [8] Daniel Abadi et. al. 2016. The Beckman report on database research. Commun. vant products. Data virtualization is closely related to mediators ACM 59, 2 (2016), 92–99. https://doi.org/10.1145/2845915 [9] Gartner. 2018. Data Virtualization Market Guide. In White Paper. and virtual databases, if not a reinvention of these. According [10] Manos Karpathiotakis, Ioannis Alagiannis, Thomas Heinis, Miguel Branco, to Wikipedia, “data virtualization is any approach to data man- and Anastasia Ailamaki. 2015. Just-In-Time Data Virtualization: Lightweight agement that allows an application to retrieve and manipulate Data Management with ViDa. In CIDR 2015. [11] Juan F. Sequeda, Marcelo Arenas, and Daniel P. Miranker. 2012. On Directly data without requiring technical details about the data, such as Mapping Relational Databases to RDF and OWL. In WWW. 649–658. how it is formatted at source, or where it is physically located, [12] Vania Maria P. et al. Vidal. 2013. Incremental Maintenance of RDF Views of and can provide a single customer view (or single view of any Relational Data. In On the Move to Meaningful Internet Systems. other entity) of the overall data. Data virtualization may also be considered as an alternative to ETL and data warehousing.