1. INTRODUCTION

Parameter Curation and Data generation for Benchmarking Multi-model Queries

Parameter Curation

0 1 0 Chao Zhang Supervised by Jiaheng Lu University of Helsinki , Finland 1 Pair A: @PersonId=33, @BrandName=''Adidas''

Unlike traditional database management systems which are organized around a single data model, a multi-model database is designed to support multiple data models against a single, integrated backend. For instance, document, graph, relational, and key-value models are examples of data models that may be supported by a multi-model database. As more and more platforms are proposed to deal with multimodel data, it becomes important to have benchmarks that can be used to evaluate performance and usability of the next generation of multi-model database systems. In this paper, we discuss the motivations and challenges for benchmarking multi-model databases, and then present our current research on the data generation and parameter curation for benchmarking multi-model queries. Our benchmark can be found at http://udbms.cs.helsinki. /bench/.

1. INTRODUCTION

Recently, there is a new trend [ 12, 11, 13, 3 ] for data management, namely, the multi-model approach, which mainly aims to utilize a single platform to manage data in di erent models, e.g., key-value, document, table, and graph. Compared to the polyglot persistence technology in NoSQL world which entails managing separate data stores to satisfy various use cases, the multi-model approach has been considered as the next generation of data management technology combining exibility, scalability, and consistency.

The multi-model query is a unique operation in multimodel databases which allows users to retrieve multi-model data by using a single query language. Figure 1 depicts an example of a typical multi-model query in the social commerce [ 17 ]: For a given person p(id=56) and product brand b(\Nike"), nd p's friends who have bought products in brand b, and return their feedback which contains product's reviews with the 5-star rating. This query involves three data models: customer with friends (Graph), order embedded with an item list (JSON ), and customer's feedback (Relation).

Database benchmark becomes an essential tool for the evaluation and comparison of DBMSs since the advent of Wisconsin benchmark [ 5 ] in the early 1980s. Since then, many database benchmarks have been proposed by academia

Person _id: 101

Person _id: 33 friend friend friend Person _id: 145

Person _id: 56 { "id": 1, "customer_id": 33, "total_price": 135, "items": [ {"product_id": 85, "brand": "Nike"}, {"product_id": 86, "brand":"Adidas"} ] }

Order(JSON)

Feedback(Relation) custID productID Rating 33 56 101 145 145 145 85 86 87 87 88 89 5 4 4 3 5 4 and industry for various evaluation goals, such as TPC-X series for RDBMSs and data warehouses, OO7 [ 2 ] benchmark for object-oriented DBMSs, and XMark [ 16 ] for XML DBMSs. More recently, The NoSQL and big data movements in the late 2000s brought the arrival of the next generation of benchmarks, such as YCSB benchmark [ 4 ] for cloud serving systems, LDBC [ 6 ] and Rbench [ 15 ] benchmarks for Graph and RDF DBMSs, BigBench [ 8 ] benchmark for big data systems. Unfortunately, these benchmarks are not well suited for the evaluation of multi-model databases due to the lack of consideration of multiple data models, e.g., multi-model storage, multi-model query processing, multimodel query evaluation. Motivated by this, my Ph.D. dissertation will focus on the holistic evaluation of multi-model databases. In general, there are three main challenges on evaluating multi-model databases:

First, existing data generators that only involve single model cannot be directly adopted to evaluate the multimodel databases, and how to design a meaningful data models to mimic most cases of multi-model application remains an open question. In this regard, we develop a new data gen) s c se 4 ( e m i tn 2 u R 0

Pair A Pair B Substitution Parameters erator to generate the correlated data in diverse data models, including Graph, JSON, XML, key-value, and tabular. Furthermore, to simulate the data distributions in real life, we propose a three-phase framework to generate the data in the scenario of social commerce. Our data generator also has good scalability because it is implemented on the top of Hadoop and Spark, which enables us to generate the data in parallel.

The second benchmarking challenge is the problem of Parameter Curation [ 9 ], with the goal of selecting the substitution parameters for the multi-model query template to yield stable runtime behaviors. The rationale is that, the di erent parameter values for same query template would result in high runtime variance. For instance, in Figure 1, PersonId/56 and BrandName/\Nike" with the orange color in the AQL query are two substitution parameters that can be replaced by other values for the example query template. Figure 2 illustrates our experiment results with two di erent pairs of substitution parameters on two representative multi-model databases: ArangoDB [ 1 ] and OrientDB [ 14 ]. As shown in Figure 2, these parameters lead to the opposite evaluation results to compare the performance between ArangoDB and OrientDB. Interestingly, we observed the query runtime mainly depends on the domination of data models. For example, pair A involves relative larger intermediate results of JSON while pair B takes in the larger size of Graph. Therefore, the problem of parameter curation for benchmarking the multi-model queries requires answering three interesting questions: (i) how to select parameters from the data model perspective, (ii) how to cover di erent workloads concerning the data model, and (iii) how to guarantee the stable distribution of substitution parameters. In light of this, we formalize this problem as the top-k parameter groups curation, and then propose a new algorithm, MJFast, to select the ideal parameter groups.

The third challenge corresponds to the metrics of the benchmark. As expected, both metrics for evaluating the multi-model dataset (e.g., how closely the data mimic the real heterogeneous datasets) and multi-model query (e.g., to what extent the queries capture the diverse multi-model patterns) are needed. However, the disparity between data models in the data structure and workload complexity is a major hurdle when trying to de ne these new metrics. For the dataset evaluation metrics, we intend to use the dataset coherence and relationship specialty [ 15 ], as well as the multi-model complexity. Regarding the metrics of multimodel query, we de ne a uni ed metric, characterizing the query processing concerning the data models. For instance, the metric can be used to either measure the cost of the nested-loop join for the relational model or assess the cost of the shortest path matching for the graph model.

The rest of this paper is divided as follows. Section 2 presents the overview of our approach. Section 3 introduces the methods and techniques for the data generation. Section 4 gives our method for the parameter curation. Section 5 shows the preliminary experimental results. Finally, the last chapter summarizes this paper and outlines our future work. 2.

OVERVIEW OF OUR APPROACH

Figure 3 gives an overview of our benchmarking approach, which consists of three key components to evaluate the multimodel query. The metadata in the repository is rst passed into the Data Generation (Section 3) component that generates the data in a uni ed multi-model form based on our developed data generator. Next, the Workload Generation component generates the multi-model queries against the data models. These multi-model queries consist of a set of complex read-only queries that involve at least two data models, aiming to cover di erent business cases and technical perspectives. More speci cally, as for business cases, these queries fall into four main levers [ 10 ] : individual, conversation, community, and commerce. In these four levers, common-used business cases in di erent granularities are rendered. Regarding technical perspectives, these queries are designed based on the choke-point concept which combines usual technical challenges to process the data in multiple data models, ranging from the conjunctive queries (OLTP) to analysis (OLAP) workloads. The nal part is the Parameter Curation (Section 4) component. It rst characterizes the multi-model query by identifying the parameters and corresponding involved data models. Then model vectors corresponding to each parameter value are generated. Finally, the top-k parameters for the multi-model query are selected based on the proposed MJFast algorithm. 3.

DATA GENERATION

The data generation is the cornerstone of our benchmark and comprised of two main parts: social network generation and e-commerce data generation. The former part is to generate the social graph, including the person entities and knows relations, as well as their activities such as posts, comments, and likes. This generation is based on the LDBC SNB[ 6 ] data generator, which is a representative tool of generating data in the social network with rich semantics and scalability. The latter one is to generate the e-commerce data. Speci cally, we propose a three-phase framework to generate the transactions by taking into account person's interests, friendship, and social engagement. The three-phase framework consists of purchase, propagation-purchase, repurchase in the context of social commerce.

Purchase. In this phase, we consider two factors when generating the transaction data. First, persons usually buy products based on their interests. Second, persons owning more interests are more likely to buy products than others. This phase is implemented on the top of Spark SQL using scala, which utilizes a plentiful APIs and UDFs to output the various model simultaneously without any additional operations. Consequently, our data include ve models: social network (Graph), vendor and feedback (Relation), order (JSON), invoice (XML), product (Key-value).

Propagation-Purchase. In this phase, we incorporate two ingredients from previous data generation: (i) person's basic demographic data, e.g., gender, age, location. (ii)

Metadata Repository LDBC Interests Purchase Friendships Propagation Purchase

Graph

CLVSV Model Multi-Model Data

Relation,JSON, XML,Key-value

Re-Purchase Metrics Choke Points Multi-Model Query Business Factors

where Pk k P r(Rui = kjA = au) is the expectation of the probability distribution of the target user u's rating on the target item i, and A = fa1; a2; :::; amg is user attribute set, we compute this part based on naive bayes method. The latter part E(Rvi : 8v 2 N (u)) is the expectation of u's friends' rating distribution on the target item, in which N (u) is the friends set of user u, and the item i is from the purchase transaction of friends.

Re-Purchase. To make ne-grained predictions by incorporating the customer's social activities, we propose a new probabilistic model, CLVSC (Customer Lifetime Value in Social Commerce), to generate the transactions based on the history of customer's purchases and social activities: CLVSC ib = E(X j n ; x0; n; m; ; ; ; ) (E(M j p; q; ; mx; x) + E(S j s; ; )) (2) where i and b are the customer and brand index, respectively, E(X j ) is the expected number of behaviors, and E(M j ) is the expected monetary value, parameters in these two parts are for the beta-geometric/beta-binomial model [ 7 ], and E(S j ) is the expected number of customer's social activities, in which the parameters are for the Poisson-gamma model.

4. PARAMETER CURATION 4.1 Preliminaries

In this section, we describe the preliminaries for the parameter curation problem. We assume that the selection of parameters for a multi-model query should guarantee the following properties: (i) the query result should correspond to involved data models and their combinations, (ii) the size of involved data models should be bounded in each class, (iii) the selected parameters should cover di erent classes in the whole parameter space.

To satisfy the property (i), we propose a vector-based approach to represent parameter values from the multi-model perspective. Speci cally, we compute sizes of all intermediate results correspond to a parameter value based on the permutation of data models. For instance, given the query in the Section 1, and a parameter pair (p, b), we compute a non-zero vector (G,J,GJ,GJR), where G stands for Graph, J for JSON,R for Relation, GJ refers to the combination of these two models, i.e., persons who are p's friends and have bought products in brand b. This method allows us to represent the model-oriented results independently of the databases. The de nition of the model vector is as follow:

De nition 1. Model Vector: In a multi-model query, each model vector is de ned as ! fc1; :; ck; :; cng, where ck is k-th intermediate result size against involved data models or their combinations, cn is the nal result size. The length of ! is between [3, 2m-1], where m is the number of the data model.

Regarding property (ii), we assume that a representative class in the whole parameter space consists of two traits: the considerable number of model vectors, and the bounded distance between these vectors. Hence, we de ne the quali ed class as the candidate parameter group:

De nition 2. Candidate parameter group: In the parameter space, the candidate parameter group is the space with radius covering at least model vectors.

To ful ll the property (iii), we nd the k farthest candidate parameter groups. Therefore, the parameter curation problem boils down to nding the top-k candidate parameter groups, with the maximum number of model vectors, and maximum distance between groups. 4.2

Problem Definition and Algorithm

We now formalize the problem as follow: Top-k Parameter Groups Curation: Given the multimodel query MQ with parameter space P that is a set of N points in Rd, each point in P is a d-dimensional multi-model vector, the distance between two groups is the Euclidean distance between centroid of groups. The objective is to select k disjoint candidate parameter groups Sk P such that the score S

k k S(Sk) = X Density(Sk)=N1 + X Distance(Sk)=N2 (3) 1 1 is maximized, where the and re ects the importance of density and distance, which has + = 1. N1 and N2 are the normalization constants that normalize the sum of density and the sum of distance between 0 to 1, respectively.

The problem of parameter curation is non-trivial because it includes two NP-COMPLETE problems: top-k highest dense regions problem and top-k weighted maximum vertex cover problem. Therefore, we propose a greedy algorithm, called MJFast, to tackle this problem. The main idea of MJFast is rst to gather the similar candidate parameter groups into the cluster, then chooses the top-k densest candidate parameter groups from each cluster. Finally, the top-k farthest groups from all groups are returned. In speci c, we propose a new data structure, snowball, to store the strongly closed candidate parameter groups. Snowball starts with an arbitrary candidate parameter group in which the centroid is a model vector. Then it recursively rolls if other quali ed centroid vectors exist among the group. Since each snowball only maintains top-k densest groups at each iteration, the search space will be reduced dramatically. In MJFast, we build a k -d tree to speed up the searches for nearest neighbor, thus the average search time is O(logn). In the worst case that all points of P are within the same group, the time complexity is O(nlogn).

EXPERIMENT RESULTS 5.1 Data Generation

In the case of e ciency, experiment result suggests the data generator can generate 1G multi-model dataset in 5 minutes, on a single 8-core machine running MapReduce and Spark in \pseudo-distributed" mode. In terms of scalability, we successfully generate 10G dataset within 20 minutes on a cluster with three nodes.

5.2 Parameter Curation

We use two metrics to compare the MJFast with the random method. First, to measure the stability of the method, we compare the KL-divergence DKL between two groups of parameters for a xed k. The distribution is the discrete distribution of model-dominating. For example, when k is 5, the two model-dominating distributions are (G :4, J :1, R:0) and (G :1, J :1, R:3) respectively, so the DKL is 1.11. Second, we proceed with experiments on ArangoDB to compare the total runtime variance (TRV) between two groups of parameters. As shown in Table 1b, the parameters curated by MJFast can not only yield the small value of the DKL, but also result in low runtime variance as the k increase.

5.3 Preliminary Benchmarking Results

Table 1a illustrates the preliminary results for benchmarking the multi-model queries on ArangoDB and OrientDB. We conduct the experiments on a quad-core Xeon E5540 server with 32GB of RAM and 500GB of disk. All of benchmark queries involve at least two models, in particular, Q1, Q2, Q5 are Graph-dominating workloads, and Q3, Q4 are JSON-dominating workloads. The results show that, in multi-model context, OrientDB outperform ArangoDB regarding the Graph-dominating workload, and ArangoDB is better at JSON-dominating workload. This also suggests that the multi-model capacities of these two databases depend on their main models. i.e., ArangoDB is originally a document-oriented database, and OrientDB is a natively graph database.

6. CONCLUSION AND FURTURE WORK

Benchmarking the multi-model databases is a challenging task since current public data and workloads can not well match the various cases of real applications. To date, we have developed a scalable data generator to provide data in multiple data models, involving Graph, JSON, XML, keyvalue, and tabular. MJFast algorithm, which is proposed to (a) Mean runtime of multi-model queries (s) k=5 k=10 k=20

DKL 0.89 0.19 0.11 address the problem of parameter curation, ensures that our performance analysis is holistic and valid.

The general plan to complete my Ph.D. dissertation is to focus on the three components shown in Figure 3. First, the data schema and corresponding model in the real application could be changed, we will introduce this process in data generation. Second, we will optimize the MJFast algorithm by incorporating the sampling-based method to avoid the computation of whole parameter space. Finally, we will nalize the multi-model query template and the unied metric, and then conduct a set of experimental study on multi-model databases. Another extension is to investigate the ACID guarantees of multi-model transactions. Acknowledgement This work is supported by Academy of Finland (310321), China Scholarship and CIMO Fellowship.

[1] ArangoDB. Highly available multi-model NoSQL database . https://www.arangodb.com/, 2017 .

[2]

M. J.

Carey , D. J. DeWitt , and

J. F.

Naughton . The oo7 benchmark . In ACM SIGMOD , pages 12 { 21 , 1993 .

[3]

Chen ,

Du ,

Li ,

Lu ,

Zhao , and

Zhou . Big data challenge: a data management perspective . Frontiers Comput. Sci. , 7 ( 2 ): 157 { 164 , 2013 .

[4]

B. F.

Cooper ,

Silberstein , E. Tam,

Ramakrishnan , and

Sears . Benchmarking cloud serving systems with YCSB . In ACM SoCC , pages 143 { 154 , 2010 .

[5] D. J. DeWitt. The wisconsin benchmark: Past, present, and future . In The Benchmark Handbook , pages 119 { 165 . 1991 .

[6]

Erling ,

Averbuch ,

Larriba-Pey ,

Cha ,

Gubichev ,

Prat-Perez ,

Pham , and

P. A.

Boncz . The LDBC Social Network Benchmark: Interactive Workload . In SIGMOD 2015 .

[7]

P. S.

Fader . Customer-base analysis with discrete-time transaction data . PhD thesis , University of Auckland, 2004 .

[8]

Ghazal ,

Rabl ,

Hu ,

Raab ,

Poess ,

Crolotte , and

Jacobsen . BigBench: Towards an Industry Standard Benchmark for Big Data Analytics . In ACM SIGMOD , 2013 .

[9]

Gubichev and

Boncz . Parameter curation for benchmark queries . In TPCTC , pages 113 { 129 , 2014 .

[10]

Huang and

Benyoucef. From e -commerce to social commerce: A close look at design features . ECRA , 2013 .

[11]

Lu . Towards Benchmarking Multi-Model Databases . In CIDR , 2017 .

[12]

Lu and

Holubova

. Multi-model data management: What's new and what's next? In EDBT,

2017 .

[13]

Lu ,

Z. H.

Liu ,

Xu , and C. Zhang. UDBMS: road to uni cation for multi-model data management . CoRR, abs/1612.08050 , 2016 .

[14] OrientDB. Distributed Multi-model and Graph Database . http://orientdb.com/orientdb/, 2017 .

[15]

Qiao and Z. M. Ozsoyoglu . Rbench: Application-speci c RDF benchmarking . In ACM SIGMOD , 2015 .

[16]

Schmidt ,

Waas ,

M. L.

Kersten ,

M. J.

Carey , I. Manolescu , and

Busse . XMark: A Benchmark for XML Data Management . In VLDB , pages 974 { 985 , 2002 .

[17] K. Z. Zhang. Consumer behavior in social commerce: A literature review . Decision Support Systems , 2016 .