An Evaluation of VIG with the BSBM Benchmark

                    Davide Lanti, Guohui Xiao, and Diego Calvanese

                            Free University of Bozen-Bolzano, Italy


       Abstract. We present an experimental evaluation of VIG, a data scaler for OBDA
       benchmarks. Data scaling is a relatively recent approach, proposed in the database
       community, that allows for scaling an input data instance to s times its size, while
       preserving certain application-specific characteristics. A data scaler is a “general”
       generator, in the sense that it can be re-used on different database schemas, and
       that users are not required to manually input the data characteristics. VIG lifts the
       scaling approach from the database level to the OBDA level, where the domain
       information of ontologies and mappings has to be taken into account as well. To
       evaluate VIG, in this paper we use it to generate data for the Berlin SPARQL
       Benchmark (BSBM), and compare it with the official BSBM data generator.


1   Introduction
An important research problem in Big Data is how to provide end-users with transparent
access to the data, abstracting from storage details. In Ontology-based Data Access
(OBDA) [6], a solution is realized by presenting the data stored in a relational database
to the end-users as a virtual RDF graph over which SPARQL queries can be posed. This
solution is realized through mappings that link classes and properties in the ontology to
queries over the database.
     Benchmarking of OBDA systems requires scalability analyses taking into account
data instances of increasing volumes. Such instances are often provided by generators
of synthetic data which are either complex schema-specific implementations or require
considerable manual input by the end-user. Applying either of the two approaches might
be challenging in a setting like OBDA where database schemas tend to be particularly
big and complex (e.g., 70 tables, some with more of 80 columns in [3]).
     Data scaling [7] is a recent approach that tries to overcome this problem by automat-
ically tuning the generation parameters through statistics collected over an initial data
instance. Hence, the same generator can be reused in different contexts, as long as an
initial data instance is available. A measure of quality for the produced data is defined in
terms of results for the available queries, that should be similar to the one observed for
real data of comparable volume. In the context of OBDA, taking as the only parameter
for generation an initial data instance does not produce data of acceptable quality, since
it has to comply with constraints deriving from the structure of the mappings and the
ontology, that in turn derive from the application domain.
     VIG is a data scaler for OBDA benchmarks designed to address these limitations. In
the VIG system, the scaling approach is lifted from the instance level to the OBDA level
by analyzing the structure of the mappings component. VIG is efficient and suitable to
generate huge amounts of data, as tuples are generated in constant time without need to
retrieve previously generated values. Furthermore, the generation can be parallelized up
2       Davide Lanti, Guohui Xiao, and Diego Calvanese

to the number of columns in the schema, without communication overhead. The system
is maintained by the Ontop team [2], is released [5] under an Apache 2.0 license, and
comes with documentation in the form of Wiki pages.
    The rest of the paper is structured as follows: in Section 2, we describe the similar-
ity measures according to which VIG scales data. In Section 3, we evaluate VIG over
the BSBM benchmark, and compare it against the official BSBM instances generator.
Section 4 concludes the paper.
2   Data Scaling for OBDA Benchmarks: VIG Approach
The data scaling problem [7] is the problem of producing, starting from an initial dataset
D over a schema Σ, a dataset D0 which is also over Σ and similar to D but s times
its size. Concerning the size, similarly to other approaches, VIG scales each table in
D by a factor of s. The notion of similarity, instead, is application-based. Being our
goal benchmarking, we define the similarity in terms of execution time for the queries
in the benchmark. In VIG we slightly change the scaling problem to make it better
suited for the OBDA context. In addition to D, in fact, VIG takes as input also the
mappings and uses them to estimate the workload for the OBDA system. This allows for
a more realistic and OBDA-tuned generation. In order to generate “similar” instances,
VIG adopts the following three similarity measures. More details can be found in [4].
Primary and Foreign Key Constraints. VIG is, to the best of our knowledge, the
only data scaler able to generate in constant time tuples that satisfy multi-attribute pri-
mary keys for weakly-identified entities (i.e., entities that cannot be uniquely identified
by their attributes alone.). The current implementation of VIG does not support multi-
attribute foreign keys.
Column-based Duplicates and NULL Ratios. They respectively measure the ratio
of duplicates and of nulls in a given column, and we consider them to be meaningful
similarity measures as they are common parameters for the cost estimation performed
by query planners in databases. These measures are maintained in D0 for all non-fixed
domain columns (i.e., columns in which the number of distinct values depends on s).
Fixed-domain columns will only contain values observed in D. Common cases of fixed-
domain columns can be automatically detected by VIG through mappings analysis, and
additional ones can be manually specified.
Cost of Joins between Semantically Related Columns. Through mapping analysis
VIG identifies those columns for which a join operation is semantically meaningful (i.e.,
columns for which a join could occur during the evaluation of a user query). Generating
data that guarantees the correct selectivity for these joins is crucial in order to deliver a
realistic evaluation. In the NPD Benchmark, for instance, the SQL join between the pair
of key columns realizing individuals for classes ShallowWellbore and Exploration
returns an empty result on D. In fact, these two classes are declared to be disjoint in the
ontology. VIG can generate data that preserves the selectivity for this join, and therefore
that satisfies the disjointness constraint specified over the ontology.
3   Evaluation with The BSBM Benchmark
The well-known Berlin SPARQL Benchmark (BSBM) [1] comes with a set of paramet-
ric queries, an ontology, mappings, an automatized test platform, and an ad-hoc data
generator (GEN) that can generate data according a scale parameter given in terms of
                                        An Evaluation of VIG with the BSBM Benchmark            3

         Time Comparison (sec.)                                  Memory Comparison (Mb)

                                                       4,000
 2000                                                  3,000
                                                       2,000
 1000
                                                       1,000
    0
    BSBM -1    BSBM -10     BSBM -100                       BSBM -1      BSBM -10   BSBM -100

                                                 vig      gen

                              Fig. 1: Generation Time and Memory Comparison

number of products. The queries contain placeholders that the testing platform instan-
tiates with values chosen from the generated ones.
     We used the two generators to create six data instances, denoted as BSBM -s-g,
where s = 1, 10, 100 indicates the scale factor with respect to an initial data instance of
10000 products (produced by GEN), and g ∈ {VIG , GEN} indicates the generator used
to produce the instance. The details of the experiment as well as the material needed for
reproducing it can be found online [5]. The experiments have been ran on a HP Proliant
server with 2 Intel Xeon X5690 CPUs (24 cores @3.47GHz), 106 GB of RAM and a
1 TB 15K RPM HD. The OS is Ubuntu 12.04 LTS.
Resources Consumption Comparison. Figure 1 shows the resources (time and mem-
ory) used by the two generators for creating the instances. For both generators the exe-
cution time grows approximately as the scale factor, which suggests that the generation
of a single column value is in both cases independent from the size of the data instance
to be generated. Observe that GEN is on average 5 times faster than VIG, but it also
requires increasingly more memory as the amount of the data to generate increase, con-
trary to VIG that always requires the same amount of memory. We point out that VIG
supports parallel generation up to the number of columns in the schema for D, so we
expect the execution time to be substantially lower on a multi-node machine.
Benchmark Queries Comparison. The top portion of Table 1 compares the execution
times for the queries in the BSBM benchmark evaluated over the instances produced by
VIG / GEN .Queries were ran on the Ontop OBDA system [2], over a MySQL back-end.
    The testing platform of the BSBM benchmark instantiates the queries with concrete
values coming from configuration files produced by GEN. This does not allow a fair
comparison between the two generators, because it is biased towards the specific values
produced by GEN. To run a fair comparison, we use the testing platform of the NPD
benchmark, which is independent from the specific generator used and instantiates the
queries only with values found in the provided database instance.
    Finally, we slightly cleaned the mappings by removing redundancies and modified
the queries by removing the LIMIT modifiers or relaxing excessively restricting fil-
ter conditions. Since these modifications do not affect the size of the produced SQL
translation, the tested queries are at least as hard as the original ones.
Deviation for Predicates Growth. The bottom part of Table 1 shows the deviation, in
terms of number of elements for each predicate (class, object or data property) in the
ontology, between the instances generated by VIG and those generated by GEN. The first
column reports the average deviation, and last two columns report the absolute number
and relative percentage of predicates for which the deviation was greater than 5%. The
4        Davide Lanti, Guohui Xiao, and Diego Calvanese

                              Table 1: Query Evaluation Results. BSBM Benchmark.

                     Comparison over BSBM Benchmark Queries (10 runs, 1 warm-up run)
              db               avg(ex time) avg(out time) avg(res size) qmpH avg(mix time) [σ 2 ]
                                  msec.         msec.        msec.                 msec.
              BSBM -1- GEN         87             6             1425    4285    840 [101.747]
              BSBM -1- VIG         77             3              841    4972     724 [85.387]
              BSBM -10- GEN        628           29            11175    608     5916 [444.489]
              BSBM -10- VIG        681           29            13429    563     6388 [606.057]
              BSBM -100- GEN      6020          271            122169    63    56620 [5946.65]
              BSBM -100- VIG      6022          212             83875    64    56117 [4508.37]
                                           Predicates Growth Results
              type db-scale          avg(dev)    dev > 5% (absolute)     dev > 5% (relative)
              CLASS - BSBM -1          0%                  0                     0%
              CLASS - BSBM -10       23.72%                2                    25%
              CLASS - BSBM -100     250.74%                2                    25%
              OBJ - BSBM -1            0%                  0                     0%
              OBJ - BSBM -10         7.46%                 2                    20%
              OBJ - BSBM -100       82.35%                 2                    20%
              DATA - BSBM -1        < 0.01%                0                     0%
              DATA - BSBM -10        2.84%                 2                   6.67%
              DATA - BSBM -100       5.74%                 2                   6.67%


high deviation in the CLASS and OBJ rows in instances of scale factor 10 and 100 is due
to a small number of outliers whose elements are built from tables that GEN, contrary
to VIG, does not scale according to the scale factor.
4    Conclusion and Development Plan
In this work we evaluated VIG in the task of generating data for the BSBM bench-
mark. More precisely, we measured how similar is the data produced by VIG to the one
produced by the native BSBM generator, obtaining encouraging results.
    The current work plan is to enrich the quality of the data produced by adding more
similarity measures to the generation process, such as multi-attribute foreign keys, non-
uniform distributions, or joint-degree distributions [7]. Unfortunately, we expect that
some of these measures will conflict with the feature of constant time for generation of
tuples (e.g., joint-degree distributions require access to previously generated tuples).
Acknowledgment This paper is supported by the EU project Optique FP7-318338.
References
1. Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. Int. J. on Semantic Web and Informa-
   tion Systems 5(2), 1–24 (2009)
2. Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez-
   Muro, M., Xiao, G.: Ontop: Answering SPARQL queries over relational databases. Semantic
   Web J. (2016), to appear
3. Lanti, D., Rezk, M., Xiao, G., Calvanese, D.: The NPD benchmark: Reality check for OBDA
   systems. In: Proc. of EDBT. pp. 617–628. OpenProceedings.org (2015)
4. Lanti, D., Xiao, G., Calvanese, D.: Fast and simple data scaling for OBDA benchmarks. In:
   Proc. of BLINK (2016), to appear
5. Lanti, D., Xiao, G., Calvanese, D.: VIG. https://github.com/ontop/vig (2016)
6. Poggi, A., Lembo, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Rosati, R.: Linking data
   to ontologies. J. on Data Semantics X, 133–173 (2008)
7. Tay, Y., Dai, B.T., Wang, D.T., Sun, E.Y., Lin, Y., Lin, Y.: UpSizeR: Synthetically scaling an
   empirical relational database. Information Systems 38(8), 1168 – 1183 (2013)