Fast and Simple Data Scaling for OBDA Benchmarks

                    Davide Lanti, Guohui Xiao, and Diego Calvanese

                            Free University of Bozen-Bolzano, Italy


       Abstract. In this paper we describe VIG, a data scaler for OBDA benchmarks.
       Data scaling is a relatively recent approach, proposed in the database community,
       that allows for quickly scaling an input data instance to n times its size, while pre-
       serving certain application-specific characteristics. The advantages of the scaling
       approach are that the same generator is general, in the sense that it can be re-used
       on different database schemas, and that users are not required to manually in-
       put the data characteristics. In the VIG system, we lift the scaling approach from
       the pure database level to the OBDA level, where the domain information of on-
       tologies and mappings has to be taken into account as well. VIG is efficient and
       notably each tuple is generated in constant time. VIG has been successfully used
       in the NPD benchmark, but it provides a general approach that can be re-used to
       scale any data instance in any OBDA setting.


1    Introduction
An important research problem in Big Data is how to provide end-users with transparent
access to the data, abstracting from storage details. The paradigm of Ontology-based
Data Access (OBDA) [6] provides an answer to this problem that is very close to the
spirit of the Semantic Web. In OBDA the data stored in a relational database is presented
to the end-users as a virtual RDF graph over which SPARQL queries can be posed. This
solution is realized through mappings that link classes and properties in the ontology to
queries over the database.
    Proper benchmarking of query answering systems, such as OBDA systems, requires
scalability analyses taking into account for data instances of increasing volume. Such
instances are often provided by generators of synthetic data. However, such generators
are either complex ad-hoc implementations working for a specific schema, or require
considerable manual input by the end-user. The latter problem is exacerbated in the
OBDA setting, where database schemas tend to be particularly big and complex (e.g.,
70 tables, some with more of 80 columns in [12]). This contributes to the slow creation
of new benchmarks, and the same old benchmarks become more and more misused
over a long period of time. For instance, evaluations on OBDA systems are usually
performed on benchmarks originally designed for triple stores, although these two types
of systems are totally different and present different challenges [12].
    Data scaling [19] is a recent approach that tries to overcome this problem by au-
tomatically tuning the generation parameters through statistics collected over an initial
data instance. Hence, the same generator can be reused in different contexts, as long
as an initial data instance is available. A measure of quality for the produced data is
defined in terms of results for the available queries, that should be similar to the one
observed for real data of comparable volume.
2       Davide Lanti, Guohui Xiao, and Diego Calvanese

     In the context of OBDA, however, taking as the only parameter for generation an
initial data instance does not produce data of acceptable quality, since the generated
data has to comply with constraints deriving from the structure of the mappings and the
ontology, that in turn derive from the application domain.
     In this work we present the VIG system, a data scaler for OBDA benchmarks that
addresses these issues. In VIG, the scaling approach is lifted from the instance level to
the OBDA level, where the domain information of ontologies and mappings has to be
taken into account as well. VIG is very efficient and suitable to generate huge amounts
of data, as tuples are generated in constant time without disk accesses or need to retrieve
previously generated values. Furthermore, different instances of VIG can be delegated
to different machines, and parallelization can scale up to the number of columns in the
schema, without communication overhead. Finally, VIG produces data in the form of
csv files that can be easily imported by any relational database system.
     VIG is a Java implementation licensed under Apache 2.0, and its source code is
available on GitHub in the form of a Maven project [13]. The code is maintained by the
Ontop team at the Free University of Bozen-Bolzano, and it comes with documentation
in the form of Wiki pages.
     We have evaluated VIG in the task of generating data for both the Berlin SPARQL
Benchmark (BSBM) [2] and the NPD Benchmark [12]. In the first, we measured how
the data produced by VIG compares to the one produced by the native ad-hoc BSBM
generator. In the latter we provided an empirical evaluation of the impact that the most
distinguished feature of VIG, namely the mappings analysis, has on the shape of the
produced instances, and how it affects the measured performance of benchmark queries.
     The rest of the paper is structured as follows. In Section 2, we introduce the ba-
sic notions and notation to understand this paper. In Section 3, we define the scaling
problem and discuss important measures on the produced data that define the quality
of instances in a given OBDA setting. In Section 4, we discuss the VIG algorithm, and
how it ensures that data conforming to the identified measures is produced. In Section 5
we provide an empirical evaluation of VIG, in terms of both resource consumption and
quality of produced data. Sections 6 and 7 contain related work and conclusions, re-
spectively.


2   Basic Notions and Notation

We assume that the reader has moderate knowledge of OBDA, and refer for it to the
abundant literature on the subject, like [5]. Moreover, we assume familiarity with basic
notions from probability calculus and statistics.
    The W3C standard ontology language in OBDA is OWL 2 QL [14]. For the sake
of conciseness, we consider here its mathematical underpinning DL-LiteR [7]. Table 1
shows a portion of the ontology from the NPD benchmark, which is the foundation
block of our running example.
    The W3C standard query language in OBDA is SPARQL [10], with queries eval-
uated under the OWL 2 QL entailment regime [11]. Intuitively, under these semantics
each basic graph pattern (BGP) can be seen as a single conjunctive query (CQ) with-
out existentially quantified variables. As in our examples we will only refer to SPARQL
                                          Fast and Simple Data Scaling for OBDA Benchmarks                                3

Table 1: Portion of the ontology for the NPD benchmark. The first three axioms (left to right) state that the classes “Shal-
lowWellbore”, “ExplorationWellbore”, and “SuspendedWellbore” are subclasses of the class “Wellbore”. The fourth axiom
states that the classes “ExplorationWellbore” and “ShallowWellbore” are disjoint.


                   ShallowWellbore v Wellbore            ExplorationWellbore v Wellbore
                   SuspendedWellbore v Wellbore          ExplorationWellbore u ShallowWellbore v ⊥


                                        Table 2: Queries for our running example.


                        q1 (y)       ← Wellbore(y), shallowWellboreForField(x, y)
                        q2 (x, n, y) ← Wellbore(x), name(x, n), completionYear(x, y)


queries containing exactly one BGP, we will use the more concise syntax for CQs rather
than the SPARQL syntax. Table 2 shows two queries used in our running example.
    The mapping component links predicates in the ontology to queries over the un-
derlying relational database. To present our techniques, we need to introduce this com-
ponent in a formal way. The standard W3C Language for mappings is R2RML [8],
however here we use a more concise syntax that is common in the OBDA literature.
Formally, a mapping assertion m is an expression of the form X(f , x) ← conj(y),
consisting of a target part X(f , x), which is an atom over function symbols f (also
called templates) and variables x ⊆ y, and a source part conj(y), which is a CQ whose
output variables are y. We say that m defines the predicate X if X is in the target of m.
A basic mapping is a mapping whose source part contains exactly one atom. Table 3
contains the mappings for our running example, as well as a short description of how
these mappings are used in order to create a (virtual) set of assertions.
    For the rest of this paper we fix an OBDA instance (T , M, Σ, D), where T is an
OWL 2 QL ontology, Σ is a database schema with foreign and primary key dependen-
cies, M is a set of mappings linking predicates in T to queries over Σ, and D is a
database instance that satisfies the dependencies in Σ and the disjointness axioms in
T . We denote by col(Σ) the set of all columns in Σ. Given a column C ∈ col(Σ),
we denote by C D the set of values for C in D. Finally, given a term f (x), where
x = (x1 , . . . , xp , . . . , xn ), we denote the argument xp at position p by f (x)|p .


3     Data Scaling for OBDA Benchmarks: VIG Approach
The data scaling problem introduced in [19] is formulated as follows:

Definition 1 (Data Scaling Problem). Given a dataset D, produce a dataset D0 which
is similar to D but s times its size.

    The notion of similarity is application-based. Being our goal benchmarking, we
define similarity in terms of query results for the queries at hand. In [19], the authors
do not consider such queries to be available to the generator, since their goal is broader
than benchmarking over a pre-defined set of queries. In OBDA benchmarking, however,
the (SQL) workload for the database can be estimated from the mapping component.
Therefore, VIG includes the mappings in the analysis, so as to obtain a more realistic
and OBDA-tuned generation.
4          Davide Lanti, Guohui Xiao, and Diego Calvanese

Table 3: Mappings from the NPD benchmark. Results from the evaluation of the queries on the source part build predicates
in the ontology. For example, each triple (a, b, c) in a relation for shallow wellbores corresponds to a predicate
ShallowWellbore(w(a)) in the ontology. In the R2RML mappings for the original NPD benchmark the term w(id)
corresponds to the URI template npd:wellbore/{id}. Columns named id are primary keys, and the column fid in
shallow wellbores is a foreign key for the primary key fid of the table fields.


     ShallowWellbore(w(id))                   ← shallow wellbores(id,name,year,fid)
     ExplorationWellbore(w(id))               ← exploration wellbores(id,name,year,state)
     SuspendedWellbore(w(id))                 ← exploration wellbores(id,name,year,state),
                                                state=’suspended’
     Field(f (f id))                          ← fields(fid,name)
     completionYear(w(id), year)              ← shallow wellbores(id,name,year,fid)
     name(w(id), name)                        ← shallow wellbores(id,name,year,fid)
     completionYear(w(id), year)              ← exploration wellbores(id,name,year)
     name(w(id), name)                        ← exploration wellbores(id,name,year)
     shallowWellboreForField(w(id), f (f id)) ← shallow wellbores(id,name,year,fid),
                                                fields(fid,fname)


    Concerning the size, similarly to other approaches, VIG scales each table in D by a
factor of s.

3.1     Similarity Measures for OBDA and Their Rationale
We overview the similarity measures used by VIG, and why they are important in the
scenario of OBDA benchmarking.
Schema Dependencies. D0 should be a valid instance for Σ. VIG is, to the best of our
knowledge, the only data scaler able to generate in constant time tuples that satisfy
multi-attribute primary keys for weakly-identified entities1 . The current implementation
of VIG does not support multi-attribute foreign keys.
Column-based Duplicates and NULL Ratios. They respectively measure the ratio
of duplicates and of nulls in a given column, and are common parameters for the
cost estimation performed by query planners in databases. By default, VIG maintains
them in D0 to preserve the cost of joining columns in a key-foreign key relationship
(e.g., the join from the last mapping in our running example). This default behavior,
however, is not applied with fixed-domain columns, which are columns whose content
does not depend on the size of the database instance. The column state in the ta-
ble exploration wellbore is fixed-domain, because it partitions the elements of
id into a fixed number of classes2 . VIG analyzes the mappings to detect such cases of
fixed-domain columns, and additional fixed-domain columns can be manually specified
by the user. To generate values for a fixed-domain column, VIG reuses the values found
in D so as to prevent empty answers for the SQL queries in the mappings. For instance,
a value ‘suspended’ must be generated for the column state in order to produce
objects for the class SuspendedWellbore.
    VIG generates values in columns according to a uniform distribution, that is, values
in columns have all the same probability of being repeated. Replication of the distribu-
tions from D will be included the next releases of VIG.
 1
   In a relational database, a weak entity is an entity that cannot be uniquely identified by its
   attributes alone.
 2
   The number of classes in the ontology does not depend on the size of the data instance.
                                   Fast and Simple Data Scaling for OBDA Benchmarks          5

Size of Columns Clusters, and Disjointness. Query q1 from our running example
returns an empty set of answers, regardless of the considered data instance. This is
because the function w used to build objects for the class Wellbore does not match with
the function f used to build objects for Fields. Indeed, fields and wellbores are two
different entities for which a join operation would be meaningless.
    On the other hand, a standard OBDA translation of q2 into SQL produces a union
of CQs containing several joins between the two tables shallow wellbores and
exploration wellbores. This is possible only because the mappings for Well-
bore, name, and completionYear all use the same unary function symbol w to define
wellbores. Intuitively, every pair of terms over the same function symbol and appearing
on the target of two distinct basic mappings identifies sets of columns for which the join
operation is semantically meaningful3 . Generating data that guarantees the correct cost
for these joins is crucial in order to deliver a realistic evaluation. In our example, the join
between shallow wellbore and exploration wellbore over id is empty
under D (because ExplorationWellbore and ShallowWellbore are disjoint classes). VIG
is able to replicate this fact in D0 . This implies that VIG can generate data satisfying
disjointness constraints declared over classes whose individuals are constructed from a
unary template in a basic mapping, if D satisfies those constraints.


4      The VIG Algorithm

We now show how VIG realizes the measures described in the previous section. The
building block of VIG is a pseudo-random number generator, that is a sequence of
integers (si )i∈N defined through a transition function sk := f (sk−1 ). The authors in [9]
discuss a particular class of pseudo-random generators based on multiplicative groups
modulo a prime number. Let n be the number of distinct values to generate. Let g be a
generator for the multiplicative group modulo a prime number p, with p > n. Consider
the sequence S := hg i mod p | i = 1, . . . , p and (g i mod p) ≤ ni. Then S is a
permutation of values in the interval [1, . . . , n]. Here we show how this generator is used
in VIG to quickly produce data complying with foreign and primary key constraints.
    The VIG algorithm can be split into two phases: the intervals creation phase, and the
generation phase. In the first phase the input data instance D is analyzed to create a set
of intervals ints(C) associated to each column C in the database schema. Each interval
I ∈ ints(C) for a column C is a structure [min, max] keeping track of a minimum and
maximum integer index. In the generation phase, the pseudo random number generator
is used to randomly choose indexes from each of these intervals, and for each chosen
index i ∈ I, I ∈ ints(C), an injective function gC is applied to transform i into a
database value according to the datatype of C.
    From now on, let s be a scale factor, and let dist(C, D) denote the number of distinct
non-null values in a column C in the database instance D. Let size(T, D) denote the
number of tuples occurring in the table T in the database instance D. To illustrate the
algorithm, we consider the source instance D to contain values as illustrated in Figure 1,
and a scaling factor s = 2.
 3
     Therefore, for which a join could occur during the evaluation of a user query.
6        Davide Lanti, Guohui Xiao, and Diego Calvanese

              exploration wellbores (abbr. ew)   shallow wellbores (abbr. sw)   wellbores overview (abbr. wo)

                id      active         ...              id        ...                  id        ...
                 2      true          ...               1        ...                    1       ...
                 4      false         ...               3        ...                    .       ...
                 6      true          ...               5        ...                    .       ...
                 8      false         ...               7        ...                    .       ...
                10      false         ...               9        ...                   10       ...


Fig. 1: Input data instance D. Columns id from tables exploration wellbores and shallow wellbores are
foreign keys of column id from wellbores overview.


4.1   Intervals Creation Phase

Initialization Phase. For each table T , VIG sets the number size(T, D0 ) of tuples to
generate to size(T, D) ∗ s. Then, VIG calculates the number of non-null distinct values
that need to be generated for each column, given s and D. That is, for each column
C, if C is not fixed-domain then VIG sets dist(C, D0 ) := dist(C, D) ∗ s. Otherwise,
dist(C, D0 ) is set to dist(C, D). To determine whether a column is fixed-domain, VIG
searches in M for mappings of shape

               ClassName(f (a)) ← table name(b), b1 = l1 , . . . , bn = ln
where l1 , . . . , ln are literal values, and marks the columns b1 , . . . , bn as fixed-domain.

Example 1. For the tables in Figure 1, VIG sets size(ew, D0 ) = 5∗2 = 10 = size(sw, D0 ),
and size(wo, D0 ) = 10 ∗ 2 = 20. Values for statistics dist(T.id, D0 ), where T is one
of {ew, sw, wo}, are set in the same way, because the id columns do not contain
duplicate values. The column ew.active is marked as fixed-domain, because of the
third mapping in Table 3. Therefore, VIG sets dist(ew.active) = 2.

Creation of Intervals. When C is a numerical column, VIG initializes ints(C) by the
interval IC := [min(C, D), min(C, D) + dist(C, D0 ) − 1] of distinct values to be
generated, where min(C, D) denotes the minimum value occurring in C D . Otherwise,
if C is non-numerical, ints(C) is initialized to the interval IC := [1, dist(C, D0 )]. The
elements in ints(C) will be transformed into values of the desired datatype by a suitable
injective function in the final generation step.

Example 2. Following on our running example, VIG creates in this phase the intervals
Iew.id = [2, 11], Iew.active = [1, 2], Isw.id = [1, 10], and Iwo.id = [1, 20]. Then, the
intervals are associated to the respective columns. Namely, ints(ew.id) = {Iew.id },
. . ., ints(wo.id) = {Iwo.id }.

Primary Keys Satisfaction. Let K = {C1 , . . . , Cn } be the primary key of a ta-
ble T . In order to ensure that values generated for each column through the pseudo-
random generator will not lead to duplicate tuples in K, the least common multiple
lcm(dist(C1 , D0 ), . . . , dist(Cn , D0 )) of the number of distinct values to be generated
in each column must be greater than the number tuples(T, D0 ) of tuples to generate for
the table T . If this condition is not satisfied, then VIG ensures the condition by slightly
                                  Fast and Simple Data Scaling for OBDA Benchmarks            7

increasing dist(Ci , D0 ) for some column Ci in K. Observe that the only side effect of
this is a small deviation on the number of distinct values to generate for a column. Once
the condition holds, data can be generated independently for each column without risk
of generating duplicate tuples for K.
Columns Cluster Analysis. In this phase, VIG analyzes M in order to identify columns
that could be joined in a translation to SQL, and groups them together into pre-clusters.
Formally, let X1 (f1 , x1 ), . . . , Xm (fm , xm ) be the atoms defined by basic mappings
                                                                      4
S M, where variables correspond to qualified column names . Consider the set F =
in
   i=1...m {f (x) | f (x) is a term in Xi (fi , xi )} of all the terms occurring in such atoms.
A set of columns pc is a pre-cluster if there exists a function f and a valid position p in
f such that pc = {f(x)|p | f (x) ∈ F}.

Example 3. For our running example, we have

                     F = {w(sw.id), w(ew.id), f (fields.fid)}

There are two pre-clusters, namely pcw|1 = {sw.id, ew.id} and pcf |1 = {fields.fid}.

     VIG evaluates on D all combinations of joins between columns in a pre-cluster pc,
and produces values in D0 so that the selectivities for these joins are maintained. In
order to do so, the intervals for the columns in pc are modified. This modification must
be propagated to all the columns related via a foreign key relationship to some column
in pc. In particular, the modification might propagate up to columns belonging to dif-
ferent pre-clusters, inducing a clash. VIG groups together such pre-clusters in order to
avoid this issue. Formally, let PC denote the set of pre-clusters for M. Two pre-clusters
pc1 , pc2 ∈ PC are in merge relation, denoted as pc1 ! pc2 , iff C(pc1 ) ∩ C(pc2 ) 6= ∅,
                                                           ∗              ∗
where C(pc) = {D ∈ col(Σ) | there is a C ∈ pc : D ↔ C}, where ↔ is the reflex-
ive, symmetric, and transitive closure of the single column foreign key relation between
                                                                                 ∗
pairs of columns5 . Given a pre-cluster pc, the set of columns {c ∈ pc0 | pc0 !pc} is
                                     ∗
called a columns cluster, where ! is the transitive closure of !. Columns clusters
group together those pre-clusters for which columns cannot be generated independently.
                                                      ∗
                                             6
Example 4. In our example we have that pcw|1 ! pcf |1 . In fact,

                 (C(pcw|1 ) = pcw|1 ∪ {wo.id}) ∩ (pcf |1 = C(pcf |1 )) = ∅
Therefore, the pre-clusters pcw|1 and pcf |1 are also columns clusters.

    After identifying columns clusters, VIG analyzes the number of shared elements
between the columns in the cluster, and creates new intervals accordingly. Formally,
consider a columns cluster cc. Let H ⊆ cc be a set of columns, and the set KH :=
{K | H ⊂ K ⊆ cc} ofT                  S of H. T
                       strict super-sets      For each such H, VIG creates an interval
IH such that |IH | := | C∈H C D \ ( K∈KH C∈K C D )| ∗ s, and adds IH to ints(C)
for all C ∈ H. Boundaries for all intervals IH are set in a way that they do not overlap.
 4
     A qualified column name is a string of the form T.C, where T/C is a table/column name.
 5
     Remember that VIG does not allow for multi-attribute foreign keys.
8           Davide Lanti, Guohui Xiao, and Diego Calvanese

Table 4: CSP Program for the Choco Solver. In the following, S is the set of intervals for the columns in the columns cluster
cc, plus one extra disjoint interval. Each interval I in a column C is encoded as a pair of variables XhC,Ii , YhC,Ii , keeping
respectively the lower and upper limit for the interval.


                    Create Program Variables:
                    ∀I ∈ S. ∀C ∈ C(cc). XhC,Ii , YhC,Ii ∈ [I.min, I.max]
                    Set Boundaries for Known Intervals:
                    ∀I ∈ S. ∀C ∈ C(cc). I ∈ ints(C) ⇒ XhC,Ii = I.min, YhC,Ii = I.max
                    Set Boundaries for Known Empty Intervals:
                    ∀I ∈ S. ∀C ∈ cc. I ∈    / ints(C) ⇒ XhC,Ii = YhC,Ii
                    The Y’s should be greater than the X’s:
                    ∀I ∈ S. ∀C ∈ C(cc). XhC,Ii ≤ YhC,Ii
                    Foreign Keys (denoted by ⊆):
                    ∀I ∈ S. ∀C1 ∈ (C(cc) \ cc). ∀C1 ⊆ C2 . XhC1 ,Ii ≥ XhC2 ,Ii
                    ∀I ∈ S. ∀C1 ∈ (C(cc) \ cc). ∀C2 ⊆ C1 . XhC2 ,Ii ≥ XhC1 ,Ii
                    ∀I ∈ S. ∀C1 ∈ (C(cc) \ cc). ∀C1 ⊆ C2 . YhC1 ,Ii ≤ YhC2 ,Ii
                    ∀I ∈ S. ∀C1 ∈ (C(cc) \ cc). ∀C2 ⊆ C1 . YhC2 ,Ii ≤ YhC1 ,Ii
                    Width of the Intervals: P
                    ∀C ∈ (C(cc) \ cc).        I∈S YhC,Ii − XhC,Ii = |C|


Example 5. Consider the columns cluster pcw|1 . There are three non-empty subsets,
namely E = {ew.id} , S = {sw.id}, and ES = {ew.id, sw.id}. Accordingly,
we identify the sets KE = {ES} = KS and KES = ∅.
   Thus, VIG needs to create three disjoint intervals IE , IES , IS such that
    – |IE | = |ew.idD \ ∅| ∗ 2 = 5 ∗ 2 = 10,
    – |IS | = |sw.idD \ ∅| ∗ 2 = 5 ∗ 2 = 10,
    – |IES | = |(ew.idD ∩ sw.idD ) \ ∅| ∗ 2 = 0.
    Without loss of generality, we assume that the intervals generated by VIG satisfying
the constraints above are IE = [2, 11] and IS = [12, 21]. These intervals are assigned to
columns ew.id and sw.id, respectively. Intervals assigned in the initialization phase
to the same columns are deleted.

Foreign Keys Satisfaction. At this point, foreign key columns D for which there is
no columns cluster cc such that D ∈ C(cc), have a single interval whose boundaries
have to be aligned to the (single) interval of the parent. Foreign keys relating pairs of
columns in a cluster, instead, are already satisfied by construction of the intervals in the
columns cluster. More work, instead, is necessary for columns belonging to C(cc) \ cc,
for some columns cluster cc. VIG encodes the problem of finding intervals for these
columns that satisfy the number of distinct values and the foreign key constraints into
a constraint satisfaction problem (CSP) [1], which can be solved by any off-the-shelf
constraint solver. In VIG, we use Choco [15].

Example 6. At this phase, the intervals found by VIG (w.r.t. the portion of D we are
considering) are:

    – Iwo.id = [1, 20] and Iew.active = [1, 2], found in the initialization phase;
    – IE = [2, 11] and IS = [12, 21], found during the columns cluster analysis.

Observe that these intervals violate the foreign key so.id ⊆ wo.id, because the
index 21 belongs to I{so.id} but not I{wo.id} . Moreover, wo.id ∈ C(pcw|1 ) \ pcw|1 .
                                          Fast and Simple Data Scaling for OBDA Benchmarks                                   9

                                     Table 5: CSP instance for the running example.


Create Program Variables:
Xhew.id,Iew.id i , Yhew.id,Iew.id i , Xhsw.id,Iew.id i , Yhsw.id,Iew.id i , Xhwo.id,Iew.id i , Yhwo.id,Iew.id i ∈ [2, 11]
Xhew.id,Isw.id i , Yhew.id,Isw.id i , Xhsw.id,Isw.id i , Yhsw.id,Isw.id i , Xhwo.id,Isw.id i , Yhwo.id,Isw.id i ∈ [12, 21]
Xhew.id,Iaux i , Yhew.id,Iaux i , Xhsw.id,Iaux i , Yhsw.id,Iaux i , Xhwo.id,Iaux i , Yhwo.id,Iaux i ∈ [22, 41]

Set Boundaries for Known Intervals:
Xhew.id,Iew.id i = 2, Yhew.id,Iew.id i = 11
Xhsw.id,Isw.id i = 12, Yhsw.id,Isw.id i = 21

Set Boundaries for Known Empty Intervals:
Xhew.id,Isw.id i = Yhew.id,Isw.id i
Xhew.id,Iaux i = Yhew.id,Iaux i
Xhsw.id,Iew.id i = Yhsw.id,Iew.id i
Xhsw.id,Iaux i = Yhsw.id,Iaux i

The Y’s should be greater than the X’s:
Xhew.id,Iew.id i ≤ Yhew.id,Iew.id i , Xhsw.id,Iew.id i ≤ Yhsw.id,Iew.id i , Xhwo.id,Iew.id i ≤ Yhwo.id,Iew.id i
Xhew.id,Isw.id i ≤ Yhew.id,Isw.id i , Xhsw.id,Isw.id i ≤ Yhsw.id,Isw.id i , Xhwo.id,Isw.id i ≤ Yhwo.id,Isw.id i
Xhew.id,Iaux i ≤ Yhew.id,Iaux i , Xhsw.id,Iaux i ≤ Yhsw.id,Iaux i , Xhwo.id,Iaux i ≤ Yhwo.id,Iaux i

Foreign Keys:
Xhew.id,Iew.id i ≥ Xhwo.id,Iew.id i
.
.
.
Xhsw.id,Isw.id i ≥ Xhwo.id,Isw.id i
.
.
.
Yhew.id,Iew.id i ≤ Yhwo.id,Iew.id i
.
.
.
Yhsw.id,Isw.id i ≤ Yhwo.id,Isw.id i

Width of the Intervals:
Yhwo.id,Iew.id i − Xhwo.id,Iew.id i + Yhwo.id,Isw.id i − Xhwo.id,Isw.id i + Yhwo.id,Iaux i − Xhwo.id,Iaux i = 20


Therefore, VIG encodes the problem of finding the right boundaries for Iwo.id into the
CSP in Table 5.
    Any solution for the above program sets Xhew.id,Iew.id i = 2, Yhew.id,Iew.id i = 11,
Xhsw.id,Isw.id i = 12, and Yhsw.id,Isw.id i = 21. The last four constraint imply also that,
for any solution, XhC,Iaux i = YhC,Iaux i, for any column C. A solution for the program
is:
                              Xhew.id,Iew.id i = 2              Yhew.id,Iew.id i = 11
                              Xhsw.id,Isw.id i = 12             Yhsw.id,Isw.id i = 21
                              Xhwo.id,Iew.id i = 2              Yhwo.id,Iew.id i = 11
                              Xhwo.id,Isw.id i = 12             Yhwo.id,Iew.id i = 21
and arbitrary values for the other variables so that XC,I = YC,I .
    From this solution, VIG creates two new intervals I{wo.id,ew.id} = [2, 11] and
I{wo.id,sw.id} = [12, 21] and sets them as intervals for column wo.id.

4.2    Generation Phase

Generation. At this point, each column in col(Σ) is associated to a set of intervals. The
elements in the intervals are associated to values in the column datatype, and to values
10        Davide Lanti, Guohui Xiao, and Diego Calvanese

                exploration wellbores (abbr. ew)     shallow wellbores (abbr. sw)   wellbores overview (abbr. wo)

                  id      active         ...                id        ...                  id        ...
                                                                                            2       ...
                  2       true          ...                 12       ...
                                                                                           ...      ...
                  3       false         ...                 13       ...
                                                                                           11       ...
                  ...     false         ...                 ...      ...
                                                                                           12       ...
                  10      true          ...                 20       ...                   ...      ...
                  11      false         ...                 21       ...                   21       ...


                                               Fig. 2: The scaled instance D 0 .


from C D in case C is fixed-domain. VIG uses the pseudo-random number generator to
randomly pick elements from the intervals that are then transformed into database val-
ues. NULL values are generated according to the detected NULLS ratio. Observe that
the generation of a value in a column takes constant time and can happen independently
for each column, thanks to the previous phases in which intervals were calculated.

Example 7. At this stage, VIG has available all the information necessary to proceed
with the generation. With respect to our running example, such information is:

    – Association Columns to Intervals. The columns in the considered tables are associ-
      ated to intervals in the following way:

                                      ints(ew.id) = {[2, 11]}
                                      ints(sw.id) = {[12, 21]}
                                      ints(wo.id) = {[2, 11], [12, 21]}
                                      ints(ew.active) = {1, 2})

    – Number of tuples to generate for each table. From Example 4.1, we know that
      size(ew, D0 ) = 10, size(sw, D0 ) = 10, and size(wo, D0 ) = 20.
    – Association Indeces to Database Values. Without loss of generality, we assume that
      the injective function used by VIG to associate elements in the intervals to database
      values is the identity function for all non fixed-domain columns. For the column
      ew.active, we assume the function g : {1, 2} → {’true’, ’false’} such
      that g(1) = ’true’ and g(2) = ’false’.

    Figure 2 contains the generated data instance D0 . Observe that D0 satisfies all the
constraints discussed in the previous paragraphs. For clarity, the generated tuples are
sorted on the primary key, however in a real execution the values would be randomly
generated by means of the multiplicative group modulo a prime number.


5     VIG in Action

The data generation techniques presented in previous sections have been implemented
in the VIG system, which is available on GitHub as a Java maven project, and comes
with extensive documentation in form of wiki pages. VIG, a mature implementation de-
livered since two years together with the NPD benchmark, is licensed under Apache 2.0,
and is maintained at the Free University of Bozen-Bolzano.
                                   Fast and Simple Data Scaling for OBDA Benchmarks        11

          Time Comparison (sec.)                              Memory Comparison (Mb)
                                                     4,000
 2000                                                3,000
 1000                                                2,000
                                                     1,000
      0
      BSBM -1   BSBM -10 BSBM -100                        BSBM -1     BSBM -10 BSBM -100

                                               vig      gen

                            Fig. 3: Generation Time and Memory Comparison


   In this section we present two evaluations of VIG. In the first evaluation we use the
BSBM benchmark to compare VIG with the native ad-hoc BSBM date generator. In the
second evaluation we use the NPD benchmark to show the impact, on the quality of the
produced data, of the mappings analysis discussed in the previous section.


5.1    The Berlin SPARQL Benchmark

The BSBM benchmark is built around an e-commerce use case in which different ven-
dors offer products that can be reviewed by customers. It comes with a set of parametric
queries, an ontology, mappings, and an ad-hoc data generator (GEN) that can generate
data according a scale parameter given in terms of number of products. The queries
contain placeholders that are instantiated by actual values during the test phase.
    We used the two generators to create six data instances, denoted as BSBM -s-g,
where s ∈ {1, 10, 100} indicates the scale factor with respect to an initial data instance
of 10000 products (produced by GEN), and g ∈ {VIG , GEN} indicates the generator used
to produce the instance.
    The details of the experiment, as well as the material to reproduce it, are available
online6 . Queries were executed through the Ontop OBDA system [4], over a MySQL
back-end. All the experiments have been ran on a HP Proliant server with 2 Intel Xeon
X5690 CPUs (24 cores @3.47GHz), 106 GB of RAM and a 1 TB 15K RPM HD. The
OS is Ubuntu 12.04 LTS.
Resources Consumption Comparison. Figure 3 shows the resources (time and mem-
ory) used by the two generators for creating the instances. For both generators the exe-
cution time grows approximately as the scale factor, which suggests that the generation
of a single column value is in both cases independent from the size of the data instance
to be generated. Observe that GEN is on average 5 times faster than VIG (in single-
thread mode), but it also requires increasingly more memory as the amount of the data
to generate increase, contrary to VIG that always requires the same amount of memory.
Benchmark Queries Comparison. In this paragraph we compare the execution times
for the queries in the BSBM benchmark evaluated over the instances produced by
VIG / GEN . Additionally, we here consider three additional data instances created with

 6
     https://github.com/ontop/ontop-examples/tree/master/
     blink-2016-vig
12      Davide Lanti, Guohui Xiao, and Diego Calvanese

                                 Table 6: Overview of the BSBM Experiment


            db                 avg(ex time) avg(out time) avg(res size) qmpH avg(mix time) [σ]
                                  msec.         msec.        msec.                msec.
             BSBM -1- GEN          87            6           1425     4285     840 [101.747]
             BSBM -1- VIG          77            3            841     4972      724 [85.387]
             BSBM -1- RAND         40            4            887     9090      396 [94.695]
             BSBM -10- GEN         628          29          11175     608     5916 [444.489]
             BSBM -10- VIG         681          29          13429     563     6388 [606.057]
             BSBM -10- RAND        175          19           8322     2061     1746 [551.06]
             BSBM -100- GEN       6020          271        122169       63   56620 [5946.65]
             BSBM -100- VIG       6022          212         83875       64   56117 [4508.37]
             BSBM -100- RAND      1676          242        117259      208   17254 [3974.39]


a third generator (RAND) that only considers database integrity constraints (primary
and foreign keys) as similarity measures, so as to quantify the impact of the measures
maintained by VIG on the task of approximating (w.r.t. task of benchmarking) the data
produced by GEN.
     The experiment was ran on a variation of the BSBM benchmark over the testing
platform, the mappings and the considered queries. We now briefly discuss and motivate
the variations, before introducing the results.
     The testing platform of the BSBM benchmark instantiates the queries with concrete
values coming from (binary) configuration files produced by GEN. This does not allow
a fair comparison between the three generators, because it is biased towards the spe-
cific values produced by GEN. Therefore, we reused the testing platform of the NPD
benchmark, which is independent from the specific generator used as it instantiates the
queries only with values found in the provided database instance.
     Another important difference regards the mapping component. The BSBM mapping
contains some binary URI templates where one of the two arguments is a unary primary
key. This is commonly regarded as a bad practice in OBDA, as it is likely to introduce
redundancies in terms of retrieved information. For instance, consider the template
     bsbm-inst:dataFromRatingSite{publisher}/Reviewer{nr}
used to construct objects for the class bsbm:Person. The template has as arguments
the primary key nr for the table person, plus an additional attribute publisher.
Observe that, being nr a primary key, the information about publisher does not con-
tribute in the identification of specific persons. Additionally, the relation between per-
sons and publishers is already realized in the mappings by a specific mapping assertion
for the property dc:publisher. This mapping assertion poses a challenge to data
generation, because query results are influenced by inclusion dependencies between bi-
nary tuples stored in different tables. VIG cannot correctly reproduce such inclusions,
because it only supports inclusions (even not explicitly declared in the schema) be-
tween single columns. Observe that this problem cannot be addressed even by support-
ing multi-attribute foreign keys, because foreign keys must always refer to a primary
key. For these reasons, we have changed the problematic URI template into an unary
template by removing the redundant column, so as to build individuals only out of pri-
mary keys. Observe that this change does not influence the semantics of the considered
queries, nor their complexity.
                                 Fast and Simple Data Scaling for OBDA Benchmarks                      13

                                 Table 7: Predicates Growth Comparison


             type-db-scale       avg(dev)        dev > 5% (absolute)             dev > 5% (relative)
             CLASS - BSBM -1        0%                       0                          0%
             CLASS - BSBM -10     23.72%                     2                          25%
             CLASS - BSBM -100   250.74%                     2                          25%
             OBJ - BSBM -1         0%                        0                           0%
             OBJ - BSBM -10      7.46%                       2                          20%
             OBJ - BSBM -100     82.35%                      2                          20%
             DATA - BSBM -1      < 0.01%                     0                           0%
             DATA - BSBM -10      2.84%                      2                         6.67%
             DATA - BSBM -100     5.74%                      2                         6.67%


                                       Table 8: Selectivity Analysis


                joins             NPD              NPD -1               NPD -5           NPD -50
                                            db        obda       db        obda        db      obda
                |sw 1 ew|          0         841       0         5046       0        42891         0
                |sw 1 dw|          0         841       0         5046       0        42891         0
                |ew 1 dw|          0        1560       0         9344       0        79814         0
                |sw 1 ew 1 dw|     0        841        0         5046       0        42891         0


    Finally, we slightly modified the queries by removing the LIMIT modifiers, so as
to compare the number of retrieved results, and in a couple of cases by modifying an
excessively restricting filter condition. We point out that this modification only slightly
change the size of the produced SQL translation, and that the modified queries are at
least as hard as the original ones.
    Table 6 contains the results of the experiment in terms of various performance
measures, namely the average execution/output time and average number of results
for queries in a run (mix), the number of query mixes per hour, and the average mix
time. Observe that the measured performance for queries executed over the instances
produced by VIG is close to the measured performance for queries executed over the in-
stances (of comparable size) produced by GEN. Moreover, it significantly differs from
the performance measured over instances produced by RAND. We argue that this differ-
ence is due to the fact that RAND does not maintain any similarity measure, apart from
schema constraints, on the generated data instance.
Deviation for Predicates Growth. Table 7 shows the deviation, in terms of number
of elements for each predicate (class, object or data property) in the ontology, between
the instances generated by VIG and those generated by GEN. The first column reports
the average deviation, and last two columns report the absolute number and relative
percentage of predicates for which the deviation was greater than 5%. Apart from a
few outliers (due to some elements which are built from tables that GEN, contrary to
VIG , does not scale according to the scale factor) the deviation for predicates growth is
inferior to 5% for the majority of classes and properties in the ontology.

5.2   The NPD Benchmark
The query discussed in our running example is at the basis of the three hardest real-
world queries in the NPD Benchmark, namely queries 6, 11 and 12. In this section we
14       Davide Lanti, Guohui Xiao, and Diego Calvanese

                             Table 9: Evaluations for queries 6, 11, and 12.


                   query   NPD            NPD -1           NPD -5              NPD -50
                                    db       obda     db       obda      db         obda
                   q6       787     597      456    10689     1494     17009        6961
                   q11      661    1020      364     2647     1487     37229       15807
                   q12     1190    2926      714    8059      3363     38726       17830


compare these queries on two modalities of VIG; one in which only the input database
is taken as input (DB mode), and for which the columns cluster analysis cannot be
performed, and the one (OBDA mode) discussed in this paper where the mapping is also
taken into account.
     Table 8 contains the selectivities (i.e., number of results) for the joins between ta-
bles shallow wellbore (abbr. sw), exploration wellbore (abbr. ew), and
development wellbore (abbr. dw), over the source NPD dataset as well as its
scaled versions of factors 1, 5 and 50. Observe that the instances created through the
DB mode produce joins of non-null selectivities, and this fact together with the def-
initions for these classes in the mappings (see Table 3 for the mapping assertions of
ExplorationWellbore and ShallowWellbore; class DevelopmentWellbore is mapped in
a similar way) produce a violation of the disjointness constraints in the ontology be-
tween the classes ShallowWellbore, ExplorationWellbore, and DevelopmentWellbore.
     Table 9 shows the impact of the wrong selectivities on the performance (response
time in milliseconds) of evaluation for the queries under consideration. Observe that
the performance measured over the DB instances differ sensibly from the one measured
over OBDA instances, or over the original NPD instance. This is due to the higher costs
for the join operations in DB instances, that in turn derive from the wrong selectivities
discussed in the previous paragraph.
Limitations. Observe that VIG considers only a limited set of similarity measures, and
that the produced instances will be similar only in terms of these measures. For in-
stance, we have already discussed how VIG is not able to reproduce constraints such as
multi-attribute foreign keys or non-uniform data distributions. Thus, although we show
here how VIG seems to suffice for the BSBM benchmark (under our assumptions for
queries, mappings, and testing platform), we expect it not to perform as good in more
complex scenarios, where the non-supported measures become significant. Moreover,
an intrinsic weakness of the scaling approach in general is that it only considers a single
source data instance: in case certain measures depend on the size of the instance, as it
seems to be the case for two classes and properties in Table 7, then the scaled instances
might significantly diverge from the real ones.


6     Related Work
In this section we discuss the relation between VIG and other data scalers, as it makes
little sense to compare it to classic data generators used in OBDA benchmarks as, for
instance, the one found in the Texas Benchmark 7 .
 7
     http://obda-benchmark.org/
                                   Fast and Simple Data Scaling for OBDA Benchmarks             15

    UpSizeR [19] replicates two kinds of distributions observed on the values for the
key columns, called joint degree distribution and joint distribution over co-clusters8 .
However, this requires several assumptions to be made on the Σ, for instance tables can
have at most two foreign keys, primary keys cannot be multi-attribute, etc. Moreover,
generating values for the foreign keys require reading of previously generated values,
which is not required in VIG. A strictly related approach is Rex [3], which provides,
through the use of dictionaries, a better handling of the content for non-key columns.
    In terms of similarity measures, the approach closest to VIG is RSGen [18], that also
considers measures like NULL ratios or number of distinct values. Moreover, values are
generated according to a uniform distribution, as in VIG. However, the approach only
works on numerical data types, and it seems not to support multi-attribute primary keys.
A related approach, but with the ability of generating data for non-numerical fields, has
been proposed in [17]. Notably, this approach is able to produce realistic text fields by
relying on machine-learning techniques based on Markov chains.
    In RDF graph scaling [16], an additional parameter, called node degree scaling fac-
tor, is provided as input to the scaler. The approach is able to replicate the phenomena
of densification that have been observed for certain types of networks. We see this as a
meaningful extension for VIG, and we are currently studying the problem of how this
could be applied in an OBDA context.
    Observe that all the approaches above do not consider ontologies nor mappings.
Therefore, many measures important in a context with mappings and ontologies and
discussed here, like selectivities for joins in a co-cluster, class disjointness, or reuse
of values for fixed-domain columns, cannot are not taken into consideration by any of
them. This leads to problems like the one we discussed through our running example,
and for which we showed how it affects the benchmarking analysis in Section 5.


7      Conclusion and Development Plan

In this work we presented VIG, a data-scaler for OBDA benchmarks. VIG integrates
some of the measures used by database query optimizers and existing data scalers with
OBDA-specific measures, in order to deliver a better data generation in the context of
OBDA benchmarks.
     We have evaluated VIG in the task of generating data for both the BSBM and NPD
benchmarks. In the first, we measured how similar is the data produced by VIG to the
one produced by the native BSBM generator, obtaining encouraging results. In the lat-
ter, we provided an empirical evaluation of the impact that the most distinguished fea-
ture of VIG, namely the mappings analysis, has on the shape of the produced instances,
and how it affects the measured performance of benchmark queries.
     The current work plan is to enrich the quality of the data produced by adding sup-
port for multi-attribute foreign keys, joint degree and value distributions, and intra-row
correlations (e.g., objects from SuspendedWellbore might not have a completionYear).
Unfortunately, we expect that some of these measures conflict with the current feature
of constant time for generation of tuples. Moreover, many of them require access to pre-
viously generated tuples in order to be calculated (e.g., joint-degree distribution [19]).
 8
     The notion of co-cluster has nothing to do with the notion of columns-cluster introduced here.
16       Davide Lanti, Guohui Xiao, and Diego Calvanese

    A related problem is how to extend the notion of “scaling” to the other components
forming an input for the OBDA system, like the mappings or the ontology. We are not
aware of any work in this direction, but we see it as an interesting research problem to
be addressed in the future.
Acknowledgment This paper is supported by the EU project Optique FP7-318338.
References
 1. Apt, K.: Principles of Constraint Programming. Cambridge University Press, New York, NY,
    USA (2003)
 2. Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. Int. J. on Semantic Web and Infor-
    mation Systems 5(2), 1–24 (2009)
 3. Buda, T., Cerqueus, T., Murphy, J., Kristiansen, M.: ReX: Extrapolating relational data in a
    representative way. In: Maneth, S. (ed.) Data Science, LNCS, vol. 9147, pp. 95–107 (2015)
 4. Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez-
    Muro, M., Xiao, G.: Ontop: Answering SPARQL queries over relational databases. Semantic
    Web J. (2016), to appear
 5. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., Rodriguez-Muro, M.,
    Rosati, R.: Ontologies and databases: The DL-Lite approach. In: Tessaris, S., Franconi, E.
    (eds.) RW 2009 Tutorial Lectures, LNCS, vol. 5689, pp. 255–356. Springer (2009)
 6. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., Rosati, R.: Link-
    ing data to ontologies: The description logic DL-Litea . In: Proc. of OWLED 2006. CEUR,
    ceur-ws.org, vol. 216 (2006)
 7. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning
    and efficient query answering in description logics: The DL-Lite family. J. of Automated
    Reasoning 39(3), 385–429 (2007)
 8. Das, S., Sundara, S., Cyganiak, R.: R2RML: RDB to RDF mapping language. W3C Recom-
    mendation, W3C (Sep 2012), available at http://www.w3.org/TR/r2rml/
 9. Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P.J.: Quickly generating
    billion-record synthetic databases. In: Proc. of ACM SIGMOD. pp. 243–252. ACM (1994)
10. Harris, S., Seaborne, A.: SPARQL 1.1 query language. W3C Recommendation, W3C (Mar
    2013), available at http://www.w3.org/TR/sparql11-query
11. Kontchakov, R., Rezk, M., Rodriguez-Muro, M., Xiao, G., Zakharyaschev, M.: Answer-
    ing SPARQL queries over databases under OWL 2 QL entailment regime. In: Proc. of
    ISWC 2014. LNCS, vol. 8796, pp. 552–567. Springer (2014)
12. Lanti, D., Rezk, M., Xiao, G., Calvanese, D.: The NPD benchmark: Reality check for OBDA
    systems. In: Proc. of EDBT 2015. pp. 617–628. OpenProceedings.org (2015)
13. Lanti, D., Xiao, G., Calvanese, D.: VIG. https://github.com/ontop/vig (2016)
14. Motik, B., Cuenca Grau, B., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C.: OWL 2 Web Ontol-
    ogy Language profiles (second edition). W3C Recommendation, W3C (Dec 2012), available
    at http://www.w3.org/TR/owl2-profiles/
15. Prud’homme, C., Fages, J.G., Lorca, X.: Choco Documentation. TASC, INRIA Rennes,
    LINA CNRS UMR 6241, COSLING S.A.S. (2015), http://www.choco-solver.
    org/, available at http://www.choco-solver.org/
16. Qiao, S., Özsoyoğlu, Z.M.: RBench: Application-specific RDF benchmarking. In: Proc. of
    ACM SIGMOD. pp. 1825–1838 (2015)
17. Rabl, T., Danisch, M., Frank, M., Schindler, S., Jacobsen, H.A.: Just can’t get enough: Syn-
    thesizing big data. In: Proc. of ACM SIGMOD. pp. 1457–1462 (2015)
18. Shen, E., Antova, L.: Reversing statistics for scalable test databases generation. In: Proc. of
    DBTest. pp. 7:1–7:6 (2013)
19. Tay, Y., Dai, B.T., Wang, D.T., Sun, E.Y., Lin, Y., Lin, Y.: UpSizeR: Synthetically scaling an
    empirical relational database. Information Systems 38(8), 1168 – 1183 (2013)