<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fast and Simple Data Scaling for OBDA Benchmarks</article-title>
      </title-group>
      <abstract>
        <p>In this paper we describe VIG, a data scaler for OBDA benchmarks. Data scaling is a relatively recent approach, proposed in the database community, that allows for quickly scaling an input data instance to n times its size, while preserving certain application-specific characteristics. The advantages of the scaling approach are that the same generator is general, in the sense that it can be re-used on different database schemas, and that users are not required to manually input the data characteristics. In the VIG system, we lift the scaling approach from the pure database level to the OBDA level, where the domain information of ontologies and mappings has to be taken into account as well. VIG is efficient and notably each tuple is generated in constant time. VIG has been successfully used in the NPD benchmark, but it provides a general approach that can be re-used to scale any data instance in any OBDA setting.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        An important research problem in Big Data is how to provide end-users with transparent
access to the data, abstracting from storage details. The paradigm of Ontology-based
Data Access (OBDA) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] provides an answer to this problem that is very close to the
spirit of the Semantic Web. In OBDA the data stored in a relational database is presented
to the end-users as a virtual RDF graph over which SPARQL queries can be posed. This
solution is realized through mappings that link classes and properties in the ontology to
queries over the database.
      </p>
      <p>
        Proper benchmarking of query answering systems, such as OBDA systems, requires
scalability analyses taking into account for data instances of increasing volume. Such
instances are often provided by generators of synthetic data. However, such generators
are either complex ad-hoc implementations working for a specific schema, or require
considerable manual input by the end-user. The latter problem is exacerbated in the
OBDA setting, where database schemas tend to be particularly big and complex (e.g.,
70 tables, some with more of 80 columns in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). This contributes to the slow creation
of new benchmarks, and the same old benchmarks become more and more misused
over a long period of time. For instance, evaluations on OBDA systems are usually
performed on benchmarks originally designed for triple stores, although these two types
of systems are totally different and present different challenges [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Data scaling [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] is a recent approach that tries to overcome this problem by
automatically tuning the generation parameters through statistics collected over an initial
data instance. Hence, the same generator can be reused in different contexts, as long
as an initial data instance is available. A measure of quality for the produced data is
defined in terms of results for the available queries, that should be similar to the one
observed for real data of comparable volume.
      </p>
      <p>In the context of OBDA, however, taking as the only parameter for generation an
initial data instance does not produce data of acceptable quality, since the generated
data has to comply with constraints deriving from the structure of the mappings and the
ontology, that in turn derive from the application domain.</p>
      <p>In this work we present the VIG system, a data scaler for OBDA benchmarks that
addresses these issues. In VIG, the scaling approach is lifted from the instance level to
the OBDA level, where the domain information of ontologies and mappings has to be
taken into account as well. VIG is very efficient and suitable to generate huge amounts
of data, as tuples are generated in constant time without disk accesses or need to retrieve
previously generated values. Furthermore, different instances of VIG can be delegated
to different machines, and parallelization can scale up to the number of columns in the
schema, without communication overhead. Finally, VIG produces data in the form of
csv files that can be easily imported by any relational database system.</p>
      <p>
        VIG is a Java implementation licensed under Apache 2.0, and its source code is
available on GitHub in the form of a Maven project [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The code is maintained by the
Ontop team at the Free University of Bozen-Bolzano, and it comes with documentation
in the form of Wiki pages.
      </p>
      <p>
        We have evaluated VIG in the task of generating data for both the Berlin SPARQL
Benchmark (BSBM) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and the NPD Benchmark [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In the first, we measured how
the data produced by VIG compares to the one produced by the native ad-hoc BSBM
generator. In the latter we provided an empirical evaluation of the impact that the most
distinguished feature of VIG, namely the mappings analysis, has on the shape of the
produced instances, and how it affects the measured performance of benchmark queries.
      </p>
      <p>The rest of the paper is structured as follows. In Section 2, we introduce the
basic notions and notation to understand this paper. In Section 3, we define the scaling
problem and discuss important measures on the produced data that define the quality
of instances in a given OBDA setting. In Section 4, we discuss the VIG algorithm, and
how it ensures that data conforming to the identified measures is produced. In Section 5
we provide an empirical evaluation of VIG, in terms of both resource consumption and
quality of produced data. Sections 6 and 7 contain related work and conclusions,
respectively.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Basic Notions and Notation</title>
      <p>
        We assume that the reader has moderate knowledge of OBDA, and refer for it to the
abundant literature on the subject, like [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Moreover, we assume familiarity with basic
notions from probability calculus and statistics.
      </p>
      <p>
        The W3C standard ontology language in OBDA is OWL 2 QL [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. For the sake
of conciseness, we consider here its mathematical underpinning DL-LiteR [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Table 1
shows a portion of the ontology from the NPD benchmark, which is the foundation
block of our running example.
      </p>
      <p>
        The W3C standard query language in OBDA is SPARQL [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], with queries
evaluated under the OWL 2 QL entailment regime [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Intuitively, under these semantics
each basic graph pattern (BGP) can be seen as a single conjunctive query (CQ)
without existentially quantified variables. As in our examples we will only refer to SPARQL
queries containing exactly one BGP, we will use the more concise syntax for CQs rather
than the SPARQL syntax. Table 2 shows two queries used in our running example.
      </p>
      <p>
        The mapping component links predicates in the ontology to queries over the
underlying relational database. To present our techniques, we need to introduce this
component in a formal way. The standard W3C Language for mappings is R2RML [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
however here we use a more concise syntax that is common in the OBDA literature.
Formally, a mapping assertion m is an expression of the form X(f ; x) conj(y),
consisting of a target part X(f ; x), which is an atom over function symbols f (also
called templates) and variables x y, and a source part conj(y), which is a CQ whose
output variables are y. We say that m defines the predicate X if X is in the target of m.
A basic mapping is a mapping whose source part contains exactly one atom. Table 3
contains the mappings for our running example, as well as a short description of how
these mappings are used in order to create a (virtual) set of assertions.
      </p>
      <p>For the rest of this paper we fix an OBDA instance (T ; M; ; D), where T is an
OWL 2 QL ontology, is a database schema with foreign and primary key
dependencies, M is a set of mappings linking predicates in T to queries over , and D is a
database instance that satisfies the dependencies in and the disjointness axioms in
T . We denote by col( ) the set of all columns in . Given a column C 2 col( ),
we denote by CD the set of values for C in D. Finally, given a term f (x), where
x = (x1; : : : ; xp; : : : ; xn), we denote the argument xp at position p by f (x)jp.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data Scaling for OBDA Benchmarks: VIG Approach</title>
      <p>
        The data scaling problem introduced in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] is formulated as follows:
Definition 1 (Data Scaling Problem). Given a dataset D, produce a dataset D0 which
is similar to D but s times its size.
      </p>
      <p>
        The notion of similarity is application-based. Being our goal benchmarking, we
define similarity in terms of query results for the queries at hand. In [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], the authors
do not consider such queries to be available to the generator, since their goal is broader
than benchmarking over a pre-defined set of queries. In OBDA benchmarking, however,
the (SQL) workload for the database can be estimated from the mapping component.
Therefore, VIG includes the mappings in the analysis, so as to obtain a more realistic
and OBDA-tuned generation.
      </p>
      <p>ShallowWellbore(w(id))
ExplorationWellbore(w(id))
SuspendedWellbore(w(id))
Field(f(fid))
completionYear(w(id); year)
name(w(id); name)
completionYear(w(id); year)
name(w(id); name)
shallowWellboreForField(w(id); f(fid))
shallow wellbores(id,name,year,fid)
exploration wellbores(id,name,year,state)
exploration wellbores(id,name,year,state),
state=’suspended’
fields(fid,name)
shallow wellbores(id,name,year,fid)
shallow wellbores(id,name,year,fid)
exploration wellbores(id,name,year)
exploration wellbores(id,name,year)
shallow wellbores(id,name,year,fid),
fields(fid,fname)</p>
      <p>Concerning the size, similarly to other approaches, VIG scales each table in D by a
factor of s.
3.1</p>
      <sec id="sec-3-1">
        <title>Similarity Measures for OBDA and Their Rationale</title>
        <p>We overview the similarity measures used by VIG, and why they are important in the
scenario of OBDA benchmarking.</p>
        <p>Schema Dependencies. D0 should be a valid instance for . VIG is, to the best of our
knowledge, the only data scaler able to generate in constant time tuples that satisfy
multi-attribute primary keys for weakly-identified entities1. The current implementation
of VIG does not support multi-attribute foreign keys.</p>
        <p>Column-based Duplicates and NULL Ratios. They respectively measure the ratio
of duplicates and of nulls in a given column, and are common parameters for the
cost estimation performed by query planners in databases. By default, VIG maintains
them in D0 to preserve the cost of joining columns in a key-foreign key relationship
(e.g., the join from the last mapping in our running example). This default behavior,
however, is not applied with fixed-domain columns, which are columns whose content
does not depend on the size of the database instance. The column state in the
table exploration wellbore is fixed-domain, because it partitions the elements of
id into a fixed number of classes2. VIG analyzes the mappings to detect such cases of
fixed-domain columns, and additional fixed-domain columns can be manually specified
by the user. To generate values for a fixed-domain column, VIG reuses the values found
in D so as to prevent empty answers for the SQL queries in the mappings. For instance,
a value ‘suspended’ must be generated for the column state in order to produce
objects for the class SuspendedWellbore.</p>
        <p>VIG generates values in columns according to a uniform distribution, that is, values
in columns have all the same probability of being repeated. Replication of the
distributions from D will be included the next releases of VIG.
1 In a relational database, a weak entity is an entity that cannot be uniquely identified by its
attributes alone.
2 The number of classes in the ontology does not depend on the size of the data instance.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Size of Columns Clusters, and Disjointness. Query q1 from our running example</title>
        <p>returns an empty set of answers, regardless of the considered data instance. This is
because the function w used to build objects for the class Wellbore does not match with
the function f used to build objects for Fields. Indeed, fields and wellbores are two
different entities for which a join operation would be meaningless.</p>
        <p>On the other hand, a standard OBDA translation of q2 into SQL produces a union
of CQs containing several joins between the two tables shallow wellbores and
exploration wellbores. This is possible only because the mappings for
Wellbore, name, and completionYear all use the same unary function symbol w to define
wellbores. Intuitively, every pair of terms over the same function symbol and appearing
on the target of two distinct basic mappings identifies sets of columns for which the join
operation is semantically meaningful3. Generating data that guarantees the correct cost
for these joins is crucial in order to deliver a realistic evaluation. In our example, the join
between shallow wellbore and exploration wellbore over id is empty
under D (because ExplorationWellbore and ShallowWellbore are disjoint classes). VIG
is able to replicate this fact in D0. This implies that VIG can generate data satisfying
disjointness constraints declared over classes whose individuals are constructed from a
unary template in a basic mapping, if D satisfies those constraints.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The VIG Algorithm</title>
      <p>
        We now show how VIG realizes the measures described in the previous section. The
building block of VIG is a pseudo-random number generator, that is a sequence of
integers (si)i2N defined through a transition function sk := f (sk 1). The authors in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
discuss a particular class of pseudo-random generators based on multiplicative groups
modulo a prime number. Let n be the number of distinct values to generate. Let g be a
generator for the multiplicative group modulo a prime number p, with p &gt; n. Consider
the sequence S := hgi mod p j i = 1; : : : ; p and (gi mod p) ni. Then S is a
permutation of values in the interval [1; : : : ; n]. Here we show how this generator is used
in VIG to quickly produce data complying with foreign and primary key constraints.
      </p>
      <p>The VIG algorithm can be split into two phases: the intervals creation phase, and the
generation phase. In the first phase the input data instance D is analyzed to create a set
of intervals ints(C) associated to each column C in the database schema. Each interval
I 2 ints(C) for a column C is a structure [min; max] keeping track of a minimum and
maximum integer index. In the generation phase, the pseudo random number generator
is used to randomly choose indexes from each of these intervals, and for each chosen
index i 2 I, I 2 ints(C), an injective function gC is applied to transform i into a
database value according to the datatype of C.</p>
      <p>From now on, let s be a scale factor, and let dist(C; D) denote the number of distinct
non-null values in a column C in the database instance D. Let size(T; D) denote the
number of tuples occurring in the table T in the database instance D. To illustrate the
algorithm, we consider the source instance D to contain values as illustrated in Figure 1,
and a scaling factor s = 2.
3 Therefore, for which a join could occur during the evaluation of a user query.
exploration wellbores (abbr. ew) shallow wellbores (abbr. sw) wellbores overview (abbr. wo)</p>
      <sec id="sec-4-1">
        <title>4.1 Intervals Creation Phase</title>
        <p>Initialization Phase. For each table T , VIG sets the number size(T; D0) of tuples to
generate to size(T; D) s. Then, VIG calculates the number of non-null distinct values
that need to be generated for each column, given s and D. That is, for each column
C, if C is not fixed-domain then VIG sets dist(C; D0) := dist(C; D) s. Otherwise,
dist(C; D0) is set to dist(C; D). To determine whether a column is fixed-domain, VIG
searches in M for mappings of shape</p>
        <p>ClassName(f (a))</p>
        <p>table name(b); b1 = l1; : : : ; bn = ln
where l1; : : : ; ln are literal values, and marks the columns b1; : : : ; bn as fixed-domain.
Example 1. For the tables in Figure 1, VIG sets size(ew; D0) = 5 2 = 10 = size(sw; D0),
and size(wo; D0) = 10 2 = 20. Values for statistics dist(T:id; D0), where T is one
of few, sw, wog, are set in the same way, because the id columns do not contain
duplicate values. The column ew.active is marked as fixed-domain, because of the
third mapping in Table 3. Therefore, VIG sets dist(ew:active) = 2.</p>
        <p>Creation of Intervals. When C is a numerical column, VIG initializes ints(C) by the
interval IC := [min(C; D); min(C; D) + dist(C; D0) 1] of distinct values to be
generated, where min(C; D) denotes the minimum value occurring in CD. Otherwise,
if C is non-numerical, ints(C) is initialized to the interval IC := [1; dist(C; D0)]. The
elements in ints(C) will be transformed into values of the desired datatype by a suitable
injective function in the final generation step.</p>
        <p>Example 2. Following on our running example, VIG creates in this phase the intervals
Iew.id = [2; 11], Iew.active = [1; 2], Isw.id = [1; 10], and Iwo.id = [1; 20]. Then, the
intervals are associated to the respective columns. Namely, ints(ew.id) = fIew.idg,
: : :, ints(wo.id) = fIwo.idg.</p>
        <p>Primary Keys Satisfaction. Let K = fC1; : : : ; Cng be the primary key of a
table T . In order to ensure that values generated for each column through the
pseudorandom generator will not lead to duplicate tuples in K, the least common multiple
lcm(dist(C1; D0); : : : ; dist(Cn; D0)) of the number of distinct values to be generated
in each column must be greater than the number tuples(T; D0) of tuples to generate for
the table T . If this condition is not satisfied, then VIG ensures the condition by slightly
increasing dist(Ci; D0) for some column Ci in K. Observe that the only side effect of
this is a small deviation on the number of distinct values to generate for a column. Once
the condition holds, data can be generated independently for each column without risk
of generating duplicate tuples for K.</p>
        <p>Columns Cluster Analysis. In this phase, VIG analyzes M in order to identify columns
that could be joined in a translation to SQL, and groups them together into pre-clusters.
Formally, let X1(f1; x1); : : : ; Xm(fm; xm) be the atoms defined by basic mappings
in M, where variables correspond to qualified column names4. Consider the set F =
Si=1:::m ff (x) j f (x) is a term in Xi(fi; xi)g of all the terms occurring in such atoms.
A set of columns pc is a pre-cluster if there exists a function f and a valid position p in
f such that pc = ff(x)jp j f (x) 2 F g.</p>
        <p>Example 3. For our running example, we have</p>
        <p>F = fw(sw.id); w(ew.id); f (fields.fid)g
There are two pre-clusters, namely pcwj1 = fsw.id; ew.idg and pcf j1 = ffields.fidg.</p>
        <p>VIG evaluates on D all combinations of joins between columns in a pre-cluster pc,
and produces values in D0 so that the selectivities for these joins are maintained. In
order to do so, the intervals for the columns in pc are modified. This modification must
be propagated to all the columns related via a foreign key relationship to some column
in pc. In particular, the modification might propagate up to columns belonging to
different pre-clusters, inducing a clash. VIG groups together such pre-clusters in order to
avoid this issue. Formally, let P C denote the set of pre-clusters for M. Two pre-clusters
pc1; pc2 2 P C are in merge relation, denoted as pc1 ! pc2, iff C(pc1) \ C(pc2) 6= ;,
where C(pc) = fD 2 col( ) j there is a C 2 pc : D $ Cg, where $ is the
reflexive, symmetric, and transitive closure of the single column foreign key relation between
pairs of columns5. Given a pre-cluster pc, the set of columns fc 2 pc0 j pc0!pcg is
called a columns cluster, where ! is the transitive closure of !. Columns clusters
group together those pre-clusters for which columns cannot be generated independently.
Example 4. In our example we have that pcwj1 6! pcf j1 . In fact,</p>
        <p>(C(pcwj1 ) = pcwj1 [ fwo.idg) \ (pcf j1 = C(pcf j1 )) = ;
Therefore, the pre-clusters pcwj1 and pcf j1 are also columns clusters.</p>
        <p>After identifying columns clusters, VIG analyzes the number of shared elements
between the columns in the cluster, and creates new intervals accordingly. Formally,
consider a columns cluster cc. Let H cc be a set of columns, and the set KH :=
fK j H K ccg of strict super-sets of H . For each such H , VIG creates an interval
IH such that jIH j := j TC2H CD n (SK2KH TC2K CD)j s, and adds IH to ints(C)
for all C 2 H . Boundaries for all intervals IH are set in a way that they do not overlap.
4 A qualified column name is a string of the form T.C, where T/C is a table/column name.
5 Remember that VIG does not allow for multi-attribute foreign keys.
Example 5. Consider the columns cluster pcwj1 . There are three non-empty subsets,
namely E = few.idg ; S = fsw.idg, and ES = few.id; sw.idg. Accordingly,
we identify the sets KE = fESg = KS and KES = ;.</p>
        <p>Thus, VIG needs to create three disjoint intervals IE ; IES ; IS such that
– jIE j = jew.idD n ;j 2 = 5 2 = 10,
– jIS j = jsw.idD n ;j 2 = 5 2 = 10,
– jIES j = j(ew.idD \ sw.idD) n ;j</p>
        <p>Without loss of generality, we assume that the intervals generated by VIG satisfying
the constraints above are IE = [2; 11] and IS = [12; 21]. These intervals are assigned to
columns ew.id and sw.id, respectively. Intervals assigned in the initialization phase
to the same columns are deleted.</p>
        <p>
          Foreign Keys Satisfaction. At this point, foreign key columns D for which there is
no columns cluster cc such that D 2 C(cc), have a single interval whose boundaries
have to be aligned to the (single) interval of the parent. Foreign keys relating pairs of
columns in a cluster, instead, are already satisfied by construction of the intervals in the
columns cluster. More work, instead, is necessary for columns belonging to C(cc) n cc,
for some columns cluster cc. VIG encodes the problem of finding intervals for these
columns that satisfy the number of distinct values and the foreign key constraints into
a constraint satisfaction problem (CSP) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which can be solved by any off-the-shelf
constraint solver. In VIG, we use Choco [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>Example 6. At this phase, the intervals found by VIG (w.r.t. the portion of D we are
considering) are:
– Iwo.id = [1; 20] and Iew.active = [1; 2], found in the initialization phase;
– IE = [2; 11] and IS = [12; 21], found during the columns cluster analysis.
Observe that these intervals violate the foreign key so.id wo.id, because the
index 21 belongs to Ifso.idg but not Ifwo.idg. Moreover, wo.id 2 C(pcwj1 ) n pcwj1 .</p>
        <p>Create Program Variables:</p>
        <sec id="sec-4-1-1">
          <title>Xhew.id;Iew.idi; Yhew.id;Iew.idi; Xhsw.id;Iew.idi; Yhsw.id;Iew.idi; Xhwo.id;Iew.idi; Yhwo.id;Iew.idi 2 [2; 11]</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Xhew.id;Isw.idi; Yhew.id;Isw.idi; Xhsw.id;Isw.idi; Yhsw.id;Isw.idi; Xhwo.id;Isw.idi; Yhwo.id;Isw.idi 2 [12; 21]</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>Xhew.id;Iauxi; Yhew.id;Iauxi; Xhsw.id;Iauxi; Yhsw.id;Iauxi; Xhwo.id;Iauxi; Yhwo.id;Iauxi 2 [22; 41]</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Yhew.id;Iew.idi, Xhsw.id;Iew.idi Yhsw.id;Iew.idi, Xhwo.id;Iew.idi Yhwo.id;Iew.idi</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>Yhew.id;Isw.idi, Xhsw.id;Isw.idi Yhsw.id;Isw.idi, Xhwo.id;Isw.idi Yhwo.id;Isw.idi</title>
        </sec>
        <sec id="sec-4-1-6">
          <title>Yhew.id;Iauxi, Xhsw.id;Iauxi Yhsw.id;Iauxi, Xhwo.id;Iauxi Yhwo.id;Iauxi</title>
          <p>Set Boundaries for Known Intervals:
Xhew.id;Iew.idi = 2; Yhew.id;Iew.idi = 11
Xhsw.id;Isw.idi = 12; Yhsw.id;Isw.idi = 21
Set Boundaries for Known Empty Intervals:
Xhew.id;Isw.idi = Yhew.id;Isw.idi
Xhew.id;Iauxi = Yhew.id;Iauxi
Xhsw.id;Iew.idi = Yhsw.id;Iew.idi
Xhsw.id;Iauxi = Yhsw.id;Iauxi
The Y’s should be greater than the X’s:</p>
        </sec>
        <sec id="sec-4-1-7">
          <title>Xhwo.id;Iew.idi</title>
        </sec>
        <sec id="sec-4-1-8">
          <title>Xhew.id;Iew.idi</title>
        </sec>
        <sec id="sec-4-1-9">
          <title>Xhew.id;Isw.idi</title>
        </sec>
        <sec id="sec-4-1-10">
          <title>Xhew.id;Iauxi</title>
          <p>Foreign Keys:</p>
        </sec>
        <sec id="sec-4-1-11">
          <title>Xhew.id;Iew.idi</title>
          <p>Therefore, VIG encodes the problem of finding the right boundaries for Iwo.id into the
CSP in Table 5.</p>
          <p>Any solution for the above program sets Xhew.id;Iew.idi = 2; Yhew.id;Iew.idi = 11,
Xhsw.id;Isw.idi = 12; and Yhsw.id;Isw.idi = 21. The last four constraint imply also that,
for any solution, XhC;Iaux i = YhC;Iaux i, for any column C . A solution for the program
is:</p>
          <p>Xhew.id;Iew.idi
Xhsw.id;Isw.idi
Xhwo.id;Iew.idi
Xhwo.id;Isw.idi
= 2
= 12
= 2
= 12</p>
          <p>Yhew.id;Iew.idi
Yhsw.id;Isw.idi
Yhwo.id;Iew.idi
Yhwo.id;Iew.idi
= 11
= 21
= 11
= 21
Ifwo.id;sw.idg
and arbitrary values for the other variables so that XC;I = YC;I .</p>
          <p>From this solution, VIG creates two new intervals Ifwo.id;ew.idg = [2; 11] and
= [12; 21] and sets them as intervals for column wo.id.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Generation Phase</title>
        <p>Generation. At this point, each column in col( ) is associated to a set of intervals. The
elements in the intervals are associated to values in the column datatype, and to values
exploration wellbores (abbr. ew) shallow wellbores (abbr. sw) wellbores overview (abbr. wo)
id active ... id ... id ...
.23.. fftaarllusseee ......... 11..32. ......... 11.2.12. ............
10 true ... 20 ... ... ...
11 false ... 21 ... 21 ...
from CD in case C is fixed-domain. VIG uses the pseudo-random number generator to
randomly pick elements from the intervals that are then transformed into database
values. NULL values are generated according to the detected NULLS ratio. Observe that
the generation of a value in a column takes constant time and can happen independently
for each column, thanks to the previous phases in which intervals were calculated.
Example 7. At this stage, VIG has available all the information necessary to proceed
with the generation. With respect to our running example, such information is:
– Association Columns to Intervals. The columns in the considered tables are
associated to intervals in the following way:
ints(ew.id) = f[2; 11]g
ints(sw.id) = f[12; 21]g
ints(wo.id) = f[2; 11]; [12; 21]g
ints(ew.active) = f1; 2g)
– Number of tuples to generate for each table. From Example 4.1, we know that
size(ew; D0) = 10, size(sw; D0) = 10, and size(wo; D0) = 20.
– Association Indeces to Database Values. Without loss of generality, we assume that
the injective function used by VIG to associate elements in the intervals to database
values is the identity function for all non fixed-domain columns. For the column
ew.active, we assume the function g : f1; 2g ! f’true’, ’false’g such
that g(1) = ’true’ and g(2) = ’false’.</p>
        <p>Figure 2 contains the generated data instance D0. Observe that D0 satisfies all the
constraints discussed in the previous paragraphs. For clarity, the generated tuples are
sorted on the primary key, however in a real execution the values would be randomly
generated by means of the multiplicative group modulo a prime number.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>VIG in Action</title>
      <p>The data generation techniques presented in previous sections have been implemented
in the VIG system, which is available on GitHub as a Java maven project, and comes
with extensive documentation in form of wiki pages. VIG, a mature implementation
delivered since two years together with the NPD benchmark, is licensed under Apache 2.0,
and is maintained at the Free University of Bozen-Bolzano.
Time Comparison (sec.)
Memory Comparison (Mb)
2000
1000
0
BSBM-1 BSBM-10 BSBM-100</p>
      <p>In this section we present two evaluations of VIG. In the first evaluation we use the
BSBM benchmark to compare VIG with the native ad-hoc BSBM date generator. In the
second evaluation we use the NPD benchmark to show the impact, on the quality of the
produced data, of the mappings analysis discussed in the previous section.
The BSBM benchmark is built around an e-commerce use case in which different
vendors offer products that can be reviewed by customers. It comes with a set of parametric
queries, an ontology, mappings, and an ad-hoc data generator (GEN) that can generate
data according a scale parameter given in terms of number of products. The queries
contain placeholders that are instantiated by actual values during the test phase.</p>
      <p>We used the two generators to create six data instances, denoted as BSBM-s-g,
where s 2 f1; 10; 100g indicates the scale factor with respect to an initial data instance
of 10000 products (produced by GEN), and g 2 fVIG,GENg indicates the generator used
to produce the instance.</p>
      <p>
        The details of the experiment, as well as the material to reproduce it, are available
online6. Queries were executed through the Ontop OBDA system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], over a MySQL
back-end. All the experiments have been ran on a HP Proliant server with 2 Intel Xeon
X5690 CPUs (24 cores @3.47GHz), 106 GB of RAM and a 1 TB 15K RPM HD. The
OS is Ubuntu 12.04 LTS.
      </p>
      <p>Resources Consumption Comparison. Figure 3 shows the resources (time and
memory) used by the two generators for creating the instances. For both generators the
execution time grows approximately as the scale factor, which suggests that the generation
of a single column value is in both cases independent from the size of the data instance
to be generated. Observe that GEN is on average 5 times faster than VIG (in
singlethread mode), but it also requires increasingly more memory as the amount of the data
to generate increase, contrary to VIG that always requires the same amount of memory.
Benchmark Queries Comparison. In this paragraph we compare the execution times
for the queries in the BSBM benchmark evaluated over the instances produced by
VIG/GEN. Additionally, we here consider three additional data instances created with
6 https://github.com/ontop/ontop-examples/tree/master/
blink-2016-vig
db
BSBM-1-GEN
BSBM-1-VIG
BSBM-1-RAND
BSBM-10-GEN
BSBM-10-VIG
BSBM-10-RAND
BSBM-100-GEN
BSBM-100-VIG
BSBM-100-RAND
a third generator (RAND) that only considers database integrity constraints (primary
and foreign keys) as similarity measures, so as to quantify the impact of the measures
maintained by VIG on the task of approximating (w.r.t. task of benchmarking) the data
produced by GEN.</p>
      <p>The experiment was ran on a variation of the BSBM benchmark over the testing
platform, the mappings and the considered queries. We now briefly discuss and motivate
the variations, before introducing the results.</p>
      <p>The testing platform of the BSBM benchmark instantiates the queries with concrete
values coming from (binary) configuration files produced by GEN. This does not allow
a fair comparison between the three generators, because it is biased towards the
specific values produced by GEN. Therefore, we reused the testing platform of the NPD
benchmark, which is independent from the specific generator used as it instantiates the
queries only with values found in the provided database instance.</p>
      <p>Another important difference regards the mapping component. The BSBM mapping
contains some binary URI templates where one of the two arguments is a unary primary
key. This is commonly regarded as a bad practice in OBDA, as it is likely to introduce
redundancies in terms of retrieved information. For instance, consider the template
used to construct objects for the class bsbm:Person. The template has as arguments
the primary key nr for the table person, plus an additional attribute publisher.
Observe that, being nr a primary key, the information about publisher does not
contribute in the identification of specific persons. Additionally, the relation between
persons and publishers is already realized in the mappings by a specific mapping assertion
for the property dc:publisher. This mapping assertion poses a challenge to data
generation, because query results are influenced by inclusion dependencies between
binary tuples stored in different tables. VIG cannot correctly reproduce such inclusions,
because it only supports inclusions (even not explicitly declared in the schema)
between single columns. Observe that this problem cannot be addressed even by
supporting multi-attribute foreign keys, because foreign keys must always refer to a primary
key. For these reasons, we have changed the problematic URI template into an unary
template by removing the redundant column, so as to build individuals only out of
primary keys. Observe that this change does not influence the semantics of the considered
queries, nor their complexity.</p>
      <p>type-db-scale
CLASS-BSBM-1
CLASS-BSBM-10
CLASS-BSBM-100
OBJ-BSBM-1
OBJ-BSBM-10
OBJ-BSBM-100
DATA-BSBM-1
DATA-BSBM-10
DATA-BSBM-100
avg(dev)</p>
      <p>Finally, we slightly modified the queries by removing the LIMIT modifiers, so as
to compare the number of retrieved results, and in a couple of cases by modifying an
excessively restricting filter condition. We point out that this modification only slightly
change the size of the produced SQL translation, and that the modified queries are at
least as hard as the original ones.</p>
      <p>Table 6 contains the results of the experiment in terms of various performance
measures, namely the average execution/output time and average number of results
for queries in a run (mix), the number of query mixes per hour, and the average mix
time. Observe that the measured performance for queries executed over the instances
produced by VIG is close to the measured performance for queries executed over the
instances (of comparable size) produced by GEN. Moreover, it significantly differs from
the performance measured over instances produced by RAND. We argue that this
difference is due to the fact that RAND does not maintain any similarity measure, apart from
schema constraints, on the generated data instance.</p>
      <p>Deviation for Predicates Growth. Table 7 shows the deviation, in terms of number
of elements for each predicate (class, object or data property) in the ontology, between
the instances generated by VIG and those generated by GEN. The first column reports
the average deviation, and last two columns report the absolute number and relative
percentage of predicates for which the deviation was greater than 5%. Apart from a
few outliers (due to some elements which are built from tables that GEN, contrary to
VIG, does not scale according to the scale factor) the deviation for predicates growth is
inferior to 5% for the majority of classes and properties in the ontology.
5.2</p>
      <sec id="sec-5-1">
        <title>The NPD Benchmark</title>
        <p>The query discussed in our running example is at the basis of the three hardest
realworld queries in the NPD Benchmark, namely queries 6; 11 and 12. In this section we
compare these queries on two modalities of VIG; one in which only the input database
is taken as input (DB mode), and for which the columns cluster analysis cannot be
performed, and the one (OBDA mode) discussed in this paper where the mapping is also
taken into account.</p>
        <p>Table 8 contains the selectivities (i.e., number of results) for the joins between
tables shallow wellbore (abbr. sw), exploration wellbore (abbr. ew), and
development wellbore (abbr. dw), over the source NPD dataset as well as its
scaled versions of factors 1; 5 and 50. Observe that the instances created through the
DB mode produce joins of non-null selectivities, and this fact together with the
definitions for these classes in the mappings (see Table 3 for the mapping assertions of
ExplorationWellbore and ShallowWellbore; class DevelopmentWellbore is mapped in
a similar way) produce a violation of the disjointness constraints in the ontology
between the classes ShallowWellbore, ExplorationWellbore, and DevelopmentWellbore.</p>
        <p>Table 9 shows the impact of the wrong selectivities on the performance (response
time in milliseconds) of evaluation for the queries under consideration. Observe that
the performance measured over the DB instances differ sensibly from the one measured
over OBDA instances, or over the original NPD instance. This is due to the higher costs
for the join operations in DB instances, that in turn derive from the wrong selectivities
discussed in the previous paragraph.</p>
        <p>Limitations. Observe that VIG considers only a limited set of similarity measures, and
that the produced instances will be similar only in terms of these measures. For
instance, we have already discussed how VIG is not able to reproduce constraints such as
multi-attribute foreign keys or non-uniform data distributions. Thus, although we show
here how VIG seems to suffice for the BSBM benchmark (under our assumptions for
queries, mappings, and testing platform), we expect it not to perform as good in more
complex scenarios, where the non-supported measures become significant. Moreover,
an intrinsic weakness of the scaling approach in general is that it only considers a single
source data instance: in case certain measures depend on the size of the instance, as it
seems to be the case for two classes and properties in Table 7, then the scaled instances
might significantly diverge from the real ones.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Related Work</title>
      <p>In this section we discuss the relation between VIG and other data scalers, as it makes
little sense to compare it to classic data generators used in OBDA benchmarks as, for
instance, the one found in the Texas Benchmark 7.
7 http://obda-benchmark.org/</p>
      <p>
        UpSizeR [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] replicates two kinds of distributions observed on the values for the
key columns, called joint degree distribution and joint distribution over co-clusters8.
However, this requires several assumptions to be made on the , for instance tables can
have at most two foreign keys, primary keys cannot be multi-attribute, etc. Moreover,
generating values for the foreign keys require reading of previously generated values,
which is not required in VIG. A strictly related approach is Rex [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which provides,
through the use of dictionaries, a better handling of the content for non-key columns.
      </p>
      <p>
        In terms of similarity measures, the approach closest to VIG is RSGen [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], that also
considers measures like NULL ratios or number of distinct values. Moreover, values are
generated according to a uniform distribution, as in VIG. However, the approach only
works on numerical data types, and it seems not to support multi-attribute primary keys.
A related approach, but with the ability of generating data for non-numerical fields, has
been proposed in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Notably, this approach is able to produce realistic text fields by
relying on machine-learning techniques based on Markov chains.
      </p>
      <p>
        In RDF graph scaling [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], an additional parameter, called node degree scaling
factor, is provided as input to the scaler. The approach is able to replicate the phenomena
of densification that have been observed for certain types of networks. We see this as a
meaningful extension for VIG, and we are currently studying the problem of how this
could be applied in an OBDA context.
      </p>
      <p>Observe that all the approaches above do not consider ontologies nor mappings.
Therefore, many measures important in a context with mappings and ontologies and
discussed here, like selectivities for joins in a co-cluster, class disjointness, or reuse
of values for fixed-domain columns, cannot are not taken into consideration by any of
them. This leads to problems like the one we discussed through our running example,
and for which we showed how it affects the benchmarking analysis in Section 5.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Development Plan</title>
      <p>In this work we presented VIG, a data-scaler for OBDA benchmarks. VIG integrates
some of the measures used by database query optimizers and existing data scalers with
OBDA-specific measures, in order to deliver a better data generation in the context of
OBDA benchmarks.</p>
      <p>We have evaluated VIG in the task of generating data for both the BSBM and NPD
benchmarks. In the first, we measured how similar is the data produced by VIG to the
one produced by the native BSBM generator, obtaining encouraging results. In the
latter, we provided an empirical evaluation of the impact that the most distinguished
feature of VIG, namely the mappings analysis, has on the shape of the produced instances,
and how it affects the measured performance of benchmark queries.</p>
      <p>
        The current work plan is to enrich the quality of the data produced by adding
support for multi-attribute foreign keys, joint degree and value distributions, and intra-row
correlations (e.g., objects from SuspendedWellbore might not have a completionYear).
Unfortunately, we expect that some of these measures conflict with the current feature
of constant time for generation of tuples. Moreover, many of them require access to
previously generated tuples in order to be calculated (e.g., joint-degree distribution [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]).
8 The notion of co-cluster has nothing to do with the notion of columns-cluster introduced here.
      </p>
      <p>A related problem is how to extend the notion of “scaling” to the other components
forming an input for the OBDA system, like the mappings or the ontology. We are not
aware of any work in this direction, but we see it as an interesting research problem to
be addressed in the future.</p>
      <p>Acknowledgment This paper is supported by the EU project Optique FP7-318338.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Apt</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Principles of Constraint Programming</article-title>
          . Cambridge University Press, New York, NY, USA (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schultz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The Berlin SPARQL benchmark</article-title>
          .
          <source>Int. J. on Semantic Web and Information Systems</source>
          <volume>5</volume>
          (
          <issue>2</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Buda</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cerqueus</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kristiansen</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>ReX: Extrapolating relational data in a representative way</article-title>
          . In: Maneth,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <source>(ed.) Data Science, LNCS</source>
          , vol.
          <volume>9147</volume>
          , pp.
          <fpage>95</fpage>
          -
          <lpage>107</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cogrel</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Komla-Ebri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontchakov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lanti</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rezk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>RodriguezMuro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
          </string-name>
          , G.:
          <article-title>Ontop: Answering SPARQL queries over relational databases</article-title>
          .
          <source>Semantic Web J</source>
          . (
          <year>2016</year>
          ), to appear
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Giacomo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lembo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poggi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodriguez-Muro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosati</surname>
          </string-name>
          , R.:
          <article-title>Ontologies and databases: The DL-Lite approach</article-title>
          . In: Tessaris,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Franconi</surname>
          </string-name>
          , E. (eds.)
          <article-title>RW 2009 Tutorial Lectures</article-title>
          , LNCS, vol.
          <volume>5689</volume>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>356</lpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Giacomo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lembo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poggi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosati</surname>
          </string-name>
          , R.:
          <article-title>Linking data to ontologies: The description logic DL-Litea</article-title>
          .
          <source>In: Proc. of OWLED</source>
          <year>2006</year>
          .
          <article-title>CEUR, ceur-ws.org</article-title>
          , vol.
          <volume>216</volume>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Giacomo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lembo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosati</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Tractable reasoning and efficient query answering in description logics: The DL-Lite family</article-title>
          .
          <source>J. of Automated Reasoning</source>
          <volume>39</volume>
          (
          <issue>3</issue>
          ),
          <fpage>385</fpage>
          -
          <lpage>429</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sundara</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>R2RML: RDB to RDF mapping language</article-title>
          .
          <source>W3C Recommendation</source>
          ,
          <source>W3C (Sep</source>
          <year>2012</year>
          ), available at http://www.w3.org/TR/r2rml/
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sundaresan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Englert</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baclawski</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weinberger</surname>
            ,
            <given-names>P.J.:</given-names>
          </string-name>
          <article-title>Quickly generating billion-record synthetic databases</article-title>
          .
          <source>In: Proc. of ACM SIGMOD</source>
          . pp.
          <fpage>243</fpage>
          -
          <lpage>252</lpage>
          . ACM (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>SPARQL 1.1 query language</article-title>
          .
          <source>W3C Recommendation</source>
          ,
          <source>W3C (Mar</source>
          <year>2013</year>
          ), available at http://www.w3.org/TR/sparql11-query
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kontchakov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rezk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodriguez-Muro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zakharyaschev</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Answering SPARQL queries over databases under OWL 2 QL entailment regime</article-title>
          .
          <source>In: Proc. of ISWC 2014. LNCS</source>
          , vol.
          <volume>8796</volume>
          , pp.
          <fpage>552</fpage>
          -
          <lpage>567</lpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lanti</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rezk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The NPD benchmark: Reality check for OBDA systems</article-title>
          .
          <source>In: Proc. of EDBT 2015</source>
          . pp.
          <fpage>617</fpage>
          -
          <lpage>628</lpage>
          . OpenProceedings.org (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lanti</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : VIG. https://github.com/ontop/vig (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Motik</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuenca Grau</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fokoue</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lutz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>OWL 2 Web Ontology Language profiles (second edition)</article-title>
          .
          <source>W3C Recommendation</source>
          ,
          <source>W3C (Dec</source>
          <year>2012</year>
          ), available at http://www.w3.org/TR/owl2-profiles/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Prud</surname>
            'homme,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fages</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lorca</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Choco Documentation</article-title>
          . TASC, INRIA Rennes,
          <source>LINA CNRS UMR 6241</source>
          ,
          <string-name>
            <surname>COSLING</surname>
            <given-names>S.A.S.</given-names>
          </string-name>
          (
          <year>2015</year>
          ), http://www.choco-solver. org/, available at http://www.choco-solver.org/
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Qiao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , O¨ zsoyog˘lu,
          <string-name>
            <surname>Z.M.:</surname>
          </string-name>
          <article-title>RBench: Application-specific RDF benchmarking</article-title>
          .
          <source>In: Proc. of ACM SIGMOD</source>
          . pp.
          <fpage>1825</fpage>
          -
          <lpage>1838</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Rabl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danisch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schindler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jacobsen</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          :
          <article-title>Just can't get enough: Synthesizing big data</article-title>
          .
          <source>In: Proc. of ACM SIGMOD</source>
          . pp.
          <fpage>1457</fpage>
          -
          <lpage>1462</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antova</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Reversing statistics for scalable test databases generation</article-title>
          .
          <source>In: Proc. of DBTest</source>
          . pp.
          <volume>7</volume>
          :
          <fpage>1</fpage>
          -
          <issue>7</issue>
          :
          <issue>6</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Tay</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>B.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>E.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>UpSizeR: Synthetically scaling an empirical relational database</article-title>
          .
          <source>Information Systems</source>
          <volume>38</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1168</fpage>
          -
          <lpage>1183</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>