<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The BAY-HIST Prediction Model for RDF Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Edna Ruckhaus</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mar´ıa-Esther Vidal</string-name>
          <email>mvidalg@ldc.usb.ve</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Simo ́n Bol ́ıvar Caracas</institution>
          ,
          <country country="VE">Venezuela</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In real-world RDF documents, property subject and object values are often correlated. The identification of these relationships is of significant relevance to many applications, e.g., query evaluation planning and linking analysis. In this paper we present the BAY-HIST Prediction Model, a combination of Bayesian networks and multidimensional histograms which is able to identify the probability of these dependencies. In general, Bayesian networks assume a small number of discrete values for each of the variables considered in the network. However, in the context of the Semantic Web, variables that represent the concepts in large-sized RDF documents may contain a very large number of values; thus, BAY-HIST implements multidimensional histograms in order to aggregate the data associated with each node in the network. We illustrate the benefits of applying BAY-HIST to the problem of query selectivity estimation as part of costbased query optimization. We report initial experimental results on the predictive capability of this model and the e ectiveness of our optimization techniques when used together with BAY-HIST. The results suggest that the quality of the optimal evaluation plan has improved over the plan identified by existing cost models that assume independence and uniform distribution of the data values.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The number of controlled vocabularies and annotated data sources in the Web has
exploded in the last few years. Individually, many of these documents contain a large
number of concepts and instances, and additionally their growth rate is very high. Thus,
in order to be capable of scaling up, Web architectures have to be tailored for query
processing on large number of resources and instances. We apply BAY-HIST to the
problem of query selectivity estimation as part of cost-based query optimization.</p>
      <p>The Prediction Model BAY-HIST is a framework that combines Bayesian networks
and multidimensional histograms with the purpose of determining dependencies
between properties in RDF documents and the distribution of their values. Bayesian
Networks are probabilistic models that allow a compact representation of the joint
distribution of the concepts defined in an RDF document. In general, Bayesian networks
assume a small number of discrete values for each of the variables considered in the
network. However, in the context of RDF documents in the Semantic Web, variables
that represent the concepts in large-sized RDF documents may contain a very large
number of values; thus, BAY-HIST implements multidimensional histograms in order
to aggregate the data associated with each node in the Bayesian network that represents
the RDF document.</p>
      <p>
        BAY-HIST has been included as a component of the OneQL System, an
Ontology System that provides optimization and query evaluation techniques that scale up
to large RDF/RDF(S) documents [
        <xref ref-type="bibr" rid="ref10 ref4">4, 10</xref>
        ]. We report initial experimental results on the
predictive capability of this model and the e ectiveness of our optimization techniques
when used together with BAY-HIST. The results suggest that the quality of the optimal
evaluation plan has improved compared to the plan identified by existing cost models
that assume independence and uniform distribution of the data values, by up to two
orders of magnitude.
      </p>
      <p>The structure of this paper is as follows: first, we will give a motivating example.
Following this, we will present the syntax and semantics of BAY-HIST. Next, we will
explain the architecture of the BAY-HIST Prediction Model and its application to
costbased query optimization. Then, the experimental study will be described, and finally,
the conclusions and future work will be presented.
2</p>
    </sec>
    <sec id="sec-2">
      <title>A Motivating Example</title>
      <p>The example that follows shows a query to the RDF repository published at
http://www.govtrack.us/. In this example, besides information concerning the U.S.
congress bills voting process, we consider information of the census such as religion
and gender, and political information such as the party and the state that is represented
by each representative that participates in the voting process. Consider the relationships
between party, gender, religion, state and the way a representative votes. To discover if
there is any correlation among the values of these five properties, we will try to
determine if for di erent instantiations of the following query, di erent number of tuples are
obtained: Names of all the representatives of state ?S , that belong to party ?P, are of gender ?G,
are of religion ?R and have voted for the winning option in the voting process of Bill ?B. The
SPARQL representation of this query is illustrated in Figure 1.</p>
      <p>PREFIX pol:&lt;tag:http://www.rdfabout.com/rdf/schema/politico/&gt;
PREFIX vote:&lt;tag:http://www.rdfabout.com/rdf/schema/vote/&gt;
PREFIX foaf:&lt;tag:http://xmlns.com/foaf/0.1/&gt;
SELECT ?X
FROM &lt;tag:http://www.examples.org/votesdataset/&gt;
WHERE
f?X pol:forO ce ?S . ?X pol:party ?P . ?Z pol:hasRole ?X . ?Z foaf:gender ?G .
?Z foaf:religion ?R . ?O vote:votedBy ?X . ?B vote:winner ?Og</p>
      <p>This query may have di erent subject and object instantiations (constants). For
instance, we may want to explore for a certain Bill, the di erent combinations of
instantations for party, religion, gender and state. While for a certain set of instantiations the
query has 18 answers, for another one it has no answers. This behavior is due to the lack
of uniformity in the property value distribution and the dependency between properties.
For example, the probability that a representative has voted for the winning option in
the voting process of Bill 1998-173 if he is Catholic, male, belongs to the Democratic
party and represents the state of Massachussets is much higher than the probability that
a representative has voted for the winning option in the voting process of Bill 1998-173
if he is Jewish, male, Republican and represents Oklahoma. The identification of these
relationships is of significant relevance to many applications. For instance, in query
evaluation planning, this information may provide the basis for the optimizer to
discriminate between bad or good query plans.
3</p>
    </sec>
    <sec id="sec-3">
      <title>The BAY-HIST Prediction Model</title>
      <p>Consider the RDF repository presented in the previous example. Let us assume that
there are certain causal relationships between the subjects and objects of properties that
are represented as an RDF Bayesian Network (RBN), as shown in Figure 2. In this
RBN, there are nodes that represent property subjects or objects. For example, node
o-religion represents the values (objects) of property religion. We also represent the
event of a combination between subjects or objects of related properties. Such is the
case of node s-s-foroffice-party that represents the event that a subject that is
representing a certain state, belongs to a certain party. The arcs in this network represent
dependencies between nodes. In this network we model that the combination of voter
and gender is conditioned not only by the gender itself, but also by the state he
represents and the party to which he belongs to; thus, the probability that a person’s gender is
‘male’, the state is ‘Oklahoma’ and that he belongs to the ‘Republican’ party is 0.033.
This probability is related to the probabilities of all the rest of combinations of gender,
state and party. Tables 1(a) and 1(b) show a portion of the conditional probability tables
(CPT) of this RBN. An RBN represents all the conditional dependencies among
propDefinition 1 (RDF Bayesian Network) Given an RDF directed graph
OR = (VR; ER) where VR and ER are the nodes and arcs in the RDF graph. An RDF
Bayesian Network RB for OR, is a pair RB = hOB; CPTBi, where OB = (VB; EB) is a
DAG. VB are the nodes in OB and EB are the arcs in OB. CPTB are the Conditional
Probability Tables for each node. The homomorphism f : P(ER) ! P(VB) establishes
mappings between OR and OB:
f (f(sub; pro; ob j)g) = fs-pro; o-prog
f (f(sub1; pro1; ob j); (sub2; pro2; ob j)g) = fo-o-pro1-pro2; o-o-pro2-pro1g
f (f(sub; pro1; ob j1); (sub; pro2; ob j2)g) = fs-s-pro1-pro2; s-s-pro2-pro1g
f (f(sub; pro1; ob j1); (sub2; pro2; sub)g) = fs-o-pro1-pro2; o-s-pro2-pro1g
(Mapping 1)
(Mapping 2)
(Mapping 3)
(Mapping 4)</p>
      <p>VC VB, where VC is the union of the sets of nodes established by mappings 2 to
4, and it is comprised of all the nodes that represent property combinations.</p>
      <p>EB VB VC is the set of arcs. An arc (v1; v2) 2 EB i there exist two sets of nodes
in the RBN, V1 VB and V2 VC such that, v1 2 V1 and v2 2 V2 and when f 1 is
applied to these sets, a subset of arcs in the RDF graph is obtained.</p>
      <p>CPTB is the probability Pr(v=predecessors(v)) for each node v 2 VB, i.e., the
distribution on the values of v for each possible value assignment of its predecessors. The
CPTB are multidimensional histograms ordered by value. If a node v is a source node,
the histogram will be one-dimensional, because in this case the CPTB only represents
the distribution of values taken up by the variable represented by the node. For each
node v, according to the properties of the distribution of the values of v, CPTB can be
represented as an equi-width histogram or as an equi-height histogram.
Example 1 Next, we illustrate the use of the homomorphism f . Figure 3 shows a
portion of an RDF graph (OR) and its corresponding RBN graph (OB). Mapping 1 is
applied to the sets of RDF arcs f(rep1,foroffice,va)g and f(rep2,party,democratic)g:
f (f(rep1,foroffice,va)g)=fs-foroffice,o-forofficeg
f (f(rep2,party,democratic)g)=fs-party,o-partyg
politico:rep2
OR
politico:foroffice
politico:party
va
ma
Democrat
s-foroffice
Then, Mapping 3 is applied to the set of RDF arcs f(rep2,foroffice,ma),
(rep2,party,democratic)g</p>
      <p>f (f(rep2,foroffice,ma),(rep2,party,democratic)g)=fs-s-foroffice-party,s-s-party-forofficeg
The arc (o-foroffice,s-s-foroffice-party) belongs to EB because the arcs
obtained by applying the inverse of f are subsets of ER:
f 1(fs-foroffice,o-forofficeg)[ f 1(fs-s-foroffice-party,s-s-party-forofficeg)=
Intuitively, an RBN is semantically valid if its arcs have been established between nodes
that map to properties whose subjects and objects are of the same type, i.e., have some
type of matching instantiations, subject-subject, subject-object or object-object. For
example, an arc from node o-s-hasrole-party to node s-s-gender-religion is
semantically valid because there are matching subject-subject instantiations between triples
of property hasrole and triples of religion, i.e., both are “persons”.</p>
      <p>Given the symmetry property of the combinations between triple patterns, the set
VB may contain only one of the nodes in the sets defined with mappings 2, 3 and 4 in
Definition 1; thus, the resulting RBN is minimal:
Definition 2 (Minimal RBN) Given an RBN RB = hOB; CPTBi. RB is a Minimal RBN
if the set VB contains exactly one node in sets fs-s-pro1-pro2, s-s-pro2-pro1g,
fs-o-pro1pro2, o-s-pro2-pro1g and fo-o-pro1-pro2, o-o-pro2-pro1g.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Architecture</title>
      <p>
        is generated for each node in the RBN structure through the stored procedures and the
histogram option implemented by the DBMS Oracle [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Both, the RBN structure and
CPT’s are fed to the SamIam network editor, and a Bayesian network is generated in
one of the internal formats recognized by the SamIam tool.
      </p>
      <p>When a query is received, the RBN Inference Engine constructs the corresponding
probability query (e.g., marginal probability and posterior marginal probability) and
passes this query on to the SamIam inference engine which then returns an answer.</p>
      <p>RDF
Documents</p>
      <p>RDF
Relational</p>
      <p>Database
RDF Graph</p>
      <p>RBN Structure
RBN Analyzer</p>
      <p>Query</p>
      <p>Answer
RBN Query Engine</p>
      <p>Bayesian
Network
Editor
SAMIAM</p>
      <p>Bayesian
Inference Tool</p>
      <p>Bayesian
Inference
Engine</p>
      <p>DBMS</p>
      <p>CPT</p>
      <p>Multi
Dimensional</p>
      <p>Histogram
RDF Bayesian</p>
      <p>
        Network
The BAY-HIST Prediction Model is applied to query selectivity estimation. These
estimates are used within the cost model of a cost-based query optimizer as part of the
formulas that compute the cost and cardinality of query sub-plans. We have developed
a randomized optimization strategy based on the Simulated Annealing algorithm [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
This algorithm explores execution plans of any shape (bushy trees) in contrast to other
optimization algorithms that explore a smaller portion, e.g., left-linear plans. Random
walks are performed in stages that consist of an initial random plan generation step
followed by one or more plan transformation steps. An equilibrium condition or a number
of iterations determines the number of transformation steps in each stage.
      </p>
      <p>The probability of transforming a current plan p into a new plan p0 is specified by
an acceptance probability function P(p; p0; T ) that depends on a global time-varying
parameter T called the temperature which reflects the number of stages to be executed.
Function P may be nonzero when cost(p0) &gt; cost(p), meaning that the optimizer can
produce a new plan even when it has a higher cost than the current one. This feature
prevents the optimizer from becoming stuck in a local minimum. Temperature T is
decreased during each stage, and the optimizer concludes when T = 0. Transformations
applied to the plan during the random walks correspond to SPARQL axioms, e.g.,
commutativity and associativity of the ‘.’ operator. The optimizer is able to identify near
optimal solutions because of the precision of estimates that take into account
correlations of values and non uniform distribution.</p>
      <p>Using BAY-HIST, the selectivity of an RDF query execution plan that joins A and
B over join arguments J (A onJ B) is expressed in terms of a probability query against
the corresponding RBN:
fs(AonJ B) =Y Pr(JoinEventJ =(JoinEvidJ A ^ JoinEvidJ B ^ instEvidIA ^ instEvidIB))</p>
      <p>J2J
This is a posterior marginal probability query, i.e., the probability that two pattern
instantiations are combined, given the evidence of the instantiations and the joins in its
left and right sub-trees.</p>
      <p>The probability queries associated with an RDF pattern (the base case) correspond
to marginal probabilities, i.e., to the probability that the value of subjects or objects
of the property in the pattern is equal to the instantiation in the pattern: Pr(o-pro=obj),
Pr(s-pro=sub) or Pr(s-pro=sub ^ o-pro=obj).</p>
      <p>An estimate of the selectivity of an RDF pattern A, carried out by using a probability
query on the RBN is more precise than an estimate carried out by using the traditional
cost model. The traditional cost model defines the following selectivity formula:
(1)
(2)
fs(A; J ) =</p>
      <p>Y 1=nKeys(A; J)</p>
      <p>J2J
where nKeys(A; J) is the number of di erent values taken up by J in pattern A.
Likewise, an estimate of the selectivity of a sub-plan A onJ B carried out through a
probability query on the RBN is more precise than an estimate carried out through the traditional
cost model. The selectivity formula in the traditional cost model is as follows:
fs(A; B; J ) = Y 1=max(nKeys(A; J); nKeys(B; J))</p>
      <p>J2J
These traditional formulas do not compute a precise estimate of the query evaluation
costs because they are based on the following assumptions: (a) the values of the subjects
and objects in a triple pattern are uniformly distributed, (b) the values of the subjects
and objects in a pattern are independent, and (c) the values of the subjects and objects
in properties of the patterns that are combined in a query, are independent.</p>
      <p>The example that follows shows the motivating example query with two di erent
sets of instantiations:
– Names of all the male representatives of the state of Massachussets that belong to the
Democratic party, are Catholic and have voted for the winning option in the voting process of Bill
1998-173.
– Names of all the male representatives of the state of Oklahoma that belong to the Republican
party, are Jewish and have voted for the winning option in the voting process of Bill
1998173.</p>
      <p>PREFIX pol:&lt;tag:http://www.rdfabout.com/politico/&gt;
PREFIX vote:&lt;tag:http://www.rdfabout.com/vote/&gt;
PREFIX foaf:&lt;tag:http://xmlns.com/foaf/0.1/&gt;
SELECT ?X
FROM &lt;tag:http://www.examples.org/votesdataset/&gt;
WHERE
PREFIX pol:&lt;tag:http://www.rdfabout.com/politico/&gt;
PREFIX vote:&lt;tag:http://www.rdfabout.com/vote/&gt;
PREFIX foaf:&lt;tag:http://xmlns.com/foaf/0.1/&gt;
SELECT ?X
FROM &lt;tag:http://www.examples.org/votesdataset/&gt;</p>
      <p>WHERE
f?X pol:forO ce senate:ma .
?X pol:party ’Democratic’ .
?Z foaf:gender ’male’ .
?Z pol:hasRole ?X .
?Z foaf:religion ’Catholic’ .
?O vote:votedBy ?X .
’1998-173’ vote:winner ?Og
(a) SPARQL Query 1
f?X pol:forO ce senate:ok .
?X pol:party ’Republican’ .
?Z foaf:gender ’male’ .
?Z pol:hasRole ?X .
?Z foaf:religion ’Jewish’ .
?O vote:votedBy ?X .
’1998-173’ vote:winner ?Og</p>
      <p>(b) SPARQL Query 2
The SPARQL representation of these two queries is illustrated in Figure 5. Query 1 and
Query 2 di er in their subject and object instantiations (constants), and their answers
are di erent: while the first query has 18 answers, the second one has no answers.
This behavior is due to the lack of uniformity in the property value distribution and
the dependencies between properties. Based on this observation, we use an RBN to
di erentiate the selectivity of the sub-plans of each query execution plan taking into
account the existing correlation between the various RDF properties. To estimate the
selectivity of the sub-plan shown in Figure 6(a), a posterior marginal probability query
is carried out in the RBN and the result of this probability query is 0:0275.
f?X pol:forO ce senate:ma .
?X pol:party ’Democratic’ .
?Z foaf:gender ’male’ .
?Z pol:hasRole ?X .
?Z foaf:religion ’Catholic’g
(a) Query Sub-Plan</p>
      <p>Pr(s-s-hasrole-gender=true /\
o-s-hasrole-foroffice=true /\ 3.182.941
o-s-haoss--rgsoe-lfeno-drp-eaorrf=tfiy"cm=eta-rpuleae"rt//y\=true /\ _Z,X
o-for-office="ma"
o-party="Democrat1ic6")4.720
For the corresponding sub-plan in the second query, i.e., the same sub-plan with
di erent instantiations, the result of the inference on the RBN is 0, which is consistent
with the expectation that the cardinality of the first query is higher than the cardinality
of the second query. Figure 6(b) shows the tree representation of the sub-plan in Figure
6(a). Each node is annotated with the probability query corresponding to the sub-plan
(sub-tree) selectivity estimate, and with its cardinality. The cost estimate of the
subplan, is equivalent to the total number of intermediate results that must be estimated to
obtain the answer:
cost(P) = 62 + 854 + 2:059 + 80 + 164:720 + 1:104 + 3:182:941 + 2 = 3:351:822
6</p>
    </sec>
    <sec id="sec-5">
      <title>Related Work</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Bayesian networks are applied to the problem of imprecise estimates of the
selectivity of a relational query; this framework is known as the Probabilistic Relational
Model (PRM). This imprecision stems from the assumption of uniform distribution of
values for attributes in a table, attribute independence in one table, and attribute
independence in tables that are semantically related. The proposed solution uses a
probabilistic model to represent the distribution of values of each attribute and the
correlations between attributes. Thus, instead of computing the query selectivity in terms of
the number of di erent values of each attribute in the select condition of the query,
the selectivity is computed using the result of a probability query to the model. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
Statistical Relational Models (SRM) were developed. They are di erent from PRM
because they represent a statistical model of a particular database state instead of
representing any state. Thus, Conditional Probability Table (CPT) construction in SRMs
is done through queries to the database whereas the structure and CPT construction in
PRMs is conducted by using machine-learning techniques.
      </p>
      <p>
        The di erence between the solution proposed by Getoor, et. al. [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] and the
solution presented in our paper, is the scalability to large-sized RDF repositories by means
of multidimensional histograms. The SRM, developed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] assume a low number of
values for each variable in the model. On the other hand, although in our work, an
RDF document is modeled similarly to an SRM, its nodes and arcs have a particular
semantics based on the RDF graph semantics, i.e., subject, property and object triples.
Besides this, in our proposed RBN model, there are also Join variables, but restricted
to the possible combinations between subjects and objects. Additionally, the purpose of
the Bayesian network proposed by Getoor, et. al., is the estimation of query selectivity.
In our work, Bayesian networks are applied to RDF documents in order to estimate the
selectivity of query evaluation plans and sub-plans.
      </p>
      <p>
        The work described in [
        <xref ref-type="bibr" rid="ref11 ref12 ref9">9, 11, 12</xref>
        ] extends the Ontology Web Language (OWL) with
constructs that allow the annotation of an ontology with probabilities and causal
relationships. These annotations are done with the purpose of reasoning on uncertainty in
ontologies. Once an ontology is annotated, it is translated to a Bayesian network, and
Bayesian inference queries may be answered. The main di erence between these
models and our research is that since the information on subject an object values are kept
in an aggregated form, our combinated approach of Bayesian networks and
multidimensional histograms scales up to large RDF documents. Besides this, in our work we
define random variables that represent the event that a property may be combined (Join)
with another property; these type of variables are not considered in these approaches.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Experimental Study</title>
      <p>The goal of the experimental study was to analyze the benefits of the proposed
predictive model when applied to the problem of query optimization. First, the
predictive capacity of the model was studied and then, the quality of the optimal query was
compared to the original query and to the optimal plan identified by a cost model that
assumes independence between properties and uniform distribution of values.</p>
      <p>We used the real-world dataset on the US Congress bills voting process for the years
1998, 1999 and 2000 published at http://www.govtrack.us/. Besides the election
results, we also consider census information about representatives such as religion and
gender, and political information such as the party and their state. The number of triples
in the dataset for years 1998, 1998-1999 and 1998-1999-2000 is 50; 860, 94; 590 and
128; 852, respectively.</p>
      <p>The query benchmark is comprised of 112 queries with five instantiated patterns.
The properties in the patterns and their ordering are the same for all queries, but the
instantiations are di erent. Previously, we determined that these properties are correlated
and thus, queries with di erent instantiations will have di erent selectivity.</p>
      <p>
        We use the Bayesian inference tool, SamIam [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], to build the RBN based on the
graph structure, and the CPT which is represented as a multidimensional histogram.
Currently, the graph structure is built by hand, but this could be done semi-automatically.
The graph in the RBN was built according to the properties represented in the
ontology. Then, the CPT were developed using multidimensional histograms to aggregate
the node values. The structure of this RBN was illustrated before as Figure 2. Each
CPT for a target node is a multidimensional histogram, where the first dimension
corresponds to a node itself, and the rest of the dimensions correspond to the predecessors
of the node. The algorithm for multidimensional histogram generation constructs a
histogram for the first dimension, and then for each bucket, it generates a histogram for
the second dimension, and so on, until all dimensions are completed. These histograms
were generated through the histogram options provided by the Oracle DBMS [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The
default histogram option generates equal-width or equal-height histograms according
to the number of di erent values of an attribute and its distribution.
      </p>
      <p>In order to exploit the DBMS histogram mechanisms, we loaded a relational table
for each property in the ontology. For each target node, we created a relational table
that is a combination of the subject or object of the property that is represented by
the node, with the subjects or objects of all its predecessors. We used methods in the
Oracle package DBMS STATS to generate an histogram on the column that represents
the target node in the “combination” table. Then, for each bucket we created a table and
again used DBMS STATS to generate an histogram on the second dimension, and so
on until all the dimensions had been covered. The histogram was completed with the
computation of the frequency of each value of the target node given the di erent sets of
values of its predecessors.</p>
      <p>
        Bayesian inference queries are posed to the network through the SamIam tool in
order to estimate the selectivity of each query based on the instantiations of its
patterns. We use one of the algorithms implemented by SamIam, the Shenoy-Shafer exact
inference algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Each query was also evaluated and we obtained the number
of results. Thus, we compared the estimate of the selectivity with the actual number
of answers. The correlation value is 0.95. This result indicates that there exists a
linear relationship between the estimates and the actual values, so we may assert that the
BAY-HIST model is capable of predicting the selectivity of a query plan or sub-plan,
and therefore, we can have a precise estimate of this plan’s evaluation cost.
      </p>
      <p>The purpose of our next experiment was to study the e ectiveness of our
optimization techniques when used with the BAY-HIST prediction model. Given that the
BAY-HIST model is capable of considering dependencies between properties and its
distribution of values, the quality of the optimal plan identified by the optimizer using
BAY-HIST should be better than the quality of a plan identified by an optimizer that
uses a cost model that does not consider dependencies between properties and
nonuniform distribution of values. We report on runtime performance, which corresponds
to the user time produced by the time command of the Unix operation system.</p>
      <p>We used the same dataset and RBN as the previous experiment. We also used the
same query benchmark, but we shu ed the queries, evaluated them and chose the 21
queries that had the worst evaluation time. The experiment was performed using these
21 queries. The Simulated Annealing optimization algorithm was configured with an
initial temperature of 700, and 20 iterations in the initial stage.</p>
      <p>We compared the performance of the original query, the optimal plan identified by
the optimizer with the model that assumes property independence and uniform
distribution, and the optimal plan identified by the optimizer with the BAY-HIST model. These
plans were evaluated with and without index structures1.</p>
      <p>The average evaluation time is reported in Figure 7. We can observe that the
performance of the optimal plans without index structures exceeds the performance of the
original queries by up to one order of magnitude. The improvement with the use of the
index strucures with respect to the original plans is up to two orders of magnitude, but
the improvement is even greater when the optimizer uses the BAY-HIST model. We
also observed that this di erence is proportional to the incremental size of the datasets.</p>
      <p>These results indicate that the quality of the plan identified by the optimizer and
the BAY-HIST model, is better than the quality of the optimal plan identified by the
optimizer with the traditional prediction model and the benefits are even greater when
index structures are used.
8</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Work</title>
      <p>We present the BAY-HIST Prediction Model, a combination of Bayesian networks and
multidimensional histograms, which is able to estimate correlations between data values
in an RDF document as well as their distribution. We study the benefits of applying
BAY-HIST to the problem of query selectivity estimation as part of cost-based query
optimization; also, we report initial experimental results that suggest that the quality of
the optimal evaluation plans can be improved when selectivity is estimated using the
BAY-HIST Prediction Model.</p>
      <p>
        In the future we plan to use BAY-HIST on the RDF(S) and OWL formalisms; also,
we will study the benefits of this prediction model when it is used to discover links
be1 Denoted as Bhyper according to the hypergraph RDF model that these index structures
implement [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
0.01 
votes 98 
votes 98‐99  votes 98‐99‐00 
Original w/Bhyper 
Op3mal  
Op3mal w/Bhyper 
Op3mal w/RBN 
Op3mal w/Bhyper and RBN 
RDF Document 
tween data terms. Currently, the optimization algorithm queries the RBN for the
selectivity of all the sub-plans in each execution plan. Future work will also include keeping
track of probability queries posed against an RBN in each execution plan, in order to
improve the e ciency of the cost model.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. SamIam - Sensitivity
          <source>Analysis Modeling Inference and More. Automated Reasoning Group</source>
          , University of California, Los Angeles. http://reasoning.cs.ucla.edu/samiam/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Darwiche</surname>
            <given-names>A</given-names>
          </string-name>
          .
          <article-title>Modeling and Reasoning with Bayesian Networks</article-title>
          . Cambridge University Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Martinez</surname>
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Vidal M. A Directed Hypergraph</surname>
          </string-name>
          <article-title>Model for RDF</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>KWEPSY</given-names>
          </string-name>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ruckhaus</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruiz</surname>
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Vidal</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>Query evaluation and optimization in the semantic web</article-title>
          .
          <source>Theory and Practice of Logic Programming - TPLP</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          ):
          <fpage>393</fpage>
          -
          <lpage>409</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Getoor L.
          <article-title>Learning statistical models from relational data</article-title>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Getoor</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taskar</surname>
            <given-names>B.</given-names>
          </string-name>
          , and Koller D.
          <article-title>Selectivity estimation using probabilistic models</article-title>
          .
          <source>In SIGMOD Conference</source>
          , pages
          <fpage>461</fpage>
          -
          <lpage>472</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Vidal</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruckhaus</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lampo</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinez</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sierra</surname>
            <given-names>J.</given-names>
          </string-name>
          , and
          <article-title>Polleres A. E ciently joining group patterns in SPARQL queries</article-title>
          .
          <source>In Proceedings ESWC</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. ORACLE.
          <article-title>Oracle Database Management System</article-title>
          . http://www.oracle.com/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Da</surname>
            <given-names>Costa P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laskey</surname>
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Laskey</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>PR-OWL</surname>
          </string-name>
          :
          <article-title>A bayesian ontology language for the semantic web</article-title>
          .
          <source>In ISWC-URSW</source>
          , pages
          <fpage>23</fpage>
          -
          <lpage>33</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lampo</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruckhaus</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sierra</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vidal</surname>
            <given-names>M.</given-names>
          </string-name>
          , and
          <article-title>Martinez A. OnEQL: An Ontology-based Architecture to E ciently Query Resources on the Semantic web</article-title>
          .
          <source>In Proceedings of SSWS, collocated with ISWC</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Yi</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jacques</given-names>
            <surname>Calmet</surname>
          </string-name>
          .
          <article-title>Ontobayes: An ontology-driven uncertainty model</article-title>
          .
          <source>In Proceedings CIMCA '05</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Ding
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Peng</surname>
          </string-name>
          <string-name>
            <given-names>Y.</given-names>
            , and
            <surname>Pan R. A</surname>
          </string-name>
          <article-title>Bayesian Approach to Uncertainty Modeling in OWL Ontology</article-title>
          .
          <source>In Proceedings of the International Conference on Advances in Intelligent Systems - Theory and Applications</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>