<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Triple Pattern Join Cardinality Estimations over HDT with Enhanced Metadata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elena Wössner</string-name>
          <email>elena.woessner@student.kit.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chang Qin</string-name>
          <email>chang.qin@student.kit.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Javier D. Fernández</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maribel Acosta</string-name>
          <email>maribel.acosta@kit.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Complexity Science Hub Vienna</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute AIFB, Karlsruhe Institute of Technology</institution>
          ,
          <addr-line>KIT</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vienna University of Economics and Business</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work, we present hdt-stats, an extension to the hdt operations, to compute further metadata when evaluating triple patterns over RDF graphs represented with hdt. Then, we propose a novel model that relies on the hdt-stats metadata, as well as the distinct position of SPARQL variables, to estimate the cardinality of joins between triple patterns. Our preliminary results suggest that our approach produces more accurate cardinality estimations than existing solutions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        hdt (Header, Dictionary, Triples) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a compact serialization format for RDF
graphs. hdt supports triple-pattern-based querying and provides metadata about
the estimated number of matched RDF triples. Therefore, the execution of
SPARQL queries over hdt requires the combination of the results of
evaluating individual triple patterns. Current query engines against interfaces over hdt
(e.g., nLDE [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) implement optimization techniques that rely on simple
cardinality estimations to devise query plans that reduce execution time. However, the
cardinality estimation of join operators with insufficient metadata is rather
challenging and, in most cases, may lead to inefficient query plans. To assist query
engines in devising more effective query plans, in this work, we propose
hdtstats, a novel interface that enhances the metadata provided by hdt. Besides
cardinality estimations for triple patterns, our approach provides the number
of distinct subjects, predicates, and objects in result sets. Then, we propose a
novel model for estimating the cardinality of joins of triple patterns, based on
the available metadata provided by hdt-stats. This work covers the estimation
of the cardinality of subject-subject, object-object, and subject-object joins. In
contrast to state-of-the-art cost models for SPARQL query optimization, e.g.,
CostFed [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], our approach combines the hdt-stats metadata in a novel way
that produces more accurate estimates. To evaluate our solution, we conducted
an empirical study using 287 joins between triple patterns over DBpedia.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Our Approach</title>
      <p>
        Evaluation of Triple Patterns with HDT-STATS
hdt supports rank and select operations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which allow for accessing triples in
the compressed RDF graph. With these operations, hdt is able to evaluate triple
patterns and compute cardinality estimates of the number of triples that belong
the solution. In this work, we propose hdt-stats, an interface to support the
retrieval of further metadata about the evaluation of triple patterns over hdt
data structures. Formally, the evaluation of a triple pattern over an HDT graph
when using the hdt-stats is defined as follows.1
      </p>
      <p>Given a triple pattern tp, an hdt RDF graph G, the solution of evaluating
a tp over G with hdt-stats is a 5-tuple ( ; card; ds; dp; do), where:
– : the result set, i.e., the subset of RDF triples in the graph G that match
the triple pattern tp, i.e., = f (tp) j dom( ) = vars(tp) and (tp) 2 Gg,
– card: estimated value for the total number of solutions j j,
– ds: estimated number of distinct subjects in , i.e., jfs j (s; p; o) 2 gj,
– dp: estimated number of distinct predicates in , i.e., jfp j (s; p; o) 2 gj,
– do: estimated number of distinct objects in , i.e., jfo j (s; p; o) 2 gj.</p>
      <p>hdt-stats is able to compute the metadata (ds; dp; do) efficiently, as it relies
on light-weight extensions to the hdt operations. This metadata is then exploited
by the the cardinality estimation model for joins between triple patterns.
2.2</p>
      <sec id="sec-2-1">
        <title>Cardinality Estimation Model</title>
        <p>One of the key components in cost-based query optimization techniques is the
estimation of the number of intermediate results produced by the operators in a
query plan. In this work, we focus on the estimation of cardinalities when joining
triple patterns, i.e., cdard(tpi ./ tpj ) with tpi and tpj triple patterns.</p>
        <p>To estimate the cardinality of joins, we propose a light-weight model that
considers the novel metadata retrieved with hdt-stats, i.e., distinct subjects,
distinct predicates, and distinct objects contained in a result set. Considering
this metadata, our proposed model distinguishes between the different types
of joins that may occur when joining triple patterns. In this work, we focus
on subject-subject, object-object, and subject-object joins. In the following, we
propose separate estimations for each join type.</p>
        <p>Subject-Subject Joins. Subject-subject joins occur when two triple patterns
exclusively share a common variable in their subject position. To estimate the
cardinality of subject-subject joins, our proposed model considers the number
of distinct subjects contained in the solutions of the joined triple patterns. The
number of distinct subjects allows for estimating the selectivity of each triple
pattern in terms of subjects. In this case, the selectivity of subjects represents
the number of times (on average) that an arbitrary subject occurs in the solution
1 We assume the terminology of SPARQL query evaluation as in the literature.
of a triple pattern. For example, subject selectivity equals to 1 indicates that
every subject occurs exactly one time in the solution set. To calculate such a
selectivity, we just divide the number of solutions by the number of distinct
subjects. Then, to calculate the cardinality of the join, our model divides the
total number of possible answers by the maximum number of distinct subjects,
which represents an upper bound on the number of subjects that will match.
Based on this, we define our proposed model in the following.</p>
        <p>Given triple patterns tpi = (?x; pi; oi) and tpj = (?x; pj; oj). Assume that
( i; cardi; dsi; dpi; doi) and ( j ; cardj ; dsj ; dpj ; doj ) are the results of evaluating
triple patterns tpi and tpj over a graph G, respectively. The estimated cardinality
of performing tpi ./ tpj , denoted cdard(tpi ./ tpj ), is defined as follows:
cdard(tpi ./ tpj ) =
cardi cardj
max(doi; doj )
cdard(tpi ./ tpj ) =
cardi cardj
max(dsi; dsj )
Object-Object Joins. Object-object joins occur when two triple patterns
exclusively share a common variable in their object position, i.e., tpi = (si; pi; ?x)
and tpj = (sj; pj; ?x). Analogous to the subject-subject estimation, we proposed
a model that considers the selectivity of objects in the result sets, as follows:
Subject-Object Joins. Subject-object joins occur when two triple patterns
exclusively share a common variable in a subject position in one triple pattern
and in object position in the other triple pattern, i.e., tpi = (?x; pi; oi) and
tpj = (sj; pj; ?x). Similarly to previous cases, our model considers the number
of distinct subjects and the number of distinct objects involved in the join. The
maximum number of subjects or objects corresponds to an upper bound on the
number of resources that match. The proposed estimation is as follows:
cdard(tpi ./ tpj ) =
cardi cardi
max(dsi; doj )
(1)
(2)
(3)
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Study</title>
      <p>
        Dataset and Queries. We used the benchmark queries over DBpedia (v.2015)
presented in the work by Acosta and Vidal [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Per query, we generated all
possible pairs of triple patterns that share variables in common. This resulted
in 287 triple pattern joins, with the following distributions: 255 subject-subject
joins, 23 object-object joins, and 9 subject-object joins.
      </p>
      <p>
        Approaches. To measure the effectiveness of our proposed cardinality
estimation model, we compare our approach with the model computed by CostFed [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Metrics. We report on the absolute error (AE) and the relative error (RE)
of the models. AE is the absolute difference between the estimated cardinality
and the actual number of answers produced by the join. RE is computed as AE
divided by the actual number of answers. In addition, we report on the Pearson
correlation coefficient between the estimated and real cardinalities.
In this study, we compare the effectiveness of our proposed cardinality estimation
with the model computed by CostFed. CostFed is a SPARQL federated engine
that implements a cost model based on cardinality estimations, which relies on
similar statistics. Therefore, a direct comparison with our techniques is possible.
      </p>
      <p>Table 1 reports on the effectiveness of the studied approaches, measured by
absolute and relative errors. The results indicate that, on average, our proposed
model produces more accurate estimations for all types of joins than the ones
implemented by CostFed. In particular, our approach performs best in predicting
the cardinalities of object-object joins followed by subject-subject-joins.</p>
      <p>A closer look into the raw results reveals that there is a particular case when
both models produce the same estimation: when the selectivity of subjects or
objects in a result set is equal to 1. For the other cases, our proposed cost model
typically produce estimates smaller than the ones by CostFed. This indicates that
even when both models overestimate the cardinality of a join, our prediction is
still closer to the real number of answers. This is confirmed by the mean relative
errors achieved in all types of joins (cf. Table 1).</p>
      <p>Lastly, our study reveals that both CostFed and our approach produce more
accurate estimates for the cardinality of object-object joins than for
subjectsubject joins. This could be a result of the topology of the DBpedia graph, where
estimating the number of subjects that match the triple patterns drastically
change when using different constants in the patterns. Still, as the number of
object-object joins in our evaluation is rather low (23 joins), we need to conduct
further experiments to derive concrete conclusions about the behavior of the
approaches in the object-object case.
3.2</p>
      <sec id="sec-3-1">
        <title>Analysis per Join Type</title>
        <p>In this study, we focus on understanding the behavior of our proposed cardinality
estimation model. We compare the estimated cardinality of our approach with
the actual number of answers produced by the joins. Figure 1 reports on the
actual number of answers vs. the estimated cardinalities by our approach.</p>
        <p>For subject-subject joins, we first realize that a few data points allows our
approach to achieve very high correlation. This is the case when joins produce
a large number of results and that our model was able to estimate accordingly.
To be able to properly analyze the results on the majority of the triple patterns,
we focused on the results for joins that produce less than 1M answers (cf.
Figure 1(a)). In this case, the Pearson correlation achieved is 0:543. Despite the
moderate positive correlation, we observe that our model tends to overestimate
the cardinality of subject-subject joins over DBpedia.</p>
        <p>Figure 1(b) reports the results of object-object joins. Despite achieving
the lowest errors (AE and RE) in this type of join, our proposed model is still
volatile, i.e., there are no clear patterns of under- or over-estimations. This is
confirmed by the Pearson correlation coefficient (0:175).</p>
        <p>The results for subject-object joins are reported in Figure 1(c). We observe
that the model also tends to overestimate the cardinalities. In this case, the
correlation achieved is 0:863. Still, these results are considered preliminary as
the benchmark did not have sufficient occurrences of this type of join.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>We have presented hdt-stats to support the retrieval of further metadata during
the evaluation of triple patterns over RDF graphs compressed with hdt. Based
on this metadata, we propose a novel cardinality estimation model that considers
the type of the join between pairs of triple patterns. Our results on over 287
joins indicate that our proposed model outperforms state-of-the-art solutions.
For future work, we plan to extend our model to cover further join types and to
conduct further experimental studies over other RDF graphs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Maribel</given-names>
            <surname>Acosta</surname>
          </string-name>
          and
          <string-name>
            <surname>Maria-Esther Vidal</surname>
          </string-name>
          .
          <article-title>Networks of linked data eddies: An adaptive web query processing engine for rdf data</article-title>
          .
          <source>In ISWC</source>
          , pages
          <fpage>111</fpage>
          -
          <lpage>127</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Javier</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fernández</surname>
          </string-name>
          , Miguel A.
          <string-name>
            <surname>Martínez-Prieto</surname>
            , Claudio Gutiérrez, Axel Polleres, and
            <given-names>Mario</given-names>
          </string-name>
          <string-name>
            <surname>Arias</surname>
          </string-name>
          .
          <article-title>Binary RDF representation for publication and exchange (HDT)</article-title>
          .
          <source>J. Web Semant</source>
          .,
          <volume>19</volume>
          :
          <fpage>22</fpage>
          -
          <lpage>41</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Muhammad</given-names>
            <surname>Saleem</surname>
          </string-name>
          , Alexander Potocki, Tommaso Soru, Olaf Hartig, and
          <article-title>AxelCyrille Ngonga Ngomo</article-title>
          . Costfed:
          <article-title>Cost-based query optimization for SPARQL endpoint federation</article-title>
          .
          <source>In SEMANTICS</source>
          , pages
          <fpage>163</fpage>
          -
          <lpage>174</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>