Speeding up RDF aggregate discovery through sampling
                            Ioana Manolescu                                                          Mirjana Mazuran∗
       Inria and LIX (UMR 7161 and Ecole polytechnique),                           Inria and LIX (UMR 7161 and Ecole polytechnique),
                            France                                                                       France
                    ioana.manolescu@inria.fr                                                    mirjana.mazuran@inria.fr
ABSTRACT                                                                              As these examples illustrate, in Dagger+ , we consider an aggre-
RDF graphs can be large and complex; finding out interesting in-                  gate query interesting if its result set has a high variance (second
formation within them is challenging. One easy method for users                   statistical moment). Other criteria can be considered, e.g., we have
to discover such graphs is to be shown interesting aggregates (un-                experimented also with the third statistical moment (skewness);
der the form of two-dimensional graphs, i.e., bar charts), where                  more generally, while an RDF graph may have many interesting
interestingness is evaluated through statistics criteria. Dagger [5]              features, we seek to find aggregates over the graph having the
pioneered this approach, however its is quite inefficient, in partic-             highest variance. The running time of Dagger was quite high: it
ular due to the need to evaluate numerous, expensive aggregation                  took hours to identify the 10 most interesting aggregates on a
queries. In this work, we describe Dagger+ , which builds upon                    dataset of 20 million triples. In this paper, we show how to speed
Dagger and leverages sampling to speed up the evaluation of po-                   it up, in particular through novel sampling techniques.
tentially interesting aggregates. We show that Dagger+ achieves                       The remainder of this article is organized as follows. Section 2
very significant execution time reductions, while reaching results                introduces the state of the art. Section 3 describes the main con-
very close to those of the original, less efficient system.                       cepts and algorithms of Dagger [5], together with a set of im-
                                                                                  provements we brought through re-engineering. Then, Section 4
1    INTRODUCTION                                                                 presents our performance-enhancing sampling technique.
RDF graphs are oftentimes large and complex; first-time users                     2    STATE OF THE ART
have a hard time to understand them. Exploration methods in-                      The problem of data exploration [8] has received much atten-
vestigated in the past are based on keyword search, or on RDF                     tion; the automatic extraction of interesting aggregates is just
summaries, which give users a first idea of a graph’s content                     one among many proposed techniques. Multidimensional data
and structure. A different method of exploring RDF graphs was                     is particularly interesting in this respect, however, most works
introduced in Dagger [5], based on aggregation. Starting from                     assume a fixed relational schema, which is not available for RDF
an RDF graph, a set of agregate queries are automatically iden-                   graphs. More recent works [2], consider graphs, but (unlike Dag-
tified and evaluated, the most interesting ones (in a sense to be                 ger) assume a very regular and simple structure.
outlined shortly below) are chosen, and shown to human users                         In [9] the authors show how to automatically extract the top-k
as two-dimensional plots, in particular, bar charts.                              insights from multi-dimensional relational data, with fixed di-
                                                   Figure 1 shows                 mensions and measures. An insight is an observation derived
                                                   two examples.                  from aggregation in multiple steps; it is considered interesting
                                                   Above, we see                  when it is remarkably different from others, or it exhibits a ris-
                                                   the distribution               ing or falling trend. SeeDB [10] recommends visualizations in
                                                   of the number                  high-dimensional relational data by means of a phased execution
                                                   of ingredients                 framework. The focus is to detect the visualizations with a large
                                                   appearing in                   deviation with respect to a reference (e.g. another dataset, histor-
                                                   food recipes in                ical data or the rest of the data) from a database with a snowflake
                                                   an RDF ver-                    schema. In [6], multi-structural databases are proposed; their
                                                   sion of the food-              schema (i.e. possible dimensions) is known and three analytical
                                                   ista.com Web                   operations are defined to: (i) determine how data is distributed
                                                   site, going from               across a particular set of dimensions, (ii) compare two sets of
                                                   1 to 25 with a                 data with respect to given dimensions and (iii) separate the data
                                                   peak at 8; we                  into cohesive groups with respect to the known dimensions.
                                                   consider this                     In contrast, Dagger (and Dagger+ ) start directly from RDF, and,
                                                   distribution in-               lacking schema information, automatically derives dimensions
                                                   teresting as it                and measures that are good candidates to produce insights.
                                                   has a clearly                     There is no universally accepted definition of interestingness;
Figure 1: Sample interesting aggre-                identified peak                frequently, something unexpected (which differs from a refer-
gates.                                             and quite a nar-               ence) is considered interesting. The reference might be known
row interval. The graph below shows the distribution of Nobel                     a-priori, come from historical data, or be the average behavior.
prizes between female (left) and male (right) recipients; we find                 So far we have experimented with variance, skewness and kurto-
this interesting as it is very unbalanced.                                        sis; we are also working to use the entropy. Many different RDF
                                                                                  visualizations techniques can be used, e.g., [1].
∗ M. Mazuran is supported by the H2020 research program under grant nr. 800192.

                                                                                  3    DAGGER OVERVIEW
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings     We consider three pairwise disjoint sets: the set of URIs U, the
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal), on     set of literals L, and the set of blank nodes B. An RDF graph G is
CEUR-WS.org.
                                                                                  a finite set of triples of the form (s, p, o), called subject, property
and object, such that s ∈ (U ∪ B), p ∈ U and o ∈ (U ∪ B ∪ L).                evaluate candidate aggregates. SQL queries are used to: deter-
The RDF property rdf:type is used to attach types (i.e. classes)             mine the resources part of a candidate fact sets; evaluate the suit-
to an RDF resource, which can have zero, one or several types.               ability of their properties as dimensions, respectively, measures;
G can have an ontology stating relationships between its classes             compute and aggregate the group measures mi . The remaining
and properties, e.g., any UndegraduateStudent is a Student; its              operations, e.g., the computation of the interestingness score
presence may lead to implicit triples, which are part of G even if           function, are done in Java.
they may not appear explicitly. A graph containing all the triples           Aggregate recommendation cost. The most expensive compu-
which may be derived from it is said to be saturated. Without                tation steps are: (i) finding the candidate dimensions and (ii) eval-
loss of generality, we consider that our input graph is saturated.           uating the aggregation operations. Indeed, when looking for
    Dagger identifies unidimensional aggregate queries to be eval-           candidate dimensions, several queries are issued over c f (e.g.,
uated in an RDF graph G. However, unlike a traditional relational            find all distinct properties, count the number of subject that have
data warehouse, an RDF graph comes without identified set of                 each property, find the values of the derived properties, etc.).
facts; nor are dimensions and measures known in advance. There-              Moreover, many candidate aggregates are generated, also leading
fore, Dagger must enumerate candidates for each of these roles.              to a high number of potentially expensive SQL queries.
    A set of candidate facts (c f , in short) is a set of G resources
which are deemed interesting for the analysis. Dagger considers
                                                                             4     DAGGER+ : SPEEDING UP DAGGER
as c f (i) all the resources of a given class C, e.g., all the Students;     A set of re-engineering changes were brought to Dagger since
(ii) all the resources having a certain set P of properties, e.g.,           the original demonstration. In particular, it has been ported on
all those having title and author. A simple way to pick such a               top of OntoSQL (https://ontosql.inria.fr), a Java-based platform
property set is to compute the support of the properties in G and            developed at Inria, providing efficient RDF storage, saturation,
select property sets whose support is above a certain threshold.             and query processing algorithms [3, 4]. OntoSQL encodes space-
    A candidate dimension (denoted d) is used for grouping the               consuming URIs and literals into compact integers, together with
candidate facts. Dagger supports as candidate dimensions (i) prop-           a dictionary table which allows going from one to the other. For
erties present on at least a certain fraction tt hr es of resources          a given class c, all triples of the form x type c are stored in a
from c f ; (ii) derived properties, computed by Dagger in order              single-column table tc holding the codes of the subjects x; for
to find potentially interesting insights. The derived properties             each property p other than type, a table tp stores (s code, o code)
supported in [5] are count(p), where p is a property that some               pairs for each (s, p, o) triple in G. This re-engineering has lead to
c f resources have. For instance, if resources of type Student are           a very significant reduction in Dagger’s running time, e.g., on a
stated to takeCourse, Dagger derives the property takeCourse#.               20 million triples graph, from several hours to 20 minutes.
Further, Dagger will consider a candidate dimension only the                    Further, we adapted an optimization previously proposed in [10]:
number of distinct values of this property on the c f resources is           we evaluate all candidate aggregates that share both dimension
smaller than tdist × |c f | for a certain ratio 0 < tdist < 1.               and measure by a single SQL query. In a relational context, ag-
    A candidate measure (denoted m) is something to be evaluated             gregates can be evaluated together as soon as they have the
or measured for each candidate fact. Dagger considers candidate              same dimension (even if the measures are different) because one
measures among the (original or derived) properties of candidate             relational tuple usually contains all the attributes used in the
facts whose support is above tt hr esh . Dimension-measure com-              different measures. In contrast, RDF candidate facts need to be
binations such that one is a property a and the other is the count           joined with the td table corresponding to the property d chosen
a# of this property are excluded.                                            as dimension, and with the tm table for the property chosen as
    An aggregation function ⊕ is chosen among min, max, avg,                 measure; distinct measures require distinct joins, thus sharing is
sum, count; the first four are considered only if the measure is             more limited. This is not due to the storage model, but to RDF
numeric. Given that RDF data is often untyped or only partially              heterogeneity: the candidate facts having the measure property
typed, Dagger implements a type detection mechanism by trying                m 1 may be different from those having property m 2 . This forces
to convert the property values to different types.                           evaluating (d, m 1 ) aggregates separately from (d, m 2 ) ones.
    A Dagger aggregate aдд is a tuple (c f , d, m, ⊕). To specify how           Below, we focus on two novel optimization techniques we ap-
interesting an aggregate is, let f (V ) be a function which inputs           plied subsequently: using an RDF graph summary to speed up
a set of numerical values V and returns a number. Given f , the              candidate dimension enumeration (Section 4.1), and using sam-
interestingness of aдд is computed as:                                       pling to accelerate candidate dimension enumeration, aggregate
      • let d 1 , d 2 , . . . be the distinct values that d may take for a   evaluation and ranking (Section 4.2).
         resource in c f ;                                                   4.1    Summary-based dimension enumeration
      • for each di , let c fi be set of c f resources for which d takes
         the value di ; observe that the c fi sets may overlap, as a         RDFQuotient [7] is a structural RDF graph summarization tool
         resource with more than one value for d belongs to several          based on graph quotients. Given an equivalence relation over
         such sets. For instance, students can be grouped by the             graph nodes, the summary contains one node for each equiva-
                                                                                                                                               a
         courses they take, and each student takes many courses;             lence class in the original graph. Moreover, each edge n −        →m
                                                                                                                                  a
      • for each c fi , let Mi be the set of m values of c fi resources,     in the original graph leads to an edge rep(n) −     → rep(m) in the
         and mi = ⊕(Mi );                                                    summary, where rep(n) is the summary representative of node
      • the interestingness of aдд is f ({m 1 , m 2 , . . .}).               n. A particular equivalence relation groups the nodes by their
The problem considered by Dagger is: given a graph G, a set of               set of types. Dagger+ uses it to find: (i) all the types, (ii) for each
value thresholds, function f and an integer k, find the k most               type, the number of resources of that type, and (iii) the set of
interesting aggregates.                                                      possible properties these resources might have. These questions
Architecture. Dagger leverages the robust query evaluation ca-               can be answered exactly directly from the RDFQuotient reducing
pabilities of an RDBMS to store the RDF graph, enumerate and                 dimension enumeration time.
4.2     Sampling-based aggregate selection
We introduced two sampling strategies to trade some accuracy
in aggregate selection for running time:
      • CFSampling: the c f is sampled (draw n 1 samples of size
        n 2 ), and candidate dimensions and measures are found for
        each sample independently. For each of the n 1 samples,
        candidate aggregates are generated, and their interesting-
        ness is evaluated on the sample.
      • ESampling: candidate dimensions and measures are com-
        puted on the whole c f as in Dagger. Then, n 1 samples of
        size n 2 are drawn from the c f , and the candidate aggre-
        gates are evaluated on the samples.
    With CFSampling, the aggregates recommended for different
sample may be different, e.g., certain properties might be found
to be frequent in some sample but not in all of them. After the
evaluation, a ranked list of aggregates is obtained for each sample.
In ESampling, instead, the set of aggregates is unique as it is
derived directly from the original data. However, given that all
the aggregates are evaluated on samples, it also yields a ranked
list of aggregates for each sample.
    To be able to compare the results found without sampling with
those found through sampling, we need to reconcile the results
found on different samples into a single ranked list. We can do
this by taking (i) the union or (ii) the intersection of the lists
obtained from all the samples. Then, we re-rank the aggregates
according to a estimation of their global interestingness measure
(based on their interestingness on each sample), e.g., through                    Figure 3: Running times with sampling.
pooled variance; at the end of this process, we obtain a unique
ranked list of candidate aggregates.
                                                                       Dagger+ generates a total of 371 candidate aggregates; without
5     EXPERIMENTAL RESULTS                                             shared evaluation, each of them is evaluated as one query, how-
                                                                       ever, with shared evaluation, only 131 queries are executed as
We measure the run time performances and to test the accuracy
                                                                       this is the number of distinct pairs of dimension and measure.
of the results obtained with sampling techniques. We used a 2,7
GHz Intel Core i7 with 16 GB of RAM. We describe experiments           Sampling. We measure the running time and the accuracy of
on the articles from the DBLP dataset1 ; we use 20.132.491 triples     the results by varying 3 parameters: (i) the sampling strategy
describing information about 888.183 distinct articles.                (CFSampling or ESampling), (ii) the number of samples (2, 3, 4 or
                                              Dagger+ without          5); and (iii) the sample size, as a ratio of the candidate fact set size
                                              sampling. First, we      (from 1% to 10% of the number of distinct CF resources). Figure 3
                                              evaluate how much        shows the running times: one plot for each number of samples
                                              the use of the sum-      (from top to bottom, 2, 3, 4, 5), each of which varies the strategy
                                              mary, and the shared     and the sample size. Ten runs are averaged for each parameter
                                              evaluation, improve      combinations (different samples are drawn in each run).
                                              performance. Figure 2        Figure 3 shows that ESampling is slower than CFSampling, as
                                              shows the two main       the former looks for candidate dimensions on the whole candi-
                                              components (find-        date fact set, whereas the latter does this on smaller-size samples.
                                              ing dimensions and       ESampling enumerates all the candidate aggregates also found
                                              aggregate evaluation),   without sampling, while CFSampling might enumerate a different
Figure 2: Impact of summary and                                        set of aggregates: some properties that are overall frequent (infre-
                                              as well as a third,
sharing.                                                               quent) might be found infrequent (frequent) in a specific sample,
                                              very small bar rep-
resenting the other operations. Note that the property analysis        and thus not be considered (considered) as candidate dimensions
which allows to select candidate dimensions also enables to select     and measures. However, even though ESampling enumerates all
candidate measures (thus candidate measure selection time is           the no-sampling candidate aggregates, it evaluates them on sam-
negligible and wrapped under “Other”).                                 ples, and may end up considering them interesting (uninteresting)
   Without any optimization (“NoSummary, NoShared” bar in              differently from the way they are on the complete CF.
Figure 2), 21% of the time is spent looking for dimensions, while          To evaluate the accuracy of the results obtained through sam-
78% of the time goes to evaluating aggregates. Summary use             pling, we compare the Top-5, Top-10 and Top-20 aggregates
introduces a modest performance improvement, while shared              found without sampling, with those found through sampling.
aggregate evaluation has a much more significant result (see           The ranked list of aggregates found with sampling is built by
the “NoSummary, Shared” and “Summary, Shared” bars). Here,             taking the union of all those found across the samples and re-
                                                                       ranking them according to the interestingness measure (in our
1 http://www.rdfhdt.org/datasets/                                      experiments we use the pooled variance to rank all the results).
                                Figure 4: Accuracy of results (%) with respect to running times (sec).

     #samples    Top-K    1%     3%    5%     7%    9%     10%          there is a 10% difference in accuracy on average, i.e. the top-Ks
                 Top5     28%    36%   28%    24%   24%    36%          differ by 2/3 aggregates. This does not mean that such aggregates
         2       Top10    41%    55%   54%    55%   62%    75%
                                                                        were not found though but they have been ranked differently.
                 Top20    57%    69%   80%    81%   88%    96%
                                                                           Figure 4 plots the accuracy with respect to the running times.
                 Top5     32%    16%   28%    24%   36%    20%
         3       Top10    49%    55%   60%    55%   74%    61%          Each line of graphs represents the Top-5, Top-10 and Top-20 for a
                 Top20    57%    68%   74%    80%   85%    83%          different number of samples; red x’s indicate CFsampling, while
                 Top5     32%    22%   16%    12%   20%    12%          green +’s indicate ESampling, for different sample dimensions.
         4       Top10    46%    50%   39%    42%   50%    55%          Clearly, the accuracy increases with K (the amount of top ag-
                 Top20    63%    68%   72%    76%   81%    85%          gregates) and with the size of the samples. The increase in the
                 Top5     40%    20%   24%    20%   16%    20%          number of samples does not significantly improve accuracy.
         5       Top10    55%    53%   50%    51%   49%    55%
                 Top20    67%    63%   68%    74%   78%    79%          6    CONCLUSION
      Table 1: Accuracy of results with CFSampling.                     Automatic selection of interesting aggregates in an RDF graph
                                                                        is challenging, as candidate dimensions and measures must be
     #samples    Top-K    1%     3%    5%     7%    9%     10%          “guessed” from the data, and candidate aggregates must be eval-
                 Top5     40%    36%   24%    32%   28%    32%          uated to assess their interestingness. After re-engineering Dag-
         2       Top10    49%    55%   63%    57%   63%    71%          ger [5] to exploit an existing efficient RDF platform, we have
                 Top20    65%    69%   81%    84%   89%    93%          shown that sharing work and (to a lesser extent) using a graph
                 Top5     28%    28%   32%    24%   16%    28%          summary can reduce its running time by a factor of 2. Our main
         3       Top10    45%    68%   71%    60%   59%    72%          focus has been on sampling, which, for Top-10 and Top-20 search,
                 Top20    67%    76%   75%    83%   83%    87%
                                                                        achieves good accuracy (above 70%) while reducing running time
                 Top5     36%    16%   12%    24%   24%    20%
                                                                        by another factor of 2. Overall, CFSampling appears the most
         4       Top10    49%    55%   27%    41%   60%    53%
                 Top20    70%    71%   66%    73%   87%    84%          interesting strategy. Our current work stretches in several direc-
                 Top5     36%    20%   12%    24%   20%    24%          tions: (i) generalizing Dagger to more complex dimension and
         5       Top10    50%    49%   38%    45%   52%    60%          measure selection, (ii) adopt more interestingness metrics, (iii)
                 Top20    68%    63%   68%    77%   80%    88%          introduce new types of derived properties, e.g., extract mean-
       Table 2: Accuracy of results with ESampling.                     ingful keywords from textual properties to be used as potential
                                                                        dimensions and/or measures.
Aggregates whose interestingness is zero are not considered. No-
tice that we do not break ties, that is: if we search e.g. the Top-5
                                                                        REFERENCES
most interesting aggregates, but find that the sixth element in the      [1] Fabio Benedetti, Sonia Bergamaschi, and Laura Po. 2015. LODeX: A Tool for
list has the same interestingness value as the fifth, we also include        Visual Querying Linked Open Data. In ISWC 2015 Posters & Demonstrations.
it in the Top-5, as we did not find a meaningful way to break such       [2] D. Bleco and Y. Kotidis. 2018. Using entropy metrics for pruning very large
                                                                             graph cubes. Information Systems (2018). To appear.
ties. When no sampling is used, Top-5 returned 5 aggregates,             [3] D. Bursztyn, F. Goasdoué, and I. Manolescu. 2015. Optimizing Reformulation-
respectively, Top-10 returned 11 while the Top-20 returned 20.               based Query Answering in RDF. In EDBT.
We compute the accuracy as the percentage of aggregates in the           [4] D. Bursztyn, F. Goasdoué, and I. Manolescu. 2016. Teaching an RDBMS about
                                                                             ontological constraints. PVLDB 9, 12 (2016).
sampling result, that are also in the result without sampling.           [5] Y. Diao, I. Manolescu, and S. Shang. 2017. Dagger: Digging for Interesting
    Table 1 shows the precision of CFSampling while Table 2                  Aggregates in RDF Graphs. In ISWC Posters & Demonstrations Track.
                                                                         [6] R. Fagin, R. V. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. 2005.
shows that of ESampling. In 57% of the cases the accuracy obtained           Multi-structural databases. In PODS. ACM, 184–195.
using ESampling is higher; on average, the two accuracy values           [7] François Goasdoué, Pawel Guzewicz, and Ioana Manolescu. 2019. Incremental
differ by 6%. The accuracy is quite low when searching for the               structural summarization of RDF graphs (demo). In EDBT.
                                                                         [8] S. Idreos, O. Papaemmanouil, and S. Chaudhuri. 2015. Overview of Data
the Top-5 aggregates; in contrast, both Top-10 and Top-20 can be             Exploration Techniques. In SIGMOD. 277–281.
well approximated (accuracy above 80% for the Top-20 even with           [9] B. Tang, S. Han, M. L. Yiu, R. Ding, and D. Zhang. 2017. Extracting Top-K
few samples). In general, the bigger the samples, the better the             Insights from Multi-dimensional Data. In SIGMOD. ACM, 1509–1524.
                                                                        [10] M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. 2015.
accuracy, however, results show situations where the Top-10 and              SEEDB: Efficient Data-Driven Visualization Recommendations to Support
Top-20 have better accuracy with a lower number of samples;                  Visual Analytics. PVLDB 8, 13 (2015), 2182–2193.