=Paper= {{Paper |id=Vol-2322/BigVis_6 |storemode=property |title=Speeding up RDF Aggregate Discovery through Sampling |pdfUrl=https://ceur-ws.org/Vol-2322/BigVis_6.pdf |volume=Vol-2322 |authors=Ioana Manolescu,Mirjana Mazuran |dblpUrl=https://dblp.org/rec/conf/edbt/ManolescuM19 }} ==Speeding up RDF Aggregate Discovery through Sampling== https://ceur-ws.org/Vol-2322/BigVis_6.pdf

Speeding up RDF aggregate discovery through sampling
Ioana Manolescu Mirjana Mazuran∗
Inria and LIX (UMR 7161 and Ecole polytechnique), Inria and LIX (UMR 7161 and Ecole polytechnique),
France France
ioana.manolescu@inria.fr mirjana.mazuran@inria.fr
ABSTRACT As these examples illustrate, in Dagger+ , we consider an aggre-
RDF graphs can be large and complex; finding out interesting in- gate query interesting if its result set has a high variance (second
formation within them is challenging. One easy method for users statistical moment). Other criteria can be considered, e.g., we have
to discover such graphs is to be shown interesting aggregates (un- experimented also with the third statistical moment (skewness);
der the form of two-dimensional graphs, i.e., bar charts), where more generally, while an RDF graph may have many interesting
interestingness is evaluated through statistics criteria. Dagger [5] features, we seek to find aggregates over the graph having the
pioneered this approach, however its is quite inefficient, in partic- highest variance. The running time of Dagger was quite high: it
ular due to the need to evaluate numerous, expensive aggregation took hours to identify the 10 most interesting aggregates on a
queries. In this work, we describe Dagger+ , which builds upon dataset of 20 million triples. In this paper, we show how to speed
Dagger and leverages sampling to speed up the evaluation of po- it up, in particular through novel sampling techniques.
tentially interesting aggregates. We show that Dagger+ achieves The remainder of this article is organized as follows. Section 2
very significant execution time reductions, while reaching results introduces the state of the art. Section 3 describes the main con-
very close to those of the original, less efficient system. cepts and algorithms of Dagger [5], together with a set of im-
provements we brought through re-engineering. Then, Section 4
1 INTRODUCTION presents our performance-enhancing sampling technique.
RDF graphs are oftentimes large and complex; first-time users 2 STATE OF THE ART
have a hard time to understand them. Exploration methods in- The problem of data exploration [8] has received much atten-
vestigated in the past are based on keyword search, or on RDF tion; the automatic extraction of interesting aggregates is just
summaries, which give users a first idea of a graph’s content one among many proposed techniques. Multidimensional data
and structure. A different method of exploring RDF graphs was is particularly interesting in this respect, however, most works
introduced in Dagger [5], based on aggregation. Starting from assume a fixed relational schema, which is not available for RDF
an RDF graph, a set of agregate queries are automatically iden- graphs. More recent works [2], consider graphs, but (unlike Dag-
tified and evaluated, the most interesting ones (in a sense to be ger) assume a very regular and simple structure.
outlined shortly below) are chosen, and shown to human users In [9] the authors show how to automatically extract the top-k
as two-dimensional plots, in particular, bar charts. insights from multi-dimensional relational data, with fixed di-
Figure 1 shows mensions and measures. An insight is an observation derived
two examples. from aggregation in multiple steps; it is considered interesting
Above, we see when it is remarkably different from others, or it exhibits a ris-
the distribution ing or falling trend. SeeDB [10] recommends visualizations in
of the number high-dimensional relational data by means of a phased execution
of ingredients framework. The focus is to detect the visualizations with a large
appearing in deviation with respect to a reference (e.g. another dataset, histor-
food recipes in ical data or the rest of the data) from a database with a snowflake
an RDF ver- schema. In [6], multi-structural databases are proposed; their
sion of the food- schema (i.e. possible dimensions) is known and three analytical
ista.com Web operations are defined to: (i) determine how data is distributed
site, going from across a particular set of dimensions, (ii) compare two sets of
1 to 25 with a data with respect to given dimensions and (iii) separate the data
peak at 8; we into cohesive groups with respect to the known dimensions.
consider this In contrast, Dagger (and Dagger+ ) start directly from RDF, and,
distribution in- lacking schema information, automatically derives dimensions
teresting as it and measures that are good candidates to produce insights.
has a clearly There is no universally accepted definition of interestingness;
Figure 1: Sample interesting aggre- identified peak frequently, something unexpected (which differs from a refer-
gates. and quite a nar- ence) is considered interesting. The reference might be known
row interval. The graph below shows the distribution of Nobel a-priori, come from historical data, or be the average behavior.
prizes between female (left) and male (right) recipients; we find So far we have experimented with variance, skewness and kurto-
this interesting as it is very unbalanced. sis; we are also working to use the entropy. Many different RDF
visualizations techniques can be used, e.g., [1].
∗ M. Mazuran is supported by the H2020 research program under grant nr. 800192.

3 DAGGER OVERVIEW
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings We consider three pairwise disjoint sets: the set of URIs U, the
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal), on set of literals L, and the set of blank nodes B. An RDF graph G is
CEUR-WS.org.
a finite set of triples of the form (s, p, o), called subject, property
and object, such that s ∈ (U ∪ B), p ∈ U and o ∈ (U ∪ B ∪ L). evaluate candidate aggregates. SQL queries are used to: deter-
The RDF property rdf:type is used to attach types (i.e. classes) mine the resources part of a candidate fact sets; evaluate the suit-
to an RDF resource, which can have zero, one or several types. ability of their properties as dimensions, respectively, measures;
G can have an ontology stating relationships between its classes compute and aggregate the group measures mi . The remaining
and properties, e.g., any UndegraduateStudent is a Student; its operations, e.g., the computation of the interestingness score
presence may lead to implicit triples, which are part of G even if function, are done in Java.
they may not appear explicitly. A graph containing all the triples Aggregate recommendation cost. The most expensive compu-
which may be derived from it is said to be saturated. Without tation steps are: (i) finding the candidate dimensions and (ii) eval-
loss of generality, we consider that our input graph is saturated. uating the aggregation operations. Indeed, when looking for
Dagger identifies unidimensional aggregate queries to be eval- candidate dimensions, several queries are issued over c f (e.g.,
uated in an RDF graph G. However, unlike a traditional relational find all distinct properties, count the number of subject that have
data warehouse, an RDF graph comes without identified set of each property, find the values of the derived properties, etc.).
facts; nor are dimensions and measures known in advance. There- Moreover, many candidate aggregates are generated, also leading
fore, Dagger must enumerate candidates for each of these roles. to a high number of potentially expensive SQL queries.
A set of candidate facts (c f , in short) is a set of G resources
which are deemed interesting for the analysis. Dagger considers
4 DAGGER+ : SPEEDING UP DAGGER
as c f (i) all the resources of a given class C, e.g., all the Students; A set of re-engineering changes were brought to Dagger since
(ii) all the resources having a certain set P of properties, e.g., the original demonstration. In particular, it has been ported on
all those having title and author. A simple way to pick such a top of OntoSQL (https://ontosql.inria.fr), a Java-based platform
property set is to compute the support of the properties in G and developed at Inria, providing efficient RDF storage, saturation,
select property sets whose support is above a certain threshold. and query processing algorithms [3, 4]. OntoSQL encodes space-
A candidate dimension (denoted d) is used for grouping the consuming URIs and literals into compact integers, together with
candidate facts. Dagger supports as candidate dimensions (i) prop- a dictionary table which allows going from one to the other. For
erties present on at least a certain fraction tt hr es of resources a given class c, all triples of the form x type c are stored in a
from c f ; (ii) derived properties, computed by Dagger in order single-column table tc holding the codes of the subjects x; for
to find potentially interesting insights. The derived properties each property p other than type, a table tp stores (s code, o code)
supported in [5] are count(p), where p is a property that some pairs for each (s, p, o) triple in G. This re-engineering has lead to
c f resources have. For instance, if resources of type Student are a very significant reduction in Dagger’s running time, e.g., on a
stated to takeCourse, Dagger derives the property takeCourse#. 20 million triples graph, from several hours to 20 minutes.
Further, Dagger will consider a candidate dimension only the Further, we adapted an optimization previously proposed in [10]:
number of distinct values of this property on the c f resources is we evaluate all candidate aggregates that share both dimension
smaller than tdist × |c f | for a certain ratio 0 < tdist < 1. and measure by a single SQL query. In a relational context, ag-
A candidate measure (denoted m) is something to be evaluated gregates can be evaluated together as soon as they have the
or measured for each candidate fact. Dagger considers candidate same dimension (even if the measures are different) because one
measures among the (original or derived) properties of candidate relational tuple usually contains all the attributes used in the
facts whose support is above tt hr esh . Dimension-measure com- different measures. In contrast, RDF candidate facts need to be
binations such that one is a property a and the other is the count joined with the td table corresponding to the property d chosen
a# of this property are excluded. as dimension, and with the tm table for the property chosen as
An aggregation function ⊕ is chosen among min, max, avg, measure; distinct measures require distinct joins, thus sharing is
sum, count; the first four are considered only if the measure is more limited. This is not due to the storage model, but to RDF
numeric. Given that RDF data is often untyped or only partially heterogeneity: the candidate facts having the measure property
typed, Dagger implements a type detection mechanism by trying m 1 may be different from those having property m 2 . This forces
to convert the property values to different types. evaluating (d, m 1 ) aggregates separately from (d, m 2 ) ones.
A Dagger aggregate aдд is a tuple (c f , d, m, ⊕). To specify how Below, we focus on two novel optimization techniques we ap-
interesting an aggregate is, let f (V ) be a function which inputs plied subsequently: using an RDF graph summary to speed up
a set of numerical values V and returns a number. Given f , the candidate dimension enumeration (Section 4.1), and using sam-
interestingness of aдд is computed as: pling to accelerate candidate dimension enumeration, aggregate
• let d 1 , d 2 , . . . be the distinct values that d may take for a evaluation and ranking (Section 4.2).
resource in c f ; 4.1 Summary-based dimension enumeration
• for each di , let c fi be set of c f resources for which d takes
the value di ; observe that the c fi sets may overlap, as a RDFQuotient [7] is a structural RDF graph summarization tool
resource with more than one value for d belongs to several based on graph quotients. Given an equivalence relation over
such sets. For instance, students can be grouped by the graph nodes, the summary contains one node for each equiva-
a
courses they take, and each student takes many courses; lence class in the original graph. Moreover, each edge n − →m
a
• for each c fi , let Mi be the set of m values of c fi resources, in the original graph leads to an edge rep(n) − → rep(m) in the
and mi = ⊕(Mi ); summary, where rep(n) is the summary representative of node
• the interestingness of aдд is f ({m 1 , m 2 , . . .}). n. A particular equivalence relation groups the nodes by their
The problem considered by Dagger is: given a graph G, a set of set of types. Dagger+ uses it to find: (i) all the types, (ii) for each
value thresholds, function f and an integer k, find the k most type, the number of resources of that type, and (iii) the set of
interesting aggregates. possible properties these resources might have. These questions
Architecture. Dagger leverages the robust query evaluation ca- can be answered exactly directly from the RDFQuotient reducing
pabilities of an RDBMS to store the RDF graph, enumerate and dimension enumeration time.
4.2 Sampling-based aggregate selection
We introduced two sampling strategies to trade some accuracy
in aggregate selection for running time:
• CFSampling: the c f is sampled (draw n 1 samples of size
n 2 ), and candidate dimensions and measures are found for
each sample independently. For each of the n 1 samples,
candidate aggregates are generated, and their interesting-
ness is evaluated on the sample.
• ESampling: candidate dimensions and measures are com-
puted on the whole c f as in Dagger. Then, n 1 samples of
size n 2 are drawn from the c f , and the candidate aggre-
gates are evaluated on the samples.
With CFSampling, the aggregates recommended for different
sample may be different, e.g., certain properties might be found
to be frequent in some sample but not in all of them. After the
evaluation, a ranked list of aggregates is obtained for each sample.
In ESampling, instead, the set of aggregates is unique as it is
derived directly from the original data. However, given that all
the aggregates are evaluated on samples, it also yields a ranked
list of aggregates for each sample.
To be able to compare the results found without sampling with
those found through sampling, we need to reconcile the results
found on different samples into a single ranked list. We can do
this by taking (i) the union or (ii) the intersection of the lists
obtained from all the samples. Then, we re-rank the aggregates
according to a estimation of their global interestingness measure
(based on their interestingness on each sample), e.g., through Figure 3: Running times with sampling.
pooled variance; at the end of this process, we obtain a unique
ranked list of candidate aggregates.
Dagger+ generates a total of 371 candidate aggregates; without
5 EXPERIMENTAL RESULTS shared evaluation, each of them is evaluated as one query, how-
ever, with shared evaluation, only 131 queries are executed as
We measure the run time performances and to test the accuracy
this is the number of distinct pairs of dimension and measure.
of the results obtained with sampling techniques. We used a 2,7
GHz Intel Core i7 with 16 GB of RAM. We describe experiments Sampling. We measure the running time and the accuracy of
on the articles from the DBLP dataset1 ; we use 20.132.491 triples the results by varying 3 parameters: (i) the sampling strategy
describing information about 888.183 distinct articles. (CFSampling or ESampling), (ii) the number of samples (2, 3, 4 or
Dagger+ without 5); and (iii) the sample size, as a ratio of the candidate fact set size
sampling. First, we (from 1% to 10% of the number of distinct CF resources). Figure 3
evaluate how much shows the running times: one plot for each number of samples
the use of the sum- (from top to bottom, 2, 3, 4, 5), each of which varies the strategy
mary, and the shared and the sample size. Ten runs are averaged for each parameter
evaluation, improve combinations (different samples are drawn in each run).
performance. Figure 2 Figure 3 shows that ESampling is slower than CFSampling, as
shows the two main the former looks for candidate dimensions on the whole candi-
components (find- date fact set, whereas the latter does this on smaller-size samples.
ing dimensions and ESampling enumerates all the candidate aggregates also found
aggregate evaluation), without sampling, while CFSampling might enumerate a different
Figure 2: Impact of summary and set of aggregates: some properties that are overall frequent (infre-
as well as a third,
sharing. quent) might be found infrequent (frequent) in a specific sample,
very small bar rep-
resenting the other operations. Note that the property analysis and thus not be considered (considered) as candidate dimensions
which allows to select candidate dimensions also enables to select and measures. However, even though ESampling enumerates all
candidate measures (thus candidate measure selection time is the no-sampling candidate aggregates, it evaluates them on sam-
negligible and wrapped under “Other”). ples, and may end up considering them interesting (uninteresting)
Without any optimization (“NoSummary, NoShared” bar in differently from the way they are on the complete CF.
Figure 2), 21% of the time is spent looking for dimensions, while To evaluate the accuracy of the results obtained through sam-
78% of the time goes to evaluating aggregates. Summary use pling, we compare the Top-5, Top-10 and Top-20 aggregates
introduces a modest performance improvement, while shared found without sampling, with those found through sampling.
aggregate evaluation has a much more significant result (see The ranked list of aggregates found with sampling is built by
the “NoSummary, Shared” and “Summary, Shared” bars). Here, taking the union of all those found across the samples and re-
ranking them according to the interestingness measure (in our
1 http://www.rdfhdt.org/datasets/ experiments we use the pooled variance to rank all the results).
Figure 4: Accuracy of results (%) with respect to running times (sec).

#samples Top-K 1% 3% 5% 7% 9% 10% there is a 10% difference in accuracy on average, i.e. the top-Ks
Top5 28% 36% 28% 24% 24% 36% differ by 2/3 aggregates. This does not mean that such aggregates
2 Top10 41% 55% 54% 55% 62% 75%
were not found though but they have been ranked differently.
Top20 57% 69% 80% 81% 88% 96%
Figure 4 plots the accuracy with respect to the running times.
Top5 32% 16% 28% 24% 36% 20%
3 Top10 49% 55% 60% 55% 74% 61% Each line of graphs represents the Top-5, Top-10 and Top-20 for a
Top20 57% 68% 74% 80% 85% 83% different number of samples; red x’s indicate CFsampling, while
Top5 32% 22% 16% 12% 20% 12% green +’s indicate ESampling, for different sample dimensions.
4 Top10 46% 50% 39% 42% 50% 55% Clearly, the accuracy increases with K (the amount of top ag-
Top20 63% 68% 72% 76% 81% 85% gregates) and with the size of the samples. The increase in the
Top5 40% 20% 24% 20% 16% 20% number of samples does not significantly improve accuracy.
5 Top10 55% 53% 50% 51% 49% 55%
Top20 67% 63% 68% 74% 78% 79% 6 CONCLUSION
Table 1: Accuracy of results with CFSampling. Automatic selection of interesting aggregates in an RDF graph
is challenging, as candidate dimensions and measures must be
#samples Top-K 1% 3% 5% 7% 9% 10% “guessed” from the data, and candidate aggregates must be eval-
Top5 40% 36% 24% 32% 28% 32% uated to assess their interestingness. After re-engineering Dag-
2 Top10 49% 55% 63% 57% 63% 71% ger [5] to exploit an existing efficient RDF platform, we have
Top20 65% 69% 81% 84% 89% 93% shown that sharing work and (to a lesser extent) using a graph
Top5 28% 28% 32% 24% 16% 28% summary can reduce its running time by a factor of 2. Our main
3 Top10 45% 68% 71% 60% 59% 72% focus has been on sampling, which, for Top-10 and Top-20 search,
Top20 67% 76% 75% 83% 83% 87%
achieves good accuracy (above 70%) while reducing running time
Top5 36% 16% 12% 24% 24% 20%
by another factor of 2. Overall, CFSampling appears the most
4 Top10 49% 55% 27% 41% 60% 53%
Top20 70% 71% 66% 73% 87% 84% interesting strategy. Our current work stretches in several direc-
Top5 36% 20% 12% 24% 20% 24% tions: (i) generalizing Dagger to more complex dimension and
5 Top10 50% 49% 38% 45% 52% 60% measure selection, (ii) adopt more interestingness metrics, (iii)
Top20 68% 63% 68% 77% 80% 88% introduce new types of derived properties, e.g., extract mean-
Table 2: Accuracy of results with ESampling. ingful keywords from textual properties to be used as potential
dimensions and/or measures.
Aggregates whose interestingness is zero are not considered. No-
tice that we do not break ties, that is: if we search e.g. the Top-5
REFERENCES
most interesting aggregates, but find that the sixth element in the [1] Fabio Benedetti, Sonia Bergamaschi, and Laura Po. 2015. LODeX: A Tool for
list has the same interestingness value as the fifth, we also include Visual Querying Linked Open Data. In ISWC 2015 Posters & Demonstrations.
it in the Top-5, as we did not find a meaningful way to break such [2] D. Bleco and Y. Kotidis. 2018. Using entropy metrics for pruning very large
graph cubes. Information Systems (2018). To appear.
ties. When no sampling is used, Top-5 returned 5 aggregates, [3] D. Bursztyn, F. Goasdoué, and I. Manolescu. 2015. Optimizing Reformulation-
respectively, Top-10 returned 11 while the Top-20 returned 20. based Query Answering in RDF. In EDBT.
We compute the accuracy as the percentage of aggregates in the [4] D. Bursztyn, F. Goasdoué, and I. Manolescu. 2016. Teaching an RDBMS about
ontological constraints. PVLDB 9, 12 (2016).
sampling result, that are also in the result without sampling. [5] Y. Diao, I. Manolescu, and S. Shang. 2017. Dagger: Digging for Interesting
Table 1 shows the precision of CFSampling while Table 2 Aggregates in RDF Graphs. In ISWC Posters & Demonstrations Track.
[6] R. Fagin, R. V. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. 2005.
shows that of ESampling. In 57% of the cases the accuracy obtained Multi-structural databases. In PODS. ACM, 184–195.
using ESampling is higher; on average, the two accuracy values [7] François Goasdoué, Pawel Guzewicz, and Ioana Manolescu. 2019. Incremental
differ by 6%. The accuracy is quite low when searching for the structural summarization of RDF graphs (demo). In EDBT.
[8] S. Idreos, O. Papaemmanouil, and S. Chaudhuri. 2015. Overview of Data
the Top-5 aggregates; in contrast, both Top-10 and Top-20 can be Exploration Techniques. In SIGMOD. 277–281.
well approximated (accuracy above 80% for the Top-20 even with [9] B. Tang, S. Han, M. L. Yiu, R. Ding, and D. Zhang. 2017. Extracting Top-K
few samples). In general, the bigger the samples, the better the Insights from Multi-dimensional Data. In SIGMOD. ACM, 1509–1524.
[10] M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. 2015.
accuracy, however, results show situations where the Top-10 and SEEDB: Efficient Data-Driven Visualization Recommendations to Support
Top-20 have better accuracy with a lower number of samples; Visual Analytics. PVLDB 8, 13 (2015), 2182–2193.