Speeding up RDF aggregate discovery through sampling Ioana Manolescu Mirjana Mazuran∗ Inria and LIX (UMR 7161 and Ecole polytechnique), Inria and LIX (UMR 7161 and Ecole polytechnique), France France ioana.manolescu@inria.fr mirjana.mazuran@inria.fr ABSTRACT As these examples illustrate, in Dagger+ , we consider an aggre- RDF graphs can be large and complex; finding out interesting in- gate query interesting if its result set has a high variance (second formation within them is challenging. One easy method for users statistical moment). Other criteria can be considered, e.g., we have to discover such graphs is to be shown interesting aggregates (un- experimented also with the third statistical moment (skewness); der the form of two-dimensional graphs, i.e., bar charts), where more generally, while an RDF graph may have many interesting interestingness is evaluated through statistics criteria. Dagger [5] features, we seek to find aggregates over the graph having the pioneered this approach, however its is quite inefficient, in partic- highest variance. The running time of Dagger was quite high: it ular due to the need to evaluate numerous, expensive aggregation took hours to identify the 10 most interesting aggregates on a queries. In this work, we describe Dagger+ , which builds upon dataset of 20 million triples. In this paper, we show how to speed Dagger and leverages sampling to speed up the evaluation of po- it up, in particular through novel sampling techniques. tentially interesting aggregates. We show that Dagger+ achieves The remainder of this article is organized as follows. Section 2 very significant execution time reductions, while reaching results introduces the state of the art. Section 3 describes the main con- very close to those of the original, less efficient system. cepts and algorithms of Dagger [5], together with a set of im- provements we brought through re-engineering. Then, Section 4 1 INTRODUCTION presents our performance-enhancing sampling technique. RDF graphs are oftentimes large and complex; first-time users 2 STATE OF THE ART have a hard time to understand them. Exploration methods in- The problem of data exploration [8] has received much atten- vestigated in the past are based on keyword search, or on RDF tion; the automatic extraction of interesting aggregates is just summaries, which give users a first idea of a graph’s content one among many proposed techniques. Multidimensional data and structure. A different method of exploring RDF graphs was is particularly interesting in this respect, however, most works introduced in Dagger [5], based on aggregation. Starting from assume a fixed relational schema, which is not available for RDF an RDF graph, a set of agregate queries are automatically iden- graphs. More recent works [2], consider graphs, but (unlike Dag- tified and evaluated, the most interesting ones (in a sense to be ger) assume a very regular and simple structure. outlined shortly below) are chosen, and shown to human users In [9] the authors show how to automatically extract the top-k as two-dimensional plots, in particular, bar charts. insights from multi-dimensional relational data, with fixed di- Figure 1 shows mensions and measures. An insight is an observation derived two examples. from aggregation in multiple steps; it is considered interesting Above, we see when it is remarkably different from others, or it exhibits a ris- the distribution ing or falling trend. SeeDB [10] recommends visualizations in of the number high-dimensional relational data by means of a phased execution of ingredients framework. The focus is to detect the visualizations with a large appearing in deviation with respect to a reference (e.g. another dataset, histor- food recipes in ical data or the rest of the data) from a database with a snowflake an RDF ver- schema. In [6], multi-structural databases are proposed; their sion of the food- schema (i.e. possible dimensions) is known and three analytical ista.com Web operations are defined to: (i) determine how data is distributed site, going from across a particular set of dimensions, (ii) compare two sets of 1 to 25 with a data with respect to given dimensions and (iii) separate the data peak at 8; we into cohesive groups with respect to the known dimensions. consider this In contrast, Dagger (and Dagger+ ) start directly from RDF, and, distribution in- lacking schema information, automatically derives dimensions teresting as it and measures that are good candidates to produce insights. has a clearly There is no universally accepted definition of interestingness; Figure 1: Sample interesting aggre- identified peak frequently, something unexpected (which differs from a refer- gates. and quite a nar- ence) is considered interesting. The reference might be known row interval. The graph below shows the distribution of Nobel a-priori, come from historical data, or be the average behavior. prizes between female (left) and male (right) recipients; we find So far we have experimented with variance, skewness and kurto- this interesting as it is very unbalanced. sis; we are also working to use the entropy. Many different RDF visualizations techniques can be used, e.g., [1]. ∗ M. Mazuran is supported by the H2020 research program under grant nr. 800192. 3 DAGGER OVERVIEW © 2019 Copyright held by the author(s). Published in the Workshop Proceedings We consider three pairwise disjoint sets: the set of URIs U, the of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal), on set of literals L, and the set of blank nodes B. An RDF graph G is CEUR-WS.org. a finite set of triples of the form (s, p, o), called subject, property and object, such that s ∈ (U ∪ B), p ∈ U and o ∈ (U ∪ B ∪ L). evaluate candidate aggregates. SQL queries are used to: deter- The RDF property rdf:type is used to attach types (i.e. classes) mine the resources part of a candidate fact sets; evaluate the suit- to an RDF resource, which can have zero, one or several types. ability of their properties as dimensions, respectively, measures; G can have an ontology stating relationships between its classes compute and aggregate the group measures mi . The remaining and properties, e.g., any UndegraduateStudent is a Student; its operations, e.g., the computation of the interestingness score presence may lead to implicit triples, which are part of G even if function, are done in Java. they may not appear explicitly. A graph containing all the triples Aggregate recommendation cost. The most expensive compu- which may be derived from it is said to be saturated. Without tation steps are: (i) finding the candidate dimensions and (ii) eval- loss of generality, we consider that our input graph is saturated. uating the aggregation operations. Indeed, when looking for Dagger identifies unidimensional aggregate queries to be eval- candidate dimensions, several queries are issued over c f (e.g., uated in an RDF graph G. However, unlike a traditional relational find all distinct properties, count the number of subject that have data warehouse, an RDF graph comes without identified set of each property, find the values of the derived properties, etc.). facts; nor are dimensions and measures known in advance. There- Moreover, many candidate aggregates are generated, also leading fore, Dagger must enumerate candidates for each of these roles. to a high number of potentially expensive SQL queries. A set of candidate facts (c f , in short) is a set of G resources which are deemed interesting for the analysis. Dagger considers 4 DAGGER+ : SPEEDING UP DAGGER as c f (i) all the resources of a given class C, e.g., all the Students; A set of re-engineering changes were brought to Dagger since (ii) all the resources having a certain set P of properties, e.g., the original demonstration. In particular, it has been ported on all those having title and author. A simple way to pick such a top of OntoSQL (https://ontosql.inria.fr), a Java-based platform property set is to compute the support of the properties in G and developed at Inria, providing efficient RDF storage, saturation, select property sets whose support is above a certain threshold. and query processing algorithms [3, 4]. OntoSQL encodes space- A candidate dimension (denoted d) is used for grouping the consuming URIs and literals into compact integers, together with candidate facts. Dagger supports as candidate dimensions (i) prop- a dictionary table which allows going from one to the other. For erties present on at least a certain fraction tt hr es of resources a given class c, all triples of the form x type c are stored in a from c f ; (ii) derived properties, computed by Dagger in order single-column table tc holding the codes of the subjects x; for to find potentially interesting insights. The derived properties each property p other than type, a table tp stores (s code, o code) supported in [5] are count(p), where p is a property that some pairs for each (s, p, o) triple in G. This re-engineering has lead to c f resources have. For instance, if resources of type Student are a very significant reduction in Dagger’s running time, e.g., on a stated to takeCourse, Dagger derives the property takeCourse#. 20 million triples graph, from several hours to 20 minutes. Further, Dagger will consider a candidate dimension only the Further, we adapted an optimization previously proposed in [10]: number of distinct values of this property on the c f resources is we evaluate all candidate aggregates that share both dimension smaller than tdist × |c f | for a certain ratio 0 < tdist < 1. and measure by a single SQL query. In a relational context, ag- A candidate measure (denoted m) is something to be evaluated gregates can be evaluated together as soon as they have the or measured for each candidate fact. Dagger considers candidate same dimension (even if the measures are different) because one measures among the (original or derived) properties of candidate relational tuple usually contains all the attributes used in the facts whose support is above tt hr esh . Dimension-measure com- different measures. In contrast, RDF candidate facts need to be binations such that one is a property a and the other is the count joined with the td table corresponding to the property d chosen a# of this property are excluded. as dimension, and with the tm table for the property chosen as An aggregation function ⊕ is chosen among min, max, avg, measure; distinct measures require distinct joins, thus sharing is sum, count; the first four are considered only if the measure is more limited. This is not due to the storage model, but to RDF numeric. Given that RDF data is often untyped or only partially heterogeneity: the candidate facts having the measure property typed, Dagger implements a type detection mechanism by trying m 1 may be different from those having property m 2 . This forces to convert the property values to different types. evaluating (d, m 1 ) aggregates separately from (d, m 2 ) ones. A Dagger aggregate aдд is a tuple (c f , d, m, ⊕). To specify how Below, we focus on two novel optimization techniques we ap- interesting an aggregate is, let f (V ) be a function which inputs plied subsequently: using an RDF graph summary to speed up a set of numerical values V and returns a number. Given f , the candidate dimension enumeration (Section 4.1), and using sam- interestingness of aдд is computed as: pling to accelerate candidate dimension enumeration, aggregate • let d 1 , d 2 , . . . be the distinct values that d may take for a evaluation and ranking (Section 4.2). resource in c f ; 4.1 Summary-based dimension enumeration • for each di , let c fi be set of c f resources for which d takes the value di ; observe that the c fi sets may overlap, as a RDFQuotient [7] is a structural RDF graph summarization tool resource with more than one value for d belongs to several based on graph quotients. Given an equivalence relation over such sets. For instance, students can be grouped by the graph nodes, the summary contains one node for each equiva- a courses they take, and each student takes many courses; lence class in the original graph. Moreover, each edge n − →m a • for each c fi , let Mi be the set of m values of c fi resources, in the original graph leads to an edge rep(n) − → rep(m) in the and mi = ⊕(Mi ); summary, where rep(n) is the summary representative of node • the interestingness of aдд is f ({m 1 , m 2 , . . .}). n. A particular equivalence relation groups the nodes by their The problem considered by Dagger is: given a graph G, a set of set of types. Dagger+ uses it to find: (i) all the types, (ii) for each value thresholds, function f and an integer k, find the k most type, the number of resources of that type, and (iii) the set of interesting aggregates. possible properties these resources might have. These questions Architecture. Dagger leverages the robust query evaluation ca- can be answered exactly directly from the RDFQuotient reducing pabilities of an RDBMS to store the RDF graph, enumerate and dimension enumeration time. 4.2 Sampling-based aggregate selection We introduced two sampling strategies to trade some accuracy in aggregate selection for running time: • CFSampling: the c f is sampled (draw n 1 samples of size n 2 ), and candidate dimensions and measures are found for each sample independently. For each of the n 1 samples, candidate aggregates are generated, and their interesting- ness is evaluated on the sample. • ESampling: candidate dimensions and measures are com- puted on the whole c f as in Dagger. Then, n 1 samples of size n 2 are drawn from the c f , and the candidate aggre- gates are evaluated on the samples. With CFSampling, the aggregates recommended for different sample may be different, e.g., certain properties might be found to be frequent in some sample but not in all of them. After the evaluation, a ranked list of aggregates is obtained for each sample. In ESampling, instead, the set of aggregates is unique as it is derived directly from the original data. However, given that all the aggregates are evaluated on samples, it also yields a ranked list of aggregates for each sample. To be able to compare the results found without sampling with those found through sampling, we need to reconcile the results found on different samples into a single ranked list. We can do this by taking (i) the union or (ii) the intersection of the lists obtained from all the samples. Then, we re-rank the aggregates according to a estimation of their global interestingness measure (based on their interestingness on each sample), e.g., through Figure 3: Running times with sampling. pooled variance; at the end of this process, we obtain a unique ranked list of candidate aggregates. Dagger+ generates a total of 371 candidate aggregates; without 5 EXPERIMENTAL RESULTS shared evaluation, each of them is evaluated as one query, how- ever, with shared evaluation, only 131 queries are executed as We measure the run time performances and to test the accuracy this is the number of distinct pairs of dimension and measure. of the results obtained with sampling techniques. We used a 2,7 GHz Intel Core i7 with 16 GB of RAM. We describe experiments Sampling. We measure the running time and the accuracy of on the articles from the DBLP dataset1 ; we use 20.132.491 triples the results by varying 3 parameters: (i) the sampling strategy describing information about 888.183 distinct articles. (CFSampling or ESampling), (ii) the number of samples (2, 3, 4 or Dagger+ without 5); and (iii) the sample size, as a ratio of the candidate fact set size sampling. First, we (from 1% to 10% of the number of distinct CF resources). Figure 3 evaluate how much shows the running times: one plot for each number of samples the use of the sum- (from top to bottom, 2, 3, 4, 5), each of which varies the strategy mary, and the shared and the sample size. Ten runs are averaged for each parameter evaluation, improve combinations (different samples are drawn in each run). performance. Figure 2 Figure 3 shows that ESampling is slower than CFSampling, as shows the two main the former looks for candidate dimensions on the whole candi- components (find- date fact set, whereas the latter does this on smaller-size samples. ing dimensions and ESampling enumerates all the candidate aggregates also found aggregate evaluation), without sampling, while CFSampling might enumerate a different Figure 2: Impact of summary and set of aggregates: some properties that are overall frequent (infre- as well as a third, sharing. quent) might be found infrequent (frequent) in a specific sample, very small bar rep- resenting the other operations. Note that the property analysis and thus not be considered (considered) as candidate dimensions which allows to select candidate dimensions also enables to select and measures. However, even though ESampling enumerates all candidate measures (thus candidate measure selection time is the no-sampling candidate aggregates, it evaluates them on sam- negligible and wrapped under “Other”). ples, and may end up considering them interesting (uninteresting) Without any optimization (“NoSummary, NoShared” bar in differently from the way they are on the complete CF. Figure 2), 21% of the time is spent looking for dimensions, while To evaluate the accuracy of the results obtained through sam- 78% of the time goes to evaluating aggregates. Summary use pling, we compare the Top-5, Top-10 and Top-20 aggregates introduces a modest performance improvement, while shared found without sampling, with those found through sampling. aggregate evaluation has a much more significant result (see The ranked list of aggregates found with sampling is built by the “NoSummary, Shared” and “Summary, Shared” bars). Here, taking the union of all those found across the samples and re- ranking them according to the interestingness measure (in our 1 http://www.rdfhdt.org/datasets/ experiments we use the pooled variance to rank all the results). Figure 4: Accuracy of results (%) with respect to running times (sec). #samples Top-K 1% 3% 5% 7% 9% 10% there is a 10% difference in accuracy on average, i.e. the top-Ks Top5 28% 36% 28% 24% 24% 36% differ by 2/3 aggregates. This does not mean that such aggregates 2 Top10 41% 55% 54% 55% 62% 75% were not found though but they have been ranked differently. Top20 57% 69% 80% 81% 88% 96% Figure 4 plots the accuracy with respect to the running times. Top5 32% 16% 28% 24% 36% 20% 3 Top10 49% 55% 60% 55% 74% 61% Each line of graphs represents the Top-5, Top-10 and Top-20 for a Top20 57% 68% 74% 80% 85% 83% different number of samples; red x’s indicate CFsampling, while Top5 32% 22% 16% 12% 20% 12% green +’s indicate ESampling, for different sample dimensions. 4 Top10 46% 50% 39% 42% 50% 55% Clearly, the accuracy increases with K (the amount of top ag- Top20 63% 68% 72% 76% 81% 85% gregates) and with the size of the samples. The increase in the Top5 40% 20% 24% 20% 16% 20% number of samples does not significantly improve accuracy. 5 Top10 55% 53% 50% 51% 49% 55% Top20 67% 63% 68% 74% 78% 79% 6 CONCLUSION Table 1: Accuracy of results with CFSampling. Automatic selection of interesting aggregates in an RDF graph is challenging, as candidate dimensions and measures must be #samples Top-K 1% 3% 5% 7% 9% 10% “guessed” from the data, and candidate aggregates must be eval- Top5 40% 36% 24% 32% 28% 32% uated to assess their interestingness. After re-engineering Dag- 2 Top10 49% 55% 63% 57% 63% 71% ger [5] to exploit an existing efficient RDF platform, we have Top20 65% 69% 81% 84% 89% 93% shown that sharing work and (to a lesser extent) using a graph Top5 28% 28% 32% 24% 16% 28% summary can reduce its running time by a factor of 2. Our main 3 Top10 45% 68% 71% 60% 59% 72% focus has been on sampling, which, for Top-10 and Top-20 search, Top20 67% 76% 75% 83% 83% 87% achieves good accuracy (above 70%) while reducing running time Top5 36% 16% 12% 24% 24% 20% by another factor of 2. Overall, CFSampling appears the most 4 Top10 49% 55% 27% 41% 60% 53% Top20 70% 71% 66% 73% 87% 84% interesting strategy. Our current work stretches in several direc- Top5 36% 20% 12% 24% 20% 24% tions: (i) generalizing Dagger to more complex dimension and 5 Top10 50% 49% 38% 45% 52% 60% measure selection, (ii) adopt more interestingness metrics, (iii) Top20 68% 63% 68% 77% 80% 88% introduce new types of derived properties, e.g., extract mean- Table 2: Accuracy of results with ESampling. ingful keywords from textual properties to be used as potential dimensions and/or measures. Aggregates whose interestingness is zero are not considered. No- tice that we do not break ties, that is: if we search e.g. the Top-5 REFERENCES most interesting aggregates, but find that the sixth element in the [1] Fabio Benedetti, Sonia Bergamaschi, and Laura Po. 2015. LODeX: A Tool for list has the same interestingness value as the fifth, we also include Visual Querying Linked Open Data. In ISWC 2015 Posters & Demonstrations. it in the Top-5, as we did not find a meaningful way to break such [2] D. Bleco and Y. Kotidis. 2018. Using entropy metrics for pruning very large graph cubes. Information Systems (2018). To appear. ties. When no sampling is used, Top-5 returned 5 aggregates, [3] D. Bursztyn, F. Goasdoué, and I. Manolescu. 2015. Optimizing Reformulation- respectively, Top-10 returned 11 while the Top-20 returned 20. based Query Answering in RDF. In EDBT. We compute the accuracy as the percentage of aggregates in the [4] D. Bursztyn, F. Goasdoué, and I. Manolescu. 2016. Teaching an RDBMS about ontological constraints. PVLDB 9, 12 (2016). sampling result, that are also in the result without sampling. [5] Y. Diao, I. Manolescu, and S. Shang. 2017. Dagger: Digging for Interesting Table 1 shows the precision of CFSampling while Table 2 Aggregates in RDF Graphs. In ISWC Posters & Demonstrations Track. [6] R. Fagin, R. V. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. 2005. shows that of ESampling. In 57% of the cases the accuracy obtained Multi-structural databases. In PODS. ACM, 184–195. using ESampling is higher; on average, the two accuracy values [7] François Goasdoué, Pawel Guzewicz, and Ioana Manolescu. 2019. Incremental differ by 6%. The accuracy is quite low when searching for the structural summarization of RDF graphs (demo). In EDBT. [8] S. Idreos, O. Papaemmanouil, and S. Chaudhuri. 2015. Overview of Data the Top-5 aggregates; in contrast, both Top-10 and Top-20 can be Exploration Techniques. In SIGMOD. 277–281. well approximated (accuracy above 80% for the Top-20 even with [9] B. Tang, S. Han, M. L. Yiu, R. Ding, and D. Zhang. 2017. Extracting Top-K few samples). In general, the bigger the samples, the better the Insights from Multi-dimensional Data. In SIGMOD. ACM, 1509–1524. [10] M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. 2015. accuracy, however, results show situations where the Top-10 and SEEDB: Efficient Data-Driven Visualization Recommendations to Support Top-20 have better accuracy with a lower number of samples; Visual Analytics. PVLDB 8, 13 (2015), 2182–2193.