=Paper= {{Paper |id=Vol-1963/paper539 |storemode=property |title=Dagger: Digging for Interesting Aggregates in RDF Graphs |pdfUrl=https://ceur-ws.org/Vol-1963/paper539.pdf |volume=Vol-1963 |authors=Yanlei Diao,Ioana Manolescu,Shu Shang |dblpUrl=https://dblp.org/rec/conf/semweb/DiaoMS17 }} ==Dagger: Digging for Interesting Aggregates in RDF Graphs== https://ceur-ws.org/Vol-1963/paper539.pdf
    Dagger: Digging for Interesting Aggregates in RDF
                         Graphs

                      Yanlei Diao∗ , Ioana Manolescu∗ , and Shu Shang∗
               ∗
                   Ecole Polytechnique, Inria, and Université Paris Saclay, France


       Abstract. RDF is the format of choice for representing Semantic Web data. RDF
       graphs may be large and their structure is heterogeneous and complex, making
       them very hard to explore and understand. To help users discover valuable in-
       sights from RDF graph, we have developed Dagger, a tool which automatically
       recommends interesting aggregation queries over the RDF graphs; Dagger eval-
       uates the queries and graphically shows their results to the user, in the ranked
       order of their interestingness. We propose to demonstrate Dagger to the ISWC
       audience, based on popular real and synthetic RDF graphs.

1   Introduction
RDF (the Resource Description Framework) is the W3C standard for describing inter-
connected resources, by specifying the values of their properties. Given that property
values may also be resources, RDF data is organized in graphs; within a graph, some
resources may have one or more types, or they may lack types.
    RDF graphs may be very large and their structure complex and heterogeneous, thus
finding interesting information they comprise may be hard. We developed Dagger a
system which automatically identifies the most interesting insights in a given RDF graph
G, defined as RDF aggregate queries evaluated over G; Dagger ranks such insights in the
decreasing order of their interest, evaluates and plots the most interesting ones as bar
charts, and shows them to the user. Figure 1 exemplifies interesting aggregates selected
and visualized by Dagger in a popular RDF graph.
    Dagger is the first system to support RDF graph discovery by recommending in-
teresting aggregation queries. Below, we formalize the problem and present Dagger’s
solution, then outline the demonstration scenario, and briefly discuss related works.

2   The insight selection problem
An RDF graph G is a set of triples, typically denoted (s, p, o), where s is called subject,
p is the property, and o is the object, or value of the property p of s. s and p are unique
resource identifiers (URIs), or blank nodes (a special form of unknown/unspecified
nodes). The special property rdf :type (τ , for short) allows a form of resource typ-
ing, e.g., the triple (s1 , τ, Student) states that s1 is a person; a resource may have
one or several types, or it may lack types. Property values can be URIs, blank nodes,
strings, integers, floating point numbers, dates etc. Values may be typed explicitly, as in
“12.5ˆˆxsd:decimal”, but may also appear simply as “12.5”. An ontology is sometimes
associated to an RDF graph, to describe its semantics, e.g., stating that any Student is a
Person; this may lead to implicit triples, e.g., (s1 , τ, Person) in our example. An RDF
graph is saturated if it contains all its implicit triples. Below, we assume G is saturated.
2        Yanlei Diao, Ioana Manolescu, and Shu Shang




Fig. 1: Dagger-selected aggregates in DBLP (http://dblp.uni-trier.de/) bibliographic data. From
left to right: the average number of authors of an article, per year (from 2 to 5); the number of
conferences per year (up to 22.000); the number of books per publisher (highest for Springer).
We consider RDF aggregation queries which are a subset of the RDF analytical queries
introduced in [3] and also of W3C’s SPARQL 1.1 aggregation queries. On a bibliog-
raphy RDF graph, sample queries are (q1 ) : “For each year, how many authors have
published conference papers that year” and (q2 ) : “For each year, what is the average
number of authors of conference papers published that year”.
     Generally, an RDF aggregation query q = hf, d, m, ⊕i consists of three RDF
queries f, d, m and an aggregation function ⊕. f determines the facts, or resources
over which the aggregate is computed (in q1 and q2 , the conference papers); d de-
fines the dimension (in q1 and q2 , the year); m specifies a measure (the authors in
q1 , and the number of authors in q2 ); while the latter is typically not present in publi-
cation graphs, Dagger derives such information. Finally, the aggregation function ⊕ is
count(distinct(·)) in q1 and average in q2 ; ⊕ ranges over the set Ω = {count, avg, sum,
min, max}, possibly combined with a distinct. The result of q on a graph G, denoted
q(G), is the set of pairs {(x, ⊕{mj | ∃fi ∈ f (G), fi .d = x, fi .m = mj })}, where fi
is a resource obtained by evaluating f on G, x = fi .d, respectively mj = fi .m is a
result of evaluating d, respectively m on fi , and ⊕{·} is the result of calling ⊕ on an
input set. For instance, q1 (G) may be {(2015, 1200), (2016, 2420), (2017, 3760)} while
q2 (G) may be {(2015, 2.5), (2016, 2.4), (2017, 2.5)}.
     While relational aggregation queries are well-known, RDF graph heterogeneity
makes our queries different in many respects. (i) A fact may lack the dimension (for
instance, a paper may lack its publication year): such a fact does not contribute to the
query. (ii) A fact may have several values along d, e.g., if the dimension is author affil-
iation and authors have multiple affiliations; such a fact contributes to the aggregation
group of each of its dimension values. (iii) A fact may lack a measure, e.g., the institu-
tion may be unspecified for an author. If ⊕ is count, such a fact contributes with a count
of 0; otherwise, it does not contribute to the query result. (iv) A fact may have several
values for m, e.g., a paper with several authors; ⊕ then applies over the complete set
of measure values obtained for the facts having the same value for the dimension [3].
The interestingness I(q, G) of a query q on a graph G, where q(G) is the set of pairs
{(x1 , ⊕1 ), . . . , (xn , ⊕n )} for some n ≥ 1, is a measure of the set of values {⊕1 , . . . , ⊕n }.
We have mostly experimented with I1 defined as the variance (second statistic mo-
ment) [1] of the value set; high variance reflects a strong difference among values,
which may correspond to a trend, and/or to spikes in the data, as illustrated in Fig-
ure 1. In our example, q1 (G) is more interesting than q2 (G), as there is more variation in
{1200, 2420, 3760} (strong growing trend) than in {2.5, 2.4, 2.5}. Dagger also supports
the interestingness measures skew I2 and kurtosis I3 [1] of the measure value set.
                          Dagger: Digging for Interesting Aggregates in RDF Graphs       3

The problem considered by Dagger is: given a graph G, an integer k, and an interesting
measure I, find the k most interesting aggregate queries q(G). The search space is huge,
as many f, m, d queries can be asked on a graph G. Below, we present Dagger’ practical
approach, which has allowed finding many interesting aggregates efficiently.

3   Dagger’s architecture and algorithms
We have built Dagger as a Java and Python application on top of the PostgreSQL 9.6
relational database server; the GUI is based on Jupyter notebooks (http://jupyter.org).
Dagger stores G as a collection of tables in PostgreSQL, one table for each class and
property. While PostgreSQL bears the brunt of aggregate query evaluation, Dagger
carefully translates between the RDF world and PostgreSQL, to faithfully reflect the
RDF-specific features of our aggregation queries we explained above.
     Dagger selects interesting insights through the following steps: 1. identify several
candidate fact sets; 2. for each candidate fact set, explore a set of candidate dimensions;
3. for each (f, d) pair, explore candidate measures m; 4. for each (f, d, m) combination,
pick applicable ⊕ functions; 5. evaluate the interestingness of the queries resulting from
steps 1. to 4. and retain those with top k values. In the sequel, we outline each step; for
simplicity, we use f, d, m to refer to the respective queries or to their results on G.
     1. Candidate fact sets (i) We identify the classes of
G (that is, the set of all URIs c such that (x, τ, c) ∈ G),
and for each such c, the set fC of all resources of type
c in G is a candidate fact set; (ii) given a user-specified
support threshold tsupp between 0 and 1, for any set of
properties P = {p1 , p2 , . . . , pn } such that at least tsupp of
the triple subjects in G have (at least once) each property in
P , the set fP of G resources having the properties in P is
a candidate fact set. tsupp can be tuned using a J3S GUI (see sample screenshot above
for illustration) showing the fraction of G resources having various set of properties.
Internally, these are computed by cube SQL queries evaluated by PostgreSQL.
     2. Candidate dimensions We look for dimensions among properties which all facts
in f have, as facts lacking dimensions do not contribute to q(G). Further, such a prop-
erty should have much fewer distinct values than there are facts, otherwise aggregation
groups may be too many and too small. We use a distinct-value threshold tdist = 0.05
to retain only properties whose values are shared on average by at least 1/tdist facts.
Next, for each property pd thus identified, we materialize as a derived dimension the
number (count) of pd values of each candidate fact. We denote this #pd , and store it in
a separate derived property table; both pd and #pd serve as candidate dimensions.
     An interesting question is the order among the values of a candidate dimension.
For instance, the visible yearly trends in Figure 1 (left and center) would be lost with a
different bar order. To determine the best order, for each candidate dimension property
pd , we detect the majoritary data type (string, integer, date, double) among the values
of pd for all candidate facts. If a value is typed, we use that information, otherwise we
attempt to cast it to integers, floats etc. and record the most specific type to which the
cast succeeded. Again we use a threshold ttype , and if at least ttype of the values of
facts f match a type (other than string), we store the typed values of f.pd in a separate
4       Yanlei Diao, Ioana Manolescu, and Shu Shang

PostgreSQL table, with the appropriate types. In Figure 1, Dagger found that years are
integers, and plotted query results in a visually compelling fashion.
     3. Candidate measures Given f and d, a candidate measure is a property pm which
all f facts have; which is not d; and such that d is not #pm , and pm is not #d. To prepare
the ground for aggregation function selection (see below), we detect the type of each
pm property thus identified, just like we did for pd in step 2.
     4. Candidate aggregate functions Given (f, d, m), candidate functions ⊕ are cho-
sen based on the type of m values: all Ω functions apply to numeric types; min, max,
count and count(distinct) apply to dates; count and count(distinct) apply to strings.
     5. Ranking by interestingness Dagger evaluates each q = (f, d, m, ⊕) using Post-
greSQL, computes I(q) and plots the k most interesting query results.
4   Demonstration scenario
Users interacting with Dagger may: (i) select an RDF dataset among those pre-loaded;
we plan to use a dozen datasets, real (DBLP, Linked Clinical Trials, BBC Music Pro-
grams, DBPedia etc.) and synthetic (LUBM and BSBM benchmark data); users may
also point us to datasets of their choice; (ii) select an interestingness measure I (vari-
ance, skew, or kurtosis); (iii) visualize the resulting top-k insights, either pre-computed,
or computed on the spot; (iv) modify the pre-set values of tsupp , tdist and ttype and trig-
ger the recomputation of the top-k insights.
    A video of our demo appears at https://team.inria.fr/cedar/projects/dagger.
5   Related work
Tools comparable to Dagger have been developed in relational data warehouses, when
the possible aggregation dimensions and measures are known [6,7]; [6] also investigates
how to efficiently derive an interesting aggregate from another. In contrast, Dagger ap-
plies on a heterogeneous RDF graph in which no facts, dimensions, and measures are
known, and recommends candidate queries through a set of effective heuristics. [2]
selects interesting cuboids from an “aggregated graph”, pre-built by analysts from re-
lational or any other kind of data, counting associations between two resources of the
same kind (e.g., social network interactions between users), while Dagger applies di-
rectly on (any) RDF graph. Dagger complements many other RDF exploration tech-
niques such as mining, summarization, faceted search, statistics extraction etc. [4,5].
Ongoing work on Dagger aims to optimize insight computations along the lines of [7].
Acknowledgements Zheng Zhang has developed the J3S property set analysis GUI.
References
1. Encyclopedia of Statistical Sciences. Wiley, 2014.
2. D. Bleco and Y. Kotidis. Entropy-based selection of graph cuboids. In GRADES, 2017.
3. D. Colazzo, F. Goasdoué, I. Manolescu, and A. Roatiş. RDF analytics: lenses over semantic
   graphs. In WWW, 2014.
4. A. K. Joshi, P. Hitzler, and G. Dong. Logical linked data compression. In ESWC, 2013.
5. S. Khatchadourian and M. P. Consens. Understanding billions of triples with usage sum-
   maries. Semantic Web Challenge, 2011.
6. B. Tang, S. Han, M. L. Yiu, R. Ding, and D. Zhang. Extracting top-k insights from multi-
   dimensional data. In SIGMOD, 2017.
7. M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: efficient
   data-driven visualization recommendations to support visual analytics. PVLDB, 8(13), 2015.