Exploring RDF Graphs through Summarization
                             and Analytic Query Discovery
                                                                           Ioana Manolescu
                                                                    ioana.manolescu@inria.fr
                                                              Inria, Institut Polytechnique de Paris
                                                                         Palaiseau, France
ABSTRACT                                                                                 Such a broad graph analysis may hide (or obscure within a larger
Graph data is central to many applications, ranging from social                          group) interesting values or outliers, but it has the advantage of
networks to scientific databases. Graph formats maximize the                             enabling a global, top-down view, which can be gained as one
flexibility offered to data designers, as they are mostly schema-                        starts working with the graph.
less and thus can be used to capture very heterogeneous-structure                           The research highlighted below takes this second path. The
content. RDF, the W3C’s format for sharing open (linked) data,                           problems to be solved are: how to efficiently build meaningful
adds the possibility to attach semantics to data, describing application-                summaries of large RDF graphs (Section 2); and how to analyze
domain constraints by means of ontologies; in turn, this leads to                        and explore RDF graphs by means of aggregate queries (Sec-
implicit data that is also part of a graph even if it is not explicitly                  tion 3). Each problem raises specific conceptual and algorithmic
in it.                                                                                   challenges; we motivate the solutions we found, and point to
   In this paper, we present a structured walk through the prob-                         interesting areas where the work could continue.
lem of analyzing and exploring RDF graphs by finding groups of                              Are node groups an interesting metaphor for exploring RDF
structurally similar nodes, and by automatically identifying inter-                      graphs? Figure 1 (from [8]) tends to suggest it. It depicts the
esting aggregates theirein. We outline the challenges raised by                          properties of the subjects in a graph describing publications listed
such processing in large, complex RDF graphs, outline the basic                          in the DBLP server. Each ring represents the frequency of a given
principles behind existing solutions, and highlight opportunities                        RDF property among these subjects (or resources). Thus, the
for future research.                                                                     central blue ring reflects the property rdf:type, which clearly
                                                                                         all the subjects have; the second one, dark blue, is date, which
                                                                                         publications have, but authors do not; we can see a set of other
1     INTRODUCTION                                                                       properties present on almost all publications (their frequency
Graph data is increasingly popular, thanks to the flexibility it                         diminishing as we move away from the center of the graph),
allows to its designers: it enables representing varying-structure                       while another set of resources have the name property but none
entities together with their rich attributes and the relationships                       of the properties that publications have; these are the authors.
interconnecting them.
    In particular, RDF graphs are abundantly present on today’s
Web, as RDF is the recommended format for sharing Open Data.
The Linked Open Data Cloud Web site (https://lod-cloud.net/)
lists numerous examples of RDF databases. Nevertheless, the
multiplication of data sources is not sufficient to enable the con-
struction of applications that take advantage of it. An important
obstacle is rooted in the very advantages of RDF: its flexibility
and the heterogeneity it tolerates in the data make it hard for
users to understand what a graph is about, and potentially even
harder to detect what is interesting within the graph.
    Two approaches can be seen for analyzing and exploring a
graph’s content. On one hand, node-focused exploration could al-
low for instance users to identify a few nodes and/or edges they
are interested in. This could be achieved by allowing them to
search, e.g., through keywords, or by some statistical analysis, e.g.,
identifying nodes that are somehow outliers, through their con-
tent or through their structural properties. Such fine-granularity
exploration enables gaining detailed knowledge about relatively                          Figure 1: Property frequency analysis in an RDF graph [8].
small part of the graph. On the other hand, group- or class-focused
exploration seeks to identify interesting subgraphs, or (most typ-
ically) groups of nodes, which are in a certain sense similar or                         2   SUMMARIZING RDF GRAPHS THROUGH
comparable. The first step is thus to simplify the cognitive task
                                                                                             STRUCTURAL QUOTIENTS
of getting acquainted with a graph, by reducing it to the (sim-
pler) task of understanding a smaller, abstract version thereof,                         The problem of summarizing RDF graphs has been extensively
where each group of nodes represents a “class” or “meta-node”.                           studied, in particular drawing upon ideas and solutions proposed
                                                                                         for summarizing generic graphs, or XML documents; RDF sum-
© Copyright 2020 for this paper held by its author(s). Published in the proceedings of   marization approaches are surveyed in [3]. A brand of summaries
DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT
2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution           well-established in database research is that of structural quo-
4.0 International (CC BY 4.0).                                                           tients: an equivalence relation is identified between the nodes
Figure 2: Sample Strong Summary [10] of a Berlin SPARQL Benchmark (BSBM) graph of 100 million triples. Other sum-
mary examples are depicted on the RDFQuotient project Web site, where our summarization tool is also available in open
source.


of a graph, typically based on their incoming/outgoing edges.                 First, RDF nodes may have types; this is encoded by graph
A quotient summary node has one node per equivalence class,                edges, connecting the typed nodes to special kinds of resources
and one edge between two summary nodes if and only if a corre-             in the graph, namely the type nodes themselves. A node may
sponding edge connects, in the original graph, a pair of nodes             have zero, one, or more types, which may or may not be logi-
they represent. Quotient summaries have been introduced as basis           cally connected. Further, some nodes in a graph may have types,
for structural indexing, in OEM databases [11] and subsequently            while others lack them. Types complicate summarization, since
for XML, e.g. [15].                                                        on one hand, they encapsulate precious application knowledge
    The choice of an equivalence relation, thus, fully determines          when present, but on the other hand, summarization must be
a summary. Which equivalence relation to pick? In [4, 10], we              able to make sense of a graph even in their absence. Therefore,
have proposed two novel relations, based on the transitive closure         we have distinguished data-first summarization, which groups
of sharing incoming, respectively, outgoing properties. Thus, if           nodes according to their types first and foremost, and then carries
n 1, n 2 both have titles, n 2 and n 3 both have authors, while n 3, n 4   the type of a node to its representative in the summary. This is
both have publication years, we say n 1 to n 4 share the same outgo-       suited for graphs where types are mostly absent, or not sufficient
ing property clique, comprising the properties (edge labels) “title”,      to distinguish classes of nodes from each other. The opposite
“author”, and “year”. This outgoing clique (set of properties) is de-      strategy is type-first; it groups nodes by their types, and only
fined based on their (transitively) co-occurring on common nodes.          uses property cliques to differentiate between the untyped ones.
Observe that n 1 and n 4 may be quite different from each other;           Depending on the graph, data-first or type-first summarization
in particular, they may have no property in common. Incoming               may be more suitable in order to produce summaries easy to
property cliques are symetrically defined. Based on the notions of         understand.
incoming, respectively, outgoing property cliques, we introduce               A second, more subtle aspect is due to the presence of an
two notions of equivalence, so-called weak and strong [10], and            ontology, which may make part of the graph implicit, that is,
show that they lead to very compact summaries of RDF graphs for            triples may hold in the graph, which are not explicitly present
which previously proposed quotients lead to summaries having               there. In this case, summarizing the graph of explicit triples may
more nodes by orders of magnitude. Figure 2 illustrates this: the          not account for the implicit ones. We have proposed in [10] a
summary of a BSBM graph of 100 million triples has only 5 nodes            sufficient condition under which one can compute the summary
and 11 edges, the size of a relatively simple Entity-Relationship          of a saturated graph (including all its implicit and explicit data),
diagram.                                                                   without actually saturating the graph; we also show that our
    What sets the summarization of RDF graphs apart from related           Weak and Strong summaries, in their data-first incarnation, sat-
graph summarization problems? Several features concur.                     isfy this condition, whereas any type-first summarization does
                                                                           not.
   The summaries we devised, like many others, strive to separate         As this example shows, a fact contributes to the answer of
nodes of a large graph in groups that simplify its understanding.      an RDF analytical query iff it has values for all dimensions and
Compared with other works, our goal has been to facilitate under-      for the measure; a fact may contribute to several cells, if it has
standing at first sight the major groups of nodes in a graph. We       multiple values for one or several dimensions. This flexible model
made the hypothesis that accepting “transitively similar” nodes        has numerous advantages for analyzing RDF graphs:
in a same group allows identifying such groups; our experiments
                                                                       • There may be several fact sets in an RDF graph. One could, for
bear out this claim. Another strong advantage of our summaries
                                                                          instance, in a publication dataset, consider the articles to be the
is that they can be all built in time linear in the size of the in-
                                                                          facts, and aggregate them according to their topics, their year
put [9, 10], including in incremental mode, that is, deriving the
                                                                          of publication etc.; on the opposite - or rather, at the same time
summary equivalence relation and summarizing the graph at the
                                                                         - one could consider the authors to be the facts, and articles (or
same time.
                                                                          the articles’ years, or topics, or venues) as dimensions.
   The compactness of our summaries comes at a cost of precision.
                                                                       • As explained above, it flexibly accomodates the absence of
For instance, they provide very poor support for indexing, since
                                                                          a dimension or a measure, as well as their possible multiple
they are unable to guarantee that graph nodes represented by a
                                                                         values.
certain summary node have, a certain property. More generally,
they (and any other quotient summaries) reflect the structure,           The formal semantics of such analytical queries [1, 6] is com-
but not the values (leaf nodes) present in the graph.                  patible with the SPARQL 1.1. aggregation semantics; the latter,
                                                                       however, is only concerned with the syntactic level, not with the
3     EXPLORING RDF GRAPHS BY MEANS OF                                 more conceptual one where facts, dimensions and measures are
      INTERESTING AGGREGATES                                           specified.
While aggregation is well-established as a way to analyze, aggre-
gate and summarize relational data, the very meaning of aggre-         3.2    Automatically identifying interesting
gation has been slow-coming for graphs, and in particular for                 RDF aggregates
RDF. In March 2013, the SPARQL 1.1 specification introduced a          As previously explained, RDF analytical queries enable express-
Group-By primitive together with aggregation operators; their          ing a large set of questions which enable characterizing, in a
semantics is essentially lifted from the relational database world,    flexible manner, the nodes of an RDF graph. But what queries to
and applied to the tuples of bindings resulting from the matches       ask?
of a Where SPARQL block. Below we outline a path we started                A well-explored branch of research in relational data analytics
from devising an RDF counterpart to relational (data warehouse)        concerns the automated identification of interesting analytical
relational queries, formalizing RDF analytical (aggregate) queries     queries [17–19]. These works are placed in a typical relational
(Section 3.1), and (in subsequent, currently ongoing work) ex-         data warehouse scenario, where a large number of dimensions
ploring RDF graphs by automatically identifying interesting ag-        exist, and seek to automatically proposed to the users the an-
gregates (Section 3.2).                                                alytical queries that are likely to bring them most insight. For
                                                                       instance, in [18], a query is interesting (brings a useful insight) if
3.1    RDF aggregate queries                                           it exhibits, on a subset of the facts, a trend that is different from
Our research [1, 6] considered, at about the same time, RDF            the one that holds on the complete fact set.
aggregation at a conceptual: what should an RDF analytical query           In our Dagger project [8], we initiated an approach to auto-
look like? The well-known concepts of facts, dimension and             matically identify interesting analytical queries in RDF graphs.
measure from the relational literature hardly fit. To start with,      This was based on a set of simple choices:
RDF graphs lack a previously-defined schema, and thus the facts
at the heart of analytical processing are not defined; irregularity    • Chosing as facts all nodes of a given RDF type, or, alternatively,
in the data may lead to a dimension or measure being absent, or          asking users to specify the fact query;
being multiply defined. We proposed to define RDF analytical           • Chosing as dimensions the properties that sufficiently many
(aggregate) queries as a combination of a fact query, defining the       facts have, and whose number of distinct values does not ex-
set of resources to be treated like facts and analyzed together, a       ceed a certain threshold; we also introduced derived properties,
set of dimension queries, associating to each fact zero or more          such as the number of authors that a paper has, which we treat
values against each dimension, a measure query, specifying what          like a new propertu attached to the paper fact;
to use as a measurable property of each fact, finally an aggregation   • Chosing a measure among the other (original or derived) prop-
function among the usual ones (sum, max, average etc.) A sample          erties of the facts;
aggregate query can be composed as follows:                            • Considering an aggregate interesting if it maximizes a certain
                                                                         statistical measure of the aggregate query result.
• Facts are all the articles published between 2000 and 2020;
• A dimension is a country to which an authors’ institutions are          Figure 3 illustrates the kinds of aggregates Dagger identified,
   affiliated; many papers have authors from multiple countries,       in a set of DBLP publications from 1936 to 2006. At the top, the
   naturally leading to multiple values for a dimension;               average number of authors of a published paper; we see the
• Another dimension is the year;                                       rise of co-authorship along the years. At the center, the number
• A measure of a paper is a keyword in the paper abstract;             of published papers grouped by year; this graph really gives
• The aggregation function counts the different keywords.              flesh to the concern that as an academic community we may be
The analytical query described above groups papers by the year         publishing too much! Last but not least, the graph at the bottom
and author country, and for each paper group, it counts the            counts the books listed in DBLP and grouped by their publisher.
keywords associated to papers published in that years with an          The dominating bar corresponds to Springer; Infix Verlag comes
author from that country.                                              second, and a set of bars at the left of the graph show different
                                                                    • We enlarge the exploration space to multidimensional (not just
                                                                       mono-dimensional) aggregates;
                                                                    • We introduce more derived properties, for instance by means of
                                                                       topic extractions from text; also, moving toward the generality
                                                                       of the analytical queries introduced in [6] we allow dimensions
                                                                       and measures to be defined by paths of a certain length starting
                                                                       in the facts;
                                                                    • To cope with the expensive exploration of multidimensional
                                                                       aggregates while remaining efficient, we have devised a novel
                                                                       version of a well-known algorithm [20], capable of evaluat-
                                                                       ing in a single pass all the aggregates determined by a set of
                                                                       dimensions, a measure and an aggregation function;
                                                                    • Still toward the goal of scalability, we have devised novel early-
                                                                       stop techniques, capable of estimating the interestingness of an
                                                                       aggregation query while it is computed, and stop the computa-
                                                                       tion as soon as it becomes clear that other aggregates, whose
                                                                       computation is ongoing, are more interesting.
                                                                        Figure 4 illustrates a multidimensional aggregate Spade iden-
                                                                    tified. It shows, for instance, that the “system” and “machine”
                                                                    keywords have been present from the early days of DBLP publi-
                                                                    cations, whereas the “web” newcomer started its history in the
                                                                    1990s; we see “system” making an important comeback (light
                                                                    yellow area toward the top right), idem for “network” etc.
                                                                        Our work on Spade is ongoing at the time of this writing.
                                                                    We are still working to improve its performance, and to under-
                                                                    stand the interplay of the early-stop and of the multidimensional
                                                                    algorithms used to explore and estimate the interestingness of
                                                                    various RDF analytical queries.

                                                                    4   CONCLUSION AND OUTLOOK
                                                                    The field of graph analytics is by nature very broad, given the
                                                                    extreme diversity of data modeled as graphs. This paper summa-
                                                                    rizes a set of recent work carried with the global goal of helping
                                                                    users grasp the content of a large and potentially complex RDF
                                                                    graph. Our key findings can be summarized as follows:
                                                                    • Identifying interesting node groups is an intuitive first step
                                                                       toward gaining an understanding of the graph, of its semantics,
    Figure 3: Interesting insights found by Dagger [8].                structure and content.
spellings for Addison-Wesley, leading to an artificial separation   • The properties incoming and outgoing RDF resources can be
of their books over what would appear to be several publishers.        used as a good basis for identifying such groups, provided that
                                                                       good measures are taken to avoid the extreme fragmentation
3.3    Scaling up the exploration of interesting                       which would result from requiring all nodes in a group to
       aggregates                                                      have exactly the same structure. Instead, summaries such as
Dagger identifies interesting aggregates through exhaustive            we introduced in [4, 9, 10] accept some heterogeneity among
search: it explores and evaluates aggregates subject to a given        the nodes, which generally leads to easy-to-read summaries.
time limit, before returning the most interesting ones. This made   • If one also takes into account the values, that structural sum-
its exploration process lengthy. We explored in [14] the use of        maries completely disregard, there are many ways to explore
sampling, both to select the dimensions and measures to use for        how groups of nodes in RDF graphs compare among them-
the facts, and to decide which aggregates are interesting. While       selves, and countless combinations of facts, dimensions, and
this did reduce the running time, it provided no guarantee of the      measures one could use. In Dagger [8] and its successor
accuracy of the exploration thus abridged.                             Spade [7], we are working to identify as quickly as possible
   In the domain of relational data analytical processing, a key       interesting aggregates, with an interesting measure currently
ingredient to the automated selection of interesting queries is        defined as the variance of the set of values that are part of the
the ability to explore many candidates, and discard as early as        aggregate query result.
possible those queries which can be determined quickly enough          Many avenues for future research are open.
to be not sufficiently interesting. Online aggregation, pioneered   • Personalization, user input, or query by example [13] could be
in [12] has been a crucial ingredient here: it allows to derive,       blended with exploration such as we envisioned it, in order
while aggregate queries are being computed, an approximation           to help users get as soon as possible to the information they
(with a given confidence interval) of these queries’ results. We       need for a specific task, in the spirit of [16].
have explored that path in Spade [7], our follow-up project on      • RDF graph semantics has not yet been fully taken into account
Dagger, where we make several new contributions:                       in the exploration. It could be incorporated as a facet, or as a
Figure 4: Sample interesting aggregate identified by Spade: number of DBLP articles by years and keyword appearing in
their titles. The darker the collor, the fewer articles there are.

  way of navigating from one interesting insight to another one,                        Connections Across Heterogeneous Data Sources. Proceedings of the VLDB
  on a closely (semantically) related set of items.                                     Endowment (PVLDB) 11 (2018), 4. https://doi.org/10.14778/3229863.3236252
                                                                                    [6] Dario Colazzo, François Goasdoué, Ioana Manolescu, and Alexandra Roatis.
   A related, if more mundane, question is which platform (or                           2014. RDF Analytics: Lenses over Semantic Graphs. In 23rd International World
                                                                                        Wide Web Conference. Seoul, South Korea. https://doi.org/10.1145/2566486.
back-end) should best support such analytics; the competition                           2567982
among (RDF) graph processing platforms is currently hot, with                       [7] Yanlei Diao, Pawel Guzewicz, Ioana Manolescu, and Mirjana Mazuran. [n. d.].
                                                                                        Spade: A Modular Framework for Analytical Exploration of RDF Graphs. In
no clear winner in sight. While many contenders exist, the very                         VLDB 2019 - 45th International Conference on Very Large Data Bases. https:
different kinds of processing envisioned, say, in Semantic Web                          //hal.inria.fr/hal-02152844
integration queries, on one hand, and in social network analysis                    [8] Yanlei Diao, Ioana Manolescu, and Shu Shang. 2017. Dagger: Digging for In-
                                                                                        teresting Aggregates in RDF Graphs. In International Semantic Web Conference
with the goal of influence maximization, on the other hand, make                        (ISWC). Vienna, Austria. https://hal.inria.fr/hal-01577464
comparisons difficult, and convergence unlikely.                                    [9] François Goasdoué, Pawel Guzewicz, and Ioana Manolescu. 2019. Incremental
   Going beyond exploration of RDF graphs, one could envision                           structural summarization of RDF graphs. In EDBT 2019 - 22nd International
                                                                                        Conference on Extending Database Technology. Lisbon, Portugal. https://hal.
tools blending more strongly extraction of information from un-                         inria.fr/hal-01978784
structured content, and structured data under one of its many                      [10] François Goasdoué, Paweł Guzewicz, and Ioana Manolescu. 2020. RDF graph
                                                                                        summarization for first-sight structure discovery. The VLDBJ Journal (2020).
forms. This kind of graphs are encountered, for instance, when                          To appear.
integrating heterogeneous data sources such as those available                     [11] Roy Goldman and Jennifer Widom. 1997. DataGuides: Enabling Query For-
to journalists. We have outlined such a graph-based integration                         mulation and Optimization in Semistructured Databases. In Proceedings of
                                                                                        23rd International Conference on Very Large Data Bases, 1997, Athens, Greece.
framework in the ConnectionLens [5] system. Such hetero-                                436–445.
geneous graphs exhibit even more structural and content het-                       [12] Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggre-
erogeneity; higher-levels abstraction methods are needed as a                           gation. In SIGMOD. 171–182.
                                                                                   [13] Matteo Lissandrini, Davide Mottin, Themis Palpanas, and Yannis Velegrakis.
first step towards facilitating their understanding [2]. We plan                        2018. Data Exploration Using Example-Based Methods. Morgan & Claypool
to continue work on these topics, within the ANR SourcesSay                             Publishers. https://doi.org/10.2200/S00881ED1V01Y201810DTM053
                                                                                   [14] Ioana Manolescu and Mirjana Mazuran. 2019. Speeding up RDF aggregate
project (2020-2024).                                                                    discovery through sampling. In BigVis 2019 - 2nd International Workshop on
Acknowledgments This research has been partially funded by                              Big Data Visual Exploration and Analytics. Lisbon, Portugal. https://hal.inria.
ANR-16-CE23-0010-01 and the H2020 research program under                                fr/hal-02065993
                                                                                   [15] Tova Milo and Dan Suciu. 1999. Index structures for path expressions. In
grant agreement nr. 800192. supported                                                   International Conference on Database Theory. Springer, 277–295.
                                                                                   [16] Amit Somech, Tova Milo, and Chai Ozeri. 2019. Predicting "What is Interesting"
                                                                                        by Mining Interactive-Data-Analysis Session Logs. In Advances in Database
REFERENCES                                                                              Technology - 22nd International Conference on Extending Database Technology,
[1] Elham Akbari-Azirani, François Goasdoué, Ioana Manolescu, and Alexandra             EDBT 2019, Lisbon, Portugal, March 2019. 456–467. https://doi.org/10.5441/
    Roatis. 2015. Efficient OLAP Operations For RDF Analytics. In International         002/edbt.2019.42
    Workshop on Data Engineering meets the Semantic Web (DESWeb). Seoul, South     [17] Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and Dongmei Zhang. 2017. Ex-
    Korea. https://doi.org/10.1109/ICDEW.2015.7129548                                   tracting Top-K Insights from Multi-dimensional Data. In SIGMOD. 1509–1524.
[2] Irène Burger, Ioana Manolescu, Emmanuel Pietriga, and Fabian Suchanek. 2020.   [18] Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya G. Parameswaran,
    Toward Visual Interactive Exploration of Heterogeneous Graphs. (2020).              and Neoklis Polyzotis. 2015. SEEDB: Efficient Data-Driven Visualization
[3] Sejla Cebiric, François Goasdoué, Haridimos Kondylakis, Dimitris Kotzinos,          Recommendations to Support Visual Analytics. PVLDB 8, 13 (2015), 2182–
    Ioana Manolescu, Georgia Troullinou, and Mussab Zneika. 2019. Summarizing           2193.
    Semantic Graphs: A Survey. The VLDB Journal 28, 3 (June 2019). https:          [19] Yuhao Wen, Xiaodan Zhu, Sudeepa Roy, and Jun Yang. 2018. QAGView: Inter-
    //hal.inria.fr/hal-01925496                                                         actively Summarizing High-Valued Aggregate Query Answers. In SIGMOD.
[4] Šejla Čebirić, François Goasdoué, and Ioana Manolescu. 2015. Query-Oriented         1709–1712.
    Summarization of RDF Graphs. In Proceedings of the VLDB Endowment, Vol. 8.     [20] Yihong Zhao, Prasad Deshpande, and Jeffrey F. Naughton. 1997. An Array-
    Kohala Coast, Hawaii, United States. https://hal.inria.fr/hal-01178140              Based Algorithm for Simultaneous Multidimensional Aggregates. In SIGMOD.
[5] Camille Chanial, Rédouane Dziri, Helena Galhardas, Julien Leblay, Minh-             159–170.
    Huong Le Nguyen, and Ioana Manolescu. 2018. ConnectionLens: Finding