Exploring RDF Graphs through Summarization and Analytic Query Discovery Ioana Manolescu ioana.manolescu@inria.fr Inria, Institut Polytechnique de Paris Palaiseau, France ABSTRACT Such a broad graph analysis may hide (or obscure within a larger Graph data is central to many applications, ranging from social group) interesting values or outliers, but it has the advantage of networks to scientific databases. Graph formats maximize the enabling a global, top-down view, which can be gained as one flexibility offered to data designers, as they are mostly schema- starts working with the graph. less and thus can be used to capture very heterogeneous-structure The research highlighted below takes this second path. The content. RDF, the W3C’s format for sharing open (linked) data, problems to be solved are: how to efficiently build meaningful adds the possibility to attach semantics to data, describing application- summaries of large RDF graphs (Section 2); and how to analyze domain constraints by means of ontologies; in turn, this leads to and explore RDF graphs by means of aggregate queries (Sec- implicit data that is also part of a graph even if it is not explicitly tion 3). Each problem raises specific conceptual and algorithmic in it. challenges; we motivate the solutions we found, and point to In this paper, we present a structured walk through the prob- interesting areas where the work could continue. lem of analyzing and exploring RDF graphs by finding groups of Are node groups an interesting metaphor for exploring RDF structurally similar nodes, and by automatically identifying inter- graphs? Figure 1 (from [8]) tends to suggest it. It depicts the esting aggregates theirein. We outline the challenges raised by properties of the subjects in a graph describing publications listed such processing in large, complex RDF graphs, outline the basic in the DBLP server. Each ring represents the frequency of a given principles behind existing solutions, and highlight opportunities RDF property among these subjects (or resources). Thus, the for future research. central blue ring reflects the property rdf:type, which clearly all the subjects have; the second one, dark blue, is date, which publications have, but authors do not; we can see a set of other 1 INTRODUCTION properties present on almost all publications (their frequency Graph data is increasingly popular, thanks to the flexibility it diminishing as we move away from the center of the graph), allows to its designers: it enables representing varying-structure while another set of resources have the name property but none entities together with their rich attributes and the relationships of the properties that publications have; these are the authors. interconnecting them. In particular, RDF graphs are abundantly present on today’s Web, as RDF is the recommended format for sharing Open Data. The Linked Open Data Cloud Web site (https://lod-cloud.net/) lists numerous examples of RDF databases. Nevertheless, the multiplication of data sources is not sufficient to enable the con- struction of applications that take advantage of it. An important obstacle is rooted in the very advantages of RDF: its flexibility and the heterogeneity it tolerates in the data make it hard for users to understand what a graph is about, and potentially even harder to detect what is interesting within the graph. Two approaches can be seen for analyzing and exploring a graph’s content. On one hand, node-focused exploration could al- low for instance users to identify a few nodes and/or edges they are interested in. This could be achieved by allowing them to search, e.g., through keywords, or by some statistical analysis, e.g., identifying nodes that are somehow outliers, through their con- tent or through their structural properties. Such fine-granularity exploration enables gaining detailed knowledge about relatively Figure 1: Property frequency analysis in an RDF graph [8]. small part of the graph. On the other hand, group- or class-focused exploration seeks to identify interesting subgraphs, or (most typ- ically) groups of nodes, which are in a certain sense similar or 2 SUMMARIZING RDF GRAPHS THROUGH comparable. The first step is thus to simplify the cognitive task STRUCTURAL QUOTIENTS of getting acquainted with a graph, by reducing it to the (sim- pler) task of understanding a smaller, abstract version thereof, The problem of summarizing RDF graphs has been extensively where each group of nodes represents a “class” or “meta-node”. studied, in particular drawing upon ideas and solutions proposed for summarizing generic graphs, or XML documents; RDF sum- © Copyright 2020 for this paper held by its author(s). Published in the proceedings of marization approaches are surveyed in [3]. A brand of summaries DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT 2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution well-established in database research is that of structural quo- 4.0 International (CC BY 4.0). tients: an equivalence relation is identified between the nodes Figure 2: Sample Strong Summary [10] of a Berlin SPARQL Benchmark (BSBM) graph of 100 million triples. Other sum- mary examples are depicted on the RDFQuotient project Web site, where our summarization tool is also available in open source. of a graph, typically based on their incoming/outgoing edges. First, RDF nodes may have types; this is encoded by graph A quotient summary node has one node per equivalence class, edges, connecting the typed nodes to special kinds of resources and one edge between two summary nodes if and only if a corre- in the graph, namely the type nodes themselves. A node may sponding edge connects, in the original graph, a pair of nodes have zero, one, or more types, which may or may not be logi- they represent. Quotient summaries have been introduced as basis cally connected. Further, some nodes in a graph may have types, for structural indexing, in OEM databases [11] and subsequently while others lack them. Types complicate summarization, since for XML, e.g. [15]. on one hand, they encapsulate precious application knowledge The choice of an equivalence relation, thus, fully determines when present, but on the other hand, summarization must be a summary. Which equivalence relation to pick? In [4, 10], we able to make sense of a graph even in their absence. Therefore, have proposed two novel relations, based on the transitive closure we have distinguished data-first summarization, which groups of sharing incoming, respectively, outgoing properties. Thus, if nodes according to their types first and foremost, and then carries n 1, n 2 both have titles, n 2 and n 3 both have authors, while n 3, n 4 the type of a node to its representative in the summary. This is both have publication years, we say n 1 to n 4 share the same outgo- suited for graphs where types are mostly absent, or not sufficient ing property clique, comprising the properties (edge labels) “title”, to distinguish classes of nodes from each other. The opposite “author”, and “year”. This outgoing clique (set of properties) is de- strategy is type-first; it groups nodes by their types, and only fined based on their (transitively) co-occurring on common nodes. uses property cliques to differentiate between the untyped ones. Observe that n 1 and n 4 may be quite different from each other; Depending on the graph, data-first or type-first summarization in particular, they may have no property in common. Incoming may be more suitable in order to produce summaries easy to property cliques are symetrically defined. Based on the notions of understand. incoming, respectively, outgoing property cliques, we introduce A second, more subtle aspect is due to the presence of an two notions of equivalence, so-called weak and strong [10], and ontology, which may make part of the graph implicit, that is, show that they lead to very compact summaries of RDF graphs for triples may hold in the graph, which are not explicitly present which previously proposed quotients lead to summaries having there. In this case, summarizing the graph of explicit triples may more nodes by orders of magnitude. Figure 2 illustrates this: the not account for the implicit ones. We have proposed in [10] a summary of a BSBM graph of 100 million triples has only 5 nodes sufficient condition under which one can compute the summary and 11 edges, the size of a relatively simple Entity-Relationship of a saturated graph (including all its implicit and explicit data), diagram. without actually saturating the graph; we also show that our What sets the summarization of RDF graphs apart from related Weak and Strong summaries, in their data-first incarnation, sat- graph summarization problems? Several features concur. isfy this condition, whereas any type-first summarization does not. The summaries we devised, like many others, strive to separate As this example shows, a fact contributes to the answer of nodes of a large graph in groups that simplify its understanding. an RDF analytical query iff it has values for all dimensions and Compared with other works, our goal has been to facilitate under- for the measure; a fact may contribute to several cells, if it has standing at first sight the major groups of nodes in a graph. We multiple values for one or several dimensions. This flexible model made the hypothesis that accepting “transitively similar” nodes has numerous advantages for analyzing RDF graphs: in a same group allows identifying such groups; our experiments • There may be several fact sets in an RDF graph. One could, for bear out this claim. Another strong advantage of our summaries instance, in a publication dataset, consider the articles to be the is that they can be all built in time linear in the size of the in- facts, and aggregate them according to their topics, their year put [9, 10], including in incremental mode, that is, deriving the of publication etc.; on the opposite - or rather, at the same time summary equivalence relation and summarizing the graph at the - one could consider the authors to be the facts, and articles (or same time. the articles’ years, or topics, or venues) as dimensions. The compactness of our summaries comes at a cost of precision. • As explained above, it flexibly accomodates the absence of For instance, they provide very poor support for indexing, since a dimension or a measure, as well as their possible multiple they are unable to guarantee that graph nodes represented by a values. certain summary node have, a certain property. More generally, they (and any other quotient summaries) reflect the structure, The formal semantics of such analytical queries [1, 6] is com- but not the values (leaf nodes) present in the graph. patible with the SPARQL 1.1. aggregation semantics; the latter, however, is only concerned with the syntactic level, not with the 3 EXPLORING RDF GRAPHS BY MEANS OF more conceptual one where facts, dimensions and measures are INTERESTING AGGREGATES specified. While aggregation is well-established as a way to analyze, aggre- gate and summarize relational data, the very meaning of aggre- 3.2 Automatically identifying interesting gation has been slow-coming for graphs, and in particular for RDF aggregates RDF. In March 2013, the SPARQL 1.1 specification introduced a As previously explained, RDF analytical queries enable express- Group-By primitive together with aggregation operators; their ing a large set of questions which enable characterizing, in a semantics is essentially lifted from the relational database world, flexible manner, the nodes of an RDF graph. But what queries to and applied to the tuples of bindings resulting from the matches ask? of a Where SPARQL block. Below we outline a path we started A well-explored branch of research in relational data analytics from devising an RDF counterpart to relational (data warehouse) concerns the automated identification of interesting analytical relational queries, formalizing RDF analytical (aggregate) queries queries [17–19]. These works are placed in a typical relational (Section 3.1), and (in subsequent, currently ongoing work) ex- data warehouse scenario, where a large number of dimensions ploring RDF graphs by automatically identifying interesting ag- exist, and seek to automatically proposed to the users the an- gregates (Section 3.2). alytical queries that are likely to bring them most insight. For instance, in [18], a query is interesting (brings a useful insight) if 3.1 RDF aggregate queries it exhibits, on a subset of the facts, a trend that is different from Our research [1, 6] considered, at about the same time, RDF the one that holds on the complete fact set. aggregation at a conceptual: what should an RDF analytical query In our Dagger project [8], we initiated an approach to auto- look like? The well-known concepts of facts, dimension and matically identify interesting analytical queries in RDF graphs. measure from the relational literature hardly fit. To start with, This was based on a set of simple choices: RDF graphs lack a previously-defined schema, and thus the facts at the heart of analytical processing are not defined; irregularity • Chosing as facts all nodes of a given RDF type, or, alternatively, in the data may lead to a dimension or measure being absent, or asking users to specify the fact query; being multiply defined. We proposed to define RDF analytical • Chosing as dimensions the properties that sufficiently many (aggregate) queries as a combination of a fact query, defining the facts have, and whose number of distinct values does not ex- set of resources to be treated like facts and analyzed together, a ceed a certain threshold; we also introduced derived properties, set of dimension queries, associating to each fact zero or more such as the number of authors that a paper has, which we treat values against each dimension, a measure query, specifying what like a new propertu attached to the paper fact; to use as a measurable property of each fact, finally an aggregation • Chosing a measure among the other (original or derived) prop- function among the usual ones (sum, max, average etc.) A sample erties of the facts; aggregate query can be composed as follows: • Considering an aggregate interesting if it maximizes a certain statistical measure of the aggregate query result. • Facts are all the articles published between 2000 and 2020; • A dimension is a country to which an authors’ institutions are Figure 3 illustrates the kinds of aggregates Dagger identified, affiliated; many papers have authors from multiple countries, in a set of DBLP publications from 1936 to 2006. At the top, the naturally leading to multiple values for a dimension; average number of authors of a published paper; we see the • Another dimension is the year; rise of co-authorship along the years. At the center, the number • A measure of a paper is a keyword in the paper abstract; of published papers grouped by year; this graph really gives • The aggregation function counts the different keywords. flesh to the concern that as an academic community we may be The analytical query described above groups papers by the year publishing too much! Last but not least, the graph at the bottom and author country, and for each paper group, it counts the counts the books listed in DBLP and grouped by their publisher. keywords associated to papers published in that years with an The dominating bar corresponds to Springer; Infix Verlag comes author from that country. second, and a set of bars at the left of the graph show different • We enlarge the exploration space to multidimensional (not just mono-dimensional) aggregates; • We introduce more derived properties, for instance by means of topic extractions from text; also, moving toward the generality of the analytical queries introduced in [6] we allow dimensions and measures to be defined by paths of a certain length starting in the facts; • To cope with the expensive exploration of multidimensional aggregates while remaining efficient, we have devised a novel version of a well-known algorithm [20], capable of evaluat- ing in a single pass all the aggregates determined by a set of dimensions, a measure and an aggregation function; • Still toward the goal of scalability, we have devised novel early- stop techniques, capable of estimating the interestingness of an aggregation query while it is computed, and stop the computa- tion as soon as it becomes clear that other aggregates, whose computation is ongoing, are more interesting. Figure 4 illustrates a multidimensional aggregate Spade iden- tified. It shows, for instance, that the “system” and “machine” keywords have been present from the early days of DBLP publi- cations, whereas the “web” newcomer started its history in the 1990s; we see “system” making an important comeback (light yellow area toward the top right), idem for “network” etc. Our work on Spade is ongoing at the time of this writing. We are still working to improve its performance, and to under- stand the interplay of the early-stop and of the multidimensional algorithms used to explore and estimate the interestingness of various RDF analytical queries. 4 CONCLUSION AND OUTLOOK The field of graph analytics is by nature very broad, given the extreme diversity of data modeled as graphs. This paper summa- rizes a set of recent work carried with the global goal of helping users grasp the content of a large and potentially complex RDF graph. Our key findings can be summarized as follows: • Identifying interesting node groups is an intuitive first step toward gaining an understanding of the graph, of its semantics, Figure 3: Interesting insights found by Dagger [8]. structure and content. spellings for Addison-Wesley, leading to an artificial separation • The properties incoming and outgoing RDF resources can be of their books over what would appear to be several publishers. used as a good basis for identifying such groups, provided that good measures are taken to avoid the extreme fragmentation 3.3 Scaling up the exploration of interesting which would result from requiring all nodes in a group to aggregates have exactly the same structure. Instead, summaries such as Dagger identifies interesting aggregates through exhaustive we introduced in [4, 9, 10] accept some heterogeneity among search: it explores and evaluates aggregates subject to a given the nodes, which generally leads to easy-to-read summaries. time limit, before returning the most interesting ones. This made • If one also takes into account the values, that structural sum- its exploration process lengthy. We explored in [14] the use of maries completely disregard, there are many ways to explore sampling, both to select the dimensions and measures to use for how groups of nodes in RDF graphs compare among them- the facts, and to decide which aggregates are interesting. While selves, and countless combinations of facts, dimensions, and this did reduce the running time, it provided no guarantee of the measures one could use. In Dagger [8] and its successor accuracy of the exploration thus abridged. Spade [7], we are working to identify as quickly as possible In the domain of relational data analytical processing, a key interesting aggregates, with an interesting measure currently ingredient to the automated selection of interesting queries is defined as the variance of the set of values that are part of the the ability to explore many candidates, and discard as early as aggregate query result. possible those queries which can be determined quickly enough Many avenues for future research are open. to be not sufficiently interesting. Online aggregation, pioneered • Personalization, user input, or query by example [13] could be in [12] has been a crucial ingredient here: it allows to derive, blended with exploration such as we envisioned it, in order while aggregate queries are being computed, an approximation to help users get as soon as possible to the information they (with a given confidence interval) of these queries’ results. We need for a specific task, in the spirit of [16]. have explored that path in Spade [7], our follow-up project on • RDF graph semantics has not yet been fully taken into account Dagger, where we make several new contributions: in the exploration. It could be incorporated as a facet, or as a Figure 4: Sample interesting aggregate identified by Spade: number of DBLP articles by years and keyword appearing in their titles. The darker the collor, the fewer articles there are. way of navigating from one interesting insight to another one, Connections Across Heterogeneous Data Sources. Proceedings of the VLDB on a closely (semantically) related set of items. Endowment (PVLDB) 11 (2018), 4. https://doi.org/10.14778/3229863.3236252 [6] Dario Colazzo, François Goasdoué, Ioana Manolescu, and Alexandra Roatis. A related, if more mundane, question is which platform (or 2014. RDF Analytics: Lenses over Semantic Graphs. In 23rd International World Wide Web Conference. Seoul, South Korea. https://doi.org/10.1145/2566486. back-end) should best support such analytics; the competition 2567982 among (RDF) graph processing platforms is currently hot, with [7] Yanlei Diao, Pawel Guzewicz, Ioana Manolescu, and Mirjana Mazuran. [n. d.]. Spade: A Modular Framework for Analytical Exploration of RDF Graphs. In no clear winner in sight. While many contenders exist, the very VLDB 2019 - 45th International Conference on Very Large Data Bases. https: different kinds of processing envisioned, say, in Semantic Web //hal.inria.fr/hal-02152844 integration queries, on one hand, and in social network analysis [8] Yanlei Diao, Ioana Manolescu, and Shu Shang. 2017. Dagger: Digging for In- teresting Aggregates in RDF Graphs. In International Semantic Web Conference with the goal of influence maximization, on the other hand, make (ISWC). Vienna, Austria. https://hal.inria.fr/hal-01577464 comparisons difficult, and convergence unlikely. [9] François Goasdoué, Pawel Guzewicz, and Ioana Manolescu. 2019. Incremental Going beyond exploration of RDF graphs, one could envision structural summarization of RDF graphs. In EDBT 2019 - 22nd International Conference on Extending Database Technology. Lisbon, Portugal. https://hal. tools blending more strongly extraction of information from un- inria.fr/hal-01978784 structured content, and structured data under one of its many [10] François Goasdoué, Paweł Guzewicz, and Ioana Manolescu. 2020. RDF graph summarization for first-sight structure discovery. The VLDBJ Journal (2020). forms. This kind of graphs are encountered, for instance, when To appear. integrating heterogeneous data sources such as those available [11] Roy Goldman and Jennifer Widom. 1997. DataGuides: Enabling Query For- to journalists. We have outlined such a graph-based integration mulation and Optimization in Semistructured Databases. In Proceedings of 23rd International Conference on Very Large Data Bases, 1997, Athens, Greece. framework in the ConnectionLens [5] system. Such hetero- 436–445. geneous graphs exhibit even more structural and content het- [12] Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggre- erogeneity; higher-levels abstraction methods are needed as a gation. In SIGMOD. 171–182. [13] Matteo Lissandrini, Davide Mottin, Themis Palpanas, and Yannis Velegrakis. first step towards facilitating their understanding [2]. We plan 2018. Data Exploration Using Example-Based Methods. Morgan & Claypool to continue work on these topics, within the ANR SourcesSay Publishers. https://doi.org/10.2200/S00881ED1V01Y201810DTM053 [14] Ioana Manolescu and Mirjana Mazuran. 2019. Speeding up RDF aggregate project (2020-2024). discovery through sampling. In BigVis 2019 - 2nd International Workshop on Acknowledgments This research has been partially funded by Big Data Visual Exploration and Analytics. Lisbon, Portugal. https://hal.inria. ANR-16-CE23-0010-01 and the H2020 research program under fr/hal-02065993 [15] Tova Milo and Dan Suciu. 1999. Index structures for path expressions. In grant agreement nr. 800192. supported International Conference on Database Theory. Springer, 277–295. [16] Amit Somech, Tova Milo, and Chai Ozeri. 2019. Predicting "What is Interesting" by Mining Interactive-Data-Analysis Session Logs. In Advances in Database REFERENCES Technology - 22nd International Conference on Extending Database Technology, [1] Elham Akbari-Azirani, François Goasdoué, Ioana Manolescu, and Alexandra EDBT 2019, Lisbon, Portugal, March 2019. 456–467. https://doi.org/10.5441/ Roatis. 2015. Efficient OLAP Operations For RDF Analytics. In International 002/edbt.2019.42 Workshop on Data Engineering meets the Semantic Web (DESWeb). Seoul, South [17] Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and Dongmei Zhang. 2017. Ex- Korea. https://doi.org/10.1109/ICDEW.2015.7129548 tracting Top-K Insights from Multi-dimensional Data. In SIGMOD. 1509–1524. [2] Irène Burger, Ioana Manolescu, Emmanuel Pietriga, and Fabian Suchanek. 2020. [18] Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya G. Parameswaran, Toward Visual Interactive Exploration of Heterogeneous Graphs. (2020). and Neoklis Polyzotis. 2015. SEEDB: Efficient Data-Driven Visualization [3] Sejla Cebiric, François Goasdoué, Haridimos Kondylakis, Dimitris Kotzinos, Recommendations to Support Visual Analytics. PVLDB 8, 13 (2015), 2182– Ioana Manolescu, Georgia Troullinou, and Mussab Zneika. 2019. Summarizing 2193. Semantic Graphs: A Survey. The VLDB Journal 28, 3 (June 2019). https: [19] Yuhao Wen, Xiaodan Zhu, Sudeepa Roy, and Jun Yang. 2018. QAGView: Inter- //hal.inria.fr/hal-01925496 actively Summarizing High-Valued Aggregate Query Answers. In SIGMOD. [4] Šejla Čebirić, François Goasdoué, and Ioana Manolescu. 2015. Query-Oriented 1709–1712. Summarization of RDF Graphs. In Proceedings of the VLDB Endowment, Vol. 8. [20] Yihong Zhao, Prasad Deshpande, and Jeffrey F. Naughton. 1997. An Array- Kohala Coast, Hawaii, United States. https://hal.inria.fr/hal-01178140 Based Algorithm for Simultaneous Multidimensional Aggregates. In SIGMOD. [5] Camille Chanial, Rédouane Dziri, Helena Galhardas, Julien Leblay, Minh- 159–170. Huong Le Nguyen, and Ioana Manolescu. 2018. ConnectionLens: Finding