RDFDigest+: A Summary-driven System for KBs Exploration Georgia Troullinou1 , Haridimos Kondylakis1 , Kostas Stefanidis2 , and Dimitris Plexousakis1 1 FORTH-ICS, Heraklion, GR, {troulin,kondylak,dp}@ics.forth.gr 2 University of Tampere, Tampere, FI, kostas.stefanidis@uta.fi Abstract. In this paper, we present RDFDigest+, a novel tool that en- ables effective and efficient RDF/S Knowledge Base (KB) exploration using summaries. The tool employs a diverse set of algorithms for iden- tifying the most important nodes, offering a wide range of possibilities to capture importance. The selected nodes can be combined using multi- ple state of the art algorithms to generate a complete schema summary graph. In addition, we present a new approach enabling the dynamic ex- ploration of summaries through two novel operations zoom and extend. Extend focuses on a specific subgraph of the initial summary, whereas zoom on the whole graph, both providing granular information access to the end-user. 1 Introduction The recent explosion of the Web data and the associated Linked Open Data initiative have led to an enormous amount of widely available RDF datasets [1]. These datasets often have extremely complex schemas, which are difficult to comprehend, limiting the exploitation potential of the information they con- tain. As a result, there is an increasing need to develop methods and tools that facilitate the quick understanding and exploration of these data sources [3]. In our work, we design and develop RDFDigest+(http://rdfdigest.ics. forth.gr) that focuses on three aspects: a) how to identify the most important nodes of an RDF/S KB, b) how to link those nodes in order to produce a valid sub-schema graph and c) how to enable the active exploration of the KB, pre- senting various statistical information and enabling zooming operators. Multiple importance measures [2] have been adapted from graph theory, offering a wide range of alternatives for identifying the importance of a node. To locate the proper paths connecting those nodes, we model the problem either as a graph Steiner-tree problem or as a maximum cost-spanning tree one [4]. Finally, over these generated summaries, we enable zoom-in and zoom-out operations, in or- der to get granular information, by adding more important nodes or removing existing ones from the generated summary. In addition, through the extend op- erator, we allow selecting a subset of the presented nodes in order to visualize other dependent nodes. Both operators can be progressively applied to make the whole process more efficient (more details in [5]). 2 Approach In this work, we separate between the schema and instances of an RDF/S KB, represented in separate graphs, GS and GI . The schema graph contains all classes and the properties they are associated with. The instance graph contains all individuals, and the instantiations of schema properties. Identification of the most important nodes. For ranking the available schema nodes based on their importance, we use the following measures: Degree, Betweeness, Bridging Centrality, Harmonic Centrality, Radiality, Betweenness, PageRank and HITS [2] [5]. As the aforementioned measures have been de- veloped for generic graphs, we adapt them to be used for RDF/S graphs. To achieve that first we normalize each importance measure and then the number of instances that belong to a schema node. As such, the adapted importance mea- sure (AIM) of each node is the sum of the normalized values of the importance measures and the instances. Overall, our platform is flexible enough to enable the addition of new measures by just adding new function calls. Linking most important nodes Having a way to rank the schema nodes of an RDF/S KB according to the perceived importance, we then select the top-k ones and focus on the paths that link those nodes, aiming to produce a valid sub-schema graph. RDFDigest+ offers two methods for achieving this. The first one focuses on identifying a maximum cost spanning tree (MST) in the graph, and on linking the selected nodes using paths from the selected tree. The corresponding algorithm is computationally efficient. However, although MST identifies the paths with the maximum weight in the whole graph, the paths selected out of the MST might not maximize overall the weight of the selected summary. Moreover, many additional nodes might be introduced in the result, since there is only one path to be selected between two nodes and in this path many other not important nodes might appear as well. Alternatively, we model the problem of linking the most important nodes as a variation of the well-known Graph Steiner-Tree problem where the goal is to minimize the additional nodes introduced for connecting the top-k nodes. However, the problem is NP-hard, and as such approximation algorithms should be explored. Exploration through summaries. With summaries, users can better un- derstand the KB contents. However, they might find the presented information overwhelming and may want to see complementary information. Extend: The extend operator gets as input a subgraph of the summary schema graph and identifies other nodes that depend on the selected nodes. Dependence has not only to do with distance. Like TF-IDF, the basic hypothe- sis is that the greater the influence of a property on identifying a corresponding instance is, the less times it is repeated. Specifically, we define the dependence between two classes as a combination of their cardinality closeness CC, the AIM of the classes, and the number of nodes appearing inPthe path Y connecting these AIM (i) AIM (v)āˆ’ iāˆˆY CC((iāˆ’1),i) two classes. That is: Dependence(u, v) = |Yu,v | , where CC is defined for a pair of classes as the number of distinct edges over the number of all edges between them. As we move away from a node, the dependence be- comes smaller by calculating the differences of AIM across a selected path in the graph. In addition, we penalize dependence dividing by the distance of the two examined nodes. The corresponding algorithm calculates the dependence of the adjacent nodes expanding progressively the range until it reaches a specified number of nodes. Then it links the nodes to be added using three approxima- tions: a) CHINS, which in essence starts with a partial solution consisting of a single schema graph node and then finds one additional node each time; b) Shortest Paths, which proceeds similar to CHINS but starts with all nodes avail- able in the summary in the first partial solution and c) Dependent Paths, which uses the nodes already visited when computing Dependence. Zoom: Differently, we focus on zooming, by exploiting the schema graph as a whole. That is, we introduce the zoom-out and zoom-in operators to produce more detailed or coarse summary schema graphs. To this end, we consider the n0 schema nodes with the highest importance in GS , where n0 can be either greater than n, for achieving a zoom-out, or smaller than n, for achieving a zoom-in, where n represents the size of the given summary. The simplest approach for zooming-in/out, is to calculate from scratch the T OPn0 schema nodes and then to use the Steiner-Tree algorithm from scratch to link the selected nodes. However, since we already have an existing summary as a basis for our zoom operations we explore the following approximations: a) Zoom- in: Remove the nodes in T OPn \T OPn0 and their connections without recalculat- ing the Steiner-Tree algorithm for T OPn0 - this might leave additional nodes in the result summary. b) Zoom-out - CHINS : Add the nodes in T OPn0 \T OPn and link them with existing summary using the CHINS approximation algorithm. c) Zoom-out - Shortest Paths: Add the nodes in T OPn0 \T OPn and link them with existing summary using the Shortest Paths approximation algorithm. 3 Architecture & Demonstration The high-level architecture of the system is shown in Figure 1 (left). The sum- marization process starts by uploading an RDF/S file, or by providing the URL of an online file. The file is then stored to a Virtuoso triple store. Our engine preprocesses the available information and stores statistical and metadata infor- mation in a different Virtuoso graph. As long as existing information is available for a specific file, it can be reused and not recomputed. To demonstrate the functionalities of the RDFDigest+, shown in Figure 1 (right)), we will use 5 KBs: the BIOSPHERE ontology, the Financial ontology, the Aktors Portal ontology, the CIDOC-CRM ontology and DBPedia. The vari- ety on the size, the domain and the structure of these ontologies offers an inter- esting test case for our demonstration. The demonstration will start by selecting one of the aforementioned ontologies and producing a visual summary identify- ing and linking the most important nodes in the KB. In the presented summary graph, the size of a node depends on the its importance. By clicking on a node, additional metadata (e.g. the number of instances, and the connected properties and instances) are provided to enhance ontology understanding. Further explo- Fig. 1. System architecture (left), Extend and zoom operators (right). ration of the data source is allowed by clicking on the details (on the left) of the selected class and properties. When clicked, its instances and connections appear in a pop-up window. Double-clicking on a node extends the summary of that specific node providing more detailed information regarding the dependent nodes. In addition, the summary can be zoomed-in and zoomed-out to present more detailed or more generic information regarding the whole summary. Finally, the user can download the summary as a valid RDF/S document. The demonstration will proceed in four phases: a) Selecting important nodes. The whole idea of exploring RDF/S KBs through summaries will be described and the basics behind each importance measure will be highlighted. According to the user decisions, the results will be discussed. b) Selecting linking algorithms. We will demonstrate the differences between the two linking algorithms, focusing on different perspectives for linking the most important nodes. c) Exploration operators. We will demonstrate how the two operators help to focus in more detailed or coarser information. d)Hands-on phase. Conference participants can directly interact with the system, exploring through summaries the aforemen- tioned ontologies. To our knowledge, no other system today is available on the Web, enabling the summarization of RDF/S KB, providing functionalities for active data exploration through summaries. References 1. V. Christophides, V. Efthymiou, and K. Stefanidis. Entity Resolution in the Web of Data. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool Publishers, 2015. 2. A. Pappas, G. Troullinou, G. Roussakis, H. Kondylakis, and D. Plexousakis. Ex- ploring importance measures for summarizing RDF/S KBs. In ESWC, 2017. 3. Y. Roussakis, I. Chrysakis, K. Stefanidis, G. Flouris, and Y. Stavrakas. A flexible framework for understanding the dynamics of evolving RDF datasets. In ISWC, 2015. 4. G. Troullinou, H. Kondylakis, E. Daskalaki, and D. Plexousakis. Ontology under- standing without tears: The summarization approach. Semantic Web, 8(6):797ā€“815, 2017. 5. G. Troullinou, H. Kondylakis, K. Stefanidis, and D. Plexousakis. Exploring RDF/S KBs using summaries. In ISWC, 2018.