RDF Digest: Ontology Exploration Using Summaries Georgia Troullinou, Haridimos Kondylakis, Evangelia Daskalaki, Dimitris Plexousakis Institute of Computer Science, FORTH, N. Plastira 100, Heraklion, Greece {troulin, kondylak, eva, dp}@ics.forth.gr Abstract. Ontology summarization aspires to produce an abridged version of the original ontology that highlights its most representative concepts. In this paper, we present RDF Digest, a novel platform that automatically produces and visu- alizes summaries of RDF/S Knowledge Bases (KBs). A summary is a valid RDFS document/graph that includes the most representative concepts of the schema, adapted to the corresponding instances. To construct this graph our al- gorithm exploits the semantics and the structure of the schema and the distribu- tion of the corresponding data/instances. A novel feature of our platform is that it allows summary exploration through extensible summaries. The aim of this demonstration is to dive in the exploration of the sources using summaries and to enhance the understanding of the various algorithms used. 1 Introduction Given the explosive growth in both data size and schema complexity, data sources are becoming increasingly difficult to understand and use. Ontologies often have extremely complex schemas which are difficult to comprehend, limiting the exploration and the exploitation potential of the information they contain. Besides schema, the large amount of data in those sources increase the effort required for exploring them. Over the latest years, various techniques have been provided on constructing over- views on ontologies [1-4], maintaining however the more important ontology elements. These overviews are provided by means of an ontology summary. Ontology summari- zation [4] is defined as the process of distilling knowledge from an ontology in order to produce an abridged version. While summaries are useful, creating a “good” sum- mary is a non-trivial task. A summary should be concise, yet it needs to convey enough information in order to enable a decent understanding of the original schema. Moreo- ver, the summarization should be coherent and should provide an extensive coverage of the entire ontology. So far, although a reasonable number of research works tried to address the problem of summarization from different angles, a solution that simultane- ously exploits the semantics of the schemas and the data instances is still missing. In this demonstration, we focus on RDF/S KBs and demonstrate for the first time the implementation of the algorithms introduced in [5]. Our system constructs summaries that constitute “valid” sub-ontologies and provide an overview of the ontology schema adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011 considering a) the semantics of the schema, b) the structure of the graph and c) the distribution of the corresponding data/instances. Extending our previous work [5] we demonstrate also an efficient and effective method to explore these KBs using schema summaries that can be extended according to user selections. In addition, we provide more meta-data to enhance ontology understanding. To the best of our knowledge, our approach is the first, in the context of ontology, combining both schema and data to allow ontology exploration though a high-quality graph summary. 2 Approach In this section we present the properties that a sub-graph of our schema is required to have in order to be considered a high-quality summary of an RDF/S KB. Specifically, we are interested in important schema nodes that can describe efficiently the whole schema and reflect the distribution of the data instances at the same time. To capture these properties, we use the notions of relevance and coverage. Relevance is used for identifying the most important nodes and coverage is used for extracting paths, which cover the whole spectrum of the RDF/S document. In our approach, initially, we determine the importance of a node/edge, judging from the instances it contains by calculating its relative cardinality. The Relative Cardinality (RC(e(vi, vj)) of an edge e(vi, vj) is the number of the specific instance connections divided by the total number of the connections of the instances of these two nodes vi, vj. After that, in order to combine the notion of centrality in the schema and the distri- bution of the corresponding dataset, we define a variation of the degree centrality, called in/out centrality (Cin/Cout) as the sum of the weighted relative cardinalities of the incoming/outgoing edges. The weights are experimentally defined and depend on the types of the properties, giving priority to user-defined properties. The algorithm is flex- ible enough to focus on the available instances when they exist, and if they are not available, it only exploits the semantics and the structure of the schema. The notion of centrality, as defined previously, is a measure that can give an intuition about how central a schema node is in an RDF/S KB. However, its importance should be determined considering also the centrality of the other nodes as well. To achieve this goal, the relevance of a node is affected by its surrounding neighbors and more specif- ically by the number and the connections of its adjacent nodes. Definition 2.1 (Relevance of a node). Let npin be the number of incoming nodes vi connected to v with ea(vi, v) and npout be the number of outgoing nodes vj connected to v with eb(v, vj). The relevance of v, i.e. the Relevance(v), is the sum of in and out cen- trality of v multiplied by the corresponding number of nodes, divided by the sum of out-centrality of the incoming nodes vi and the in-centrality of the outgoing nodes vj. Cin (v) * npin  Cout (v) * npout Relevance v npin npo u t ¦ C (v )  ¦ C (v ) 1 out i 1 in j Obviously, the relevance of a schema node in an RDF/S KB is determined by both its connectivity in the schema and the cardinality of the instances. In addition, the pro- duced summary should be a valid schema graph. So the chosen paths should be selected having in mind to collect the more relevant nodes by minimizing the overlaps. As a consequence, the main criteria to estimate the level of coverage of a specific path are: a) the relevance of each node in the path, b) its relevant instances in the dataset and c) the length of the path. As a result, similar to [3], we define the notion of coverage. Definition 2.2 (Coverage of a path). The coverage of a path from vs to vi, i.e. the Coverage(vs⟶vi), is derived by the sum of the relevance of the sequential nodes vj contained between the nodes vs and vi, multiplied by the relative cardinality of each edge e(vj-1, vj) contained in the path. The result is divided by the length of the path in order to penalize the longer paths. d ns oni 1 Coverage(v s o vi ) * ¦ Relevance (v j ) * RC e v j 1 , v j d ns oni j 2 The above formula aims to select the schema nodes that are more relevant while avoiding having nodes (or paths) in the summary which cover one another. The highest the coverage of a path, the more appropriate is considered in representing the original graph or part of it. For more information on the aforementioned formulas the interested reader is forwarded to the relevant publication [5]. According to the aforementioned formula, each selected node represents/covers a part of its neighborhood in the summary graph. In order to enable further exploration, we allow the extension of the summary on a node of interest. Our algorithm is trying to identify the neighbors that are not included in the current summary and until now they are represented/covered by the selected node. Having calculating the coverage of all paths starting from the selected node to all its neighbors, our algorithm includes in the summary those nodes contained in the paths that minimize the coverage compared to the paths (set of nodes) that have been already inserted in the existing summary. 3 Architecture & Demonstration Highlights Based on the aforementioned metrics, the RDF Digest prototype has been implemented. The architecture of the system is shown in Fig. 1 and a beta version of the platform is currently available online (http://www.ics.forth.gr/isl/rdf-digest). The RDF Digest is composed of two major components, the Summarizer and the Visualizer. Using the interface, a user can select or give the URL of an online RDF/S document, she would like to be summarized and is optionally able to define the expected length of the summary. The Summarizer gets the input RDF/S document and preprocesses it (us- ing the RDF Preprocessor module) by computing the corresponding RDF/S KB. The result is stored in a Virtuoso instance to enable efficient data access. Then, the RDF Accessor module calculates the relevance of each node. The RDF Summary Builder generates the final summary of the schema, based on the rankings produced by the RDF Assessor and the requested size of the summary. The result and additional meta-data are returned to the Visualizer which enables effective visualization of the summary and exploration of the data source as shown in the right of Fig. 1. In our demonstration, example ontologies will be used for generating summaries and their exploration through extensible summaries will be demonstrated. In the presented summary graph, the size of a node is depending on the node’s relevance. In addition by clicking on a node, additional meta-data (its relevance and centrality, the number of instances, the connected properties, instances etc.) are provided to enhance ontology understanding. Besides meta-data, further exploration of the data source is allowed by clicking further on the details (on the left) of the selected class and the properties. When clicked, its instances and connections appear in a pop-up window. Moreover, further exploration of the data source is allowed by double-clicking on a node to extend the summary on that specific node. Finally the user is able to download the summary as a valid RDFS document. Fig. 1. The architecture of the RDF Digest (left) and a screenshot of the Visualizer (right). Our immediate plans comprise the extension of RDF Digest to handle multi-ontol- ogy KBs, possibly by using external SPARQL endpoints, and to evaluate the summar- ies produced by checking if they can answer the most frequent queries issued to these KBs. As the size and the complexity of schemas and data increase, ontology summari- zation is becoming more and more important and several challenges remain to be in- vestigated in the near future. Acknowledgments: This work was partially supported by the EU projects iMan- ageCancer (H2020-643529), MyHealthAvatar (FP7-600929) and EURECA (FP7- 288048). 4 References 1. Peroni, S., Motta, E., and d’ Aquin, M.: Identifying Key Concepts in an Ontology, Through the Integration of Cognitive Principles with Statistical and Topological Measures. In The Semantic Web Journal. 5367, pp. 242–256 (2008) 2. Queiroz-Sousa, P. O., Salgado, A. C., Pires, C. E.: A Method for Building Personalized Ontology Summaries. JIDM Journal, 4, 3, pp. 236 (2013) 3. Yu, C., Jagadish, H.V.: Schema Summarization, VLDB, pp. 319-330 (2006) 4. Zhang, X., Cheng, G., Qu, Y: Ontology Summarization Based on RDF Sentence Graph. WWW, pp. 707-716 (2007) 5. Troullinou, G., Kondylakis, H., Daskalaki, E., Plexousakis, D.: RDF Digest: Efficient Sum- marization of RDF/S KBs, ESWC, pp 119-134 (2015)