ELLIS: Interactive Exploration of Linked Data on the Level of Induced Schema Patterns Thomas Gottron1 , Malte Knauf2 , Ansgar Scherp3 , Johann Schaible4 1 Innovation Lab, SCHUFA Holding AG, Wiesbaden, Germany Thomas.Gottron@schufa.de 2 Institute for Web Science and Technologies, University of Koblenz-Landau, Germany mknauf@uni-koblenz.de 3 ZBW – Leibniz Information Center for Economics, Kiel University, Kiel, Germany asc@informatik.uni-kiel.de 4 GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany johann.schaible@gesis.org Abstract. We present ELLIS, a demo to browse the Linked Data cloud on the level of induced schema patterns. To this end, we define schema-level patterns of RDF types and properties to identify how entities described by type sets are con- nected by property sets. We show that schema-level patterns can be aggregated and extracted from large Linked Data sets using efficient algorithms for mining frequent item sets. A subsequent visualisation of such patterns enables users to quickly understand which type of information is modelled on the Linked Data cloud and how this information is interconnected. 1 Introduction The Linked Open Data (LOD) cloud does not have a fixed or pre-defined schema. How- ever, the use of RDF types and properties to describe the data provides an emerging schema. This implicit schema can be induced from data observations on the Web and, thereby, can be made explicit. A subsequent visualisation of the induced schema in- formation enables users to investigate the structure of Linked Data in an interactive and exploratory way. The insights and understanding of the data gained in this way are beneficial for several applications. It can help users in finding relevant vocabulary terms when modelling data as LOD [10] or in programming a Linked Data application that requires to obtain data of specific type and with specific properties [11]. Further- more, it allows users to understand what type of information is available on the LOD cloud and how this information is interconnected on the Web of Data. In this paper, we present ELLIS, a graph-based approach for visualising and exploring induced schema information for Linked Data on the basis of schema-level patterns. 2 Schema-level Patterns There are various approaches of different granularity for inferring schema information from observations made on the Linked Data cloud. For the purpose of providing a con- sistent and browsable view of schema-level information, we need to describe (at least) two aspects: an aggregated representation of the entities modelled in the Linked Data graph as well as a notion of the relations connecting them. The entities can be grouped together on the basis of the sets of RDF types associated with them. Likewise, the sets of RDF properties interlinking the entities can serve to describe the relations between groups of entities of the same type. Hence, we model schema-level patterns (SLP) as a combination of subject type sets sts and object type sets ots (i. e., sets of RDF types T of entities modelled on the Linked Data cloud) which are connected by property sets ps (i. e., sets of predicates P ). Formally, an SLP is defined as a triple (sts, ps, ots) ∈ P(T ) × P(P ) × P(T ) (1) This schema-level representation of Linked Data lends itself for a graph-based inter- pretation and visualisation. As the subject and object type sets follow the same formal definition, they can be seen as nodes connected by edges consisting of property sets. When computing SLPs for a (potentially distributed) segment R of the RDF data graph on the LOD cloud, we consider all URIs appearing in the subject position and object position of RDF triples (s, p, o), extract their RDF types and the unified set of all predicates used to model a relation between them. Formally, we define the set of observed SLPs over an RDF data set R: SLP(R) ={(sts, ps, ots) | ∃s, o : (∀ts ∈ sts : (s, rdf:type, ts ) ∈ R) (2) ∧ (∀p ∈ ps : (s, p, o) ∈ R) ∧ (∀to ∈ ots : (o, rdf:type, to ) ∈ R)} The set SLP(R) can be computed with relatively little overhead from large data sets using the Apriori algorithm for frequent item set mining. As a result, we obtain the above mentioned graph structure over induced schema-level patterns. 3 ELLIS Based on the definition of SLPs, we implemented the ELLIS prototype for visualising and navigating the LOD cloud on a schema level5 . The system provides four essential functionalities: (a) a visualisation of SLPs as a graph, (b) browsable rendering of the graph nodes together with annotations of the relevant schema information, (c) a history trace to keep track of previous steps in the exploration path, and (d) a search function- ality to find relevant entry points for browsing the SLP graph. The graph visualisation represents the type set information as well as the property set information as nodes in a graph as shown in Figure 1. The edges connect the nodes in a directed way to indicate the order of the triple in an SLP starting from the subject type set over the connecting property set to the object type set. Representing all relevant information as nodes in a browsable graph has two advantages. First, it condenses in- formation on a high level. This enables users to quickly grasp the structure of the data. When needed and requested, additional information can be revealed and displayed. In ELLIS we use hover info boxes and an additional info field in the menu to indicate the 5 A screencast of ELLIS is publicly available at https://www.youtube.com/watch? v=q47YFKyf32I&feature=youtu.be. Fig. 1. Visualisation of an initial query over induced schema-level patterns in ELLIS. type and property sets associated with nodes of the SLP graph. Second, the graph can easily be navigated by selecting any of the displayed type set nodes. Upon selection of a node, the visualisation interface updates the graph by retrieving all connected property sets and type sets as given by the SLPs. A history trace [1] allows the users to identify the path they took in the exploration of the LOD cloud on a schema level. SLPs in the history trace older than the last three steps are removed from the visualisation. This provides orientation and context without overloading the interface with all previously visited schema-level patterns. Finally, a search functionality permits the users to search for specific RDF types. Subsequently, ELLIS lists all type sets containing these types. In this way, it is possible to flexibly chose an entry point type set and the embedding SLPs for starting to browse the schema graph. ELLIS is designed following a classical three-tier architecture. The Web front end visualises the graph constructed from SLPs, displays additional information, and pro- vides interaction functionality. Figure 1 illustrates the graph visualisation in ELLIS. The middle tier encapsulates functions for search and navigation. In particular, it allows to resolve for a given type set node all relevant SLPs containing this type set as subject type set and object type set. The backend tier consists of a database containing all SLPs obtained from a Linked Data set. In our ELLIS demo, we constructed the SLPs from the BTC 2012 dataset, containing approximately 1.4 billion triples. Figure 1 shows the result of an initial query about Greek philosophers to ELLIS. The best matching type set of the query is marked in red and shown in the middle of the graph. The related sets of RDF resources with a similar set of properties and types Fig. 2. Navigation of SLPs and history trace. are connected via relations. In the example shown in Figure 1, these are properties like dbpo:influencedBy and dbpo:influenced. The user hovers with the mouse over a type set TS1195275161. It mainly contains German philosophers that are dbpo:influenced by the Greek philosophers. Subsequently, the user clicks on this type set of German philosophers in order to further navigate through the induced SLPs in ELLIS. The result is shown in Figure 2. The clicked type set is now indicated in red and moved to the center of the graph visualization. Further properties, such as the birthplace and place of death of the philosophers, of this node are shown and can be explored further. 4 Related Work There are numerous approaches for inducing schema information from Linked Open Data. The applications vary from statistical schema inferencing [12] over cardinal- ity estimation for query result sets [9] and analytics of the dynamics of Linked Data sources [4] to schema-level indices [6,8]. Most similar to the presented SLPs are the equivalence classes in SchemEX [8] or the Node-Collection Layer from the RDF graph summary [2], which capture even more fine grained schema information. Regarding the visualisation of Linked Data, most approaches address visualisation on an instance level [3]. In contrast, Katifori et al. [7] present a survey of different approaches for vi- sualising ontologies, i. e, schema-level information (so-called T-Box). A more recent visualisation approach involving schematic information on the LOD cloud is LOD- Sight [5]. It uses a dataset summarization algorithm which induces the schema from a dataset via SPARQL queries. Such SPARQL queries can get quite complicated. EL- LIS induces the schema via SLPs which are computed in a less complicated manner by using the Apriori algorithm for mining frequent item sets. 5 Conclusion With schema-level patterns, we have defined a structure which is suitable for induc- ing and aggregating schema-level information from Linked Data. The ELLIS demo visualises schema-level patterns as a graph structure and allows for an interactive ex- ploration and browsing of the schema information induced from the Linked Data cloud. As future work, we plan to integrate the visualisation technique with a novel tool for modelling data as LOD [10]. It will allow data engineers to not only conduct textual queries to find relevant vocabulary terms for reuse but also enable them to visually explore terms that are related with the model they are working on. References 1. Campbell, I.: Interactive evaluation of the ostensive model using a new test collection of images with multiple relevance assessments. Information Retrieval 2(1), 89–114 (2000) 2. Campinas, S., Perry, T.E., Ceccarelli, D., Delbru, R., Tummarello, G.: Introducing RDF graph summary with application to assisted SPARQL formulation. In: 23rd International Workshop on Database and Expert Systems Applications. pp. 261–266. IEEE (2012) 3. Dadzie, A.S., Rowe, M.: Approaches to visualising linked data: A survey. Semant. web 2(2), 89–124 (Apr 2011), http://dx.doi.org/10.3233/SW-2011-0037 4. Dividino, R., Gottron, T., Scherp, A.: Strategies for efficiently keeping local linked open data caches up-to-date. In: The Semantic Web-ISWC 2015, pp. 356–373. Springer (2015) 5. Dudáš, M., Svátek, V., Mynarz, J.: Dataset summary visualization with lodsight. In: The Semantic Web: ESWC 2015 Satellite Events, pp. 36–40. Springer (2015) 6. Gottron, T., Gottron, C.: Perplexity of Index Models over Evolving Linked Data. In: ESWC’14: Proceedings of the Extended Semantic Web Conference. pp. 161–175 (2014) 7. Katifori, A., Halatsis, C., Lepouras, G., Vassilakis, C., Giannopoulou, E.: Ontology visual- ization methods—a survey. ACM Comput. Surv. 39(4) (Nov 2007) 8. Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX—Efficient Construction of a Data Catalogue by Stream-based Indexing of Linked Data. Web Semantics: Science, Services and Agents on the World Wide Web 16(5), 52 – 58 (2012), the Semantic Web Challenge 2011 9. Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for rdf queries with multiple joins. In: Proceedings of the 27th International Conference on Data Engineering, ICDE 2011. pp. 984–994. IEEE Computer Society (2011) 10. Schaible, J., Gottron, T., Scheglmann, S., Scherp, A.: Lover: support for modeling data us- ing linked open vocabularies. In: Joint 2013 EDBT/ICDT Conferences, EDBT/ICDT ’13, Genoa, Italy, March 22, 2013, Workshop Proceedings. pp. 89–92. ACM (2013) 11. Scheglmann, S., Leinberger, M., Gottron, T., Staab, S., Lämmel, R.: Sepal: Schema enhanced programming for linked data. KI-Künstliche Intelligenz pp. 1–4 (2015) 12. Völker, J., Niepert, M.: Statistical schema induction. In: The Semantic Web: Research and Applications - 8th Extended Semantic Web Conference, ESWC 2011, Heraklion, Crete, Greece, May 29-June 2, 2011, Proceedings, Part I. Lecture Notes in Computer Science, vol. 6643, pp. 124–138. Springer (2011)