SemFacet: Faceted Search over Ontology Enhanced Knowledge Graphs? B. Cuenca Grau1 E. Kharlamov1 Š. Marciuška2 D. Zheleznyakov1 M. Arenas3 1 2 3 University of Oxford Microsoft Bing Pontificia Universidad Catolica de Chile Abstract. In this demo we present the SemFacet system for faceted search over ontology enhanced Knowledge Graphs (KGs) stored in RDF. SemFacet allows users to query KGs with relatively complex SPARQL queries via an intuitive Amazon-like interface. SemFacet can compute faceted interfaces over large scale RDF datasets by relying on incremental algorithms and over large ontologies by exploiting ontology projection techniques. SemFacet relies on an in-memory triple store and current implementation bundles JRDFox, Sesame, Stardog, and PAGOdA. During the demonstration the attendees can try SemFacet by exploring Yago KG. 1 Introduction Knowledge graphs (KGs) such as Yago and Freebase have become a powerful asset for enhancing search and are being intensively used in both academia and industry. Many existing KGs are either available as Linked Open Data, or they can be exported as RDF datasets enhanced with background knowledge in the form of an OWL 2 ontology. Formulating queries over data in this format is, however, a challenging task for end users, which has been noticed for, e.g., industry [9, 10, 12] and Life Science [6]. Faceted search is the de facto approach for exploratory search in many online applications such as Amazon and Booking.com. Faceted search allows for querying collections of entities where users can narrow down the search results by progressively applying filters, called facets [14]. A facet typically consists of a predicate (e.g., ‘gender’ or ‘occupation’ when querying entities about people) and a set of possible string values (e.g., ‘female’ or ‘research’), and entities in the collection are annotated with predicate- value pairs. During faceted search users iteratively select facet values and the entities annotated according to the selection are returned as the search result. Faceted search in the context of RDF has received significant attention and a number of systems have been developed (see [3] for an overview of these systems). In our previous studies [2, 3, 7] we have developed a rigorous theoretical underpinnings for faceted search in the context of RDF-based KGs enhanced with OWL 2 ontologies: we identified well-defined fragments of SPARQL that can be naturally captured using faceted search as a query paradigm, and established the computational complexity of answering such queries; we also studied the problem of updating faceted interfaces (FIs), which is critical for guiding users in the formulation of meaningful queries during exploratory search. In this demonstration we present our faceted search system SemFacet that implements thiese techniques and demonstrate it on the Yago KG [13]. ? This demo is accompanying our ISWC’16 presentation selected at the journal papers track [3]. We note that SemFacet presented here extends the one demonstrated earlier [1, 4, 5]: we have extended the backend with new reasoners, implemented our keyword search engine, improved the front-end, and implemented incremental update of faceted interfaces. This research was supported by the Royal Society, the EPSRC projects Score!, DBOnto, MaSI3 , and ED3 and the EU FP7 project Optique (n. 318338). Client Start SemFacet search politicians Search User enters keywords type http://en.wikipedia.org/wiki/Bill_Clinton William Jefferson "Bill" Clinton (born William USpres Country Jefferson Blythe III; August 19, 1946) is an American politician who served as the 42nd Relevant object IDs are computed President of the United States from 1993 to has child 2001. Inaugurated at age 46, he was the third- ANY youngest president. He took office at the end of the Cold War, and was the first president of Facets of initial FI Snippets are computed the baby boomer generation... grad from are computed Stanford Uni. grad from Stanford Uni. Initial FI and snippets are displayed Harvard Uni. Georgetown Uni. User (un)selects facet values Composer of Query Snippet Faceted Facets of FI is updated Snippets are updated Converter Composer Interfaces Server Updated FI and snippets are displayed Query Reaso Facet Snippet Answering ners Generator Generator User refocusses Search Engine End SemFacet search Inverted Index RDF Data, Query Ontology, Answers user's input on DRF Data Materialisation Rules, server component Facet Graph Triple Store client component Fig. 1: Left: SemFacet workslow; Right: SemFacet architecture and screenshot 2 SemFacet System SemFacet is implemented in Java and available for download under an academic license. The system can be obtained from our project website [11], where we also provide a collection of test data and detailed installation and configuration instructions. SemFacet is also available on GitHub [8]. System Architecture. SemFacet is based on a modular architecture, which is depicted in Figure 1 (Left). On the client side, SemFacet implements a GUI developed using HTML 5 consisting of three main parts: a free text search box for keywords, a hierarchically organised faceted interface, and a scrollable panel containing snippet-shaped answers. In the upper part of Figure 1 (Left) we present a screenshot of SemFacet GUI that corresponds to the following search scenario: the user is looking for politicians who are US presidents, graduated from Harward or Georgetown, and whose children graduated from Stanford. User keywords such as ‘politicians’ are sent by the client to the server where they are processed by the search engine. For efficiency reasons, we implemented our own simple engine based on an inverted index, and also allowed for the possibility of delegating keyword search to Lucene. User selections in the faceted interface are compiled into a SPARQL query using the query converter and then sent to the back-end reasoner for evaluation. The snippet and interface composers receive information about facets and answers that should be displayed to the user and update the currently displayed interface and query answers. In Figure 1(Left) the answer is Bill Clinton and it is displayed in the form of a snippet with a photograph and wiki description. The system updates the faceted interface incrementally: only the parts of the interface that are affected by users’ actions are updated, which allows for a significantly faster response time. On the server side, the system relies on an in-memory triple store to store the inverted index, input data and ontology and other important information. The current implementation bundles JRDFox, Sesame, Stardog, PAGOdA, and HermiT. Any other in-memory triple store providing similar functionality can be seamlessly integrated with SemFacet. The facet generator is the back-end component responsible for constructing the interface in response to user actions, while the query answering component of the back-end executes the SPARQL query obtained from the query converter using the reasoning engine selected by the user. SemFacet Workflow SemFacet’s workflow is summarised in Figure 1 (Right). The user initiates the search by entering a set of keywords (e.g., ‘politicians’), which are then matched to textual information associated to URIs in the data (such as labels and descriptions) resulting in an initial set of relevant URIs.1 SemFacet then computes the initial interface (with no value selections) based on these relevant URIs, which constitutes the starting point for faceted navigation. The main tasks performed by SemFacet are realised in the system as follows: – Matching of keywords. SemFacet exploits the values of annotation properties to determine whether a URI is relevant to a set of keywords. Intuitively, a URI u is relevant to a keyword k w.r.t. an annotation property R if the input data has a triple of the form (u, R, w), where w is a string containing k; and u is relevant to a set of keywords if at least one of them occurs in w. – Interface generation and update. In order to generate and update faceted interfaces SemFacet relies on a so-called facet graph that consists of the input RDF data and a projection of the input ontology on a graph structure (see [3] for details). The part of this graph that corresponds to entailed RDF data is materialised offline at loading time, while the part corresponding to the ontology is computed in the online phase by querying the materialised graph. Faceted interfaces are updated in response to user actions incrementally moreover, SemFacet minimises faceted interfaces by hiding facet values whose selection leads to empty answer sets or does not affect the currently computed answer set. Finally, the current version of the system can be customised so that facet values are hierarchically arranged according to a user-specified predicate, which greatly facilitates navigation in the presence of a large number of values per facet. – Query generation and execution. SemFacet compiles user selections from the faceted interface into SPARQL queries, which are then evaluated using a reasoner. Our system currently bundles several reasoning engines with different capabilities, and users can select the reasoner that is deemed more appropriate for their application at hand. Answers to SPARQL queries are typically returned by reasoners in the form of a URI. This may not be very informative for end users; hence, SemFacet also displays the annotations associated to the answer URIs and displays them in the form of a snippet. Configuring the System. SemFacet offers a range of options for system administrators to deploy and configure the system These include (i) the reasoning engine of choice (JRDFox, PAGOdA, Sesame, Stardog, or HermiT); (ii) the annotation properties relevant for keyword search and displaying of query answers; and (iii) the facet that is first displayed to the user. By default, values within a facet are interpreted disjunctively; however, SemFacet provides advanced configuration capabilities for specifying which facets must be interpreted conjunctively. Additionally, the hierarchical display of facet values can also be configured by specifying the property used to construct the hierarchy (typically rdfs:subClassOf or a property capturing a partonomy relation). 3 Demonstration Scenarios The demo attendees will experience Yago with SemFacet. They will be able to search Yago either using their own search tasks or using the tasks that we prepare for them. We now discuss the dataset that we prepare for the demonstration. 1 If the given set of keywords is empty, the system considers all URIs in the data as relevant. Yago comes in several slices available for download and since the current version of SemFacet relies on main memory triple stores, we took only some slices of Yago that could fit in the main memory of our machine. In particular, we took the Taxonomy slice, which consists of domain and range restrictions as well as subclass relations, and the Core slice, which contains instances of object and annotation properties. The axioms from Taxonomy constitute the ontology that we prepared for the demo. We refer to this ontology together with the Taxonomy and Core data slices as FYago. In order to generate snippets we included in both slices DBpedia abstracts, thumbnails, and links to Wikipedia articles. FYago contains 97 million triples involving over 3 million URIs among which 55% of triples relate entities via object properties, 16% relate entities to numbers, 2% relate entities to dates, and 27% relate entities to other kinds of strings. FYago has 89 predicate URIs which gives an upper bound to the number of facets that the user will see during faceted search. We analysed popularity of these predicates, that is how ofter the users will see them during faceted search and found out that 8 facet predicates have popularity exceeding 1 million entities and include hasLongitude, hasLatitude, and rdf:type, thus, a facet involving such predicate will occur in most search sessions. Then, 12 facet predicates with popularity between 100,000 and 1 million, which implies that they will occur rather often. The remaining 69 facet predicates have popularity below 100,000, and thus they will occur rarely; e.g., a facet predicate with popularity 1,000 is relevant to 1,000 entities only, and hence only to 0.025% of all data triples. 4 References [1] M. Arenas, B. Cuenca Grau, E. Kharlamov, Š. Marciuška, and D. Zheleznyakov. Enabling Faceted Search over OWL 2 with SemFacet. In: OWLED. 2014. [2] M. Arenas, B. Cuenca Grau, E. Kharlamov, Š. Marciuška, and D. Zheleznyakov. Faceted Search over Ontology-Enhanced RDF Data. In: CIKM. 2014. [3] M. Arenas, B. Cuenca Grau, E. Kharlamov, Š. Marciuška, and D. Zheleznyakov. Faceted Search over RDF-based Knowledge Graphs. In: J. Web Sem. 37 (2016). [4] M. Arenas, B. Cuenca Grau, E. Kharlamov, Š. Marciuška, and D. Zheleznyakov. Towards Semantic Faceted Search. In: WWW (Companion Volume). 2014. [5] M. Arenas, B. Cuenca Grau, E. Kharlamov, Š. Marciuška, D. Zheleznyakov, and E. Jiménez- Ruiz. SemFacet: Semantic Faceted Search over Yago. In: WWW (Companion Volume). 2014. [6] B. Cuenca Grau, E. Kharlamov, Š. Marciuška, D. Zheleznyakov, and Y. Zhou. Querying Life Science Ontologies with SemFacet. In: SWAT4LS. 2014. [7] B. Cuenca Grau, E. Kharlamov, D. Zheleznyakov, M. Arenas, and Š. Marciuška. On Faceted Search over Knowledge Bases. In: DL. 2014. [8] GitHub of SemFacet. https://github.com/semfacet. [9] E. Kharlamov, D. Hovland, E. Jimenez-Ruiz, D. Lanti, H. Lie, et al. Ontology Based Access to Exploration Data at Statoil. In: ISWC. 2015. [10] E. Kharlamov, N. Solomakhina, Ö. L. Özçep, D. Zheleznyakov, T. Hubauer, et al. How Semantic Technologies Can Enhance Data Access at Siemens Energy. In: ISWC. 2014. [11] SemFacet Project Page. http://www.cs.ox.ac.uk/isg/tools/SemFacet/. [12] A. Soylu, E. Kharlamov, D. Zheleznyakov, E. Jiménez-Ruiz, M. Giese, and I. Horrocks. Ontology-Based Visual Query Formulation: An Industry Experience. In: ISVC. 2015. [13] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A Core of Semantic Knowledge. In: WWW. 2007. [14] D. Tunkelang. Faceted Search. Morgan & Claypool Publishers, 2009.