Exploring Large RDF Datasets using a Faceted Search Juan Francisco Garcia Navarro, Victor Antonio Lopez Villamar Jagannathan Srinivasan, Matthew Perry, Souripriya Das, Zhe Wu Oracle Corporation {juan.f.garcia, victor.a.lopez, jagannathan.srinivasan}@oracle.com { matthew.perry, souripriya.das, alan.wu}@oracle.com Abstract. We propose a facet-based RDF data exploration mechanism that lets the user visualize large RDF datasets by successively refining a query. The novel aspects of our work are: i) the SPARQL query pattern is visualized as a query graph, ii) the successive refinements are visualized in a query refinement graph, and iii) the result triples are visualized as a result RDF graph. The scheme is scalable and it visualizes RDF graphs stored in Oracle Database 12c Spatial and Graph Option with Cytoscape, a graph visualization tool. 1 Introduction For large RDF datasets, we propose a facet-based RDF data exploration mecha- nism that lets the user visualize data through a query refinement process. To describe our proposal, we use the GovTrack RDF data [1].The user starts with a conjunctive SPARQL triple pattern1 (Fig. 1a), for example to get bill and voting information for sessions of the U.S Congress. From this SPARQL triple pattern, a query graph is formed as a directed-graph (Fig. 1b), where the subject and the objects (including variables, IRIs, or literals) are represented as nodes, and the predicates (both IRIs and variables) are represented as edges. Thus, the seven triple patterns would result in a directed graph with 6 nodes and 7 edges (the pre-recorded viewlet available at [6]). The starting query provides the most general form (shown in Fig. 1a and 1b). We treat each variable in the query as a facet, which can have many possible values. Thus, subsequent refined queries are derived by replacing the variable in the SPARQL query with the value selected from the corresponding facet. The refined query can be executed on demand to get the resulting RDF triples. This process can be repeated until all variables are substituted. Furthermore, a context menu on each node (Fig. 1c) allows the user to list and choose among the possible values for the selected variable. This list also shows the number of solutions that would result after selecting a specific value thereby allowing the user to determine how much the result space will be reduced before performing the substitution. 1 In general, our scheme is applicable to an arbitrarily complex SPARQL query since a query can be represented as a directed graph using its abstract syntax tree. adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011 (a) (b) (c) Fig. 1. a) The SPARQL query triple pattern b) the query graph, and c) a context menu Fig. 2. Query Refinement graph (node: the original/refined query; edge: the substitution) To capture the substitution steps, a query refinement graph is created. This graph is populated with nodes representing each of the refined queries that are being created. Figures 2a-2d show the different branches of the query graph, where each of the nodes represent a specific query with the facet values already replaced. Note that in addition to refining the last resulting query, which would represent a leaf node, the refinement can as well proceed from the root or any intermediate node. Figure 2a shows the creation of the new node Q2, resulting from substituting the variable ?name with the value "Robin Hayes". Figure 2b shows the new query Q3 resulting from replacing variable ?option_uri in query Q2. The user can further explore from the root node (Q1) to create another branch (node Q4 in Figure 2c) by replacing a different variable. Similarly, the user can explore from the root node (Q1) and create a new branch (node Q5 shown in Figure 2d). Fig. 3. The sequence of modified query graphs on substitutions Once a substitution is made, and a new node is added into the query-refinement graph, the query- graph (shown in Fig. 1b) is also updated. The node counts are up- dated with the modified possible values on the nodes, according to the replaced varia- ble. This sequence is illustrated in Figures 3a, 3b, and 3c, which correspond to their counterpart steps of the sequence shown on Figures 2a, 2b, and 2c. At this point, we have applied the concept of facets to refine a SPARQL query suc- cessively. The result is a set of sub-queries, identified in Figures 2a-2d that can be independently executed giving a smaller graph that users may find easier to visualize and analyze, compared to the starting query. Fig. 4 shows the result of executing que- ry Q5, depicted on Figure 2d, resulting in a graph having 270 nodes and 269 edges, with a total of 132 sub-graphs that match as result of the SPARQL query. Note that queries are executed as SPARQL CONSTRUCT WHERE queries to return a graph rather than bindings. Analogously, each of the nodes depicted in Figures 2a, 2b and 2c and labeled as Q2, Q3, and Q4, which represent sub-queries derived from the original query, can also be executed. This way of successive refinement helps users explore the query solution space in an incremental manner. At each step, the facet counts for remaining variables give an indication of the solution space, prior to the actual visualization of any of the queries. The above scheme is used to visualize RDF graphs stored in Oracle Database 12c Spatial and Graph Option [2] with Cytoscape [3], a graph visualization tool. The idea of hierarchical faceted navigation has been presented as early as 2002 in [4]. In [5], a combination of graph visualization and facet based filtering is used. However, our scheme is scalable. It requires materialization of only the first result set in a compact integer id-based format, on which facet counts are computed using full-scans or with bitmap index scans. We leverage Oracle’s Parallel DML and query, and In-Memory capabilities to achieve interactive response times (Fig 4). For extremely large datasets (over billions of triples), sampling is used to limit the initial materialized result size. Sample Performance on 8-core AMD FX-8350 CPU, 32 GB of RAM, Two 10000 rpm SATA disks, Oracle Enterprise Linux 5 #vars (of 9) Materialize Facet Computation substituted Results No Index Bitmap Index In-Memory 0 35s (18M rows) 87s 3.73s 3.90s 1 N/A (11M rows) 85s 38.00s 2.13s 2 N/A (0.2M rows) 85s 0.59s 0.90s Fig. 4. Sample Performance Results and the Result Graph of a refined query. 2 Conclusions We described a facet-based approach to effectively explore large RDF datasets. Our proposal includes a visual graph-based representation in order to make the query- ing process easier. The solution described includes creation of a query graph to repre- sent the original SPARQL query and query refinement graph to keep track of what has been explored. The query refinement graph presents a hierarchy of sub-queries that are derived from the original query. Furthermore, each node in the query refine- ment graph, which represents a sub-query, can be executed to generate a smaller re- sult graph that is easier for the user to visualize, analyze and thus understand. 3 References 1. Tauberer, J.,GovTrack, http://www.xml.com/lpt/a/1643 2. Oracle DB 12c Spatial & Graph , http://www.oracle.com/us/products/database/038407.htm 3. Cytoscape: An Open Source Platform for Network Data, http://www.cytoscape.org/ 4. Hearst, M. A., Elliott, A., English, J., Sinha, R. R.,Swearingen, K., Yee. K.: Finding the flow in web site search. Commun. ACM 45(9): 42-49 (2002) . 5. Heim, P., Ertl T., and Ziegler, J., Facet Graphs: Complex Semantic Querying Made Easy, ESWC 2010. 6. Exploring Large RDF Datasets using a Faceted search Viewlet, In http://download.oracle.com/otndocs/tech/semantic_web/viewlets/iswc2015_faceted_search .zip