Introduction

Lattice-Based View Access: A way to Create Views over SPARQL Query for Knowledge Discovery

Mehwish Alam

Amedeo Napoli

0 Universite de Lorraine) BP 239, Vandoeuvre-les-Nancy , F-54506 , France

The data published in the form of RDF resources is increasing day by day. This mode of data sharing facilitates the exchange of information across the domains. Although it provides easier ways in the use of data such as through SPARQL queries. These queries over semantic web data usually produce list of tuples as answers which may be huge in number or may require further manipulation so that it can be understood and interpreted. Accordingly, this paper introduces a new clause View By in the SPARQL query for creating semantic views over the raw SPARQL query answers. This approach namely, Lattice-Based View Access (LBVA), is a framework based on Formal Concept Analysis (FCA). It provides a classi cation of the answers of SPARQL queries based on a concept lattice, that can be navigated for retrieving or mining speci c patterns in query results w.r.t. user constraints. In this way, the concept lattice can be considered as a materialized view of the data resulting from a SPARQL query.

Formal Concept Analysis SPARQL Query Views Lattice-Based Views SPARQL Classi cation

Introduction

A considerable amount of Semantic Web (SW) data is already available on the web. Thus many agents are looking for more and more data present in the form of ontologies and RDF triples. Linked Open Data (LOD) [ 2 ] is a huge source of RDF resources interlinked with each other to form a cloud. SPARQL queries are used in order to make these data usable by domain experts and software agents. Sometimes queries are executed which may generate huge amount of results giving rise to the problem of information overload [ 4 ]. A typical example is the answers retrieved by search engines, which may mix between several meanings of one keyword. In case of huge results, many results are navigated to nd the interesting links, which may be overwhelming without any navigation tool. Same is the case with the answers obtained by SPARQL queries, which may be huge in number and it may be harder to extract the most interesting patterns. This problem of information overload raises new challenges for data modeling and analysis and calls for improving data access, information retrieval and knowledge discovery w.r.t web querying.

In order to deal with the problem of information overload, this paper proposes a new approach based on Formal Concept Analysis (FCA [ 5 ]), which provides a lattice-based classi cation of the results obtained by SPARQL queries w.r.t user constraints. This new framework, namely Lattice Based View Access (LBVA), allows the classi cation of SPARQL query results into a concept lattice, referred to as views, for data analysis, knowledge discovery and information retrieval purposes. Based on one SPARQL query several views can be generated from di erent perspectives. In addition, LBVA allows for navigation over SPARQL query results. Here after, we describe how the views (a view corresponds to a concept lattice) can be designed from a SPARQL query and the result which is returned. Moreover, the analysis and the interpretation of the views is totally supported by the concept lattice. In case of large data only a part of the lattice can be considered for the analysis using the technique of iceberg lattices.

The intuition of classifying results obtained by querying LOD is inspired by web clustering engines [ 3 ] such as Carrot21. The general idea behind web clustering engines is to group the results obtained by query posed by the user based on the di erent meanings of the terms related to a query. Such systems deal with unstructured textual data on web. However, there are some studies conducted to deal with structured RDF data. In [ 4 ], the authors target the problem of managing large amounts of results obtained by conjunctive queries with the help of subsumption hierarchy present in the knowledge base. On the other hand, latticebased views provide classi cation based on the formal concepts and a partially ordered organization of the results. It also opens possibilities for navigation or information retrieval by traversing the concept lattice and for data analysis by allowing the extraction of association rules from the lattice. Additionally, unlike [ 4 ], LBVA also deals with data that has no schema (which is often the case with linked data).

The concept lattice provides a well founded structure on which a series of interpretations can be carried out. This framework is general and does not depend on any particular domain and may be used in addition with external resources, e.g. domain knowledge.

The paper is structured as follows: Section 1 gives a brief overview of Linked Open Data and gives the motivating example. Section 2 de nes LBVA and gives the overall architecture of the framework. Section 3 brie y described the experimentation setting. Finally, Section 4 concludes the paper. 1

Linked Open Data

Linked Open Data (LOD) [ 2 ] is the way of publishing structured data which helps in the connection between several resources through their schema. LOD

1 http://project.carrot2.org/index.html

represents data in the form of RDF graphs. Given a set of URIs U , blank nodes B and literals L, an RDF triple is represented as t = (s; p; o) 2 (U [ B) U (U [ B [ L), where s is a subject, p is a predicate and o is an object. A nite set of RDF triples is called as RDF Graph G such that G = (V; E), where V is a set of vertices and E is a set of labeled edges and G 2 G, such that G = (U [ B) U (U [ B [ L). Each pair of vertices connected through a labeled edge keeps the information of a statement. Each statement is represented as hsubject; predicate; objecti referred to as an RDF Triple. V includes subject and object while E includes the predicate.

SPARQL2 is the standard query language for RDF. In the current work we will focus more on the type of queries whose output performs value selection over the variables matching the patterns (queries containing SELECT clause). Now let us assume that there exists a set of variables V disjoint from U in the above de nition of RDF, then (U [ V ) (U [ V ) (U [ V ) is a graph pattern called a triple pattern. If a variable ?X 2 V and ?X = c then c 2 U . Given U , V and a triple pattern t a mapping (t) would be the triple obtained by replacing variables in t with U . [[:]]G takes an expression of patterns and returns a set of mappings. Given a mapping : V ! U and a set of variables W V , is represented as jW , which is described as a mapping such that dom( jW ) = dom( ) \ W and jW (?X) = (?X) for every ?X 2 dom( ) \ W . Finally, the SPARQL SELECT query is de ned as follows: De nition 1. A SPARQL SELECT query is a tuple (W; P ), where P is a graph pattern and W is a set of variables such that W var(P ). The answer of (W; P ) over an RDF graph G, denoted by [[(W; P )]]G , is the set of mappings: [[(W; P )]]G = f jW j 2 [[P ]]Gg

In the above de nition var(P ) is the set of variables in pattern P and variables W in SELECT clause of SPARQL query3. Further details on the formalization and foundations of RDF databases are discussed in [ 1 ].

Example 1. Consider a query all the bands which play di erent stringed instruments along with their origin. This example will continue in the rest of this paper. Let us name this query Q, then Q can not be answered by standard search engines as it generates a separate list of bands and stringed instruments requiring multiple resources to be integrated. However, Q can be answered by SPARQL queries over LOD. For example, let us consider the SPARQL query in Listing 1.1 over DBpedia4. DBpedia is the central hub of LOD which extracts data from Wikipedia and makes it available in the structured format. 2 http://www.w3.org/TR/rdf-sparql-query/ 3 In the rest of the paper we denote W as V to avoid overlap between the attribute values W in many-valued context. 4 http://dbpedia.org/sparql (a) Classes of Bands w.r.t. Musical Instruments and Countries, e.g., the concept on the top right corner with the attribute Cuatro contains all the bands which play Cuatro. (b) Classes of Musical Instruments w.r.t Bands and their Origin

Listing 1.1: SPARQL Query Q SELECT ? band ? i n s t r u m e n t ? o r i g i n WHERE f ? band r d f : t y p e dbpedia owl : Band . ? band dbpprop : o r i g i n ? o r i g i n . ? band dbpedia owl : bandMember ?member . ?member dbpedia owl : i n s t r u m e n t ? i n s t r u m e n t .

? i n s t r u m e n t d c t e r m s : s u b j e c t C a t e g o r y : S t r i n g i n s t r u m e n t s . GROUP BY ? i n s t r u m e n t ? o r i g i n

The above SPARQL query returns a list of bands along with the instruments they play and their origin as an answer. An excerpt of the answers is shown below: dbpedia5:RHCP6 dbpedia:Banjo dbpedia:US dbpedia:Disturbed dbpedia:Bass Guitar dbpedia:US, dbpedia:The Solution dbpedia:Banjo dbpedia:Sweden.

In case of too many origins GROUP BY clause will lead to many small groups which would be hard for the user to observe with respect to origin or instrument, failing in the task of grouping. A classi cation technique can be used for navigation or interpretation. For example, Figure 1(a) shows a concept lattice for a small part of query answers. Here we can see classes such as the concept which contains all the bands which play Cuatro. If the search is more speci ed then the origin of each of the bands can also be retrieved. It is possible to retrieve bands which play Cuatro and are from UK, here Chrome Hoof is the band which plays Cuatro in the current small example. On the other hand, Figure 1(b) shows a concept lattice where musical instruments are classi ed with respect to bands and their origin, giving a totally di erent perspective over the same set of answers.

5 http://dbpedia.org/resource/ 6 Red Hot Chilli Peppers Lattice-Based View Access

In this paper, we propose an approach called Lattice-Based View Access which generates a concept lattice referred to as view. This view provides users with classi cation, navigation and analysis capabilities over these results. In the scenario of LOD, query processing procedure can not be controlled, so, in our algorithm we do not process the SPARQL query. The views are de ned over RDF data by processing the set of tuples returned by the SPARQL query. 2.1

SPARQL Queries with Classi cation Capabilities

The idea of introducing a VIEW BY clause is to provide classi cation of the results and add a knowledge discovery aspect to the results w.r.t the variables appearing in VIEW BY clause. For example, we have a SPARQL SELECT query Q = SELECT ?X ?Y ?Z WHERE fpattern Pg VIEW BY ?X then the set of variables V = f?X; ?Y; ?Zg. According to the de nition 1 the answer of the tuple (V; P ) is represented as [[(f?X; ?Y; ?Zg; P )]] = i where i 2 f1; : : : ; kg and k is the number of mappings obtained for the query Q. Here, dom( i) = f?X; ?Y; ?Zg which means that (?X) = Xi, (?Y ) = Yi and (?Z) = Zi. Finally, a complete set of mappings can be given as ff?X ! Xi; ?Y ! Yi; ?Z ! Zigg.

Now, variables appearing in the VIEW BY clause are referred to as object variable7 and is denoted as Ov such that Ov 2 V . In the current scenario Ov = f?Xg. The remaining variables are referred to as attribute variables and are denoted as Av where Av 2 V such that Ov [ Av = V and Ov \ Av = ;. Example 2. An alternate query for the query in Listing 1.1 with the VIEW BY clause can be given as: SELECT ?band ?instrument ?origin WHERE f ?band rdf:type dbpedia-owl:Band. ?band dbpprop:origin ?origin. ?band dbpedia-owl:bandMember ?member . ?member dbpedia-owl:instrument ?instrument .

?instrument dcterms:subject dbpedia8:Category:String instruments.g VIEW BY ?band Here, V=f?band, ?instrument, ?origing then the evaluation of the SELECT query [[(f?band; ?instrument; ?origing; P )]] will generate the mappings shown in Table 1. Accordingly, dom( i) = f?band; ?instrument; ?origing. Here, 1(?band) = RHCP , 1(?instrument) = Banjo and 1(?origin) = U S. In the current example, we have, Ov = f?bandg because it appears in the VIEW BY clause and Av = f?instrument; ?origing. Figure 1a shows the generated view when Ov = f?bandg and in Figure 1b, we have; Ov = f?instrumentg. 7 The object here refers to the object in FCA. 8 http://dbpedia.org/resource/ ... ...

?band ?instrument ?origin 1 RHCP Banjo US 2 Disturbed Bass Guitar US

... ... The results obtained by the query are in the form of set of tuples, which are then organized as a many-valued context. If Ov = f?Xg then (?X) = fXigi2f1;:::;kg, where Xi denote the values obtained for the object variable and the corresponding mapping is given as ff?X ! Xigg. Finally, G = (?X) = fXigi2f1;:::;kg. Let Av = f?Y; ?Zg then M = Av and the attribute values W = f (?Y ); (?Z)g = ffYig; fZiggi2f1;:::;kg. The corresponding mapping for attribute variables are ff?Y ! Yi; ?Z ! Zigg Consider an object value gi 2 G and an attribute value wi 2 W then we have (gi; \?Y 00; wi) 2 I i ?X(gi) = wi, i.e., the value of gi for attribute ?Y is wi, i 2 f1; : : : ; kg as we have k values for ?Y . Obtaining Binary Context (G; M; I): Afterwards, a conceptual scaling used for binarizing the many-valued context, in the form of (G; M; I). Finally, we have G = fXigi2f1;:::;kg, M = fYig [ fZig where i 2 f1; : : : ; kg for object variable Ov = f?Xg.

Example 3. In the example Ov = fbandg, Av = finstrument; origing. The answers obtained by this query are organized into a many-valued context as follows: the distinct values of the object variable ?band are kept as a set of objects, so G = fRHCP; Disturbed; : : : g, attribute variables provide M = finstrument; origing, W1 = fBanjo; BassGuitar; : : : g and W2 = fU S; U K; F rance : : : g in a many-valued context. The obtained many-valued context is shown in Table 2. Following the above de ned procedure a many-valued context is conceptually scaled to obtain a binary context shown in Table 3. The corresponding concept lattice is shown in Figure 1(a). Once the context is designed, the concept lattice can be built using an FCA algorithm. This step is straight forward as soon as the context is provided. In the current implementation we use AddIntent [ 8 ] which is an incremental concept lattice construction algorithm. In case of large data iceberg lattices can be considered [ 6 ]. The use of VIEW BY clause activates the process of LBVA, which transforms the SPARQL query answers (tuples) to a formal context Kanswers through which a concept lattice is obtained which is referred to as a Lattice-Based View. A view on SPARQL query in section 1, i.e, a concept lattice corresponding to Table 3 is shown in Figure 1a. At the end of this step the concept lattice is built and the interpretation step can be considered.

2.4 Interpretation Operations over a Concept Lattice

Navigation Operation and Knowledge Discovery: The obtained concept lattice can be navigated for searching and accessing particular LOD elements. It is possible to drill down from general to speci c concepts according to some constraints. For example, in order to search for bands in US playing Banjo, the concept lattice in Figure 1(a) is explored levelwise. First the broader concept contains all the bands from US, RHCP, The Solution, Disturbed. Then, the children concepts contain more speci c concepts with the instruments Banjo and Bass Guitar. According to the initial constraint, the attribute concept of Banjo can be selected returning two objects namely RHCP, The Solution. Next, to check which instruments are played in music originating from US, another concept lattice can be explored, where objects correspond to instruments shown in Figure 1(b). The results in this case is the set of objects Bass Guitar, Banjo.

FCA provides a powerful means for data analysis and knowledge discovery. Iceberg lattices provide the top most part of the lattice ltering out only general concepts. The concept lattice is still explored levelwise depending on a given threshold. Then, only concepts whose extent is su ciently large are explored, i.e., the support of a concept corresponds to the cardinal of the extent. If further speci c concepts are required the support threshold of the iceberg lattices can be lowered and the resulting concept lattice can be explored levelwise.

Another way of interpreting the data is provided by Duquenne-Guigues basis of implications which takes into account a minimal set of implications which represent all the association rules that can be generated for a given formal context. For example, DG-basis of implications according to the formal context in Table 3 state that all the bands which play Banjo are from US (rule: Banjo ! US). Moreover, the rule Venezuela ! Cuatro suggests that all the bands from Venezuela play Cuatro. This rule states that Cuatro is widely used in the folk music of Venezuela. 3

Experimentation

Several experiments were conducted on real datasets. These datasets include DBpedia, Yago [ 7 ], which is a knowledge base automatically extracted from Wikipedia (infoboxes, categories), Wordnet and Geonames. The experiment was also tested on the biomedical data such as Sider9 and Drugbank10. Sider keeps the information about the medicines along with their side e ects. Drugbank keeps the detailed information about the drugs such as drug category and target proteins. These experiments provide qualitative and quantitative evaluation to our approach. These experiments are not discussed in the current paper due to lack of space. However, the software, a detailed technical report along with the visualization of the SPARQL query views can be accessed online11. 4

Conclusion and Discussion

In LBVA, we introduce a classi cation framework for the set of tuples obtained as a result of SPARQL queries over LOD. We introduce a classi cation framework based on FCA for organizing a view, i.e., the set of tuples resulting from a SPARQL query. In this way, the view is organized as a concept lattice that can be navigated where information retrieval and knowledge discovery can be performed. For future work, we are interested in working with several object variable allowing to deal with more complex relations, with the help of Relational Concept Analysis (RCA). In addition, here only binary contexts are taken into account. It is possible to go beyond this limitation in using another variation of FCA which is the formalism of pattern structures. 9 http://sideeffects.embl.de/ 10 http://www.drugbank.ca/ 11 http://webloria.loria.fr/~alammehw/lbva/

Marcelo

Arenas , Claudio Gutierrez, and

Jorge

Perez . Foundations of rdf databases . In Reasoning Web , pages 158 { 204 , 2009 .

Christian

Bizer , Tom Heath, and Tim Berners-Lee. Linked data - the story so far . Int. J. Semantic Web Inf. Syst. , 5 ( 3 ):1{ 22 , 2009 .

Claudio

Carpineto , Stanislaw Osinski, Giovanni Romano, and

Dawid

Weiss . A survey of web clustering engines . ACM Comput. Surv. , 41 ( 3 ): 17 :1{ 17 : 38 , July 2009 .

4. Claudia d'Amato , Nicola

Fanizzi , and Agnieszka

Lawrynowicz . Categorize by: Deductive aggregation of semantic web query results . In ESWC (1) , 2010 .

Bernhard

Ganter and

Rudolf

Wille . Formal Concept Analysis: Mathematical Foundations . Springer, Berlin/Heidelberg, 1999 .

Stumme ,

Taouil ,

Bastide , and

Lakhal . Conceptual clustering with iceberg concept lattices . In Proc. GI-Fachgruppentre en Maschinelles Lernen (FGML'01) , 2001 .

7. Fabian

Suchanek , Gjergji Kasneci, and Gerhard

Weikum . Yago: A core of semantic knowledge . In Proceedings of the 16th International Conference on World Wide Web, WWW '07 , pages 697 { 706 , New York, NY, USA, 2007 . ACM.

8. Dean van der Merwe , Sergei A. Obiedkov , and Derrick

Kourie. Addintent : A new incremental algorithm for constructing concept lattices . In ICFCA , 2004 .