=Paper=
{{Paper
|id=None
|storemode=property
|title=An Approximation Approach for Semantic Queries of Naïve Users by a New
Query Language
|pdfUrl=https://ceur-ws.org/Vol-867/Paper6.pdf
|volume=Vol-867
|dblpUrl=https://dblp.org/rec/conf/icwit/DjeddaiSK12
}}
==An Approximation Approach for Semantic Queries of Naïve Users by a New
Query Language==
An approximation approach for semantic queries of naïve users by a new query language Ala Djeddai, Hassina Seridi-Bouchelaghem and Med Tarek Khadir LABGED Laboratory, University Badji Mokhtar Annaba, Po-Box 12, 23000, Algeria {djeddai, seridi, khadir}@labged.net Abstract. This paper focuses on querying semi structured data such as RDF da- ta, using a proposed query language for the non-expert user, in the context of a lack knowledge structure. This language is inspired from the semantic regular path queries. The problem appears when the user specifies concepts that are not in the structure, as approximation approaches, operations based on query modi- fications and concepts hierarchies only are not able to find valuable solutions. Indeed, these approaches discard concepts that may have common meaning, therefore for a better approximation; the approach must better understand the user in order to obtain relevant answers. Starting from this, an approximation approach using a new query language, based on similarity meaning obtained from WordNet is proposed. A new similarity measure is then defined and calcu- lated from the concepts synonyms in WordNet, the measure is then used in eve- ry step of the approach for helping to find relations between graph nodes and user concepts. The new proposed similarity can be used for enhancing the pre- vious approximate approaches. The approach starts by constructing a graph pat- tern ( ) from the query and finalized by outputting a set of approximate graph patterns containing the results ranked in decreasing order of the approximation value level. Keywords. Graph matching, RDF, Naïve user, Graph pattern, Semantic Que- ries, Regular Path Queries, Approximation, Similarity, Ranking and WordNet 1 Introduction In recent years, the amount of information on the web grows increasingly and the classic information retrieval is not able to find the answer which satisfies the user queries, therefore, the semantic search may be a proposed solution for such situations. Most users have not much knowledge about the querying language in the semantic web, they are not aware of target knowledge base; so the user query does not match necessary the data structure. It is very hard and difficult to understand intend of naïve users. In this paper we propose an approach for answering a new query language in- spired from the conjunctive regular path queries [1], the user query is transformed to a graph pattern. We use a new method to calculate the approximation level between the paths of the graph data and the query paths; approximation is enhanced using the Proceedings ICWIT 2012 50 WordNet database so the method is based on a proposed meaning similarity between concepts from WordNet We consider the problem of querying the semi-structured data such RDF data which is modeled by a graph , and an ontology , . Where each node in is labeled with a constant and each edge is labeled with a label drawn from a finite set of symbols , contains nodes representing entity classes or instanc- es or data values (values of properties), the blank nodes are not considered, the edges between the class nodes and the instance nodes is labeled by ‘type’, represents the relations between the nodes in , and . Users specify their request by a proposed language inspired from the conjunctive regular path queries CRP which have the next format: 1… – 1 1 1 ,…, (1) • Each Yi or Zi is a variable or a constant. The variable is specified by? , we make a simple modification to the constants for specifying the choices so the user is able to specify constants which are not necessarily appearing in G and he is able to use many constants by using the symbol ‘|‘so Yi or Zi is a variable or a constant or ex- pression (in our approach). • Regular path expressions {R1,…,Rn}, which are defined by the grammar: : å| |_| | 1. 2 | 1| 2 | , (2) Where ε is the empty string, “a” is a label constant, “_” denotes any label and L is a label variable. • 1… are head variables and the result is returned in these variables. In this paper, for helping the naïve users, we propose a new simple query lan- guage, we focus on the regular expression which has a simple format (using only the ‘.’ and the ‘|’), the query 1 is an example of the proposed language, We construct from the user query a graph patterns for finding a set of sub graphs in (approx- imate graph patterns) whose nodes matches the nodes in and its paths have a level of approximation to the paths in . Example1. We assume that a user writes a query 1 for finding the publications and the authors in ‘California’ university or ‘Harvard’ university in the ‘ESWC 2012’conference: ? ,? ? , ,? , ? , _ . | , | , ? , . , 2012 . Figure 1 shows a constructed from 1, the separate points between symbols rep- resented by non-labeled nodes, the query paths 1 2 3 correspond to user paths 1 2 3 of 1. The variable nodes are specified with ‘?’ to indicate that only these nodes are shown in the answer. In our work, the answers for the query is a set of approximated graph patterns ranked in order of decreasing the approximation level value, every one contains nodes that correspond to the user variables, the paths in every approximate graph pattern are an approximation of the paths in (every path in is corre- sponding to a single conjuncts query [4]). We use the graph patterns as answers, for Proceedings ICWIT 2012 51 giving too the user the ability to exxplore the resu ults for moree information about the result noddes. F Fig.1. A graph pattern p GP consstructed from Q1 Q In secction 2 relatedd works are discussed and Section 3 preesents WordN Net and the new propposed similaritty meaning. Inn section 4 thee approximatioon approach is detailed. Section 5 is dedicated to the approaach implemen ntation and exxperimentationn, whereas the concluusion and futuure works are presented in section s 6. 2 Reelated work ks Many appproaches, meethods and quuery languagee are proposedd for the searrch in the semantic web search, and a may be claassified as folllows: 1. Approaaches consideer structured query q languagees, such as: Coorese [9], Swooogle [11] and ONTTOSEARCH2 [4]. 2. Approaaches for naivve users, thesee approaches can c themselves be divided innto: 9 Keyw word-based appproaches, succh as QUICK [8], where quueries consist of lists of keywoords; 9 Naturral-language approaches, a w where users caan express queries using naatural lan- guagee, such that PoowerAqua [100]. In this work we aree interested byy using the reegular path quueries with sim mple regu- lar expresssion, this hellps the naive users u to use th he query languuage as they aare able to write simmple regular expression. e O approach combines thee two previouusly cited Our classes, so s the naïve user u queries thhe system usin ng simple struucture and the user con- stants aree seen as keywwords. Many worksw are pro oposed for thee approximatioon such as [1] and [2], where the t approximmation is appllied to the conjuncts c queeries. The ISPARQL L [3] is a sim milarity based approach which added thee notion of sim milarity to SPARQL L queries, wheere another teechnique in [7 7] calculates thhe approximaate answer from RD DF graph usinng an evolutioonary algorith hm. Despite thheir efficiencyy, the ap- proaches discard the user u influencee and opinion. The obtained results do nnot, there- fore, ofteen satisfy the latter. l In addittion to the aboove approachees, our work ppropose a new querry language innspired from conjunctive c queries, q using a technique ffor the ap- proximatiion based on meaning m simiilarity from WordNet W for a better understtanding of the user query as welll as finding thhe correspond dences betweeen its conceptts and the graph datta. The answeers are a set off approximatee graph patternns ranked in ddecreasing Proceedinngs ICWIT 20012 52 order approximation level so the user can explore these results in order to acquire more knowledge. 3 Using WordNet WordNet [5] is a lexical resource for the English language; it groups terms (name, verbs, adjectives etc.) in sets of synonyms called Synsets. Approaches based on char- acters strings become insufficient when concepts are systematically close to each other and when their names are different (example: « car » and « automobile »), the interrogation of a linguistic resource such as WordNet may indicates that two con- cepts are similar . For the calculation of the linguistic similarity, the function Syn(c) calculates the set of WordNet Synsets related to the concept c. 3.1 Definition of a new WordNet Meaning Similarity In this section we define a new WordNet meaning similarity, this measure is used in the process of discovering the nodes mapping from the user query and graph data. Let _ 1 2 the set of common senses between 1 and 2 to be compared, the cardinality of _ is : ë _ | 1 2 | , we use the following measure: Let min |Syn c1 |, | Syn c2 | be the minimum cardinality between the two sets Syn(c1) and Syn(c2) for the concept c1 and c2 respectively, thus our similarity meas- ure is constructed from analyzing of the next metric [7]: S_ Sim1 c1, c2 = |S |,| S | (3) This metric based on common senses of c1 and c2, it return 1.0 if c1 is synonym of c2 but if the set of senses for c1 (or c2) are including in the set of senses of c2 (or c1) so this metric return again 1.0, for example the concept “machine” has 8 senses and “motorcar” have 1 sense (included in the 8 sense of “machine”), utilizing this metric: 1 , 1 , so “machine” is the synonym of “motor- , car” but this is wrong because “machine” is the generalization of “motorcar”, so from this idea we propose the next new measure which is based on the different senses between two concepts: Let _ 1 2 – 1 2 : the set of different senses between c1 and c2, so _ . | _ | =| Syn c1 Syn c2 | Syn c1 Syn c2 , the set of union is defined as: 1 2 , our met- ric is: | | | |–| | 2 1, 2 1 | | 1 | | (4) | _ | | |–| _ | 2 1, 2 1 | | 1 | | (5) If 1 2 (no different senses c1 is synonym of c2) then 2 1, 2 7 1 0 1. 2 , 1 1 0.87 0.13 (7 common senses). 8 Proceedings ICWIT 2012 53 In this paper we use the next measure which takes advantage of Sim1 (common sens- es) and Sim2 (deferent senses): _ 1, 2 ù1 Sim1 ù2 Sim2 (6) where, ù1 and ù2 are the widths associated to Sim1 and Sim2 respectively, ù1 0.5, ù2 0.5 by default i.e. same importance, we adjust ù1 and ù2 according the preference of the user. Example 2.Table 1 shows values of similarity for some pair of concepts. We cannot find a significant similarity between these concepts if we use a metric based on syntax only, the similarity indicates that “house” and “mouse” are similar but this is wrong, this highlights the importance of the proposed measure as it is used to find relationships between terms of the semantic regular path queries and the nodes of the graph data. Concept1 Concept2 _ Car Automobile 0.5 0.16 0.33 0.0 Location Placement 0.33 0.16 0.245 0.22 House mouse 0.0 0.0 0.0 0.86 Table 1. Some similarity values calculated using _ and 4 Approximating the naïve user queries We start by defining the problem that is: how to satisfy the user in case if he specifies concepts that do not exist in the graph data? This is a big difficulty, as the approxima- tion is the solutions for finding results and approximating the user query. However, it must take into account the concept meaning, this is the goal of the new proposed query language and the meaning similarity. This helps to better understand the user and helps the discovery a set of concepts in the structure which are relevant to user concepts in order to begin the process of exploration and finding the responses for the variables. The proposed approach may be divided in three steps: 1- Discovering nodes which correspond to discovering user concepts in . 2- Finding for every query path its approximate paths in the graph data. 3- Generation of the results which are a set of approximate graph patterns with its approximation level value, these graph patterns contain the nodes results corre- sponding to the projection of the user variables. The procedure is based on the following objectives: 9 Giving to ability to the naïve user to take advantage from the power of semantic search, in this case we let him specify his needs by writing simple regular paths. 9 Understanding the naïve user query by finding relationships between the user paths and the knowledge base (RDF graph). Most user concepts do not appear in the structure, for this reason, we propose a new query language and a meaning similari- ty leading to a better understanding of the user needs on one hand and discovering the correspondences between the query concepts and the graph nodes on the other hand. The user, however, still plays an important role in the query answer para- digm. 9 The outputted answer must be understandable for the user and it should be simple. Proceedings ICWIT 2012 54 We make clear the procedures have been omitted, in the rest of the paper, because of pages limitation; we cannot describe the approach in detail so only the main steps are mentioned. 4.1 Mapping from Nodes in GP to Nodes in G The mapping process is necessary to find the correspondences of the nodes in GP (variables and constants in the conjuncts query); these nodes are used for finding the set of the approximate paths in . Because the user have lack knowledge of the graph data structure so he is able to use concepts not necessarily appearing in the graph and the process of mapping is important for discovering the nodes matches these concepts using WordNet. In order to enhance the matching we use a similarity metrics based on syntax (characters strings) (like: Levenshtein, NGram, JaroWinkler) and our meaning similarity (using the WordNet ontology) for discovering the senses (the meaning) commons between the concepts. Definition 1. Two concepts 1 , 2 are similar if _ 1, 2 (WordNet similarity), is predefined threshold, if _ 1, 2 0 then we test _ 1, 2 , the values of _ , _ and is defined in [0,1]. _ and _ (any syntax similarity) use the labels of nodes and edges. In the rest of the paper we use 1, 2) for the value returned by _ or _ . For finding the sets of node mapping the procedure _ _ returns for every node ( 1,2 i.e. the first or the last node in the query path ), the set contains the nodes in which are similar to using its label by the similari- ty based senses (or based syntax), in addition this procedure use a strategy for discov- ering another nodes in from the first and last edge in the query path . 4.2 Computing the Approximate Paths In this section we introduce the notion of approximation level between two paths and describe the method for calculating its values _ , this section is for the computa- tion of the approximate paths from , the finale answers (approximate graph patterns) are calculated in the next sub section. This calculation is started after the generation of the set of nodes mapping for every node .The procedure _ _ take as input a query path and outputting the set of tuples answer _ ,every tuple , , , _ containing two node : is first node in the approximate path , is the last node and _ is the value of the approximate level between and , the sets of tuples answer are used for constructing the ap- proximate graph patterns for . We consider the next points in the calculation process of _ : • The number of edges in similar to the edges in (similarity ≠ 1), each simi- lar edge in is a non-direct substitution for its corresponding edge in so we added the value of substitution to _ . Proceedings ICWIT 2012 55 • The number of additional edges in ( _ ), not appearing in , each addi- tional edge in is an insertion. • We also take into account the two values: similarity value between the first node in and the first in , similarity value between the last node in and the last in . • The order of edges in the query path for respecting the preference of the user. Our approach considers common and similar edges, therefore common edges are not associated to a value of ‘0’ but ‘1’, as well as the similarity values for similar edges. Before starting the process of finding the approximate local answer, the proce- dure _ _ generates the set of all paths from the two sets of nodes mapping and 2 . Definition 2. Let a path in , a query path in , is an approximate path for if the value of the approximation level between and is higher than _ ( predefined threshold of approximation), _ 0,1 . The procedure _ _ use the similarity obtained between two nodes and , , 2 .If is labeled with more than one term by the symbol ’|‘ so all terms are compared to the label of and only one value of similarity is returned i.e. the value. Example 3. Figure 2 shows the computation of the approximation level _ for the paths and ’ for the query path . For : the first node is labeled with the variable ‘?pub’ and has the set of nodes mapping = {publi- cation, pub1, pub2}, the last node is labeled with constant ‘ESWC2012’ and has the set of nodes mapping = {ESWC2012, ISWC2012}.The similar edges by discontinued line, additional edges by double line, first and last nodes by the dark circle; the values of similarity between edges and nodes are in italic, ). The common edges are represented by single continued line. In the path , number of similar or common edges is 2 (with two values of similarity: 0.95, 1), 1 ? , 2 0.90, 2 2012, 2012 1, the approximate level associated with is: ∑ _ 1 . 1 2 4 (7) . _ 1 0.90 1 ⁄4 0,97 (8) The tuple answer corresponding to is: 2, 2012, , 0,97 . In the path there is one additional edge the (the edge type) so _ 1 and, number of similar or common edges is 2 (with two values of similarity: 0.95, 1), 1 ? , 0.70, 2 2012, 2012 0.20, so the approximate level associated with is : . _ 1 0.70 0.20 ⁄4 0,64 (9) The tuple answer corresponding to is: , 2012, , 0,64 . _ for is greater than _ for so the path have a good approximation than . Proceedings ICWIT 2012 56 Fig. 2. Compputing the approoximation levell apx_lev for thee paths P and P P’ 4.3 Coomputing thee Approximate Graph Pattterns In this seection we desscribe how thee final answers (Approxim mate graph pattterns) are computedd from the appproximate paths discovered d in the previious step, Thee final an- swers forr the approxim mation is returrned in form of o tuples and every tuple reepresented by 1 , 2 , … . , , _ , _ _ , where to ꌔ are the nodes n correspponding to the nodess variable in (the nodes answers correesponding to the t variables iin the user query). _ is the approximate graph pattern n constructedd from the appproximate paths retuurned in _ and _ _ is the approximation level between and i.ee. the mean of the values _ of the approximate paths used foor the con- struction of _ . In [1] and [2], the final f answers are a set of no odes correspoonding to the vvariable in the queryy, in addition; as our approach based on graph patternns, a graph paattern with each nodees result is retturned for a beetter answers understanding u g. For computing c the final answ wer we must generate the set of tuplees answer _ for every pathh , exploring thee paths in eveery tuple and ccombining same pathhs for the generation of the graph pattern ns answer . Definitioon 3. Let a graph patterrn constructed d from a regullar conjunctivve query , Let a graph patternn . is an a approximatee graph patterrn for , if thhe value of approximmate level _ _ betwween and is higher than _ _ (prede- fined threeshold of apprroximation forr ), _ _ 0,1 . The procedure p _ is caalled with the set _ and its first ttuple. The proceduree _ explores all tuples t in any _ to geenerate all appproximate graph pattterns, added them in the final f set _ with its nodes variablles and its approximmation level. ForF the process of ranking, the value _ _ is used to rank the tupless in _ , the tuples are a outputted rankedr in a deecreasing ordeer. 5 Im mplementation and Exxperimenta ation Our apprroach is impleemented in Jaava and Jena API, we usee JAWS (Javaa API For WordNett Searching) for f the implem mentation of th he proposed meaning m simillarity. The RDF dataa set used is a sub set from the SwetoDblp lp ontology which w is large ssize ontol- ogy focused on biblioggraphy data ofo Computer Science S publiccations wheree the main data sourrce is DBLP, it T used subsset contains a collection i has 4 millionns of triples. The Proceedinngs ICWIT 20012 57 of books and its bookk chapters. Foor making the execution faster, f an offlline phase which coontains: RDF triples norm malization, (geetting triples that are closeely to the natural laanguage), buillding 2 indexees, is computeed in order to allow quick finding of the approoximate paths.. The threshollds , _ , _ _ are automatically a initialized and updaated accordingg the query structure, s this update allowws the reductiion of the found ansswers numberr. For exxperimentatioon purposes annd because ourr query languaage is inspiredd from the conjunctiive path queriies for helpingg the naïve (non-expert) useers, a query bbenchmark is createdd. The benchm mark containss a set of queeries, with different intendds that are executed over the RDF F subset. For every query, from the subsset; we compuuted, man- ually, thee set of the releevant solutionns (RS) for evaaluating Preciision and Recaall: RS (10) RS RS (11) In commparison withh SPARQL, our o work can be b used by a non-expert ussers and it allows sppecifying a quuery paths bettween variablles and constaants for a bettter under- standing of the user inttend. It is diffficult for the naïve n user to use u SPARQL eefficiently because its i complexes structure. Taable 2 includees some queriees, used for thhe evalua- tion wherreas Figure 4 shows the precision and recall r for somme queries, prroving the effectivenness of the appproach. 1 Precission 0,5 Recall 0 1 2 3 4 5 6 Fig. 3. Evaluuation results fo or some queriess answers in Precision Relevant answers answers system Recall RS Nb Nb Nb user query q (?Book__Chapter , title.contains, web)) User inttend: Find all Book_chapters B that 52 59 50 0.84 0.96 have tittle contains «w web » ‐ (?Book_Chapter, boo ok chapter included in the book, b Prolog and Databases) ‐ (?Book_Chapter, pagges number, ? pages) p 20 20 20 1.0 1.0 User inttend: Find all Book_Chapters B s in‐ cluded in the book: «PProlog and Data abases », assocciated with the pages numberr. Proceedinngs ICWIT 20012 58 ‐ (?Book, year of publication, 2000) ‐ (?Book, book isbn, ?isbn) ‐ (?Book, has publisher, ?publisher) 5 6 5 0.83 1.0 User intend: Find Books published in 2000, associated with its isbn and the publisher. Table 2. Some user queries used for the evaluation 6 Conclusion and Future Works In this paper a novel approach for query approximation based on meaning similarity from WordNet is proposed, using a proposed query language inspired from the con- juncts queries. Using this technique, the naive users are able to write simple queries that not necessarily match the data structure. Our approach can be used as an exten- sion to other approaches for a better understanding of the user query and obtaining results that satisfies the user’s needs. It has been shown that the answers are a set of graph patterns ranked following the approximation level decreasing order. The work, is not considering only RDF graph but it can be seen as a general approach which may be applied to any semi-structured data that is modeled as graph, Future work will consist in applying the proposed approach to specific domains such as geographic, medical, biologic and bibliography, using query interface and building new indexes for scaling a huge number of triples. References 1. A. Poulovassilis and P. T. Wood Combining Approximation and Relaxation in Semantic web Path Queries. In Proc. ISWC, 2010. 2. C. A. Hurtado, A. Poulovassilis, and P. T. Wood. Ranking approximate answers to seman- tic web queries. In Proc. ESWC, pages 263–277, 2009. 3. C. Kiefer, A. Bernstein, and M. Stocker. The fundamentals of iSPARQL: A virtual triple approach for similarity-based semantic web tasks. In Proc. ISWC, pages 295–309, 2007. 4. E. Thomas, J. Z. Pan, and D. H. Sleeman. ONTOSEARCH2: Searching ontologies seman- tically. In Proc. OWLED-2007, CEUR Workshop Proceedings 258. CEUR-WS.org, 2007. 5. Eyal Oren, Christophe Guéret, Stefan Schlobach. Anytime Query Answering in RDF through Evolutionary Algorithms, International Semantic Web Conference pp.98-113, 2008. 6. Fellbaum, C.: WordNet, an electronic lexical database. MIT Press, Cambridge (1998) 7. Fellah, A., Malki, M and Zahaf, A., « Alignement des ontologies : utilisation de WordNet et une nouvelle mesure structurelle CORIA 2008 - Conférence en Recherche d'Informa- tion et Applications, Trégastel, France, 2008. 8. G. Zenz, X. Zhou, E. Minack, W. Siberski, and W. Nejdl. From keywords to semantic queries -Incremental query construction on the Semantic Web. J. Web Sem., 7(3):166–176, 2009. 9. O. Corby, R. Dieng-Kuntz, and C. Faron-Zucker. Querying the Semantic Web with Corese search engine. In Proc. ECAI-2004, pp. 705–709. IOS Press, 2004. 10. Lopez, V., Fernndez, M., Motta, E., Stieler, N.: PowerAqua: Supporting Users in Querying and Exploring the Semantic Web Content. Semantic Web Journal. IOS Press (2011). 11. T. W. Finin, L. Ding, R. Pan, A. Joshi, P. Kolari, A. Java, and Y. Peng. Swoogle: Searching for knowledge on the Semantic Web. In Proc. AAAI-2005, pp. 1682–1683. Proceedings ICWIT 2012 59