Imprecise SPARQL: Towards a Unified
    Framework for Similarity-Based Semantic Web
                       Tasks

                                 Christoph Kiefer

                 Department of Informatics, University of Zurich,
                Binzmuehlestrasse 14, CH-8050 Zurich, Switzerland
                             kiefer@ifi.unizh.ch


      Abstract. This proposal explores a unified framework to solve Semantic
      Web tasks that often require similarity measures, such as RDF retrieval,
      ontology alignment, and semantic service matchmaking. Our aim is to
      see how far it is possible to integrate user-defined similarity functions
      (UDSF) into SPARQL to achieve good results for these tasks. We present
      some research questions, summarize the experimental work conducted so
      far, and present our research plan that focuses on the various challenges
      of similarity querying within the Semantic Web.


1    Motivation

Semantic Web tasks such as ontology alignment, semantic service matchmaking,
and similarity-based retrieval depend on some notion of similarity (at least if
they are not solely based on logic). Therefore, researchers still try to find sound
user-defined similarity functions (UDSF) to achieve good results for these tasks.
Finding good similarity functions is, however, data- and context-dependent, and
needs to be reconsidered every time new data is inspected. Nonetheless, good
UDSFs are crucial for the success of the above-mentioned Semantic Web tasks.
    Furthermore, in recent years, query languages for the Semantic Web such
as RDQL and SPARQL have gained increasing popularity. The current W3C
candidate recommendation of SPARQL, however, does not support UDSF to
analyze the data during query processing. The goal of this project is to overcome
this limitation and to develop a unified framework based on SPARQL to solve
similarity-dependent Semantic Web tasks. The proposed iSPARQL framework
should be easy to use and easily extendable to allow for user-defined, task-specific
similarity functions. The “i” stands for imprecise indicating that two or more
resources are compared by using similarity measures.
    We strive for a robust implementation of similarity querying for the Semantic
Web and its integration into SPARQL. The proposed iSPARQL approach should
have a high degree of flexibility in terms of customization to the actual Semantic
Web task.
2   Related Work
RDF Retrieval. Siberski et al. [19] propose SPARQL extensions allowing the
user to query the Semantic Web with preferences. New keywords (PREFERRING,
CASCADE) are added to the official SPARQL grammar in order to favor query
answers which match user-defined preference criteria. Finally, the answers which
are not dominated by any other answers (optimal according to the defined pref-
erence dimensions) are returned to the user.
Ontology Alignment. The task of ontology alignment (aka ontology map-
ping/matching) is a heavily researched field within the Semantic Web. Noy
and Musen [15] present the PROMPT framework – a suite of tools including
iPROMPT and ANCHORPROMPT – simplifying the comparing, aligning, and
merging of ontologies of different origins. Furthermore, Doan et al. [5] propose
the GLUE system that assists the user in finding mappings between ontologies
using techniques from machine learning. A different methodology is proposed
by Ehrig and Staab in [6]: based on QOM, ontologies can be aligned on differ-
ent layers focusing on different (modeling) aspects of ontologies. Euzenat and
Valtchev [7] propose an approach that is based on a specialized similarity mea-
sure to compare OWL-lite ontologies. Last, in a more recent paper, Tous and
Delgado [20] map nodes of ontologies to matrices which capture the relationships
of the mapped nodes among each other. Finally, a graph matching algorithm is
applied to find mappings between the ontologies under comparison.
Matchmaking/Discovery. Klusch et al. [14] propose OWLS-MX to perform
service matchmaking which adopts both, pure logic-based and Information Re-
trieval (IR) based techniques for the needs of hybrid semantic service match-
making. Furthermore, Hau et al. [10] propose a similarity measure to compare
Semantic Web services expressed in the OWL-S language. In addition, Jaeger
et al. [11] present an approach for matching service inputs, service outputs, a
service category, and user-defined service matching criteria. The four individual
matching scores are aggregated to result in an overall matchmaking score.
Query Optimization. Query optimization strategies have been developed to
reduce the complexity of Semantic Web queries to boost their runtime perfor-
mance. Ruckhaus et al. [16] propose to estimate the cost and cardinality of
individual query predicates based on selectivity estimations taken from [18].
Similarity Joins (Data Integration). To perform data integration, Cohen
[4] presents WHIRL and the notion of similarity joins by which data is joined
on similarity rather than on equality. In WHIRL, the TF-IDF weighting schema
from IR [1] is applied together with the cosine similarity measure to determine
the affinity of text. Similar approaches are proposed by Gravano et al. employing
string joins [8] and text joins [9] in order to correlate information from different
databases and web sources respectively.

3   General Problem Areas/Gaps
Numerous Semantic Web tasks rely on some notion of similarity, either to
compare ontologies (for alignment and/or integration), or to compare services
(for matchmaking and/or discovery), or to compare resources (for querying and
similarity-based retrieval) among others. All of the approaches presented in Sec-
tion 2 tackle one of these tasks individually (i.e., in their own specific way). None
of the approaches present a unified framework to solve them all. We made the
following observations:

 – To solve these tasks, Semantic Web researchers still try to find sound user-
   defined similarity functions (UDSF), which are crucial for the success of
   these tasks. However, good similarity functions are data- and context-de-
   pendent, and generally not easy to find.
 – SPARQL in combination with UDSFs could be used to solve individual tasks.
   However, traditional SPARQL does not support querying ontologies with
   UDSF. It is not clear what the optimal solution would look like: an extension
   of the official SPARQL grammar or the exploitation of “magic properties”
   (aka virtual triples or property functions) as supported in ARQ.1
 – The semantics and complexity of UDSF-extended SPARQL queries are un-
   clear. Hence, they should be elaborated and formally studied.
 – UDSF statements add an additional layer of complexity to SPARQL queries.
   Therefore, an approach for optimizing queries containing UDSFs should be
   provided. This is particularly important when executing web-scale queries.
   In other words: do UDSF-queries have the potential to scale to the web?


4     Research Plan

4.1   Choice of Datasets and Evaluation Strategy

So far we have experimented with the two matchmaking/retrieval test collections
OWLS-TC2 and the OWL MIT Process Handbook3 . For our preliminary opti-
mization experiments we used SwetoDblp4 , which focuses on bibliography infor-
mation of computer science publications. Furthermore, we worked with EvoOnt5
– a set of ontologies to model the domain of object-oriented software source code.
We will use these datasets for the evaluations of our proposed unified framework.


4.2   Current State of Our Research

RDF Retrieval. iRDQL [2] is our extension of traditional RDQL with similarity
joins to determine the similarity of Semantic Web resources.6 A limitation of
iRDQL is that it allows to utilize only one similarity measure per query and
1
  http://jena.sourceforge.net/ARQ/
2
  http://projects.semwebcentral.org/projects/owls-tc/
3
  http://www.ifi.unizh.ch/ddis/mitph.html
4
  http://lsdis.cs.uga.edu/projects/semdis/swetodblp/
5
  http://www.ifi.unizh.ch/ddis/evoont.html
6
  All similarity measures are implemented in SimPack, our generic library of similarity
  measures for the use in ontologies (http://www.ifi.unizh.ch/ddis/simpack.html).
it does not perform any query optimization. A demonstration of our current
prototype implementation iSPARQL is available at http://www.ifi.unizh.
ch/ddis/isparql.html. We will use this prototype as a starting point (and
benchmark) for the new framework to be accomplished within this PhD thesis.
Matchmaking/Discovery. In [12], the applicability of our iSPARQL prototype
is evaluated for the task of Semantic Web service discovery within the OWL MIT
Process Handbook.
Query Optimization. Our first steps toward Semantic Web query optimiza-
tion are presented in [3]. The proposed OptARQ approach investigates SPARQL
query optimization by means of a rule-based query optimization engine. Opti-
mization techniques for UDSF-queries, however, are not covered by OptARQ.
Analyzing Software Repositories. To highlight the benefits and applicability
of the proposed unified framework to different, initially non-Semantic Web tasks,
we realized the Coogle [17] and EvoOnt [13] projects for the tasks of software
evolution analysis and visualization as well as design flaws detection.


4.3   Our Approach – Next Steps

The aim of this work is the design, specification, implementation, and evaluation
of a unified framework for similarity-based Semantic Web tasks. There are several
goals to achieve: the first goal consists of a detailed revision of our preliminary
work. This will answer the question if the virtual triple approach taken so far
is sufficient to solve the remaining challenges of such a unified framework. The
second goal is the formal elaboration of the iSPARQL grammar, its semantics
and complexity. As a third goal, we investigate query optimization techniques
to boost the performance of UDSF-queries. Finally, the whole iSPARQL model
and implementation will be evaluated for applicability to different application
tasks (see Section 2).
    To achieve the goals, the following steps are planned: a revision of the current
prototype with special attention to its usability, flexibility, customizability, and
scalability; the specification of the iSPARQL model, particularly the complexity
and semantics of UDSF-queries; the implementation of the unified framework;
the investigation of UDSF-query optimization techniques; and an evaluation of
the applicability to different similarity-based Semantic Web tasks in terms of
testing, usability, customization, and performance measurement.


5     Conclusions

In this proposal we outlined the need of a unified framework to solve similarity-
based Semantic Web tasks, such as ontology alignment, service matchmaking,
and RDF retrieval. Our approach extends traditional SPARQL with user-defined
similarity functions (UDSF). The semantics and complexity of SPARQL-based
similarity queries will be formally elaborated and query optimization techniques
proposed. This systematical assessment will answer the questions of what is
the range of tasks that can be solved with the iSPARQL system, what is the
performance to solve these tasks, and what is its potential to scale to the web.
It is important to realize that these tasks provide a kind of “stress test” for the
usefulness of our unified framework.

References
 1. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison
    Wesley, 1999.
 2. A. Bernstein and C. Kiefer. Imprecise RDQL: Towards Generic Retrieval in On-
    tologies Using Similarity Joins. In SAC 2006, pages 1684–1689.
 3. A. Bernstein, C. Kiefer, and M. Stocker. OptARQ: A SPARQL Optimization
    Approach based on Triple Pattern Selectivity Estimation. Technical Report ifi-
    2007.03, Department of Informatics, University of Zurich, 2007.
 4. W. W. Cohen. Data Integration Using Similarity Joins and a Word-Based Infor-
    mation Representation Language. TOIS, 18(3):288–321, 2000.
 5. A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Halevy. Learning to
    match ontologies on the Semantic Web. VLDB Journal, 12(4):303–319, 2003.
 6. M. Ehrig and S. Staab. QOM - Quick Ontology Mapping. In GI Jahrestagung,
    pages 356–361, 2004.
 7. J. Euzénat, D. Loup, M. Touzani, and P. Valtchev. Ontology Alignment with OLA.
    In ISWC 2004, pages 333–337.
 8. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and
    D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB
    Journal, pages 491–500, 2001.
 9. L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins in an RDBMS
    for Web Data Integration. In WWW 2003, pages 90–101.
10. J. Hau, W. Lee, and J. Darlington. A Semantic Similarity Measure for Semantic
    Web Services. In WWW 2005.
11. M. C. Jaeger, G. Rojec-Goldmann, G. Mühl, C. Liebetruth, and K. Geihs. Ranked
    Matching for Service Descriptions using OWL-S. In KiVS 2005, Informatik Aktuell,
    pages 91–102.
12. C. Kiefer, A. Bernstein, H. J. Lee, M. Klein, and M. Stocker. Semantic Process
    Retrieval with iSPARQL. In ESWC 2007.
13. C. Kiefer, A. Bernstein, and J. Tappolet. Mining Software Repositories with iS-
    PARQL and a Software Evolution Ontology. In MSR 2007.
14. M. Klusch, B. Fries, and K. Sycara. Automated Semantic Web Service Discovery
    with OWLS-MX. In AAMAS 2006, pages 915–922.
15. N. F. Noy and M. A. Musen. The PROMPT Suite: Interactive Tools for Ontology
    Merging and Mapping. IJHCS, 59(6):983–1024, 2003.
16. E. Ruckhaus, E. Ruiz, and M.-E. Vidal. Query Optimization in the Semantic Web.
    In ALPSWS 2006.
17. T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer. Detecting Similar Java Classes
    Using Tree Algorithms. In MSR 2006.
18. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price.
    Access path selection in a relational database management system. In SIGMOD
    1979, pages 23–34.
19. W. Siberski, J. Z. Pan, and U. Thaden. Querying the Semantic Web with Prefer-
    ences. In ISWC 2006.
20. R. Tous and J. Delgado. A Vector Space Model for Semantic Similarity Calculation
    and OWL Ontology Alignment. In DEXA 2006.