1st International Workshop on Adaptation, Personalization and REcommendation in the Social-semantic Web (APRESW 2010) R3 - A Related Resource Recommender Thomas Kurz, Tobias Bürger and Rolf Sint Salzburg Research Forschungsgesellschaft Jakob Haringer Str. 5/3, 5020 Salzburg, Austria firstname.lastname@salzburgresearch.at Abstract. Due to the ever growing amount of content in the Web of Data, the retrieval of relevant information is challenging. Currently, effi- cient resource recommendation methods are lacking, that could ease the exploration of data in the Web of Data. To alleviate this situation, this paper proposes the R3 resource recommendation framework for retrieval of data in the Linked Open Data (LOD) cloud. It analyses relevant search engines and interlinking frameworks and, based on that, proposes the R3 framework which is illustrated both in theoretical and practical details. The framework enables the recommendation of (RDF) resources from the LOD cloud based on textual, structural, or semantic similarity. 1 Introduction The goal of Linking Open Data (LOD) community is to bootstrap the Seman- tic Web (the “Web of Data”) by publishing and interconnecting datasets using RDF[1]. The outcome of this movement is the so called LOD cloud which grew to 13.1 billion triples and 142 million RDF links in the last two years and it is still growing [2]. As within the traditional, document-centric Web, search and retrieval of infor- mation is of utmost importance. Similarly, a big challenge for a specific end user or application, operating on the Web of Data, is to find relevant data that serves their specific needs. Despite the fact, that Linked Data browsers and search en- gines are available to explore content in the LOD cloud, means to issue complex queries by ordinary users or to recommend content in the cloud based on par- ticular interests, are currently lacking. In case a user is searching for the city of Berlin using a LOD search engine, he is able to retrieve resources with many properties such as their names, descriptions, latitude, longitude, or density of population. If she now would like to retrieve related resources such as a ranked list of cities ordered by geographical distance and/or density of population or resources with similar structure (like countries or provinces) ranked on the se- mantic similarity of their textual description, she will fail with current search engines. Similarly the recommendation of related resources could allow the user to issue a “Query by Example” by defining some kind of a fake-resource and use it as query base, which would be a novel form for searching the Web of Data. In order to alleviate this situation, this paper investigates the state of the art 7th Extended Semantic Web Conference (ESWC 2010) Page 33 of 64 1st International Workshop on Adaptation, Personalization and REcommendation in the Social-semantic Web (APRESW 2010) in LOD search engines and interlinking frameworks (Section 2) and, based on that, proposes the R3 resource recommendation framework that is capable of recommending data from the LOD cloud based on the semantic, structural, or textual similarity of given resources. The framework allows to query for related things in the LOD cloud based on a given resource and is illustrated including its requirements, conceptual architecture, and implementation aspects (Section 3). Finally, details are given on how to further advance and implement the frame- work (Section 4). 2 Resource Discovery and Interlinking in the LOD Cloud There are some applications on the web, which allow the user to search or browse the web of data. Supplementary to that there are so called Interlinking Frame- works that can be used to check the resources of two or more different datasets pairwise for similarity. Because of the analogies to our approach these frame- works should also be considered in the following discussion. 2.1 Browsers and Search Engines Sindice1 , as described in [3], is a scalable index of the Semantic Web. It crawls the Web for RDF Documents and Microformats and indexes resulting resource URIs, Inverse Functional Properties (IFPs) and keywords. A human user can access these documents through a simple user interface, based on indexes men- tioned above. Sigma2 is rather a semantic information mashup enabled by Sindice than a self-contained semantic search service. Nevertheless it enriches a lot of its func- tionalities with some nice additional features. It works as Web of Data browser where the user can start from any entity (found by a fulltext search) and then browse to the resulting page. The resources index is build out of from sites which use RDF, RDFa or Microformats. The Open Link Search3 will list entities with a user-defined text pattern occur- ring in any literal property value or label. It also supports Entity URI lookup. The Search can be redefined by filtering type, property value, etc. It is also possible to execute SPARQL queries by using the SPARQL endpoint. Some demo queries are predefined and can easily be altered via text input fields. Falcons4 is described in [5] as a service for searching and browsing entities on the Semantic Web. It is a keyword-based search engine for the Semantic Web URIs and provides different query types for object, concept and document search. Falcons also gives the facility of facetting over types by dynamically recommend- ing ontologies. The recommendation is based on a combination of the TF-IDF technique and the popularity of ontologies. 1 http://sindice.com/ 2 http://sig.ma/ 3 http://lod.openlinksw.com/ 4 http://iws.seu.edu.cn/services/falcons/objectsearch/index.jsp 7th Extended Semantic Web Conference (ESWC 2010) Page 34 of 64 1st International Workshop on Adaptation, Personalization and REcommendation in the Social-semantic Web (APRESW 2010) Watson5 offers keyword based querying to obtain a URI-list of semantic doc- uments in which the keywords appear as identifiers or in literals of classes, prop- erties, and individual. Search options make it possible to restrict the search space to particular types of entities (classes, properties or individuals) and to partic- ular elements within the entities (e.g. local name, label, comment). SWSE6 is a search engine for the RDF Web. Similar search engines currently provided for the HTML Web it looks like a ordinary fulltext search. But the information retrieval capabilities of SWSE are much more powerful because of the inherent semantics of RDF and other Semantic Web languages. Swoogle7 allows a user to search through ontologies, instance data, and terms of the Semantic Web. Furthermore it supports browsing the Web of Data. This search engine also uses an archive functionality to identify and provide different versions of Semantic Web documents. Like described above, each considered semantic search service provides a cer- tain amount of functionalities. Some of them are part of two or more services, others are exclusive to one certain engine. Though it is possible to search for appearance of a given resource in some of them, neither it is possible to find re- lated resources for a resource and its RDF triples nor to define on which triples the relationship should be calculated on. Also the search engines do not consider a semantic similarity of queries and content, which definitely could increase the quality of result. But there are applications in the area of Semantic Web which match some of these requirements in certain ways - the interlinking frameworks. 2.2 Interlinking frameworks Interlinking frameworks for semantic web data try to detect related and link resources in different datasets. In [8] several frameworks are compared to each other concerning their functionalities, which brings us to the decision that the Silk8 approach is rather related to our goals. Silk[7] is a framework for detecting explicit RDF links between data items within different data sources. Using the declarative Silk - Link Specification Language (Silk-LSL), developers can specify which types of RDF links should be discovered between data sources and, based on arbitrary metrics and aggregation functions, which resources should be declared as related. Silk accesses the interlinking can- didates via the SPARQL protocol. The usage of different metrics and aggregation functions for different types of properties can be adopted to our resource recommender. In addition we can remodel Silk-LSL in some ways (e.g. alternative metrics) and use it as query syntax. This language makes it also possible to define the appropriated data- sources by query. 5 http://kmi-web05.open.ac.uk/WatsonWUI/ 6 http://swse.deri.org/ 7 http://swoogle.umbc.edu/ 8 http://www4.wiwiss.fu-berlin.de/bizer/silk/spec/ 7th Extended Semantic Web Conference (ESWC 2010) Page 35 of 64 1st International Workshop on Adaptation, Personalization and REcommendation in the Social-semantic Web (APRESW 2010) 3 R3 - A Conceptual Overview Our intent is to build a recommender service, which allows to query for related resources from various (predefined) datasources based on a given resource. But what is relatedness, what factors have an impact on it and how can we implement such a recommender service? This is discussed in the following sections. 3.1 Requirements In case of RDF resources there are various factors which define relatedness. On the one hand the RDF structure itself (predicates and non-literal objects) reveals something about how similar two resources are. On the other hand the literal properties can be compared according to their types towards different metrics. That can be simple ones like euclidean metric for numbers, or more complex like semantic similarity of texts. A user should be able to specify the factors that are used to find relevant related resources, and also its impact on the result. In addition to that the whole recommendation process should be calculated in an adequate time. So we can specify requirements below: 1. Recommend related resources from the LOD cloud based on a given RDF resource. 2. Consider semantic similarity of texts and structural similarity of resources. 3. Offer a comparison mechanism for literals with adjustable metrics. 4. Allow user defined feature boost; that means a certain feature (e.g. property x or structure) has a higher relevance on relatedness than others. 5. Return related resources ordered by relevance. 3.2 Conceptual Architecture The concept to fulfill these requirements is illustrated in Figure 1. The data must be fetched from the LOD cloud, combined and indexed; it should be queryable via a specific search syntax. This process is described more precisely in this sec- tion. Data Consolidation The service gets recommendable resources out of the Linked Data Cloud. Since it should possible, to build a multi-source index, there must be a kind of ontology alignment. Thus preprocessed data is stored directly into the index. The single datasources must be reindexed in given time intervals. Resource Recommender Index A core index can provide lot of metrics like euclidean distance, date similarity, string equality, etc. Semantic similarity which can be used to evaluate the se- mantic distance of texts and RDF structures is more complex, therefore we need a supplemental semantic index. Semantic textual indices (one for each defined property) as well as the semantic structure index (one for the whole dataset) are 7th Extended Semantic Web Conference (ESWC 2010) Page 36 of 64 1st International Workshop on Adaptation, Personalization and REcommendation in the Social-semantic Web (APRESW 2010) Fig. 1: Design and workflow of R3 build out of the core index. Resource Query To get recommended resources based on a given one, the recommender provides a query language, whereby the user can specify, which features should be in- cluded in the calculation according to which metric. Furthermore the factor how intensive a specific feature impacts the result and how the diverse values are combined is configurable by query. To restrict the set of base resources the user can define the included datasets. The searchresult is list of resources ranked by relevance. 3.3 Implementation Datasets, which build our resources base is taken out from the LOD cloud via SPARQL. To map different resources from sources we use a simple mapping ta- ble. Complex ontology matching strategies like in [9] are also possible. Because of its high scalability, its fast query processing and the possibility to use integrated functions and numerical as well as token-based comparison, we decided to use SOLR9 as our index base. A lot of metrics like euclidean distance, date similarity, string equality, etc. are provided by or can be directly integrated into SOLR index. As described, for more complex metrics we need supplemental semantic indices build out of the SOLR index. Text-based Semantic Index A potential semantic index can be a Semantic Vector Index. This approach bases upon the Vector Space Model wherein every document is represented as a vec- tor in an n-dimensional term space according to appearing terms. The Semantic 9 http://lucene.apache.org/solr/ 7th Extended Semantic Web Conference (ESWC 2010) Page 37 of 64 1st International Workshop on Adaptation, Personalization and REcommendation in the Social-semantic Web (APRESW 2010) Vector Package10 is able to build such an Index (which can be queried for se- mantic related documents) out of the basic Lucene Index. Structure-based Semantic Index The semantic vector index can also be used to index the semantic similarity of RDF structures. Therefore not every word or text module is integrated in the term model but the URI, RDF predicates and non-literal objects of a resource. Figure 2 shows the semantic similarity of a subset of dbpedia resources. To illus- trate this semantic space we build a structure distance matrix of this resources and scaled it to two dimensions using classical multidimensional scaling (MDS) offered by the R statistics software11 . We highlighted resources of different types which shows that related resources have a similar RDF structure. Fig. 2: Evaluation of Structure Index Query Language As mentioned, the SILK Link Specification Language12 can be used as inspira- tion for a query format that fulfills our query requirements and allows to specify the basic resource (set of RDF triples or URI), the considered datasets (SPARQL endpoints used from data consolidator), relevant features and its impact and the applied metrics (taken from a fix set). Figure 3 shows an simple query example. 4 Further Work In this paper we described the conceptual architecture of a resource recommen- dation framework for the Semantic Web. Our future work includes the implemen- tation of this concept and a practical evaluation with real datasets. In a further step we plan to optimize the Semantic Vector package, which is used in one core 10 http://code.google.com/p/semanticvectors/ 11 http://www.r-project.org/ 12 http://www4.wiwiss.fu-berlin.de/bizer/silk/spec/#specification 7th Extended Semantic Web Conference (ESWC 2010) Page 38 of 64 1st International Workshop on Adaptation, Personalization and REcommendation in the Social-semantic Web (APRESW 2010) Fig. 3: Sample for a Recommander Query component of the framework, to enhance its scalability and performance. The resulting recommender will be integrated into the KiWi13 system. References 1. C. Bizer et al. Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems (IJSWIS), Vol. 5, Issue 3, 2009. 2. Linking Open Data: W3C SWEO Community Project. http://esw.w3.org/topic/ SweoIG/TaskForces/CommunityProjects/LinkingOpenData, 2010. 3. E. Oren et al. Sindice.com: a document-oriented lookup index for open linked data. Int. J. Metadata, Semantics and Ontologies, Vol. 3, No. 1, 2008. 4. DERI Galway: Sindice API for Query Services. http://sindice.com/developers/ api, 2008-2009. 5. G. Cheng and Y. Qu. Searching linked objects with Falcons: Approach, implemen- tation and evaluation. International Journal on Semantic Web and Information Sys- tems 5(3):49-70, September 2009 6. W.B. Frakes and R.A. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, New Jersey, 1992. 7. J. Volz et al. SILK - A Link discovery framework for the Web of Data. Linked Data on the Web (LDOW2009), Madrid, 2009. 8. F. Scharffe and J. Euzenat. Alignments for data interlinking. http://melinda. inrialpes.fr, 2009 9. C. A. Curino et al. X-SOM: A Flexible Ontology Mapper. 18th International Con- ference on Database and Expert Systems Applications (DEXA 2007), 2007. 13 http://kiwi-project.eu/ 7th Extended Semantic Web Conference (ESWC 2010) Page 39 of 64