1. INTRODUCTION

Linked Data Based Approach to Similarity Reasoning

Anna Formica

Michele Missikoff

Elaheh Pourabbas

Francesco Taglino

0 0 National Research Council, Istituto di Analisi dei Sistemi ed Informatica “A. Ruberti” Viale Manzoni 30 , I-00185 Rome , Italy

SemSim is a semantic similarity reasoning method that has been conceived to be used as a service for the Semantic Web. SemSim is based on a Weighted Reference Ontology, which is used to semantically annotate a collection of digital resources (e.g., documents) to be searched. In this paper we present a new approach to SemSim implementation based on Linked Data, that significantly increments its usability in the Semantic Web.

Similarity Reasoning Linked Data Weighted Reference Ontology Information content Digital Resources

1. INTRODUCTION

Innovation generally starts from a creative idea, triggered by a specific problem requesting a non trivial solution or from an offered opportunity, e.g., by new technological solutions. But the innovative idea is just the starting point of a long undertaking, a seed that needs to be ‘watered’, ‘fertilized’, and cared to grow into a concrete value proposition for the enterprise. A key ‘fertilizer’ for innovations is represented by knowledge. Given a brilliant idea, we need to verify how promising it really is. First of all, checking if a similar idea has been explored in the past: is our idea really innovative or is just a step towards evolution? Are there previous similar experiences? Were they successful or not? If negative, what were the difficulties and obstacles encountered? These are some among the initial questions for which we would like to get an answer. Nowadays, there is an emerging movement, referred to as Open Innovation [ 1 ] that makes such questions easier to find an answer. But Open Innovation is not easy to practice, both for socio-economic and technological motivations. The work presented in this paper intends to address two problems that fall in the latter group: (i) the difficulty we encounter in finding (over the Internet, but also within a single company and its knowledge resources) the knowledge that appears relevant (to what extent? Is it possible to assess its relevance?) to the proposed innovative idea; (ii) the possibility to concretely extract and use the identified knowledge resources, despite of the different formats and applications used to generate them.

In this paper we address the two mentioned problems by proposing a representation of the notions underlying the SemSim semantic similarity method [ 2 ], according to the Linked Data approach [ 3 ]. SemSim is a method used to retrieve and assess the degree of similarity between a request and a knowledge resource, which is based on a weighted ontology and semantically annotated resources. Linked Data is an approach aiming at encoding knowledge resources in a way that they can be easily accessed and reused in different contexts. Furthermore, Linked Data is the base for the inclusion of the knowledge resources in the vast open knowledge network belonging to the Semantic Web. With this approach we are able to publish SemSim as a service on the Semantic Web that can be freely invoked as long as the weighted ontology and resources’ annotations are made accessible in a format compliant with Linked Data.

Regarding the related work, currently, there are several proposals following on the one hand the Linked Data principles for defining vocabularies and describing data (e.g., documents, people, etc.), and on the other hand semantic similarity approaches. Concerning the Linked Data initiatives, the most popular is DBpedia1, which aims at extracting structured content from Wikipedia and representing it in a RDF format. With regard to semantic similarity approaches, see [ 2 ] for a detailed related work.

Concerning a joint approach the literature is still quite limited. It is worth mentioning the work in [ 6 ], although it adopts the Linked Data approach for representing the resources while we extend it to the semantic search engine itself. The paper is organized as follows. In Section 2, we briefly recall the SemSim method. In Section 3, the knowledge space underlying our approach is presented, which is organized according to four levels. In Section 4, we propose an RDF representation of the SemSim notions in order to enabling Linked Data mechanisms. Finally, the Conclusion follows.

2. THE SEMSIM METHOD: AN OVERVIEW

The Universe of Digital Resources (UDR) represents the knowledge space where SemSim operates, it consists in a collection of digital resources that are semantically annotated using a reference ontology. In our work we address a simplified notion of ontology, Ont, consisting of a set of concepts organized according to a ISA hierarchy. In particular, Ont is a taxonomy defined by the pair Ont = <C, H>, where C is a set of concepts and H is a set of ordered pairs of concepts of C such that if (ci,cj) H, then cj ISA ci, i.e., ci is a more general concept than cj.

Consider an ontology Ont = <C, H>. A request feature vector (request vector for short) rv is defined by a set of ontology concepts (the order of the concepts is irrelevant), i.e., rv = (c1,...,cn) where ci  C. Analogously, given a digital resource dri UDR, an ontology feature vector ofvi associated with dri is defined by a set of ontology concepts describing the resource as follows: ofvi = (ci,1,...,ci,m), where ci,j  C, j = 1,...,m.

A Weighted Reference Ontology (WRO) is a pair WRO = <Ont, w>, where w is a function defined on C, such that given a concept c  C, w(c) is a decimal number in the interval [0,...,1].

In [ 2 ], we have experimented the method in the tourism domain, therefore the examples below are drawn upon it. We are soon going to adopt it in the BIVEE project, focusing on innovation. In this case, the rv will be, e.g., a problem and the

1 http://dbpedia.org

UDR will be a knowledge space from which we wish to extract the documents relevant to this problem. In our experiment the digital resources are vacation packages for visiting a European capital, which are offered by a tourism agency. Each package is annotated with one ontology feature vector defined by using the concepts of the WRO. The UDR contains 22 vacation packages, which are indicated as h1..h22. A fragment of the WRO is shown in Figure 1. For instance, below some of the 22 ofvs are recalled: ofv1 = (InternationalHotel, FrenchMeal, Cinema, Flight) ofv2 = (Pension, VegetarianMeal, ArtGallery, ShoppingCenter) ofv3 = (CountryResort, MediterraneaMeal, Bus) … ofv7 = (RegularAccommodation, RegularMeal, Salon, Flight) … ofv15 =( InternationalHotel, PictureGallery, Flight) ….

Suppose a tourist wants to visit a European capital and, in order to buy a vacation package, he/she expresses some preferences. For instance, she/he wants to travel by plane, sleep in a international hotel, have international food, and enjoy art galleries. According to SemSim, these preferences can be formulated by using the following request feature vector:

rv = (InternationalHotel, InternationalMeal, ArtGallery, Flight) The SemSim method allows the user to choose among the 22 vacation packages offered by the tourism agency the one that better satisfies his/her needs. In particular, it evaluates the similarity between feature vectors, which is based on the notion of similarity between concepts (features), referred to as consim. Given a WRO, the consim notion relies on the information content approach defined by Lin [ 4 ], according to which the information content of a concept c is defined as -log w(c), where w is the weight associated with the concept c in the WRO. Therefore, as the weight of a concept increases the informativeness decreases hence, the more abstract a concept the lower its information content.

On the basis of the consim, the SemSim method allows us to compute the semantic similarity between a request vector rv and an ofv, indicated as semsim(rv,ofv). Such a computation essentially focuses on the pairs of concepts, one from the rv and the other one from the ofv, that exhibit high affinity, computed according to the so called stable marriage problem [ 5 ]. Given a request vector rv, on the basis of the semsim(rv,ofv) values, a Ranked Solution Vector (RSV) is defined, which provides a ranked list of ofvs most similar to the rv. In the RSV each ofv is a associated with the related semsim score, from the highest to the lowest values, down to a threshold. For instance, in our experiment the threshold has been fixed to 0.5, and in the case of the request vector rv, the resulting RSV is the following:

RSV(rv) = <(ofv15, 0.66), (ofv7, 0.60), (ofv1, 0.52)> Therefore, according to our approach, the ofv15, which refers to the h15 resource, is the most similar vacation package among the 22 available to the user preferences. In fact, ofv15 and rv have both the features InternationalHotel and Flight which match exactly. Furthermore, ofv15 has the feature PictureGallery sharing the information content of Salon with the feature ArtGallery of rv (see the WRO in Figure 1). The similarity between rv and ofv7 is lower because they have only one feature with an exact match (Flight) and all the remaining features sharing some information content in the taxonomy whose overall similarity does not exceed that with ofv15. Analogously, in the case of ofv1, although it has two features matching exactly with rv (InternationaHotel and Flight).

3. THE KNOWLEDGE SPACE ORGANIZATION

The Linked Data approach requires the adoption of standard vocabularies in representing the information structures to be exposed, shared, and connected to other pieces of data, information, and knowledge in the Semantic Web2 [ 4 ]. Such vocabularies are described by metadata, classified and interlinked. In this paper we focus on shared vocabularies and utilize them to represent the whole knowledge space, called reference knowledge space, that include the UDR of a given domain (like Tourism or Business Innovation) but also the meta-knowledge used by SemSim service. The reference knowledge space, as shown in Figure 2, is composed of four levels as follows: 

The Vocabulary level defines the terminology used in the Linked Data implementation of SemSim. It is based on the well-known OWL3, RDF4, RDFS5, XML Schema6, and SKOS7. In particular, SKOS (Simple Knowledge Organization System) in our work has been used as a common data model for defining the Domain Concepts and the SemSim glossaries. The former is addressed to organize the knowledge of the given domain and the latter is conceived to model the data structures of the SemSim method introduced in the previous section. 2 http://linkeddata.org/ 3 http://www.w3.org/2002/07/owl# 4 http://www.w3.org/1999/02/22-rdf-syntax-ns# 5 http://www.w3.org/2000/01/rdf-schema# 6 http://www.w3.org/2001/XMLSchema# 7 http://www.w3.org/2004/02/skos/core#   

The Knowledge schema level represents the schemes of the main components of the weighted taxonomy (e.g., broader, narrower) and the SemSim method, which are ofv, rv, and RSV. This level essentially provides further details about the structure and constraints implementing SemSim.

The Knowledge fact level represents the extensions of the elements defined at the schema level. Essentially they are organized as a weighted hierarchy of concepts and a set of ofvs.

The Domain resource level refers to the resources of the selected UDR on the basis of a specific application domain (e.g., Tourism or Business Innovation), each of which is annotated with one ofvs.

4. A LINKED DATA SOLUTION

In this section the Knowledge schema and Knowledge fact levels, presented in the previous section, are further detailed and modelled according to RDF. To this end we first provide in Figure 3 a graphical representation of both these levels. Successively, the diagrams are represented in RDF Turtle syntax8.

In Figure 3(a) the notion of ofv has been generalized in order to model both the annotation vector (AnnotV), and the request vector (RV) as specializations of a generic set of concepts referred to as OFV. In particular, the former is always associated with a resource whereas the latter is used to express the user preferences. Furthermore, the RSV element (RSVElem) allows us to model the elements of the ranked solution vector (RSV). Each RSVElem has two properties, namely hasAnnotV and hasScore whose ranges are AnnotV and float, respectively. The weighted taxonomy can be organized as a set of concepts related by the narrower/broader relationship. Each concept in the taxonomy has one property, namely hasWeight

8 http://www.w3.org/TR/turtle/

whose range is float. In Figure 3(b) a fragment of the weighted taxonomy in the tourism domain and a possible instance of AnnotV (ofv15) are illustrated.

In Table 1, see below, we present a RDF representation of the semsim namespace. The SemSim notions are defined in terms of existing RDF-based vocabularies, which are adopted by referring to their namespaces. In particular, we use rdf, rdfs, owl, skos and xsd as prefixes for RDF, RDFS, OWL, SKOS and XML Schema namespaces, respectively. In accordance with Figure 3(a), we present in Table 1 the SemSim data structures and properties, where their initial letter is denoted in upper case (e.g., AnnotV) and lower case (e.g., hasConcept), respectively.

To build the WRO, we refer to SKOS, a W3C recommendation designed for the representation of thesauri, classification schemes, taxonomies, or any other type of structured controlled vocabulary. The main reason for adopting SKOS is that it is widely used in the context of the Semantic Web, and therefore it allows us the reuse of existing taxonomies and, in turn, publish the SemSim vocabulary for a wide use on the Web. Since SKOS does not support the notion of the weight of a concept, as required in our approach, in the semsim namespace the hasWeight property has been introduced.

In accordance with Figure 3(b), we report in Table 2 an example of weighted taxonomy and annotation vector, by using the SemSim vocabulary defined above, and the Turtle syntax.

5. CONCLUSION

In this paper, we have proposed a RDF representation of the notions underlying the SemSim semantic similarity method. SemSim works on a weighted ontology, on the basis of which resources of interest are annotated by using Ontology Feature Vector (OFV) structures. Leveraging on the weighted ontology and the OFVs, the method is able to assess the semantic similarity between a given request (request vector, RV) and available resources, by returning a ranked list of best matches (ranked solution vector, RSV). OFV, RV and RSV structures have been modelled by re-using very popular vocabularies in the Semantic Web such as OWL, RDF(S), XML Schema, while the reference taxonomy has been represented in SKOS. Such a RDF-based representation allows us to define SemSim structures in accordance with the Linked Data approach. Accordingly, we are able to publish a Linked Data compliant SemSim service. In order to invoke such a web service, a weighted SKOS taxonomy, and the annotations of the resources of interest need also to be available on the Web, by adopting the semsim namespace specifications, as described in Section 4.

1. Chesbrough , H. , Open Innovation: The New Imperative for Creating and Profiting from Technology , Boston Harvard Business School Press, 2003 , ISBN: 1 - 57851 -837-7.

2. Formica

, Missikoff

, Pourabbas

, Taglino

( 2013 ). Semantic search for matching user requests with profiled enterprises . Computers in Industry , 64 : 191 - 202 .

3. Heath

, Bizer

( 2011 ). Linked Data: Evolving the Web into a Global Data Space . Synthesis Lectures on the Semantic Web: Theory and Technology February 2011 , 136 pages, (doi:10.2200/S00334ED1V01Y201102WBE001) .

4. Lin , D. ( 1998 ). An Information-Theoretic Definition of Similarity . In Proceedings of 15th the International Conference on Machine Learning . Madison, Wisconsin, USA, Morgan Kaufmann. Shavlik J. W. (ed.), 296 - 304 .

5. Mairson

( 1992 ). The Stable Marriage Problem . The Brandeis Review , 12 ( 1 ).

6. Sheng

, Chen

, Yu

, Feng

( 2010 ). Linked data based semantic similarity and data mining . In Proc of 11th IEEE International Conference on Information Reuse and Integration (IRI 2010 ). Las Vega, USA, 104 - 108 .