10th International Workshop on Uncertainty
Reasoning for the Semantic Web
(URSW 2014)
Proceedings
edited by Fernando Bobillo
Rommel Carvalho
Davide Ceolin
Paulo C. G. da Costa
Claudia d’Amato
Nicola Fanizzi
Kathryn B. Laskey
Kenneth J. Laskey
Thomas Lukasiewicz
Trevor Martin
Matthias Nickles
Michael Pool
Tom De Nies
Olaf Hartig
Paul Groth
Stephen Marsh
Riva del Garda, Italy, October 19, 2014
collocated with
the 13th International Semantic Web Conference
(ISWC 2014)
II
Foreword
This volume contains the papers presented at the 10th International Workshop
on Uncertainty Reasoning for the Semantic Web (URSW 2014), held as a part of the
13th International Semantic Web Conference (ISWC 2014) at Riva del Garda, Italy,
October 19, 2014. 4 technical papers and 1 position paper were accepted at URSW
2014. Furthermore, there was a special session on Methods for Establishing Trust of
(Open) Data (METHOD 2014) including 1 technical paper and 2 position papers.
All the papers were selected in a rigorous reviewing process, where each paper was
reviewed by three program committee members.
The International Semantic Web Conference is a major international forum for
presenting visionary research on all aspects of the Semantic Web. The International
Workshop on Uncertainty Reasoning for the Semantic Web provides an opportunity
for collaboration and cross-fertilization between the uncertainty reasoning commu-
nity and the Semantic Web community.
We wish to thank all authors who submitted papers and all workshops par-
ticipants for fruitful discussions. We would like to thank the program committee
members for their timely expertise in carefully reviewing the submissions.
September 2014
Fernando Bobillo
Rommel Carvalho
Davide Ceolin
Paulo C. G. da Costa
Claudia d’Amato
Nicola Fanizzi
Kathryn B. Laskey
Kenneth J. Laskey
Thomas Lukasiewicz
Trevor Martin
Matthias Nickles
Michael Pool
Tom De Nies
Olaf Hartig
Paul Groth
Stephen Marsh
III
IV
URSW 2014
Workshop Organization
Organizing Committee
Fernando Bobillo (University of Zaragoza, Spain)
Rommel Carvalho (Universidade de Brasilia, Brazil)
Paulo C. G. da Costa (George Mason University, USA)
Davide Ceolin (VU University Amsterdam, The Netherlands)
Claudia d’Amato (University of Bari, Italy)
Nicola Fanizzi (University of Bari, Italy)
Kathryn B. Laskey (George Mason University, USA)
Kenneth J. Laskey (MITRE Corporation, USA)
Thomas Lukasiewicz (University of Oxford, UK)
Trevor Martin (University of Bristol, UK)
Matthias Nickles (National University of Ireland, Ireland)
Michael Pool (Goldman Sachs, USA)
Program Committee
Fernando Bobillo (University of Zaragoza, Spain)
Rommel Carvalho (Universidade de Brasilia, Brazil)
Davide Ceolin (VU University Amsterdam, The Netherlands)
Paulo C. G. da Costa (George Mason University, USA)
Fabio Gagliardi Cozman (Universidade de São Paulo, Brazil)
Claudia d’Amato (University of Bari, Italy)
Nicola Fanizzi (University of Bari, Italy)
Marcelo Ladeira (Universidade de Brası́lia, Brazil)
Kathryn B. Laskey (George Mason University, USA)
Kenneth J. Laskey (MITRE Corporation, USA)
Thomas Lukasiewicz (University of Oxford, UK)
Trevor Martin (University of Bristol, UK)
Alessandra Mileo (DERI Galway, Ireland)
Matthias Nickles (National University of Ireland, Ireland)
Jeff Z. Pan (University of Aberdeen, UK)
Rafael Peñaloza (TU Dresden, Germany)
Michael Pool (Goldman Sachs, USA)
Livia Predoiu (University of Mannheim, Germany)
Guilin Qi (Southeast University, China)
David Robertson (University of Edinburgh, UK)
Thomas Scharrenbach (University of Zurich, Switzerland)
V
Giorgos Stoilos (National Technical University of Athens, Greece)
Umberto Straccia (ISTI-CNR, Italy)
Matthias Thimm (Universität Koblenz-Landau, Germany)
Peter Vojtáš (Charles University Prague, Czech Republic)
VI
METHOD 2014
Organizing Committee
Davide Ceolin (VU University Amsterdam, The Netherlands)
Tom De Nies (Ghent University, Belgium)
Paul Groth (VU University Amsterdam, The Netherlands)
Olaf Hartig (University of Waterloo, Canada)
Stephen Marsh (University of Ontario Institute of Technology, Canada)
Program Committee
Edzard Hfig (Beuth University of Applied Sciences Berlin, Germany)
Rino Falcone (Institute of Cognitive Sciences and Technologies, Italy)
Jean-Francois Lalande (INSA Centre Val de Loire, France)
Zhendong Ma (Austrian Institute of Technology, Austria)
Uwe Nestmann (Technische Universität Berlin, Germany)
Florian Skopik (Austrian Institute of Technology, Austria)
Gabriele Lenzini (University of Luxembourg, Luxembourg)
Erik Mannens (Ghent University, Belgium)
Matthias Flgge (Fraunhofer FOKUS, Germany)
Trung Dong Huynh (University of Southampton, UK)
Paolo Missier (Newcastle University, UK)
Khalid Belhajjame (Paris-Dauphine University, France)
James Cheney (University of Edinburgh, UK)
Christian Bizer (Mannheim University Library, Germany)
Yolanda Gil (University of Southern California, USA)
Daniel Garijo (Universidad Politécnica de Madrid, Spain)
Tim Lebo (Rensselaer Polytechnic Institute, USA)
Simon Miles (Kings College of London, UK)
Andreas Schreiber (German Aerospace Center, Cologne, Germany)
Kieron O’Hara (University of Southampton, UK)
Tim Storer (University of Glasgow, Scotland)
Babak Esfandiari (Carleton University, Canada)
Christian Damsgaard Jensen (Technical University of Denmark, Denmark)
Khalil El-Khatib (University of Ontario Institute of Technology, Canada)
Kai Eckert (Mannheim University Library, Germany)
VII
VIII
Table of Contents
URSW 2014 Technical Papers
– A Probabilistic OWL Reasoner for Intelligent Environments 1-12
David Ausı́n, Diego López-De-Ipiña and Federico Castanedo
– Learning to Propagate Knowledge in Web Ontologies 13-24
Pasquale Minervini, Claudia d’Amato, Nicola Fanizzi, Volker Tresp
– Automated Evaluation of Crowdsourced Annotations in the Cultural
Heritage Domain 25-36
Archana Nottamkandath, Jasper Oosterman, Davide Ceolin, Wan
Fokkink
– Probabilistic Relational Reasoning in Semantic Robot Navigation 37-48
Walter Toro, Fabio Cozman, Kate Revoredo, Anna Helena Reali
Costa
URSW 2014 Position Papers
– Towards a Distributional Semantic Web Stack 49-52
André Freitas, Edward Curry, Siegfried Handschuh
Special Session: Methods for Establishing Trust of (Open) Data
Overview
– Overview of METHOD 2014: the 3rd International Workshop on
Methods for Establishing Trust of (Open) Data 53-54
Tom De Nies, Davide Ceolin, Paul Groth, Olaf Hartig, Stephen Marsh
Research Papers
– Hashing of RDF Graphs and a Solution to the Blank Node Problem 55-66
Edzard Hoefig, Ina Schieferdecker
Short Papers
– Rating, Recognizing and Rewarding Metadata Integration and Sharing
on the Semantic Web 67-72
Francisco Couto
– Towards the Definition of an Ontology for Trust in (Web) Data 73-78
Davide Ceolin, Archana Nottamkandath, Wan Fokkink, Valentina
Maccatrozzo
IX
X
A Probabilistic OWL Reasoner for Intelligent
Environments
David Ausı́n1 , Diego López-de-Ipiña1 , and Federico Castanedo2
1
Deusto Institute of Technology, DeustoTech. University of Deusto, Avda. de las
Universidades, 24, 48007 Bilbao, Spain. {david.ausin, dipina}@deusto.es
2
Wise Athena. 71 Stevenson Street, San Francisco, USA.
fcastanedo@wiseathena.com ?
Abstract. OWL ontologies have gained great popularity as a context
modelling tool for intelligent environments due to their expressivity.
However, they present some disadvantages when it is necessary to deal
with uncertainty, which is common in our daily life and affects the de-
cisions that we take. To overcome this drawback, we have developed a
novel framework to compute fact probabilities from the axioms in an
OWL ontology. This proposal comprises the definition and description
of our probabilistic ontology. Our probabilistic ontology extends OWL
2 DL with a new layer to model uncertainty. With this work we aim
to overcome OWL limitations to reason with uncertainty, developing a
novel framework that can be used in intelligent environments.
Keywords: OWL, Bayesian networks, probability, probabilistic ontol-
ogy
1 Introduction
In Ambient Intelligence applications, context can be defined as any data which
can be employed to describe the state of an entity (a user, a relevant object, the
location, etc.) [6]. How this information is modelled and reasoned over time is
a key component of an intelligent environment in order to assist users in their
daily activities or execute the corresponding actions. An intelligent environment
is any space in which daily activities are enhanced by computation [4].
One of the most popular techniques for context modelling is OWL ontologies
[18]. They have been employed in several Ambient Intelligence projects such as
SOUPA[3], CONON[20] or CoDAMoS [15], to name a few.
OWL is the common way to encode description logics in real world. However,
when the domain information contains uncertainty, the employment of OWL
ontologies is less suitable [11]. The need to handle uncertainty has created a
growing interest in the development of solutions to deal with it.
As in other domains, uncertainty is also present in Ambient Intelligence [16]
and affects to the decision making process. This task requires context information
?
This work is supported by the Spanish MICINN project FRASEWARE (TIN2013-
47152-C3-3-R)
1
in order to respond to the users’ needs. Data in Ambient Intelligence applications
are provided by several sensors and services in real time. Unfortunately, these
sensors can fail, run out of battery or be forgotten by the user, in the case of
wearable devices. On the other hand, the services can also be inaccessible due
to network connectivity problems or technical difficulties on the remote server.
Nonetheless, that unavailable information may be essential to answer correctly
user’s requirements.
For this reason, we present a novel approach to deal with uncertainty in
intelligent environments. This work proposes a method to model uncertainty,
that combines OWL ontologies with Bayesian networks. The rest of this article
is organized as follows. The next section describes the problem that we address.
Section 3 explains the semantics and syntax of our proposal. Section 4 gives an
exemplary use case where our proposal is applied and describes how to model
it. Finally, section 5 summarizes this work and addresses the future work.
2 Description of the Problem
In intelligent environments, the lack of information causes incomplete context
information and it may be produced by several causes:
– Sensors that have run out of batteries. Several sensors, such as wearable
devices, depend on batteries to work.
– Network problems. Sensors, actuators and computers involved in the envi-
ronment sensing and monitoring are connected to local networks that can
suffer network failures. In these cases, the context information may be lost,
although the sensors, actuators and computers are working properly.
– Remote services’ failures. Some systems rely on remote services to provide
a functionality or to gather context information.
– A system device stops working. Computer, sensors and actuators can suffer
software and hardware failures that hamper their proper operation.
When one of these issues occurs, the OWL reasoner will infer conclusions
that are insufficient to attend the user’s needs. Besides, taking into account that
factors can improve several tasks carried in intelligent environments, such as
ontology-based activity recognition. For instance, WatchingTVActivity can be
defined as an activity performed by a Person who is watching the television in
a room:
W atchingT V Activity ≡ ∃isDoneBy.(P erson u ∃isIn(Room u
∃containsAppliance.(T V u ∃isSwitched.(true )))) (1)
If the user is watching the television and the system receives the values of all
the sensors, then it is able to conclude that the user’s current activity is of the
type WatchingTVActivity. In contrast, if the value of the presence sensor is not
available, then it is not possible to infer that the user is watching the television.
2
In addition, sometimes there is not a rule of thumb to classify an individual
as a member of a class. For instance, we can classify the action that the system
has to perform regarding the current activity of the user. Thus, we can define
that the system should turn off the television, when the user is not watching it:
T urnOf f T V ≡ ∃requiredBy.(P erson u ∀isDoing.¬W atchingT V Activity
u∃hasAppliance.(T V u ∃isSwitched.(true )))(2)
However, this concept definition does not accurately model the reality. The
action can fulfil every condition expressed in the TurnOffTV definition, but the
television should not be turned off. This situation may occur when the user goes
to the toilet or answers a call phone in another room, among others.
In these cases in which the information of the domain comes with quantitative
uncertainty or vagueness, ontology languages are less suitable [11]. Uncertainty
is usually considered as the different aspects of the imperfect knowledge, such as
vagueness or incompleteness. In addition, the uncertainty reasoning is defined as
the collection of methods to model and reason with knowledge in which boolean
truth values are unknown, unknowable or inapplicable [19]. Other authors [1]
[11] consider that there are enough differences to distinguish between uncertainty
and vague knowledge. According to them, uncertainty knowledge is comprised
by statements that are either true or false, but we are not certain about them
due to our lack of knowledge. In contrast, vagueness knowledge is composed of
statements that are true to certain degree due to vague notions.
In our work, we are more interested in the uncertainty caused by the lack
of information rather than the vague knowledge. For this reason, probabilistic
approaches are more suitable to solve our problem.
3 Turambar Solution
Our proposal, called Turambar, combines a Bayesian network model with an
OWL 2 DL ontology in order to handle uncertainty. A Bayesian network [13]
is a graphical model that is defined as a directed acyclic graph. The nodes in
the model represent the random variables and the edges define the dependencies
between the random variables. Each variable is conditionally independent of its
non descendants given the value of its parents.
Turambar is able to calculate the probability associated to a class, object
property or data property assertions. These probabilistic assertions have only
two constraints:
– The class expression employed in the class assertion should be a class.
– For positive and negative object property assertions, the object property
expression should be an object property.
However, these limitations can be solved declaring a class equivalent to a class
expression or an object property as the equivalent of an inverse object property.
Examples of probabilistic assertions that can be calculated with Turambar are:
3
isIn(John, Bedroom1) 0.7 (3)
W atchingT V Activity(Action1) 0.8 (4)
isSwitched(T V 1, true) 1 (5)
The probabilistic object property assertion expressed in (3) states that John is
in Bedroom1 with a probability of 0.7. On the other hand the probabilistic class
assertion (4) describes that the Action1 is member of the class WatchingTVAc-
tivity with a probability of 0.2. Finally, the probabilistic data property assertion
(5) defines that the television, TV1, is on with a probability of 1.0. The prob-
ability associated to these assertions is calculated through Bayesian networks
that describe how other property and class assertions influence each other. In
Turambar, the probabilistic relationships should be defined by an expert. In
other words, the Bayesian networks must be generated by hand, since learning
a Bayesian network is out of the scope of this paper and it is not the goal of this
work.
3.1 Turambar Functionality Definition
The classes, object properties and data properties of the OWL 2 DL ontology
involved in the probabilistic knowledge are connected to the random variables
defined in the Bayesian network model. For example, the OWL class Watch-
ingTVActivity is connected to at least one random variable, in order to be able
to calculate probabilistic assertions about that class. The set of data properties,
object properties and classes that are linked to random variables is called Vprob
and a member of Vprob , vprobi ; such that vprobi ∈ Vprob .
In Turambar, every random variable (RV) is associated to a Vprob and every
RV’s domain is composed of a set of functions that determine the values that a
random variable can take, such as V al(RV ) = {f1 , f2 ...fn } and fi ∈ V al(RV ).
These functions require a property or class and individual to calculate the prob-
abilistic assertion, such as fi : a1 , ex → result where a1 is an OWL individual,
ex, a class, data property or object property; result, a class assertion, object
property assertion, data property assertion or void (no assertion). In the case,
that every function in the domain of a random variable returns void, it means
that the random variable is auxiliary. In contrast, if any fi in the domain of
a random variable returns a probability associated to an assertion, then the
random variable is called final.
For instance, the data property lieOnBedTime is linked to a random variable
named SleepTime whose domain is composed of two functions f1 that check if
the user has been sleeping for less than 8 hours and f2 function that checks if
the user has been sleeping for more than 8 hours. Both functions are not able to
generate assertions, so the random variable SleepTime is auxiliary. By contrast,
WatchingTVActivity class is linked to a random variable called WatchingTV
whose domain comprises f3 function that checks if an individual is member of
the class WatchingTVActivity (e.g. W atchingT V Activity(Activity1) 0.8) and
4
the f4 function which checks if an individual is a member of the complement of
WatchingTVActivity.
It is also important to remark that a vprobi can be referenced from several
random variables. For example, the TurnOffTV depends on the user’s impair-
ments, so if the blind user is deaf, it is more likely that the television needs to be
turned off. Additionally, having an impairment also affects to the probability of
having another impairment: deaf people have a higher probability of also being
mute. In this case, we can link hasImpairment object property with two random
variables in order to model it.
Apart from the conditional probability distribution, nodes connected be-
tween them may have an associated node context. The context defines how
different random variables are related between them and the condition that
must fulfil. This context establishes an unequivocal relationship in which ev-
ery individual involved in that relationship should be gatherer before calcu-
lating the probability of an assertion. If the relationship is not fulfilled then
the probabilistic query cannot be answered. For example, to estimate the prob-
ability for the TurnOffTV, the reasoner needs to know who is the user and
in which room he is. For this case the relationship may be the following one
isIn(?user, ?room) ∧ requiredBy(?action, ?user) ∧ hasAppliance(?user, ?tv) ,
being ?user, ?action, ?tv and ?room variables. So, if we ask for the probability
that Action1 is member of TurnOffTV, such as Pr(TurnOffTV(Action1)), then
the first step to calculate it is to check its context. If everything is right the
evaluation of this relationship should return that the Action1 is required only
by one user who is only in one room and has only one television. Otherwise, the
probability cannot be calculated.
Our proposal can be viewed as a SROIQ(D) extension that includes a prob-
abilistic function P r which maps role assertions and concept assertions to a value
between 0 and 1. The sum of the probabilities obtained for a random variable is
equal to 1. In contrast, the sum of probabilities for the set of assertions obtained
for a vprobi may be different from 1. For instance, the object property hasIm-
pairment is related to two random variables one to calculate the probability of
being deaf and another one to calculate the probability of being mute. If both
random variables have a domain with two functions, we can get four probabilis-
tic assertions that sums 2 instead of 1, but the sum of probabilities obtained in
one random variable is 1:
– Random variable deaf’s assertions: hasImpairment(John, Deaf )0.8 and
¬hasImpairment(John, Deaf )0.2.
– Random variable mute’s assertions: hasImpairment(John, M ute)0.7 and
¬hasImpairment(John, M ute)0.3.
The probability of an assertion that exists in the OWL 2 DL ontology is
always 1 although the data property, object properties or class is not member of
Vprob . For example, if an assertion states that John is a Person (P erson(John))
and we ask for the probability of this assertion, then its probability is 1, such
as P erson(John)1. However, if the data property, object properties or class is
not member of Vprob and there is not an assertion in the OWL 2 DL ontology
5
that states it, then the probability for that assertion is unknown. We consider
that the probabilistic ontology is satisfied if the OWL 2 DL ontology is satisfied
and the Bayesian network model is not in contradiction with the OWL ontology
knowledge.
3.2 Turambar Ontology Specification
In Turambar, a probabilistic ontology comprises an ordinary OWL 2 DL ontol-
ogy and the dependency description ontology that defines the Bayesian network
model.
The ordinary OWL ontology imports the Turambar annotations ontology,
which defines the following annotations:
– turambarOntology annotation defines the URI of the dependency description
ontology.
– turambarClass annotation links OWL classes in the ordinary ontology to
random variables in the dependency description ontology.
– turambarProperty annotation connects OWL data properties and object prop-
erties in the ordinary ontology to random variables in the dependency de-
scription ontology.
We choose to separate the Bayesian network definition from the ordinary on-
tology in order to isolate the probabilistic knowledge definition from the OWL
knowledge. We define isolation as the ability of exposing an ontology with an
unique URI that locates the traditional ontology and the probabilistic one. So,
given the URI of a probabilistic ontology, a compatible reasoner loads the ordi-
nary ontology and the dependency description ontology it. In contrast, a tradi-
tional reasoner only loads the ordinary ontology. So, if the Turambar probabilistic
ontology is loaded by a traditional reasoner, the traditional reasoner does not
have access to the knowledge encoded in the dependency description ontology.
In this way, we also want to promote the re-utilization of probabilistic ontologies
as simple OWL 2 DL ontologies by traditional systems and the interoperability
between our proposal and them.
On the other hand, the dependency description ontology defines the proba-
bilistic model employed to estimate the probabilistic assertions. To model that
knowledge, it imports the Turambar ontology, which defines the vocabulary to
describe the probabilistic model. As the figure 1 shows, the main classes and
properties in the Turambar ontology are the following ones:
– Node class represents the nodes in Bayesian networks. Node instances are
defined as auxiliary random variables through the property isAuxiliar. The
hasProbabilityDistibution object property links Node instances with their
corresponding probability distributions and hasState object property asso-
ciates Node instances with their domains. Furthermore, hasChildren object
property and its inverse hasParent set the dependencies between Node in-
stances. Finally, hasContext object property defines the context for a node
and hasVariable object property, the value of the variable that the node
requires.
6
– MetaNode is a special type of Node that is employed with non functional
object properties and data properties. Its main functionality is to group sev-
eral nodes that share a context and are related to the same property. For
instance, in the case of the hasImpairment object property we can model a
MetaNode with two nodes: Deaf and Mute. Both nodes share the same con-
text but have different parents and states. The object property compriseNode
identifies the nodes that share a context.
– State class defines the values of random variables’ domain. In other words, it
describes the functions which generate probabilistic assertions. These func-
tions are expressed as a string through the data property stateCondition.
– ProbabilityDistribution class represents a probability distribution. Probabil-
ity distributions are given in form of conditional probability tables. Cells of
the conditional probability table are associated to the instances of Probabil-
ityDistribution through hasProbability object property.
– Probability class represents a cell of a conditional probability table, such
P (x1 | x2 , x3 ) = value, where x1 , x2 and x3 are State individuals and value
is the probability value. x1 State is assigned to an instance of Probability
class through the hasValue object property and x2 and x3 conditions through
hasCondition object property. Finally, the data property hasProbabilityValue
sets the probability value for that cell.
– Context class establishes the relationships between the nodes of a Bayesian
network. Relationships between nodes are expressed as a SPARQL-DL query
through the data property relationship.
– Variable class represents the variables of the context. Their instances iden-
tify the SPARQL-DL variables defined in the context SPARQL-DL query.
The variableName data property establishes the name of the variable. For
example, if the context has been defined with the following SPARQL-DL
expression: select ?a ?b where { PropertyValue(p:livesIn, ?a, ?b)} , then
we should create two instances of Variable with the variableName property
value of a and b, respectively.
– Plugin class defines a library that provides some functions that are referenced
by State class instances and are not included as member of the Turambar
core. The core functions are the following ones: (i) numbers and strings
comparison, (ii) ranges of number and string comparison, (iii) individual in-
stances comparison, (iv) boolean comparison, (v) class memberships check-
ing and (vi) the void assertion to define the probability that no assertion
involves an individual. Only i, iii and iv are able to generate probabilistic as-
sertions. Every function, except the void function, has their inverse function
to check if that value has been asserted as false.
4 Related Works
We can classify probabilistic approaches to deal with uncertainty in two groups:
probabilistic description logics approaches and probabilistic web ontology lan-
guages [11].
7
8
Fig. 1. Classes and properties defined by the Turambar ontology
In the first group, P-CLASSIC [9] extends description logic CLASSIC to add
probability. In contrast, Pronto [8] is a probabilistic reasoner for P-SROIQ, a
probabilistic extension of SROIQ. Pronto models probability intervals with its
custom OWL annotation pronto#certainty. Apart from the previously described
works, there are several other approaches that have been explained in different
surveys such as [14].
In contrast, probabilistic web ontology languages combine OWL with prob-
abilistic formalisms based on Bayesian networks. Since our proposal falls under
this group, we will review in depth the most important works in this category:
BayesOWL, OntoBayes and PR-OWL. The BayesOWL [7] framework extends
OWL capacities for modelling and reasoning with uncertainty. It applies a set
of rules to transform the class hierarchy defined in an OWL ontology into a
Bayesian network. In the generated network there are two types of nodes: con-
cept nodes and L-Nodes. The former one represents OWL classes and the latter
one is a special kind of node that is employed to model the relationships defined
by owl:intersectionOf, owl:unionOf, owl:complementOf, owl:equivalentClass and
owl:disjointWith constructors. Concept nodes are connected between them by
directed arcs that link superclasses with their classes. On the other hand, L-
Nodes and concept nodes involved in a relationship are linked following the
rules established for each constructor. The probabilities are defined with the
classes PriorProb, for prior probabilities, and CondProb, for conditional proba-
bilities. For instance, BayesOWL [22] recognizes some limitations: (i) variables
should be binaries, (ii) probabilities should contain only one prior variable, (iii)
probabilities should be complete and (iv) in case of inconsistency the result may
not satisfy the constraints offered. BayesOWL approach is not valid for our pur-
pose, because it only supports uncertainty to determine the class membership
of an individual and this may not be enough for context modelling. For exam-
ple, sensors’ values may be represented as data and object properties values and
knowing the probability that a sensor has certain value may be very useful for
answering user’s needs.
In contrast to BayesOWL, OntoBayes [21] focuses on properties. In Onto-
Bayes, every random variable is a data or object property. Dependencies between
them are described via the rdfs:dependsOn property. It supports to describe
prior and conditional probabilities, besides it contains a property to specify the
full disjoint probability distribution. Another improvement of OntoBayes over
BayesOWL is that it supports multi-valued random variables. However, it is not
possible to model relationships between classes in order to prevent errors when
extracting Bayesian network structure from ontologies. OntoBayes offers us a
solution for the limitation presented in BayesOWL regarding OWL properties,
but its lack of OWL class support makes it unsuitable for our goal.
PR-OWL [5] is an OWL extension to describe complex bayesian models. It
is based on the Multi-Entity Bayesian newtworks (MEBN) logic. MEBN [10]
defines the probabilistic knowledge as a collection of MEBN fragments, named
MFrags. A set of MFrags configures a MTheory and every PR-OWL ontology
must contain at least one MTheory. To consider a MFrag set as a MTheory, it
9
must satisfy consistency constraints ensuring that it only exists a joint prob-
ability distribution over MFrags’ random variables. In PR-OWL, probabilistic
concepts can coexist with non probabilistic concepts, but these are only bene-
fited by the advantages of the probabilistic ontology. Each MFrag is composed
of a set of nodes which are classified in three groups: resident, input and con-
text node. Resident nodes are random variables whose probability distribution
is defined in the MFrag. Input nodes are random variables whose probability
distribution is defined in a distinct MFrag than the one where is mentioned.
In contrast, context nodes specify the constraints that must be satisfied by an
entity to substitute an ordinary variable. Finally, node states are modelled with
the object property named hasPossibleValues.
The last version of PR-OWL [2], PR-OWL 2, addresses the PR-OWL 1 lim-
itations regarding to its compatibility with OWL: no mapping to properties of
OWL and the lack of compatibility with existing types in OWL. Although, PR-
OWL offers a good solution to deal with uncertainty, it does not provide some
characteristics that we covet for our systems, such as isolation.
Our proposal is focused on computing the probability of data properties asser-
tions, object properties assertions and class assertions. This issue is only covered
by PR-OWL, because BayesOWL only takes into account class membership and
OntoBayes, object and data properties.
In addition, we pretend to offer a way to keep the uncertainty information iso-
lated as much as possible from the traditional ontology. With this policy, we want
to ease the reutilization of our probabilistic ontologies by traditional systems that
do not offer support for uncertainty and the interoperability between them. Fur-
thermore, we aim to avoid that traditional reasoners load unnecessary informa-
tion about the probabilistic knowledge that they do not need. Thus, if we load the
Turambar probabilistic ontology located in http://www.example.org/ont.owl,
traditional OWL reasoners load only the knowledge defined in the ordinary OWL
ontology and do not have access to the probabilistic knowledge. In contrast, Tu-
rambar reasoner is able to load the ordinary OWL ontology and the dependency
description ontology. The Turambar reasoner needs to access to the ordinary
OWL ontology to answer traditional OWL queries and to find the evidences of
the Bayesian networks defined in the dependency description ontology. It is also
important to clarify that a class or property can have deterministic assertions
and probabilistic assertions without duplicating them due to the links between
Bayesian networks’ nodes and OWL classes and properties through turambar-
Class and turambarProperty, respectively. Thanks to this feature, a Turambar
ontology has a unique URI that allows it to be used as an ordinary OWL 2 DL
ontology without loading the probabilistic knowledge. This characteristic is not
offered by other approaches as far as we know.
Another difference with other approaches is that we have taken into account
the extensibility of our approach through plug-ins to increase the basis function-
alities. We believe that it is necessary to offer a straightforward, transparent and
standard mechanism to extend reasoner functionality in order to cover hetero-
geneous domains’ needs.
10
However, our approach has the shortcoming of assuming a simple attribute-
value representation in comparison to PR-OWL. That means that each prob-
abilistic query involves reasoning about the same fixed number of nodes, with
only the evidence values changing from query to query. To solve this drawback,
we can opt to employ situation specific Bayesian networks [12], as PR-OWL
does. However, the development of custom plug-ins can overcome this limita-
tion in some cases. Besides, thanks to this expressiveness restriction we are able
to know the size of the Bayesian network and give a better estimation of the
performance of the Turambar probabilistic ontology.
5 Conclusions and Future Work
In this work we have presented a proposal to deal with uncertainty in intelligent
environments. Its main features are: a) it isolates the probabilistic information
definition from traditional ontologies, b) it can be extended easily and c) it is
oriented to intelligent environments.
As ongoing work, we are developing an extension to SPARQL-DL [17] in
order to offer a simple mechanism to execute complex queries in a declara-
tive way that abstracts developers from the reasoner implementation employed
and its API. This extension proposes the addition of two new query atoms to
query probabilistic knowledge: ProbType for probabilistic class assertions and
ProbPropertyValue for probabilistic property assertions. We believe that this
extension can ease the development of applications that employ Turambar.
As future work, we plan to create a graphical tool to ease the creation of
probabilistic ontologies in order to promote its adoption. On the other hand,
we plan to extend its expressivity and evaluate new and better ways to define
the probabilistic description ontology in order to improve its maintainability. In
addition, we are studying a formalism that allows us the definition of custom
function for state evaluation that was independent of the programming language
employed.
References
1. Baader, F., Küsters, R., Wolter, F.: The description logic handbook. chap. Exten-
sions to description logics, pp. 219–261. Cambridge University Press, New York,
NY, USA (2003), http://dl.acm.org/citation.cfm?id=885746.885753
2. Carvalho, R.N.: Probabilistic Ontology: Representation and Modeling Methodol-
ogy. Ph.D. thesis, George Mason University (2011)
3. Chen, H., Perich, F., Finin, T., Joshi, A.: Soupa: standard ontology for ubiquitous
and pervasive applications. In: Mobile and Ubiquitous Systems: Networking and
Services, 2004. MOBIQUITOUS 2004. The First Annual International Conference
on. pp. 258–267 (Aug 2004)
4. Coen, M.H.: Design principles for intelligent environments. In: Proceedings of
the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Ap-
plications of Artificial Intelligence. pp. 547–554. AAAI ’98/IAAI ’98, Amer-
ican Association for Artificial Intelligence, Menlo Park, CA, USA (1998),
http://dl.acm.org/citation.cfm?id=295240.295733
11
5. Costa, P.C.G.: Bayesian semantics for the Semantic Web. Ph.D. thesis, George
Mason University (2005)
6. Dey, A.K.: Understanding and using context. Personal Ubiquitous Comput. 5(1),
4–7 (Jan 2001), http://dx.doi.org/10.1007/s007790170019
7. Ding, Z.: Bayesowl: A Probabilistic Framework for Uncertainty in Semantic Web.
Ph.D. thesis, Catonsville, MD, USA (2005)
8. Klinov, P.: Practical reasoning in probabilistic description logic. Ph.D. thesis, Uni-
versity of Manchester (2011)
9. Koller, D., Levy, A., Pfeffer, A.: P-CLASSIC: A tractable probabilistic description
logic. In: In Proceedings of AAAI-97. pp. 390–397 (1997)
10. Laskey, K.: MEBN: A language for first-order bayesian knowledge bases. Artificial
Intelligence 172(2-3), 140–178 (2008)
11. Lukasiewicz, T., Straccia, U.: Managing uncertainty and vagueness in description
logics for the semantic web. Web Semantics: Science, Services and Agents on the
World Wide Web 6(4), 291–308 (2008)
12. Mahoney, S.M., Laskey, K.B.: Constructing situation specific belief networks. In:
Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence.
pp. 370–378. UAI’98, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
(1998), http://dl.acm.org/citation.cfm?id=2074094.2074138
13. Pearl, J.: Fusion, propagation, and structuring in be-
lief networks. Artificial Intelligence 29(3), 241 – 288 (1986),
http://www.sciencedirect.com/science/article/pii/000437028690072X
14. Predoiu, L., Stuckenschmidt, H.: Probabilistic models for the semantic web. The
Semantic Web for Knowledge and Data Management: Technologies and Practices
pp. 74–105 (2009)
15. Preuveneers, D., Van den Bergh, J., Wagelaar, D., Georges, A., Rigole, P., Clerckx,
T., Berbers, Y., Coninx, K., Jonckers, V., De Bosschere, K.: Towards an extensible
context ontology for ambient intelligence. In: Ambient Intelligence, Lecture Notes
in Computer Science, vol. 3295, pp. 148–159. Springer Berlin Heidelberg (2004)
16. Ramos, C., Augusto, J.C., Shapiro, D.: Ambient intelligencethe next step for arti-
ficial intelligence. Intelligent Systems, IEEE 23(2), 15–18 (March 2008)
17. Sirin, E., Parsia, B.: SPARQL-DL: SPARQL query for OWL-DL. In: OWLED. vol.
258 (2007)
18. Strang, T., Linnhoff-Popien, C.: A context modeling survey. In: In: Workshop on
Advanced Context Modelling, Reasoning and Management, UbiComp 2004 - The
Sixth International Conference on Ubiquitous Computing, Nottingham/England
(2004)
19. W3C: Uncertainty reasoning for the world wide web. Tech. rep., W3C (2008),
http://www.w3.org/2005/Incubator/urw3/XGR-urw3/
20. Wang, X., Zhang, D., Gu, T., Pung, H.: Ontology based context modeling and
reasoning using owl. In: Proceedings of the Second IEEE Annual Conference on
Pervasive Computing and Communications Workshops, 2004. pp. 18–22 (2004)
21. Yang, Y., Calmet, J.: Ontobayes: An ontology-driven uncertainty model. In: Com-
putational Intelligence for Modelling, Control and Automation, 2005 and Interna-
tional Conference on Intelligent Agents, Web Technologies and Internet Commerce,
International Conference on. vol. 1, pp. 457–463. IEEE (2005)
22. Yun Peng, Z.D.: Bayesowl: Reference manual (Feb 2013),
http://www.csee.umbc.edu/ ypeng/BayesOWL/manual/index.html
12
Learning to Propagate Knowledge in Web Ontologies
Pasquale Minervini1 , Claudia d’Amato1 , Nicola Fanizzi1 , Volker Tresp2
1
Department of Computer Science - University of Bari, Italy
{pasquale.minervini|claudia.damato|nicola.fanizzi}@uniba.it
2
Siemens AG, Corporate Technology, Munich, Germany
volker.tresp@siemens.com
Abstract. The increasing availability of structured machine-processable knowl-
edge in the W EB OF DATA calls for machine learning methods to support stan-
dard pattern matching and reasoning based services (such as query-answering
and inference). Statistical regularities can be efficiently exploited to overcome the
limitations of the inherently incomplete knowledge bases distributed across the
Web. This paper focuses on the problem of predicting missing class-memberships
and property values of individual resources in Web ontologies. We propose a
transductive inference method for inferring missing properties about individuals:
given a class-membership/property value prediction problem, we address the task
of identifying relations encoding similarities between individuals, and efficiently
propagating knowledge across their relations.
1 Introduction
Standard query answering and reasoning services for the Semantic Web [2] (SW) mainly
rely on deductive inference. However, purely deductive inference suffers from several
limitations [20]: reasoning tasks might be computationally complex, do not always ad-
dress uncertainty and only rely on axiomatic prior knowledge. However, purely deduc-
tive reasoning with SW representations suffers from several limitations owing to its
complexity and the inherent incompleteness and incoherence of distributed knowledge
bases (KBs) in a Web-scale scenario modeled by formal ontologies. In such a con-
text many complex tasks (e.g. query answering, clustering, ranking, etc.) are ultimately
based on assessing the truth of ground facts. Deciding on the truth of specific facts (as-
sertions) in SW knowledge bases requires to take into account the open-world form of
reasoning adopted in this context: a failure on ascertaining the truth value of a given fact
does not necessarily imply that such fact is false, but rather that its truth value cannot
be deductively inferred from the KB (e.g. because of incomplete or uncertain knowl-
edge). Other issues are related to the distributed nature of the data across the Web. Cases
of contradictory answers or flawed inferences may be caused by distributed pieces of
knowledge that may be mutually conflicting.
The prediction of the truth value of an assertion can be cast as a classification prob-
lem to be solved through statistical learning: individual resources in an ontology can be
regarded as statistical units, and their properties can be statistically inferred even when
they cannot be deduced from the KB. Several approaches have been proposed in the SW
literature (see [20] for a recent survey). A major issue with the methods proposed so far
13
is that the induced statistical models (as those produced by kernel methods, tensor fac-
torization, etc.) are either difficult to interpret by experts and to integrate in logic-based
SW infrastructures, or computationally impractical.
1.1 Related Work
A variety of methods have been proposed for predicting the truth value of assertions in
Web ontologies, including generative models [18,21], kernel methods [4,8,16], upgrad-
ing of propositional algorithms [15], matrix and tensor factorization methods [9,17,26].
An issue with existing methods is that they either rely on a possibly expensive search
process, or induce statistical models that are often not easy to interpret by human
experts. Kernel methods induce models (such as separating hyperplanes) in a high-
dimensional feature space implicitly defined by a kernel function. The underlying ker-
nel function itself usually relies on purely syntactic features of the neighborhood graphs
of two individual resources (such as their common subtrees [16]). In both cases, there is
not necessarily a direct translation in terms of domain knowledge. Latent variable and
matrix or tensor factorization methods such as [9, 17, 21, 26] try to explain the observa-
tions in terms of latent classes or attributes, which also may be non-trivial to describe
using the domain’s vocabulary. The approaches in [15, 18] try to overcome this limita-
tion by making use of more complex features and knowledge representation formalisms
jointly with the ontology’s terminology: these methods involve either a search process
in a possibly very large feature space or complex probabilistic inference tasks, which
might not be feasible in practice.
1.2 Contribution
We propose a transductive inference method for predicting the truth value of assertions,
which is based on the following intuition: examples (each represented by a individual
in the ontology) that are similar in some aspects tend to be linked by specific relations.
Yet it may be not straightforward to find such relations for a given learning task. Our
approach aims at identifying such relations, and permits the efficient propagation of
information along chains of related entities. It turns out to be especially useful with real-
world shallow ontologies [22] (i.e. those with a relatively simple fixed terminology and
populated by very large amounts of instance data such as citation or social networks),
in which the characteristics of related entities tend to be correlated. These features are
particularly relevant in the context of the Linked Open Data [10] (LOD). Unlike other
approaches, the proposed method can be used to elicit which relations link examples
with similar characteristics. The proposed method is efficient, since the complexity of
both inference and learning grows polynomially with the number of statistical units.
As in graph-based semi-supervised learning (SSL) methods [5], we rely on a sim-
ilarity graph among examples for propagating knowledge. However, SSL methods are
often designed for propositional representations; our method addresses the problem of
learning from real ontologies, where examples can be interlinked by heterogeneous re-
lations. In particular, this paper makes the following contributions:
– A method to efficiently propagating knowledge among similar examples: it lever-
ages a similarity graph, which plays a critical role in the propagation process.
14
– An approach to efficiently learning the similarity matrix, by leveraging a set of
semantically heterogeneous relations among examples in the ontology.
To the best of our knowledge, our approach is the first to explicitly identify relations
that semantically encode similarities among examples w.r.t. a given learning task.
The remainder of the paper is organized as follows. In the next section, we review
the basics of semantic knowledge representation and reasoning tasks, and we introduce
the concept of transductive learning in the context of semantic KBs. In Sect. 3, we
illustrate the proposed method based on an efficient propagation of information among
related entities, and address the problem of identifying the relations relevant to the
learning task. In Sect. 4, we provide empirical evidence for the effectiveness of the
proposed method. Finally, in Sect. 5, we summarize the proposed approach, outline its
limitations and discuss possible future research directions.
2 Transductive Learning with Web Ontologies
We consider ontological KBs using Description Logics (DLs) as a language to describe
objects and their relations [1] Basics elements are atomic concepts NC = {C, D, . . .}
interpreted as subsets of a domain of objects (e.g. Person or Article), and atomic
roles NR = {R, S, . . .} interpreted as binary relations on such a domain (e.g. friendOf
or authorOf). Domain objects are represented by individuals NI = {a, b, . . .}: each
represents a domain entity (such as a person in a social network, or an article in a
citation network).
In RDF(S)/OWL 1 , concepts, roles and individuals are referred to as classes, prop-
erties and resources, respectively, and are identified by their URIs. Complex concept
descriptions can be built using atomic concepts and roles by means of specific con-
structors offered by expressive DLs.
A Knowledge Base (KB) K = hT , Ai contains a TBox T , made by terminological
axioms, and an ABox A, made by ground axioms, named assertions, involving indi-
viduals. Ind(A) denotes the set of individuals occurring in A. As inference procedure,
Instance Checking consists in deciding whether K |= Q(a) (where Q is a query concept
and a is an individual) holds. Because of the Open-World Assumption (OWA), instance
checking may provide three possible outcomes, i.e. i) K |= Q(a), ii) K |= ¬Q(a)
and iii) K 6|= Q(a) ∧ K 6|= ¬Q(a). This means that failing to deductively infer the
membership of an individual a to a concept Q does not imply that a is a member of its
complement ¬Q.
Given the inherent incompleteness of deductive inference under open-world reason-
ing, in this work we focus on transductive inference [27]: this consists in reasoning from
observed (training) cases to a specific set of test cases, without necessarily generalizing
to unseen cases. This differs from inductive inference, where training cases are used to
infer a general model, which is then applied to test cases.
The main motivation behind the choice of transductive learning is described by the
main principle in [27]: “If you possess a restricted amount of information for solving
some problem, try to solve the problem directly and never solve a more general problem
1
OWL 2 W3C Recommendation: http://www.w3.org/TR/owl-overview/
15
as an intermediate step. It is possible that the available information is sufficient for a
direct solution but is insufficient for solving a more general intermediate problem”.
On the ground of the available information, the proposed approach aims at learning
a labeling function for a given target class that can be used for predicting whether in-
dividuals belong to a class C (positive class) or to its complement ¬C (negative class)
when this cannot be inferred deductively. The problem can be defined as follows:
Definition 2.1 (Transductive Class-Membership Learning). Given:
– a target class C in a KB K;
– a set of examples X ⊆ Ind(A) partitioned into:
• a set of positive examples: X+ , {a ∈ X | K |= C(a)};
• a set of negative examples: X− , {a ∈ X | K |= ¬C(a)};
• a set of neutral (unlabeled) examples: X0 , {a ∈ X | a 6∈ X+ ∧ a 6∈ X− };
– a space of labeling functions F with domain X and range {−1, +1}, i.e.
F , {f | f : X → {+1, −1}};
– a cost function cost(·) : F 7→ R, specifying the cost associated to each labeling
functions f ∈ F;
Find f ∗ ∈ F minimizing cost(·) w.r.t. X:
f ∗ ← arg min cost(f ).
f ∈F
The transductive learning task is cast as the problem of finding a labeling function
f ∗ for a target class C, defined over a finite set of labeled and unlabeled examples X
(represented by a subset of the individuals in the KB), and minimizing some arbitrary
cost criterion.
Example 2.1 (Transductive Class-Membership Learning). Consider an ontology mod-
eling an academic domain. The problem of learning whether a set of persons is affiliated
to a given research group or not, provided a set of positive and negative examples of
affiliates, can be cast as a transductive class-membership learning problem: examples
(a subset of the individuals in the ontology, each representing a person), represented by
the set X, can be either positive, negative or neutral depending on their membership to
a target class ResearchGroupAffiliate. Examples can be either unlabeled (if their
membership to the target class cannot be determined) or labeled (if positive or nega-
tive). The transductive learning problem reduces to finding the best labeling function f
(according to a given criterion, represented by the cost function) providing membership
values for examples in X.
In this work, we exploit the relations holding among examples to propagate knowl-
edge (in the form of label information) through chains of related examples. Inspired
by graph-based semi-supervised transductive learning, the criterion on which the cost
function will be defined follows the semi-supervised smoothness assumption [5]: if two
points in a high-density region are close, then so should be the corresponding outputs.
Transductive and semi-supervised learning are closely related: in both settings, test ex-
amples are available during the learning task (in the form of unlabeled examples). In
the proposed method, the learning task is reduced to finding a labeling function f which
varies smoothly across the similarity graph defined over examples.
16
3 Knowledge Propagation
In this section, we present our method for solving the learning problem in Def. 2.1 in
the context of Web ontologies. In Sect. 3.1 we show that a similarity graph between ex-
amples can be used to efficiently propagate label information among similar examples.
The effectiveness of this approach strongly depends on the choice of a similarity graph
(represented by its adjacency matrix W). In Sect. 3.2, we show how the matrix W can
be learned from examples, by leveraging their relationship within the ontology.
3.1 Transductive Inference as an Optimization Problem
We now propose a solution to the transductive learning problem in Def. 2.1. As dis-
cussed in the end of Sect. 2, we need a labeling function f ∗ defined over examples X,
which is both consistent with the training labels, and varies smoothly among similar
examples (according to the semi-supervised smoothness assumption). In the following,
we assume that a similarity graph over examples in X is already provided. Such a graph
is represented by its adjacency matrix W, such that Wij = Wji ≥ 0 if xi , xj ∈ X
are similar, and 0 otherwise (for simplicity we assume that Wii = 0). In Sect. 3.2 we
propose a solution to the problem of learning W from examples.
Formally, each labeling function f can be represented by a finite-size vector, where
fi ∈ {−1, +1} is the label for the i-th element in the set of examples X. According
to [30], labels can be enforced to vary smoothly among similar examples by considering
a cost function with the following form:
|X| |X| |X|
1 XX X
E(f ) , Wij (fi − fj )2 + fi2 , (1)
2 i=1 j=1 i=1
where the first term enforces the labeling function to vary smoothly among similar
examples (i.e. those connected by an edge in the similarity graph), and the second term
is a L2 regularizer (with weight > 0) over f . A labeling for unlabeled examples X0
is obtained by minimizing the function E(·) in Eq. 1, constraining the value of fi to 1
(resp. −1) if xi ∈ X+ (resp. xi ∈ X− ).
Let L , X+ ∪ X− and U , X0 represent labeled and unlabeled examples respec-
tively. Constraining f to discrete values, i.e. fi ∈ {−1, +1}, ∀xi ∈ X0 , has two main
drawbacks:
– The function f only provides a hard classification (i.e. fU ∈ {−1, +1}|U | ), any
measure of confidence;
– E defines the energy function of a discrete Markov Random Field, and calculating
the marginal distribution over labels fU is inherently difficult [13].
To overcome these problems, in [30] authors propose a continuous relaxation of fU ,
where labels for unlabeled examples are represented by real values, fU ∈ R|U | , which
also express a measure of the classification confidence. This allows for a simple, closed-
form solution to the problem of minimizing E for a fixed fL , where fL represents the
labels of labeled examples.
17
Application to Class-Membership Learning We can solve the learning problem in
Def. 2.1 by minimizing the cost function E(·) in Eq. 1. It can be rewritten as [30]:
E(f ) = f T (D − W)f + f T = f T (L + I)f , (2)
P|X|
where D is a diagonal matrix such that Dii = j=1 Wij and L , D−W is the graph
Laplacian of W. Reordering the matrix W, the graph Laplacian L and the vector f w.r.t.
their membership to L and U , they can be rewritten as:
WLL WLU LLL LLU f
W= , L= , f= L . (3)
WU L WU U LU L LU U fU
The problem of finding the fU minimizing the cost function E for a fixed value for fL
has a closed form solution:
fU∗ = (LU U + I)−1 WU L fL . (4)
Complexity A solution for Eq. 4 can be computed efficiently in nearly-linear time
w.r.t. |X|. Indeed computing fU∗ can be reduced to solving a linear system in the form
Ax = b, with A = (LU U + I), b = WU L fL and x = fU∗ . A linear system Ax = b
with A ∈ Rn×n can be solved in nearly linear time w.r.t. n if the coefficient matrix
2
A is symmetric diagonally dominant (SDD), e.g. using the algorithm in [6], whose
1/2
time-complexity is ≈ O m log n , where m is the number of non-zero entries in A
and n is the number of variables in the system of linear equations. In Eq. 4, the matrix
(LU U + I) is SDD (since LU U is a principal submatrix of L, which is SDD [25]). An
efficient parallel solver for SDD linear systems is discussed in [19].
3.2 Learning to Propagate Knowledge in Web Ontologies
As discussed in Sect. 3.1, the proposed approach to knowledge propagation relies on a
similarity graph, represented by its adjacency matrix W.
The underlying assumption of this work is that some relations among examples
in the KB might encode a similarity relation w.r.t. a specific target property or class.
Identifying such relations can help propagate information through similar examples.
In the literature, this effect goes under the name of Guilt-by-Association [14]: re-
lated examples influence each other, and some relations (e.g. friendship in a social net-
work) can encode some form of similarity w.r.t. a set of properties (such as political
views, hobbies, religious beliefs). However, depending on the learning task at hand, not
all relations are equally effective at encoding similarity relations. For example in a so-
cial network, friends may tend to share common interests, while quiet people may tend
to prefer talkative friends and vice-versa [14].
In this work, we represent each relation by means of an adjacency matrix W̃, such
that W̃ij = W̃ji = 1 iff the relation rel(xi , xj ) between xi and xj holds in the
ontology; wrel might represent any generic relation between examples (e.g. friendship
or co-authorship). For simplicity, we assume that W̃ii = 0, ∀i.
2 P
A matrix A is SDD iff A is symmetric (i.e. A = AT ) and ∀i : Aii ≥ i6=j |Aij |.
18
Given a set of adjacency matrices W , {W̃1 , . . . , W̃r } (one for each relation
type), we can define W as a linear combination of the matrices in W:
r
X
W, µi W̃i , with µi ≥ 0, ∀i (5)
i=1
where µi , is a parameter representing the contribution of W̃i to the adjacency matrix of
the similarity graph W. Non-negativity in µ ensures that W has non-negative weights,
and therefore the corresponding graph Laplacian L is positive semidefinite [25] (PSD),
leading to the unique, closed form solution in Eq. 4.
Probabilistic Interpretation as a Gaussian Random Field Let us consider the relax-
ation of the energy function in Eq. 2, such that labels f are allowed to range in R|X|
(f ∈ R|X| ). It corresponds to the following probability density function over f :
1 1 1
p(f ) = (2π)− 2 |Σ|− 2 exp − E(f ) = N 0, (L + I)−1 . (6)
2
The probability density function in Eq. 6 defines a Gaussian Markov Random Field [13]
(GMRF) f ∼ N (0, Σ), where Σ = Ω −1 and Ω = (L+I) are respectively its covari-
ance and inverse covariance (or precision) matrix, and |Σ| indicates the determinant of
the covariance matrix Σ.
The covariance matrix and its inverse fully determine the independence relations
among variables in a GMRF [13]: if Ω ij 6= 0, then there is an edge between fi and fj in
the minimal I-map GMRF of p. A zero element in the inverse covariance matrix implies
that two variables are conditionally independent given all the other variables.
Parameters Learning The parametric form of W is fully specified by the parame-
ters µ in Eq. 5, which may be unknown. We will estimate the parameters by means of
Leave-One-Out (LOO) Error minimization: given that propagation can be performed
efficiently, we are able of directly minimizing the LOO error, consisting in the sum-
mation of reconstruction errors obtained by considering each labeled example, in turn,
as unlabeled, and predicting its label (as in [29]). This leads to a computationally ef-
ficient procedure for evaluating the matrix W, and yields more flexibility as arbitrary
loss functions are allowed. Let Ui , U ∪ {xi } and Li , L − {xi }: the labeling vector
f and matrices W and L, for any given xi ∈ L, can be rewritten as follows:
fL i WLi Li WLi Ui LLi Li LLi Ui
f= , W= , L= , (7)
fUi WUi Li WUi Ui LUi Li LUi Ui
where w.l.o.g. we assume that the left-out example xi ∈ L corresponds to the first
element in Ui (in the enumeration used for the block representation in Eq. 7). Let `(x, x̂)
be a generic, differentiable loss function (e.g. `(x, x̂) = |x − x̂| for the absolute loss, or
`(x, x̂) = (x − x̂)2 /2 for the quadratic loss). The LOO Error is defined as follows:
|L|
X
Q(Θ) , `(fi , f̂i ), (8)
i=1
19
where eT , (1, 0, . . . , 0) ∈ Ru+1 and f̂i , eT (LUi Ui + I)−1 WUi Li fLi represents the
continuous label value assigned to xi as if such a value was not known in advance. The
vector eT is needed to select the first value of fU∗i only, i.e. the inferred continuous label
associated to the left-out example xi ∈ L. This leads to the definition of the following
criterion for learning the optimal set of parameters Θ , {µ, }:
Definition 3.1 (Minimum LOO Error Parameters). Given a set of labeled (resp. un-
labeled) examples L (resp. U ) and a similarity matrix W defined by parameters Θ
(according to the parametric form of W in Eq. 5), the minimum LOO Error Parameters
Θ ∗LOO are defined as follows:
Θ ∗LOO , arg min Q(Θ) + λ||Θ||2 , (9)
Θ
where the function Q is defined as in Eq. 8 and λ > 0 is a small positive scalar that
weights a regularization term over Θ (useful for avoiding some parameters to diverge).
The objective function in Def. 3.1 is differentiable and can be efficiently minimized
by using gradient-based function minimization approaches such as L-BFGS.
Let Zi = (LUi Ui + I). The gradient of Q w.r.t. a parameter θ ∈ Θ is given by:
|L|
∂Q(Θ) X ∂`(fi , f̂i ) T −1 ∂WUi Li ∂Zi ∗
= e Zi fLi − fUi . (10)
∂θ i=1 ∂ f̂i ∂θ ∂θ
∂WUi Li
Complexity of the Gradient Calculation Let zi = ∂θ fLi − ∂Zi ∗
∂θ fUi . Calcu-
lating Z−1
i zi can be reduced to solving a linear system in the form Ax = b, with
A = Zi = (LUi Ui + I) and b = zi . As discussed in Sect. 3.1, this calculation has a
nearly-linear complexity in the number of non-zero elements in A, since Zi is SDD.
4 Empirical Evaluation
The transductive inference method discussed in Sect. 3, which we will refer to as Adap-
tive Knowledge Propagation (AKP), was experimentally evaluated 3 in comparison with
other approaches proposed in the literature on a variety of assertion prediction prob-
lems. In the following, we describe the setup of experiments and their outcomes.
4.1 Setup
In empirical evaluations, we used an open source DL reasoner 4 . In experiments, we
considered the DB PEDIA 3.9 Ontology [3]. DB PEDIA [3] makes available structured
information extracted from Wikipedia the LOD cloud providing unique identifiers for
the described entities that can be dereferenced over the Web. DB PEDIA 3.9, released in
September 2013, describes 4.0 million entities.
3
Sources and datasets are available at http://lacam.di.uniba.it/phd/pmm.html
4
Pellet v2.3.1 – http://clarkparsia.com/pellet/
20
Experimental Setting As discussed in Sect. 3.2, parameters Θ = {µ, } in AKP are
estimated by minimizing the Leave-One-Out error Q, as described in Eq. 9. We solved
the problem by using Projected Gradient Descent, according to the gradient formulation
in Eq. 10 (enforcing µ ≥ 0 and > 0), together with an intermediate line search to
assess the step size. The regularization parameter λ in Eq. 9 was fixed to λ = 10−8 .
In this work, each of the adjacency matrices W = {W̃1 , . . . , W̃r } is associated to a
distinct atomic role in the ontology, linking at least two examples.
Before each experiment, all knowledge inherent to the target class was removed
from the ontology. Following the evaluation procedures in [16, 28], members of the tar-
get concepts were considered as positive examples, while an equal number of negative
examples was randomly sampled from unlabeled examples. Remaining instances (i.e.
neither positive nor negative) were considered as neutral examples.
Results are reported in terms of Area Under the Precision-Recall Curve (AUC-
PR), a measure to evaluate rankings also used in e.g. [17], and calculated using the
procedure described in [7]. In each experiment, we considered the problem of predicting
the membership to each of several classes; for each of such classes, we performed a
10-fold cross validation (CV), and report the average AUC-PR obtained using each of
the considered methods. Since the folds used to evaluate each of the methods do not
vary, we report statistical significance tests using a paired, non-parametric difference
test (Wilcoxon T test). We also report diagrams showing how using a limited quantity
of randomly sampled labeled training instances (i.e. 10%, 30%, 50%, . . ., a plausible
scenario for a number of real world settings with limited labeled training data), and
using the remaining examples for testing, affects the results in terms of AUC-PR.
Setup of the Compared Methods We compared our method with state-of-the-art ap-
proaches proposed for learning from ontological KBs. Specifically, we selected two
kernel methods: Soft-Margin SVM [23, pg. 223] (SM-SVM) and Kernel Logistic Re-
gression (KLR), jointly with the Intersection SubTree [16] (IST) kernel for ontological
KBs, and the SUNS [26] relational prediction model. The relational graph used by both
the RDF kernel and SUNS was materialized as follows: all hs, p, oi triples were re-
trieved by means of SPARQL-DL queries (where p was either an object or a data-type
property) together with all direct type and direct sub-class relations.
As in [16], IST kernel parameters were ranging in d ∈ {1, 2, 3, 4} and λist ∈
{0.1, 0.3, . . . , 0.9}). In order to obtain a ranking among instances (provided by soft-
labels f in AKP), we applied the logistic function s to the decision boundary f instead
of the standard sign function, commonly used in the classification context (thus ob-
taining s(f (·)) : X → [0, 1]). In SM-SVM, C ∈ {0.0, 10−6 , 10−4 , . . . , 104 , 106 },
while in KLR the weight λk associated to the L2 regularizer was found considering
λk ∈ {10−4 , 10−3 , . . . , 104 }. In SUNS, parameters were selected by means of a 10-
fold CV within the training set by grid optimization, with t ∈ {2, 4, 6, . . . , 24} and
λs ∈ {0, 10−2 , 10−1 , . . . , 106 }.
4.2 Results
Similarly to [17], we evaluated the proposed approach on two prediction tasks, namely
predicting party affiliations to either the Democratic and the Republican party for US
21
AUC-PR results – DBpedia 3.9 Fragment
1
0.8
AUC-PR
Method AUC-PR (mean ± var.) (S)
0.6 AKP (A) 0.967 ± 0.003
SM-SVM (IST) 0.930 ± 0.011 O
KLR (IST) 0.888 ± 0.029 H
AKP (A)
SUNS 0.832 ± 0.019 H
0.4
SVM (IST)
KLR (IST)
SUNS
0.2
10% 30% 50% 70% 90%
Percentage of labeled examples used during training
Fig. 1: DB PEDIA 3.9 Ontology – Left: AUC-PR results (mean, st.d.) estimated by 10-
fold CV, obtained varying the percentage of examples used for training – Right: AUC-
PR results estimated by 10-fold CV: H/O (resp. N/4) indicates that AKP’s mean is
significantly higher (resp. lower) in a paired Wilcoxon T test with p < 0.05 / p < 0.10
presidents and vice-presidents. The experiment illustrated in [17] uses a small RDF
fragment containing the president and vicePresident predicates only. In this ex-
periment, we used a real-life fragment of DB PEDIA 3.9 (obtained by means of a crawl-
ing process), containing a number of irrelevant and possibly noisy entities and relations.
Following the procedure in [11], the DB PEDIA 3.9 RDF graph was traversed starting
from resources representing US presidents and vice-presidents: all immediate neigh-
bors, i.e. those with a recursion depth of 1, were retrieved, together with their related
schema information (direct classes and their super-classes, together with their hierar-
chy). All extracted knowledge was used to create an ALCH ontology fragment, with
78795 axioms, 16606 individuals, 132 properties and 11 classes.
In this experiment, 82 individuals representing US presidents and vice-presidents
were interlinked by 25 relations represented by atomic roles. The proposed method,
denoted as AKP (A), makes use of such atomic roles to identify relations holding among
the examples in the ontology.
Experimental results are summarized in Fig. 1. We observe that AUC-PR values
obtained with AKP (A) are significantly higher than results obtained by other methods
considered in comparison (p < 0.05, except for three cases in which p < 0.10). Results
show how presidents and vice-presidents linked by simple relations such as president
and vicePresident tend to be affiliated to the same political party.
AKP (A) is able to identify which atomic roles are likely to link same party af-
filiates. As expected, it recognizes that relations represented by the president and
vicePresident atomic roles should be associated to higher weights, which means
that presidents and their vice-presidents tend to have similar political party affiliations.
AKP (A) also recognizes that presidents (or vice-presidents) linked by the successor
atomic role are unlikely to have similar political party affiliations.
22
5 Conclusions and Future Works
In this work, we proposed a semi-supervised transductive inference method for statis-
tical learning in the context of the W EB OF DATA. Starting from the assumption that
some relations among entities in a Web ontology can encode similarity information
w.r.t. a given prediction task (pertaining a particular property of examples, such as a
class-membership relation), we proposed a method (named Adaptive Knowledge Prop-
agation, or AKP) for efficiently learning the best way to propagate knowledge among
related examples (each represented by an individual) in a Web ontology.
We empirically show that the proposed method is able to identify which relations
encode similarity w.r.t. a given property, and that their identification can provide an ef-
fective method for predicting unknown characteristics of individuals. We also show that
the proposed method can provide competitive results, in terms of AUC-PR, in compar-
ison with other state-of-the-art methods in literature.
We only considered relations between statistical units (i.e. training examples) en-
coded by atomic roles. However, those do not always suffice: for example, in the re-
search group affiliation prediction task discussed in [16], individuals representing re-
searchers in the AIFB P ORTAL ontology are not related by any atomic role. We are
currently investigating other approaches to identifying meaningful relations among in-
dividuals, for example by means of Conjunctive Query Answering [12]. Other research
directions involve the study of different objective functions and optimization methods.
Acknowledgments This work fulfills the objectives of the PON 02 00563 3489339
project “P UGLIA @S ERVICE - Internet-based Service Engineering enabling Smart Ter-
ritory structural development” funded by the Italian Ministry of University and Re-
search (MIUR).
References
1. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The
Description Logic Handbook. Cambridge University Press (2007)
2. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American 284(5),
34–43 (May 2001)
3. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.:
DBpedia - a crystallization point for the Web of Data. J. Web Sem. 7(3), 154–165 (2009)
4. Bloehdorn, S., Sure, Y.: Kernel methods for mining instance data in ontologies. In: Aberer,
K., et al. (eds.) Proceedings of ISWC’07. LNCS, vol. 4825, pp. 58–71. Springer (2007)
5. Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press (2006)
6. Cohen, M.B., Kyng, R., Miller, G.L., Pachocki, J.W., Peng, R., Rao, A., Xu, S.C.: Solving
SDD linear systems in nearly mlog1/2 n time. In: Shmoys [24], pp. 343–352
7. Davis, J., Goadrich, M.: The relationship between Precision-Recall and ROC curves. In:
Cohen, W., et al. (eds.) Proceedings of ICML’06. pp. 233–240. ACM (2006)
8. Fanizzi, N., d’Amato, C., Esposito, F.: Induction of robust classifiers for web ontologies
through kernel machines. J. Web Sem. 11, 1–13 (2012)
9. Franz, T., Schultz, A., Sizov, S., Staab, S.: TripleRank: Ranking Semantic Web Data by Ten-
sor Decomposition. In: Bernstein, A., et al. (eds.) International Semantic Web Conference.
LNCS, vol. 5823, pp. 213–228. Springer (2009)
23
10. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Synthesis
Lectures on the Semantic Web, Morgan & Claypool Publishers (2011)
11. Hellmann, S., Lehmann, J., Auer, S.: Learning of OWL Class Descriptions on Very Large
Knowledge Bases. Int. J. Semantic Web Inf. Syst. 5(2), 25–48 (2009)
12. Hitzler, P., Krötzsch, M., Rudolph, S.: Foundations of Semantic Web Technologies. Chapman
& Hall/CRC (2009)
13. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT
Press (2009)
14. Koutra, D., Ke, T.Y., Kang, U., Chau, D.H., Pao, H.K.K., Faloutsos, C.: Unifying Guilt-
by-Association Approaches: Theorems and Fast Algorithms. In: Gunopulos, D., et al. (eds.)
Proceedings of ECML/PKDD’11. LNCS, vol. 6912, pp. 245–260. Springer (2011)
15. Lin, H.T., Koul, N., Honavar, V.: Learning Relational Bayesian Classifiers from RDF Data.
In: Aroyo, L., et al. (eds.) International Semantic Web Conference (1). LNCS, vol. 7031, pp.
389–404. Springer (2011)
16. Lösch, U., Bloehdorn, S., Rettinger, A.: Graph kernels for RDF data. In: Simperl, E., et al.
(eds.) Proceedings of ESWC’12. LNCS, vol. 7295, pp. 134–148. Springer (2012)
17. Nickel, M., Tresp, V., Kriegel, H.P.: A Three-Way Model for Collective Learning on Multi-
Relational Data. In: Getoor, L., et al. (eds.) Proceedings of ICML’11. pp. 809–816. Omni-
press (2011)
18. Ochoa-Luna, J.E., Cozman, F.G.: An Algorithm for Learning with Probabilistic Description
Logics. In: Bobillo, F., et al. (eds.) Proceedings of the 5th International Workshop on Uncer-
tainty Reasoning for the Semantic Web, URSW09. CEUR Workshop Proceedings, vol. 654,
pp. 63–74. CEUR-WS.org (2009)
19. Peng, R., Spielman, D.A.: An efficient parallel solver for SDD linear systems. In: Shmoys
[24], pp. 333–342
20. Rettinger, A., Lösch, U., Tresp, V., d’Amato, C., Fanizzi, N.: Mining the Semantic Web:
Statistical learning for next generation knowledge bases. Data Min. Knowl. Discov. 24(3),
613–662 (2012)
21. Rettinger, A., Nickles, M., Tresp, V.: Statistical Relational Learning with Formal Ontologies.
In: Buntine, W.L., et al. (eds.) Proceedings of the European Conference on Machine Learning
and Knowledge Discovery in Databases, ECML/PKDD’09. LNCS, vol. 5782, pp. 286–301.
Springer (2009)
22. Shadbolt, N., Berners-Lee, T., Hall, W.: The Semantic Web Revisited. IEEE Intelligent Sys-
tems 21(3), 96–101 (2006)
23. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge Univer-
sity Press (2004)
24. Shmoys, D.B. (ed.): Symposium on Theory of Computing, STOC 2014, New York, NY,
USA, May 31 - June 03, 2014. ACM (2014)
25. Spielman, D.A.: Algorithms, Graph Theory, and Linear Equations in Laplacian Matrices. In:
Proceedings of ICM’10. pp. 2698–2722 (2010)
26. Tresp, V., Huang, Y., Bundschus, M., Rettinger, A.: Materializing and querying learned
knowledge. In: Proceedings of IRMLeS’09 (2009)
27. Vapnik, V.N.: Statistical learning theory. Wiley, 1 edn. (Sep 1998)
28. de Vries, G.K.D.: A Fast Approximation of the Weisfeiler-Lehman Graph Kernel for RDF
Data. In: Blockeel, H., et al. (eds.) ECML/PKDD (1). LNCS, vol. 8188, pp. 606–621.
Springer (2013)
29. Zhang, X., et al.: Hyperparameter Learning for Graph Based Semi-supervised Learning Al-
gorithms. In: Schölkopf, B., et al. (eds.) NIPS. pp. 1585–1592. MIT Press (2006)
30. Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-Supervised Learning Using Gaussian Fields
and Harmonic Functions. In: Fawcett, T., et al. (eds.) Proceedings of ICML’03. pp. 912–919.
AAAI Press (2003)
24
Automated Evaluation of Crowdsourced
Annotations in the Cultural Heritage Domain
Archana Nottamkandath1 , Jasper Oosterman2 , Davide Ceolin1 , and
Wan Fokkink1
1
VU University Amsterdam, Amsterdam, The Netherlands
{a.nottamkandath,d.ceolin,w.j.fokkink}@vu.nl
2
Delft University of Technology, Delft, The Netherlands
j.e.g.oosterman@tudelft.nl
Abstract. Cultural heritage institutions are employing crowdsourcing
techniques to enrich their collection. However, assessing the quality of
crowdsourced annotations is a challenge for these institutions and manu-
ally evaluating all annotations is not feasible. We employ Support Vector
Machines and feature set selectors to understand which annotator and
annotation properties are relevant to the annotation quality. In addition
we propose a trust model to build an annotator reputation using subjec-
tive logic and assess the relevance of both annotator and annotation prop-
erties on the reputation. We applied our models to the Steve.museum
dataset and found that a subset of annotation properties can identify
useful annotations with a precision of 90%. However, our studied anno-
tator properties were less predictive.
1 Introduction
Cultural heritage institutions have large collections which can be viewed in ex-
hibitions and often are digitised and visible online. For these institutions the
metadata of these artefacts (paintings, prints, sculptures etc.) are of the utmost
importance. They notably cover the physical properties of the artefact (e.g. di-
mensions, material), provenance properties (e.g. creator, previous owners) and
the subject matter (what is depicted on the artefact). Typically, cultural her-
itage institutions employ professionals, mostly art historians, who mostly provide
high-quality annotations about art-historical properties of artefacts, but tend to
lack domain expertise for other aspects such as names of depicted items (of e.g.
flowers and birds). With regard to the large scale of collections, their annotation
capacity is also limited to describe the subject matter in detail.
Due to these limitations institutions are looking into the knowledge and ca-
pacity of crowds. Projects such as Steve.museum [18], Your Paintings [4] and
Waisda? [8], are all examples of cultural heritage or media institutions opening
up their collection to the crowd for annotation. In these projects institutions en-
gage people from the web in different tasks with the purpose of integrating the
25
obtained data within their collections. However, employed professional annota-
tors are trained and follow strict guidelines on how to correctly and qualitatively
annotate artefacts, to maintain the high quality standards these institutions
have. Crowdsourced annotators are not trained in such a way and their quality
cannot be guaranteed in a straightforward manner.
Crowdsourced annotations thus need to be assessed, to evaluate whether
they meet the institution’s quality criteria. However, manually evaluating such
a large amount of annotations is likely as expensive as entering the information
manually. Thus there is a need to develop algorithms which can automatically or
semi-automatically predict the trustworthiness of crowd annotations. The goal of
this study is to understand which kinds of properties are important in deciding
this trustworthiness, so that in the future suitable annotators can be recruited,
or annotation tasks can be tuned in such a way to more likely obtain desired
information. The results from this study will thus have implications in the fields
of expert finding and task formulation in the domain of crowdsourcing cultural
heritage data. In this paper we answer the following research questions:
RQ1: Which annotation properties affect the trustworthiness of crowd-provided
annotations?
RQ2: Can an annotator’s profile information help in the estimation of annota-
tion and annotator trustworthiness?
In this paper we make use of the Steve.museum dataset [18] containing re-
viewed annotations on museum objects and information about the annotators
such as age, museum and annotation familiarity and income. We propose a trust
model for annotator reputation and make prediction models for both annotation
usefulness and annotator reputation. The contributions of this paper are: 1) A
trust model for reputation based on subjective logic, and 2) insights into the rel-
evance of annotation and annotator properties on the trustworthiness of cultural
heritage annotations.
The remainder of the paper is structured as follows. Section 2 compares our
work to existing methods. Section 3 describes our methodology and presents the
trust model and semantic model. The Steve.museum case study and semantic
representation of the data are described in Section 4. Experiments and evalua-
tions are reported in Section 5 and Section 6 provides conclusions of the paper.
2 Related Work
The problem of assessing the trustworthiness of annotations and annotators is
not new. There exist several ontologies for representing trust (e.g., those of Gol-
beck et al. [6] and of Alnemr et al. [1]). While these put emphasis on the social
aspects of trust, we are more interested in the trustworthiness of annotations and
annotators. Ceolin et al. [2] employed semantic similarity measures, clustering
algorithms and subjective logic for the semi-automatic evaluation of annotations
in the cultural heritage domain. A probabilistic model, based on a combination
of an annotators reputation and the semantic similarity with already labelled
26
annotations, is used to assess the usefulness of new annotations, achieving 80%
correctness. In this paper we take a different approach and employ machine learn-
ing algorithms to determine the usefulness of an annotation by using features of
both annotator and annotations.
Majority voting [9] is a commonly used method to assess the quality of anno-
tations. However, for domains with a broad vocabulary, such as the cultural her-
itage domain, this is not optimal. Adapted annotator agreement or disagreement
measures have also been studied [11,5], by considering, for example, annotator
history and agreement with aggregated label. In contrast, we employ subjective
logic to build a user reputation based on his/her positive and negative contri-
butions, and focus more on identifying features about the information and the
user that may help to predict his/her trustworthiness.
Task design is also important to achieve qualitative annotations. Test ques-
tions or other specialised constructions should be employed to filter out low-
quality and spam workers [14] and are necessary to approximate results from
experts [15].
Annotation properties have also been studies in the context of Wikipedia [19]
and Twitter [17]. Annotation quality has been shown to be related to properties
of the annotator. The impact of user information such as age, gender, educa-
tion and demographics in crowdsourcing tasks have been explored in [13]. They
explored the relationship between worker characteristics and their work quality
and showed a strong link between them. In this paper we continue in this di-
rection and investigate the relationship between annotation quality and a more
extensive set of user properties including income and internet connection speed.
3 Methodology
In this section we describe the methodology employed in this paper. Our method-
ology focusses around methods to understand the importance of annotator and
annotation properties and is outlined in Algorithm 1. Firstly we identify the
features which are relevant for predicting the value, in our case the evaluation
of the annotation and the reputation of the annotator. Feature identification is
done through three different methods: process analysis, extended analysis and
using feature selection algorithms. Having identified the sets of features, we per-
form an independent correlation analysis of each of the identified features with
the value. We split the dataset into a test and a training set and use the feature
sets to predict the value. The result of the feature selection methods are then
compared.
In Section 3.1 we describe the trust modelling of annotator reputation and
in 3.2 we describe the semantic representation of our data model.
3.1 Trust Modelling
The annotation process involves an annotator who is either a user from the crowd
or an employee of a cultural heritage institution who provides information about
27
Algorithm 1: Algorithm to perform predictions based on relevant features
Input: A finite set of features F and values used for training
Input_set ={hF,valuei}
Output: A finite set of relevant features and predicted values
Output_set ={hF_relevant, predicted_valuei}
1 F_relevant ← Identify_relevant_features(Input_set)
2 for F_relevant ← F _relevant1 to F _relevantn do
3 Compute_correlation(F_relevant, value )
4 Train_set ← Build_train_set(F_relevant, value)
5 Test_set ← Build_test_set(F_relevant)
6 Output_set ← Employ_machine_learning(Train_set, Test_set)
7 return Output_set
digital artefacts. A digital artefact is an image of the actual physical artefact
which is published online by the cultural heritage institution. An annotation
is information describing some properties of the digital artefact such as what
is depicted, who is the artist, etc. A reviewer is a trusted entity, usually an
employee of a cultural heritage institution who evaluates the annotation and
decides if it is to be accepted or not, based on review policy of the institution.
Aggregating the annotations and their evaluations per annotator helps us
understand the reputation of the annotator in the system based on the total
number of useful and not useful annotations. We define reputation of an anno-
tator as a value representing the trustworthiness of a given annotator, based on
the evaluation that a cultural heritage institution made of the tags that he or
she contributed.
In order to properly model and represent the user expertise and reputation
based on the evidence at our disposal, we use a probabilistic logic named subjec-
tive logic [12]. It models the truth of propositions as Beta probability distribu-
tions that represent both the probability of the proposition to be true (i.e., for
instance, the probability of a user to be trustworthy) and the uncertainty about
this probability. In subjective logic such a probability distribution is represented
by means of the “opinion” (ω) construct. An opinion that a certain institution
holds with respect to a given annotator is represented as follows:
institution
ωannotator (belief , disbelief , uncertainty, apriori )
where
belief + disbelief + uncertainty = 1, apriori ∈ [0...1]
and
p n 2
belief = p+n+2 disbelief = p+n+2 uncertainty = p+n+2
Here p is the amount of positive evidence (e.g., annotations evaluated as useful ),
n the amount of negative evidence (e.g., annotations evaluated as not useful ),
and apriori is the prior knowledge about the reputation, which is set to 12 by
28
default. The actual value that we use to represent an annotator’s reputation is
the expected value of the corresponding Beta distribution, that is computed as:
E = belief + apriori · uncertainty
Subjective logic offers a wide range of operators that allow one to reason upon
the evidence at our disposal and infer the reputation based on the different
features considered. But we use it merely for a representation purpose. In fact,
to apply such operators we would need to know a priori the kind of relations that
occur between the features that we identify and the reputation. These relations
will instead be discovered by means of a machine learning approach.
We use subjective logic to model both annotator and annotation reputations
by means of the expected value E. In the case of the annotators, we collect
evidence about them (i.e. reviews of the tags they contributed) and we estimate
their reputations by means of the subjective opinions described above. In the
case of annotation reputations, we use the expected value E to model them, but
their prediction is made by means of the machine learning methods.
3.2 Semantic Modelling
We adopt semantic web technologies for representing the annotations and the
related metadata. This is done for two reasons. First, they provide a uniform
layer that allow us interoperability and prevents us from relying on the specific
structure such as relational databases. Second, they provide a means to possibly
share metadata and computation results in such a manner that other institu-
tions could benefit from them, thus promoting the sharing of possibly precious
information (precious both because of their specificity and of their quality).
A (crowd) annotator performs an annotation task. The annotator’s features
(e.g., age, country, education) are as much as possible represented by means of
the standard FOAF ontology [3], while the annotation is represented by means
of the Open Annotation Model [16].
The annotation entered by the user is reviewed by an employee of the cultural
heritage institution. The annotation evaluation is yet again represented by means
of the Open Annotation Model, as an annotation of the first annotation. All the
features we adopt in our computation that are not representable by means of
standard vocabularies are represented by means of an ad-hoc construct (“ex:”
prefix). An illustration of the annotation (and related metadata) representation
is provided in Figure 1, where it is also indicated that we use annotator and an-
notation features as a basis for estimating the value of an annotation evaluation.
4 Cultural Heritage Annotations: Steve.museum
The Steve.museum [18] dataset was created by a group of art museums with the
aim to explore the role that user-contributed descriptions can play in improving
29
rdf:type
foaf:person Reviewer oac:annotator Review
oac:hasTarget
Annotation rdf:type
rdf:type
oac:annotator rdf:type
User oac:annotation
oac:hasBody
oac:hasTarget
foaf:age ... Tag oac:hasBody
oac:annotates oac:annotates
Target
...
age ...
ex:length ...
foaf:gender Used to estimate Review value
length
gender
Fig. 1: Representation of an annotation and of the related metadata.
on-line access to works of art. Annotations were gathered for 1,784 artworks
and the usefulness, either useful or not useful, of each annotation was evaluated
by professional museum staff. The annotations including their evaluations and
annotator information were published as a dataset to study3 .
We performed two pre-processing steps on the data. First, for the correct
calculation of the annotator reputation we need at least five annotations per
annotator and as such removed data from annotator with fewer annotations. It
also occurred that multiple reviewers evaluated the same annotation. For those
annotations we took the majority vote of the evaluations. In case of a tie we
always chose useful, giving more weight to a potentially useful annotation.
The dataset contains both anonymous (730) and registered (488) annotators.
Table 1 lists the annotator properties and the percentage of registered annota-
tors who filled in each property. The distribution of the number of annotations
per annotator follows a power law. The majority of the annotations (87%) were
evaluated as useful. Considering other crowdsourcing initiatives this was a re-
markably good crowd. Table 2 provides a summary of the complete dataset.
Table 1: Annotator properties and the percentage of registered annotators who
filled in the property.
Household
Community Experience Education Age Gender
income
431 (88%) 483 (99%) 483 (99%) 480 (98%) 447 (92%) 344 (70%)
Works in Involvement Tagging Internet Internet
a museum level experience connection usage
428 (88%) 411 (84%) 425 (87%) 406 (83%) 432 (89%)
3
http://verne.steve.museum/steve-data-release.zip
30
Table 2: Summary of the Steve.museum dataset.
Number of annotators / Registered 1,218 / 488 (40%)
Provided tags 45,733
Unique tags 13,949
Tags evaluated as useful 39,931 (87%)
Tags evaluated as not useful 5,802 (13%)
5 Evaluation
Annotations in the Steve.museum dataset have been assessed as either useful or
not useful. Each annotator has a reputation score using the model described in
Section 3.1. Using machine learning techniques, we aim to automatically predict
the evaluation of the annotations based on features of annotators and annota-
tions. Next to that we aim to predict the reputation of the annotator based on
the annotator features. The first subsection describes the setup and tooling of our
experiments. Section 5.2 contains the results of analysing the relation between
annotation properties and usefulness of annotations and Section 5.3 between
annotation properties and both annotation evaluation and user reputation.
5.1 Experimental Setup
In order to perform fair training, we randomly selected 1000 useful and 1000
not useful annotations as training set. The remainder of the dataset was used as
test set. We used a Support Vector Machine (Sequential Minimal Optimisation4 ,
default PolyKernel5 ) on selected features to predict annotation usefulness, since
that algorithm works for dichotomous variables, and is commonly used, fast
and resistant against over-fitting. For prediction of the reputation of a user
(an interval variable) we used a similar algorithm but adapted for regression.
For automated selection of relevant features we used correlation-based feature
subset selection [7]. This algorithm selects subsets of features that are highly
correlated with the prediction class but have a low inter-correlation.
To calculate an independent correlation between different types of variables
we used appropriate statistical tests; Biserial for interval, ordinal and nomi-
nal against dichotomous variables followed by Wilcoxon rank sum for ordinal
and Chi squared for nominal; Fisher’s exact test for two dichotomous variables;
Kendall τ for ordinal against interval variables; and Pearson for both two inter-
val variables and nominal against interval variables. Fisher’s exact test signals a
strong correlation above a score of 1.0.
4
We used the implementation inside the tool WEKA http://cs.waikato.ac.nz/ml/
weka/.
5
There are specific kernels targeting RDF data, but these were, for simplicity reasons,
not used.
31
5.2 Predicting Annotation Evaluation Using Annotation Features
Features Selection. We manually analysed the annotations in different eval-
uation categories of the Steve.museum so as to understand the evaluation poli-
cies depicted as F_man. From our observations, we found out that some of
the evaluations were strongly influenced by certain features of the annotation.
Annotations that did not describe something actually depicted, for example sen-
timental annotations such as “happy”, were evaluated as not useful. Adjectives
in general were not deemed useful. Also annotations in non-English languages
or misspelled words were evaluated as not useful. To detect these problems we
created the features is_adjective, is_english and in_wordnet, where the latter
signals a correctly spelled word. For detecting the language of a tag we used
the n-gram based language detection from [10]. For detecting the adjective and
spelling errors we used Wordnet,6 where words not in Wordnet are treated as
incorrectly spelled. For multi-words annotations we assessed whether either of
the words matched the criteria. We explored the possibilities to extract more
features which might be indicative of the evaluation of the annotation repre-
sented as F_all . We regarded the creation time (both day and hour) of the
annotation, how specific the annotation was (based on the depth a word occurs
at in the Wordnet tree), the length and number of words of the annotation, and
the frequency with which the annotation was created for the same object.
We applied the feature selection algorithm to the features from F_all on the
annotation data resulting in the feature set F_ml.
F_man = [is_adjective, is_english, in_wordnet]
F_all = F_man + [created_day, created_hour, Length, Specificity, nrWords,
Frequency]
F_ml = [created_day, in_wordnet, Frequency]
Independent correlation of annotation features. We performed an inde-
pendent correlation analysis of the mentioned features with regard to the eval-
uation of the annotation. We observed a strong correlation (3.34, using Fisher’s
exact test) for in_wordnet, significant at <0.01. We observed a weak corre-
lation for Specificity (-0.11), Frequency (0.14), is_adjective(0.67, Fisher) and
is_English (0.94, Fisher, not statistically significant).
Predicting annotation evaluation. Table 3 lists the precision, recall and
F-measure of the three feature sets. We observe that the precision is high, rang-
ing from 0.90 to 0.978 in all the cases of classifying useful annotations. All three
methods for creating the feature sets result in a model that can predict use-
ful annotations very well. However, the recall is high only for the feature set
F_man, while the predictions using feature sets F_all and F_ml had a high
number of false positives.
None of the classifiers performed well in predicting the annotations which
were classified as not useful. There was a large number of false positives and the
6
We used the NLTK library (http://nltk.org/) to query the Wordnet tree.
32
precision was very low in all cases, ranging from 0.13 to 0.21. Thus from our anal-
ysis we can observe that although the machine learning classifier using the three
different features were comparably successful in identifying useful annotations,
neither of them succeeded in identifying the not useful annotations.
Table 3: Comparison of results from SVM predictions using annotation features.
Feature set Class Precision Recall F-measure
F_man useful 0.90 0.90 0.90
not useful 0.21 0.20 0.20
F_all useful 0.91 0.75 0.83
not useful 0.18 0.42 0.25
F_ml useful 0.98 0.20 0.34
not useful 0.13 0.96 0.23
5.3 Predicting Annotation Evaluation And User Reputation Using
Annotator Features
Feature Selection. The set F_man is based on the annotator properties listed
in Table 1. Apart from the provided features for an annotator, we also compute
certain features related to the annotations they provided, which may be useful
for predicting the evaluation of an annotation. The computed features are the
total number of annotations entered by the user (#Annotations), the vocabu-
lary size and diversity of the annotator, and the number of matched annotations
in Wordnet (#matched_in_wordnet). The vocabulary size of an annotator is
the number of distinct annotations after stemming has been applied. The vocab-
ulary diversity is computed as the vocabulary size divided by the total number of
annotations provided by that annotator. The definition of vocabulary diversity
is reasonable in view of the fact that the number and length of annotations is
relatively small in Steve.museum dataset.
Two sets are obtained when the feature selection algorithm is applied in two
instances, one to identify relevant features for the annotation evaluation, repre-
sented as F_ml_a, and in the second case to identify relevant features for anno-
tator reputation, represented as F_ml_u. For the prediction of the annotation
evaluation, we merged the annotation data with the corresponding annotator
properties and performed a prediction of annotation evaluation. We applied the
feature selection algorithm to the features from F_all on the annotation data
(F_ml_a) and on the user data (F_ml_u) resulting in the following features.
F_man = [Features in Table 1]
F_all = [F_man, vocabulary_size, vocabulary_diversity, is_anonymous,
#Annotations_in_wordnet]
F_ml_a = [vocabulary_size, vocabulary_diversity]
F_ml_u = [Language, Education, Community, #tags_wordnet,
T agging_experience]
33
Independent correlation analysis of annotator features. A statistical cor-
relation analysis was performed to determine the relationship between the an-
notator features with the annotation reputation and annotation evaluation as
shown in Table 4. For the annotation evaluation, Experience, Education, Tag-
ging Experience, Internet connection and Internet usage had a weak correlation
that was statistically significant. However, Community had a higher correlation
compared to the other features. For the annotator reputation, the computed
features such as # Annotations, vocabulary size and #Annotations in Wordnet
were considered significant.
Table 4: Correlation of features with annotation evaluation and annotation rep-
utation. In brackets the statistical test (See Section 5.1). * indicates significance
at p < 0.01. Note: Fisher signals a high correlation for values > 1.
Correlation score Correlation score
Annotator feature
Annotation evaluation Annotator reputation
Community 0.22* (C+B) 0.22 (P)
Experience 0.02* (W+B) 0.02 (K)
Education 0.02* (W+B) 0.01 (K)
Age 0.01 (B) -0.16 (P)
Gender 1.11 (F) -0.004 (B)
Household income -0.14 (W+B) -0.14 (K)
Works in a museum 0.99 (F) -0.34 (B)
Involvement level 0.04* (W+B) -0.10 (K)
Tagging experience 1.22* (F) -0.08 (B)
Internet connection 0.02* (W) 0.06 (K)
Internet usage 0.02* (W) -0.16 (K)
# Annotations -0.06 (B) 0.27* (P)
Vocabulary size -0.06 (B) 0.27* (P)
Vocabulary diversity 0.05 (B) -0.03 (P)
# Annotations in Wordnet -0.08 (B) 0.31* (P)
Predicting annotation evaluation and annotator reputation. From Table
5 we can see that the features identified from the annotator profile and those
identified by the feature selection algorithm are useful in classifying useful an-
notations and have a high precision of 0.91. However, these methods also have
lower values of recall, indicating a high number of false negatives. Both methods
have a low precision and recall in classifying not useful annotations, and thus
are not successful in predicting not useful annotations.
We used a SVM for regression to estimate the reputation of the annotator
since it was hard to perform a classification for reputation. This is because the
reputation is highly right skewed with 90% of the annotators having a reputation
> 0.7. This makes it hard to classify data and distinguish the classes when the
distribution is highly skewed. Another point is that classification of reputation
is highly use case dependent. Upon performing regression on the reputation,
as shown in Table 6, we can observe that all the predictions have a very high
34
relative absolute error and low coefficients. Another observation is that relative
weights assigned to the #Annotations in Wordnet feature are relatively high,
showing consistency with our earlier analysis.
Table 5: Comparison of results from SVM predictions using annotator features.
Feature set Class Precision Recall F-measure
F_man useful 0.90 0.29 0.44
not useful 0.11 0.73 0.20
F_all useful 0.91 0.69 0.78
not useful 0.15 0.43 0.22
F_ml_a useful 0.91 0.55 0.68
not useful 0.13 0.53 0.21
Table 6: Comparison of results from predicting annotator reputation using SVM
regression and 10-fold cross validation.
Feature set corr Mean abs error Root mean sq error Rel abs error
F_man −0.02 0.10 0.15 97.8%
F_all 0.22 0.09 0.13 95.1%
F_ml_u 0.29 0.09 0.13 90.4%
6 Conclusion and Future Work
In this paper we described methods which can automatically evaluate annota-
tions. The experiment was performed on the Steve.museum dataset and investi-
gated the effect of annotation and annotator properties in predicting trustworthi-
ness of annotations and reputation of annotator. We also devised a model using
Support Vector Machines for predicting annotation evaluation and annotator
reputation. Presence of an annotation in Wordnet is shown to be indicative for
the perceived usefulness of that annotation. With a small set of features we were
able to predict 98% of the useful and 13% of the not useful annotations correctly.
The annotator reputation was computed using a model in subjective logic. Since
the reputation of annotators is highly skewed in this dataset(with more than
90% having a reputation > 0.7), we could not make successful estimations of
reputation from annotator profiles.
As part of future work, we would like to repeat the experiment on other cul-
tural heritage datasets. We would also like to build a reputation for an annotator
based on topics of expertise, to obtain more accurate correlations between the
semantics of the annotation and the topical reputation of the annotator. Our
analysis also indicated that there is relevance in aspects related to creation time
of an annotation. A more sophisticated model, such as whether an annotation
was created during work or during free-time might increase the predictive power.
Acknowledgements This publication was supported by Data2Semantics and
SEALINCMedia projects from the Dutch National program COMMIT.
35
References
1. Alnemr, R., Paschke, A., Meinel, C.: Enabling reputation interoperability through
semantic technologies. In: I-SEMANTICS. pp. 1–9. ACM (2010)
2. Ceolin, D., Nottamkandath, A., Fokkink, W.: Efficient semi-automated assessment
of annotation trustworthiness. Journal of Trust Management 1, 1–31 (2014)
3. Dan Brickley, L.M.: FOAF. http://xmlns.com/foaf/spec/ (Jan 2014)
4. Ellis, A., Gluckman, D., Cooper, A., Greg, A.: Your paintings: A nation’s oil paint-
ings go online, tagged by the public. In: Museums and the Web 2012. Online (2012)
5. Georgescu, M., Zhu, X.: Aggregation of crowdsourced labels based on worker his-
tory. In: Proceedings of the 4th International Conference on Web Intelligence,
Mining and Semantics (WIMS14). pp. 37:1–37:11. WIMS ’14, ACM (2014)
6. Golbeck, J., Parsia, B., Hendler, J.A.: Trust networks on the semantic web. In:
CIA. pp. 238–249. Springer (2003)
7. Hall, M.A.: Correlation-based Feature Subset Selection for Machine Learning.
Ph.D. thesis, University of Waikato, Hamilton, New Zealand (1998)
8. Hildebrand, M., Brinkerink, M., Gligorov, R., van Steenbergen, M., Huijkman,
J., Oomen, J.: Waisda?: Video labeling game. In: Proceedings of the 21st ACM
International Conference on Multimedia. pp. 823–826. MM ’13, ACM (2013)
9. Hirth, M., Hossfeld, T., Tran-Gia, P.: Cost-optimal validation mechanisms and
cheat-detection for crowdsourcing platforms. In: Innovative Mobile and Internet
Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference
on. pp. 316–321 (June 2011)
10. Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., Feinerer, I.: The textcat
package for n-gram based text categorization in R. Journal of Statistical Software
52(6), 1–17 (2013)
11. Inel, O., Aroyo, L., Welty, C., Sips, R.J.: Domain-independent quality measures for
crowd truth disagreement. Journal of Detection, Representation, and Exploitation
of Events in the Semantic Web pp. 2–13 (2013)
12. Jøsang, A.: A logic for uncertain probabilities. Intl. Journal of Uncertainty, Fuzzi-
ness and Knowledge-Based Systems 9(3), 279–212 (2001)
13. Kazai, G., Kamps, J., Milic-Frayling, N.: The face of quality in crowdsourcing
relevance labels: Demographics, personality and labeling accuracy. pp. 2583–2586.
CIKM ’12, ACM (2012)
14. Kittur, A., Chi, E.H., Suh, B.: Crowdsourcing user studies with mechanical turk.
pp. 453–456. CHI ’08, ACM (2008)
15. Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: A study
about inter-annotator agreement for multi-label image annotation. pp. 557–566.
MIR ’10, ACM (2010)
16. Sanderson, R., Ciccarese, P., de Sompel, H.V., Clark, T., Cole, T., Hunter, J.,
Fraistat, N.: Open annotation core data model. Tech. rep., W3C Community (May
9 2012)
17. Suh, B., Hong, L., Pirolli, P., Chi, E.H.: Want to be retweeted? large scale analytics
on factors impacting retweet in twitter network. In: Social Computing (SocialCom),
2010 IEEE Second International Conference on. pp. 177–184. IEEE (Aug 2010)
18. Trant, J.: Tagging, folksonomy and art museums: Early experiments and ongoing
research. J. Digit. Inf. 10(1) (2009)
19. Warncke-Wang, M., Cosley, D., Riedl, J.: Tell me more: An actionable quality
model for wikipedia. In: Proceedings of the 9th International Symposium on Open
Collaboration. pp. 8:1–8:10. WikiSym ’13, ACM (2013)
36
Probabilistic Relational Reasoning
in Semantic Robot Navigation
Walter Mayor Toro1 , Fabio G. Cozman1 , Kate Revoredo2 , and Anna Helena
Reali Costa1
1
Escola Politécnica, Univ. de São Paulo
Av. Prof. Luciano Gualberto, trav.3, 158, 05508-970 São Paulo, SP, Brazil
{walter.mayortoro,fgcozman,anna.reali}@usp.br
2
Depto. de Informática Aplicada, Univ. Federal do Estado do Rio de Janeiro
Rio de Janeiro, RJ, Brazil
katerevoredo@uniriotec.br
Abstract. We examine the use of semantic web resources in robot nav-
igation; more specifically, in qualitative navigation where uncertain rea-
soning plays a significant role. We propose a framework for robot naviga-
tion that connects existing semantic web resources based on probabilis-
tic description logics, with probabilistic relational learning and planning.
We show the benefits of this framework in a real robot, presenting a case
study on how semantic web resources can be used to face sensor and
mapping uncertainty in a practical problem.
Keywords: Semantic robotics, KnowRob system, probabilistic descrip-
tion logics, Bayesian networks.
1 Introduction
Recent experience has shown that applications in robotics can benefit from se-
mantic information carrying commonsense facts [1, 3, 12, 13] One particular ex-
ample of semantic knowledge system for robotics is the KnowRob package
(Knowledge Processing for Autonomous Personal Robots) [15, 16]. KnowRob
operates on ontology databases such as OMICS (indoor common-sense knowl-
edge database) [4], mixing description logics [2] and Bayesian networks [11].
However, it is not always easy to effectively bring these semantic web re-
sources into practical use, as it is necessary to combine semantic information
and low-level data, and to handle uncertain sensors and incomplete maps. In
this paper we propose a framework for qualitative robot navigation that uses
the probabilistic description logic knowlege base in KnowRob to learn and rea-
son at a relational level. We explore a scheme where higher level descriptions
are used to reason at an abstract level. This has important advantages. First,
it saves computation as it handles sparser representations. Second, it is a per-
fect match to the level at which knowledge is stored (that is, relations are used
throughout). Third, the use of information in a higher level of abstraction allows
37
Fig. 1. Overview of the Web-based Relational Robotic Architecture.
knowledge to be generalized and transferred to other agents or used in similar
tasks.
We also describe an implementation with a real robot that demonstrates how
semantic web resources can work within applications that demand substantial
uncertain reasoning. We present our knowledge representation strategy, our sen-
sor gathering blocks, and our ground and abstract reasoning modules. Overall,
our goal is to contribute with a case study on how semantic web resources can
be used in practice. The paper is organized as follows. In Section 2 we give
an overview of the different subsystems that our proposal combines. Section 3
presents the implementation of the system and our experiments. Section 4 con-
cludes the paper.
2 A Framework for Robot Navigation
We consider a robot with three important sources of information. First, sensors
that detect objects in the environment. Second, a map with semantic infor-
mation, that is updated during navigation. Third, and most important, a web
database with commonsense knowledge (for example, likely location of objects
in rooms). In our framework, information processing flows through three ma-
jor modules: Perception, Reasoning and Learning, and Actuation. We call the
whole architecture by Web-based Relational Robotic Architecture (WRRA), as
depicted in Figure 1. From the perspective of uncertain reasoning with seman-
tic web resources, the Perception module (Section 2.1) is the most significant
contribution of this paper. The other two modules are only briefly described as
relevant information can be found in previous publications.
38
2.1 Perception: semantic information and probabilistic reasoning
This module receives a description of the goal. In the case study of interest here,
the goal is to find a target room inside a house. The Perception module receives
sensory information (detected objects), and must abstract the data into compact
symbolic representations. Upon receiving data, the robot accesses its Semantic
Web resources to determine the most likely room that generated the data, as
described in this section.
Unlike most existing robotic systems, we pursue reasoning at a high level of
abstraction, employing concepts, roles and relations between then as expressed
within KnowRob. It is due to this decision that we can effectively employ
semantic web resources. The situation of the robot is described in terms of a re-
lational representation that not only allows for abstraction of metric and sensory
details, but also enables knowledge to be generalized and reused in new tasks.
During navigation, the robot uses sensor data to build a relational representation
of the environment (the semantic map).
The output of the Perception module is a description of the robot’s situation,
which is specified by a conjunction of active predicates (with truth value true)
such as: seeDoor() that indicates that the robot sees one or more doors in the
room; seeNonVisitedDoor(d1 ), meaning that the robot sees door d1 that has
not yet been visited; inTargetRoom(), which indicates that the target room is
where the robot is; nonTargetRoom(p1 ), meaning that the robot is in p1 and it
is not the target room; inRoom(p1 ) that indicates that p is the most likely room
where the robot is; and others. The truth value of inRoom(p) is computed by
Place Inference block, as we explain now.
The Perception module is heavily based on reasoning facilities available in the
KnowRob package. The knowledge base in KnowRob uses rdf triples to repre-
sent a large ontology, with relationships between objects such as Drawer, a sub-
class of StorageConstruct, or Refrigerator − Freezer, a subclass of FurniturePiece
[16]. Additionally, sentences in OWL indicate relationships between objects. Sen-
tences are attached to probabilities, and for inference they are grounded into
Bayesian networks using facilities in the ProbCog system [5].
Just as an example of rdf triple in the knowledge base, consider the fact,
contained in the OMICS database, that a kitchen contains a refrigerator (XXX
denotes the string http://ias.cs.tum.edu/kb/knowrob.owl):
The Perception module queries KnowRob, which returns, for each observed
object, the probability that the location is each possible room, given the observed
object. Queries are sent to KnowRob through Prolog sentences via function calls
in the Python language; as an example, consider (a complete query is given in
Section 3):
39
for obj in DetectedObjects:
q="bayes_probability_given(knowrob:’OmicsLocations’,
Room,knowrob:’"+obj+"’,Pr)"
query = prolog.query(q)
Such a query returns probabilities such as
Room = ’knowrob.owl#Kitchen’
Pr = 0.1031101853182014 ;
That is, given a perceived object oi , KnowRob uses inference with its prob-
abilistic description logic [8, 7] to return P (rj |oi ) for each room rj . The problem
now is to combine these pieces of information into a probability that the robot
is in room rj , given all detected objects o1 , . . . , on . We have:
P (o1 , . . . , on |rj )P (rj )
P (rj |o1 , . . . , on ) =
P (o1 , . . . , on )
P (o1 |rj , o2 , . . . , on )P (o2 |rj , o3 , . . . , on ) . . . P (o1 |rj )P (rj )
= .
P (o1 , . . . , on )
We now assume that, given rj , an observation (of an object) is independent of
other observations (of other objects in the same room). Hence:
P (o1 |rj )P (o2 |rj ) . . . P (o1 |rj )P (rj )
P (rj |o1 , . . . , on ) =
P (o1 , . . . , on )
(P (rj |o1 )P (o1 )/P (rj )) . . . (P (rj |on )P (on )/P (rj ))P (rj )
=
P (o1 , . . . , on )
n
! Qn
Y
i=1 P (oi )
= P (rj |oi ) .
i=1
P (o1 , . . . , on )(P (rj ))n−1
We now introduce a substantive assumption, namely, that every room Qnhas identi-
cal a priori probability P (rj ). So, P (rj |o1 , . . . , on ) is proportional to i=1 P (rj |oi ).
Once the Perception module gets, for each room, each term of this product from
KnowRob, it compares each room with respect to Q this product, setting the
n
truth value of inRoom(p) as true for: p = arg maxrj i=1 P (rj |oi ), and false
otherwise.
During navigation, a semantic map of the environment is created. Each vis-
ited room and each observed object are represented as vertices of a graph that
describes the topological map (left side of Figure 2). Connectivity between rooms
is represented by graph edges, which are defined through doors conecting the
rooms. While this topological map is built, edges are created by connecting ver-
tices of the topological map to vertices of the conceptual map (right side of Figure
2). Unlike other approaches [1, 3], our map does not involve metric representation
of the environment. Still, our semantic map inserts probabilistic information in
the representation. Every inference and reasoning in WRRA occurs at the level
of objects, rooms and relationships and properties thereof.
40
Fig. 2. The semantic map built by WRRA.
2.2 Reasoning and Learning, and Actuation
The WRRA uses reinforcement learning (RL) to refine behavior through in-
teractions with the environment [14]. Typical RL solutions learn from scratch;
we instead employ two levels of RL [6], where an abstract and a ground policy
are learned simultaneously. The stochastic abstract policy learned in a source
task is then used in new similar tasks. Our robot navigation problem is mod-
eled as a Relational Markov Decision Process (RMDP) [10], in which situa-
tions s ∈ S are represented as a conjunction of predicates describing prop-
erties of and relations among objects, such as: s1 = inRoom(livingroom) ∧
nonTargetRoom(livingroom) ∧ seeNoDoors() ∧ notAllDoorsVisited(). Other
formalisms are possible to represent decisions and transitions [9].
A conjunction is a ground conjunction if it contains only ground atoms (such
as s1 given in the example). In our discussion each variable in a conjunction is
implicitly assumed to be existentially quantified. An abstract situation σ (and
abstract behavior α) is a conjunction with no ground atom. A relational rep-
resentation enables us to aggregate situations and behaviors by using variables
instead of constants in the predicate terms. For example, ground situation s1 is
covered by abstract situation σ by replacing livingroom with variable X; in this
case, other situation could also be covered by σ, e.g., s1 = inRoom(kitchen) ∧
nonTargetRoom(kitchen) ∧ seeNoDoors() ∧ notAllDoorsVisited().
41
Denote by Sσ the set of ground situations s ∈ S covered by abstract situation
σ. We assume that each ground situation s is abstracted to only one abstract
situation σ. Similarly, we define Aα (s) as the set of all ground behaviors a ∈ A
covered by an abstract behavior α in ground situation s. We also define Sab and
Aab as the set of all abstract situations and the set of all abstract behaviors in
an RMDP, respectively. To simplify notation, here we use the assumption that
if an atom does not appear in a ground sentence, the negated atom is assumed.
To solve an RMDP is to find an optimal policy π ∗ that maximizes a function
Rt of future rewards. In RL tasks the agent does not know the dynamics of
the process and a series of RL algorithms can be used to find a policy [14]. To
translate from ground to abstract level, we define two operations: abstraction
and grounding. Abstraction is the translation from the ground level (perceived
by the robot’s sensors) to the abstract level by replacing constants with variables,
φs : S → Sab . For a ground situation s, the corresponding abstract situation σ
is given by φs (s) = σ so that s ∈ Sσ . Grounding is the translation from the
abstract level to the ground level, a = grounding(α, s). Clearly only ground
states are sensed and visited by the robot, and only ground actions can be
actually applied. Laerning and reasoning must proceed by processing, at time
t, the (ground) experience hst , at , rt , st+1 , at+1 i, which is related to the tuple
hσt , αt , rt , σt+1 , αt+1 i.
We propose the following scheme to apply an abstract policy in a ground
problem. Consider a stochastic abstract policy defined as πab : Sab ×Aab → [0, 1].
After the abstract situation σ = φs (s) is derived from the observed ground sit-
uation s, a transferred abstract policy (learned from source tasks) yields prob-
abilities πab (σ, αk ) = P (αk |σ) for all αk ∈ Aab . We select an abstract behavior
αk ∈ Aab according to these probabilities. Then the process remains the same,
with a = grounding(αk , s).
Thus, in our system, the robot initially receives an abstract policy and applies
it. As its knowledge about the new environment increases, due to its perception
and action in the environment, the robot creates and improves a semantic map,
which places restrictions on the actions defined by the policy initially received,
adapting it to the new environment and to the new task. For example, consider
the robot identifies it is in the living room, which is not the target room, and the
living room has two doors, d1 and d2 . The abstract policy indicates that it can
randomly choose any one of the two doors and go through it, hoping to reach the
target room. Assume the robot circulates in other rooms, after going through the
chosen door, say d1 , and represents what is discovered about the environment in
a semantic map. Upon returning to the living room without having reached the
target room, the reasoning process now indicates that it should choose another
door (d2 ).
Finally, the Actuation module is divided into a High Level Control (HLC) and
Low Level Control (LLC). HLC receives a behavior selected by the Reasoning
and Learning module. The behavior is divided into simple actions that can be
executed by specific hardware modules. Each simple action is sent to LLC, to
be actually executed. Low-level commands are issued by the Actuation module.
42
Fig. 3. Left: Simulated house. Right: Experimental setup with real robot.
3 Implementation and Discussion
We now describe our implementation and experiments. The robot we consider
is a wheeled base equipped with 5 sensors: 3 semantic cameras, 1 odometer and
1 laser range scanner. We run tests both in a simulated environment and with
the real robot. The simulated scenario (see Figure 3-Left) was designed with
the open source tool for 3D creation, BLENDER3 , with which we have created
a 3D CAD representation of the house and the objects it contains, including
the robot (an ATRV 4-wheeled base); the representation was integrated into the
MORSE simulator4 and the Robot Operating System (ROS)5 . The environment
is a house that has eight types of rooms: 1 hallway (with some potted plants),
1 kitchen (with 1 stove, 1 fridge, and a dishwasher), 1 living room (with 1 sofa
and 2 armchairs), 1 bathroom (with 1 toilet and 1 sink), 3 bedrooms (with 1 bed
and 1 bedside), and 1 dining room (with 1 table and 6 chairs). The real robot is
a Pioneer 2DX, and with the real robot we used QR codes to identify doors and
objects, so as to obtain functionality similar to a semantic camera.
The semantic camera is, in essence, a sensor that allows to recognize ob-
jects that the robot sees and the relative position between the robot and objects
viewed. The robot was equipped with two semantic cameras that recognize gen-
eral objects and one semantic camera that recognizes only doors. The odometer
and the laser scanner are used in the Navigation module.
For a better understanding of how the architecture WRRA works, how is
its integration with the information of the semantic web and the technologies
employed, we describe a simple case study executed in the simulated scenario,
where we have great flexibility in defining tasks and measuring behavior. WRRA
was implemented using the Python programming language and was integrated
with the ROS framework. Initially, the robot knows nothing about the house
3
http://www.blender.org/
4
http://www.openrobots.org/wiki/morse/
5
http://wiki.ros.org/
43
Fig. 4. Sequence of situations faced by the mobile robot in case study.
and has only a generic abstract policy that defines mappings from abstract sit-
uations into abstract behaviors, such as πabs (inRoom(X) ∧ nonTargetRoom(X) ∧
seeNoDoors() ∧ notAllDoorsVisited()) = findDoor(), Indicating that if the
robot is in a certain room that is not the target room and it did not detect any
door in this room and the list of doors visited by it is empty, then the robot
must find a door. The robot is given the goal of reaching the home kitchen. Then,
the robot perceives situations, and selects the appropriate behavior for each sit-
uation and performs it, until the target room is reached. Figure 4 describes a
sequence of five situations faced by the robot.
Situation 1: The Perception module collects information from the environ-
ment using the robot semantic cameras. From the position where the robot is,
two objects are detected: obj1 = sofa and obj2 = armchair. Then the Place
Inference submodule performs a query to the integrated ROS library KnowRob-
OMICS, which estimates the most likely room where the robot is taking into
account the objects detected by the robot:
prolog = json_prolog.Prolog()
for obj in objets:
q="bayes_probability_given(knowrob:’OmicsLocations’,
Room,knowrob:’"+obj+"’,Pr)"
44
query = prolog.query(q)
for solution in query.solutions():
room=str(solution[’Room’])[37::]
places.place[room].append(solution[’Pr’])
query.finish()
As the queries are implemented using Python and the KnowRob-OMICS
is implemented using the PROLOG logic programming language, the WRRA
uses the ROS library JSON-PROLOG 6 to send the queries from Python code
to PROLOG. When the KnowRob-OMICS is queried, it returns the following
information for each recognized object:
Room = ’knowrob.owl#Kitchen’
Pr = 0.1031101853182014 ;
Room = ’knowrob.owl#DiningRoom’
Pr = 0.12185749173969258 ;
Room = ’knowrob.owl#BedRoom’
Pr = 0.12185749173969258 ;
Room = ’knowrob.owl#LivingRoom’
Pr = 0.3655724752190777 ;
Room = ’knowrob.owl#Hallway’
Pr = 0.14893693434851316 ;
Room = ’knowrob.owl#BathRoom’
Pr = 0.1386654216348226 ;
This information gives the probability that each room is the current location
of the robot, given only one recognized object. Figure 5 shows some probabil-
ities that a room type has certain object type. These probabilities come from
the ROS library KnowRob-OMICS that uses the Lidstone’s law, which redis-
tributes the probability mass assigned to the seen tuples to the unseen tuples
like (bed,bathroom). The Lidstone’s law uses a parameter λ < 1 and when the
parameter λ → 1 much probability is distributed to unseen tuples [4]. In our
experiments, we set λ = 0.5.
Then the Place Inference submodule uses the probabilistic reasoning process
explained in Section 2.1 to infer the most likely place where the robot is by taking
into account all objects recognized by the robot. In situation1 of the case study
reported here, the place inferred as the most likely place is p1 = livingroom.
When objects and doors are detected and the place is inferred, the semantic
map is updated. For this instance the map is updated with the objects obj1 and
obj2 and the place p1 by building the relations obj1 at p1 and obj2 at p1 . Next,
the Inference Situation submodule receives information about detected doors,
the inferred place and the updated semantic map. With this information, the
truth values of the predicates are calculated and the conjunction of predicates
with truth value true forms a situation description using the SMACH library 7 ,
6
http://wiki.ros.org/json prolog
7
http://wiki.ros.org/smach
45
Fig. 5. Probabilities of room given observed object, by KnowRob-OMICS.
which is a ROS-independent Python library to build hierarchical state machines.
Finally that inferred situation is the output of the Perception module that for this
case is the situation1 = inRoom(livingroom) ∧ nonTargetRoom(livingroom) ∧
seeNoDoors() ∧ notAllDoorsVisited().
Then situation1 is sent as input to the Reasoning and Learning module, and
it is transformed in its corresponding abstract situation. The abstract policy
is then used to define the output to the Actuation module: πabs (inRoom(X) ∧
nonTargetRoom(X)∧seeNoDoors()∧notAllDoorsVisited()) = findDoor() (see
subsection 2.2). In the current version the output of this module is one of four
behaviors: goToDoor(d) meaning that the robot should go to d; goToNextRoom(p)
meaning that the robot should go into the next place p; findDoor() meaning
that the robot should search for doors in the room; exploration() meaning that
the robot should search for objects in the room.
Finally, in the Actuation module, the HLC submodule uses the ActionLIB
ROS library to decompose each behavior into a sequence of simple actions, which
are in turn translated by the LLC submodule to respective velocities of trans-
lation and rotation by using the Move Base ROS library 8 , which allows the
robot to reach a certain target pose. The Move Base library in turn uses other
ROS libraries to avoid obstacles during navigation and to build a local path for
the execution of each simple action sent by the HLC. Usually the Move Base
library works with a static metric map of the environment, but in our case the
Move Base library was set to work without it, since WRRA only reasons in
relational level. The robot, controlled by the actuation module, search for a door
in the environment. At the moment its semantics camera detects a door, a new
situation is defined.
Situation 2: When the robot perceives a door (in this case, it sees door
d1 ), the Perception module updates the semantic map with the door d1 and the
8
http://wiki.ros.org/move base
46
relation d1 at p1 . So a new ground situation is determined by the Perception
module: situation2 = inRoom(livingroom) ∧ nonTargetRoom(livingroom) ∧
seeDoors() ∧ notAllDoorsVisited() ∧ seeNonVisitedDoor(d1 ). This situation
is turned into its corresponding abstract situation in the Reasoning and Learning
module, and the abstract policy gives: πabs (inRoom(X) ∧ nonTargetRoom(X) ∧
seeDoors() ∧ notAllDoorsVisited() ∧ seeNonVisitedDoor(Y)) = goToDoor(Y).
In this case, the grounding of behavior goToDoor(Y) gives goToDoor(d1 ). Then,
the Navigation module operates properly, using sensory information that gives
the relative position of the robot to d1 and to obstacles, in order to drive the
robot to the front door.
Situation 3: In this situation the robot knows it still is in the living room,
it still sees the door d1 that has not been visited, and now it can see the adja-
cent room p2 through door d1 . The map is updated by building the relation p2
connected to p1 through d1 . This situation is:
situation3 = inRoom(livingroom)∧nonTargetRoom(livingroom)∧seeDoors()∧
notAllDoorsVisited() ∧ seeNonVisitedDoor(d1 ) ∧ seeNextPlace(p2 ).
Given the abstraction of situation3 , the behavior indicated by the abstract pol-
icy is goT oN extP lace(Z) and the grounding of it results in goToNextPlace(p2 ).
In this case, the Navigation module drives the robot through the door and the
robot reaches the adjacent room p2 .
Situation 4: As the robot has not observed any object in this new room, then
p2 = unknown. In this case, the map does not need to be updated and the only
predicate with truth-value TRUE is inUnknownRoom(). Then the Reasoning and
Learning module outputs the behavior exploration(), meaning that the robot
must explore the room, looking for objects.
Situation 5: Finally, the robot observes objects obj3 = oven and obj4 =
freezer that allow inferring that it is in the p2 = kitchen. Then the map is
updated by bulding the relations obj3 at p2 , obj4 at p2 . As the kitchen is the
target room, the episode ends and the task is fulfilled.
4 Conclusion
In this paper we have explored the use of semantic web resources to conduct prob-
abilistic relational learning and reasoning in robot navigation. We have presented
an architecture (WRRA) for robot navigation that employs the KnowRob sys-
tem and its knowledge base of probabilistic description logic sentences, together
with relational reinforcement learning. The resulting framework shows how to
use, in practice, semantic web resources that can deal with uncertainty.
We have implemented the WRRA, first in a simulated environment, then in
a real robot. The fact that the WRRA operates with abstract semantic infor-
mation, both in its knowledge base, and in its inputs and outputs, simplifies
the whole process and leads to effective qualitative navigation. Moreover, the
acquired abstract knowledge base can be transferred to other scenarios. Our
experiments indicate that indoor navigation can actually benefit from such a
framework.
47
Several avenues are open to future work. Additional predicates can be tested
to infer the robot location, and web resources can be mixed with human inter-
vention. Furthermore, we intend to include measures of uncertainty about the
situation of the robot, by associating probabilities with predicates. We also plan
to conduct more extensive tests with the real robot.
Acknowledgments. The authors are partially supported by the National Coun-
cil for Scientific and Technological Development (CNPq), Brazil.
References
1. Aydemir, A., Pronobis, A., Gobelbecker, M., Jensfelt, P.: Active visual object search
in unknown environments using uncertain semantics. Robotics, IEEE Transactions
on, 29(4), 986–1002 (2013)
2. Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D., Patel-Schneider, P. F.:
Description Logic Handbook, Cambridge University Press (2002)
3. Galindo, C., Saffiotti, A.: Inferring robot goals from violations of semantic knowl-
edge. Robotics and Autonomous Systems, 61(10), 1131–1143 (2013)
4. Gupta, Rakesh, Kochenderfer, Mykel J. : Commonsense data acquisition for indoor
mobile robots. In: 19th National Conference on Artificial Intelligence (AAAI-04),
pp. 605–610. AAAI Press (2004)
5. Jain, D., Waldherr, S., and Beetz, M.: Bayesian Logic Networks. Technical report,
IAS Group, Fakultat fur Informatik, Technische Universitat Munchen (2009)
6. Koga, M.L., Silva, V.F.d., Cozman, F.G., Costa, A.H.R.: Speeding-up reinforcement
learning through abstraction and transfer learning. In: Conf. Autonomous Agents
and Multiagent Systems, pp.119–126 (2013)
7. Lukasiewicz, T.: Expressive probabilistic description logics. Artificial Intelligence,
172(6-7), 852–883 (2008)
8. Lukasiewicz, T., Straccia, U.: Managing Uncertainty and Vagueness in Description
Logics for the Semantic Web. Journal of Web Semantics, 6, 291–308 (2008)
9. Mateus, P., Pacheco, A., Pinto, J., Sernadas, A.:Probabilistic Situation Calculus.
Annals of Mathematics and Artificial Intelligence, 32, 393–431 (2001)
10. Otterlo,M. V.: The Logic of Adaptative Behaviour. IOS Press, Amsterdam (2009)
11. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference, Morgan Kaufmann (1988)
12. Riazuelo, L., Tenorth, M., Di Marco, D., Salas, M., Msenlechner, L., Kunze, L.,
Montiel, J. M. M.: RoboEarth Web-Enabled and Knowledge-Based Active Percep-
tion. In: IROS Workshop on AI-based Robotics (2013)
13. Saito, M., Chen, H., Okada, K., Inaba, M., Kunze, L.,Beetz, M.: Semantic object
search in large-scale indoor environments. In: IROS Workshop on active Semantic
Perception and Object Search in the Real World (2011)
14. Sutton, Richard S., Barto, Andrew G.: Introduction to Reinforcement Learning,
MIT Press, Cambridge, MA (1998)
15. Tenorth, M., Klank, U., Pangercic, D., Beetz, M.: Web-enabled robots. Robotics
& Automation Magazine, IEEE, 18(2), 58–68 (2011)
16. Tenorth, M., Beetz, M.: KnowRob – A Knowledge Processing Infrastructure for
Cognition-enabled Robots. Part 1: The KnowRob System. International Journal of
Robotics Research (IJRR) 32(5), 566–590 (2013)
48
Towards a Distributional Semantic Web Stack
André Freitas1 , Edward Curry1 , Siegfried Handschuh1,2
1
Insight Centre for Data Analytics, National University of Ireland, Galway
2
School of Computer Science and Mathematics, University of Passau
Abstract. The capacity of distributional semantic models (DSMs) to
discover similarities over large scale heterogeneous and poorly structured
data brings them as a promising universal and low-effort framework to
support semantic approximation and knowledge discovery. This position
paper explores the role of distributional semantics in the Semantic Web
vision, based on state-of-the-art distributional-relational models, catego-
rizing and generalizing existing approaches into a Distributional Seman-
tic Web stack.
1 Introduction
Distributional semantics is based on the idea that semantic information can be
extracted from lexical co-occurrence from large-scale data corpora. The simplic-
ity of its vector space representation, its ability to automatically derive meaning
from large-scale unstructured and heterogeneous data and its built-in seman-
tic approximation capabilities are bringing distributional semantic models as a
promising approach to bring additional flexibility into existing knowledge repre-
sentation frameworks.
Distributional semantic approaches are being used to complement the seman-
tics of structured knowledge bases, generating hybrid distributional-relational
models. These hybrid models are built to support semantic approximation, and
can be applied to selective reasoning mechanisms, reasoning over incomplete
KBs, semantic search, schema-agnostic queries over structured knowledge bases
and knowledge discovery.
2 Distributional Semantic Models
Distributional semantic models (DSMs) are semantic models which are based on
the statistical analysis of co-occurrences of words in large corpora. Distributional
semantics allows the construction of a quantitative model of meaning, where the
degree of the semantic association between different words can be quantified
in relation to a reference corpus. With the availability of large Web corpora,
comprehensive distributional models can effectively be built.
DSMs are represented as a vector space model, where each dimension repre-
sents a context C for the linguistic or data context in which the target term T
occurs. A context can be defined using documents, co-occurrence window sizes
49
(number of neighboring words or data elements) or syntactic features. The dis-
tributional interpretation of a target term is defined by a weighted vector of the
contexts in which the term occurs, defining a geometric interpretation under a
distributional vector space. The weights associated with the vectors are defined
using an associated weighting scheme W, which can re-calibrates the relevance
of more generic or discriminative contexts. A semantic relatedness measure S
between two words in the dataset can be calculated by using different similar-
ity/distance measures such as the cosine similarity or Euclidean distance. As
the dimensionality of the distributional space can grow large, dimensionality
reduction approaches d can be applied.
Different DSMs are built by varying the parameters of the tuple (T , C, W, d, S).
Examples of distributional models are Latent Semantic Analysis, Random In-
dexing, Dependency Vectors, Explicit Semantic Analysis, among others. Distri-
butional semantic models can be specialized to different application areas using
different corpora.
3 Distributional-Relational Models (DRMs)
Distributional-Relational Models (DRMs) are models in which the semantics of
a structured knowledge base (KB) is complemented by a distributional semantic
model.
A Distributional-Relational Model (DRM) is a tuple (DSM, KB, RC, F, H, OP),
where: DSM is the associated distributional semantic model ; KB is the struc-
tured dataset, with elements E and tuples Ω; RC is the reference corpora which
can be unstructured, structured or both. The reference corpora can be internal
(based on the co-occurrence of elements within the KB) or external (a separate
reference corpora); F is a map which translates the elements ei ∈ E into vectors
→
−
ei in the the distributional vector space V S DSM using the natural language
label and the entity type of ei ; H is a set of threshold values for S above which
two terms are considered to be equivalent; OP is a set of operations over − →ei in
DSM
VS and over E and Ω in the KB. The set of operations may include search,
query and graph navigation operations using the distance measure S.
The DRM supports a double perspective of semantics, keeping the fine-
grained precise semantics of the structured KB but also complementing it with
the distributional model. Two main categories of DRMs and associated applica-
tions can be distinguished:
Semantic Matching & Commonsense Reasoning: In this category the RC
is unstructured and it is distinct from the KB. The large-scale unstructured RC
is used as a commonsense knowledge base. Freitas & Curry [1] define a DRM
(τ − Space) for supporting schema-agnostic queries over the structured KB:
terms used in the query are projected into the distributional vector space and
are semantically matched with terms in the KB via distributional semantics
using commonsense information embedded on large scale unstructured corpora
RC. In a different application scenario, Freitas et al. [3] uses the τ − Space to
support selective reasoning over commonsense KBs. Distributional semantics is
50
used to select the facts which are semantically relevant under a specific reasoning
context, allowing the scoping of the reasoning context and also coping with
incomplete knowledge of commonsense KBs. Pereira da Silva & Freitas [2] used
the τ − Space to support approximate reasoning on logic programs.
Knowledge Discovery: In this category, the structured KB is used as a dis-
tributional reference corpora (where RC = KB). Implicit and explicit semantic
associations are used to derive new meaning and discover new knowledge. The
use of structured data as a distributional corpus is a pattern used for knowledge
discovery applications, where knowledge emerging from similarity patterns in the
data can be used to retrieve similar entities and expose implicit associations. In
this context, the ability to represent the KB entities’ attributes in a vector space
and the use of vector similarity measures as way to retrieve and compare similar
entities can define universal mechanisms for knowledge discovery and semantic
approximation. Novacek et al. [5] describe an approach for using web data as a
bottom-up phenomena, capturing meaning that is not associated with explicit
semantic descriptions, applying it to entity consolidation in the life sciences do-
main. Speer et al. [8] proposed AnalogySpace, a DRM over a commonsense KB
using Latent Semantic Indexing targeting the creation of the analogical closure
of a semantic network using dimensional reduction. AnalogySpace was used to
reduce the sparseness of the KB, generalizing its knowledge, allowing users to
explore implicit associations. Cohen et al. [6] introduced PSI, a predication-
based semantic indexing for biomedical data. PSI was used for similarity-based
retrieval and detection of implicit associations.
4 The Distributional Semantic Web Stack
DRMs provide universal mechanisms which have fundamental features for se-
mantic systems: (i) built-in semantic approximation for terminological and in-
stance data; (ii) ability to use large-scale unstructured data as commonsense
knowledge, (iii) ability to detect emerging implicit associations in the KB, (iv)
simplicity of use supported by the vector space model abstraction, (v) robust-
ness with regard to poorly structured, heterogeneous and incomplete data. These
features provide a framework for a robust and easy-to-deploy semantic approx-
imation component grounded on large-scale data. Considering the relevance of
these features in the deployment of semantic systems in general, this paper syn-
thesizes its vision by proposing a Distributional Semantic Web stack abstraction
(Figure 1), complementing the Semantic Web stack. At the bottom of the stack,
unstructured and structured data can be used as reference corpora together
with the target KB (RDF(S)). Different elements of the distributional model
are included as optional and composable elements of the architecture. The ap-
proximate search and query operations layer access the DSM layer, supporting
users with semantically flexible search and query operations. A graph navigation
layer defines graph navigation algorithms (e.g. such as spreading activation,
bi-directional search) using the semantic approximation and the distributional
information from the layers below.
51
1 Distributional-Relational Model 2 Distributional Stack for Semantic Web
c1 Kate Hudson
Semantic search Spreading activation
childOf Structured KB Application Random walk
...
rθ
:spouse Question Answering
Bi-directional search
...
θ r
Chris Ontology alignment
Ƭ -Space Operations & ...
Robinson Recommender systems Graph Navigation Model
Navigational
θ e
Model Personalised search
Stanley
Robinson
... Latent Semantic
c0 Approximate Search / Query Analysis
Cosine similarity Explicit Semantic
cn context space semantic relatedness = Euclidean distance Distributional Semantic Model Analysis
cos(θ) = daughter . child = 0.234 Jaccard distance Random Indexing
Chebyshev Context ...
Distributional Semantic Model Similarity Model
Correlation Model
Reference Data ...
Corpus Window size
context c term Weighting Model Linguistic features
same term context c
TF/IDF Entity type
... Okapi BM25 ...
Target data Reference corpora
distribution of Mutual Information
term associations T-Test
RDF(S) Ustructured
χ2
Observer ...
Observer I II
... a child (plural: children) is a human between ...
...is the third child and only son of Prince ...
Reality
... was the first son and last child of King ...
...
Fig. 1: (A) Depiction of an example DRM (τ − Space) (B) Distributional Seman-
tic Web stack.
Acknowledgment: This publication was supported in part by Science Foundation
Ireland (SFI) (Grant Number SFI/12/RC/2289) and by the Irish Research Council.
References
1. Freitas, A., Curry, E., Natural Language Queries over Heterogeneous Linked Data
Graphs: A Distributional-Compositional Semantics Approach. In Proc. of the 19th
Intl. Conf. on Intelligent User Interfaces (IUI). (2014).
2. Pereira da Silva, J.C., Freitas A., Towards An Approximative Ontology-Agnostic
Approach for Logic Programs, In Proc. of the 8th Intl. Symposium on Foundations
of Information and Knowledge Systems. (2014).
3. Freitas, A., Pereira Da Silva, J.C., Curry, E., Buitelaar, P., A Distributional Seman-
tics Approach for Selective Reasoning on Commonsense Graph Knowledge Bases.
In Proc. of the 19th Int .Conf. on Applications of Natural Language to Information
Systems (NLDB). (2014).
4. Speer, R., Havasi, C., Lieberman, H., AnalogySpace: Reducing the Dimensionality of
Common Sense Knowledge. In Proc. of the 23rd Intl. Conf. on Artificial Intelligence,
548-553. (2008).
5. Novacek, V., Handschuh, S., Decker, S.. Getting the Meaning Right: A Comple-
mentary Distributional Layer for the Web Semantics. In Proc. of the Intl. Semantic
Web Conference, 504-519. (2011).
6. Cohen, T., Schvaneveldt, R.W., Rindflesch, T.C.. Predication-based Semantic In-
dexing: Permutations as a Means to Encode Predications in Semantic Space. T.
AMIA Annu Symp Proc., 114-118. (2009).
7. Turney, P.D., Pantel P., From frequency to meaning: vector space models of seman-
tics. J. Artif. Int. Res., 37(1), 141–188. (2010).
8. Speer, R., Havasi, C., Lieberman, H., AnalogySpace: Reducing the Dimensionality of
Common Sense Knowledge. In Proc. of the 23rd Intl. Conf. on Artificial Intelligence,
548-553. (2008).
52
Overview of METHOD 2014: The 3rd
International Workshop on Methods for
Establishing Trust of (Open) Data
Tom De Nies1 , Davide Ceolin2 , Paul Groth2 , Olaf Hartig3 , and Stephen Marsh4
1
tom.denies@ugent.be
Ghent University – iMinds – Multimedia Lab, Belgium
2
d.ceolin, p.t.groth@vu.nl
VU University Amsterdam, The Netherlands
3
ohartig@uwaterloo.ca
University of Waterloo, Canada
4
stephen.marsh@uoit.ca
University of Ontario Institute of Technology, Canada
1 Introduction
The METHOD workshop aims to bring together researchers working on the
problem of trust and quality assessment of (open) data, and all components
that contribute to this goal. It provides a forum for researchers from both the
Semantic Web and the Trust Management community, with the goal of gaining
new insights towards solutions for this complex problem. Due to the relatively
low number of submissions, and to maximize the impact of the accepted papers,
METHOD 2014 merged with the 10th International Workshop on Uncertainty
Reasoning for the Semantic Web (URSW), as a special session. In this short
editorial paper, we provide an overview of the topics discussed during this session.
Trust assessment of content on the Web is a highly complex concept that
depends on objective as well as subjective criteria, including the content’s prove-
nance, data quality estimation, and also the consumer’s background, personality,
and context. However, the exact criteria and tolerances differ for each context
and for each assessor, requiring detailed knowledge about the data and its uses.
This also makes it very challenging to find generic solutions and assessments that
are applicable everywhere or transferrable from one context to another. There-
fore, stakeholders in this field are continuously investigating new techniques to
handle and prepare data in such a way that it becomes easier for machines to
process it with the goal of trust and/or quality assessment. The METHOD work-
shop is a venue for presenting and discussing novel research ideas in this field,
as well as technical applications.
2014 is the third year for METHOD. The two previous editions were held in
conjunction with the IEEE Annual International Computer Software & Appli-
cations Conference (COMPSAC). This year, the workshop is held for the first
time at ISWC, since the topics of provenance, data quality, and trust are highly
relevant to the Semantic Web community. The diversity in the ways that these
topics may be approached is also visible judging from the subject of the submis-
53
sions we received. Although only three papers were accepted, each submission
covers a distinctly different aspect of the general theme of the workshop.
2 Papers and Discussion Topics
This year’s submissions cover the following aspects of trust of (open) data: con-
tent rating, data quality rewarding, data attribution, and trust representation.
The first two aspects are covered by Couto, who proposes a number of guide-
lines for a system that rates as well as rewards good practices in data sharing,
by means of a virtual currency [1]. A large number of unsolved technical chal-
lenges are identified by this position paper, and some issues are left open, such
as how to uniquely identify and attribute data to its creators. This is exactly
what Höfig and Schieferdecker investigate, proposing a new hash function for
RDF graphs [3]. The solution proposed in this research paper may contribute
to a tamper-resistant way of attributing RDF data to its authors, allowing one
to make claims about the data’s trustworthiness. Of course, a representation
for these trustworthiness assessments is needed. In the final position paper of
our workshop, Ceolin et. al. propose an ontology to represent trust of web data,
extending existing solutions [2].
Despite the diverse aspects discussed in the submissions, a number of common
themes and open questions can be identified, listed below.
– What incentive is there for data creators to make their data trustworthy?
– Which mechanisms are in place to attribute data to its creators and editors,
and do they suffice for our needs?
– How can we estimate ratings or assessments of data quality and trustwor-
thiness?
– How would it be possible to allow the reuse of such trust and quality esti-
mations?
– How do we represent the aforementioned aspects in an interoperable way?
These are the questions we hope to see addressed during the discussions at
METHOD 2014, and in future editions of the workshop.
References
[1] Couto, F.: Rating, recognizing and rewarding metadata integration and sharing on
the semantic web. In: Proceedings of the 10th International Workshop on Uncer-
tainty Reasoning for the Semantic Web (URSW), Riva del Garda, Italy. CEUR-
WS.org (2014)
[2] Davide Ceolin, Archana Nottamkandath, W.F., Maccatrozzo, V.: Towards the def-
inition of an ontology for trust in (web) data. In: Proceedings of the 10th Interna-
tional Workshop on Uncertainty Reasoning for the Semantic Web (URSW), Riva
del Garda, Italy. CEUR-WS.org (2014)
[3] Hoefig, E., Schieferdecker, I.: Hashing of RDF graphs and a solution to the blank
node problem. In: Proceedings of the 10th International Workshop on Uncertainty
Reasoning for the Semantic Web (URSW), Riva del Garda, Italy. CEUR-WS.org
(2014)
54
Hashing of RDF Graphs and a Solution to the
Blank Node Problem
Edzard Höfig1 and Ina Schieferdecker2
1
Beuth University of Applied Sciences, Luxemburger Str. 10, 13353 Berlin, Germany
edzard.hoefig@beuth-hochschule.de
2
Fraunhofer FOKUS, Kaiserin-Augusta-Allee 31, 10589 Berlin, Germany
ina.schieferdecker@fokus.fraunhofer.de
Abstract. The ability to calculate hash values is fundamental for using
cryptographic tools, such as digital signatures, with RDF data. Without
hashing it is difficult to implement tamper-resistant attribution or prove-
nance tracking, both important for establishing trust with open data. We
propose a novel hash function for RDF graphs, which does not require
altering the contents of the graph, does not need to record additional in-
formation, and does not depend on a concrete RDF syntax. We are also
presenting a solution to the deterministic blank node labeling problem.
1 Introduction
Two years ago we started to discuss requirements for Open Information Spaces
(OIS): distributed systems that facilitate the sharing of data, while supporting
certain trust-related properties, e.g. attribution, provenance, or non-repudiation
[1, Section II]. While working on the topic, we quickly found out that it is
necessary to not only record trust-relevant information, but also to make sure
that the information cannot easily be tampered with. In closed systems, where
access is strictly regulated and monitored, it is possible to record attribution
information like “Alice created this data set” in a trustworthy manner. In open
systems, where everyone is free to share and re-use any provided data sets, this
is harder and usually requires some sort of cryptographic processing.
One of most commonly used methods employed in this context are hash
functions. They take a fixed version of a data set (the snapshot) and calculate
a smaller, characteristic string for that data (the hash value). Cryptographic
hash functions are constructed in a way that even very small changes to the
original snapshot result in completely different hash values. Furthermore, it is
a runtime expensive operation to construct a data set that yields a given hash
value. Thus, by publishing the hash value for a snapshot x, it is possible to verify
that some data set y is highly likely to be identical to x, simply because their
hash values match. For example, to record the information “Alice created this
data set”, Alice would create a hash value of the data set and digitally sign it
using common cryptographic techniques. The signature could then be published
as meta data along with the data set, allowing verification of the attribution
information by calculating the hash value of a local copy of the data set and
comparing it to the one in the signature.
55
1.1 Motivation
Attribution is one of the fundamental characteristics that OIS needs to support,
as more complex operations, like provenance tracking, build upon such an at-
tribution framework. To engineer an OIS, we needed to start somewhere, so we
decided to quickly implement an attribution system. Technology-wise, we use the
Resource Description Framework (RDF) [2] to hold our data sets. Thus, our first
task was the implementation of a hash function for RDF graphs. This turned
out to be a complex undertaking. The problem is that RDF does not have a
single, concrete syntax. Although quite rigidly defined in terms of semantics, the
specification explicitly states that “A concrete RDF syntax may offer many dif-
ferent ways to encode the same RDF graph or RDF dataset, for example through
the use of namespace prefixes, relative IRIs, blank node identifiers, and different
ordering of statements.” [2, Section 1.8]. Unfortunately, to work properly, hash
functions need a single, concrete syntax. Often, RDF data is transmitted not as
a document — which would be bound to a concrete syntax — but comes from
query interfaces, e.g., SPARQL endpoints [3]. What we really needed was a hash
function that can work on an in-memory RDF graph. The hash function itself
is not the issue, as there are several implementations available (we are relying
on SHA-256 [4, Section 6.2] to calculate the final hash value). The problem is
to deterministically construct a single character string that distinctly represents
the RDF graph, and the main issue here is the identity of blank nodes contained
in the graph.
1.2 Paper Structure
The current section introduces the reader to the general issue and explains our
motivation. In Section 2, we are studying both the underlying problems that
arise when trying to implement a hash function, as well as the related work in
the scientific community. Section 3 contains the description of an algorithm that
is able to create a hash value for in-memory RDF graphs with blank nodes and
we conclude with a critical discussion of our contribution in Section 4.
2 Problems and Related Work
Initially, we looked at the literature and found a number of articles, dating back
about ten years and discussing the issue in great detail. There were even stan-
dards that seemed directly applicable, for example XML Signature [5,6]. After
studying the literature, we found that none of the articles fully explains a general
solution to the problem. All of them need certain constraints, or make assump-
tions about the data, for example the use of a certain concrete syntax. For our
purposes the situation was inadequate, because of our following requirements:
1. No modification of the RDF data is needed for the algorithm to work
2. No additional data needs to be available, apart from the RDF graph
3. The algorithm works in-memory and not on a concrete syntax
56
During studies of the subject, we found that three issues were of paramount
importance, if we wanted to solve the problem. We will explain these first, before
investigating the existing work.
Blank Node Identifiers: RDF graphs might contain blank nodes. Such nodes
do not have an Internationalized Resource Identifier (IRI) [7] assigned. Once
loaded by an RDF implementation, they are assigned local identifiers, which
are not transferable to other implementations. When trying to calculate a
hash value, this is a major issue as blank nodes cannot be deterministically
addressed. One can think of blank nodes as anonymous and having an iden-
tity is a necessary pre-condition for calculating a hash value. Solving the
blank node issue is an algorithmic challenge.
Order of Statements Calculation of hash values effectively serialises a RDF
graph structure into a string. RDF does not imply a certain order of state-
ments, e.g. the order of predicates attached to a single subject. For serial-
isation purposes we need a deterministic order, or otherwise we might end
up with hash values differing between implementations. This issue can be
solved by adhering to a sort order when serialising the graph.
Encoding RDF uses literals to store values in the graph. These literals need to
follow a common encoding, otherwise different hash values might be calcu-
lated. The same goes for namespace prefixes or relative IRIs (both features of
the XML syntax for RDF [8]) –they need to be encoded with fully qualified
names when stored in memory, or, at the latest, when serialised.
The most influential paper on the subject of RDF graph hashing was written
by Carroll in 2003 [9]. Building on earlier work [10], Carroll explains that the
runtime of any algorithm for generic hashing of RDF data is equivalent to the
graph isomorphism problem, which is not known to be N P-complete nor known
to be in P. Carroll then refrains from finding a generic solution to the problem
and details his algorithm, which runs in O(n log(n)), but re-writes RDF graphs
to a canonical format. The proposed algorithm works on the N-Triples format (a
concrete document syntax). As far as solving the blank node identity problem,
the article states: “Since the level of determinism is crucial to the workings of
the canonicalization algorithm, we start by defining a deterministic blank node
labelling algorithm. This suffers from the defect of not necessarily labeling all
the blank nodes.” [9, Section 4]. Carroll continued to work on the subject, for
example by publishing applications based on digitally signing graphs, together
with Bizer, Hayes, and Stickler [11], but did not seem to have designed a general
algorithm for hashing RDF graphs.
Sayers and Karp, colleagues of Carroll, published two technical reports at
Hewlett-Packard that explains RDF hashing and applications thereof [12,13].
They identify four different ways to tackle the blank node problem [13, Section
3 ff.], namely:
Limit operations on the graph The idea is to maintain blank node identi-
fiers across implementations, which is not possible in an open world scenario.
Limit the graph itself Avoid the use of blank nodes. This is clearly not the
way for us, as we strive for a general solution to the problem.
57
Modify the graph Work around the problem by adding information about
blank node identity within the graph. Not possible for us, as we don’t want
to change the RDF graph.
Change the RDF specification Generally assign globally unique identifiers
to blank nodes. This will most likely never happen.
Apart from changing the RDF specification, none of these methods actually
solves the problem and they can be seen as workarounds.
Other authors also investigated RDF graph hashing, e.g. Giereth [14] for the
purpose of encrypting fragments of a graph. Giereth works around the blank node
issue by modifying the graph. A more current approach by Kuhn and Dumontier
[15] proposes an encoding of hash values in URIs. The approach replaces blank
nodes with arbitrarily generated identifiers and thus needs to modify the RDF
data to work.
Although Carroll did already provide pointers in the right direction, the final
idea for solving the issue can be attributed to Tummarello et al., who introduce
the concept of a “Minimum Self-Contained Graph” (MSG) [16]. A MSG is a
partitioning of a graph, so that each MSG contains at most one transitively
connected sub-graph of blank nodes. We apply this idea to construct a blank
node identity, as explained in the next section.
3 The Algorithm
β2 A B β4
β1 C β3
Fig. 1. An example structure with blank nodes
Our approach to the blank node labeling problem relies on constructing the
identity of a blank node though its context. This is similar to the MSG principle
of Tummarello et al. [16, Section 2] and a logical continuation of the thoughts
of Carroll [9, Section 7.1]. For an example, see Fig. 1. The diagram shows three
nodes with IRIs (A, B, C) and four blank nodes (β1 ... β4 ). If we go on to define
the identity of a node as determined by its direct subjects, we can distinguish β2
and β4 , but not yet the two other ones: there are both blank nodes pointing to
another blank node and the C node. We can only discern every node by taking
more of the context into account: not only direct neighbours, but neighbouring
nodes one hop away. Consequentially, the identity of a blank node can only be
constructed when following all of the transitive blank nodes, reachable from the
original one. Using this scheme, we are able to establish an identity for blank
58
nodes. Having identity for both: blank nodes, and IRI nodes, we can generate a
characteristic, implementation-independent string for both node types. As RDF
only allows these two types of nodes as subjects, we are now able to create a list
of so-called “subject strings”. To solve the issue of implementation-dependent
statement order, it is necessary to establish an overall ordering criterium over
this list. We are using a simple lexicographical ordering based on the unicode
value of each character in a string. As we are solely relying on subject nodes to
calculate the hash value, we need to also encode the predicates and objects of the
RDF graph into the subject strings, as well. This way of encoding graph structure
into the overall data used to calculate the hash value is a major difference to
other algorithms striving for the same goal. Usually, only flat triple lists are
processed and the hash function calculates a value for each triple. To encode the
graph structure we are using special delimiter symbols. Without these symbols,
differing graph topology might lead to the same string representation and thus,
the same hash value. For the final cryptographic calculation of the hash value,
we are using SHA-256 on UTF-8 [17] encoded data. SHA-256 was chosen as
the amount of characters in the overall data string seems to be moderate. It is
recommended to use SHA-256 for less than 264 bits of input [4, Section 1].
3.1 Preconditions and Remarks
To calculate the digest of a single RDF graph g, the graph has to reside in
memory first, as we are not concerned with any network or file-based represen-
tations of the graph. We require that the graph’s content is accessible as a set of
< S, P, O > triples 3 . It is beneficial to have fast access to all subject nodes of
the graph, and to all properties of a node and we are using matching patterns to
express this type of access, e.g. to denote any node ns that appears
in the role of a subject in the RDF triple data4 . Literals and IRI identifiers need
to be stored as unicode characters. There are no restrictions on the blank node
identifiers, as we do not use them for calculation of the digest. For sake of clarity,
we present our algorithm broken down in four separate sub-aglorithms: A func-
tion that calculates the hash value for a given graph (Algorithm 1), a procedure
that calculates the string representation for a given subject node (Algorithm 2),
another one for the string representation of the properties of a given subject
node (Algorithm 3), and a last one for calculation of the string representation
of an object node (Algorithm 4). These sub-algorithms call each other and thus,
could be combined in a single operation. It should be noted, that the algorithm
uses reentrance to establish the transitive relationship needed to assign identities
to the blank nodes.
Our algorithm uses a number of different delimiter symbols with strictly
defined, constant values, which we assigned greek letters to. Table 1 gives an
overview of these symbols.
3
Triples with a Subject, P redicate, and Object
4
The question mark notation is inspired by the SPARQL query syntax [3]
59
Symbol Symbol Name Value (Unicode) Value Name
αs Subject start symbol { (U+007B) Left curly bracket
ωs Subject end symbol } (U+007D) Right curly bracket
αp Property start symbol ( (U+0028) Left parentheses
ωp Property end symbol ) (U+0029) Right parentheses
αo Object start symbol [ (U+005B) Left square bracket
ωo Object end symbol ] (U+005D) Right square bracket
β Blank node symbol ∗ (U+002A) Asterisk
Table 1. Guide to delimiters and symbols used in the algorithms
Use of these symbols is unproblematic in regard to their appearance as part
of the RDF content. As we use two symbols to delimit a scope in the string, we
can clearly distinguish between use as delimiter and use as content. As our goal
is the creation of a string representation for each subject node, the algorithm
makes heavy use of string concatenation and we are using the ⊕ symbol to
denote this operation. In several places, strings are nested between start and
stop delimiters, like this: α ⊕ string ⊕ ω.
Some final remarks regarding the implementation before delving into the
specifics of the algorithm: We are using some variables (visitedN odes, g) as
parameters for functions and procedures. Of course, these should be better put
away as shared variables (e.g. attributes in an object). The variable result is
always local and needs to be empty at the start of each function. Furthermore,
there are some functions that dependent on a concrete implementation and are
quite trivial to use. We skip an in-depth discussion of those, e.g., predicates(...).
3.2 Calculating the Hash Value
Algorithm 1 Calculating a hash value for a RDF graph g
1: function calculateHashValue(g)
2: for all ns ∈ g that match do . All subject nodes
3: visitedN odes ← ∅
4: subjectStrings ← subjectStrings ∪ encodeSubject(ns , visitedN odes, g)
5: end for
6: sort subjectStrings in unicode order
7: for all s ∈ subjectStrings do
8: result ← result ⊕ αs ⊕ s ⊕ ωs
9: end for
10: return hash(result) . Using SHA-256 and UTF-8
11: end function
To calculate the hash value for a given RDF graph, Algorithm 1 is used. The
function takes a single parameter: the RDF graph g to use for calculation of
60
its hash value. At first the algorithm iterates over all of the subject nodes ns
that exist in g (lines 2–5). A subject node is any node that appears in the role
of a subject in a RDF triple contained in g. For each of the subject nodes we
create a data structure, called visitedN odes, which is used to record if we already
processed some blank node. This is necessary for termination of the construction
of the blank node identities. visitedN odes needs to be empty before calculating
the string representation for ns by calling the procedure encodeSubject(...) in
line 4, which is explained in the next section. After all subject nodes are encoded,
the resulting list needs to be sorted. Any sorting order could be used and as we
do not require specific semantics for this step, we are establishing an ordering
simply by comparing the unicode numbering of letters. The sorting operation
in line 6 is key to deterministically create a hash value, as the result would
otherwise build upon the (implementation-dependent) order of nodes in g. All of
the sorted subject-strings are then concatenated to form an overall result string,
while each single subject string is enclosed with the subject delimiter symbols
(line 8). Finally, the result string is subjected to a cryptographic hash function
and returned (line 10).
3.3 Encoding the Subject Nodes
In RDF, subject nodes can be of two types: they can either be blank nodes or
IRIs [2, Section 3.1]. The procedure encodeSubject(...), shown in Algorithm 2
and used by Algorithm 1, needs to take care of this. The procedure has three
arguments: ns – the subject node to encode as a string, visitedN odes – our data
structure for tracking already visited nodes, and g – the RDF graph.
Algorithm 2 Encoding a subject node ns
1: procedure encodeSubject(ns , visitedN odes, g)
2: if ns is a blank node then
3: if ns ∈ visitedN odes then
4: return ∅ . This path terminates
5: else
6: visitedN odes ← visitedN odes ∪ ns . Record that we visited this node
7: result ← β
8: end if
9: else
10: result ← IRI of ns . ns has to be a IRI
11: end if
12: result ← result ⊕ encodeP roperties(ns , visitedN odes, g)
13: return result
14: end procedure
The discrimination of types comes first: lines 2–8 process blank nodes and
lines 9–11 take care of IRIs. For the blank nodes, we have to distinguish between
the case where we already met a blank subject node (line 3–4), and the case where
61
we didn’t (line 5–7). In the case that the subject node was already encountered,
the graph traversal ends and we return an empty string. If the blank subject
node is hitherto unknown, then result is set to the blank node symbol β (see
Table 1) and the node is recorded in visitedN odes as being processed. If the
subject is not a blank node, but an IRI, we simply set result to be the IRI itself.
Only encoding the subject node itself is not sufficient for our purposes, as we
need to establish an identity based on the context of the current subject node. In
line 12 this process is triggered by calling the encodeP roperties(...) procedure
(see Algorithm 3) and concatenating the returned string with the existing result.
The result itself is returned in line 13.
3.4 Encoding the Properties of a Subject Node
Algorithm 3 is responsible for encoding all of the properties of a given subject
node ns into a single string representation. We understand properties as the
predicates (p) and objects (o) that fullfil < ns , ?p, ?o >, where ns is a given
subject. Apart from ns , the procedure encodeP roperties(...) needs visitedN odes
as a second, and g as a third argument.
Algorithm 3 Encode properties for a subject node
1: procedure encodeProperties(ns , visitedN odes, g)
2: p ← predicates(ns , g) . Retrieve all predicate IRIs that have ns as subject
3: sort p in unicode order
4: for all iri ∈ p do
5: result ← result ⊕ αp ⊕ iri
6: for all no ∈ g that match < ns , iri, ?no > do . All objects for ns and iri
7: objectStrings ← objectStrings ∪ encodeObject(no , visitedN odes, g)
8: end for
9: sort objectStrings in unicode order
10: for all o ∈ objectStrings do
11: result ← result ⊕ αo ⊕ o ⊕ ωo
12: end for
13: result ← result ⊕ ωp
14: end for
15: return result
16: end procedure
The algorithmic structure reflects the complexity of graph composition using
RDF predicates: a subject node can be associated with multiple predicates, and
the predicates are allowed to be similar, if associated with different objects. We
use a two stage process to encode properties: First, all unique predicate IRIs of
the given subject node ns are retrieved and ordered (lines 2–3). We are postulat-
ing a function predicates(...) that returns all predicate IRIs for a given subject
node ns by searching all triples for matches to < ns , ?px , ? >, extracting the IRI
of the identified predicate px , and eliminating double entries. In a second step,
62
we encode each predicate IRI and the set of objects associated with it (line 4–14).
The property encoding starts in line 5, where the result is concatenated with
the property start symbol αp and the predicate IRI. Subsequently, we retrieve
all object nodes (nodes that appear in triples with ns as subject and iri as pred-
icate), encode each object node using the procedure encodeObject(...), and store
their respective string representations in a objectStrings list (line 6–8). The
encodeObject(...) procedure is detailed in Algorithm 4. After collecting all the
encoded object strings, the resulting list has to be sorted (line 9). In line 10–12
each object string is appended to result, while taking care to enclose the string
in delimiter symbols. Encoding of a single property (one pass of the loop started
in line 4) ends with appending the property stop symbol ωp to result. Once the
procedure has encoded all properties it returns with the complete result string
in line 15.
3.5 Encoding the Object Nodes
Processing the object nodes itself is trivial when compared to the property encod-
ing. Objects in RDF triples can be three things: an IRI, a literal, or a blank node
[2, Section 3.1]. The encodeObject(...) procedure needs to return an appropriate
string representation for each of these three cases. It takes three arguments: an
object node no , visitedN odes, and the RDF graph g.
Algorithm 4 Encode an object node
1: procedure encodeObject(no , visitedN odes, g)
2: if no is a blank node then
3: return encodeSubject(no , visitedN odes, g) . Re-enter Algorithm 2
4: else if no is a literal then
5: return literal representation of no . Consider language and type
6: else
7: return IRI of no . no has to be a IRI
8: end if
9: end procedure
The three aforementioned cases are treated as follows. If no is a blank node,
we continue with re-entering Algorithm 2 (see line 3). The re-entrance allows
us to construct a path through neighbouring blank nodes. Together with having
potentially many object nodes associated with a single subject, this yields a
connected graph, similar to the MSGs of Tummarello et al. If no is a literal,
it is returned in a format according to [2, Section 3.3] in line 5, including any
language and type information. If no is neither a blank node, nor a literal, it
has to be an IRI and we return it verbatim. After all objects, properties, and
subjects have been encoded, all sub-algorithms have returned and Algorithm 1
terminates.
63
3.6 An Example
Consider the RDF graph shown in Figure 1 at the beginning of Section 3. When
applying the algorithm to the given graph, we end up with the following string
before calculating the SHA-256 hash (see Algorithm 1, line 10):
{∗(−[∗(−[A][C])][C])} {∗(−[∗(−[B][C])][C])} {∗(−[A][C])} {∗(−[B][C])}
Please note that this string is not valid RDF, as scheme and path information
are missing from the employed IRIs5 used by A, B, and C. Also the symbol “−”
is used to indicate an arbitrary IRI employed for all predicates.
Thanks to the delimiters, it is quite easy to understand the string’s structure. The
four subject node strings for β1 , β3 , β2 , and β4 (in this order) are encapsulated
between curly brackets each. A, B, and C only appear as object nodes and thus do
not trigger the creation of additional subject strings. Instead, they are encoded
as part of the blank node subject strings. Let’s take a look at the first subject
string for node β1 : {∗(−[∗(−[A][C])][C])}. Apart from the curly brackets, the
string starts with the blank node symbol “∗”, followed by the properties of that
node, delimited in parentheses. The node has only a single predicate, used with
two different objects: [∗(−[A][C])] and [C]. If we would have additional predicate
types, there would also be further parentheses blocks. Objects are delimited by
square brackets and due to the re-entrant nature of the algorithm object strings
follow the same syntax as just discussed.
4 Conclusion and Future Work
The presented algorithm calculates a hash value for RDF graphs including blank
nodes. It is not necessary to alter the RDF data or to record additional informa-
tion. It is not dependent on any concrete syntax. It solves the blank node labeling
problem by encoding the complete context of a blank node, including the RDF
graph structure, into the data used to calculate the hash value. The algorithm
has a runtime complexity of O(nn ), which is consistent with current research on
algorithms for solving the graph isomorphism problem [9]. The worst-case sce-
nario is a fully meshed RDF graph of blank nodes, which does not seem to make
any sense whatsoever — usually, we would expect the amount of blank nodes in
a graph to be far smaller, thus the real execution speed to be less catastrophic.
We concentrated on solving the primary problem, not on runtime optimisations
and we are certain, that there is room for improvement in the given algorithm.
There are some obvious starting points for doing this. For example, there are
redundancies in the string representations for transitive blank node paths (the
string representations for β2 and β4 appear twice in the example given in Section
3.6). One could cache already computed subject-strings, pulling them from the
cache when needed. Also, the interplay between the SHA-256 digest computation
and the subject string calculations has not been researched in sufficient detail.
5
For example A instead of http://a/
64
It might be possible to reduce processing and storage overhead by combining
these two operations, calling the digest operation on each subject string and
combining the resulting values in order of the final sorted list. The sensibility
of such optimizations should largely depend on the susceptibility of the digest
algorithm implementation to the length of the given input strings. This, in turn,
depends on the usage of blank nodes in the input RDF data. More blank nodes
in the input data and more references between blank nodes means longer sub-
ject strings. Consequentially, to come to a more substantial assessment of the
presented algorithm, we will need to study its performance on a number of real
(and larger) data sets.
While we trust the general approach for solving the blank node labeling
problem through an assignment of identity based on the surrounding context
of the node, we did not proof that the algorithm works correctly. To assure
that it works properly, we did test it: on the one hand in regard to its ability
to process all possible RDF constructs using tests from W3C’s RDF test cases
recommendation [18], on the other hand in regard to the correctness of the blank
node labeling approach using manually constructed graphs. These graphs range
from simple, non cyclic ones with a single blank node to all possible permutations
of a fully meshed graph of blank nodes with variations on the attachment of IRI
nodes and predicate types.
Acknowledgments. We would like to thank Thomas Pilger and Abdul Saboor
for their collection of material documenting the current state of the art and for
tireless implementation work.
References
1. Höfig, E. Supporting Trust via Open Information Spaces. Proc IEEE 36th Annual
Computer Software and Applications Conference, pp. 87–88, (2012)
2. World Wide Web Consortium.: RDF 1.1 Concepts and Abstract Syntax, W3C Rec-
ommendation, http://www.w3.org/TR/rdf11-concepts/ (2014)
3. World Wide Web Consortium.: SPARQL 1.1 Query Language, W3C Recommenda-
tion, http://www.w3.org/TR/sparql11-query/ (2013)
4. National Institute of Standards and Technology.: Secure Hash Standard (SHS). Fed-
eral Information Processing Standards Publication 180–4 (2012)
5. World Wide Web Consortium.: XML Signature Syntax and Processing (Second
Edition), W3C Recommendation, http://www.w3.org/TR/xmldsig-core/ (2008)
6. Cloran, R., Irwin B.: XML Digital Signature and RDF. Proc. Information Security
South Africa (2005)
7. Duerst, M., Suignard, M.: Internationalized Resource Identifiers (IRIs). Internet
Engineering Task Force – Request for Comments 3987 (2005)
8. World Wide Web Consortium.: RDF 1.1 XML Syntax, W3C Recommendation,
http://www.w3.org/TR/rdf-syntax-grammar/ (2014)
9. Carroll, J. J.: Signing RDF Graphs. Proc. of the Second International Semantic
Web Conference, LNCS 2870, pp. 369–384 (2003)
10. Carroll, J. J.: Matching RDF Graphs. Proc. of the First International Semantic
Web Conference, LNCS 2342, pp. 5–15 (2002)
65
11. Carroll, J. J., Bizer, C., Hayes, P., Stickler, P.: Named Graphs, Provenance and
Trust. Proc. International World Wide Web Conference, pp. 613–622 (2005)
12. Sayers, C., Karp. A. H.: Computing the Digest of an RDF Graph. Hewlett-Packard
Labs Technical Report HPL-2003-235 (2003)
13. Sayers, C., Karp. A. H.: RDF Graph Digest Techniques and Potential Applications.
Hewlett-Packard Labs Technical Report HPL-2004-95 (2004)
14. Giereth, M.: On Partial Encryption of RDF-Graphs. Proc. of the Fourth Interna-
tional Semantic Web Conference, LNCS 3729, pp. 308–322 (2005)
15. Kuhn, T., Dumontier, M.: Trusty URIs: Verifiable, Immutable, and Permanent
Digital Artifacts for Linked Data. Proc. Eleventh European Semantic Web Confer-
ence, LNCS 8465, pp. 395–410 (2014)
16. Tummarello, G., Morbidoni, C., Puliti, P., Piazza, F.: Signing Individual Fragments
of an RDF Graph. International World Wide Web Conference, pp. 1020–1021 (2005)
17. Yergeau, F.: UTF-8, a Transformation Format of ISO 10646. Internet Engineering
Task Force – Request for Comments 3629 (2003)
18. World Wide Web Consortium.: RDF Test Cases, W3C Recommendation, http:
//www.w3.org/TR/2004/REC-rdf-testcases-20040210/ (2004)
66
Rating, recognizing and rewarding metadata integration
and sharing on the semantic web
Francisco M. Couto
LASIGE, Dept. de Informática, Faculdade de Ciências, Universidade de Lisboa, Portugal
fcouto@di.fc.ul.pt
Abstract. Research is increasingly becoming a data-intensive science, however
proper data integration and sharing is more than storing the datasets in a public
repository, it requires the data to be organized, characterized and updated contin-
uously. This article assumes that by rewarding and recognizing metadata sharing
and integration on the semantic web using ontologies, we are promoting and in-
tensifying the trust and quality in data sharing and integration. So, the proposed
approach aims at measuring the knowledge rating of a dataset according to the
specificity and distinctiveness of its mappings to ontology concepts.
The knowledge ratings will then be used as the basis of a novel reward and recog-
nition mechanism that will rely on a virtual currency, dubbed KnowledgeCoin
(KC). Its implementation could explore some of the solutions provided by cur-
rent cryptocurrencies, but KC will not be a cryptocurrency since it will not rely
on a cryptographic proof but on a central authority whose trust depends on the
knowledge rating measures proposed by this article. The idea is that every time
a scientific article is published, KCs are distributed according to the knowledge
rating of the datasets supporting the article.
Keywords: Data Integration, Data Sharing, Linked Data, Metadata, Ontologies
1 Introduction
Research is increasingly becoming a data-intensive science in several areas, where
prodigious amounts of data can be collected from disparate resources at any time [6].
However, the real value of data can only be leveraged through its trust and quality, which
ultimately results in the acquisition of knowledge through its analysis. Since multiple
types of data are involved, often from different sources and in heterogeneous formats,
data integration and sharing are key requirements for an efficient data analysis. The
need for data integration and sharing has a long-standing history, and besides the big
technological advances it still remains an open issue. For example, in 1985 the Com-
mittee on Models for Biomedical Research proposed a structured and integrated view
of biology to cope with the available data [8]. Nowadays, the BioMedBridges 1 initia-
tive aims at constructing the data and service bridges needed to connect the emerging
biomedical sciences research infrastructures (BMSRI), which are on the roadmap of the
European Strategy Forum on Research Infrastructures (ESFRI). One common theme to
1
www.biomedbridges.eu
67
all BMSRIs is the definition of the principles of data management and sharing [3]. The
Linked Data initiative 2 already proposed a well-defined set of recommendations for
exposing, sharing and integrating data, information and knowledge using semantic web
technologies. In this paradigm data integration and sharing is achieved in the form of
links connecting the data elements themselves and adding semantics to them. Following
and understanding the links between data elements in publicly available Data Linked
stores (Linked Data Cloud) enables us to access the data and knowledge shared by
others. The Linked Data Cloud offers an effective solution to break down data silos;
however the systematic usage of these technologies requires a strong commitment from
the research community.
Promoting the trust and quality of data through their proper integration and sharing
is essential to avoid the creation of silos that store raw data that cannot be reused by
others, or even by the owners themselves. For example, the current lack of incentive to
share and preserve data is sometimes so problematic, that there are even cases of authors
that cannot recover the data associated with their own published works [5]. However,
the problem is how to obtain a proactive involvement of the research community in data
integration and sharing. In 2009, Tim Berners-Lee gave a TED talk3 , where he said:
“you have no idea the number of excuses people come up with to hang onto their data
and not give it to you, even though you’ve paid for it as a taxpayer.” Public funding
agencies and journals may enforce the data-sharing policies, but the adherence to them
is most of the times inconsistent and scarce [1]. Besides all the technological advances
that we may deliver to make data integration and sharing tasks easier, researchers need
to be motivated to do it correctly. For example, due to the Galileos strong commitment
to the advance of Science, he integrated the direct results of his observations of Jupiter
with careful and clear descriptions of how they were performed, which he shared in
Sidereus Nuncius [4]. These descriptions enabled other researchers not only to be aware
of Galileos findings but also to understand, analyze and replicate his methodology. This
is another situation that we could characterize with the famous phrase “That’s one small
step for a man, one giant leap for mankind.” Now let us imagine if we could extend
Galileos commitment to all the research community, the giant leap that it could bring to
the advance of science.
Thus the commitment of the research community to data integration and sharing
is currently a major concern, and this explains why BMSRIs have recently included in
their definition of the principles of data management and sharing the following chal-
lenge: “to encourage data sharing, systematic reward and recognition mechanisms are
necessary”. They suggest studying not only measurements of citation impact, but also
highlighting the importance to investigate other mechanisms as well. Systematic reward
and recognition mechanisms should motivate the researchers in a way that they become
strongly committed in sharing data, so others can easily understand and reuse it. By
doing so, we encourage the research community to improve previous results by repli-
cating the experiments and testing new solutions. However, before developing a reward
and recognition mechanism we must formally define: i) what needs to be rewarded and
recognized; ii) and measure its value in a quantitative and objective way.
2
http://linkeddata.org/
3
http://www.ted.com/talks/tim_berners_lee_on_the_next_web
68
2 Metadata Quality
Proper data integration and sharing is more than storing the datasets in a public repos-
itory, it requires the data to be organized and characterized in a way that others can
find it and reuse it effectively. In an interview4 to Nature, Steven Wiley emphasized
that sharing data “is time-consuming to do properly, the reward systems aren’t there
and neither is the stick”. Not adding links to external resources hampers the efficient
retrieval and analysis of data, and therefore its expansion and update. Making a dataset
easier to find and access is also a way to improve its initial trust and quality, as more
studies analyze, expand and update it. Like the careful and clear descriptions provided
by Galileo, semantic characterizations in the form of metadata must also be present so
others can easily find the raw data and understand how it can be retrieved and explored.
Metadata is a machine-readable description of the contents of a resource made
through linking the resource to the concepts that describe it. However, to fully under-
stand such diverse and large collections of raw data being produced, their metadata need
to be integrated in a non-ambiguous and computational amenable way [9, 13]. Ontolo-
gies can be loosely defined as “a vocabulary of terms and some specification of their
meaning” [7, 14]. If an ontology is accepted as a reference by the community (e.g.,
the Gene Ontology), then its representation of its domain becomes a standard, and data
integration and sharing facilitated. The complex process of enriching a resource with
metadata by means of semantically defined properties pointing to other resources often
requires human input and domain expertise. Thus, the proposed approach assumes that
by rewarding and recognizing metadata sharing and integration on the semantic web us-
ing standard and controlled vocabularies, we are promoting and intensifying scientific
collaboration and progress.
Figure 1 illustrates the Se-
mantic Web in action with two
datasets annotated with its re-
spective metadata using a hy-
pothetical Metal Ontology. A
dataset including Gold Market
Stats contains an ontology map-
ping (e.g., an RDF triple) to
the concept Gold, and another
dataset Silver Market Stats con-
tains an ontology mapping to the
concept Silver. Given that Gold
and Silver are both coinage met- Fig. 1. An hypothetical metal ontology and dataset map-
als, a semantic search engine is pings.
capable of identifying as rele-
vant both datasets when asked
for market stats of coinage metals.
Now, we need to define the value of metadata in terms of knowledge it provides
about a given dataset. Semantic interoperability is a key requirement in the realization
4
http://www.nature.com/news/2011/110914/full/news.2011.536.html
69
of the semantic web and it is mainly achieved through mappings between resources.
For example, all dataset mappings to ontology concepts are to some extent important to
enhance the retrieval of that dataset, but the level of importance varies across mappings.
The proposed approach assumes that metadata can be considered as a set of links where
all the links are equal, but some links are more equal than others (adaption of George
Orwells quote). Thus, the proposed approach aims at measuring the knowledge rating
of any given dataset through its mappings to concepts specified in an ontology.
3 Knowledge rating
The proposed approach assumes that the metadata integration and sharing value of a
dataset, dubbed as knowledge rating, is proportional to the specificity and distinctive-
ness of its mappings to ontology concepts in relation to all the others datasets in the
Linked Data Cloud.
The specificity of a set of ontology concepts can be defined by the information con-
tent (IC) of each concept, which was introduced by [11]. For example, intuitively the
concept dog is more specific than the concept animal. This can be explained because the
concept animal can refer to many distinct ideas, and, as such, carries a small amount of
information content when compared to the concept dog, which has a more informative
definition. The distinctiveness of a set of ontology concepts can be defined by its con-
ceptual similarity [2,12] to all the others sets of ontology concepts, i.e. a distinctiveness
of a dataset is high if there are no other semantically similar datasets available. Concep-
tual similarity explores ontologies and the relationships they contain to compare their
concepts and, therefore, the entities they represent. Conceptual similarity enables us to
identify that arm and leg are more similar than arm and head, because an arm is a limb
and a leg is also a limb. Likewise, because an airplane contains wings, the two concepts
are more related to each other than wings is to boat.
Most implementations of IC and conceptual similarity only span a single domain
specified by an ontology [10]. However, realistic datasets frequently use concepts from
distinct domains of knowledge, since reality is rarely unidisciplinary. So, the scientific
challenge is to propose innovative algorithms to calculate the IC and conceptual sim-
ilarity using multiple-domain ontologies to measure the specificity and distinctiveness
of a dataset. Similarity in a multiple ontology context will have to explore the links
between different ontologies. Such correspondences already exist for some ontologies
that provide cross-reference resources. When these resources are unavailable, ontology
matching techniques can be used to automatically create them.
4 Reward and recognition mechanism
The reward and recognition mechanism can rely on the implementation of a new virtual
currency, dubbed KnowledgeCoin (KC), that will be specifically designed to promote
and intensify the usage of semantic web technologies for scientific data integration and
sharing. The idea is that every time a scientific article is published, KCs are distributed
according to the knowledge rating of the datasets supporting that article. Note that KCs
70
should by no means be a new kind of money and the design of KC transactions will
focus on the exchange of scientific data and knowledge.
After developing the knowledge rating measures, they can be used to implement the
supply algorithm of a new virtual currency, KC. This will not only aim at validating the
usefulness of the proposed knowledge ratings but also deliver an efficient reward and
recognition mechanism to promote and intensify the usage of semantic web technolo-
gies for scientific data integration and sharing. Unlike conventional cryptocurrencies,
the KCs will rely on a trusted central authority and not on a cryptographically proof.
But even without being a cryptocurrency, the KC will take advantage of the technical
solutions provided by existing cryptocurrencies, such as bitcoin5 .
The scientific challenge is to create a trusted central authority that issue new KCs
when new knowledge is created in the form of a scientific article, as long as it refer-
ences a supporting dataset properly integrated in the Linked Data Cloud. If there is no
reference to the dataset in the Linked Data Cloud no KCs will be issued. This way,
researchers will be incentivized to publically share the dataset, including the raw data
or at least a description of the raw data, in the Linked Data Cloud. If a dataset is shared
through the Linked Data Cloud then its level of integration will be measured by its
knowledge rating. This way, researchers will be encouraged to properly integrate their
data. The success of this mining process will rely on the trustworthiness of the knowl-
edge ratings, and therefore will further validate the developed measures.
From recognition researchers may get reputation, and from reputation they may
get a reward. For example, researchers recognize the relevance of a research’s work
by citing it, and by having a high number of citations the researcher obtains a strong
reputation, which may in the end help him to be rewarded with a project grant. Thus,
KCs can be interpreted as a form of reputation that in the end can result in a reward.
However, we can also design and implement direct reward mechanisms through KCs
transactions as a way to establish a virtual marketplace of scientific data and knowledge
exchanges. The main scenario of a KCs transaction is to represent the exchange of
datasets identified by an URI from the data provider to the data consumer, which may
include recognition statements.
5 Future Directions
The design of the approach is ongoing work and its direction depends on a more detailed
analysis of many social and technical challenges that its implementation poses. For
example, some of the issues that need to be further studied and discussed: i) knowledge
ratings implementation, i.e. their validation, aggregation, performance, exceptions, and
extension to any mappings besides the ontological ones; ii) potential abuses, such as the
creation of spam mappings and other security threats; iii) central trusted authority for
the KC vs. the peer-to-peer mechanisms used by bitcoin; iv) use case scenarios for the
KC, e.g. exchange of datasets and their characterization based on KC transactions.
In a nutshell, this paper presents the guidelines for delivering sound knowledge
rating measures to serve as the basis of a systematic reward and recognition mechanism
5
http://bitcoin.org/
71
based on KCs for improving the trust and quality of data through proper data integration
and sharing on the semantic web. The proposed idea aims to be the first step in providing
an effective solution towards data silos extinction.
Acknowledgments
The anonymous reviewers for their valuable comments and suggestions. Work funded by the Por-
tuguese FCT through the LASIGE Strategic Project (PEst-OE/EEI/UI0408/2014) and SOMER
project (PTDC/EIA-EIA/119119/2010).
References
1. Alsheikh-Ali, A.A., Qureshi, W., Al-Mallah, M.H., Ioannidis, J.P.: Public availability of pub-
lished research data in high-impact journals. PloS one 6(9), e24357 (2011)
2. Couto, F.M., Pinto, H.S.: The next generation of similarity measures that fully explore the
semantics in biomedical ontologies. Journal of bioinformatics and computational biology
11(05) (2013)
3. ELIXIR, EU-OPENSCREEN, BBMRI, EATRIS, ECRIN, INFRAFRONTIER, INSTRUCT,
ERINHA, EMBRC, Euro-BioImaging, LifeWatch, AnaEE, ISBE, MIRRI: Principles of data
management and sharing at European Research Infrastructures (DOI:105281/zenodo8304,
Feb 2014)
4. Galilei, G.: Sidereus Nuncius, or The Sidereal Messenger. University of Chicago Press
(1989)
5. Goodman, A., Pepe, A., Blocker, A.W., Borgman, C.L., Cranmer, K., Crosas, M., Di Stefano,
R., Gil, Y., Groth, P., Hedstrom, M., et al.: Ten simple rules for the care and feeding of
scientific data. PLoS computational biology 10(4), e1003542 (2014)
6. Hey, A.J., Tansley, S., Tolle, K.M.: The fourth paradigm: data-intensive scientific discovery
(2009)
7. Jasper, R., Uschold, M., et al.: A framework for understanding and classifying ontology
applications. In: Proceedings 12th Int. Workshop on Knowledge Acquisition, Modelling,
and Management KAW. vol. 99, pp. 16–21 (1999)
8. National Research Council (US). Committee on Models for Biomedical Research: Models
for biomedical research: a new perspective. National Academies (1985)
9. Noy, N.F.: Semantic integration: a survey of ontology-based approaches. ACM Sigmod
Record 33(4), 65–70 (2004)
10. Pedersen, T., Pakhomov, S.V., Patwardhan, S., Chute, C.G.: Measures of semantic similarity
and relatedness in the biomedical domain. Journal of biomedical informatics 40(3), 288–299
(2007)
11. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In:
Proceedings of the 14th international joint conference on Artificial intelligence-Volume 1.
pp. 448–453. Morgan Kaufmann Publishers Inc. (1995)
12. Rodrı́guez, M.A., Egenhofer, M.J.: Determining semantic similarity among entity classes
from different ontologies. Knowledge and Data Engineering, IEEE Transactions on 15(2),
442–456 (2003)
13. Uschold, M., Gruninger, M.: Ontologies and semantics for seamless connectivity. ACM SIG-
Mod Record 33(4), 58–64 (2004)
14. Uschold, M., Gruninger, M.: Ontologies: Principles, methods and applications. The knowl-
edge engineering review 11(02), 93–136 (1996)
72
Towards the Definition of an Ontology for Trust
in (Web) Data
Davide Ceolin, Archana Nottamkandath,
Wan Fokkink, and Valentina Maccatrozzo
VU University, Amsterdam, The Netherlands
{d.ceolin,a.nottamkandath,w.j.fokkink,v.maccatrozzo}@vu.nl
Abstract. This paper introduces an ontology for representing trust that
extends existing ones by integrating them with recent trust theories.
Then, we propose an extension of such an ontology, tailored for repre-
senting trust assessments of data, and we outline its specificity and its
relevance.
Keywords: Trust, Ontology, Web Data, Resource Definition Framework (RDF)
1 Introduction
In this paper we tackle the problem of modeling and representing trust asser-
tions, in particular about (Web) data. This is an important issue for a variety
of reasons. First, trust is an important aspect both of everyday life and of many
computational approaches, for similar reasons. In fact, trust is a “leap of faith” 1
that is necessary to be taken whenever we need to rely on third party agents or
information. We decide whether or not to take this leap of faith based on the
evaluation of the trustworthiness of the agent or information. In general, when
trusting, a risk is involved, i.e., the risk of relying on uncertain and possibly
unpredictable actions or information. We can soften such a risk, and one way
to achieve this result is to share trust and trustworthiness values, along with
their provenance, to allow their reuse and increase the probability to correctly
place trust thanks to the availability of this information. Therefore, an ontology
for trust assessments, in particular of Web data, can indicate the basic elements
that are necessary to define a trust value.
This paper aims at introducing an ontology for trust representation, starting
from existing ones and extending them to cover aspects indicated by recent trust
theories. In Section 2 we present related work, in Section 3 we provide a summary
of the trust theory of O’Hara that we use in the rest of the paper, in Section 4
we propose an extended ontology for representing trust and in Section 5 we
expose our vision about the issues of trusting (Web) data. Lastly, we conclude
in Section 6.
1
Stephen Marsh, “Trust: Really, Really Trust”, IFIP Trust Management Conference
2014 Tutorial
73
2 Related Work
Trust is a widely explored topic within a variety of computer science areas.
Here, we focus on those works directly touching upon the intersection of trust,
reputation and the Web. We refer the reader to the work of Sabater and Sierra
[19], Artz and Gil [2], and Golbeck [12] for comprehensive reviews about trust
in artificial intelligence, Semantic Web and Web respectively. Trust has also
been widely addressed in the agent systems community. Pinyol and Sabater-Mir
provide an up-to-date review of the literature in this area [18].
We extend the ontology proposed by Alnemr et al. [1]. We choose it because:
(1) it focuses on the computational part of trust, rather than on social and agent
aspects that are marginal to our scope, and (2) it already presents elements that
are useful to represent computational trust elements. Nevertheless, we propose
to extend it to cover at least the main elements of the trust theory of O’Hara,
that are missing from their original ontology, and we highlight how these ex-
tensions can be beneficial to model trust in (Web) data. Viljanen [21] envisions
the possibility to define an ontology for trust, but puts a particular emphasis
on trust between people or agents. Heath and Motta [14] propose an ontology
for representing expertise, thus allowing us to represent an important aspect of
trust, but again posing more focus on the agents rather than on the data. A
different point of view is taken by Sherchan et al. [20], who propose an ontology
for modeling trust in services.
Goldbeck at al. [13], Cesare et al. [5] and Huang et al. [15] propose ontologies
for modeling trust in agents. Although these could be combined with the ontology
we propose (e.g., to model the trust in the author of a piece of data), for the
moment their focus falls outside of the scope of our work, that is trust in data.
Trust has been modeled also in previous works of ours [7,6,8,9] using generic
models (e.g., the Open Annotation Model [4] or the RDF Data Cube Vocabu-
lary [10]). Here we aim at providing a specific model for representing trust.
3 A Definition of Trust in Short
We recall here the main elements of “A Definition of Trust” by O’Hara [17], that
provide the elements of trust we use to extend the ontology of Alnemr et al.
Tw (Trustworthiness) agent Y is willing, able and moti-
vated to behave in such a way as to conform to behaviour R, to the benefit
of members of audience A, in context C, as claimed by agent Z.
Tr (Trust attitude) X believes, with con-
fidence Deg on the basis of warrant Warr, that Y’s intentions, capacities and
motivations conform to I(R[A],c), which X also believes is entailed by R(A),
a claim about how Y will pursue the interests of members of A, made about
Y by a suitably authorised Z.
X places trust in Y (Trust Action) X performs some action which intro-
duces a vulnerability for X, and which is inexplicable without the truth of
Trust attitude.
74
4 Extending a Trust Ontology
We are interested in enabling the sharing of the trust values regarding both
trust attitude and actions, along with their provenance. The ontology proposed
by Alnemr et al. [1] captures the basic computational aspects of these trust
values. However we believe that it lacks some peculiar trust elements that are
present in the theory of O’Hara, and thus we extend that ontology as shown
in Figure 12 . Compared with the ontology of Alnemr et al., we provide some
important additions. We clearly identify the parts involved in the trust relation:
Trustor (source): every trust assessment is made by an agent (human or not),
that takes his decision based on his policy and on the evidence at his disposal;
Trustee (target): the agent or piece of information that is actually being
trusted or not trusted. This class replaces the generic “Entity” class, as it
emphasizes its role in the trust relation.
We also distinguish between the attitude and the act of trusting.
Trust Attitude Object: it represents the graded belief held by the trustor in
the trustworthiness of the trustee and it is treated as a quality attribute
when deciding if to place trust in the trustee or not. It replaces the repu-
tation object defined by Alnemr et al. because it has a similar function to
it (quantifying the trust in something), but implements the trust attitude
relation defined by O’Hara that is more precise and complete (e.g. warranties
are not explicitly modeled by the reputation object);
Trust Action Object: the result of the action of placing trust. Placing trust
is an instantaneous action based on an “outright” belief. Therefore the trust
value is likely to be a Boolean value.
Role, Context and Warranty: in the original ontology, the criterion is a ge-
neric class that contextualizes the trust value. We specialize it, to be able to
model the role and the context indicated in the theory of O’Hara, as well as
the evidence on which the trust value is based, by means of the warranty.
The trustworthiness relation is not explicitly modeled, since it falls outside
our current focus. We discuss this further in the following section. The remaining
elements of the model shown in Figure 1 are part of the original trust ontology.
These include a criterion for the trust value, and an algorithm that allows com-
bining observations (warranties) into a trust value (an algorithm is used also to
determine the value of the trust action). The trust attitude value corresponds
to the Deg element of the theory of O’Hara. We model the action both when it
is performed and when it is not. Both trust values are modeled uniformly.
5 Modeling Trust in Data
In the previous section we provided an extended ontology that aims at capturing
the basic elements that are involved in the process of taking trust decisions. Here
we focus on the specificity of trusting (Web) data.
2
The ontology is available at http://trustingwebdata.org/ontology
75
Role Warranty Context Quality Attribute
Collection Algorithm
1 * * *
hasWarranty rdfs:subClassof
Assessment collectedBy hasContext rdfs:subClassof
hasRole1 1 1
* Description
rdf:type rdf:type Criterion Name
1 hasCriteria 1...* 1 * 1...*
Rating Trust Attitude CalculatedBy
Object hasRange 1
OrderedValuesList
* * * OrderFunction Computation
rdf:type hasSource
Algorithm
1 Possible Values
1 1
Trustor hasTarget hasValue
hasTrustAttitudeValues
(Source) 1 part of hasCriteria
Trustee (Target) 1...* 1 *
hasTarget Trust Attitude Value
hasSource hasCriteria
Trust Action Object * * hasTrustValues
* Trust Action Value
1 1...*
CurrentValue Time Stamp History List
Fig. 1. Extended Trust Ontology. We highlight in red the added elements and in yellow
the updated ones.
Data are used as information carriers, so actually trusting data does not mean
to place trust in a sequence of symbols. It rather means to place trust in the in-
terpretation of such a sequence of symbols and on the basis of its trustworthiness.
For instance, consider a painting reproducing the city “Paris” and its annotation
“Paris”. To trust the annotation, we must have evidence that the painting ac-
tually corresponds to the city Paris. But, to do so, we must: (1) give the right
interpretation to the word “Paris” (e.g., there are 26 US cities and towns named
“Paris”), and (2) check if one of the possible interpretations is correct. Both in
the case the picture represents another city or in the case the picture represents
a town named Paris which existence we ignored, we would not place trust in the
data, but for completely different reasons. One possible RDF representation of
the above example is: exMuseum:ParisPainting ex:depicts dbpedia:Paris,
where we take for granted the correctness of the subject and of the property
and, if we accept the triple, we do so because we believe in the correctness of the
object in that context (represented by the subject), and role (represented by the
property). We make use of the semantics of RDF 1.1 [22], from which we recall
the elements of a simple interpretation I of an RDF graph:
1. A non-empty set IR of resources, called the domain or universe of I.
2. A set IP, called the set of properties of I.
3. A mapping IEXT : IP → P (IR × IR).
4. A mapping IS: IRIs → (IR ∪ IP). An IRI (Internationalized Resource Iden-
tifier [11]) is a generalization of a URI [3].
5. A partial mapping IL from literals into IR
Also, the following semantic conditions for ground graphs hold:
76
a. if E is a literal then I(E) = IL(E)
b. if E is an IRI then I(E) = IS(E)
c. if E is a ground triple s p o. then I(E) = true if I(p) ∈ IP and the pair
∈ IEXT(I(p)) otherwise I(E) = false.
d. if E is a ground RDF graph then I(E) = false if I(E’) = false for some triple
E’ ∈ E, otherwise I(E) = true.
Items 1, 2, 4 and 5 map the URIs, literals and the RDF triples to real-world
objects. We are particularly interested in Item 3, that maps the property of an
RDF triple to the corresponding real-world relation between subject and object.
Trusting a piece of data means to place trust in the information it carries, in a
given context. The trust context can be represented by means of the subject and
object of an RDF triple, so their semantic interpretation is assumed to be known
by the trustor. If the trustor trusts the triple, he believes that the interpretation
of the object o makes the interpretation of the triple s p o true:
TrustAttitude trustor (o|s, p) = Belief trustor (∃I(o) : ∈ IEXT (I(p))
Belief is an operator that maps logical propositions to values that quantify
their believed truth, e.g., by means of subjective opinions [16] quantified in the
Deg value of the theory of O’Hara and based on evidence expressed by Warranty.
By virtue of items c and d, we do not model explicitly the trustworthiness
relation defined by O’Hara: we consider an object o to be trustworthy by virtue
of the fact that it is part of an RDF triple that is asserted.
Criterion Trustor (Source) rdf:type ex:me Trustworthiness Object
rdf:type hasSource rdf:type
Computation
exMuseum:ParisPainting hasContext CO hasContext TO1 Algorithm
hasRole
ex:depicts rdf:object hasTrustAttitudeValues rdf:type
hasTarget
Trustee (Target) rdf:type dbpedia:Paris 0.8 calculatedBy TrustAlgorithm
Fig. 2. Snapshot of the trust ontology, specialized for modeling data trustworthiness.
Figure 2 presents a snapshot of the trust ontology modeling the example
above and adding a trust attitude value computed with a sample trust algorithm.
6 Conclusion
In this paper we introduce an ontology for trust representation that extends
an existing model with recent trust theories. We specialize it in order to model
data-related trust aspects, and we motivate our design choices based on standard
RDF 1.1 semantics. This model is still at a very early stage, but it emerges from
previous research and from standard trust theories. In the future, it will be
extended, and evaluated in depth, also by means of concrete applications.
77
References
1. R. Alnemr, A. Paschke, and C. Meinel. Enabling reputation interoperability
through semantic technologies. In I-SEMANTICS, pages 1–9. ACM, 2010.
2. D. Artz and Y. Gil. A survey of trust in computer science and the semantic web.
Journal of Semantic Web, 2007.
3. T. Berners-Lee, R. Fielding, and L. Masinter. Uniform Resource Identifier (URI):
Generic Syntax (RFC 3986). Technical report, IETF, 2005. http://www.ietf.
org/rfc/rfc3986.txt.
4. S. Bradshaw, D. Brickley, L. J. G. Castro, T. Clark, T. Cole, P. Desenne,
A. Gerber, A. Isaac, J. Jett, T. Habing, B. Haslhofer, S. Hellmann, J. Hunter,
R. Leeds, A. Magliozzi, B. Morris, P. Morris, J. van Ossenbruggen, S. Soiland-
Reyes, J. Smith, and D. Whaley. Open Annotation Core Data Model. http:
//www.openannotation.org/spec/core, 2012. W3C Community Draft.
5. S. J. Casare and J. S. Sichman. Towards a functional ontology of reputation. In
AAMAS, pages 505–511. ACM, 2005.
6. D. Ceolin, A. Nottamkandath, and W. Fokkink. Automated Evaluation of An-
notators for Museum Collections using Subjective Logic. In IFIPTM 2012, pages
232–239. Springer, May 2012.
7. D. Ceolin, A. Nottamkandath, and W. Fokkink. Semi-automated Assessment of
Annotation Trustworthiness. In PST 2013. IEEE, 2013.
8. D. Ceolin, A. Nottamkandath, and W. Fokkink. Efficient semi-automated assess-
ment of annotations trustworthiness. Journal of Trust Management, 1(1):3, 2014.
9. D. Ceolin, W. R. van Hage, and W. Fokkink. A Trust Model to Estimate the
Quality of Annotations using the Web. In WebSci 2010. Web Science Trust, 2010.
10. R. Cyganiak, D. Reynolds, and J. Tennison. The rdf data cube vocabulary. Tech-
nical report, W3C, 2014.
11. M. Dürst and M. Suignard. Internationalized Resource Identifiers (IRIs) (RFC
3987). Technical report, IETF, 2005. http://www.ietf.org/rfc/rfc3987.txt.
12. J. Golbeck. Trust on the World Wide Web: A Survey. Foundations and Trends in
Web Science, 1(2):131–197, 2006.
13. J. Golbeck, B. Parsia, and J. A. Hendler. Trust networks on the semantic web. In
CIA, pages 238–249. Springer, 2003.
14. T. Heath and E. Motta. The Hoonoh Ontology for describing Trust Relationships
in Information Seeking. In PICKME 2008. CEUR-WS.org, 2008.
15. J. Huang and M. S. Fox. An ontology of trust: Formal semantics and transitivity.
In ICEC ’06, pages 259–270. ACM, 2006.
16. A. Jøsang. A logic for uncertain probabilities. Intl. Journal of Uncertainty, Fuzzi-
ness and Knowledge-Based Systems, 9(3):279–212, 2001.
17. K. O’Hara. A General Definition of Trust. Technical report, University of
Southampton, 2012.
18. I. Pinyol and J. Sabater-Mir. Computational trust and reputation models for open
multi-agent systems: A review. Artif. Intell. Rev., 40(1):1–25, June 2013.
19. J. Sabater and C. Sierra. Review on computational trust and reputation models.
Artificial Intelligence Review, 24:33–60, 2005.
20. W. Sherchan, S. Nepal, J. Hunklinger, and A. Bouguettaya. A trust ontology for
semantic services. In IEEE SCC, pages 313–320. IEEE Computer Society, 2010.
21. L. Viljanen. Towards an ontology of trust. In Trust, Privacy, and Security in
Digital Business, pages 175–184. Springer, 2005.
22. W3C. RDF 1.1 Semantics. http://www.w3.org/TR/2014/
REC-rdf11-mt-20140225/, 2014.
78