<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Walking Linked Data: a graph traversal approach to explain clusters</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilaria Tiddi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathieu d'Aquin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Motta</string-name>
          <email>enrico.mottag@open.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Knowledge Media Institute The Open University</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Link traversal is one of the biggest advantages of Linked Data, as it allows the serendipitous discovery of new knowledge thanks to the natural connections between data of di erent sources. Our general problem is to understand how such a property can bene t the Knowledge Discovery process: in particular, we aim at using Linked Data to explain the patterns of data that have been extracted from a typical data mining process such as clustering. The strategy we propose here is Linked Data traversal, in which we explore and build on-the- y an unknown Linked Data graph by simply deferencing entities' URIs until we nd, by following the links between entities, a valid explanation to our clusters. The experiments section gives an insight into the performance of such an approach, in terms of time and scalability, and show how the links easily gather knowledge from di erent data sources.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>Graph Traversal</kwd>
        <kwd>URI Dereferencing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Almost ten years passed since Tim Berners-Lee presented the Linked Data
principles for the rst time1:
1. Use URIs to denote things.
2. Use HTTP URIs so that these things can be referred to and looked up
(\dereferenced ") by people and user agents.
3. Provide useful information about the thing when its URI is dereferenced,
leveraging standards such as RDF and SPARQL.
4. Include links to other related things (using their URIs) when publishing
data on the Web.</p>
      <p>Ever since, there has been much e ort from both the academia and the
industry to create a multi-domain, shared knowledge graph today de ned as \the
Web of Data" (sometimes referred to as the Linked Data Cloud, too). Following
those principles, datasets of multiple formats, sources and domains have been
published and connected, in order to aggregate fragmentary information into a
more complete one and facilitate automatic data reuse.</p>
    </sec>
    <sec id="sec-2">
      <title>1 http://www.w3.org/DesignIssues/LinkedData.html</title>
      <p>Interlinking data allows the Linked Data graph to be blindly navigated, as one
would usually do with the Web of documents: \blindly", because by looking up
URIs, new resources can be discovered on-the- y, possibly belonging to unknown
datasources, and therefore new knowledge can be serendipitously discovered. If
it is true that new elds have emerged in the Semantic Web area, that try to
leverage this link traversal feature as well as datasources interconnections, most
of their applications still rely on data known in advance. They lose, therefore, one
of the major bene ts of Linked Data: the serendipitous discovery of knowledge
that, in real world applications, is yet to be reached.</p>
      <p>Our research nds its place at the intersection between Knowledge Discovery
and Linked Data or, in other words, we consider that Linked Data can bene t a
eld of long tradition such as Knowledge Discovery. What we aim at exploiting
is the Linked Data shared knowledge, to derive explanations about Knowledge
Discovery patterns (more precisely, clusters). The main assumption is that items
are clustered together because of common characteristics, that can be explained
by (possibly cross-domain) background knowledge, that is usually provided by
experts that analyse and understand those patterns. Assuming those items are
represented as Linked Data, we can then exploit this interconnected knowledge to
derive explanations about their grouping, by looking for Linked Data information
that such items have in common. To this end, can the link traversal be bene cial
to derive those explanations, and how?</p>
      <p>
        Based on the previous work presented in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we propose in this paper an A*
process to derive Linked Data-based explanations for groups of items behaving in
the same way. To produce those explanations, we apply a graph search process
relying on link traversal and resources dereferencing. Link traversal allows us
to navigate and span from datasource to datasource throughout Linked Data,
without knowing those in advance nor in their entirety, with the ultimate scope
of nding commonalities among the items of the cluster we want to explain. The
main contributions of this paper are a reformulation of the process in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] as an
A* strategy based on Linked Data traversal, the extension of the existing process
to generate explanations out of datatype (and mostly numerical) properties and
a real world use-case in which we demonstrate that by following the links between
data we can gather new unrevealed knowledge from di erent datasources.
2
      </p>
      <sec id="sec-2-1">
        <title>Problem de nition</title>
        <p>The scenario we use to illustrate our problem involves the educational domain.
The map of Figure 1 shows a dataset D = fc1; : : : ; cj g of j world countries
grouped according the rate of female and male literacy over the last decade
(enrolment in secondary and tertiary school from the UNESCO Linked Data
statistics2). Countries where female are more educated than men are in blue (we
will de ne it as cluster B = fci; : : : cmg, where B D); countries where men are
more educated than women in yellow (cluster Y D); nally, countries where
the education rate is on average equal are in green (cluster G = D n B [ Y).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2 http://uis.270a.info/.html</title>
      <p>Explaining a cluster. In our example, countries are grouped together if they
have a common characteristic, that is, based on the di erence between women's
literacy rate and the men's one. For each country ci 2 D, we state that:
if literacy (male, ci) { literacy (female, ci) &gt; 2%: then ci 2 Y
else if literacy (male, ci) { literacy (female, ci) &lt; 2%: then ci 2 B
else: ci 2 G
Our rst assumption is that countries do not happen to be together by pure
luck, but an underlying reason will make them appearing in the same group Ci.
Finding this underlying reason is de ned as explain(Ci). If one looks at the
map, this underlying reason will be clearly visible. In fact,
explain(Y) = \least developed countries "
explain(B) = \developed countries "
What one does to deduce so is using his own background knowledge (knowledge
about the countries' geopolitical, economical or social situations) to infer that
the countries belonging to Y correspond to societies living on older standards,
where women are less educated as their education is not considered useful.</p>
      <p>Here, the challenge is, can we exploit Linked Data as the source of such
background knowledge, and automatically reproduce the process of explaining a
cluster, e.g. explain(Y)?
Extracting an explanation from Linked Data. Our second assumption
is that Linked Data connect enough knowledge to derive the explanation for
the items in a cluster, e.g. that countries with less educated women are the
least developed countries. This, of course, assumes that such an information is
somehow described in some (accessible) Linked Data sources.</p>
      <p>The main idea is that the items share in the Linked Data graph the same path,
or walk, to a speci c and unique entity ei. This walk has length l, corresponding
to the distance in number of RDF properties between the observed items we
want to explain, and the given entity ei.</p>
      <p>In summary, given:
{ a RDF graph G = fV; Eg where V is the set of URI entities and E the set of</p>
      <p>RDF properties;
{ the set of items D, where D V;
{ the cluster we want to explain, C+, where C+ D;
{ the items that do not belong to C , where C = D n C+;
there exists
{ a set of items I = fc1; : : : ; ckg D sharing the same walk w!i of length l to
an entity ei, where w!i is a sequence of l RDF properties pi 2 E in the form
of w!i = fp1; : : : ; plg and ei is an entity in V.</p>
      <p>Given the items ci 2 I, some of them would belong to C+, and some others
will belong to C . The objective is then to nd the best walk w!i to an entity
ei maximising the number of ci 2 (I [ C+) and minimising the number of ci 2
(I [ C ). This can be de ned as an explanation expi for a cluster.</p>
      <p>Figure 2 shows a toy example that uses a RDF graph of countries. Here, D
is the set of 5 countries uis:Somalia, uis:Ethiopia, uis:India, uis:UK and
uis:US from the UNESCO dataset. What we know from the clusters is that
uis:Somalia, uis:Ethiopia, uis:India belong to Y (Y = C+), while uis:UK,
uis:US to B (B = C ). As one can see, the three uis:Somalia, uis:Ethiopia,
uis:India are connected to the DBpedia entity e1 = dbpedia:category:Least
developed Countries by a walk of length l = 3, i.e. w!1=fowl:sameAs,
dc:subject, skos:relatedMatchg, while uis:UK, uis:US do share the same w!1, but to a
di erent entity e2 = dbpedia:category:developed Countries. Because items
in Y share the same walk w!i to the entity ei, while items outside the cluster do
not, then this can considered an explanation to it, i.e. explain(Y).</p>
      <p>The process of explaining a cluster is therefore:
nding all the explanations expi, with w!i being the common walk and ei the
entity that is common to a set of initial items I, where jIj jC+j. In our
example,
exp1=</p>
      <p>explain(Y)
howl:sameAs, dc:subject, skos:relatedMatch.</p>
      <p>db:category:Least developed Countriesi.</p>
      <p>Here is the second issue: how to perform such a search for a common entity? In
other words, where do we nd db:category:Least developed Countries, and
how?
Traversing Linked Data. The interconnection of Linked Data can be easily
exploited for this purpose. Looking for a common entity can become a graph
search process, in which a graph is iteratively built by traversing entities and
following their links to other entities. In such a manner, there is no need to have
any a priori knowledge about data sources, nor taking care of data indexing or
crawling. Each entity can be dereferenced in order to nd connections to other
entities (therefore, datasets), allowing the discovery of new knowledge, until an
entity common to enough items of the cluster is found.</p>
      <p>The link traversal process relies on the fact that if data are connected (through
owl:sameAs, skos:exactMatch, rdfs:seeAlso or simply by vocabulary reuse), then
we can easily and naturally span datasources and gather new, unknown
knowledge. If we refer again to our example, the UNESCO data (de ned by the
uis namespace) are connected to their DBpedia correspondent via the walk
w!1=fowl:sameAsg of length l = 1. So, in only one traversal, we already
accessed knowledge within a new datasource. As DBpedia entities are also linked
to other datasets, we can expect to go across new datasets within few traversals.</p>
      <p>As the link traversal can be only be applied to URIs, our last challenge is:
how can we build explanations out of literals and numerical values?
Reasoning over datatype properties. So far we have considered as valid
explanation for a group of items I a walk w!i from them to one common entity
ei. If we look again at our graph example, we will notice that uis:Somalia,
uis:Ethiopia, and uis:India have the same walk w!2 = fowl:sameAs,
dbp:gdpPppPerCapitag, and the three numerical values they are walking to are similar
if compared to the ones of items in cluster B. Again, our human expert would
say:</p>
      <p>explain(Y) = \countries with a GPD per capita lower than 4k$"</p>
      <p>In the case of incomes, it is unlikely that two countries will have the same one,
so we cannot expect that the walk will take to a common value. It is necessary to
re ne the de nition of an explanation for a cluster, by including this similarity
between numerical values, as well as literals:
1. explain(Ci): hw!i:vii</p>
      <p>if the last property pl of the walk w!i is an object property
2. explain(Ci) =hw!i: [ j ].vii</p>
      <p>if the last property pl of the walk w!i is a datatype property
To conclude, we now focus on creating a process to generate those
explanations, that exploits the Linked Data traversal and interconnections between
datasources.
3
3.1</p>
      <sec id="sec-3-1">
        <title>Proposed Solution</title>
        <sec id="sec-3-1-1">
          <title>Dedalo, an A* process for Linked Data</title>
          <p>
            In [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] we presented Dedalo, an automatic approach to derive Linked Data
explanations out of clusters. As said, the current work presents an extension of
such a process.
          </p>
          <p>Dedalo is an A* process considering Linked Data as a graph in which nodes
are the RDF entities and edges are the properties connecting them. Many
algorithms have proven being more e cient than the A* in path nding, as they
pre-process the graph to perform better. Those approaches, however, cannot be
applied in our context, for two main reasons: (1) a retrieval of the entire Linked
Data graph is not conceivable considering the huge amount of data sources and
(2) most of the information would actually be not relevant for our explanation
(we might not care about movies, when looking for an explanation about
countries, unless those movies are connected to the countries for some reason).</p>
          <p>
            The A* is a best- rst search aiming at nding the least-cost path from a
given initial node (the source) to one other node (the goal) according to a given
heuristics [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. The graph traversal is held by following the path with the lowest
cost, while the new paths are collected and kept into a queue. The cost of a path
x is estimated using a heuristic measure f (x), which de nes the order the paths
in the queue. f (x) is the sum of :
{ g(x), the past path-cost function, which is the known distance from the
starting node to the current node;
{ h(x), the future path-cost function, which is an estimate of how likely the
path is to be a good one to reach the goal.
          </p>
          <p>
            This idea is then applied to Linked Data. Items in D = fc1; : : : ; cj g are
the graph sources, while the entity ei of each explanation expi = hw!i:eii is
the goal. In [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ], we demonstrated how the entropy of a path is a valid cost
function f (x) for our purpose. Entropy [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] focuses on the frequency of a given
path (corresponding to g(x)) and the distribution of its values (corresponding
to h(x)). For a detailed discussion around other possible cost functions, please
refer to [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ].
          </p>
          <p>The problem here is that we do not know what is the goal in advance, nor we
can know how good it is for our cluster. Moreover, our graph is build iteratively:
each time we dereference new entities, V increases in size. For this reason, the
goal of our traversal is any entity ei 2 V at a maximum distance j from the
sources, where j is the length of the graph at the jth given iteration. Iteration
is intended as how many times a new ( rst) path is the queue has been chosen.
When this happens, a new part of the graph G is revealed, and new goals ei are
added to V. Finally, for each of the discovered goals, we introduced a second
function f 2(expi), to assess the explanation expi = hw!i:eii for the given cluster.
3.2</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>The Linked Data traversal process</title>
          <p>The Linked Data traversal is composed of three di erent steps: (i) URI
dereferencing, (ii) Path collecting and (iii) Explanation building.</p>
          <p>URI dereferencing. Initially, the graph we have is a graph of length j = 0,
where V = D and E = ;. As explained, we chose to use the URI dereferencing
process to be consistent with the Linked Data principles. For each of the items,
we use the HTTP protocol to obtain all the RDF properties and values the entity
is related to, by collecting all the triples &lt;ei,pi,vi&gt;. For example, given the
entity uis:Ethiopia, we collect p0=owl:sameAs and v0=dbpedia:Ethiopia.
The discovered values vi are added to V, while the properties to E. As one can
see, some of the discovered values are part of new datasets, that we have found
following the natural links of the described resource. In case the entity has no
equivalent values, we select equivalent instances using the sameAs.org service3,
by processing the new triples &lt;ei,owl:sameAs,vi&gt; and adding its components
to the graph.</p>
          <p>Path collecting. Each new walk w!i is built starting by adding to the existing
rst walk of the pile, the new properties pi of each triple extracted from the
URI dereferencing. The new w!i are evaluated according to the entropy function
ent(w!i) and queued in the pile of possible walks to follow in the graph
accordingly. When the new rst walk in the queue will be chosen, a new j +1th iteration
will start.</p>
          <p>For instance, if the last rst walk in the pile was of length l = 1 such as
w!1 =fowl:sameAsg and from the dereferencing of the entity dbpedia:Ethiopia
we have collected the triples:
t1=&lt;dbpedia:Ethiopia,dc:subject,db:category:Countries in Africa&gt;
t2=&lt;dbpedia:Ethiopia,dbp:gdpPppPerCapita,"1200"&gt;
we will form two new walks of length 2, such as w!2=fowl:sameAs, dc:subject g
and w!3=fowl:sameAs, dbp:gdpPppPerCapitag. We then evaluate their costs with
ent(w!2) and ent(w!3) and add them to the queue of paths to follow.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 http://sameas.org/</title>
      <p>All the entities, the rst w!i in the queue walks to, are the ones further
expanded by deferencing within the following iteration. If we assume the walk with
the least cost is w!2, all the entities this one takes to (in our case
db:category:Countries in Africa, db:category:South Asian countries, and db:category:
Liberal democracies) are dereferenced. Subsequently, new walks are found and
build of out of this new traversal, e.g. w!4 = fowl:sameAs, dc:subject, skos:relatedMatchg,
and so on.</p>
      <p>Explanations building. Before starting a new iteration, we build and evaluate
the new explanations. Explanations are built by chaining the walks w!i to the
entities !e i that have been discovered at the current iteration. The length of
the new explanations, which corresponds to the length of the walk w!i rst in
the queue, gives an insight of how much the graph has been traversed, i.e. how
far we have gone from the sources. If we take w!4, we will build the following
explanations:
exp1= howl:sameAs,dc:subject,skos:relatedTerm.db:category:developed countries i
exp2= howl:sameAs,dc:subject,skos:relatedTerm.db:category:Least developed countriesi</p>
      <p>To evaluate how accurate a new explanation expi is for the cluster we are
explaining, we chose as f 2(expi) the F-Measure = 2 PP+RR , and adapted it by
de ning precision and recall as follows. Given an explanation expi = hw!i:eii:
P = sources(expi) \ C+ (1) R = sources(expi) \ C+ (2)</p>
      <p>sources(expi) jC+j
where sources(expi) is equivalent to jIj, the number of sources walking to ei
through the walk w!i, and C+ is the cluster we want to explain. For instance, the
explanation exp2 has three sources walking to it, and the three of them are part
of C+ (= Y), while none from outside the cluster is. So we consider it as the
most valuable explanation for the cluster.</p>
      <p>In the case the walk w!i's ending property pl is a datatype property and vi is
a numerical value, we create two alternate explanations:
exp1 = hw!i:
exp2 = hw!i:
:vii
:vii
and check, for each of the sources that have that same walk w!i, whether the value
vj they are walking to is greater or less than the value vi, and subsequently
estimate the F-measure of both exp1 and exp2. Let us consider the walk w!2
again. The entity uis:Ethiopia walks to the value v1=\1200", uis:Somalia to
the value v2=\600" and uis:India to v3= \3851". For each of the values vi,
we create the two alternate explanations and then evaluate them (see Table 1),
keeping only the one with the best score with respect to the cluster Y.
4</p>
      <sec id="sec-4-1">
        <title>Explaining the map { experiments</title>
        <p>Data preparation. The UNESCO Institute for Statistics publishes most of
its data under Linked Data principles. Following the cube model4, they
pro</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4 http://www.w3.org/TR/vocab-data-cube/</title>
      <p>vide statistical observations about countries in a wide range of domains such as
economics, food, agriculture, nance and so forth. To select data and build the
dataset D of items to use as source of the graph, we used the provided SPARQL
endpoint. We selected, for each country, the percentage of females enrolled in the
secondary and tertiary education since the year 2000 and accordingly derived
the male one. We thus compared the two percentages: if the absolute di erence
of the two groups was less than 2%, the country was considered part of the G
cluster, comprehending countries where the education is on average equal. As
already presented, the map of Fig. 1 shows the results. All those data, as well
as the results and maps, are publicly available online5.
In our experiments we aim at evaluating how fast the process explain(C+)
performs.</p>
      <p>We are interested in knowing how much time it takes to reach the same
explanation expi that a human would naturally give, how much it ts the cluster
C+, how far it is from the sources, as well as how big is the graph at the moment
of the discovery. This is a preliminary step for a broader evaluation to be held on
a long term perspective, in which we aim at manually evaluating explanations
obtained automatically and the ones given by human experts.</p>
      <p>Table 2 shows the results we had for each cluster after 10 iterations. Time is
evaluated in terms of seconds taken to reach the explanation expi; the quality
of the explanation for the cluster C+ is evaluated in F-Measure. In 10 iterations,
our graph has 3.742.344 triples, 671 walks w!i have been built and are queueing
in the pile.</p>
      <p>As one can remark, the process found very good explanations (in F-measure
score) with very little cost. Dedalo's A* process is actually able to produce
explanations involving knowledge from di erent datasources (from the UNESCO
statistics to DBpedia), by following the natural links between data and by
cleverly detecting the correct walk to follow into the big Linked Data graph.</p>
      <p>To get the best explanation for Y, the process requires less than 200". The
explanation shows that the 87.8% of the countries in Y are ranked below the 126th</p>
    </sec>
    <sec id="sec-6">
      <title>5 http://linkedu.eu/dedalo/</title>
      <p>explain(Y): countries where males are more educated</p>
      <p>expi F(%) Time"
hskos:exactMatch, dbp:hdiRank. .\126"i 87.8 197"
hskos:exacdtbM:aCtacthe,gdocr:ys:uLbejeacstt. developed countries i 74.7 524"
hskos:exactMatch, dbp:gdpPppPerCapitaRank. .\89"i 68.3 269"
hskos:exactMatch,dbd:cC:sautbejegcotrysk:oCso:ubnrotardieers. in Africai 67.1 540"
hskos:exactMatch, dbp:populationEstimateRank.\76"i 61.9 201"
hskos:exactMatch, dbp:gdpPppRank. .\10"i 59.1 235"
explain(B): countries where females are more educated</p>
      <p>expi F(%) Time"
hskos:exactMatch, dbpedia:hdiRank. .\119"i 63.4 198"
hskos:exactMatch, dbp:gdpPppRank. .\56"i 62.3 236"
hskos:exactMatch, dbp:populationEstimateRank. .\128"i 56.9 203"
hskos:exactMatch, dbp:gdpPppPerCapitaRank. .\107"i 56.3 267"
hskos:exactMatch, dbp:gdpPppPerCapitaRank. .\100"i 54.5 267"
hskos:exactMatdcbh:,Cdact:esguobjreyc:t,Lastkoisn:bArmoeardierc.an Countriesi 49.3 542"
explain(G): countries where education is on average equal</p>
      <p>expi F(%) Time"
hskos:exactMatch, dbprop:gdpPppRank. .\64"i 62 234"
hskos:exactMatch, dbprop:gdpPppPerCapitaRank. .\29"i 61 268"
hskos:exactMatch, dbprop:areaRank. .\18"i 57 254"
hskos:exactMatch, dbprop:populationDensityRank. .\148"i 52 238"
hskos:exactMatch, dbprop:populationEstimateRank. .\25"i 49 201"
country in the Human Development Index 6 (HDI) ranking. Based on statistics
on life expectancy, education and income, the HDI ranks countries from the
most developed to the least one. The lower the country is in the rank, the less
developed it is. Similarly, the best explanation for B is that the 63.4% of its
countries are among the 119 most developed countries. It is important to recall that
such an explanation would have not been found without any reasoning upon
numerical values. Other good explanations involve an object property, which
con rms our assumption that items of the same cluster share walks to common
values. In fact, the second good explanation for Y is that the 74.7% of the cluster
is labeled in DBpedia as least developed countries, which means that they all
have a common walk w!i=fskos:exactMatch, dc:subject g to the common entity
db:Category:Least developed Countries.</p>
    </sec>
    <sec id="sec-7">
      <title>6 http://en.wikipedia.org/wiki/Human Development Index</title>
      <sec id="sec-7-1">
        <title>Related Work</title>
        <p>Works discovering new knowledge in Linked Data can be grouped into bottom-up
and top-down approaches.</p>
        <p>
          Bottom-up approaches are focused on coping with data diversity. Generally,
those approaches present data services allowing the exploration, navigation and
reasoning on billions of triples from di erent datasets: among them, we can cite
Factforge [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], including DBpedia, Freebase, Geonames, the CIA World Factbook,
MusicBrainz, WordNet and the New York Times; the LODatio framework [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
a platform to search instances over Linked Data, using a Google-like approach
based on RDF types and properties; but also indexes such as the OpenLinks
LOD cache7 or the Semantic Web index Sindice [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The main objective of those
works is to keep a large, up-to-date coverage of the Web of Data as well as a
fast and e cient response time of the service. As already mentioned in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], those
objectives have been partially met using technical expedients (e.g. distribution
techniques, index optimisation, data synchronisation), but they still require a
local data management that goes beyond the principles of the Web of Data.
        </p>
        <p>
          The second category comprehends top-down techniques traversing Linked
Data as graph and exploiting the connections between sources for an on-the- y
knowledge discovery. Some works such as the ones of [
          <xref ref-type="bibr" rid="ref11 ref4">4, 11</xref>
          ] focus more on the
navigation functionalities providing query languages, while recent approaches
to automatically traverse links between entities to gather data live and from
unknown sources can also be found in the Link Traversal Based Query Execution
eld (LTBQE), such as the ones of [
          <xref ref-type="bibr" rid="ref14 ref6 ref7">6, 7, 14</xref>
          ]. After obtaining the query results,
the URIs are looked up following the data links in order to improve the SPARQL
answer with information from unknown sources. Similarly, we use the entities
dereferencing to gather unknown data and produce meaningful explanation for
clusters.
        </p>
        <p>
          Finally, the idea of applying graph search algorithms to Linked Data has
been exploited in the literature for users recommendation. In the works of [
          <xref ref-type="bibr" rid="ref10 ref9">9,
10</xref>
          ] users are suggested items that are considered similar, when similar means
Linked Data items sharing the same path to a speci ed entity. Those work only
take into consideration a singular graph (such as DBpedia) and do not consider
the knowledge that might be connected in external datasources. Moreover, they
rely on SPARQL endpoints to retrieve information rather than URI lookup.
6
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>Conclusion and Future Work</title>
        <p>In this work we presented an extension to Dedalo, a process to explain Knowledge
Discovery clusters using Linked Data. To achieve this, we rede ned Dedalo as
an A* search in the Linked Data graph aiming at nding the best walk(s) of
RDF properties between a set of initial sources (the items in the cluster to be
explained) and a speci c value in the graph, that can be either a URI resource or
a numerical value. Those explanations are built using the links between data (and
datasources), simply exploiting URI dereferencing. Without having any a priori</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>7 http://lod.openlinksw.com/</title>
      <p>knowledge about the datasources, we nd meaningful explanations gathering
knowledge from di erent datasets.</p>
      <p>The future direction we might want to take are currently focusing on the
noise and bias introduced by the owl:sameAs links. In fact, explanations might
be biased if information in the datasets is missing or not homogeneous. Other
future directions might concern the traversal of incoming links, as our process
currently only takes into account the outgoing ones.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bishop</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiryakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ognyanov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peikov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tashev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Velkov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Factforge: A fast track to the web of data</article-title>
          .
          <source>Semantic Web</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ),
          <fpage>157</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Oren</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delbru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Catasta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stenzhorn</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tummarello</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Sindice.com: a document-oriented lookup index for open linked data</article-title>
          .
          <source>IJMSO</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <fpage>3752</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Delling</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanders</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schultes</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Wagner</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Engineering route planning algorithms</article-title>
          .
          <source>In Algorithmics of large and complex networks</source>
          (pp.
          <fpage>117</fpage>
          -
          <lpage>139</lpage>
          ). Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fionda</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pirro</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>The swget portal: Navigating and acting on the web of linked data</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <volume>26</volume>
          ,
          <fpage>29</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krayer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Get the google feeling: Supporting users in ndingrelevant sources of linked open data at web-scale. Semantic Web Challenge, Submission to the Billion Triple Track</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Langegger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>A database perspective on consuming linked data on the web</article-title>
          .
          <source>Datenbank-Spektrum</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ),
          <fpage>57</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          (
          <year>2013</year>
          , June).
          <article-title>SQUIN: a traversal based query execution system for the web of linked data</article-title>
          .
          <source>In Proceedings of the 2013 international conference on Management of data</source>
          (pp.
          <fpage>1081</fpage>
          -
          <lpage>1084</lpage>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ladwig</surname>
            <given-names>G.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>SIHJoin: Querying remote and local Linked Data</article-title>
          .
          <source>In ESWC</source>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ostuni</surname>
            ,
            <given-names>V. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Di Noia</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirizzi</surname>
            <given-names>R.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Di</surname>
            <given-names>Sciascio E.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>A Linked Data Recommender System using a Neighborhood-based Graph Kernel</article-title>
          . EC-Web2014; to appear.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ostuni</surname>
            ,
            <given-names>V. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Di Noia</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Di Sciascio</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Mirizzi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2013</year>
          ,
          <article-title>October)</article-title>
          .
          <article-title>Top-N recommendations from implicit feedback leveraging linked open data</article-title>
          .
          <source>In Proceedings of the 7th ACM conference on Recommender systems</source>
          (pp.
          <fpage>85</fpage>
          -
          <lpage>92</lpage>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arenas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>nSPARQL: A navigational language for RDF</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <volume>8</volume>
          (
          <issue>4</issue>
          ),
          <fpage>255</fpage>
          -
          <lpage>270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Shannon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>1948</year>
          ).
          <source>A Mathematical Theory of Communication. Bell System Technical Journal</source>
          <volume>27</volume>
          (
          <issue>3</issue>
          ):
          <fpage>379</fpage>
          -
          <lpage>423</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Tiddi</surname>
            , I., d'Aquin,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>Dedalo: looking for Clusters Explanations in a Labyrinth of Linked Data, 11th Extended Semantic Web Conference</article-title>
          ,
          <string-name>
            <surname>ESWC</surname>
          </string-name>
          <year>2014</year>
          , Crete.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Link Traversal Querying for a diverse Web of Data. Semantic Web Journal</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>