<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Spectrometry of Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giovanni Bartolomeo</string-name>
          <email>giovanni.bartolomeo@uniroma2.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Salsano</string-name>
          <email>stefano.salsano@uniroma2.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Rome Tor Vergata</institution>
          ,
          <addr-line>Via del Politecnico, 1, 00133 Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <volume>21</volume>
      <issue>2011</issue>
      <abstract>
        <p>Entity mining is still a troublesome open problem. In past years many approaches allowed to automate the generation of equivalence links between references using schema matching or various heuristics based on the recognition of similar property values. In contrast, few of them considered the analysis of the network of equivalence links (“equivalence network”) as an indication of the likelihood and strength of the equivalence. Following this basic idea, in this paper we apply the well known Girvan and Newman algorithm to partition existing equivalence networks into clusters of co-references and gain an insight of their nature, size and composition. 1 See</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Entity</kwd>
        <kwd>Co-references</kwd>
        <kwd>Linked Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Could a URI reference (URIRef) be thought as exactly “attached”
to its referent? Could it make sense to talk about entity
“identifiers” or would it be better to talk about more ambiguous
“references”, i.e., placeholders for any model that satisfies the
formal semantics of the Semantic Web (Hayes)1? Booth [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
observes that the aforementioned question, which in the past has
been often regarded as fundamental in the debate about identity
on the Web, is relatively unimportant. As long as an entity,
identified by whatsoever URIRef, is associated to at least one
description containing machine understandable information, this
information can be automatically processed and used by
applications.
      </p>
      <p>
        Yet the proliferation of references poses a practical problem in
Linked Data. From [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] we learn that using multiple references for
the same entity (in short, “co-references”) is a fault-tolerant
approach, lowers the barrier to enter the Linked Data and helps in
maintaining traceability of different “views” of the same entity by
various data publishers. On the contrary, the opposite party [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
In the past the vision of a single, canonical “entity identifier” has
inspired at least two major European projects (the ReSIST
Network of Excellence2 and the OKKAM project3). In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], for
instance, Jaffri describes a “Consistent Reference Service” (CRS)
that aggregates entity co-references into bundles. Each bundle
contains only one preferred reference (“canon”) and a number of
other co-references (“duplicates”). To obtain “consistent
references” at a scale, the CRS recommends using the canon
instead of duplicates; in many cases, however, the canon is a
random choice of the system and is no more “representative” than
its duplicates. Differently, Bouquet [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposes a solution based
on “OkkamIDs”, a new class of identifiers that “directly refer” to
entities. As opposite to different descriptions provided by single
data publishers, the notion of direct reference is realized by means
of a shared entity profile, i.e. an associated description, accessible
by dereferencing the OkkamID and containing information agreed
and consolidated by the Web community. Being shared by the
community of its users, the entity profile tends to become as much
complete and exhaustive as possible. According to Bouquet, this
feature would provide the answer to the argument, raised in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
that the user (i.e., the data publisher or consumer) tends “to
observe [only] a small portion of [a reference] use”; thus,
implicitly, her knowledge about the referenced entity remains
ambiguous.
      </p>
      <p>At the time of writing, however, neither the CRS nor OkkamIDs
seem to have provided the ultimate solution for entity
identification. Nevertheless, they contributed to raise an
interesting multidisciplinary discussion on this topic and to
provide good theoretical and practical inputs to several related
researches. In particular, the main lesson learned from these
experiences was that community (i.e. inter-domain) consensus is
fundamental. This realization seems to suggest that the task of
evaluating an equivalence link as correct or incorrect often
requires a global perspective. A “bird-eye view” of the entire set
of co-references and of their equivalence links could probably
provide more useful indications about the strength of a given
equivalence. As a matter of fact, the Linked Data cloud has begun
2 The ReSIST European Newtork of Excellence co-funded by the
European Commission (GA 026764) ran from January 2006 to
March 2009, http://www.resist-noe.org/ (accessed March, 21
2011).
3 The OKKAM project co-funded by the European Commission
(GA 215032) ran from January 2008 to June 2010,
http://www.okkam.org/ (accessed March, 21 2011).
to show widely referred nodes tending to attract a large portion of
incoming equivalence links, and becoming, for the Linked Data
community, more representative of an entity than others. At a
closer look, however, this observable fact presents at least three
particular features: it is distributed, because, for each entity, more
than one node might aspire to become an “authority”. It is also
dynamic, as it varies, over the time, according to the raising of
new pay-level domains4 publishing their entity descriptions as
Linked Data. Last but not least, it is aggregative: groups of
coreferences, coming from different domains, tend to aggregate into
clusters. But, despite its popularity, few works have focused on
the analysis of this phenomenon. Actually, previous works, which
we will describe in section 2, have provided similar investigations
on equivalence links; but none of them, to the best of our
knowledge, analyzed the tendency, shown by co-references, to
form clusters. Therefore, the main contribution of this paper will
consist in getting a first insight into this trend – ignoring, for the
time being, its dynamics. To this end, in section 3 we will
illustrate a methodology to detect clusters of nodes in equivalence
networks (i.e. networks formed by equivalence links) based on the
well known community detection algorithm developed by Girvan
and Newman. Next, in section 4, we will present the results of an
experiment made applying the algorithm to a dataset of about 1.7
million equivalence links that we collected by means of the
Sindice5 search engine API. Then we will discuss some interesting
observed patterns, and propose a synoptic diagram showing the
average composition of a cluster as a function of its cardinality
(section 5). Finally we will draw the conclusions and hint at future
developments and possible applications of our methodology.</p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORKS</title>
      <p>
        Entity matching has been addressed mainly by providing tools to
automate the generation of equivalence links, using schema
analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or heuristics that recognize similar property values
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These works are indeed related to our approach: by
publishing equivalence links on the Linked Data in form of
owl:sameAs statements, they served to progressively build the
precious ground relevant for the cluster analysis we are going to
present.
      </p>
      <p>
        However, contrary to the synthesis of new potential equivalence
links, we are interested in a methodology that allows isolating
groups of co-references perceived by the Liked Data community
as “consistent”, i.e. as mostly referring to given entity.
Analysis of existing similarity and equivalence links has been
performed in the past by Hu (2008) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Ding (2010) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In
particular, Hu considered not only explicit equivalences (i.e., RDF
statements containing the owl:sameAs predicate), but also other
predicates that may provide hints on the equivalence of two
resources, namely inverse functional properties, functional
properties and maximum cardinality. However, his experimental
4 Pay-level domain is the term used to identify a domain
subordinate to a generic top level domain or to a country code
top level domain. In the remainder of this paper, and for the
sake of simplicity, we will often use the term “domain” to refer
to a pay-level domain.
results on a large-scale dataset (76 million URIrefs) have shown
that the bulk (99.8%) of equivalence relations is given by explicit
statements. Interesting enough, Hu also considered the
“authoritativeness” of the context in which the equivalence
statements appeared, showing that only 6% of the over 7 million
equivalence statements he analyzed appeared in RDF documents
reachable by dereferencing the subject as well as the object of the
statement, 66% by dereferencing either the subject or the object
whereas 27% were in other documents. Therefore, an indexing
service (such as e.g., Sindice) is generally needed in order to
discover existing statements from these documents as well.
Ding was the first researcher that used the term “sameAs
networks” to denote those RDF graphs formed by only RDF
statements containing the owl:sameAs predicates. He
performed a statistical investigation explicitly focused on these
networks involving about 8 million equivalence links among
nodes arranged into 2 million weakly connected components. He
found that the in-degree (i.e. the number of incoming
owl:sameAs links per node) distribution exhibited the power
law pattern characteristic of scale-free networks. Another result in
Ding’s experiment was that the highly referenced nodes were from
relatively few domains such as dbpedia.org, freebase.com,
geonames.org.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. PROPOSED APPROACH</title>
      <p>
        We need to look at the development of node “clusters”, i.e. groups
of nodes within which equivalence links are much more dense
than between them. A similar phenomenon has been found in
many complex networks [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], finally encouraging the two
physicians Newman and Girvan [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to develop an ad hoc
algorithm facilitating cluster detection. Before introducing their
algorithm, however, we need to formalize the concept of graph,
path, connected component, modularity and edge betweenness.
Let I, B, L be disjoint infinite sets of URIRefs, blank nodes and
literals.
      </p>
      <p>Definition 1. RDF Graph. A RDF triple is a tuple (s,p,o) ∈ I ∪ B
x I x I ∪ B ∪ L, with s ∈ I ∪ B, p ∈ I, o ∈ I ∪ B ∪ L, and s said
subject, p said predicate, o said object. An RDF graph is a set of
RDF triples. A subject or an object of a RDF triple is called a
node of the graph.</p>
      <p>Definition 2. RDF Subgraph. A subgraph of a RDF graph G is a
RDF graph whose RDF triples are a subset of those in G.
Definition 3. SameAs Network. A sameAs network is a RDF
graph whose RDF triples are all in the form (s,owl:sameAs,o).
Definition 4. Arcs and Edges. Let G be an RDF graph. A
predicate in a RDF triple in G is called an arc of G and is
represented as a direct link from the subject to the object. An edge
of G is any undirected link between two nodes.</p>
      <p>In the following of our discussion it will be not critical to consider
the direction of the links in a RDF graph G. Therefore we will
assume that for each arc in G connecting two nodes m and n there
is always an associated edge that connects m with n.</p>
      <p>Definition 5. Neighbours. Two nodes that are connected by an
edge are said to be neighbours.</p>
      <p>Definition 6. Undirected Path. An undirected path is a sequence
of edges that connects one node to another.</p>
      <p>Definition 7. Connected Component. A (weakly) connected RDF
graph is a RDF graph where there exists an undirected path
between any pair of nodes. A (weakly) connected component of a
RDF graph G is any RDF subgraph of G that is connected.
Definition 8. Partitions and Partition Set of a Graph. A partition
set of a graph G is any subgraph G’ obtained from G by removing
as many arcs as needed to fragment G into a set of exhaustive
disjoint connected components, said partitions, Si, so that 1&lt;i&lt;|G|,
G’= Si, Si ∩ Sj = φ i≠j.</p>
      <p>
        In practice, a partition set represents one of the possible
dissections of the original graph into sets of nodes called
partitions. Now, our problem is to find the “best” partition set, i.e.
the one which most closely depicts the tendency of the nodes in
the original graph to arrange themselves into clusters. To this end,
let us consider a RDF graph G of cardinality |G| and let G’ be one
of its partition sets. Let e={eij} be the symmetric matrix whose
element eij∈[
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] represents the fraction of total edges, in G,
connecting nodes that in the partition set G’ would belong to
partition Si with nodes that in G’ would belong to Sj. The element
eii represents the fraction of edges connecting nodes within
partition Si. Denoting with Tr(x) the trace of the matrix x, ∑(x) the
sum of its elements and x·y the multiplication of matrices x and y,
we can enunciate the following definition and theorem:
Definition 9. (Girvan and Newman) Modularity. The quantity Q
= Tr(e) – ∑(e·e) ∈[
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] is called modularity.
      </p>
      <p>
        Theorem 1. Given a partition set G’ of a graph G, the modularity
in Definition 9 quantifies the fraction of edges in G falling inside
the partitions of the partition set G’ minus the expected value that
the same quantity would have in a graph H having the same
partition set G’ but random connections between all its nodes.
Demonstration of this theorem is given in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Intuitively, the
modularity estimates “how much” the arrangement of links falling
in the considered partitions differs from a random pattern.
Therefore, to detect a meaningful “community structure”, i.e. a
tendency of nodes to aggregate into groups – instead of falling
randomly – Q should present relatively high values, say in the
range [0.3,0.8]; According to Newman and Girvan, in fact, values
greater than Q=0.8 have not yet been found in any natural
network.
      </p>
      <p>We now finally define the notion of cluster.</p>
      <p>Definition 10. Cluster. A cluster is any partition Si in the partition
set G’ of a graph G that shows the optimal (highest) modularity.
Clusters are therefore the partitions of the partition set(s)6 that
maximizes the modularity. Note that the modularity is a property
defined for the graph itself, with respect to the considered
partition set. It is not a property of each single partition in the
partition set.</p>
      <p>In order to detect clusters, Newman and Girvan proposed an
algorithm based on the removal of edge presenting the highest
“betweenness”, which is below defined.</p>
      <p>Definition 11. Edge Betweenness. The betweenness of edge e is
the number of shortest paths between pairs of nodes that run along
it.</p>
      <p>Note that edges with high betweenness are likely to connect
different clusters, because they are part of the maximum number
of shortest paths connecting nodes from different clusters.
The simplest algorithm proposed by Girvan and Newman
progressively computes the edge e with highest betweenness in a
graph G, removes it and computes the modularity of the resulting
graph G’ (a partition set of the original graph) until the highest
modularity value is found. In table 1 we propose a slightly
modified version of this algorithm which returns the maximum
modularity value in partition sets with less than nmax connected
components. This algorithm considers only partition sets that
exhibit modularity greater than a given threshold Qmin. To collect
more meaningful results, we decided to feed the algorithm with
only sameAs networks with a sufficient number of edges (safely
we considered only networks with cardinality greater than 15
nodes). Furthermore we experimentally set nmax=6 (likely a
reasonable value compared to the original semantics of
owl:sameAs) and Qmin=0.35.</p>
    </sec>
    <sec id="sec-4">
      <title>4. EXPERIMENTAL RESULTS</title>
      <p>We collected our dataset from Sindice using Sindice4J API (a
Java wrapper for Sindice Search and Cache API). We performed
all computations on a machine equipped with i686 Intel Xeon
CPU 3060 2.40GHz processor, 1,048,772kB RAM, and featured
Gentoo Linux Base System 1.12 OS, kernel 2.6.18-xen, Java v6.0
and Aduna Sesame v2.60 back end by Gentoo Linux MySQL
v5.1. Graph algorithms were implemented using JUNG API v2.0.
The hardware equipment we used was definitely cheaper than the
one described in previous experiments, thanks to the choice of
using Sindice Cache API and storing locally only equivalence
statements instead of caching full datasets.</p>
      <p>
        In Table 2 we provide some statistics of our experiments
compared with the ones presented by Hu [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Ding [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The
total amount of RDF statements, calculated by using the methods
provided by Sindice Cache API, was 314,922,780 – certainly less
than the one handled by Ding, but comparable with the number of
statements considered by Hu. The number of different
owl:sameAs statements found in the dataset was 1,722,938,
representing 0.55% of the total amount of statements, a
percentage closer to the one reported by Ding than the one
provided by Hu.
6 There might be more than one partition set showing the same
modularity.
      </p>
      <p>Applying the algorithm in Table 1, we restricted the dataset to
only sameAs networks with cardinality greater than 15 nodes and
having a modularity Q&gt;0.35 when split in 6 clusters at most. In
our dataset 2,922 networks presented these features.
About 80% of the networks were fragmented in either five or six
clusters; the remaining ones were partitioned into three or four
clusters; none into two clusters. The maximum modularity ranged
from 0.35 to 0.67; its distribution, shown in Figure 1, presented
two peaks around 0.51 and 0.62 (respectively 24% and 19% of the
whole population).</p>
      <p>Proceeding with a manual inspection we found that the algorithm
isolated clusters more precisely referring to an entity from clusters
containing less specific references. For instance, the conceptual
entity “sergeant” was found in a sameAs network partitioned into
four clusters with modularity Q=0,44. The biggest cluster was
mainly formed by nodes from dbpedia.org, suggesting slightly
different meanings for the same concept: sergeant instructor,
detective sergeant, senior sergeant, etc. The second relevant
cluster, more consistently referring to the concept of police
sergeant, was formed by six co-references from five different
domains.</p>
      <p>Sometimes our detections revealed different location entities
wrongly stated to represent the same place. For instance, the
component referring to Abuja, since 1991 the new capital of
Nigeria, mistakenly appeared in the sameAs network containing
Lagos, the former capital of the country. The algorithm clearly
isolated Abuja from other two clusters referring to Lagos with a
maximum modularity Q=0.49.</p>
      <p>Similarly, the network containing the node
&lt;http://umbel.org/umbel/sc/Italy&gt; was partitioned
into three clusters with maximum modularity Q=0,49. One of the
clusters however contained only co-references to Mendocino, a
Victorian village near San Francisco, CA, renowned for its award
winning wines.</p>
    </sec>
    <sec id="sec-5">
      <title>5. DISCUSSION</title>
      <p>After the manual inspection we noted that edges with high
betweenness generally corresponded to mistakenly added
800
600
s
e
c
n
rre 400
u
c
c
O
200
owl:sameAs links. However, sometimes removed links were
from the collections maintained in okkam.org, the entity search
portal created by the OKKAM project. Nodes from this domain
were often acting as “hubs”, showing an elevated number of
outgoing arcs but no incoming links. They frequently appeared in
paths connecting two or more clusters. Thus, their outgoing arcs,
presenting an elevated betweenness, were often removed.
Almost all the networks presented two typical clusters: The first
one characteristically contained nodes from dbpedia.org and
fuberlin.de that were often linked to a central node from
freebase.com (or sometimes from mpii.de). The second typical
cluster included a smaller number (about 5-10) of heterogeneous
nodes coming from various domains (umbel.org, opencyc.org,
nyt.com, freebase.com, mpii.de, etc.)
To account for this phenomenon, we measured out the average
composition of each cluster and plotted it as a function of the
cluster’s cardinality. The result was the “cluster spectrometry”
depicted in Figure 2.</p>
      <p>This diagram shows three main different areas. Clusters with
cardinality 1 or 2 are prevalently made of nodes coming from
dbpedia.org and fu-berlin.de. In particular, clusters with
cardinality 1 account for isolated nodes, whereas those with
cardinality 2 quite often represent couple of nodes interlinking
resources in these two domains.</p>
      <p>The contribution from more domains becomes evident from
clusters with cardinality greater than 2, which begin to show a
more heterogeneous composition. From cardinality 3 to about
cardinality 12, dividing the cluster’s cardinality by the number of
involved domains we obtain a ratio close to 1, which means one
node per domain on average. We observe that these clusters likely
represent the “core” of Linked Data, as they contain
interconnected nodes from several different domains describing
the same entity. These clusters prove the existence of a set of well
aligned “synonyms”, which, together with their associated
descriptions, contribute to determine the generic meaning of an
entity as shared by the Linked Data community.</p>
      <p>Clusters with cardinality greater than 12 progressively drop this
characteristic composition and begin to be more homogeneous.
For instance, we can observe how contributions from
freebase.com, nyt.com and opencyc.org progressively decrease
with the number of nodes, in favor of clusters generally
dominated by nodes from dbpedia.org and fu-berlin.de. Moreover,
these clusters often consist of hyponyms (i.e. more specialized
terms) connected to a central node but not interlinked each others.
One trivial explanation for this is that, at the time of writing, only
DBpedia provides finer granule definitions while other data
providers often publish more generic definitions. However, these
semantic nuances have not been captured during the linkage and
have been flattened as “equivalences” with the corresponding
hyperonym (i.e. more generic tem).</p>
      <p>We are aware that our investigation covered only 1% of the over
30 billion RDF statements in Linked Data and possibly even
fewer samples having restricted the study to only networks
compliant with the criteria illustrated in section 3. However, we
believe that the discovered behavior could be also found in larger
portions of Linked Data and we plan to perform a more
exhaustive analysis extending our investigation to larger datasets.
Possible criticism is also related to the bias introduced by Sindice.
Many RDF statements that have been cached by Sindice are no
more in the Linked Data cloud, which is continuously refining
and evolving. Many faulty equivalences seem to have been
corrected on the dereferenceable online version of the source
documents. Clearly this bias can be reduced by repeating the
experiment and using a more recent version of Sindice caches.
Nevertheless, we noted that the presence of faulty equivalences
represented a good benchmark for our algorithm, which was able
to clearly detect clusters consistently referring to an entity and to
keep them separate from others.</p>
    </sec>
    <sec id="sec-6">
      <title>6. CONCLUSIONS AND FUTURE WORKS</title>
      <p>Entity consolidation has been subject of many researches in past
years, with many efforts focused on providing algorithms and
tools to automate the generation of equivalence links in Linked
100%
75%
n
o
iit
s
po 50%
m
o
C
25%
0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25</p>
      <p>Cardinality
freebase.com
mpii.de
semanticweb.org
dbpedia.org
nyt.com
umbel.org
opencyc.org
geonames.org
cyc.com
fu-berlin.de
rkbexplorer.com</p>
      <p>Data. Now that Linked Data has passed its bootstrapping phase,
with several billions of equivalence links available, link analysis
could provide novel meaningful results to previously unanswered
questions.</p>
      <p>In this paper we used the already deployed equivalence links to
run a cluster detection algorithm based on edge betweenness and
Newman and Girvan modularity.</p>
      <p>Analyzing the results we found typical recurring clusters
consisting in a small number of heterogeneous nodes that we
believe to represent the bulk of consolidated entity references in
the Linked Data cloud.</p>
      <p>
        This finding inspired us to provide our own answer to the identity
debate we mentioned in the introduction of this paper: rather than
looking at an unique stable identifier for an entity, one should
accept the existence of a dynamic set of “synonyms”, arranged
into clusters and contributing to create the “meaning” of that
entity as understood by the Linked Data community. This does
not preclude the possibility to detect most representative nodes
(for instance, the ones with the highest in-degree and cluster
coefficient [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], or authority rank [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]), but this is not strictly
necessary and does not seem to offer any particular advantage
except probably the “psychological” one of having solved the
quest for an unique entity identifier. Most likely, the very problem
behind entity identification is not “how many” identifiers, but
“which ones” should be used to refer to an entity. In this paper we
attempted to provide a first answer to this question from a novel
perspective, the link analysis, which in our opinion might reveal
several new understandings of Linked Data in the near future.
      </p>
      <p>
        There are several future directions of investigation: for instance,
one potential important finding that we highlighted was that
mistakenly added links showed the highest edge betweenness.
This suggests an interesting novel technique to detect misleading
equivalences (and thus different entities) at a scale. However, in
order to mechanize this procedure, an automatic verification step
– which we have not yet implemented – is necessary. At a first
glance, this verification necessarily requires a deeper analysis of
the involved nodes, and of their properties other than
owl:sameAs (for instance: literal properties, type, etc.).
However, under the assumption that similar entities should
present similar structures in their set of links, an interesting
alternative could be using link analysis algorithms that measure
the similarity between the link sets of the different nodes. This
technique, called “blockmodelling”, has been proposed in the past
for the analysis of social networks [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        As another direction of research, we plan to broaden our
methodology by introducing more sophisticated cluster detection
techniques. The Girvan-Newmann edge betweenness algorithm
we used requires that each node must appear in one and only one
cluster. This condition is probably too strong. Existing algorithms
based on a combination of edge betweenness and vertex
betweenness [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] remove this limitation and might produce more
meaningful results.
      </p>
      <p>
        Moreover we believe that other interesting features – e.g.,
relationships between clusters – might be unveiled by applying
more advanced detection techniques such as, for instance, the
“mesoscopic” analysis of connected components recently
proposed by Arenas et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. ACKNOWLEDGMENTS</title>
      <p>This work is partially sponsored by the European Commission
under the framework of the ‘Convergence’ project (FP7-257123).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Booth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>URIs and the myth of resource identity</article-title>
          .
          <source>In Proceedings of Identity</source>
          , Reference, and the Web Workshop at the WWW Conference.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Linked Data: Evolving the Web into a Global Data Space</article-title>
          .
          <source>Synthesis Lectures on the Semantic Web: Theory &amp; Technology</source>
          . Morgan Claypool.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bouquet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoermer</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niederee</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mana</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Entity Name System: The Backbone of an Open and Scalable Web of Data</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Semantic Computing</source>
          ,
          <fpage>554</fpage>
          -
          <lpage>561</lpage>
          , IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Halpin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>P.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCusker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mcguinness</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>When owl:sameas isn't the same: An analysis of identity in linked data</article-title>
          .
          <source>In Proceedings of the 9th International Semantic Web Conference.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Jaffri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glaser</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <source>Millard</source>
          .
          <year>2008</year>
          .
          <article-title>URI disambiguation in the context of linked data</article-title>
          .
          <source>In Proceedings of the 1st International Workshop on Linked Data on the Web.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Bouquet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palpanas</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoermer</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vignolo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>A Conceptual Model for a Web-scale Entity Name System</article-title>
          .
          <source>In Proceedings of 9th the Asian Semantic Web Conference.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halpin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2008</year>
          . In defense of ambiguity.
          <source>International Journal of Semantic Web and Information Systems</source>
          ,
          <volume>4</volume>
          (
          <issue>3</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Shvaiko</surname>
            <given-names>E.</given-names>
          </string-name>
          <year>2007</year>
          . Ontology Matching. Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferrara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          et al.
          <year>2010</year>
          .
          <article-title>Results of the Ontology Alignment Evaluation Initiative 2010</article-title>
          , http://disi.unitn.it/~p2p/OM-2010
          <source>/oaei10_paper0.pdf (accessed March</source>
          , 21
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Bootstrapping object coreferencing on the semantic web</article-title>
          .
          <source>Journal of Computer Science Technology</source>
          ,
          <volume>26</volume>
          (
          <issue>4</issue>
          ),
          <fpage>663</fpage>
          -
          <lpage>675</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shinavier</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Shangguan
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>SameAs Networks and Beyond: Analyzing Deployment Status and Implications of owl:sameAs in Linked Data</article-title>
          .
          <source>Lecture Notes in Computer Science</source>
          , Volume
          <volume>6496</volume>
          /
          <year>2010</year>
          ,
          <fpage>145</fpage>
          -
          <lpage>160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Strogatz</surname>
            ,
            <given-names>S. H.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>Exploring complex networks</article-title>
          .
          <source>Nature</source>
          ,
          <volume>410</volume>
          ,
          <fpage>268</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girvan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Finding and evaluating community structure in networks</article-title>
          .
          <source>In Physical Review E</source>
          , volume
          <volume>69</volume>
          , issue 2.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Watts</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strogatz</surname>
            ,
            <given-names>S. H.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>Collective dynamics of 'small-world' networks</article-title>
          .
          <source>Nature</source>
          ,
          <volume>393</volume>
          ,
          <fpage>440</fpage>
          -
          <lpage>442</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Authoritative sources in a hyperlinked environment</article-title>
          .
          <source>Journal of the ACM</source>
          ,
          <volume>46</volume>
          (
          <issue>5</issue>
          ):
          <fpage>604</fpage>
          -
          <lpage>632</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Wasserman</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faust</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>1994</year>
          .
          <article-title>Social Network Analysis: Methods and Applications</article-title>
          . Cambridge University Press, Cambridge.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Pinney</surname>
            <given-names>J. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Westhead</surname>
            <given-names>D. R.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Betweenness-based decomposition methods for social and biological networks</article-title>
          .
          <source>Interdisciplinary Statistics and Bioinformatics</source>
          ,
          <volume>87</volume>
          -
          <fpage>90</fpage>
          , Leeds University Press.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Granell</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Arenas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Mesoscopic analysis of networks: applications to exploratory analysis and data clustering</article-title>
          .
          <source>Chaos: An Interdisciplinary Journal of Nonlinear Science</source>
          ,
          <volume>21</volume>
          ,
          <fpage>016102</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>