<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Silk - A Link Discovery Framework for the Web of Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Julius Volz</string-name>
          <email>volz@hrz.tu-chemnitz.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Bizer</string-name>
          <email>chris@bizer.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Gaedke</string-name>
          <email>gaedke@cs.tu-chemnitz.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgi Kobilarov</string-name>
          <email>georgi.kobilarov@fu-berlin.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chemnitz University of, Technology</institution>
          ,
          <addr-line>Straße der Nationen 62, D-09107 Chemnitz</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Freie Universität Berlin, Web-based Systems Group</institution>
          ,
          <addr-line>Garystr. 21, D-14195 Berlin</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Freie Universität Berlin, Web-based Systems Group</institution>
          ,
          <addr-line>Garystr. 21, D-14195 Berlin</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <volume>20</volume>
      <issue>2009</issue>
      <abstract>
        <p>The Web of Data is built upon two simple ideas: Employ the RDF data model to publish structured data on the Web and to set explicit RDF links between entities within different data sources. This paper presents the Silk - Link Discovery Framework, a tool for finding relationships between entities within different data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. Silk features a declarative language for specifying which types of RDF links should be discovered between data sources as well as which conditions entities must fulfill in order to be interlinked. Link conditions may be based on various similarity metrics and can take the graph around entities into account, which is addressed using a path-based selector language. Silk accesses data sources over the SPARQL protocol and can thus be used without having to replicate datasets locally.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Linked data</kwd>
        <kwd>link discovery</kwd>
        <kwd>record linkage</kwd>
        <kwd>similarity</kwd>
        <kwd>RDF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        While there are more and more tools available for publishing
Linked Data on the Web [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], there is still a lack of tools that
support data publishers in setting RDF links to other data sources
on the Web. The Silk - Link Discovery Framework contributes to
filling this gap. Using the declarative Silk - Link Specification
Language (Silk-LSL), data publishers can specify which types of
RDF links should be discovered between data sources as well as
which conditions data items must fulfill in order to be interlinked.
These link conditions can apply different similarity metrics to
multiple properties of an entity or related entities which are
addressed using a path-based selector language. The resulting
similarity scores can be weighted and combined using various
similarity aggregation functions. Silk accesses data sources via the
SPARQL protocol and can thus be used to discover links between
local and remote data sources.
      </p>
      <p>The main features of the Silk framework are:



it supports the generation of owl:sameAs links as well as
other types of RDF links.
it provides a flexible, declarative language for specifying link
conditions.
it can be employed in distributed environments without
having to replicate datasets locally.
it can be used in situations where terms from different
vocabularies are mixed and where no consistent RDFS or
OWL schemata exist.
it implements various caching, indexing and entity
preselection methods to increase performance and reduce
network load.</p>
      <p>This paper is structured as follows: Section 2 gives an overview of
the Silk - Link Specification Language along a concrete usage
example. Section 3 reports the results of applying Silk to discover
links between several data sources within the LOD data cloud1.
We describe the implementation of the Silk framework in Section
4 and review related work in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. LINK SPECIFICATION LANGUAGE</title>
      <p>The Silk - Link Specification Language (Silk-LSL) is used to
express heuristics for deciding whether a semantic relationship
exists between two entities. The language is also used to specify
the access parameters for the involved data sources, and to
configure the caching, indexing and preselection features of the
framework. Link conditions can use different aggregation
functions to combine similarity scores. These aggregation
functions as well as the implemented similarity metrics and value
transformation functions were chosen by abstracting from the link
heuristics that were used to establish links between different data
sources in the LOD cloud.</p>
      <p>Figure 1 contains a complete Silk-LSL example. In this particular
use case, we want to discover owl:SameAs links between the
URIs that are used by DBpedia2 and by GeoNames3 to identify
cities. In line 12 of the link specification, we thus configure the
&lt;LinkType&gt; to be owl:sameAs.</p>
      <sec id="sec-2-1">
        <title>1 http://esw.w3.org/topic/SweoIG/TaskForces/</title>
        <p>CommunityProjects/ LinkingOpenData
2 http://dbpedia.org/About
3 http://www.geonames.org/ontology/</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2.1 Data Access</title>
      <p>For accessing the source and target datasources, we first configure
access parameters to the DBpedia and GeoNames SPARQL
endpoints using the &lt;DataSource&gt; directive. The only
mandatory datasource parameter is the endpoint URI. Besides
this, it is possible to define other datasource access options, such
as the graph name and to enable the caching of SPARQL query
results in memory. In order to restrict the query load on remote
SPARQL endpoints, it is possible to set a delay in between
subsequent queries using the &lt;Pause&gt; parameter, specifying the
delay time in milliseconds. For working against SPARQL
endpoints that restrict result sets to a certain size, Silk uses a
paging mechanism. The maximal result size is configured using
the &lt;PageSize&gt; parameter. The paging mechanism is
implemented via SPARQL LIMIT and OFFSET queries. Lines 2
to 7 within the example show how the access parameters for the
DBpedia datasource are set to select only resources from the
named graph http://dbpedia.org, enable caching and limit
the page size to 10,000 results per query.</p>
      <p>The configured data sources are later referenced in the
&lt;SourceDataset&gt; and &lt;TargetDataset&gt; clauses of the
"cities" link specification. Since we only want to match cities, we
restrict the sets of examined resources to instances of the classes
dbpedia:City and dbpedia:PopulatedPlace and the
GeoNames feature class gn:P by supplying SPARQL conditions
within the &lt;RestrictTo&gt; directives in lines 14 and 17. These
statements may contain any valid SPARQL expressions that
would usually be found in the WHERE clause of a SPARQL query.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Link Conditions</title>
      <p>
        The &lt;LinkCondition&gt; section is the heart of a Silk link
specification and defines how similarity metrics are combined in
order to calculate a total similarity value for an entity pair.
For comparing property values or sets of entities, Silk provides a
number of builtin similarity metrics. Table 1 gives an overview of
these metrics. The implemented metrics include string, numeric,
data, URI, and set comparison methods as well as a taxonomic
matcher that calculates the semantic distance between two
concepts within a concept hierarchy using the distance metric
proposed by Zhong et al. in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Each metric in Silk evaluates to a
similarity value between 0 or 1, with higher values indicating a
greater similarity.
      </p>
      <p>In the &lt;LinkCondition&gt; section of the example (lines 19 to
55), we compute similarity values for the the labels, Wikipedia
links, population counts and geographic coordinates of cities
between datasets and calculate a weighted average of these values.
Most metrics are configured to be optional since the presence of
the respective RDF property values they refer to is not always
guaranteed. In cases where alternating properties refer to an
equivalent feature (such as dbpedia:populationEstimate
and dbpedia:populationTotal), we choose to perform
comparisons for both properties and select the best evaluation by
using the &lt;MAX&gt; aggregation operator. Weighting of results is
used within the metrics comparing the geographical coordinates
(lines 46 and 50), with the longitude and latitude similarity
weights lowered to 0.7 each.</p>
      <p>After specifying the link condition, we finally specify within the
&lt;Thresholds&gt; clause that resource pairs with a similarity
score above 0.9 are to be interlinked, whereas pairs between 0.7
and 0.9 should be written to a separate output file and be reviewed
by an expert. The &lt;Limit&gt; clause is used to limit the number of
outgoing links from a particular entity within the source data set.
If several candidate links exist, only the highest evaluated one is
chosen and written to the output files as specified by the
&lt;Output&gt; directive. In this example, we permit only one
outgoing owl:sameAs link from each resource.</p>
      <p>Discovered links are outputted either as simple RDF triples or in
reified form together with their creation date, confidence score
and the ID of the employed interlinking heuristic.</p>
    </sec>
    <sec id="sec-5">
      <title>2.3 Silk Selector Language</title>
      <p>Especially for discovering other semantic relationships than entity
equality, a flexible way for selecting sets of resources or literals in
the RDF graph around a particular resource is needed. For
instance, DBpedia and LinkedMDB both contain movies and
directors. For generating links between movies in DBpedia and
their directors in LinkedMDB, we might want to navigate to the
director of a movie in DBpedia and compare her properties with
directors in LinkedMDB. In the case of linking musical artists
between DBpedia and MusicBrainz4, an open music database, we
might want to compare properties of the albums of the musicians.
Silk addresses this requirement by using a simple RDF path
selector language for providing parameter values to similarity
metrics and transformation functions. A Silk selector language
path starts with a variable referring to an RDF resource and may
then use one of several operators to navigate the graph
surrounding this resource. To simply access a particular property
of a resource, the forward operator ( / ) may be used. For example,
the path "?artist/rdfs:label" would select the set of label
values associated with an artist referred to by the ?artist
variable.</p>
      <p>Sometimes, however, we need to navigate backwards along a
property edge. For example, musical albums in DBpedia contain a
dbpedia:artist property pointing to the album's creator.
However, there exists no explicit reverse property like
dbpedia:albums for an artist resource. So if a path begins
with an artist and we need to select all of her albums, we may use
the backward operator ( \ ) to navigate property edges in reverse.
Since navigating backwards along the property
dbpedia:artist would select all of the artist's works, this
may not only select albums, but also songs and single releases.
This is addressed by a filter operator ([ ]), which allows selected
resources to be restricted to match a certain predicate. In this
example, we could use the RDF path
"?artist\
dbpedia:artist[rdf:type dbpedia:Album]" to select only
albums amongst the works of a musical artist in DBpedia. The
filter operator also supports comparisons of numeric types as
predicates. For example, to select songs of an artist with a runtime
greater than 200 seconds, the path
"?artist\
dbpedia:artist[dbpedia:runtime &gt; 200]" can be used.</p>
    </sec>
    <sec id="sec-6">
      <title>2.4 Pre-Matching</title>
      <p>To compare all pairs of entities of a source dataset S and a target
dataset T would result in an unsatisfactory runtime complexity of
O(|S|·|T|). Even after using SPARQL restrictions to select suitable
subsets of each dataset, the required time and network load to
perform all pair comparisons might prove to be impractical in
many cases. To avoid this problem, we need a way to quickly find
a limited set of target entities that are likely to match a given
source entity. Silk supports this by allowing rough index
prematching.</p>
      <p>When using prematching, all target resources are indexed by one
or more specified property values (most commonly, their labels)
before any detailed comparisons are performed. During the
subsequent resource comparison phase, the previously generated
index is used to look up potential matches for a given source
resource. This lookup uses the BM255 weighting scheme for the
ranking of search results and additionally supports spelling
corrections of individual words of a query. Only a fixed amount of
target resources found in this lookup are considered as candidates
for a detailed comparison. An example of such a prematching
configuration that could be applied to our city linking example is
presented in Figure 2:
4 http://musicbrainz.org
5 http://xapian.org/docs/bm25.html
&lt;PreMatchingDefinition
sourcePath="?a/rdfs:label"
hitLimit="10"&gt;
&lt;Index targetPath="?b/gn:name" /&gt;
&lt;Index targetPath="?b/gn:alternateName" /&gt;
&lt;/PreMatchingDefinition&gt;</p>
      <p>Figure 2. Pre-Matching
This statement instructs Silk to index the cities in the target
dataset by both their gn:name and gn:alternateName
property values. When performing comparisons, the
rdfs:label of a source resource is used as a search term into
the generated indexes and only the first ten target hits found in
each index are considered as link candidates for detailed
comparisons. If we neglect a slight index insertion and search
time dependency on the target dataset size, we now achieve a
runtime complexity of O(|S| + |T|), making it feasible to interlink
even large datasets under practical time constraints. Note however
that this prematching may come at the cost of missing some links
during discovery, since it is not guaranteed that a prematching
lookup will always find all matching target resources.</p>
    </sec>
    <sec id="sec-7">
      <title>3. EXPERIMENTS</title>
      <p>During the implementation of Silk, we experimented with linking
DBpedia to several other public Linked Data sources. Movies in
DBpedia were linked both to their movie counterparts and to their
directors in LinkedMDB6. Between GeoNames and DBpedia, we
created links between cities, as shown in Silk-LSL example
above. Finally, clinical drugs from DrugBank7 were linked with
their counterparts in DBpedia. The following section gives a short
overview over the employed similarity heuristics as well as the
amounts of discovered links.</p>
      <p>For interlinking movies between DBpedia and LinkedMDB, we
used Jaro string similarity to match movie titles and director
names, date similarity for comparing release dates and numeric
similarity for runtimes. We used the Thresholds directive
&lt;Thresholds accept="0.9" verify="0.7" /&gt; to
define similarities of 0.9 as acceptable and similarities between
0.7 to 0.9 to be verified by an expert. The number of movies in the
datasets and amounts of discovered links are shown in Table 2.
8,367
1,693
374
For linking cities in DBpedia and GeoNames, we used Jaro
similarity between city names, URI equality for links to
Wikipedia articles as well as numeric similarity for the population
counts and geographic coordinates. The results for this use case
are shown in Table 4.
Finally, for generating links between clinical drugs in DrugBank
and DBpedia, we compared drug labels via the JaroWinkler
similarity, PubChem 8 identifiers via string equality and used
numeric similarity for comparing the drugs' molecular weights.
Table 5 shows the results for this case.
The metric compositions, weightings and thresholds in these
examples were chosen based on what seemed to produce
reasonably valid results in our tests. However, a detailed analysis
of the quality of the generated links has not yet been performed.
When using Silk in a practical scenario, it is advisable to evaluate
the accuracy and completeness of generated links more closely
while adjusting the linking specification accordingly.</p>
    </sec>
    <sec id="sec-8">
      <title>4. SILK IMPLEMENTATION</title>
      <p>Silk is written in Python and is run as a batch process on the
command line. The framework may be downloaded from Google
Code9 under the terms of the BSD license. For calculating string
similarities, a library from Febrl 10 , the Freely Extensible
Biomedical Record Linkage toolkit, is used, while Silk's</p>
      <sec id="sec-8-1">
        <title>8 http://pubchem.ncbi.nlm.nih.gov</title>
      </sec>
      <sec id="sec-8-2">
        <title>9 http://silk.googlecode.com</title>
        <p>10 http://sourceforge.net/projects/febrl
prematching features are achieved with the search engine library
Xapian11. The Silk system architecture is illustrated in Figure 3:
Before executing any comparisons, Silk retrieves the source and
target resource lists. The list of source resources is retrieved
directly through a resource lister which queries the respective
SPARQL endpoint and caches the list on disk for reuse in a later
run of Silk. Target resources are first indexed by means of a
resource indexer, making them searchable by specific properties
or RDF Path evaluations. During comparison processing, a list of
target resource candidates for each source resource is looked up in
this index, limiting detailed comparisons to index search hits. This
prematching of resources is optional, but recommended as it
drastically reduces run time and network load.</p>
        <p>During each detailed resource pair comparison, the
userspecificed metric aggregation tree is evaluated. Function or metric
parameters passed as RDF Path values are transformed to
SPARQL queries by an RDF Path translator and sent to the
respective SPARQL endpoint for evaluation. Query results are
cached in memory during Silk runtime.</p>
        <p>If a metric aggregation for a pair of resources results in a value
above the specified linking thresholds, a candidate link is saved in
memory. After completing all comparisons for a link
specification, a link limit may be applied to limit the maximum
number of outgoing links from a single resource. Only a specified
count of highest-rated links are kept, lower-valued links are
discarded. The remaining links are written to the output file in the
format specified by the user (Turtle, CSV, reified format together
with meta-information such as confidence score and creation
date).</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>5. RELATED WORK</title>
      <p>
        There is a large body of related work on record linkage [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
duplicate detection [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] within the database community as well as
on ontology matching [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in the knowledge representation
community. Silk builds on this work by implementing similarity
metrics and aggregation functions that proved successful within
other scenarios. What distinguishes Silk from this work is its
focus on the Linked Data scenario where different types of
11 http://xapian.org
semantic links should be discovered between Web data sources
that often mix terms from different vocabularies and where no
consistent RDFS or OWL schemata spanning the data sources
exist.
      </p>
      <p>
        Related work that also focuses on Linked Data includes Raimond
et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] who propose a link discovery algorithm that takes into
account both the similarities of web resources and of their
neighbors. The algorithm is implemented within the GNAT tool
and has been evaluated for interlinking music-related data sets. In
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Hassanzadeh et al. describe a framework for the discovery of
semantic links over relational data which also introduces a
declarative language for specifying link conditions. A main
difference between LinQL and Silk-LSL is the underlying data
model and Silk’s ability to more flexibly combine metrics through
aggregation functions. A framework that deals with instance
coreferencing as part of the larger process of fusing Web data is
the KnoFuss Architecture proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In contrast to Silk,
KnoFuss assumes that instance data is represented according to
consistent OWL ontologies.
      </p>
    </sec>
    <sec id="sec-10">
      <title>6. CONCLUSIONS</title>
      <p>We presented the Silk framework, a flexible tool for discovering
links between entities within different Web data sources. We
introduced the Silk-LSL link specification language and
demonstrated its applicability within different link discovery
scenarios.</p>
      <p>The value of the Web of Data rises and falls with the amount and
the quality of links between data sources. We hope that Silk and
other similar tools will help to strengthen the linkage between data
sources and therefore contribute to the overall utility of the
network.</p>
      <p>The complete Silk- LSL language specification and further Silk
usage examples are found on the Silk project website at
http://www4.wiwiss.fu-berlin.de/bizer/silk/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked Data - Design Issues</article-title>
          . http://www.w3.org/DesignIssues/LinkedData.html
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>How to publish Linked Data on the Web</article-title>
          . http://www4.wiwiss.fuberlin.de/bizer/pub/LinkedDataTutorial/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Conceptual Graph Matching for Semantic Search</article-title>
          .
          <source>The 2002 International Conference on Computational Science (ICCS2002)</source>
          , Amsterdam,
          <year>April 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ipeirotis</surname>
            ,
            <given-names>P.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verykios</surname>
            ,
            <given-names>V.S.:</given-names>
          </string-name>
          <article-title>Duplicate record detection: A survey</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>19</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Winkler</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of Record Linkage and Current Research Directions. Bureau of the Census</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shvaiko</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Ontology Matching. Springer, Heidelberg,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Raimond</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sandler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Automatic Interlinking of Music Datasets on the Semantic Web</article-title>
          .
          <source>In: Linked Data on the Web Workshop (LDOW2008)</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , et al.:
          <article-title>A Declarative Framework for Semantic Link Discovery over Relational Data</article-title>
          .
          <source>Poster at 18th World Wide Web Conference (WWW2009)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Integration of Semantically Annotated Data by the KnoFuss Architecture</article-title>
          .
          <source>In: 16th International Conference on Knowledge Engineering and Knowledge Management</source>
          ,
          <fpage>265</fpage>
          -
          <lpage>274</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>