<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Silk { Generating RDF Links while publishing or consuming Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anja Jentzsch</string-name>
          <email>mail@anjajentzsch.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Isele</string-name>
          <email>robertisele@googlemail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Bizer</string-name>
          <email>chris@bizer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Freie Universitat Berlin, Web-based Systems Group</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The central idea of the Web of Data is to interlink data items using RDF links. However, in practice most data sources are not su ciently interlinked with related data sources. The Silk Link Discovery Framework addresses this problem by providing tools to generate links between data items based on user-provided link speci cations. It can be used by data publishers to generate links between data sets as well as by Linked Data consumers to augment Web data with additional RDF links. In this poster we present the Silk Link Discovery Framework and report on two usage examples in which we employed Silk to generate links between two data sets about movies as well as to nd duplicate persons in a stream of data items that is crawled from the Web.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>Link Discovery</kwd>
        <kwd>Identity Resolution</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Silk - Single Machine is used to generate RDF links on a single machine.
The datasets that should be interlinked can either reside on the same
machine or on remote machines which are accessed via the SPARQL protocol.
Silk - Single Machine provides multithreading and caching. In addition, the
performance can be further enhanced using an optional blocking feature.
Silk - Map Reduce is used to generate RDF links between data sets using a
cluster of multiple machines. Silk - Map Reduce is based on Hadoop and can
for instance be run on Amazon Elastic MapReduce. Silk - Map Reduce
enables Silk to scale out to very big datasets by distributing the link generation
to multiple machines.</p>
      <p>Silk - Server can be used as an identity resolution component within
applications that consume Linked Data from the Web. Silk - Server provides an
HTTP API for matching instances from an incoming stream of RDF data
against a local set of known instances. It can be used for instance together
with a Linked Data crawler to populate a local duplicate-free cache with
data from the Web.</p>
      <p>The Silk Link Discovery Framework is implemented in Scala2 and can be
downloaded from the project homepage3 under the terms of the Apache Software
License.</p>
      <p>In the following, we will give an overview of the Silk Link Discovery
Framework, report on two usage examples in which we employed the framework and
present planned future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Silk Link Discovery Framework</title>
      <p>
        The Silk Link Discovery Framework consists of a console application used to
interlink two data sets as well as of the Silk Server, an HTTP server, which
receives an incoming RDF stream and creates links between data items. Both
applications provide a exible con guration language, the Silk Link Speci cation
Language (Silk-LSL), to specify the conditions data items must ful ll in order to
be interlinked [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For this purpose, the user may apply similarity metrics, such
as string, date or URI comparison methods, to multiple property values of an
entity or related entities. The resulting similarity scores can be combined and
weighted using various similarity aggregation functions. A Silk link con guration
may contain several link speci cations if links for di erent types of data items
should be generated.
      </p>
      <p>The central part of the Silk Link Discovery Framework is the Silk Linking
Engine, which generates the links between data items according to user-provided
link speci cations. The Silk Linking Engine processes the incoming data items,
which are usually originating from a SPARQL endpoint, in subsequent phases:</p>
      <p>The optional Blocking phase partitions the incoming data items into
clusters. Since comparing every source resource to every single target resource results
2 http://scala-lang.org
3 http://www4.wiwiss.fu-berlin.de/bizer/silk/</p>
      <p>Silk { Generating RDF Links while publishing or consuming Linked Data
in a number of n m comparisons which might be time-consuming, blocking can
be used to reduce the number of comparisons. Blocking partitions similar data
items into clusters limiting the comparisons to items in the same cluster. For
example, given a set of books to be compared, in order to reduce the number of
comparisons, one could block the books by publisher.</p>
      <p>The Link Generation phase reads the incoming data items and computes a
similarity value for each pair. The incoming data items, which might be allocated
to a cluster by the preceding blocking phase, are written to an internal cache.
From the cache, pairs of data items are generated. If blocking is disabled, this
will generate the complete cartesian product of the two data sets. If blocking is
enabled, only data items from the same cluster are compared. For each pair of
data items, the link condition is evaluated, which computes a similarity value
between 0 and 1. Each pair generates a preliminary link with a con dence value
according to the similarity of the source and target data item.</p>
      <p>The Filtering phase lters the incoming links in two stages: In the rst stage,
all links with a lower con dence than the user-de ned threshold are removed. In
the second stage, all links which originate from the same data item are grouped
together. The number of links per source item which are forwarded to the output,
is speci ed by an optional link limit. If a link limit is de ned, only the links with
the highest con dence are forwarded.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Silk Server</title>
      <p>Silk Server is an identity resolution component, that can be used within Linked
Data application architectures to add missing RDF links to data that is
consumed from the Web of Linked Data. It is designed to be used with an incoming
stream of RDF instances, produced for example by a Linked Data crawler such
as LDspider. Silk Server matches data describing incoming instances against a
local set of known instances and discovers missing links between them. Based
on this assessment, an application can store data about newly discovered
instances in its repository or fuse data that is already known about an entity with
additional data about the entity from the Web.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Usage Examples</title>
      <p>Silk has been employed in several scenarios to generate links between data sets
on the Web of Data. In the following we report the results of employing Silk
Single Machine and Silk - Server within two usage scenarios.
4.1</p>
      <p>Interlinking DBpedia movies with LinkedMDB directors
We employed Silk - Single Machine to interlink movies in DBpedia with the
corresponding director in LinkedMDB. For this purpose, Silk was fed with the
50000 movies from DBpedia and 2500 directors from LinkedMDB. For each
movie, Silk was con gured to set a dbpedia:director link from the movie to
its director.4 A single PC with a Core 2 Duo CPU and 4GB of RAM needed
55 minutes to match the complete cartesian product resulting in 125 000 000
comparisions. Silk successfully identi ed 5900 links between a movie and its
director. In order to increase the performance, the Linking Speci cation was
extended to employ blocking on the director names to reduce the number of
comparisions. As blocking may decrease the recall of the matching, we compared
the generate links. With blocking enabled Silk was still able to identify 5857 links,
resulting in a loss of less than one percent, while reducing the runtime of the
matching considerable to only 7 minutes.
4.2 Identifying duplicate person descriptions in a data stream
In the Web of Data we can usually nd di erent URIs which e ectively
identify the same entity, e.g. &lt;http://tomheath.com/id/me&gt; and &lt;http://www.
eswc2006.org/people/#tom-heath&gt; describe the same person. We employed a
Linked Data crawler to crawl the FOAF web and stream the traversed pro les to
the Silk Server in order to identify duplicate persons and generate owl:sameAs
links between them. For evaluation, we used the Semantic Web Dog Food data
set5 which already interlinks some of the contained persons with their
corresponding FOAF pro le. Among the 56 persons for which the Semantic Web Dog
Food data set provides links to their FOAF pro le, Silk Server was able to
reconstruct 51 links from the stream. In addition, it was able to identify the FOAF
pro le of another 132 persons for which Semantic Web Dog Food did not provide
a link to their pro le yet.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported in part by Vulcan Inc. as part of its Project Halo
(www.projecthalo.com) and by the EU FP7 project LOD2 - Creating Knowledge
out of Interlinked Data (Grant No. 257943, http://lod2.eu/).
4 The link speci cation can be found on http://www4.wiwiss.fu-berlin.de/bizer/
silk/linkspecs/dbpedia_linkedmdb_directors.xml
5 http://data.semanticweb.org/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked data - the story so far</article-title>
          .
          <source>Int. J. Semantic Web Inf. Syst.</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):1{
          <fpage>22</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          .
          <article-title>A graph analysis of the linked data cloud</article-title>
          .
          <source>CoRR, abs/0903.0194</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J.</given-names>
            <surname>Volz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaedke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Kobilarov</surname>
          </string-name>
          .
          <article-title>Discovering and maintaining links on the web of data</article-title>
          .
          <source>In International Semantic Web Conference</source>
          , pages
          <volume>650</volume>
          {
          <fpage>665</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>