=Paper=
{{Paper
|id=Vol-538/paper-13
|storemode=property
|title=Silk - A Link Discovery Framework for the Web of Data
|pdfUrl=https://ceur-ws.org/Vol-538/ldow2009_paper13.pdf
|volume=Vol-538
|dblpUrl=https://dblp.org/rec/conf/www/VolzBGK09
}}
==Silk - A Link Discovery Framework for the Web of Data==
Silk – A Link Discovery Framework for the Web of Data
Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov
Chemnitz University of Freie Universität Berlin Chemnitz University of Freie Universität Berlin
Technology Web-based Systems Group Technology Web-based Systems Group
Straße der Nationen 62 Garystr. 21 Straße der Nationen 62 Garystr. 21
D-09107 Chemnitz D-14195 Berlin D-09107 Chemnitz D-14195 Berlin
volz@hrz.tu-chemnitz.de chris@bizer.de gaedke@cs.tu-chemnitz.de georgi.kobilarov@fu-berlin.de
ABSTRACT The main features of the Silk framework are:
The Web of Data is built upon two simple ideas: Employ the RDF it supports the generation of owl:sameAs links as well as
data model to publish structured data on the Web and to set other types of RDF links.
explicit RDF links between entities within different data sources.
This paper presents the Silk – Link Discovery Framework, a tool it provides a flexible, declarative language for specifying link
for finding relationships between entities within different data conditions.
sources. Data publishers can use Silk to set RDF links from their
data sources to other data sources on the Web. Silk features a it can be employed in distributed environments without
declarative language for specifying which types of RDF links having to replicate datasets locally.
should be discovered between data sources as well as which it can be used in situations where terms from different
conditions entities must fulfill in order to be interlinked. Link vocabularies are mixed and where no consistent RDFS or
conditions may be based on various similarity metrics and can OWL schemata exist.
take the graph around entities into account, which is addressed
using a path-based selector language. Silk accesses data sources it implements various caching, indexing and entity pre-
over the SPARQL protocol and can thus be used without having selection methods to increase performance and reduce
to replicate datasets locally. network load.
Categories and Subject Descriptors This paper is structured as follows: Section 2 gives an overview of
the Silk - Link Specification Language along a concrete usage
H.2.3 [Database Management]: Languages example. Section 3 reports the results of applying Silk to discover
links between several data sources within the LOD data cloud1.
General Terms We describe the implementation of the Silk framework in Section
Measurement, Languages 4 and review related work in Section 5.
Keywords
Linked data, link discovery, record linkage, similarity, RDF
2. LINK SPECIFICATION LANGUAGE
1. INTRODUCTION The Silk - Link Specification Language (Silk-LSL) is used to
The Web of Data [1] has grown significantly over the last two express heuristics for deciding whether a semantic relationship
years and has started to span data sources from a wide range of exists between two entities. The language is also used to specify
domains such as geographic information, people, companies, the access parameters for the involved data sources, and to
music, life-science data, books, and scientific publications. configure the caching, indexing and preselection features of the
framework. Link conditions can use different aggregation
While there are more and more tools available for publishing functions to combine similarity scores. These aggregation
Linked Data on the Web [2], there is still a lack of tools that functions as well as the implemented similarity metrics and value
support data publishers in setting RDF links to other data sources transformation functions were chosen by abstracting from the link
on the Web. The Silk - Link Discovery Framework contributes to heuristics that were used to establish links between different data
filling this gap. Using the declarative Silk - Link Specification sources in the LOD cloud.
Language (Silk-LSL), data publishers can specify which types of
Figure 1 contains a complete Silk-LSL example. In this particular
RDF links should be discovered between data sources as well as
which conditions data items must fulfill in order to be interlinked. use case, we want to discover owl:SameAs links between the
These link conditions can apply different similarity metrics to URIs that are used by DBpedia2 and by GeoNames 3 to identify
multiple properties of an entity or related entities which are cities. In line 12 of the link specification, we thus configure the
addressed using a path-based selector language. The resulting to be owl:sameAs.
similarity scores can be weighted and combined using various
similarity aggregation functions. Silk accesses data sources via the
SPARQL protocol and can thus be used to discover links between
local and remote data sources.
1
http://esw.w3.org/topic/SweoIG/TaskForces/
CommunityProjects/ LinkingOpenData
Copyright is held by the author/owner(s). 2
http://dbpedia.org/About
LDOW 2009, April 20, 2009, Madrid, Spain. 3
http://www.geonames.org/ontology/
01
02
03 http://dbpedia.org/sparql
04 http://dbpedia.org Specify SPARQL endpoints
05 1
06 10000
07
08
09 http://localhost:8890/sparql
10
11 Specify link type
12 owl:sameAs
Specify source dataset
13
14 { ?a rdf:type dbpedia:City } UNION { ?a rdf:type dbpedia:PopulatedPlace }
15
16 Specify target dataset
17 ?b gn:featureClass gn:P
18
19
20
21
22
23
Aggregate
24
results
25 Compare city names
26 using Jaro similarity
27
28
29
30
31
32
Compare links to Wikipedia
33
34
35
36
37
38
39 Compare populations
40
41
42
43
44 Weight results
45
46
47
48
Compare geocoordinates
49
50
51
52
53
54 Use paths to address RDF nodes
55
56
57
Speficy thresholds, link limits and output format
58
59
60
Figure 1. Example: Interlinking cities in DBpedia and GeoNames
2.1 Data Access Returns the highest encountered
For accessing the source and target datasources, we first configure maxSimilarityInSet similarity of comparing a single
access parameters to the DBpedia and GeoNames SPARQL item to all items in a set
endpoints using the directive. The only setSimilarity Similarity between two sets of items
mandatory datasource parameter is the endpoint URI. Besides
this, it is possible to define other datasource access options, such
as the graph name and to enable the caching of SPARQL query These similarity metrics may be combined using the following
results in memory. In order to restrict the query load on remote aggregation functions:
SPARQL endpoints, it is possible to set a delay in between
subsequent queries using the parameter, specifying the
AVG – weighted average
delay time in milliseconds. For working against SPARQL
endpoints that restrict result sets to a certain size, Silk uses a MAX – choose the highest value
paging mechanism. The maximal result size is configured using MIN – choose the lowest value
the parameter. The paging mechanism is
implemented via SPARQL LIMIT and OFFSET queries. Lines 2 EUCLID – Euclidian distance metric
to 7 within the example show how the access parameters for the PRODUCT – weighted product
DBpedia datasource are set to select only resources from the
named graph http://dbpedia.org, enable caching and limit
the page size to 10,000 results per query. To take into account the varying importance of different
properties, the metrics grouped inside the AVG, EUCLID and
The configured data sources are later referenced in the PRODUCT operators may be weighted individually, with higher-
and clauses of the weighted metrics having a greater influence on the aggregated
"cities" link specification. Since we only want to match cities, we result.
restrict the sets of examined resources to instances of the classes
dbpedia:City and dbpedia:PopulatedPlace and the In the section of the example (lines 19 to
GeoNames feature class gn:P by supplying SPARQL conditions 55), we compute similarity values for the the labels, Wikipedia
within the directives in lines 14 and 17. These links, population counts and geographic coordinates of cities
statements may contain any valid SPARQL expressions that between datasets and calculate a weighted average of these values.
Most metrics are configured to be optional since the presence of
would usually be found in the WHERE clause of a SPARQL query.
the respective RDF property values they refer to is not always
2.2 Link Conditions guaranteed. In cases where alternating properties refer to an
The section is the heart of a Silk link equivalent feature (such as dbpedia:populationEstimate
specification and defines how similarity metrics are combined in and dbpedia:populationTotal), we choose to perform
order to calculate a total similarity value for an entity pair. comparisons for both properties and select the best evaluation by
using the aggregation operator. Weighting of results is
For comparing property values or sets of entities, Silk provides a
used within the metrics comparing the geographical coordinates
number of builtin similarity metrics. Table 1 gives an overview of
(lines 46 and 50), with the longitude and latitude similarity
these metrics. The implemented metrics include string, numeric,
weights lowered to 0.7 each.
data, URI, and set comparison methods as well as a taxonomic
matcher that calculates the semantic distance between two After specifying the link condition, we finally specify within the
concepts within a concept hierarchy using the distance metric clause that resource pairs with a similarity
proposed by Zhong et al. in [3]. Each metric in Silk evaluates to a score above 0.9 are to be interlinked, whereas pairs between 0.7
similarity value between 0 or 1, with higher values indicating a and 0.9 should be written to a separate output file and be reviewed
greater similarity. by an expert. The clause is used to limit the number of
outgoing links from a particular entity within the source data set.
If several candidate links exist, only the highest evaluated one is
Table 1. Available similarity metrics in Silk chosen and written to the output files as specified by the
Metric Description