=Paper= {{Paper |id=Vol-538/paper-13 |storemode=property |title=Silk - A Link Discovery Framework for the Web of Data |pdfUrl=https://ceur-ws.org/Vol-538/ldow2009_paper13.pdf |volume=Vol-538 |dblpUrl=https://dblp.org/rec/conf/www/VolzBGK09 }} ==Silk - A Link Discovery Framework for the Web of Data== https://ceur-ws.org/Vol-538/ldow2009_paper13.pdf
    Silk – A Link Discovery Framework for the Web of Data
        Julius Volz                          Christian Bizer                 Martin Gaedke                      Georgi Kobilarov
    Chemnitz University of             Freie Universität Berlin          Chemnitz University of               Freie Universität Berlin
         Technology                   Web-based Systems Group                 Technology                    Web-based Systems Group
    Straße der Nationen 62                   Garystr. 21                 Straße der Nationen 62                     Garystr. 21
      D-09107 Chemnitz                     D-14195 Berlin                  D-09107 Chemnitz                       D-14195 Berlin
   volz@hrz.tu-chemnitz.de                 chris@bizer.de              gaedke@cs.tu-chemnitz.de            georgi.kobilarov@fu-berlin.de



ABSTRACT                                                               The main features of the Silk framework are:
The Web of Data is built upon two simple ideas: Employ the RDF               it supports the generation of owl:sameAs links as well as
data model to publish structured data on the Web and to set                   other types of RDF links.
explicit RDF links between entities within different data sources.
This paper presents the Silk – Link Discovery Framework, a tool              it provides a flexible, declarative language for specifying link
for finding relationships between entities within different data              conditions.
sources. Data publishers can use Silk to set RDF links from their
data sources to other data sources on the Web. Silk features a               it can be employed in distributed environments without
declarative language for specifying which types of RDF links                  having to replicate datasets locally.
should be discovered between data sources as well as which                   it can be used in situations where terms from different
conditions entities must fulfill in order to be interlinked. Link             vocabularies are mixed and where no consistent RDFS or
conditions may be based on various similarity metrics and can                 OWL schemata exist.
take the graph around entities into account, which is addressed
using a path-based selector language. Silk accesses data sources             it implements various caching, indexing and entity pre-
over the SPARQL protocol and can thus be used without having                  selection methods to increase performance and reduce
to replicate datasets locally.                                                network load.

Categories and Subject Descriptors                                     This paper is structured as follows: Section 2 gives an overview of
                                                                       the Silk - Link Specification Language along a concrete usage
H.2.3 [Database Management]: Languages                                 example. Section 3 reports the results of applying Silk to discover
                                                                       links between several data sources within the LOD data cloud1.
General Terms                                                          We describe the implementation of the Silk framework in Section
Measurement, Languages                                                 4 and review related work in Section 5.

Keywords
Linked data, link discovery, record linkage, similarity, RDF
                                                                       2. LINK SPECIFICATION LANGUAGE
1. INTRODUCTION                                                        The Silk - Link Specification Language (Silk-LSL) is used to
The Web of Data [1] has grown significantly over the last two          express heuristics for deciding whether a semantic relationship
years and has started to span data sources from a wide range of        exists between two entities. The language is also used to specify
domains such as geographic information, people, companies,             the access parameters for the involved data sources, and to
music, life-science data, books, and scientific publications.          configure the caching, indexing and preselection features of the
                                                                       framework. Link conditions can use different aggregation
While there are more and more tools available for publishing           functions to combine similarity scores. These aggregation
Linked Data on the Web [2], there is still a lack of tools that        functions as well as the implemented similarity metrics and value
support data publishers in setting RDF links to other data sources     transformation functions were chosen by abstracting from the link
on the Web. The Silk - Link Discovery Framework contributes to         heuristics that were used to establish links between different data
filling this gap. Using the declarative Silk - Link Specification      sources in the LOD cloud.
Language (Silk-LSL), data publishers can specify which types of
                                                                       Figure 1 contains a complete Silk-LSL example. In this particular
RDF links should be discovered between data sources as well as
which conditions data items must fulfill in order to be interlinked.   use case, we want to discover owl:SameAs links between the
These link conditions can apply different similarity metrics to        URIs that are used by DBpedia2 and by GeoNames 3 to identify
multiple properties of an entity or related entities which are         cities. In line 12 of the link specification, we thus configure the
addressed using a path-based selector language. The resulting           to be owl:sameAs.
similarity scores can be weighted and combined using various
similarity aggregation functions. Silk accesses data sources via the
SPARQL protocol and can thus be used to discover links between
local and remote data sources.
                                                                       1
                                                                           http://esw.w3.org/topic/SweoIG/TaskForces/
                                                                           CommunityProjects/ LinkingOpenData
 Copyright is held by the author/owner(s).                             2
                                                                           http://dbpedia.org/About
 LDOW 2009, April 20, 2009, Madrid, Spain.                             3
                                                                           http://www.geonames.org/ontology/
01 
02     
03           http://dbpedia.org/sparql
04           http://dbpedia.org                               Specify SPARQL endpoints
05           1
06           10000
07     
08     
09           http://localhost:8890/sparql
10     
11                                   Specify link type
12           owl:sameAs
                                                                                       Specify source dataset
13           
14               { ?a rdf:type dbpedia:City } UNION { ?a rdf:type dbpedia:PopulatedPlace }
15           
16                                        Specify target dataset
17               ?b gn:featureClass gn:P
18           
19           
20               
21                   
22                        
23                              
      Aggregate
24                              
       results
25                                                                                             Compare city names
26                                                         using Jaro similarity
27                              
28                              
29                        
30                   
31                   
32                        
                                                                                                         Compare links to Wikipedia
33                        
34                        
35                   
36                   
37                        
38                              
39                                                          Compare populations
40                        
41                        
42                              
43                              
44                                                                                         Weight results
45                   
46                   
47                        
48                        
                                                                                                         Compare geocoordinates
49                   
50                   
51                        
52                        
53                   
54                                                              Use paths to address RDF nodes
55           
56           
57           
                                                                                       Speficy thresholds, link limits and output format
58           
59     
60 

                                        Figure 1. Example: Interlinking cities in DBpedia and GeoNames
2.1 Data Access                                                                                       Returns the highest encountered
For accessing the source and target datasources, we first configure    maxSimilarityInSet             similarity of comparing a single
access parameters to the DBpedia and GeoNames SPARQL                                                      item to all items in a set
endpoints using the  directive. The only                   setSimilarity               Similarity between two sets of items
mandatory datasource parameter is the endpoint URI. Besides
this, it is possible to define other datasource access options, such
as the graph name and to enable the caching of SPARQL query            These similarity metrics may be combined using the following
results in memory. In order to restrict the query load on remote       aggregation functions:
SPARQL endpoints, it is possible to set a delay in between
subsequent queries using the  parameter, specifying the
                                                                                AVG – weighted average
delay time in milliseconds. For working against SPARQL
endpoints that restrict result sets to a certain size, Silk uses a              MAX – choose the highest value
paging mechanism. The maximal result size is configured using                   MIN – choose the lowest value
the  parameter. The paging mechanism is
implemented via SPARQL LIMIT and OFFSET queries. Lines 2                        EUCLID – Euclidian distance metric
to 7 within the example show how the access parameters for the                  PRODUCT – weighted product
DBpedia datasource are set to select only resources from the
named graph http://dbpedia.org, enable caching and limit
the page size to 10,000 results per query.                             To take into account the varying importance of different
                                                                       properties, the metrics grouped inside the AVG, EUCLID and
The configured data sources are later referenced in the                PRODUCT operators may be weighted individually, with higher-
 and  clauses of the                     weighted metrics having a greater influence on the aggregated
"cities" link specification. Since we only want to match cities, we    result.
restrict the sets of examined resources to instances of the classes
dbpedia:City and dbpedia:PopulatedPlace and the                        In the  section of the example (lines 19 to
GeoNames feature class gn:P by supplying SPARQL conditions             55), we compute similarity values for the the labels, Wikipedia
within the  directives in lines 14 and 17. These           links, population counts and geographic coordinates of cities
statements may contain any valid SPARQL expressions that               between datasets and calculate a weighted average of these values.
                                                                       Most metrics are configured to be optional since the presence of
would usually be found in the WHERE clause of a SPARQL query.
                                                                       the respective RDF property values they refer to is not always
2.2 Link Conditions                                                    guaranteed. In cases where alternating properties refer to an
The  section is the heart of a Silk link                equivalent feature (such as dbpedia:populationEstimate
specification and defines how similarity metrics are combined in       and dbpedia:populationTotal), we choose to perform
order to calculate a total similarity value for an entity pair.        comparisons for both properties and select the best evaluation by
                                                                       using the  aggregation operator. Weighting of results is
For comparing property values or sets of entities, Silk provides a
                                                                       used within the metrics comparing the geographical coordinates
number of builtin similarity metrics. Table 1 gives an overview of
                                                                       (lines 46 and 50), with the longitude and latitude similarity
these metrics. The implemented metrics include string, numeric,
                                                                       weights lowered to 0.7 each.
data, URI, and set comparison methods as well as a taxonomic
matcher that calculates the semantic distance between two              After specifying the link condition, we finally specify within the
concepts within a concept hierarchy using the distance metric           clause that resource pairs with a similarity
proposed by Zhong et al. in [3]. Each metric in Silk evaluates to a    score above 0.9 are to be interlinked, whereas pairs between 0.7
similarity value between 0 or 1, with higher values indicating a       and 0.9 should be written to a separate output file and be reviewed
greater similarity.                                                    by an expert. The  clause is used to limit the number of
                                                                       outgoing links from a particular entity within the source data set.
                                                                       If several candidate links exist, only the highest evaluated one is
          Table 1. Available similarity metrics in Silk                chosen and written to the output files as specified by the
        Metric                          Description                     directive. In this example, we permit only one
                                                                       outgoing owl:sameAs link from each resource.
                              String similarity based on Jaro
jaroSimilarity                                                         Discovered links are outputted either as simple RDF triples or in
                                       distance metric
                              String similarity based on Jaro-         reified form together with their creation date, confidence score
jaroWinklerSimilarity                                                  and the ID of the employed interlinking heuristic.
                                       Winkler metric
qGramSimilarity             String similarity based on q-grams         2.3 Silk Selector Language
                            Returns 1 when strings are equal, 0        Especially for discovering other semantic relationships than entity
stringEquality                                                         equality, a flexible way for selecting sets of resources or literals in
                                        otherwise
numSimilarity                  Percentual numeric similarity           the RDF graph around a particular resource is needed. For
                                                                       instance, DBpedia and LinkedMDB both contain movies and
dateSimilarity              Similarity between two date values         directors. For generating links between movies in DBpedia and
                                                                       their directors in LinkedMDB, we might want to navigate to the
                            Returns 1 if two URIs are equal, 0
uriEquality                                                            director of a movie in DBpedia and compare her properties with
                                         otherwise
                                                                       directors in LinkedMDB. In the case of linking musical artists
                             Metric based on the taxonomic
taxonomicSimilarity
                                distance of two concepts
between DBpedia and MusicBrainz4, an open music database, we           
Silk addresses this requirement by using a simple RDF path                 
selector language for providing parameter values to similarity             
metrics and transformation functions. A Silk selector language         
path starts with a variable referring to an RDF resource and may                             Figure 2. Pre-Matching
then use one of several operators to navigate the graph
surrounding this resource. To simply access a particular property
of a resource, the forward operator ( / ) may be used. For example,    This statement instructs Silk to index the cities in the target
the path "?artist/rdfs:label" would select the set of label            dataset by both their gn:name and gn:alternateName
values associated with an artist referred to by the ?artist            property values. When performing comparisons, the
variable.                                                              rdfs:label of a source resource is used as a search term into
                                                                       the generated indexes and only the first ten target hits found in
Sometimes, however, we need to navigate backwards along a              each index are considered as link candidates for detailed
property edge. For example, musical albums in DBpedia contain a        comparisons. If we neglect a slight index insertion and search
dbpedia:artist property pointing to the album's creator.               time dependency on the target dataset size, we now achieve a
However, there exists no explicit reverse property like                runtime complexity of O(|S| + |T|), making it feasible to interlink
dbpedia:albums for an artist resource. So if a path begins             even large datasets under practical time constraints. Note however
with an artist and we need to select all of her albums, we may use     that this prematching may come at the cost of missing some links
the backward operator ( \ ) to navigate property edges in reverse.     during discovery, since it is not guaranteed that a prematching
Since     navigating     backwards        along     the     property   lookup will always find all matching target resources.
dbpedia:artist would select all of the artist's works, this
may not only select albums, but also songs and single releases.
This is addressed by a filter operator ([ ]), which allows selected
resources to be restricted to match a certain predicate. In this       3. EXPERIMENTS
example, we could use the RDF path "?artist\                           During the implementation of Silk, we experimented with linking
                                                                       DBpedia to several other public Linked Data sources. Movies in
dbpedia:artist[rdf:type dbpedia:Album]" to select only
                                                                       DBpedia were linked both to their movie counterparts and to their
albums amongst the works of a musical artist in DBpedia. The
                                                                       directors in LinkedMDB6. Between GeoNames and DBpedia, we
filter operator also supports comparisons of numeric types as
                                                                       created links between cities, as shown in Silk-LSL example
predicates. For example, to select songs of an artist with a runtime
                                                                       above. Finally, clinical drugs from DrugBank7 were linked with
greater    than     200    seconds,      the    path     "?artist\
                                                                       their counterparts in DBpedia. The following section gives a short
dbpedia:artist[dbpedia:runtime > 200]" can be used.
                                                                       overview over the employed similarity heuristics as well as the
2.4 Pre-Matching                                                       amounts of discovered links.
To compare all pairs of entities of a source dataset S and a target    For interlinking movies between DBpedia and LinkedMDB, we
dataset T would result in an unsatisfactory runtime complexity of      used Jaro string similarity to match movie titles and director
O(|S|·|T|). Even after using SPARQL restrictions to select suitable    names, date similarity for comparing release dates and numeric
subsets of each dataset, the required time and network load to         similarity for runtimes. We used the Thresholds directive
perform all pair comparisons might prove to be impractical in           to
many cases. To avoid this problem, we need a way to quickly find       define similarities of 0.9 as acceptable and similarities between
a limited set of target entities that are likely to match a given      0.7 to 0.9 to be verified by an expert. The number of movies in the
source entity. Silk supports this by allowing rough index              datasets and amounts of discovered links are shown in Table 2.
prematching.
When using prematching, all target resources are indexed by one
                                                                           Table 2. Linking movies between DBpedia and LinkedMDB
or more specified property values (most commonly, their labels)
before any detailed comparisons are performed. During the              Number of movies in DBpedia              34,685
subsequent resource comparison phase, the previously generated         Number of movies in LinkedMDB            38,064
index is used to look up potential matches for a given source
resource. This lookup uses the BM255 weighting scheme for the          Links above accept threshold             26,059
ranking of search results and additionally supports spelling           Links above verify threshold             1,858
corrections of individual words of a query. Only a fixed amount of
target resources found in this lookup are considered as candidates
for a detailed comparison. An example of such a prematching            Interlinking DBpedia movies to their directors in LinkedMDB is
configuration that could be applied to our city linking example is     an example of creating links other than owl:sameAs links, for
presented in Figure 2:                                                 which we simply used a Jaro string similarity metric to compare a
                                                                       movie's director name to the label of a director in LinkedMDB.
                                                                       Dataset statistics and linking results for this example are given in
                                                                       Table 3.



4                                                                      6
    http://musicbrainz.org                                                 http://www.linkedmdb.org/
5                                                                      7
    http://xapian.org/docs/bm25.html                                       http://www4.wiwiss.fu-berlin.de/drugbank/
Table 3. Linking DBpedia movies to directors in LinkedMDB              prematching features are achieved with the search engine library
                                                                       Xapian11. The Silk system architecture is illustrated in Figure 3:
Number of movies in DBpedia                  34,685
Number of directors in LinkedMDB             8,367
Links above accept threshold                 1,693
Links above verify threshold                 374


For linking cities in DBpedia and GeoNames, we used Jaro
similarity between city names, URI equality for links to
Wikipedia articles as well as numeric similarity for the population
counts and geographic coordinates. The results for this use case
are shown in Table 4.


      Table 4. Linking cities between DBpedia and GeoNames
Number of cities in DBpedia                  40,197
Number of populated places                   2,410,855
in GeoNames
Links above accept threshold                 35,031                                     Figure 3. Silk System Architecture
Links above verify threshold                 9,147                     Before executing any comparisons, Silk retrieves the source and
                                                                       target resource lists. The list of source resources is retrieved
                                                                       directly through a resource lister which queries the respective
Finally, for generating links between clinical drugs in DrugBank       SPARQL endpoint and caches the list on disk for reuse in a later
and DBpedia, we compared drug labels via the JaroWinkler               run of Silk. Target resources are first indexed by means of a
similarity, PubChem 8 identifiers via string equality and used         resource indexer, making them searchable by specific properties
numeric similarity for comparing the drugs' molecular weights.         or RDF Path evaluations. During comparison processing, a list of
Table 5 shows the results for this case.                               target resource candidates for each source resource is looked up in
                                                                       this index, limiting detailed comparisons to index search hits. This
                                                                       prematching of resources is optional, but recommended as it
     Table 5. Linking drugs between DBpedia and DrugBank
                                                                       drastically reduces run time and network load.
Number of drugs in DBpedia                   3,134
                                                                       During each detailed resource pair comparison, the user-
Number of drugs in DrugBank                  4,772                     specificed metric aggregation tree is evaluated. Function or metric
Links above accept threshold                 1,202                     parameters passed as RDF Path values are transformed to
                                                                       SPARQL queries by an RDF Path translator and sent to the
Links above verify threshold                 245                       respective SPARQL endpoint for evaluation. Query results are
                                                                       cached in memory during Silk runtime.
The metric compositions, weightings and thresholds in these            If a metric aggregation for a pair of resources results in a value
examples were chosen based on what seemed to produce                   above the specified linking thresholds, a candidate link is saved in
reasonably valid results in our tests. However, a detailed analysis    memory. After completing all comparisons for a link
of the quality of the generated links has not yet been performed.      specification, a link limit may be applied to limit the maximum
When using Silk in a practical scenario, it is advisable to evaluate   number of outgoing links from a single resource. Only a specified
the accuracy and completeness of generated links more closely          count of highest-rated links are kept, lower-valued links are
while adjusting the linking specification accordingly.                 discarded. The remaining links are written to the output file in the
                                                                       format specified by the user (Turtle, CSV, reified format together
                                                                       with meta-information such as confidence score and creation
4. SILK IMPLEMENTATION                                                 date).
Silk is written in Python and is run as a batch process on the
command line. The framework may be downloaded from Google
Code9 under the terms of the BSD license. For calculating string       5. RELATED WORK
similarities, a library from Febrl 10 , the Freely Extensible          There is a large body of related work on record linkage [5] and
Biomedical Record Linkage toolkit, is used, while Silk's               duplicate detection [4] within the database community as well as
                                                                       on ontology matching [6] in the knowledge representation
                                                                       community. Silk builds on this work by implementing similarity
                                                                       metrics and aggregation functions that proved successful within
                                                                       other scenarios. What distinguishes Silk from this work is its
8                                                                      focus on the Linked Data scenario where different types of
    http://pubchem.ncbi.nlm.nih.gov
9
    http://silk.googlecode.com
10                                                                     11
     http://sourceforge.net/projects/febrl                                  http://xapian.org
semantic links should be discovered between Web data sources           7. REFERENCES
that often mix terms from different vocabularies and where no          [1] Berners-Lee, T.: Linked Data - Design Issues.
consistent RDFS or OWL schemata spanning the data sources                  http://www.w3.org/DesignIssues/LinkedData.html
exist.
                                                                       [2] Bizer, C., Cyganiak, R., Heath, T.: How to publish Linked
Related work that also focuses on Linked Data includes Raimond
                                                                           Data on the Web. http://www4.wiwiss.fu-
et al. [7] who propose a link discovery algorithm that takes into
                                                                           berlin.de/bizer/pub/LinkedDataTutorial/
account both the similarities of web resources and of their
neighbors. The algorithm is implemented within the GNAT tool           [3] Zhong, J., et al.: Conceptual Graph Matching for Semantic
and has been evaluated for interlinking music-related data sets. In        Search. The 2002 International Conference on
[8], Hassanzadeh et al. describe a framework for the discovery of          Computational Science (ICCS2002), Amsterdam, April
semantic links over relational data which also introduces a                2002.
declarative language for specifying link conditions. A main
                                                                       [4] Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate
difference between LinQL and Silk-LSL is the underlying data
                                                                           record detection: A survey. IEEE Transactions on
model and Silk’s ability to more flexibly combine metrics through
                                                                           Knowledge and Data Engineering 19(1), 1–16 (2007).
aggregation functions. A framework that deals with instance
coreferencing as part of the larger process of fusing Web data is      [5] Winkler, W.: Overview of Record Linkage and Current
the KnoFuss Architecture proposed in [9]. In contrast to Silk,             Research Directions. Bureau of the Census, Technical
KnoFuss assumes that instance data is represented according to             Report, 2006.
consistent OWL ontologies.                                             [6] Euzenat, J., Shvaiko, P.: Ontology Matching. Springer,
                                                                           Heidelberg, 2007.

6. CONCLUSIONS                                                         [7] Raimond, Y., Sutton, C., Sandler, M.: Automatic Interlinking
We presented the Silk framework, a flexible tool for discovering           of Music Datasets on the Semantic Web. In: Linked Data on
links between entities within different Web data sources. We               the Web Workshop (LDOW2008), 2008.
introduced the Silk-LSL link specification language and                [8] Hassanzadeh, O., et al.: A Declarative Framework for
demonstrated its applicability within different link discovery             Semantic Link Discovery over Relational Data. Poster at
scenarios.                                                                 18th World Wide Web Conference (WWW2009), 2009.
The value of the Web of Data rises and falls with the amount and       [9] Nikolov, A., et al.: Integration of Semantically Annotated
the quality of links between data sources. We hope that Silk and           Data by the KnoFuss Architecture. In: 16th International
other similar tools will help to strengthen the linkage between data       Conference on Knowledge Engineering and Knowledge
sources and therefore contribute to the overall utility of the             Management, 265-274, 2008.
network.
The complete Silk- LSL language specification and further Silk
usage examples are found on the Silk project website at
http://www4.wiwiss.fu-berlin.de/bizer/silk/.