Silk – A Link Discovery Framework for the Web of Data Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Chemnitz University of Freie Universität Berlin Chemnitz University of Freie Universität Berlin Technology Web-based Systems Group Technology Web-based Systems Group Straße der Nationen 62 Garystr. 21 Straße der Nationen 62 Garystr. 21 D-09107 Chemnitz D-14195 Berlin D-09107 Chemnitz D-14195 Berlin volz@hrz.tu-chemnitz.de chris@bizer.de gaedke@cs.tu-chemnitz.de georgi.kobilarov@fu-berlin.de ABSTRACT The main features of the Silk framework are: The Web of Data is built upon two simple ideas: Employ the RDF  it supports the generation of owl:sameAs links as well as data model to publish structured data on the Web and to set other types of RDF links. explicit RDF links between entities within different data sources. This paper presents the Silk – Link Discovery Framework, a tool  it provides a flexible, declarative language for specifying link for finding relationships between entities within different data conditions. sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. Silk features a  it can be employed in distributed environments without declarative language for specifying which types of RDF links having to replicate datasets locally. should be discovered between data sources as well as which  it can be used in situations where terms from different conditions entities must fulfill in order to be interlinked. Link vocabularies are mixed and where no consistent RDFS or conditions may be based on various similarity metrics and can OWL schemata exist. take the graph around entities into account, which is addressed using a path-based selector language. Silk accesses data sources  it implements various caching, indexing and entity pre- over the SPARQL protocol and can thus be used without having selection methods to increase performance and reduce to replicate datasets locally. network load. Categories and Subject Descriptors This paper is structured as follows: Section 2 gives an overview of the Silk - Link Specification Language along a concrete usage H.2.3 [Database Management]: Languages example. Section 3 reports the results of applying Silk to discover links between several data sources within the LOD data cloud1. General Terms We describe the implementation of the Silk framework in Section Measurement, Languages 4 and review related work in Section 5. Keywords Linked data, link discovery, record linkage, similarity, RDF 2. LINK SPECIFICATION LANGUAGE 1. INTRODUCTION The Silk - Link Specification Language (Silk-LSL) is used to The Web of Data [1] has grown significantly over the last two express heuristics for deciding whether a semantic relationship years and has started to span data sources from a wide range of exists between two entities. The language is also used to specify domains such as geographic information, people, companies, the access parameters for the involved data sources, and to music, life-science data, books, and scientific publications. configure the caching, indexing and preselection features of the framework. Link conditions can use different aggregation While there are more and more tools available for publishing functions to combine similarity scores. These aggregation Linked Data on the Web [2], there is still a lack of tools that functions as well as the implemented similarity metrics and value support data publishers in setting RDF links to other data sources transformation functions were chosen by abstracting from the link on the Web. The Silk - Link Discovery Framework contributes to heuristics that were used to establish links between different data filling this gap. Using the declarative Silk - Link Specification sources in the LOD cloud. Language (Silk-LSL), data publishers can specify which types of Figure 1 contains a complete Silk-LSL example. In this particular RDF links should be discovered between data sources as well as which conditions data items must fulfill in order to be interlinked. use case, we want to discover owl:SameAs links between the These link conditions can apply different similarity metrics to URIs that are used by DBpedia2 and by GeoNames 3 to identify multiple properties of an entity or related entities which are cities. In line 12 of the link specification, we thus configure the addressed using a path-based selector language. The resulting to be owl:sameAs. similarity scores can be weighted and combined using various similarity aggregation functions. Silk accesses data sources via the SPARQL protocol and can thus be used to discover links between local and remote data sources. 1 http://esw.w3.org/topic/SweoIG/TaskForces/ CommunityProjects/ LinkingOpenData Copyright is held by the author/owner(s). 2 http://dbpedia.org/About LDOW 2009, April 20, 2009, Madrid, Spain. 3 http://www.geonames.org/ontology/ 01 02 03 http://dbpedia.org/sparql 04 http://dbpedia.org Specify SPARQL endpoints 05 1 06 10000 07 08 09 http://localhost:8890/sparql 10 11 Specify link type 12 owl:sameAs Specify source dataset 13 14 { ?a rdf:type dbpedia:City } UNION { ?a rdf:type dbpedia:PopulatedPlace } 15 16 Specify target dataset 17 ?b gn:featureClass gn:P 18 19 20 21 22 23 Aggregate 24 results 25 Compare city names 26 using Jaro similarity 27 28 29 30 31 32 Compare links to Wikipedia 33 34 35 36 37 38 39 Compare populations 40 41 42 43 44 Weight results 45 46 47 48 Compare geocoordinates 49 50 51 52 53 54 Use paths to address RDF nodes 55 56 57 Speficy thresholds, link limits and output format 58 59 60 Figure 1. Example: Interlinking cities in DBpedia and GeoNames 2.1 Data Access Returns the highest encountered For accessing the source and target datasources, we first configure maxSimilarityInSet similarity of comparing a single access parameters to the DBpedia and GeoNames SPARQL item to all items in a set endpoints using the directive. The only setSimilarity Similarity between two sets of items mandatory datasource parameter is the endpoint URI. Besides this, it is possible to define other datasource access options, such as the graph name and to enable the caching of SPARQL query These similarity metrics may be combined using the following results in memory. In order to restrict the query load on remote aggregation functions: SPARQL endpoints, it is possible to set a delay in between subsequent queries using the parameter, specifying the  AVG – weighted average delay time in milliseconds. For working against SPARQL endpoints that restrict result sets to a certain size, Silk uses a  MAX – choose the highest value paging mechanism. The maximal result size is configured using  MIN – choose the lowest value the parameter. The paging mechanism is implemented via SPARQL LIMIT and OFFSET queries. Lines 2  EUCLID – Euclidian distance metric to 7 within the example show how the access parameters for the  PRODUCT – weighted product DBpedia datasource are set to select only resources from the named graph http://dbpedia.org, enable caching and limit the page size to 10,000 results per query. To take into account the varying importance of different properties, the metrics grouped inside the AVG, EUCLID and The configured data sources are later referenced in the PRODUCT operators may be weighted individually, with higher- and clauses of the weighted metrics having a greater influence on the aggregated "cities" link specification. Since we only want to match cities, we result. restrict the sets of examined resources to instances of the classes dbpedia:City and dbpedia:PopulatedPlace and the In the section of the example (lines 19 to GeoNames feature class gn:P by supplying SPARQL conditions 55), we compute similarity values for the the labels, Wikipedia within the directives in lines 14 and 17. These links, population counts and geographic coordinates of cities statements may contain any valid SPARQL expressions that between datasets and calculate a weighted average of these values. Most metrics are configured to be optional since the presence of would usually be found in the WHERE clause of a SPARQL query. the respective RDF property values they refer to is not always 2.2 Link Conditions guaranteed. In cases where alternating properties refer to an The section is the heart of a Silk link equivalent feature (such as dbpedia:populationEstimate specification and defines how similarity metrics are combined in and dbpedia:populationTotal), we choose to perform order to calculate a total similarity value for an entity pair. comparisons for both properties and select the best evaluation by using the aggregation operator. Weighting of results is For comparing property values or sets of entities, Silk provides a used within the metrics comparing the geographical coordinates number of builtin similarity metrics. Table 1 gives an overview of (lines 46 and 50), with the longitude and latitude similarity these metrics. The implemented metrics include string, numeric, weights lowered to 0.7 each. data, URI, and set comparison methods as well as a taxonomic matcher that calculates the semantic distance between two After specifying the link condition, we finally specify within the concepts within a concept hierarchy using the distance metric clause that resource pairs with a similarity proposed by Zhong et al. in [3]. Each metric in Silk evaluates to a score above 0.9 are to be interlinked, whereas pairs between 0.7 similarity value between 0 or 1, with higher values indicating a and 0.9 should be written to a separate output file and be reviewed greater similarity. by an expert. The clause is used to limit the number of outgoing links from a particular entity within the source data set. If several candidate links exist, only the highest evaluated one is Table 1. Available similarity metrics in Silk chosen and written to the output files as specified by the Metric Description directive. In this example, we permit only one outgoing owl:sameAs link from each resource. String similarity based on Jaro jaroSimilarity Discovered links are outputted either as simple RDF triples or in distance metric String similarity based on Jaro- reified form together with their creation date, confidence score jaroWinklerSimilarity and the ID of the employed interlinking heuristic. Winkler metric qGramSimilarity String similarity based on q-grams 2.3 Silk Selector Language Returns 1 when strings are equal, 0 Especially for discovering other semantic relationships than entity stringEquality equality, a flexible way for selecting sets of resources or literals in otherwise numSimilarity Percentual numeric similarity the RDF graph around a particular resource is needed. For instance, DBpedia and LinkedMDB both contain movies and dateSimilarity Similarity between two date values directors. For generating links between movies in DBpedia and their directors in LinkedMDB, we might want to navigate to the Returns 1 if two URIs are equal, 0 uriEquality director of a movie in DBpedia and compare her properties with otherwise directors in LinkedMDB. In the case of linking musical artists Metric based on the taxonomic taxonomicSimilarity distance of two concepts between DBpedia and MusicBrainz4, an open music database, we Silk addresses this requirement by using a simple RDF path selector language for providing parameter values to similarity metrics and transformation functions. A Silk selector language path starts with a variable referring to an RDF resource and may Figure 2. Pre-Matching then use one of several operators to navigate the graph surrounding this resource. To simply access a particular property of a resource, the forward operator ( / ) may be used. For example, This statement instructs Silk to index the cities in the target the path "?artist/rdfs:label" would select the set of label dataset by both their gn:name and gn:alternateName values associated with an artist referred to by the ?artist property values. When performing comparisons, the variable. rdfs:label of a source resource is used as a search term into the generated indexes and only the first ten target hits found in Sometimes, however, we need to navigate backwards along a each index are considered as link candidates for detailed property edge. For example, musical albums in DBpedia contain a comparisons. If we neglect a slight index insertion and search dbpedia:artist property pointing to the album's creator. time dependency on the target dataset size, we now achieve a However, there exists no explicit reverse property like runtime complexity of O(|S| + |T|), making it feasible to interlink dbpedia:albums for an artist resource. So if a path begins even large datasets under practical time constraints. Note however with an artist and we need to select all of her albums, we may use that this prematching may come at the cost of missing some links the backward operator ( \ ) to navigate property edges in reverse. during discovery, since it is not guaranteed that a prematching Since navigating backwards along the property lookup will always find all matching target resources. dbpedia:artist would select all of the artist's works, this may not only select albums, but also songs and single releases. This is addressed by a filter operator ([ ]), which allows selected resources to be restricted to match a certain predicate. In this 3. EXPERIMENTS example, we could use the RDF path "?artist\ During the implementation of Silk, we experimented with linking DBpedia to several other public Linked Data sources. Movies in dbpedia:artist[rdf:type dbpedia:Album]" to select only DBpedia were linked both to their movie counterparts and to their albums amongst the works of a musical artist in DBpedia. The directors in LinkedMDB6. Between GeoNames and DBpedia, we filter operator also supports comparisons of numeric types as created links between cities, as shown in Silk-LSL example predicates. For example, to select songs of an artist with a runtime above. Finally, clinical drugs from DrugBank7 were linked with greater than 200 seconds, the path "?artist\ their counterparts in DBpedia. The following section gives a short dbpedia:artist[dbpedia:runtime > 200]" can be used. overview over the employed similarity heuristics as well as the 2.4 Pre-Matching amounts of discovered links. To compare all pairs of entities of a source dataset S and a target For interlinking movies between DBpedia and LinkedMDB, we dataset T would result in an unsatisfactory runtime complexity of used Jaro string similarity to match movie titles and director O(|S|·|T|). Even after using SPARQL restrictions to select suitable names, date similarity for comparing release dates and numeric subsets of each dataset, the required time and network load to similarity for runtimes. We used the Thresholds directive perform all pair comparisons might prove to be impractical in to many cases. To avoid this problem, we need a way to quickly find define similarities of 0.9 as acceptable and similarities between a limited set of target entities that are likely to match a given 0.7 to 0.9 to be verified by an expert. The number of movies in the source entity. Silk supports this by allowing rough index datasets and amounts of discovered links are shown in Table 2. prematching. When using prematching, all target resources are indexed by one Table 2. Linking movies between DBpedia and LinkedMDB or more specified property values (most commonly, their labels) before any detailed comparisons are performed. During the Number of movies in DBpedia 34,685 subsequent resource comparison phase, the previously generated Number of movies in LinkedMDB 38,064 index is used to look up potential matches for a given source resource. This lookup uses the BM255 weighting scheme for the Links above accept threshold 26,059 ranking of search results and additionally supports spelling Links above verify threshold 1,858 corrections of individual words of a query. Only a fixed amount of target resources found in this lookup are considered as candidates for a detailed comparison. An example of such a prematching Interlinking DBpedia movies to their directors in LinkedMDB is configuration that could be applied to our city linking example is an example of creating links other than owl:sameAs links, for presented in Figure 2: which we simply used a Jaro string similarity metric to compare a movie's director name to the label of a director in LinkedMDB. Dataset statistics and linking results for this example are given in Table 3. 4 6 http://musicbrainz.org http://www.linkedmdb.org/ 5 7 http://xapian.org/docs/bm25.html http://www4.wiwiss.fu-berlin.de/drugbank/ Table 3. Linking DBpedia movies to directors in LinkedMDB prematching features are achieved with the search engine library Xapian11. The Silk system architecture is illustrated in Figure 3: Number of movies in DBpedia 34,685 Number of directors in LinkedMDB 8,367 Links above accept threshold 1,693 Links above verify threshold 374 For linking cities in DBpedia and GeoNames, we used Jaro similarity between city names, URI equality for links to Wikipedia articles as well as numeric similarity for the population counts and geographic coordinates. The results for this use case are shown in Table 4. Table 4. Linking cities between DBpedia and GeoNames Number of cities in DBpedia 40,197 Number of populated places 2,410,855 in GeoNames Links above accept threshold 35,031 Figure 3. Silk System Architecture Links above verify threshold 9,147 Before executing any comparisons, Silk retrieves the source and target resource lists. The list of source resources is retrieved directly through a resource lister which queries the respective Finally, for generating links between clinical drugs in DrugBank SPARQL endpoint and caches the list on disk for reuse in a later and DBpedia, we compared drug labels via the JaroWinkler run of Silk. Target resources are first indexed by means of a similarity, PubChem 8 identifiers via string equality and used resource indexer, making them searchable by specific properties numeric similarity for comparing the drugs' molecular weights. or RDF Path evaluations. During comparison processing, a list of Table 5 shows the results for this case. target resource candidates for each source resource is looked up in this index, limiting detailed comparisons to index search hits. This prematching of resources is optional, but recommended as it Table 5. Linking drugs between DBpedia and DrugBank drastically reduces run time and network load. Number of drugs in DBpedia 3,134 During each detailed resource pair comparison, the user- Number of drugs in DrugBank 4,772 specificed metric aggregation tree is evaluated. Function or metric Links above accept threshold 1,202 parameters passed as RDF Path values are transformed to SPARQL queries by an RDF Path translator and sent to the Links above verify threshold 245 respective SPARQL endpoint for evaluation. Query results are cached in memory during Silk runtime. The metric compositions, weightings and thresholds in these If a metric aggregation for a pair of resources results in a value examples were chosen based on what seemed to produce above the specified linking thresholds, a candidate link is saved in reasonably valid results in our tests. However, a detailed analysis memory. After completing all comparisons for a link of the quality of the generated links has not yet been performed. specification, a link limit may be applied to limit the maximum When using Silk in a practical scenario, it is advisable to evaluate number of outgoing links from a single resource. Only a specified the accuracy and completeness of generated links more closely count of highest-rated links are kept, lower-valued links are while adjusting the linking specification accordingly. discarded. The remaining links are written to the output file in the format specified by the user (Turtle, CSV, reified format together with meta-information such as confidence score and creation 4. SILK IMPLEMENTATION date). Silk is written in Python and is run as a batch process on the command line. The framework may be downloaded from Google Code9 under the terms of the BSD license. For calculating string 5. RELATED WORK similarities, a library from Febrl 10 , the Freely Extensible There is a large body of related work on record linkage [5] and Biomedical Record Linkage toolkit, is used, while Silk's duplicate detection [4] within the database community as well as on ontology matching [6] in the knowledge representation community. Silk builds on this work by implementing similarity metrics and aggregation functions that proved successful within other scenarios. What distinguishes Silk from this work is its 8 focus on the Linked Data scenario where different types of http://pubchem.ncbi.nlm.nih.gov 9 http://silk.googlecode.com 10 11 http://sourceforge.net/projects/febrl http://xapian.org semantic links should be discovered between Web data sources 7. REFERENCES that often mix terms from different vocabularies and where no [1] Berners-Lee, T.: Linked Data - Design Issues. consistent RDFS or OWL schemata spanning the data sources http://www.w3.org/DesignIssues/LinkedData.html exist. [2] Bizer, C., Cyganiak, R., Heath, T.: How to publish Linked Related work that also focuses on Linked Data includes Raimond Data on the Web. http://www4.wiwiss.fu- et al. [7] who propose a link discovery algorithm that takes into berlin.de/bizer/pub/LinkedDataTutorial/ account both the similarities of web resources and of their neighbors. The algorithm is implemented within the GNAT tool [3] Zhong, J., et al.: Conceptual Graph Matching for Semantic and has been evaluated for interlinking music-related data sets. In Search. The 2002 International Conference on [8], Hassanzadeh et al. describe a framework for the discovery of Computational Science (ICCS2002), Amsterdam, April semantic links over relational data which also introduces a 2002. declarative language for specifying link conditions. A main [4] Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate difference between LinQL and Silk-LSL is the underlying data record detection: A survey. IEEE Transactions on model and Silk’s ability to more flexibly combine metrics through Knowledge and Data Engineering 19(1), 1–16 (2007). aggregation functions. A framework that deals with instance coreferencing as part of the larger process of fusing Web data is [5] Winkler, W.: Overview of Record Linkage and Current the KnoFuss Architecture proposed in [9]. In contrast to Silk, Research Directions. Bureau of the Census, Technical KnoFuss assumes that instance data is represented according to Report, 2006. consistent OWL ontologies. [6] Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg, 2007. 6. CONCLUSIONS [7] Raimond, Y., Sutton, C., Sandler, M.: Automatic Interlinking We presented the Silk framework, a flexible tool for discovering of Music Datasets on the Semantic Web. In: Linked Data on links between entities within different Web data sources. We the Web Workshop (LDOW2008), 2008. introduced the Silk-LSL link specification language and [8] Hassanzadeh, O., et al.: A Declarative Framework for demonstrated its applicability within different link discovery Semantic Link Discovery over Relational Data. Poster at scenarios. 18th World Wide Web Conference (WWW2009), 2009. The value of the Web of Data rises and falls with the amount and [9] Nikolov, A., et al.: Integration of Semantically Annotated the quality of links between data sources. We hope that Silk and Data by the KnoFuss Architecture. In: 16th International other similar tools will help to strengthen the linkage between data Conference on Knowledge Engineering and Knowledge sources and therefore contribute to the overall utility of the Management, 265-274, 2008. network. The complete Silk- LSL language specification and further Silk usage examples are found on the Silk project website at http://www4.wiwiss.fu-berlin.de/bizer/silk/.