<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Silk -A Link Discovery Framework for the Web of Data</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Julius</forename><surname>Volz</surname></persName>
							<email>volz@hrz.tu-chemnitz.de</email>
							<affiliation key="aff0">
								<orgName type="institution">Chemnitz University of Technology</orgName>
								<address>
									<addrLine>Straße der Nationen 62</addrLine>
									<postCode>D-09107</postCode>
									<settlement>Chemnitz</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Christian</forename><surname>Bizer</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Freie Universität Berlin Web-based Systems Group</orgName>
								<address>
									<addrLine>Garystr. 21</addrLine>
									<postCode>D-14195</postCode>
									<settlement>Berlin</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Martin</forename><surname>Gaedke</surname></persName>
							<email>gaedke@cs.tu-chemnitz.de</email>
							<affiliation key="aff2">
								<orgName type="institution">Chemnitz University of Technology</orgName>
								<address>
									<addrLine>Straße der Nationen 62</addrLine>
									<postCode>D-09107</postCode>
									<settlement>Chemnitz</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Georgi</forename><surname>Kobilarov</surname></persName>
							<email>georgi.kobilarov@fu-berlin.de</email>
							<affiliation key="aff3">
								<orgName type="institution">Freie Universität Berlin Web-based Systems Group</orgName>
								<address>
									<addrLine>Garystr. 21</addrLine>
									<postCode>D-14195</postCode>
									<settlement>Berlin</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Silk -A Link Discovery Framework for the Web of Data</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">25734794D187684B3EAF7F34FB8ED029</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T18:53+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>H.2.3 [Database Management]: Languages Measurement</term>
					<term>Languages Linked data</term>
					<term>link discovery</term>
					<term>record linkage</term>
					<term>similarity</term>
					<term>RDF</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The Web of Data is built upon two simple ideas: Employ the RDF data model to publish structured data on the Web and to set explicit RDF links between entities within different data sources. This paper presents the Silk -Link Discovery Framework, a tool for finding relationships between entities within different data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. Silk features a declarative language for specifying which types of RDF links should be discovered between data sources as well as which conditions entities must fulfill in order to be interlinked. Link conditions may be based on various similarity metrics and can take the graph around entities into account, which is addressed using a path-based selector language. Silk accesses data sources over the SPARQL protocol and can thus be used without having to replicate datasets locally.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>The Web of Data <ref type="bibr" target="#b1">[1]</ref> has grown significantly over the last two years and has started to span data sources from a wide range of domains such as geographic information, people, companies, music, life-science data, books, and scientific publications.</p><p>While there are more and more tools available for publishing Linked Data on the Web <ref type="bibr" target="#b2">[2]</ref>, there is still a lack of tools that support data publishers in setting RDF links to other data sources on the Web. The Silk -Link Discovery Framework contributes to filling this gap. Using the declarative Silk -Link Specification Language (Silk-LSL), data publishers can specify which types of RDF links should be discovered between data sources as well as which conditions data items must fulfill in order to be interlinked. These link conditions can apply different similarity metrics to multiple properties of an entity or related entities which are addressed using a path-based selector language. The resulting similarity scores can be weighted and combined using various similarity aggregation functions. Silk accesses data sources via the SPARQL protocol and can thus be used to discover links between local and remote data sources.</p><p>The main features of the Silk framework are:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head></head><p>it supports the generation of owl:sameAs links as well as other types of RDF links.</p><p> it provides a flexible, declarative language for specifying link conditions.</p><p> it can be employed in distributed environments without having to replicate datasets locally.</p><p> it can be used in situations where terms from different vocabularies are mixed and where no consistent RDFS or OWL schemata exist.</p><p> it implements various caching, indexing and entity preselection methods to increase performance and reduce network load.</p><p>This paper is structured as follows: Section 2 gives an overview of the Silk -Link Specification Language along a concrete usage example. Section 3 reports the results of applying Silk to discover links between several data sources within the LOD data cloud 1 .</p><p>We describe the implementation of the Silk framework in Section 4 and review related work in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">LINK SPECIFICATION LANGUAGE</head><p>The Silk -Link Specification Language (Silk-LSL) is used to express heuristics for deciding whether a semantic relationship exists between two entities. The language is also used to specify the access parameters for the involved data sources, and to configure the caching, indexing and preselection features of the framework. Link conditions can use different aggregation functions to combine similarity scores. These aggregation functions as well as the implemented similarity metrics and value transformation functions were chosen by abstracting from the link heuristics that were used to establish links between different data sources in the LOD cloud.</p><p>Figure <ref type="figure" target="#fig_0">1</ref> contains a complete Silk-LSL example. In this particular use case, we want to discover owl:SameAs links between the URIs that are used by DBpedia 2 and by GeoNames 3 to identify cities. In line 12 of the link specification, we thus configure the &lt;LinkType&gt; to be owl:sameAs. &lt;DoCache&gt;1&lt;/DoCache&gt; &lt;PageSize&gt;10000&lt;/PageSize&gt; &lt;/DataSource&gt; &lt;DataSource id="geonames"&gt; &lt;EndpointURI&gt;http://localhost:8890/sparql&lt;/EndpointURI&gt; &lt;/DataSource&gt; &lt;Interlink id="cities"&gt; &lt;LinkType&gt;owl:sameAs&lt;/LinkType&gt; &lt;SourceDataset dataSource="dbpedia" var="a"&gt; &lt;RestrictTo&gt;{ ?a rdf:type dbpedia:City } UNION { ?a rdf:type dbpedia:PopulatedPlace }&lt;/RestrictTo&gt; &lt;/SourceDataset&gt; &lt;TargetDataset dataSource="geonames" var="b"&gt; &lt;RestrictTo&gt;?b gn:featureClass gn:P&lt;/RestrictTo&gt; &lt;/TargetDataset&gt; &lt;LinkCondition&gt; &lt;AVG&gt; &lt;MAX&gt; &lt;Compare metric="jaroSimilarity" optional="1"&gt; &lt;Param name="str1" path="?a/rdfs:label[@lang 'en']" /&gt; &lt;Param name="str2" path="?b/gn:alternateName[@lang 'en']" /&gt; &lt;/Compare&gt; &lt;Compare metric="jaroSimilarity" optional="1"&gt; &lt;Param name="str1" path="?a/rdfs:label" /&gt; &lt;Param name="str2" path="?b/gn:name" /&gt; &lt;/Compare&gt; &lt;/MAX&gt; &lt;Compare metric="maxSimilarityInSets" optional="1" weight="3"&gt; &lt;Param name="set1" path="?a/foaf:page" /&gt; &lt;Param name="set2" path="?b/gn:wikipediaArticle" /&gt; &lt;Param name="submetric" value="stringEquality" /&gt; &lt;/Compare&gt; &lt;MAX&gt; &lt;Match metric="numSimilarity" optional="1"&gt; &lt;Param name="num1" path="?a/p:populationEstimate" /&gt; &lt;Param name="num2" path="?b/gn:population" /&gt; &lt;/Match&gt; &lt;Match metric="numSimilarity" optional="1"&gt; &lt;Param name="num1" path="?a/dbpedia:populationTotal" /&gt; &lt;Param name="num2" path="?b/gn:population" /&gt; &lt;/Match&gt; &lt;/MAX&gt; &lt;Compare metric="numSimilarity" optional="1" weight="0.7"&gt; &lt;Param name="num1" path="?a/wgs84_pos:lat" /&gt; &lt;Param name="num2" path="?b/wgs84_pos:lat" /&gt; &lt;/Compare&gt; &lt;Compare metric="numSimilarity" optional="1" weight="0.7"&gt; &lt;Param name="num1" path="?a/wgs84_pos:long" /&gt; &lt;Param name="num2" path="?b/wgs84_pos:long" /&gt; &lt;/Compare&gt; &lt;/AVG&gt; &lt;/LinkCondition&gt; &lt;Thresholds accept="0.9" verify="0.7" /&gt; &lt;Limit max="1" method="metric_value" /&gt; &lt;Output acceptedLinks="accepted_links.n3" verifyLinks="verify_links.n3" mode="truncate" /&gt; &lt;/Interlink&gt; &lt;/Silk&gt; </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Data Access</head><p>For accessing the source and target datasources, we first configure access parameters to the DBpedia and GeoNames SPARQL endpoints using the &lt;DataSource&gt; directive. The only mandatory datasource parameter is the endpoint URI. Besides this, it is possible to define other datasource access options, such as the graph name and to enable the caching of SPARQL query results in memory. In order to restrict the query load on remote SPARQL endpoints, it is possible to set a delay in between subsequent queries using the &lt;Pause&gt; parameter, specifying the delay time in milliseconds. For working against SPARQL endpoints that restrict result sets to a certain size, Silk uses a paging mechanism. The maximal result size is configured using the &lt;PageSize&gt; parameter. The paging mechanism is implemented via SPARQL LIMIT and OFFSET queries. Lines 2 to 7 within the example show how the access parameters for the DBpedia datasource are set to select only resources from the named graph http://dbpedia.org, enable caching and limit the page size to 10,000 results per query.</p><p>The configured data sources are later referenced in the &lt;SourceDataset&gt; and &lt;TargetDataset&gt; clauses of the "cities" link specification. Since we only want to match cities, we restrict the sets of examined resources to instances of the classes dbpedia:City and dbpedia:PopulatedPlace and the GeoNames feature class gn:P by supplying SPARQL conditions within the &lt;RestrictTo&gt; directives in lines 14 and 17. These statements may contain any valid SPARQL expressions that would usually be found in the WHERE clause of a SPARQL query.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Link Conditions</head><p>The &lt;LinkCondition&gt; section is the heart of a Silk link specification and defines how similarity metrics are combined in order to calculate a total similarity value for an entity pair.</p><p>For comparing property values or sets of entities, Silk provides a number of builtin similarity metrics. Table <ref type="table" target="#tab_0">1</ref> gives an overview of these metrics. The implemented metrics include string, numeric, data, URI, and set comparison methods as well as a taxonomic matcher that calculates the semantic distance between two concepts within a concept hierarchy using the distance metric proposed by Zhong et al. in <ref type="bibr" target="#b3">[3]</ref>. Each metric in Silk evaluates to a similarity value between 0 or 1, with higher values indicating a greater similarity. To take into account the varying importance of different properties, the metrics grouped inside the AVG, EUCLID and PRODUCT operators may be weighted individually, with higherweighted metrics having a greater influence on the aggregated result.</p><p>In the &lt;LinkCondition&gt; section of the example (lines 19 to 55), we compute similarity values for the the labels, Wikipedia links, population counts and geographic coordinates of cities between datasets and calculate a weighted average of these values. Most metrics are configured to be optional since the presence of the respective RDF property values they refer to is not always guaranteed. In cases where alternating properties refer to an equivalent feature (such as dbpedia:populationEstimate and dbpedia:populationTotal), we choose to perform comparisons for both properties and select the best evaluation by using the &lt;MAX&gt; aggregation operator. Weighting of results is used within the metrics comparing the geographical coordinates (lines 46 and 50), with the longitude and latitude similarity weights lowered to 0.7 each.</p><p>After specifying the link condition, we finally specify within the &lt;Thresholds&gt; clause that resource pairs with a similarity score above 0.9 are to be interlinked, whereas pairs between 0.7 and 0.9 should be written to a separate output file and be reviewed by an expert. The &lt;Limit&gt; clause is used to limit the number of outgoing links from a particular entity within the source data set.</p><p>If several candidate links exist, only the highest evaluated one is chosen and written to the output files as specified by the &lt;Output&gt; directive. In this example, we permit only one outgoing owl:sameAs link from each resource.</p><p>Discovered links are outputted either as simple RDF triples or in reified form together with their creation date, confidence score and the ID of the employed interlinking heuristic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Silk Selector Language</head><p>Especially for discovering other semantic relationships than entity equality, a flexible way for selecting sets of resources or literals in the RDF graph around a particular resource is needed. For instance, DBpedia and LinkedMDB both contain movies and directors. For generating links between movies in DBpedia and their directors in LinkedMDB, we might want to navigate to the director of a movie in DBpedia and compare her properties with directors in LinkedMDB. In the case of linking musical artists between DBpedia and MusicBrainz <ref type="foot" target="#foot_0">4</ref> , an open music database, we might want to compare properties of the albums of the musicians.</p><p>Silk addresses this requirement by using a simple RDF path selector language for providing parameter values to similarity metrics and transformation functions. A Silk selector language path starts with a variable referring to an RDF resource and may then use one of several operators to navigate the graph surrounding this resource. To simply access a particular property of a resource, the forward operator ( / ) may be used. For example, the path "?artist/rdfs:label" would select the set of label values associated with an artist referred to by the ?artist variable.</p><p>Sometimes, however, we need to navigate backwards along a property edge. For example, musical albums in DBpedia contain a dbpedia:artist property pointing to the album's creator. However, there exists no explicit reverse property like dbpedia:albums for an artist resource. So if a path begins with an artist and we need to select all of her albums, we may use the backward operator ( \ ) to navigate property edges in reverse. Since navigating backwards along the property dbpedia:artist would select all of the artist's works, this may not only select albums, but also songs and single releases. This is addressed by a filter operator ([ ]), which allows selected resources to be restricted to match a certain predicate. In this example, we could use the RDF path "?artist\ dbpedia:artist[rdf:type dbpedia:Album]" to select only albums amongst the works of a musical artist in DBpedia. The filter operator also supports comparisons of numeric types as predicates. For example, to select songs of an artist with a runtime greater than 200 seconds, the path "?artist\ dbpedia:artist[dbpedia:runtime &gt; 200]" can be used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Pre-Matching</head><p>To compare all pairs of entities of a source dataset S and a target dataset T would result in an unsatisfactory runtime complexity of O(|S|•|T|). Even after using SPARQL restrictions to select suitable subsets of each dataset, the required time and network load to perform all pair comparisons might prove to be impractical in many cases. To avoid this problem, we need a way to quickly find a limited set of target entities that are likely to match a given source entity. Silk supports this by allowing rough index prematching.</p><p>When using prematching, all target resources are indexed by one or more specified property values (most commonly, their labels) before any detailed comparisons are performed. During the subsequent resource comparison phase, the previously generated index is used to look up potential matches for a given source resource. This lookup uses the BM25 <ref type="foot" target="#foot_1">5</ref> weighting scheme for the ranking of search results and additionally supports spelling corrections of individual words of a query. Only a fixed amount of target resources found in this lookup are considered as candidates for a detailed comparison. An example of such a prematching configuration that could be applied to our city linking example is presented in Figure <ref type="figure">2</ref>: &lt;PreMatchingDefinition sourcePath="?a/rdfs:label" hitLimit="10"&gt; &lt;Index targetPath="?b/gn:name" /&gt; &lt;Index targetPath="?b/gn:alternateName" /&gt; &lt;/PreMatchingDefinition&gt;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 2. Pre-Matching</head><p>This statement instructs Silk to index the cities in the target dataset by both their gn:name and gn:alternateName property values. When performing comparisons, the rdfs:label of a source resource is used as a search term into the generated indexes and only the first ten target hits found in each index are considered as link candidates for detailed comparisons. If we neglect a slight index insertion and search time dependency on the target dataset size, we now achieve a runtime complexity of O(|S| + |T|), making it feasible to interlink even large datasets under practical time constraints. Note however that this prematching may come at the cost of missing some links during discovery, since it is not guaranteed that a prematching lookup will always find all matching target resources.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">EXPERIMENTS</head><p>During the implementation of Silk, we experimented with linking DBpedia to several other public Linked Data sources. Movies in DBpedia were linked both to their movie counterparts and to their directors in LinkedMDB <ref type="foot" target="#foot_2">6</ref> . Between GeoNames and DBpedia, we created links between cities, as shown in Silk-LSL example above. Finally, clinical drugs from DrugBank <ref type="foot" target="#foot_3">7</ref> were linked with their counterparts in DBpedia. The following section gives a short overview over the employed similarity heuristics as well as the amounts of discovered links.</p><p>For interlinking movies between DBpedia and LinkedMDB, we used Jaro string similarity to match movie titles and director names, date similarity for comparing release dates and numeric similarity for runtimes. We used the Thresholds directive &lt;Thresholds accept="0.9" verify="0.7" /&gt; to define similarities of 0.9 as acceptable and similarities between 0.7 to 0.9 to be verified by an expert. The number of movies in the datasets and amounts of discovered links are shown in Table <ref type="table" target="#tab_1">2</ref>. Interlinking DBpedia movies to their directors in LinkedMDB is an example of creating links other than owl:sameAs links, for which we simply used a Jaro string similarity metric to compare a movie's director name to the label of a director in LinkedMDB. Dataset statistics and linking results for this example are given in Table <ref type="table" target="#tab_2">3</ref>. For linking cities in DBpedia and GeoNames, we used Jaro similarity between city names, URI equality for links to Wikipedia articles as well as numeric similarity for the population counts and geographic coordinates. The results for this use case are shown in Table <ref type="table" target="#tab_3">4</ref>. Finally, for generating links between clinical drugs in DrugBank and DBpedia, we compared drug labels via the JaroWinkler similarity, PubChem<ref type="foot" target="#foot_4">8</ref> identifiers via string equality and used numeric similarity for comparing the drugs' molecular weights. Table <ref type="table" target="#tab_4">5</ref> shows the results for this case. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">SILK IMPLEMENTATION</head><p>Silk is written in Python and is run as a batch process on the command line. The framework may be downloaded from Google Code<ref type="foot" target="#foot_5">9</ref> under the terms of the BSD license. For calculating string similarities, a library from Febrl<ref type="foot" target="#foot_6">10</ref> , the Freely Extensible Biomedical Record Linkage toolkit, is used, while Silk's prematching features are achieved with the search engine library Xapian<ref type="foot" target="#foot_7">11</ref> . The Silk system architecture is illustrated in Figure <ref type="figure" target="#fig_1">3</ref>: If a metric aggregation for a pair of resources results in a value above the specified linking thresholds, a candidate link is saved in memory. After completing all comparisons for a link specification, a link limit may be applied to limit the maximum number of outgoing links from a single resource. Only a specified count of highest-rated links are kept, lower-valued links are discarded. The remaining links are written to the output file in the format specified by the user (Turtle, CSV, reified format together with meta-information such as confidence score and creation date).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">RELATED WORK</head><p>There is a large body of related work on record linkage <ref type="bibr" target="#b5">[5]</ref> and duplicate detection <ref type="bibr" target="#b4">[4]</ref> within the database community as well as on ontology matching <ref type="bibr" target="#b6">[6]</ref> in the knowledge representation community. Silk builds on this work by implementing similarity metrics and aggregation functions that proved successful within other scenarios. What distinguishes Silk from this work is its focus on the Linked Data scenario where different types of semantic links should be discovered between Web data sources that often mix terms from different vocabularies and where no consistent RDFS or OWL schemata spanning the data sources exist.</p><p>Related work that also focuses on Linked Data includes Raimond et al. <ref type="bibr">[7]</ref> who propose a link discovery algorithm that takes into account both the similarities of web resources and of their neighbors. The algorithm is implemented within the GNAT tool and has been evaluated for interlinking music-related data sets. In <ref type="bibr" target="#b8">[8]</ref>, Hassanzadeh et al. describe a framework for the discovery of semantic links over relational data which also introduces a declarative language for specifying link conditions. A main difference between LinQL and Silk-LSL is the underlying data model and Silk's ability to more flexibly combine metrics through aggregation functions. A framework that deals with instance coreferencing as part of the larger process of fusing Web data is the KnoFuss Architecture proposed in <ref type="bibr" target="#b9">[9]</ref>. In contrast to Silk, KnoFuss assumes that instance data is represented according to consistent OWL ontologies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">CONCLUSIONS</head><p>We presented the Silk framework, a flexible tool for discovering links between entities within different Web data sources. We introduced the Silk-LSL link specification language and demonstrated its applicability within different link discovery scenarios.</p><p>The value of the Web of Data rises and falls with the amount and the quality of links between data sources. We hope that Silk and other similar tools will help to strengthen the linkage between data sources and therefore contribute to the overall utility of the network.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. Example: Interlinking cities in DBpedia and GeoNames</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 3 .</head><label>3</label><figDesc>Figure 3. Silk System Architecture Before executing any comparisons, Silk retrieves the source and target resource lists. The list of source resources is retrieved directly through a resource lister which queries the respective SPARQL endpoint and caches the list on disk for reuse in a later run of Silk. Target resources are first indexed by means of a resource indexer, making them searchable by specific properties or RDF Path evaluations. During comparison processing, a list of target resource candidates for each source resource is looked up in this index, limiting detailed comparisons to index search hits. This prematching of resources is optional, but recommended as it drastically reduces run time and network load. During each detailed resource pair comparison, the userspecificed metric aggregation tree is evaluated. Function or metric parameters passed as RDF Path values are transformed to SPARQL queries by an RDF Path translator and sent to the respective SPARQL endpoint for evaluation. Query results are cached in memory during Silk runtime.</figDesc><graphic coords="5,309.12,130.02,233.22,161.16" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 . Available similarity metrics in Silk Metric Description</head><label>1</label><figDesc></figDesc><table><row><cell>jaroSimilarity</cell><cell>String similarity based on Jaro distance metric</cell></row><row><cell>jaroWinklerSimilarity</cell><cell>String similarity based on Jaro-Winkler metric</cell></row><row><cell>qGramSimilarity</cell><cell>String similarity based on q-grams</cell></row><row><cell>stringEquality</cell><cell>Returns 1 when strings are equal, 0 otherwise</cell></row><row><cell>numSimilarity</cell><cell>Percentual numeric similarity</cell></row><row><cell>dateSimilarity</cell><cell>Similarity between two date values</cell></row><row><cell>uriEquality</cell><cell>Returns 1 if two URIs are equal, 0 otherwise</cell></row><row><cell>taxonomicSimilarity</cell><cell>Metric based on the taxonomic distance of two concepts</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 . Linking movies between DBpedia and LinkedMDB</head><label>2</label><figDesc></figDesc><table><row><cell>Number of movies in DBpedia</cell><cell>34,685</cell></row><row><cell>Number of movies in LinkedMDB</cell><cell>38,064</cell></row><row><cell>Links above accept threshold</cell><cell>26,059</cell></row><row><cell>Links above verify threshold</cell><cell>1,858</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 . Linking DBpedia movies to directors in LinkedMDB</head><label>3</label><figDesc></figDesc><table><row><cell>Number of movies in DBpedia</cell><cell>34,685</cell></row><row><cell>Number of directors in LinkedMDB</cell><cell>8,367</cell></row><row><cell>Links above accept threshold</cell><cell>1,693</cell></row><row><cell>Links above verify threshold</cell><cell>374</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 . Linking cities between DBpedia and GeoNames</head><label>4</label><figDesc></figDesc><table><row><cell>Number of cities in DBpedia</cell><cell>40,197</cell></row><row><cell>Number of populated places</cell><cell>2,410,855</cell></row><row><cell>in GeoNames</cell><cell></cell></row><row><cell>Links above accept threshold</cell><cell>35,031</cell></row><row><cell>Links above verify threshold</cell><cell>9,147</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5 . Linking drugs between DBpedia and DrugBank</head><label>5</label><figDesc></figDesc><table><row><cell>Number of drugs in DBpedia</cell><cell>3,134</cell></row><row><cell>Number of drugs in DrugBank</cell><cell>4,772</cell></row><row><cell>Links above accept threshold</cell><cell>1,202</cell></row><row><cell>Links above verify threshold</cell><cell>245</cell></row><row><cell cols="2">The metric compositions, weightings and thresholds in these</cell></row><row><cell cols="2">examples were chosen based on what seemed to produce</cell></row><row><cell cols="2">reasonably valid results in our tests. However, a detailed analysis</cell></row><row><cell cols="2">of the quality of the generated links has not yet been performed.</cell></row><row><cell cols="2">When using Silk in a practical scenario, it is advisable to evaluate</cell></row><row><cell cols="2">the accuracy and completeness of generated links more closely</cell></row><row><cell cols="2">while adjusting the linking specification accordingly.</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0">http://musicbrainz.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1">http://xapian.org/docs/bm25.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2">http://www.linkedmdb.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_3">http://www4.wiwiss.fu-berlin.de/drugbank/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_4">http://pubchem.ncbi.nlm.nih.gov</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_5">http://silk.googlecode.com</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_6">http://sourceforge.net/projects/febrl</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_7">http://xapian.org</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The complete Silk-LSL language specification and further Silk usage examples are found on the Silk project website at http://www4.wiwiss.fu-berlin.de/bizer/silk/.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Berners-Lee</surname></persName>
		</author>
		<ptr target="http://www.w3.org/DesignIssues/LinkedData.html" />
		<title level="m">Linked Data -Design Issues</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cyganiak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Heath</surname></persName>
		</author>
		<ptr target="http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/" />
		<title level="m">How to publish Linked Data on the Web</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Conceptual Graph Matching for Semantic Search</title>
		<author>
			<persName><forename type="first">J</forename><surname>Zhong</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The 2002 International Conference on Computational Science (ICCS2002)</title>
				<meeting><address><addrLine>Amsterdam</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2002-04">April 2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Duplicate record detection: A survey</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Elmagarmid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">G</forename><surname>Ipeirotis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">S</forename><surname>Verykios</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="16" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Overview of Record Linkage and Current Research Directions</title>
		<author>
			<persName><forename type="first">W</forename><surname>Winkler</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
		<respStmt>
			<orgName>Bureau of the Census,</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Ontology Matching</title>
		<author>
			<persName><forename type="first">J</forename><surname>Euzenat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shvaiko</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
			<publisher>Springer</publisher>
			<pubPlace>Heidelberg</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Automatic Interlinking of Music Datasets on the Semantic Web</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Raimond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Sutton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sandler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Linked Data on the Web Workshop</title>
				<imprint>
			<date type="published" when="2008">LDOW2008. 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A Declarative Framework for Semantic Link Discovery over Relational Data</title>
		<author>
			<persName><forename type="first">O</forename><surname>Hassanzadeh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Poster at 18th World Wide Web Conference (WWW2009)</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Integration of Semantically Annotated Data by the KnoFuss Architecture</title>
		<author>
			<persName><forename type="first">A</forename><surname>Nikolov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">16th International Conference on Knowledge Engineering and Knowledge Management</title>
				<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="265" to="274" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
