<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SLINT: A Schema-Independent Linked Data Interlinking System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khai Nguyen</string-name>
          <email>nhkhai@fit.hcmus.edu.vn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryutaro Ichise</string-name>
          <email>ichise@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bac Le</string-name>
          <email>lhbac@fit.hcmus.edu.vn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Informatics</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>Ho Chi Minh</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Linked data interlinking is the discovery of all instances that represent the same real-world object and locate in di erent data sources. Since di erent data publishers frequently use di erent schemas for storing resources, we aim at developing a schema-independent interlinking system. Our system automatically selects important predicates and useful predicate alignments, which are used as the key for blocking and instance matching. The key distinction of our system is the use of weighted co-occurrence and adaptive ltering in blocking and instance matching. Experimental results show that the system highly improves the precision and recall over some recent ones. The performance of the system and the e ciency of main steps are also discussed.</p>
      </abstract>
      <kwd-group>
        <kwd>linked data</kwd>
        <kwd>schema-independent</kwd>
        <kwd>blocking</kwd>
        <kwd>interlinking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Years of e ort in building linked data has brought a huge amount of data in the
LOD. However, maximizing the e ciency of linked data in the development of
semantic web is still facing many di culties. One of the current challenges is to
integrate the individual data sources for building a common knowledge system.
When di erent data source may contain heterogeneous instances, which co-refer
to the same real-world objects, data integration process requires the detection of
such objects to ensure the integrity and consistency of data. Detecting all
identities between data sources is the mission of data interlinking. Data interlinking
consist of two main steps, blocking and instance matching. While blocking aims
at pruning the number of comparison, instance matching is to determine the
matching status of two interested instances.</p>
      <p>
        Current interlinking methods can be categorized into two main groups:
schemadependent [
        <xref ref-type="bibr" rid="ref10 ref2 ref7">2,7,10</xref>
        ] and schema-independent [
        <xref ref-type="bibr" rid="ref1 ref3 ref4 ref9">1,3,4,9</xref>
        ]. The former requires the
knowledge about meaning of RDF predicates (e.g. predicate #preLabel declares
the label of object) and the predicate alignments (e.g. predicate #preLabel
matched with predicate #name). In contrast, the latter does not need these
information, therefore it does not rely on human knowledge about the schema.
Because a linked data instance is a set of many RDF triples (subject, predicate,
object), the schema of a data source refers to the list of all used predicates,
which are closely related to vocabulary and ontology. The schemas are usually
di erent for each data sources, even in the same data source but di erent
domains. Clearly, schema-independent methods are more applicable when it can
work on every kind of source or domain without any human's instruction.
Besides, manual speci cations of interlinking rules frequently ignore the hidden
useful predicate alignments.
      </p>
      <p>We present SLINT system, which use a new approach for schema-independent
linked data interlinking. SLINT automatically selects important RDF predicates
using the coverage and discriminability. The selected predicates are combined to
construct the predicate alignments in conciliation of data type. We estimate the
con dence of predicate alignments to collect the most appropriate alignments
for blocking and interlinking. By this way, the collective information of instance
is frequently leveraged. Blocking is therefore more complete, compact, and
supportive for interlinking. Also, we apply adaptive ltering techniques for blocking
and instance matching. In experiment, we compare SLINT with three systems,
which participated OAEI 2011 instance matching campaign, and report the high
improvement on both precision and recall. Experiments on the performance of
SLINT and the e ciency of blocking step are also reported.</p>
      <p>The paper is organized as follow: the next section is the overview of previous
work. Section 3 describes the detail of SLINT system. Section 4 reports our
experimental evaluation. Section 5 closes the paper with conclusion and outlook.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Related work</title>
      <p>
        Data interlinking is an early studied area, however, this problem in linked data
has just been recently attended. Silk [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a well-known framework, provides a
declarative interface for user to de ne the predicate alignments as well as the
similarity metrics for matching. Silk was used as a main component of the LDIF [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
a multiple linked data sources integration framework. Recently, Isele and Bizer
have improved their Silk by applying an automatic linkage rules generation using
genetic algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The work is very interesting at the modeling of
appropriate tness function and speci c transformations for genetic programming in the
context of interlinking. This work makes Silk to be schema-independent. With
the similar objective, RAVEN [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] minimize human curation e ort using active
learning, while Nikolov et al. also use genetic algorithm with the research
target is an unsupervised learning process [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Also with schema-independent goal,
Nguyen et al. suggest using decision tree classi er for determining the matching
status of two instances [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Zhishi.Links [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is one of the current state-of-the-art matchers. This system
adopt the idea of Silk's pre-matching step, by using label of objects such as
skos:prefLabel or scheme:label, to group similar instances. Afterward, a more
complex semantic similarity is utilized for matching. This system ranks rst at
the OAEI 2011 instance matching, while the second best is SERIMI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a
schemaindependent system. SERIMI selects RDF predicates and predicate alignments
using entropy and RDF object similarity, respectively. AgreementMaker [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is an
ontology matching and instance matching system. It rstly generates candidates
set by comparing the labels of instances. These candidates are then divided into
smaller subsets, in which every pair is matched to produce the nal alignments.
      </p>
      <p>
        Most of previous interlinking systems do not deeply investigate on blocking
step, which generates potential identity pairs of instances. Song and He n focus
on blocking scheme for linked data interlinking for parallel independent work [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
It is a very interesting idea when the authors propose an unsupervised learning
for maximizing the usefulness of blocking keys, which are the combinations of
RDF predicates. The authors conduct experiments on some large datasets, which
also proof for the scalability.
      </p>
      <p>In general, the schema-dependent approaches compare two instances by
speci ed properties. That is, they can detect almost right identity pairs but the
precision may be low on highly ambiguous data sources. The reason is that some useful
information can be ignored since the manual predicate alignment frequently is
not an optimal solution. In contrast, schema-independent approaches reconcile
precision and recall because of elaborate analysis on the data. Although these
approaches need to collect predicate alignments, the matching is more e ective
when collective information is frequently used. Comparing SLINT with previous
interlinking systems, the prominent di erences are the predicate selection,
predicate alignment, and adaptive ltering for blocking and interlinking. In the next
section, we describe these elements as the details of SLINT.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Schema-independent linked data interlinking system</title>
      <p>This section describes the SLINT system. The overview of the interlinking
process for source data Ds and target data Dt is shown in Figure 1. In this gure,
the small circles and triangles respectively stand for instances and theirs RDF
predicates. The referred circles of each step are the output of that step. The
SLINT system consists of four steps. The interlinking process begins with
predicate selection, which collects the important predicates from all predicates of
each data sources. In the second step, predicate alignment, selected predicates
are combined in accordance with their data type to construct the raw predicate
alignments. We estimate the con dence of every raw alignment to measure its
appropriateness. A raw alignment will be called a key alignment if its con dence
satis es a ltering condition. These key alignments provide much useful
information in blocking and instance matching steps. Next, blocking step is designed to
reduce the number of comparison by producing identity candidates of instances.
The instance matching afterward only need to veri es the retrieved candidates
for discovering the identity pairs. The followings are the detail of each step.</p>
      <sec id="sec-3-1">
        <title>3.1 Predicate selection</title>
        <p>The mission of this step is to nd the important predicates from the schema,
which consists of all predicates appearing in interested data source. We use two
criteria for determining the importance level of predicate p: coverage cov(p; D)
and discriminability dis(p; D). Eq.1 and Eq. 2 are the explanations of these
criteria when considering predicate p of data source D.</p>
        <p>cov(p; D) = jfxj9 &lt; s; p; o &gt;2 x; x 2 Dgj :
jDj
(1)
dis(p; D) = jfoj9x 2 D; &lt; s; p; o &gt;2 xgj : (2)
jf&lt; s; p; o &gt; j9x 2 D; &lt; s; p; o &gt;2 xgj</p>
        <p>
          In these equation, x represents an instance and is a set of RDF triple &lt;
s; p; o &gt; (subject, predicate, object). D is the interested data source and is a
set of instances. From each input source, we collect the predicates having high
score of coverage and discriminability. A predicate p is selected if it satis es the
condition in Eq.3, which inherits from the idea of [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>(cov(p; D)
) ^ (dis(p; D)
) ^ (HM ean(cov(p; D); dis(p; D))
): (3)</p>
        <p>The and imply the minimum standard of an important predicate, whereas
, the condition for harmonic mean of dis(p; D) and cov(p; D), is the main
requirement. Therefore, we set small values for and and larger value for .</p>
        <p>
          Song and He n focus on learning blocking key by iteratively maximize the
coverage and discriminability of the set of predicates [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In our system, we use
the same discriminability function with theirs and slightly di erent function for
coverage. For the numerator of Eq. 1, while they use the number of RDF subjects,
we use the number of instances, because we aim at nding the frequency of
predicate over instances, not over RDF subjects.
        </p>
        <p>Important predicates are expected to be used for declaring the common
properties and distinct information of objects. Since coverage and discriminability
respectively express the former and latter, the combination of them is therefore
appropriate for the objective of predicate selection. If a predicate has a high
coverage but a low discriminability or otherwise, it will not be important. An
example for this kind of predicate is rdf:type. This predicate is frequently used
but it usually describes a limit range of various RDF objects when observing the
instances in the same domain.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Predicate alignment</title>
        <p>In this step, we nd the appropriate alignments of predicates between the source
data and target data. An alignment of two predicates is considered to be
appropriate if the interested predicates describe the similar properties of instances.
From selected predicates of source data and target data, we connect every
typematched pair and select the alignments whose con dence is higher than threshold
. Selected predicate alignments are called key alignments. The con dence of an
alignment is the Dice coe cient between the representatives of RDF objects
described by its formed predicates. Eq. 4 is the equation of con dence conf (ps; pt)
for the alignment between predicate ps in source data Ds and predicate pt in
target data Dt.</p>
        <p>conf (ps; pt) = 2 jR(Os) \ R(Ot)j ; Ok = foj9x 2 Dk; &lt; s; pk; o &gt;2 xg: (4)
jR(Os)j + jR(Ot)j</p>
        <p>In above equation, R is the function that returns the representative elements
of RDF objects. The return values of R depend on the type of predicates. We
divide the predicates into ve di erent types: string, URI, decimal, integer, and
date. This separation is based on the variety of data types in the real world and
covers most of current types of linked data. For string, we extract the word token
of RDF objects. For URI, we omit the domain part and use the same manner as
for string, with the assumption that slash `/' is the token separator. For decimal,
we take the 2-decimal digits rounded values. For integer and date, we do not
transform the RDF objects and use the original values. For determining type of
a predicate, we use the major type of RDF objects declared by this predicate.
For example, if 51% appearance times of p is to describe decimal values, the
data type of p will be decimal. Currently, we detect the type of RDF objects
without the consideration about the di erence in their metric (e.g. the units of
time, distance, area).</p>
        <p>The con dence of an alignment represents the similarity of RDF objects
between two data sources. The predicates having the same meaning frequently
describe the similar information. Therefore, alignments of matched predicates
usually have higher con dence than the others. It means that a predicate
satisfying the requirements of an important predicate is veri ed again, by considering
the con dence of all alignments in which it appears. For example, the predicate
rdfs:comment has possibility to be important but the con dences of its
alignments are usually low because the denominator of Eq.3 is very high in this case.</p>
        <p>A common limiting point of almost previous systems is the use of string
measurement for every type of RDF objects. Clearly, this approach is not su cient to
cover the meaning of RDF objects, thus, does not well estimate the similarity of
non-string values. We discriminate data types not only in combining predicates,
but also in blocking and instance matching.</p>
        <p>It is not easy for a person to detect all useful predicate alignments, this step
is therefore very meaningful, and in accompaniment with predicate selection, it
tackles the schema-independent goals. The next steps are the use of selected key
alignments and their formed predicates.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Blocking</title>
        <p>As we introduced, the blocking aims at retrieving the candidates for instance
matching step by grouping similar instances into the same block. A candidate
is a pair of two instances, one belongs to source data and one belongs to target
data. The blocking can be divided into three sub phases. The rst phase indexes
8
9
10
11</p>
        <p>Algorithm 1: Generating candidates set</p>
        <sec id="sec-3-3-1">
          <title>Input: Ds, Dt, P rs, P rt, ,</title>
          <p>Output: Candidate set C
1 H</p>
          <p>;
2 M [jDsj; jDtj] f0g
3 C ;
4 foreach &lt; D; P &gt;2 f&lt; Ds; P rs &gt;; &lt; Dt; P rt &gt;g do
5 foreach x 2 D do
6 foreach pi 2 P do
7 sumConf Ppj2fP rs;P rtgnP conf (pi; pj)
foreach r 2 Rp(O); O = foj &lt; s; pi; o &gt;2 xg do
if not H.ContainsKey(r:Label) then</p>
          <p>H.AddKey(r:Label)
H.AddValue(r:Label, D, &lt; x; r:V alue
sumConf &gt;)
12 foreach key 2 H.AllKeys() do
13 foreach &lt; xs; vs &gt;2 H:GetV alues(key; Ds) do
14 foreach &lt; xt; vt &gt;2 H:GetV alues(key; Dt) do
15 M [xs; xt] M [xs; xt] + vs vt
every instance in each data source by the value extracted from its RDF objects.
The second phase traverses the index table and builds a weighted co-occurrence
matrix. The nal phase uses this matrix as the input information when it applies
a ltering technique to select candidates. Algorithm 1 is the pseudo-code of the
whole blocking process. In this algorithm, P rs and P rt represent the list of
predicates that form the key alignments, where P rk belongs to Dk. H, M , C,
Rp represent the inverted-index table, weighted co-occurrence matrix, candidates
set, and representative extraction method, respectively.</p>
          <p>The lines 4-11 perform the invert-indexing, a well-known indexing technique.
By once traversing each data source, we extract the representatives of RDF
objects and use them as the keys of invert-index table. An element r in the
representatives set of RDF objects consists of two elds: the label r:Label and
value r:V alue. While r:Label is the return value of representative extraction
method R as in predicate alignment step, r:V alue is computed in accordance
with the data type of predicate pi. If pi is string or URI, we set the value to
TF-IDF score of the token. If pi is either decimal, integer, or date, we assign the
value to a xed number, which is 1.</p>
          <p>After constructing the invert-index table, we compute weighted co-occurrence
matrix M as the lines 12-15, by accumulating the value for each matrix element.</p>
          <p>The lines 16-22 are the process of adaptive ltering. An instance pair &lt;
xs; xt &gt; will be considered as a candidate if its weighted co-occurrence value
M [xs; xt] satis es the threshold , after divided for the harmonic mean of
maximum weighted co-occurrences of xs and xt. In addition, we use , a small
threshold, to avoid the surjection assumption. The identities frequently have the high
weighted co-occurrences; however, these values are variable for di erent pairs.
Choosing a xed threshold for selecting candidates is not good in this situation
and is a tedious task. Therefore, we use the coe cient of M [xs; xt] and max,
which is a data driven element and expresses the adaptive ltering idea.</p>
          <p>
            Blocking is very important because it reduces the number of comparison in
instance matching. However, it seems not to have been su ciently attended when
most of previous systems use quite simple method for blocking. In comparison
with blocking step in previous interlinking systems, the key di erence of our
method is the weighted co-occurrence matrix and the adaptive ltering. While
previous systems compare the pairs of RDF objects, we aggregate the product
of the weight of their matched representatives. For candidate selection, Silk [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]
and Zhishi.Links [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] use top-k strategy, which selects k candidates for each
instance. The approach is very good for controlling the number of candidates, but
determining the value of k is not easy. Song and He n use thresholding selection
[
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], which is also similar with SERIMI [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. Our method also use thresholding
approach as the availability of . However, the key idea of our selection method
is the adaptive ltering because the impact of is not high. Frequently, there
are many of non-identity pairs between two data sources, is therefore usually
con gured with a low value.
          </p>
          <p>Next, we input the set of candidates C and the key alignments A into the
nal step, instance matching.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4 Instance matching</title>
        <p>The instance matching veri es the selected candidates to determine their identity
state. We compute the matching score for every candidate and choose the ones
that have high score as the identity pairs. For each element in A, we compute the
similarity of RDF objects, which declared by the involved predicates of interested
key alignment. The nal score of two instances is the weighted average value of
all these similarities, and the weights are the con dences of key alignments. Eq.5
is the computation of matching score between instance xs 2 Ds and xt 2 Dt.
score(xs; xt) = 1 X conf (ps; pt) sim(R(Os); R(Ot));</p>
        <p>W</p>
        <p>&lt;ps;pt&gt;2A</p>
        <sec id="sec-3-4-1">
          <title>W here W =</title>
          <p>Ok = foj9x 2 Dk; &lt; s; pk; o &gt;2 xg</p>
          <p>X</p>
          <p>conf (ps; pt):
In this equation, R stands for the representative extraction methods, which are
similar to those in predicate alignment step.</p>
          <p>Categorizing ve data types, we implement three di erent versions of sim
function in accordance with the type of predicates. For decimal and integer, we
take the variance of the values to remedy the slight di erence of data
representations. For date, the sim function yields 1 or 0 when the values are equal or not,
respectively. A date is usually important if it is a property of the object (e.g.
birthday, decease date, release date of a movie). Therefore, the exact matching is
an appropriate selection for dates comparison. For string and URI, we compute
the TF-IDF modi ed cosine similarity, as given in Eq.6. TF-IDF is used because
its advantage in disambiguating the instances sharing common tokens. TF-IDF
also minimizes the weight for the stop-words, which are usually useless.
sim(Qs; Qt) =</p>
          <p>Pq2Qs\Qt T F IDF (q; Qs)T F IDF (q; Qt)
qPq2Qs T F IDF 2(q; Qs)</p>
          <p>Pq2Qt T F IDF 2(q; Qt)
Similar with blocking step, we do not use xed single threshold for ltering the
candidates. Two instances will be considered as an identity pair if their score is
higher than the maximum score of the candidates in which either of instances
appears. The nal identities set I is formalized in Eq.7.</p>
          <p>I = f&lt; xs; xt &gt; jscore(xs; xt)</p>
          <p>^
score(xs; xt)
max8&lt;xm;xn&gt;2C;xm xs_xn xt score(xm; xn)
g:
(7)
An identity pair is expected to be the highest correlated candidate of each
instance. However, it usually is not true because the ambiguity of instances. A
thresholding method that relies on the highest score would be better in this
situation. While true identity pair and its impostors have the similar score, is
assigned with a quite large value. On the other hand, is additionally con
gured as the minimum requirement of an identity. Like in Algorithm 1, ensures
there is no assumption about the surjection of given data sources.</p>
          <p>
            The key distinctions of our approach in comparison with the previous are the
use of weighted average and adaptive ltering. Previous systems do not have the
data driven information like con dence of key alignments. Silk [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] provides a
manual weighting method; however, a good weight usually depends much on the
human knowledge about the data. For identity selection, Zhishi.Links [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] and
AgreementMaker[
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] eventually select the best correlated candidates, while Silk
[
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] and SERIMI [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] use threshold-based selection. We compare the interlinking
result of our system with those of Zhishi.Links, SERIMI, and AgreementMaker
in our experiments, which are reported in the next section.
4
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiment</title>
      <sec id="sec-4-1">
        <title>4.1 Experiment setup</title>
        <p>
          We evaluate the e ciency of blocking step and the whole interlinking process
of SLINT. We also compare SLINT with Zhishi.Links [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], SERIMI [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], and
AgreementMaker [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], which recently participated OAEI 2011 instance
matching campaign. Discussion on predicate selection and predicate alignment are
also included. For every test in our experiment, we use the same value for each
threshold. We set , , (Eq. 3), (Eq. 4), , (Algorithm 1), , and (Eq. 7)
to 0.25, 0.25, 0.5, 0.25, 0.1, 0.5, 0.25, and 0.95, respectively. The xed values of
, , and express the schema-independent capability of SLINT.
        </p>
        <p>Like previous studies, for blocking, we use two evaluation metrics: pair
completeness (PC) and reduction ratio (RR); For interlinking, we use recall (Rec),
precision (Prec), and F1 score, the harmonic mean of recall and precision. Eq.8
and Eq.9, Eq.10, and Eq.11 are the computations of used metrics.</p>
        <p>P C =</p>
        <sec id="sec-4-1-1">
          <title>N umber of correct candidates</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>N umber of actual identity pairs : RR = 1</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>N umber of candidates</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>N umber of all pairs : Rec =</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>N umber of correct identity pairs</title>
        </sec>
        <sec id="sec-4-1-6">
          <title>N umber of actual identity pairs :</title>
          <p>(8)
(9)
(10)</p>
        </sec>
        <sec id="sec-4-1-7">
          <title>N umber of correct identity pairs</title>
          <p>P rec = : (11)</p>
        </sec>
        <sec id="sec-4-1-8">
          <title>N umber of discovered pairs</title>
          <p>The performance of an interlinking system is also very important. We report the
execution times of SLINT when running on a desktop machine equipped with
2.66Ghz quad-core CPU and 4GB of memory.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Datasets and discussion on predicate selection &amp; predicate alignment</title>
        <p>We use 9 datasets in experiment. The rst 7 datasets are IM@OAEI2011 datasets,
the ones used in instance matching campaign at the OAEI 20113. Concretely, we
use the datasets of interlinking New York Times track, which asks participants
to detect identity pairs from NYTimes to DBpedia, Freebase, and Geonames.
These datasets belong to three domains: locations, organizations, and people.
The IM@OAEI2011 datasets are quite small. Therefore, for evaluating the
computational performance of the system, we select two larger datasets. The rst
one, a medium dataset, contains 13758 pairs in lm domain between DBpedia
and LinkedMDB4. The second one is a quite large dataset, which contains 86456
pairs in locations domain between DBpedia and Geonames5. All datasets are
downloaded by dereferencing URI and stored in external memory in advance.
We remove triples having owl:sameAs predicates and rdf:seeAlso predicates of
course. Table 1 gives the overview of these datasets. In this table, IM@OAEI2011
datasets are from D1 to D7, and the last two datasets are D8 and D9. We also
include the number of predicates and predicate alignments in this table. Denotes
that s and t are the source and target data, respectively; P rd and P fd are the
3 http://oaei.ontologymatching.org/2011/instance/
4 http://downloads.dbpedia.org/3.7/links/linkedmdb links.nt.bz2
5 http://downloads.dbpedia.org/3.7/links/geonames links.nt.bz2
number of predicates in data source d before and after selected by predicate
selection step, respectively; A and K are the number of all predicate alignments
and only key alignments, respectively.</p>
        <p>In general, excepts in NYTimes, the number of available predicates in the
schema of each data source is very large, but the important predicates occupy a
very small percent. As our observation, the predicates declaring the label or the
name of objects are always aligned with a very high con dence. The non-string
type predicates also frequently construct the key alignments. For example, in
locations domain, the key alignments always contain the right combination of
predicates declaring latitudes and longitudes. The predicate releaseDate of DBpedia
is successfully combined with predicate date and predicate initial relase date of
LinkedMDB. An interesting key alignment in dataset D6 is the high con dence
combination of core#preLabel of NYTimes and user.mikeshwe.default
domain.videosurf card.videosurf link text of Freebase. The latter predicate may be di cult
for manual selection since the meaning of the predicate name does not imply the
label. Clearly, it is not easy for human to detect every compatible predicates.
When manually doing this task, we may lose to leverage all useful information.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3 Blocking and interlinking result</title>
        <p>This section reports the result of blocking and the whole interlinking process.
Concretely, we report the pair completeness and reduction ratio of blocking, and
precision, recall, F1 score, and the runtime of the system. Table 2 shows these
metrics on each dataset. According to this table, although we cannot retain all
identity pairs, the lowest PC is still very high at 0.94. Besides, the high RRs
reveal that the numbers of retrieved candidates are very small if compared with
the numbers of total pairs. For all the evidences of PC and RR, the aim of
blocking is successfully achieved.</p>
        <p>For interlinking, the precison and recall are very competitive. The recall,
which is not much lower than pair completeness, implies that the instance
matching performs a good work. The high precision implies that our system has a
efcient disambiguation capability on tested datasets. It seems easy for SLINT to
interlink people domain, whereas in locations domain, SLINT achieves the best
result on IM@OAEI2011 datasets involving with Geonames.</p>
        <p>The execution time of SLINT is very good in overview. Because we use
cooccurrence structure in blocking, the memory on tested machine cannot satisfy
Dataset
D1
D2
D3
D4
D5
D6
D7
D8
D9
a very large dataset. In our context, interlinking dataset D9 has such issue.
We temporarily implement a parallel program for re-computing every element
of the co-occurrence matrix. The interlinking on this dataset is therefore takes
much time because the repeat of data traversing and high computational cost.
However, the high speeds on other datasets are really promising for scaling-up
our system.</p>
        <p>The advantage of blocking is very high if we compare the time of interlinking
with and without this step. For example, the time for instance matching step to
match 33926 candidates of dataset D8 is 12.2 seconds. It means that the time
for matching all available pairs will be nearly 17 hours, whereas this number is
only 67.76 seconds in total if we implement the blocking step. Blocking averagely
occupies 58% total runtime of interlinking process on the nine tested datasets.
Although this number is over a half, the advantage of blocking is still very
considerable.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4 Comparison with previous interlinking systems</title>
        <p>
          As mentioned, we compare our system with AgreementMaker [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], SERIMI [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ],
and Zhishi.Links [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Because these systems recently participated instance
matching campaign of the OAEI 2011, we use the results on IM@OAEI2011 datasets
for comparison. Table 3 shows the interlinking result of SLINT and others. As
showed in this table, it is clear that our system totally outperforms the others
on both precision and recall. AgreementMaker has a competitive precision with
SLINT on dataset D3 but this system is much lower in recall. Zhishi.Links
results on dataset D3 are very high, but the F1 score of SLINT is still 0.05 higher
in overall.
        </p>
        <p>The prominent di erences of SLINT and these systems are that we use the
con dence of alignment as the weight for blocking and instance matching, and
discriminate data types with the use of TF-IDF for the token of string and URI.
Generally, SLINT is veri ed as the best accurate one among compared systems.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusion</title>
      <p>In this paper, we present SLINT, an e cient schema-independent linked data
interlinking system. We select important predicates by predicate's coverage and
discriminability. The predicate alignments are constructed and ltered for
obtaining key alignments. We implement an adaptive ltering technique to produce
candidates and identities. Compare with the most recent systems, SLINT highly
outperforms the precision and recall in interlinking. The performance of SLINT
is also very high when it takes around 1 minute to detect more than 13,000
identity pairs.</p>
      <p>Although SLINT has good result on tested datasets, it is not su cient to
evaluate the scalability of our system, which we consider as the current limiting
point because of the used of weighted co-occurrence matrix. We will investigate
about a solution for this issue in our next work. Besides, we also interested
in automatic con guration for every threshold used in SLINT and improving
SLINT into a novel cross-domain interlinking system.</p>
    </sec>
    <sec id="sec-6">
      <title>References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Araujo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          , A. de Vries,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hidders</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwabe</surname>
          </string-name>
          . SERIMI:
          <article-title>Class-based disambiguation for e ective instance matching over heterogeneous web data</article-title>
          .
          <source>In SIGMOD'12 15th Workshop on Web and Database</source>
          , pages
          <volume>19</volume>
          {
          <fpage>25</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. P.</given-names>
            <surname>Antonelli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Stroe</surname>
          </string-name>
          .
          <article-title>AgreementMaker: e cient matching for large real-world schemas and ontologies</article-title>
          .
          <source>VLDB Endow</source>
          .,
          <volume>2</volume>
          :
          <fpage>1586</fpage>
          {
          <fpage>1589</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Learning linkage rules using genetic programming</article-title>
          .
          <source>In ISWC' 11 6th Workshop on Ontology Matching</source>
          , pages
          <volume>13</volume>
          {
          <fpage>24</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Ho</surname>
          </string-name>
          <article-title> ner. RAVEN - Active learning of link speci cations</article-title>
          .
          <source>In ISWC' 11 6th Workshop on Ontology Matching</source>
          , pages
          <volume>25</volume>
          {
          <fpage>36</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>K.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ichise</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <article-title>Learning approach for domain-independent linked data instance matching</article-title>
          .
          <source>In KDD'12 2nd Workshop on Minning Data Semantic</source>
          , pages
          <fpage>7</fpage>
          <issue>:1</issue>
          {
          <issue>7</issue>
          :
          <issue>8</issue>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          , M. d'Aquin,
          <string-name>
            <given-names>and E.</given-names>
            <surname>Motta</surname>
          </string-name>
          .
          <article-title>Unsupervised learning of link discovery con guration</article-title>
          .
          <source>In ESCW'12</source>
          , pages
          <fpage>119</fpage>
          {
          <fpage>133</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>X.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          . Zhishi.
          <article-title>links results for OAEI 2011</article-title>
          .
          <source>In ISWC' 11 6th Workshop on Ontology Matching</source>
          , pages
          <volume>220</volume>
          {
          <fpage>227</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Schultz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Matteini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Becker</surname>
          </string-name>
          . LDIF -
          <article-title>Linked data integration framework</article-title>
          .
          <source>In ISWC' 11 2nd Workshop on Consuming Linked Data</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          and
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>He in. Automatically generating data linkages using a domainindependent candidate selection approach</article-title>
          .
          <source>In ISWC' 11</source>
          , pages
          <fpage>649</fpage>
          {
          <fpage>664</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>J. Volz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bizez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gaedke</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kobilarov</surname>
          </string-name>
          .
          <article-title>Discovering and maintaining links on the web of data</article-title>
          .
          <source>In ISWC' 09</source>
          , pages
          <fpage>650</fpage>
          {
          <fpage>665</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>