LDM: Link Discovery Method for new Resource
                        Integration

                              Nathalie Pernelle1 and Fatiha Saïs1

                LRI(CNRS UMR 8623 & Paris-Sud 11 University), INRIA Saclay,
         4 rue Jacques Monod, Parc Club Orsay Université, F-91893 Orsay Cedex, France
                    {Nathalie.Pernelle, Fatiha.Sais}@lri.fr


         Abstract. In this paper we address the problem of resource discovery in the
         Linked Open Data cloud (LOD) where data described by different schemas is
         not always linked. We propose an approach that allows discovery of new links
         between data. These links can help to match schemas that are conceptually rel-
         evant with respect to a given application domain. Furthermore, these links can
         be exploited during the querying process in order to combine data coming from
         different sources. In this approach we exploit the semantic knowledge declared in
         different schemas in order to model: (i) the influences between concept similari-
         ties, (ii) the influences between data similarities, and (iii) the influences between
         data and concept similarities. The similarity scores are computed by an iterative
         resolution of two non linear equation systems that express the concept similar-
         ity computation and the data similarity computation. The proposed approach is
         illustrated on scientific publication data.


1 Introduction

The appearance of Web of documents (WWW) [1] has upset the way we create and
share knowledge by breaking down barriers of publishing and accessing documents.
Hypertext links allow users to navigate on the graph of documents and Web search en-
gines to index the documents and answer to user queries. However, hyperlinks do not
express explicit links between the various entities described in Web of documents. With
the initiative of Open Linked Data cloud [3], the number of data providers on the Web
is in a continuous growth leading to a global data space of billions of assertions where
data and documents can be linked. However, until now the published data is very hetero-
geneous in the sense that it is incomplete, inconsistent, described according to different
schemas and contains duplicates. In order to be able to automatically exploit this huge
amount of heterogeneous data, an important work integration must be performed.
     In this paper we focus our interest on the problem of resource discovery in the
Linked Open Data cloud (LOD) where data described by different schemas is not al-
ways linked. We propose an approach1 that allows discovery of new links between data.
These links can help to match schemas that are conceptually relevant with respect to a
given application domain.
 1
     in the setting of the ANR (the French National Research Agency) project GeOnto.
    Ontology alignment plays a key role for semantic interoperability of this data. Many
approaches have been proposed for automatically identifying mappings between ele-
ments (concepts and relations) described in heterogeneous ontologies [18, 14]. These
approaches may exploit lexical and structural information, user inputs, prior matches or
external resources. When concept and relation instances are available, it is also possible
to exploit them to find more mappings between ontologies. In [7], the common instances
of concepts are exploited to compute mappings between concepts. Since, data is not de-
scribed using the same URIs even when it describes the same entities, these common
instances cannot be obtained straightforwardly. Conversely, discovering that two pieces
of data refer to the same world entity is also a key issue for data integration. We propose
an approach which simultaneously addresses both problems of ontology alignment and
data linking. Thus, the results of data linking step is exploited to improve the results
of ontology alignment step and vice versa. These two steps are performed alternatively
until a fix point is reached. The two methods exploit the semantic knowledge that is
declared in different schemas (ontologies) in order to model: (i) the influences between
concept similarities, (ii) the influences between data similarities, and (iii) the influences
between data and concept similarities. The similarity scores are computed using an
iterative resolution of two non linear equation systems that express, respectively, the
concept similarity computation and the data similarity computation.
    Applying this approach allows one to infer mappings of equivalence between con-
cepts of different schemas as well as to infer owl:same-as relations between instances
that refer to the same entity. The obtained schema mappings allow discovery new re-
sources and inferring if they are relevant with respect to a given application domain.
    The paper is organized as follows: in section 2 we present the related work in data
linking and ontology reconciliation fields. In section 3, we present the ontology and data
model and give a short presentation of N2R method on which our work relies. Section
4 presents the proposed approach of link discovery. Finally, we conclude and give some
future work in section 5.


2 Related Work

We denote by “web data” the network formed by the set of structured datasets described
in RDF (Resource Description Framework) and linked by explicit links. Large amount
of structured data have been published, including in the project Linking Open Data
cloud (LOD).
    Datasets are expressed in terms of one or several ontologies for establishing the
vocabulary describing data. Web data requires linking together the various sources
of published data. Given the big amount of published data, it is necessary to provide
methods for automatic data linking. Several tools [17, 13, 10] have recently been pro-
posed to solve partially this problem, each with its own characteristics. For instance,
[10] have developed a generic framework for integrating linking methods in order to
help users finding the link discovery methods that are more suitable for their relational
data. They introduced LinQL, an extension of SQL that integrates querying with string
matching (e.g. weighted jaccard measure) and/or semantic matching (i.e. using syn-
onyms/hyponyms) methods. This approach takes advantage of the DBMS query engine
optimizations and it can easily be used to test elementary similarity measures. Never-
theless, this approach is not designed to propagate similarity scores between entities,
i.e. their approach is not global.
     Some other works address the problem of link discovery in the context semantic
Web services. In [11], the authors propose to match a user request with semantic web
service descriptions by using a combination of similarity measures that can be learnt on
a set of labeled examples.
     Our proposal in this paper can also be compared to approaches studying the ref-
erence reconciliation problem, i.e., detecting whether different data descriptions refer
to the same real world entity (e.g. the same person, the same paper, the same protein).
Different approaches have been proposed. [5, 19, 2, 6] have developed supervised refer-
ence reconciliation methods which use supervised learning algorithm in order to learn
parameters and help the duplicate detection. Such supervised approaches cannot be used
in contexts where data amount is big and data schemas are different and incomplete.
     In [16] we have developed an automatic method of reference reconciliation which is
declarative and unsupervised reference reconciliation method. Besides, in this method
we assumed that the data sets conform to the same schema, i.e. the problem of ontology
reconciliation is already solved. Some ontology reconciliation approaches [7, 12] have
proposed to exploit a priori reconciled instances in the ontology reconciliation process.
When we aim at online reference and ontology reconciliation in the context of Linked
Open Data, we cannot use these traditional reconciliation approaches, where solving
the problem of reference reconciliation assumes the resolution of the ontology recon-
ciliation and vice versa. Furthermore, up to our knowledge, there is no approach which
deals with the two problems of discovering links in the ontology level and in the data
level, simultaneously.

3 Preliminaries
In this section we will present the ontology and data model that we consider in this
work. We will then present the Numerical method for Reference Reconciliation (N2R)
[16] on which relies our link discovery approach.

3.1 Ontology and its Constraints
The considered OWL ontology consists of a set of concepts (unary relations) organized
in a taxonomy and a set of typed properties (binary relations). These properties can
also be organized in a taxonomy of properties. Two kinds of properties can be distin-
guished in OWL: the so-called relations (owl:objectProperty), the domain and the range
of which are concepts and the so-called attributes (owl:DatatypeProperty), the domain
of which is a concept and the range of which is a set of basic values (e.g. Integer, Date,
Literal). In Figure 1, we give an extract O1 of the ontology that is used to describe the
RDF data of the local data source of publications (see source 1 Figure 2) which we will
use to illustrate our proposal.
    We allow the declaration of constraints expressed in OWL-DL or in SWRL in order
to enrich the domain ontology by additional and useful knowledge. The constraints that
we consider are of the following types:
                Fig. 1. An extract O1 of the local Ontology for publications

Source S1:
Article(S1_a1); title(S1_a1,“Implementing the TEA algorithm on sensors”); Person(S1_p1); Per-
son(S1_p2); year(S1_a1, “2004”); name(S1_p1,“Olga V. Gavrylyako” ); name(S1_p2,“Shuang
Liu” ); pageFrom(S1_a1,“64” ); pageTo(S1_a1,“69” );
Conference(S1_c1); confName(S1_c1, “Proceedings of the 42nd Annual Southeast Regional
Conference, 2004, Huntsville, Alabama, USA, April 2-3, 2004”); confYear(S1_c1, “2004”);
city(S1_c1, “Alabama”)
authoredBy(S1_a1,S1_p1 ); authoredBy(S1_a1,S1_p2 ); published(S1_a1,S1_c1 );
Article(S1_a2); title(S1_a2,“Weighted Hyper-sphere SVM for Hypertext Classifica-
tion”); Person(S1_p3); Person(S1_p4); year(S1_a2, “2008”); name(S1_p3,“Shuang Liu” );
name(S1_p4,“Guoyou Shi” ); pageFrom(S1_a2,“733” ); pageTo(S1_a2,“740” );
Conference(S1_c2); confName(S1_c2, “Advances in Neural Networks - ISNN 2008, 5th Interna-
tional Symposium on Neural Networks, ISNN 2008, Beijing, China, September 24-28, 2008, Pro-
ceedings, Part I”); confYear(S1_c2, “2008”) city(S1_c2, “Beijing”) authoredBy(S1_a2,S1_p3);
authoredBy(S1_a2,S1_p4); published(S1_a2,S1_c2);
Source S2:
Article(S2_a1); title(S2_a1,“Implementing the TEA algorithm on sensors.’); Person(S2_p1); Per-
son(S2_p2); year(S2_a1, “2004”); name(S2_p1,“Olga V. Gavrylyako” ); name(S2_p2,“Shuang
Liu” ); pageFrom(S2_a1,“64” ); pageTo(S2_a1,“69” );
Conference(S2_c1); confName(S2_c1, “42nd Annual Southeast Regional Conference,
2004”); confYear(S2_c1, “2004”); city(S2_c1,“Alabama”) authoredBy(S2_a1,S2_p1 ); au-
thoredBy(S2_a1,S2_p2 ); published(S2_a1,S2_c1 );

                              Fig. 2. an extract of RDF data


– Constraints of disjunction between concepts: DISJOINT(C,D) is used to declare
  that the two concepts C and D are disjoint. In the ontology O1 we declare that all
  the concepts Article, Conf erence and P erson are pairwise disjoint.
 – Constraints of functionality of properties: PF(P) is used to declare that the property
   P (relation or attribute) is a functional property. In O1 , we declare that all the prop-
   erties are functional except the relation authoredBy which means that one article
   may have several authors.
 – Constraints of inverse functionality of properties: PFI(P) is used to declare that the
   property P (relation or attribute) is an inverse functional property. These constraints
   can be generalized to a set {P1 , . . . , Pn } of relations or attributes to state a com-
   bined constraint of inverse functionality that we will denote PFI(P1 , . . . , Pn ). In
   O1, we declare that the combinations (title, year) and (conf N ame, conf Y ear)
   are inverse functional. For example, P F I(title, year) expresses that one title and
   one year cannot be associated to several articles (i.e. both are needed to identify an
   article).


3.2 Data description and its constraints.
A piece of data has a reference, which has the form of a URI (e.g. http://dblp.
l3s.de/d2r/resource/authors/A._Joe_Turner), and a description, which
is a set of RDF facts involving its reference. An RDF fact can be either: (i) a concept-
fact C(i), where C is a concept and i is a reference, (ii) a relation-fact R(i1, i2), where
R is a relation and i1 and i2 are references, or (iii) an attribute-fact A(i, v), where A
is an attribute, i a reference and v a basic value (e.g. integer, string, date). We consider
the Unique Name Assumption (UNA) which can be declared or not on a data source.
Declaring UNA on a data source means that two different data descriptions having two
different references, then we infer that they refer to distinct entities.
    The data description that we consider is composed of RDF facts coming from the
data sources which are enriched by applying the OWL entailment rules. Figure 2, pro-
vides examples of data coming from two RDF data sources S1 and S2, which conform
to the same ontology describing the scientific publication domain previously mentioned.
    In the N2R method which we will present in section 3.3, we consider that the
descriptions of data coming from different sources conform to the same OWL ontol-
ogy (possibly after ontology reconciliation). In the link discovery method, that we will
present in section 4, the assumption of prior ontology reconciliation is not fulfilled, i.e.
the considered data source do not conform to the same ontology.

3.3 N2R: a Numerical method for Reference Reconciliation
N2R is a numerical method which allows inferring reconciliation decisions between
reference coming from different sources that conform to the same ontology, i.e. the
problem on ontology reconciliation is already solved.
     N2R [16] has two main distinguishing characteristics. First, it is fully unsupervised:
it does not require any training phase from manually labeled data to set up coefficients or
parameters. Secondly, it is based on equations that model the influences between simi-
larities. In the equations, each variable represents the (unknown) similarity between two
references while the similarities between values of attributes are expressed by constants.
These constants are obtained, either by (i) exploiting a dictionnary of synonyms (e.g.
WordNet thesaurus, the dictionnary of synonyms generated by L2R method [15]); or
(ii) using standard similarity measures on strings or on sets of strings [4]. Furthermore,
ontology and data knowledge (disjunctions and UNA) is exploited by N2R in a filtering
step to reduce the number of reference pairs that are considered in the equation system.
The functions modeling the influence between similarities are a combination of maxi-
mum and average functions in order to take into account the constraints of functionality
and inverse functionality declared in the OWL ontology in an appropriate way.
     N2R can also take as input a set of reference pairs that are reconciled (sim =1)
by another method (e.g. L2R [15] in the LN2R approach) or given by a user like the
owl:same-as links available in the Open Linked Data cloud.
The equations modeling the dependencies between similarities. For each pair of
references, its similarity score is modeled by a variable xi and the way it depends on
other similarity scores, is modeled by an equation: xi = fi (X), where i ∈ [1..n] and n
is the number of reference pairs for which we apply N2R, and X = (x1 , x2 , . . . , xn ) is
the set of their corresponding variables. Each equation xi = fi (X) is of the form:
                           fi (X) = max(fi−df (X), fi−ndf (X))
    The function fi−df (X) is the maximum of the similarity scores of the value pairs
and the reference pairs of attributes and relations with which the i-th reference pair
is functionally dependent. The maximum function allows propagating the similarity
scores of the values and the references having a strong impact. The function fi−ndf (X)
is defined by a weighted average of the similarity scores of the value pairs (and sets)
and the reference pairs (and sets) of attributes and relations with which the i-th reference
pair is not functionally dependent. See [16] for the detailed definition of fi−df (X) and
fi−ndf (X).
Iterative algorithm for reference pairs similarity computation. Solving this equation
system is done by an iterative method inspired from the Jacobi method [8], which is
fast converging on linear equation systems. To compute the similarity scores, we have
implemented an iterative resolution method. At each iteration, the method computes
the variable values by using those computed in the precedent iteration. Starting from an
initial vector X 0 = (x01 , x02 , ..., x0n ), the value of the vector X at the k-th iteration is
obtained by the expression: X k = F (X k−1 ). At each iteration k we compute the value
of each xki : xki = fi (xk−1
                         1    , x2k−1 , ...xk−1
                                             n ) until a fix-point with a precision ! is reached.
The fix-point is reached when: ∀i, |xki − xk−1     i  | <= !.
     In order to illustrate the iterative resolution of the equation system, we consider an
extract of RDF data given in Figure 2 corresponding to the set of RDF facts where the
references S1_a1, S1_c1, S2_a1 and S2_c1 are involved. By considering the disjunc-
tions between concepts of O1 and the UNA in S1 and S2, we obtain an equation system
of six variables:
     x1 = Simr (S1_a1, S2_a1) ; x2 = Simr (S1_c1, S2_c1) ;
     x3 = Simr (S1_p1, S2_p1) ; x4 = Simr (S1_p1, S2_p2) ;
     x5 = Simr (S1_p2, S2_p1) ; x6 = Simr (S1_p2, S2_p2).
     We give bellow, the similarity scores of basic values obtained by using the Jaccard
similarity measure. For clarity reasons, we denote the value of an attribute A associ-
ated to a reference i as: A.val(i). For example, the conf Y ear value associate to the
reference S2_c2 is denoted conf Y ear.val(S2_c2) which equals to “2008”. The sim-
ilarity score of the two conference names that are needed in the equation system and
that belong to ]0, 1[ is:
Simv (conf N ame.val(S1_c1), conf N ame.val(S2_c1)) = 0.43. All the similarity
scores of basic values, that are needed in the computation, are either equal to 1 or equal
to 0.
     The weights that are used in the weighted average of equations are computed in
function of the number of common attributes and common relations of the reference
pairs. The similarity computation is illustrated by the equation system (see Table 1)
obtained from the data descriptions shown in Figure 2 which conforms to the ontology
O1. The detailed equations expressing the similarity computation of two articles and
two conferences are as follows:
x1 = max( 12 (Simv (title.val(S1_a1), title.val(S2_a1)) + Simv (year.val(S1_a1),
year.val(S2_a1))), 16 (x2 + SJ({S1_p1, S1_p2}, {S2_p1, S2_p2}),
Simv (pageF rom.val(S1_a1), pageT o.val(S2_a1)))
with SJ is the Sof tJaccardo similarity measure between sets of objects (see section
4.2)

x2 = max(x1 , max( 21 (Simv (conf N ame.val(S1_c1), conf N ame.val(S2_c1))+
Simv (conf Y ear.val(S1_c1), conf Y ear.val(S2_c1))), 14 (Simv (city.val(S1_c1),
city(S2_c1)))

x3 = 21 ∗ x1 + 21 ∗ Simv (name.val(S1_p1), name.val(S2_p1))
    The equation system and the different iterations of the resulting similarity compu-
tation are provided in Table 1. We assume that fix-point precision ! equals to 0.005.


                                Iterations                    0 1 2 3
               x1 = max( 21 (1 + 1), 61 (x2 + XS1 + 1 + 1)) 0 1 1 1
                  x2 = max(x1 , max( 21 (0.43 + 1), 41 (1)    0 0.71 1 1
                             x3 = 12 (x1 + 1)                 0 0.5 1 1
                             x4 = 12 (x1 + 1)                 0 0.5 1 1
                             x5 = 12 (x1 + 0)                 0 0 0.5 0.5
                             x6 = 12 (x1 + 0)                 0 0 0.5 0.5
                   Table 1. Example of iterative similarity computation


    The solution of the equation system is X = (1, 1, 1, 1, 0.5, 0.5). This corresponds
to the similarity scores of the six reference pairs. The fix-point has been reached after
three iterations. If we fix the reconciliation threshold Trec at 0.80, then we obtain four
reconciliation decisions: two articles, two conferences and two pairs of persons.


4 Link Discovery Method (LDM)

We present in this section our LDM approach which aims to discover a LOD source
that shares concepts and data with a data source described dy a domain ontology. Our
approach compares a local dataset on which domain knowledge can be declared and a
LOD dataset by using a combined ontology reconciliation and reference reconciliation
method. Since data that is provided by the LOD source and by the domain application
source is not described using the same ontology, we have adapted N2R method in order
to be able to compute data similarities when data do not belongs to non disjoint con-
cepts but to similar concepts. Furthermore, we have defined how similarities between
concepts of two ontologies can be computed when some of their references are common
(i.e. same URI or owl:same-as links that have been previously asserted) or similar. The
main steps of our link discovering approach are as follows:

 1. application of an ontology mapping tool to obtain: (i) the set of equivalent/ compa-
    rable properties and (i) initial similarity scores for some concept pairs;
 2. building of the two equation systems: the conceptual equation system which ex-
    presses the similarity computation between pairs of concepts in function of their
    labels, their structural similarity and their references; and the instance level equa-
    tion system one which expresses the similarity computation between pairs of ref-
    erences in function of their common description and the similarity of the concepts
    they are instance of;
 3. iterative resolution of the conceptual equation system until a fix point is reached;
 4. iterative resolution of the instance level equation system until a fix point is reached.

    The two steps (3) and (4) are iterated until a global fix point is reached, i.e., neither
the resolution of the conceptual equation system nor the resolution of the instance level
equation system does update the similarity scores.
    In the following subsections, we will first describe the elementary similarity mea-
sures that are used to compute similarities. Then, we present the two equation systems
that have been defined to compute concept similarities and data similarities. Finally, we
illustrate our LDM approach on data and ontologies of publication domain.

4.1 Initialization
We first use an alignment tool which exploits lexical and structural information to find
similarity scores between ontology elements (concepts and properties). Given a local
ontology O1 and a LOD ontology O2, the used alignment tool finds a set of mappings
and each mapping is described by the tuple {e1 , e2 , co, rel} where e1 is aligned with the
confidence co to the element e2 using the type of correspondence rel (e.g. equivalence,
subsumption, overlap, closeness, etc.). These scores are used to initialize the similarity
score simInit of each pair of concepts and to find a set of properties (relations or at-
tributes) that are very similar (rel = equivalence or subsumption, co ≥ th and th is
a high threshold). These properties are then considered as equivalent.

4.2 Elementary similarity measures
We present in this section the elementary measures used to compute similarity scores
between pairs of concepts of two ontologies. These elementary similarity measures take
into account the lexical and the structural knowledge declared in the two ontologies.
Most of these elementary similarity measures are based on the SoftJaccard similarity
measure which computes similarity between sets of basic values or between sets of ob-
jects (e.g., references, concepts).
SoftJaccard: a similarity measure for sets of objects. In [16], we have defined the
SoftJaccard similarity measure which is an adaptation of the Jaccard similarity mea-
sure in the sense that: (i) instead of considering only basic values we consider sets of
basic values and (ii) instead of considering the equality between values we consider a
similarity score with respect to a threshold θ.
    Let S1 and S2 be two sets of elements which can be basic values or objects.
To compute the similarity score between S1 and S2 we compute, first, the set
CLOSET (S1 , S2 , θk ) which represents the set of element pairs of S1 × S2 having
a similarity score simT ≥ θ.
CLOSET (S1 , S2 , θ) = {ej | ej ∈ S1 and ∃ek ∈ S2 s.t. SimT (ej , ek ) > θ},
with T a parameter which indicates if the sets S1 and S2 contain basic values, then
T = v or contain objects, then T = o. When T = v, the function Simv corresponds
to a similarity measure between basic values like Jaccard, Jaro − W inkler, and so
on [4]. When T = o, the function Simo corresponds to a similarity score that can be
provided by a tool dedicated to object comparison like N2R tool [16] for references or
TaxoMap [9] tool for concepts.
                                      | CLOSE(S1 , S2 , θ) |
      Sof tJaccardT (S1 , S2 , θ) =                          , with | S1 |≥| S2 |
                                            | S1 |
Similarity measures used to compare concepts. To compute the similarity scores be-
tween concepts we exploit both the conceptual content which means the sets of ances-
tors and the sets of descendants but also the sets of shared properties with respect to
a given equivalence relation. The similarity score between concepts is also function of
the similarity scores of their references, i.e. instance level content.
Similarity of concept labels. In OWL ontologies sets of labels are usually associ-
ated to the concepts. In case of concepts where the labels are not given, we consider
their corresponding URIs. Let L1 be the set of labels of a concept c1 and L2 be the
set of labels of the concept c2 . The label similarity simlabel is computed by apply-
ing the Sof tJaccard similarity measure on the two sets of basic values L1 and L2 :
simlabel (c1 , c2 ) = Sof tJaccardv (L1 , L2 , θ1 ).
Similarity of concept ancestors. For two concepts, we also compute the similarity of
their ancestor sets in the two ontologies. Let A1 be the set of ancestors of the concept
c1 and A2 be the set of ancestors of c2 . The ancestor similarity simanc is computed
by applying Sof tJaccard similarity measure on the two sets of concept ancestors (i.e.
objects) which is defined as follows: simanc (c1 , c2 ) = Sof tJaccardo (A1 , A2 , θ2 )
Similarity of concept descendants. The similarity score of two concepts also de-
pends on the similarity scores of their descendants in the two ontologies. Let D1
be the set of descendants of the concept c1 and D2 be the set of descendants of
c2 . The descendant similarity simdesc is computed by applying Sof tJaccard sim-
ilarity measure on the two sets of concept descendants which is defined as follows:
simdesc (c1 , c2 ) = Sof tJaccardo (D1 , D2 , θ3 )
Similarity of shared properties of concepts. The similarity of two concepts depends
on the proportion of equivalent properties compared to the full number of properties
defined for both concepts. Let R1d (resp. R2d ) be the set of properties such that the
concept c1 (resp. c2 ) is subsumed by the (equivalent) property domain and let R1r
(resp. R2r ) the set of properties such that the c1 (resp. c2 ) is subsumed by one of the
range of the (equivalent) property. The relation similarity simrel is defined as follows :

                                         | (R1d ∩ R2d ) ∪ (R1r ∩ R2r ) |
                   simrel (c1 , c2 ) =
                                          | (R1d ∪ R2d ∪ R1r ∪ R2r ) |

Similarity of concept references. The similarity score of two concepts also depends
on the set of their references. Let I1 (resp. I2 ) be the set of instances of c1 (resp. c2 ),
the similarity of c1 and c2 depends on the similarity scores obtained for the pairs of ref-
erences of I1 × I2 and it is computed by applying the Sof tJaccard similarity measure
on the sets I1 and I2 of references (i.e. objects). simref (c1 , c2 ) is defined as follows:
simref (c1 , c2 ) = Sof tJaccardo (I1 , I2 , θ4 ).


4.3 Equation modeling the dependencies between similarities in LDM approach

In LDM approach the similarity of each pair of references is expressed by a variable xi
in the instance level equation system. Its value depends on the common description of
the pair of references w.r.t the equivalent/ comparable properties (cf. N2R). It depends
also on the similarity scores of the concepts sci that are instantiated by the pair of
references. An equation of the instance level equation systemxi = gi (X), where i ∈
[1..n] and n is the number of reference pairs and X = (x1 , . . . , xn ), is of the form:
                                             1
                                 gi (X) =      (sci , fi (X))
                                             2
with sci is the similarity score computed by the resolution of the conceptual equation
system presented in the following. The function fi (X) is expressed as in N2R method
and we consider that knowledge on the (inverse) functionality of the shared properties
declared in the local ontology is also fulfilled in the LOD ontology.
     The similarity of each pair of concepts (c, c" ) is expressed by a variable xcj in the
conceptual equation system. Its value depends on the initial similarity score provided
by the alignment tool, the similarity of their labels, the set of their equivalent / com-
parable properties and the similarity of their references represented respectively by the
constants simj−init , simj−label , simj−rel and simj−ref . It depends also on the simi-
larity of their ancestors and their descendants represented by the variables XSCj−anc
and XSCj−desc computed using SotfJaccard function.
     An equation xcj = hj (XC), where j ∈ [1..m] and m is the number of concept
pairs and XC = (xc1 , . . . , xcm ), is of the form:

                                hj (XC) = max(simj−init ,
           1
             (XSCj−anc + XSCj−desc + simj−rel + simj−label + simj−ref ))
           5
 The values of the constants simj−init , simj−label , simj−rel and simj−ref are
computed using the similarity functions described in the above subsection.

    The size m of the conceptual equation system is | C1 × C2 |, where C1 (resp. C2 ) is
the set of concepts of the ontology O1 (resp. O2 ). The size of the instance level equation
system depends on the number k of comparable relations and on the size of their cor-
responding domain instances and range instances. Let ri1 and ri2 be two comparable
relations. Let Ei1 (resp. Ei2 ) be the set of domain instances of ri1 (resp. of ri2 ) and Ei3
(resp. Ei4 ) be the set of range instances of ri3 (resp. ri4 ). It also depends on the number
of comparable attributes k " and on the size of their corresponding domain instances.
Let aj1 and aj2 be two comparable attributes. Let Ej1 (resp. Ej2 ) be the set of domain
instances of aj1 (resp. of aj2 ). The number n of variables of the instance level equation
system is:

                      !
                      i=k                                       !
                                                                j=k!
               n =|         ((Ei1 × Ei2 ) ∪ (Ei3 × Ei4 )) ∪ (          (Ej1 × Ej2 )) |
                      i=1                                       j=1

 The computation complexity of the LDM method is O((n2 ∗ itref ) + (m2 ∗ itc )), with
itref is the number of iterations of the instance level equation system and itc is the
number of iterations of the conceptual equation system.

    One of the most distinguishing characteristic of LDM is its ability to propagate
similarities at different levels: (i) between pairs of concepts, (ii) between pairs of refer-
ences and (iii) between sets of references and sets of concepts. By using two separated
equation systems we avoid the propagation between references when we compute the
concept similarity scores and we avoid also the propagation between concepts when
we compute the reference similarity scores. Thus, we decrease the size of the equation
system and we allow a user to visualize and validate the intermediate equation system
results.


4.4 Illustrative example

We present in Figure 3 an extract of the DBLP ontology which is used to describe the
DBLP data published in the LOD. The considered data set only contains a collection
of conference proceedings and the collection of their corresponding research papers in
computer science. In order to illustrate our approach of link discovery, we will compare
the local RDF data of the source S1 given in Figure 2 with the extract of DBLP dataset
of the LOD given in Figure 4.
    The initialization step provides the following initial similarity scores for the concept
pairs:
siminit (Article, InP roceedings) = 0.3; siminit (Article, P roceedings) = 0.1;
siminit (Article, Agent) = 0.1; siminit (P erson, InP roceedings) = 0.0;
siminit (P erson, P roceedings) = 0.0; siminit (P erson, Agent) = 0.3;
siminit (Conf erence, InP roceedings) = 0.2;
siminit (Conf erence, P roceedings) = 0.2; siminit (Conf erence, Agent) = 0.1
                      Fig. 3. An extract O2 of LOD DBLP ontology

 LOD source S2:
 InProceedings(S2_a1); label(S2_a1,“Implementing the TEA algorithm on sensors’);
 Agent(S2_p1); Agent(S2_p2); issued(S2_a1, “2004”); name(S2_p1,“Olga V. Gavrylyako”
 ); name(S2_p2,“Shuang Liu” );

 Proceedings(S2_c1); label(S2_c1, “42nd Annual Southeast Regional Conference, 2004”);
 creator(S2_a1,S2_p1 ); creator(S2_a1,S2_p2 ); partOf(S2_a1,S2_c1 );
 InProceedings(S2_a2); label(S2_a2,“New Chaos Produced from Synchronization of Chaotic
 Neural Networks”); Agent(S2_p3); issued(S2_a2, “2008”);
 name(S2_p3,“Zunshui Cheng” );

 Proceedings(S2_c2); label(S2_c2, “Advances in Neural Networks - ISNN 2008, 5th
 International Symposium on Neural Networks”); creator(S2_a2,S2_p3); partOf(S2_a2,S2_c2);

                      Fig. 4. An extract of DBLP data set on the LOD


     Since, there are no subsumption relations in O1 and in O2 the conceptual equations
do not take into account the similarity scores of the ancestors and of the descendants.
For example, the equation expressing the similarity of the two concepts Article and
InP roceedings is: xc1 = max(0.3, 13 ( 23 + 0 + sim1−ref )). In this example, the con-
ceptual equation system consists of nine variables (xc1 , . . . , xc9 ).
     The instance level equation system consists of twenty-five equations representing
all the reference pairs where the common description is not empty. For example, the
equations expressing:
 – The       similarity    of        the     two      references        S1_a1       (Ar-
   ticle)      and      S2_a1         (InProceedings)      is:              x1         =
    1             1
    2
      (sc1 , max( 2
                    (Sim v (label.val(S2_a1), title.val(S1_a1))+Sim v (issued.val(S2_a1),
    year.val(S1_a1)), 41 (SJ({S1_p1, S1_p2}, {S2_p1, S2_p2}) + x14 )
 – The similarity of the two references S1_c1 (Conference) and S2_c1(Proceedings)
   is : x14 = 21 (sc14 , max(x1 , 12 (Simv (label.val(S2_c1), conf N ame.val(S1_c1)))
 – The similarity of the two references S1_p1 (Person) and S2_p1 (Agent) is:
    x5 = 21 ((sc5 , 12 (x1 + Simv (name.val(S1_p1), name.val(S2_p1))
 – The similarity of the two references S1_p1 (Conference) and S2_p1 (InProceed-
   ings) is: x18 = 12 (sc18 , max(Simv (conf N ame.val(S1_c1), label.val(S2_a1))

    In Table 2 we show the iterative resolution of the conceptual equation system ES1
modeling the similarity of all the pairs of concepts of the ontologies O1 and O2. The
column siminit represents the initial similarity score computed by an external concept
alignment tool, like TaxoMap [9]. The Table 3 shows the results of the iterative resolu-
tion of the instance level equation system ES2 of the pairs of references coming from
the local source S1 of Figure 2 which conforms to the local ontology O1 and the S2
LOD source which conforms to the LOD DBLP ontology O2.


             Variables of ES1           siminit Resolution1– iteration1 Resolution2– it1
       xc1 = (Article, InProceedings)     0.3     max(0.3, 31 ( 46 )) = 0.3    xc1 = 0.33
        xc2 = (Article, Proceedings)      0.1     max(0.1, 31 ( 16 )) = 0.1    xc2 = 0.1
           xc3 = (Article, Agent)         0.1     max(0.1, 13 (0)) = 0.1       xc3 = 0.1
       xc4 = (Person, InProceedings)      0.0      max(0, 13 (0)) = 0.0        xc4 = 0.0
        xc5 = (Person, Proceedings)       0.0      max(0, 13 (0)) = 0.0        xc5 = 0.0
           xc6 = (Person, Agent)          0.3 max(0.3, 31 (1)) = 0.33         xc6 = 0.427
     xc7 = (Conference, InProceedings) 0.2        max(0.2, 13 ( 14 )) = 0.2    xc7 = 0.2
      xc8 = (Conference, Proceedings)     0.2 max(0.2, 13 ( 24 )) = 0.25       xc8 = 0.33
         xc9 = (Conference, Agent)        0.1     max(0.1, 31 (0)) = 0.1       xc9 = 0.1
               Table 2. The two resolutions of the conceptual equation system ES1


    The Resolution1 step of ES1 corresponds to the first iterative resolution of ES1
where simref of all the concepts equals to 0. The fix-point of ! = 0.05 is reached in two
iterations. The Resolution1 step of ES2 corresponds to the first iterative resolution of
ES2 where sci of all the references equals to siminit (c.f. Table 2) . The fix-point of
! = 0.05 is reached in three iterations. The Resolution2 step of ES1 corresponds to
the second iterative resolution of ES1 where simref of all the concepts equals to the
similarity scores computed by ES2 at the last iteration of Resolution1. The fix-point
of ! = 0.05 is also reached in two iterations. The Resolution2 step of ES2 corresponds
to the second iterative resolution of ES2 where sci of all the references equals to the
similarity scores computed by ES1 at the last iteration of Resolution1. The fix point
of ! = 0.05 is reached in two iterations.
    The global fix-point is reached after three resolutions. At Resolution3 2 of the
two systems ES1 and ES2 we obtain the same similarity scores than the last iteration
of their corresponding Resolution2 step. The results obtained by ES1 show that the
method obtains the best similarity scores for the most possible equivalent concepts:
(Article, InP roceedings), (P erson, Agent) and (Conf erence, P roceedings). In
an analogous way, the results obtained by ES2 show that the best similarity scores
are obtained for the most possible owl:same-as references. If we fix the reconciliation
 2
     The scores are not shown here, they are equal to those obtained in the Resolution2 of ES1
     and ES2 .
  Variables of ES2            Resolution1– iteration 1          Res1–it2 Res1–it3 Res2–it1
 x1 = (S1_a1,S2_a1) 12 (0.3 + max( 12 (2), 41 (0)) = 0.66          0.66     0.66   0.665
 x2 = (S1_a2,S2_a1) 12 (0.3 + max( 12 (0), 41 (0)) = 0.15         0.165     0.175   0.19
          ...                             ...                       ...      ...     ...
                              1          1
 x5 =(S1_p1,S2_p1)            2
                                (0.33 +  2
                                           (1)) =  0.415           0.58     0.58    0.62
                              1
 x6 = (S1_p2,S2_p1)           2
                                (0.33 + 12 (0)) = 0.165            0.33     0.33   0.379
          ...                             ...                       ...      ...     ...
 x14 = (S1_c1,S2_c1) 12 (0.25 + max(0, 12 (0.438)) = 0.219 0.455            0.455  0.477
          ...                             ...                       ...      ...     ...
                               1
 x18 = (S1_c1,S2_a1)           2
                                 (0.2 + max(0))   =  0.1            0.1      0.1     0.1
          ...                             ...                       ...      ...     ...
              Table 3. The resolution of the instance level equation system ES2


threshold at 0.45 we infer the reconciliation of the two papers (S1_a1, S2_a1), of the
two persons (S1_p1, S2_p1) and of the two conferences (S1_c1, S2_c1).
    In this example we have shown the applicability of the approach even when the
considered ontologies are not syntactically close and when they have very poor structure
(no subsumption relations) which means that the ancestors and the descendants are not
considered.


5 Conclusion and Future Work


In this paper we have presented a Link Discovering Method (LDM) which allows dis-
covery of new data sources that are published in the Open Linked Data cloud (LOD).
Our approach is based on the idea of comparing a local dataset on which domain knowl-
edge can be declared and a LOD dataset by using a combined ontology reconciliation
and reference reconciliation method. By using our LDM method one may discover more
owl:same-as links with datasets available on the LOD.
     One of the most distinguishing characteristic of our link discovery approach resides
on its ability to propagate similarities at different levels: (i) between pairs of concepts,
(ii) between pairs of references and (iii) between sets of references and sets of concepts.
By using two separated equation systems we avoid the propagation between references
when we compute the concept similarity scores and we avoid also the propagation be-
tween concepts when we compute the reference similarity scores.
     As a very short term perspective, we plan to test our LDM approach on real data
sets and evaluate the quality of its results and its scalability. It will be worth to com-
pare LDM method with those of existing link discovery methods like [10]. As future
work, we plan to extend the approach to be able, in addition of the equivalent proper-
ties, take into account the other properties in oder to consider richer data descriptions.
Moreover, we aim also to extend the LDM method to compute also similarities between
the properties of the considered ontologies.
References
 1. Berners-Lee, T., Cailliau, R., Groff, J.F., Pollermann, B.: World-wide web: The information
    universe. Electronic Networking: Research, Applications and Policy 1(2), 74–82 (1992)
 2. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity
    measures. In: KDD. pp. 39–48 (2003)
 3. Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semantic Web Inf.
    Syst. 5(3), 1–22 (2009)
 4. Cohen, W.W., Ravikumar, P.D., Fienberg, S.E.: A comparison of string distance metrics for
    name-matching tasks. In: IIWeb. pp. 73–78 (2003)
 5. Cohn, D.A., Atlas, L.E., Ladner, R.E.: Improving generalization with active learning. Ma-
    chine Learning 15(2), 201–221 (1994)
 6. Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information
    spaces. In: SIGMOD Conference. pp. 85–96 (2005)
 7. Euzenat, J., Loup, D., Touzani, M., Valtchev, P.: Ontology alignment with ola. In: Sure, Y.,
    Corcho, O., Euzenat, J., Hughes, T. (eds.) Proc. 3rd ISWC2004 workshop on Evaluation of
    Ontology-based tools (EON), Hiroshima (JP). pp. 59–68 (2004)
 8. Golub, G.H., Loan, C.F.V.: Matrix computations (3rd ed.). Johns Hopkins University
    Press, Baltimore, MD, USA (1996), http://portal.acm.org/citation.cfm?
    id=248979
 9. Hamdi, F., Safar, B., Niraula, N.B., Reynaud, C.: Taxomap in the oaei 2009 alignment con-
    test. In: Proceedings of the 4th International Workshop on Ontology Matching (OM-2009)
    collocated with the 8th International Semantic Web Conference (ISWC-2009) Chantilly,
    USA, October 25, 2009 (2009)
10. Hassanzadeh, O., Kementsietsidis, A., Lim, L., Miller, R.J., Wang, M.: A framework for
    semantic link discovery over relational data. In: Proceedings of the 18th ACM Conference
    on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November
    2-6, 2009. pp. 1027–1036 (2009)
11. Kiefer, C., Bernstein, A.: The creation and evaluation of isparql strategies for matchmaking.
    In: The Semantic Web: Research and Applications, 5th European Semantic Web Conference,
    ESWC 2008, Tenerife, Canary Islands, Spain, June 1-5, 2008, Proceedings. pp. 463–477
    (2008)
12. Li, J., Tang, J., Li, Y., Luo, Q.: Rimom: A dynamic multistrategy ontology alignment frame-
    work. IEEE Transactions on Knowledge and Data Engineering 21, 1218–1232 (2009)
13. Nikolov, A., Uren, V.S., Motta, E., Roeck, A.N.D.: Handling instance coreferencing in the
    knofuss architecture. In: Proceedings of the 1st IRSW2008 International Workshop on Iden-
    tity and Reference on the Semantic Web, Tenerife, Spain, June 2, 2008 (2008)
14. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The
    VLDB Journal 10(4), 334–350 (2001)
15. Saïs, F., Pernelle, N., Rousset, M.C.: L2r: A logical method for reference reconciliation. In:
    AAAI. pp. 329–334 (2007)
16. Saïs, F., Pernelle, N., Rousset, M.C.: Combining a logical and a numerical method for data
    reconciliation. J. Data Semantics 12, 66–94 (2009)
17. Scharffe, F., Liu, Y., Zhou, C.: Rdf-ai: an architecture for rdf datasets matching, fusion and
    interlink. In: Proc. IJCAI 2009 workshop on Identity, reference, and knowledge representa-
    tion (IR-KR), Pasadena (CA US) (2009)
18. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches pp. 146–171 (2005)
19. Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information
    integration. Inf. Syst. 26(8), 607–633 (2001)