Improving Link Specifications using Context-Aware Information Andrea Cimmino Carlos R. Rivero David Ruiz University of Seville, Spain Rochester Institute of University of Seville, Spain cimmino@us.es Technology, USA druiz@us.es crr@cs.rit.edu ABSTRACT types of links between datasets, but the most common one is There is an increasing interest in publishing data using the owl:sameAs [7]. To generate the links, a link discovery task Linked Open Data philosophy. To link the RDF datasets, must be performed, which aims to find all pair of instances a link discovery task is performed to generate owl:sameAs that are describing the same concept [13, 15]. links. There are two ways to perform this task: by means Link discovery can be performed in two different ways: of a classifier or a link specification; we focus in the lat- by means of a classifier [22], which links instances with ter approach. Current link specification techniques only use owl:sameAs if it considers them as the same; or generat- the data properties of the instances that they are linking, ing a link specification, which is a set of restrictions over and they do not take the context information into account. the data properties of two instances [17, 11]. Each pair of In this paper, we present a proposal that aims to gener- data properties is associated to a similarity function and a ate context-aware link specifications to improve the regular threshold, if the similarity function returns a value higher link specifications, increasing the effectiveness of the results than the threshold, then the restriction is satisfied and the in several real-world scenarios where the context is crucial. two instances are linked using owl:sameAs. For example, a Our context-aware link specifications are independent from link specification defines that two instances of Person are the similarity functions, transformations or aggregations. We same if both have a data property describing their names, have evaluated our proposal using two real-world scenarios and their literals are exactly the same. in which we improve precision and recall with respect to Unfortunately, in some scenarios, this definition is not regular link specifications in 23% and 58%, respectively. suitable to generate owl:sameAs links. Under some circum- stances, taking only the literals of the data properties of two instances into account may lead to mix up instances Keywords that are very similar but actually different, e.g. if two peo- Linked Data, Link Discovery, Link Specification, Contex- ple are different but have the same name. In these cases, Aware Link Specification a link specification should include conditions over the con- text information, data properties of other instances that are 1. INTRODUCTION related with the main two to have more information and In recent years, we have witnessed an increasing interest in improve the effectiveness. the Linked Open Data [9]. As a matter of fact, the number In this paper, we focus on improving current link spec- of datasets in 2011 were 452 and in 2014 that number raised ifications making them context aware. We aim to extend to 2,289 [6]. Publicly-available datasets have to fulfill the the definition used by actual techniques [17, 11], and add Linked Data principles, which manly consist in using IRIs restrictions over the instances in the context of the pair that as names of things, using HTTP IRIs so that people can we are linking. To achieve this, we introduce the concept look up those names, when someone looks up a IRI provide of overlap factor. If we define different link specifications useful information using the standards (RDF or SPARQL), over the instances in both contexts, we can handle them as and finally, include links to other IRIs so that they can dis- sets potentially overlapped by means of an equity criteria cover more things [2]. Since the number of datasets has defined by the link specifications. The overlap factor is a increased in the recent years and these principles establishes function that measures the overlapping between contexts. that the datasets must be linked with others and published Thanks to this, we are able to define restrictions over this in RDF formats [1], a huge effort has been done to link value. In this paper, we restrict ourselves to two different these RDF datasets automatically [14]. There are different types of overlap factors, namely: exists or for all. The for- mer means that there is a pair of instances in the context considered the same by means of a link specification. The latter, means that all the instances in both contexts are the same. Figure 1 depicts a sample scenario where the context is crucial to obtain a good precision. Figure 1(a) shows a part of the data model of DBLP and the National Science Foun- dation (NSF), both were built using authors that have pub- Copyright is held by the author/owner(s). lished in the International Conference of Very Large Data WWW2016 Workshop: Linked Data on the Web (LDOW2016). CALSAR: forall LSAR and exists LSAP dblp:Author nsf:Researcher dblp:name LSAR: Jaro > 0.98 nsf:name nsf:leads nsf:email nsf:position nsf:Award dblp:writes nsf:name nsf:sponsor dblp:Article nsf:Paper nsf:startDate nsf:amountToDate dblp:title LSAP: Levenshtein > 0.90 nsf:title dblp:date nsf:date nsf:supports dblp:numberOfPages nsf:pages dblp:recordedAt nsf:conference (a) Data model and context-aware link specification nsf:leads dblpU:Wang0011:Wei nsfU:WeiWang0007 nsfU:AN-0423336 dblp:writes dblp:name : "Wei Wang" (LSAR) sameAs nsf:supports nsf:name : "Wei Wang" rdf:type : dblp:Author rdf:type: nsf:Award rdf:type : nsf:Researcher nsfU:AN-0423336/#13 (CALSAR) (LSAR) sameAs nsf:title : "Efficient Computation ..." dblpU:conf/vldb/JiangWL03 nsfU:WeiWang0012 rdf:type : nsf:Paper nsf:name : "Wei Wang" nsf:leads dblp:title : "Holistic Twig ..." rdf:type : dblp:Article rdf:type : nsf:Researcher nsfU:AN-1043034 nsf:supports rdf:type: nsf:Award dblpU:conf/vldb/YuanLLWYZ05 nsfU:YangWang0023 nsfU:AN-1043034/#2 nsf:leads nsf:title : "Necessary and sufficient ..." dblp:title : "Efficient Computation ..." nsf:name : "Yang Wang" rdf:type : dblp:Article rdf:type : nsf:Researcher rdf:type : nsf:Paper (b) Sample instances and links generated to relate them Figure 1: Scenario DBLP-NSF: Improving precision using context-aware link specification Bases (PVLDB) of 2013. DBLP contains these authors and author in DBLP, dblpU:Wang0011:Wei. We see that there their articles, the NSF researchers from its portal, with the are two researchers whose name is “Wei Wang” that lead same name of these authors, that leads awards which sup- two different NSF grants (nsf:AN-1043034, nsf:AN-0423336). ports papers. We wish to identify authors in DBLP that Using LSAR alone, we link the three instances (dblpU:Wang- have been awarded with NSF grants using their names and 0011:Wei and nsfU:WeiWang0012, and dblpU:Wang0011:Wei publications. We include two link specifications: LSAR , and nsfU:WeiWang0007). However, using LSAP we are able which links the dblp:Author and nsf:Researcher instances if to identify one paper in DBLP written by “Wei Wang” that their literals of dblp:name and nsf:name obtain a score over also appears in NSF. If we use both links at the same time 0.98 using Jaro; LSAP , which links dblp:Article and nsf:Paper as part of the context information of the authors, we discard instances if their literals dblp:title and nsf:title obtain a score one of the previous links (dblpU:Wang0011:Wei and nsfU:We- over 0.90 using Levenshtein. We rely on these link specifica- iWang0012), which is not correct since it is actually linking tions and add overlap factors to them, each of which states another researcher in NSF whose name is “Wei Wang”. As how many instances should be linked. The overlap factors a result, we improve precision using context information. for LSAR and LSAP are for all and exists, respectively. The Figure 2 depicts another sample scenario, in which DBLP resulting context-aware link specification is interpreted as acts both as source and target, where the context is cru- follows: a pair of dblp:Author and nsf:Researcher instances cial to obtain a good recall. Figure 2(a) shows a part of are the same by means of the context-aware link specifica- the data model, this scenario has several authors and their tion if all the instances covered by LSAR , pairs of dblp:Author aliases (different names that refer to the same person); both and nsf:Researcher instances, are the same, and if at least one datasets contains the same authors and their articles but pair covered by LSAP , dblp:Article and nsf:Paper instances, they have different aliases in each dataset. We include a are the same. link specification, LSAA , that links dblp:Article instances if Actual link specification techniques can generate LSAR the literals of their dblp:title obtain a score over 0.99 using or LSAP (notice that LSAP would not link instances of db- Jaccard. We rely on LSAA and we add to it a for all as over- lp:Author and nsf:Researcher), but they are not able to use lap factor. The resulting context-aware link specification is LSAP to link instances of dblp:Author and nsf:Researcher. interpreted as follows: two instances of dblp:Author are the Additionally, they are not able to generate and apply overlap same by means of the contex-aware link specification if all factors over them. their dblp:Article instances are the same by means of LSAA . Figure 1(b) depicts a set of sample instances, the link Figure 2(b) depicts a set of sample instances, the link specifications LSAR and LSAP and the context-aware link specification LSAA and the context-aware link specification specification CALSAR . We focus on “Wei Wang”, who is an CALSAA . We focus on link ”Hosagrahar V. Jagadish” and CALSAA: forall LSAA dblpU:Jagadish_0002:Hosagrahar.V. dblpU:Jagadish_0001:H.V. dblp:Author dblp:Author dblp:name : "Hosagrahar V. Jagadish" (CALSAA) sameAs dblp:name : "H. V. Jagadish" dblp:name dblp:name rdf:type : dblp:Author rdf:type : dblp:Author dblp:writes dblp:writes dblpU:journals/pvldb/LabrinidisJ12 dblpU:journals/pvldb/LabrinidisJ12 dblp:writes dblp:writes dblp:title : "Challenges and ..." dblp:title : "Challenges and ..." dblp:Article dblp:Article rdf:type : dblp:Article rdf:type : dblp:Article dblp:title LSAA: Jaccard > 0.99 dblp:title dblpU:conf/sigmod/QianCJ12 dblpU:conf/sigmod/QianCJ12 dblp:date dblp:date dblp:numberOfPages dblp:numberOfPages dblp:title : "Sample-driven ..." dblp:title : "Sample-driven .." dblp:recordedAt dblp:recordedAt rdf:type : dblp:Article rdf:type : dblp:Article (a) Data model and context-aware link specifica- (b) Sample instances and links generated to relate them tion Figure 2: Scenario DBLP-DBLP: Improving recall using context-aware link specification ”H. V. Jagadish” (dblpU:Jagadish 0002:Hosagrahar.V. and d- to exploit the information contained in instances related to blpU:Jagadish 0001:H.V.), which are different names of the the pair that is been analyzed by it. It takes two data mod- same person. A regular link specification would compare els as input and generates a probabilistic model. In the first these literals, instead we rely on their publications. Using place, the technique computes the probabilities of equiva- LSAR as part of the context information of the authors, we lences of instances, then, the probabilities for relationships are able to link all their dblp:Article instances and, hence, with other instances and, finally, it creates the equivalences link the instances dblpU:Jagadish 0002:Hosagrahar.V. and d- between the classes. Hassanzadeh et al. [8] proposed a semi- blpU:Jagadish 0001:H.V. through their publications. We im- supervised technique that receives two datasets as input. prove the recall because we do not rely on the names of au- This technique works as follows: having all the data proper- thors, which differs from both datasets, instead we use their ties of all the classes in each dataset, the technique iterates publications, which titles do not differ from both datasets. over one set and searches in the other set, according to a As result, we improve the recall using the context informa- string distance, the most similar data properties. The tech- tion. nique returns the ranked set of pairs according to the string We performed several experiments using the datasets in- distance. troduced in the Figure 1 and 2, in which we improved 23% Regarding the link specification approaches, Isele and Bi- in precision and 58% in recall, respectively. zer [11] proposed GenLink, which is a supervised genetic The rest of the article is organized as follows: in Section programming technique that generates link specifications as 2, we report on several related proposals and their features; trees. It starts with a population made of random link spec- Section 3 introduces our conceptual framework; Section 4 ifications and some recurrent link specification predefined presents our proposal to generate context-aware link specifi- by the authors. Then, using genetic operations (reproduc- cations; Section 5 reports the results obtained in our experi- tion, crossover and mutation), the population is evolved and ments; and, finally, Section 6 recaps on our main conclusions its quality evaluated by means of a fitness function, which and future work. uses training data provided by the user. The technique stops when a configured maximum number of iterations is reached, 2. RELATED WORK or a link specification obtains a value in the fitness function over a threshold given by the user. Based on GenLink, the Over the last years, several approaches have been devel- same authors proposed ActiveGenLink [12], which aims to oped to address the link discovery task. There are two ways reduce the number of labelled examples using active learn- to face link discovery. The first approach is building different ing. ActiveGenLink selects link candidates to be labelled kind of classifiers to establish if two instances are the same by the user from a pool of unlabelled instances through a [22]. The second approach is through the discovery of accu- query strategy. Then, once the user labels a given example, rate link specifications, which specify conditions that must it adds the example to the training data and evolves the pop- hold true for two entities to be interlinked [11, 12, 17, 18, ulation using GenLink. Another semi-supervised technique 19, 21]. The main difference between these two approaches is EAGLE [17], which is based in a genetic programming is that the former does not specify why two instances are the technique with active learning, and it aims to generate link same, i.e., it works like a black box that receives as input two specifications as trees. It starts detecting similar classes and instances and outputs whether or not they are the same; the data properties using RAVEN [16]. Then, EAGLE evolves latter generates a specification of why two instances should an initial random population of link specifications according be the same, describing which data properties have to fulfil to genetic operators. After that, the technique computes the the conditions. most informative links and asks the user to label them. This We focus only in the link specification approach, although, process is repeated until the stop condition is fulfilled, i.e., a we have also analyzed some classifiers since they exploit the maximal number of iterations is reached, or the fitness value context information [8, 10, 23]. Holub et al. [10] proposed of a link specification is over a given threshold. An unsuper- a technique that works with a fixed formula that takes the vised learning technique was proposed by Nikolov et al. [19], instances related directly to the pair that is been linked into which starts with a random population and keeps iterating account. PARIS [23] is an unsupervised technique developed over it, applying genetic operators, until a maximal num- 3.1 Foundations ber of iterations is reached, or the fitness of the population We are focusing on RDF datasets, which are triple stores does not improve for several iterations. Since this technique that contain literals and IRIs. Our proposal focuses on the does not work with labelled data, the fitness function uses analysis of different instances, each of which entails several two criteria defined by the authors to evaluate link specifica- concepts as follows: tions, namely: pseudo-F-measure and neighborhood growth. When a stop condition is reached, it returns the link speci- • IRI: it uniquely identifies a web location. Note that fication with the highest fitness value from the population. we use expressions like dblp:name to refer to IRIs, in EUCLID [18] is an unsupervised technique that, using differ- which dblp: is a prefix. Table 2 summarizes all of our ent similarity functions, evaluates the data properties of the prefixes. For example, some sample IRIs in Figure 1 instances and generates a space of similarity values. Then, are dblpU:Wang0011:Wei and nsfU:YangWang0023. depending on different heuristics, it iterates over that space updating the scores and pruning them until a solution is • BlankNode: are placeholders for IRIs whose actual value found or a stop condition is reached. The unsupervised tech- is unknown. They have only local scope and are purely nique proposed by Song and Heflin [21] focuses on metrics an artifact of the serialization. Blank nodes are disjoint to improve the candidate selection to be as more scalable from IRIs and Literals. as possible. Candidate selection is a process to pick pairs of instances, each of which has a high probability to be the • Instance: an instance is an IRI or a BlankNode that we same. The process is performed by selecting and compar- are interested in linking. ing only part of the data properties of each instance in the pair. It then extracts a set of data properties very useful in disambiguation, which identify why the pair of instances are Pref. IRI rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#type the same. owl: http://www.w3.org/2002/07/owl# As far as we know, none of the previous link specification dblp: http://example.org/voc/dblp# techniques is able to exploit context information. Holub et nsf: http://example.org/voc//nsf# al. [10] proposed a technique that takes into account only db- http://example.org/urls/dblp# the instances of the context one-hop related to the pair that lpU: is been linked. PARIS [23] takes as input all the datasets and nsfU: http://example.org/urls/nsf# generates a probabilistic model to classify input instances. The technique by Hassanzadeh et al. [8] returns a ranking of Table 2: IRI prefixes used in the paper most similar data properties using several string distances. None of the previous techniques is able to apply transforma- tions on data properties and only [8] is able to use different • Class: we can assign classes to Instances, each of which string similarity measures, however it only returns a ranked is an IRI that represents a real-world concept. When list of data property and not why two instances should be we assign a Class to an Instance, we are explicitly saying linked. Additionally, [10] only takes one-hop connected in- that the Instance belongs to the type of the Class. We stances into account, although, many real-world scenarios use rdf:type to represent this assignment. For exam- require to take more than one-hop related instances into ac- ple, in Figure 1, Instances related to Classdblp:Author count [20]. represent authors in DBLP. Table 1 summarizes all the techniques and their different features. Those that generate link specifications are classi- • DataProperty: Instances may comprise attributes that fied as LS; if they take into account the context, been LS or describe features of the Instances, which are plain lit- not, then we classify them as context-aware (C-A). Finally, if erals. To represent these features, we use data prop- the technique is independent of any specific function (aggre- erties, which are IRIs that identify these literals. In gations, transformations, string distance measures or string Figure 1, the names of the dblp:Author Instances are similarities), we classify it as function independent (FI). identified using dblp:name. Technique LS C-A FI • Literal: it denotes a value that a data property takes. [10] Holub et al. (2015) No Yes No For example, in Figure 1 “Yang Wang” for nsfU:Yang- [23] Suchanek et al. (2011) No Yes No Wang0023 or “Wei Wang” for nsfU:WeiWang0012 are [8] Hassanzadeh et al. (2013) No Yes No sample literals for the same data property. Depending [5, 12] Isele and Bizer (2011, 2012) Yes No Yes on the Instance, data properties have different literals. [17, 18] Ngonga and Lyko (2012,2013) Yes No Yes [19] Nikolov et al. (2012) Yes No Yes • ObjectProperty: Instances can be related to other In- [21] Song and Heflin (2011) Yes No No stances by means of object properties, which are IRIs; a set of Instances related conform a graph. Note that, Table 1: Comparison of current techniques in RDF, object properties are first-class citizens and they are not subordinated to Instances. Figure 1 shows a sample object property that relates dblp:Author and 3. PRELIMINARIES dblp:Article using dblp:writes. Notice that we can add In the following, we present the formalization of several multiple relations connecting the same Instance to mul- concepts that we use to describe our proposal. We define its tiple Instances; for example, in Figure 1, the object foundations, what a link specification is and what we mean property dblp:writes may relate one dblp:Author with by context-aware link specification. several dblp:Article Instances. LinkSpecification CALinkSpecification source: Set source: Set target: Set target: Set Condition * * CACondition SameAsCondition ConditionComposite CASameAsCondition ConditionComposite f: Function f: Aggregation oF: OverlapFactor f: Aggregation threshold: Double 2 Operand * 2 LinkSpecification ObjectLeafNode source: Set prop: ObjectProperty * DataLeafNode OperandComposite target: Set dataset: {SRC, TRG} prop: DataProperty f: Transformation dataset: {SRC, TRG} (a) Link Specification (b) Context-Aware Link Specification Figure 3: Models for link discovery 3.2 Link Specification Figure 3(b) specifies the structure of a context-aware link When performing link discovery, we have a source and specification using an UML-like notation. Each ObjectLeaf- a target datasets that we wish to relate using owl:sameAs Node represents the object properties that connect the sets links. To link the Instances of each dataset we generate a of classes in CALinkSpecification with the other sets of classes link specification. A link specification has been defined in in the LinkSpecifications. A CASameAsCondition specifies an multiple manners in the literature [3, 11, 12, 17]. We have OverlapFactor over a LinkSpecification. OverlapFactor takes based our work in the definition given by Isele and Bizer as values for all, if all the Instances are required to be con- [11]. A link specification is a set of restrictions that define sidered the same by means of the LinkSpecification, or exists, the equality between a source and a target sets of classes if just one pair of Instances is required. ConditionComposi- based on their data properties. For instance, a dblp:Author te combines different CASameAsConditions or other Condi- and a nsf:Researcher are the same if they have very similar tionComposites, the Aggregation functions are: AND or OR literals for dblp:name and nsf:name. Boolean conditions. Finally, CALinkSpecification represents Figure 3(a) depicts how we model link specifications using the two main sets of classes, source and target, that the an UML-like notation. Each DataLeafNode represents a spe- Instances we wish to link with owl:sameAs belongs. cific data property and the dataset it belongs. OperandCom- We present a sample context-aware link specification in posite specifies one Transformation function to be applied Figure 1(a) between dblp:Author and nsf:Researcher, that we over the literals; examples of these transformations are low- refer to as CALSAR . The Instances of both classes are consid- ercase, tokenize, concatenate or remove prefix. SameAsCon- ered the same using some of their data properties, dblp:name dition represents a threshold and a string distance measure, and nsf:name, but also taking the data properties of Instances or a string similarity, that defines when two Operands are belonging to the context into account. It uses two link spec- the same; some of the well-known string distance measures ifications to link the different kind of Instances, LSAR and are Levenshtein, Jaccard, Jaro and Jaro-Winkler. Condi- LSAP , and over them it defines two overlap factors, for all tionComposite combines different SameAsConditions or other and exists, respectively. ConditionComposites. Thanks to this, it is possible to define restrictions over data properties and combine the results, for example, using AND or OR Boolean conditions. LinkSpeci- 4. APPROACH fication contains the sets of source and target classes of the Our proposal aims to generate context-aware link specifi- Instances that we are relating with owl:sameAs links. cations by means of Algorithm 1. The input is an example Figure 1(a) shows two sample link specifications. One of composed by two Instances from different datasets represent- them between the Instances of dblp:Author and nsf:Resear- ing the same concept. The output of the algorithm is a cher (LSAR ). The Instances of both classes are the same context-aware link specification. if literals of data properties dblp:name and nsf:name are the The algorithm takes each input Instance and explores the same by means of a Jaro comparison and a threshold of 0.98. Instances related with them by means of their object prop- The second link specification (LSAP ) relates dblp:Article and erties, retrieving all the new Instances from both contexts nsf:Paper, which are the same if literals of data properties (lines 10-11 in Algorithm 1). Then, the algorithm generates dblp:title and nsf:title are the same by means of a Levenshtein a set of link specifications for the Instances in each context, comparison and a threshold of 0.90. line 12 in Algorithm 1. For example, receiving as input the Instances dblpU:Wang0011:Wei and nsfU:WeiWang0007 from 3.3 Context-Aware Link Specification Figure 1(b), the algorithm firstly retrieves all the Instances A context-aware link specification extends the given def- that are related with them, dblpU:conf/vldb/JiangWL03 and inition of link specification by defining when two Instances dblpU:conf/vldb/YuanLLWYZ05 for the first one, nsfU:AN- are the same, like before, but the restrictions are not defined 0423336 and nsfU:AN-0423336/#13 for the second one. It is only over their data properties, but also taking data proper- important to notice that not only the Instances one-hop away ties of other Instances that belongs to a different set of classes are retrieved but also those that are more distant. Then, into account, which are connected by object properties. the algorithm generates two link link specifications, LSAR Algorithm 1 generateCALinkSpecification 2), it firstly applies the current link specification to the In- 1: input stances, and then, it measures the ratio of Instances linked 2: i1 , i2 : Instance by a owl:sameAs generated with the current link specification 3: output (line 16 in Algorithm2). In our technique, if all the Instances 4: cals: CALinkSpecification in both contexts covered by the link specification are linked 5: variables by owl:sameAs, then a for all overlap factor is assigned to 6: C1 , C2 : P Instance the current link specification (lines 19-20 in Algorithm 2). 7: LS: P LinkSpecification If only one owl:sameAs is generated between the Instances in 8: SO: P CASameAsCondition the context, covered by the current link specification, then 9: an exists overlap factor is assigned to it (lines 21-22 in Al- 10: C1 ← expand (i1 ) gorithm 2); otherwise, the link specification is discarded. 11: C2 ← expand (i2 ) In Figure 1(b), the algorithm assigns for all to LSAR , since 12: LS ← generateLinkSpecifications (C1 , C2 ) there is only a pair of Instances covered by it and both ful- 13: SO ← assignOverlapFactor (LS, C1 , C2 , i1 , i2 ) fill the restrictions of LSAR . The algorithm assigns exists 14: cals ← createCALinkSpecification (SO, i1 , i2 ) to LSAP since only one pair of dblp:Article and nsf:Paper In- stances fulfill the conditions of LSAP . Additionally, we also store the sets of object properties that connect the class of the input Instance with the class of the Instances covered by and LSAP , relating the Instances in both contexts. Actual the link specification (lines 17-18 in Algorithm 2). In Fig- link specification techniques are able to generate LSAR and ure 1(b), the algorithm relates LSAR with an empty set of LSAP , so, in this paper, we do not focus on generating them. object properties because the Instances covered by it are the We assume that generateLinkSpecifications returns the best same that the input. Then the algorithm assigns to LSAP link specifications for the Instances in the context. two sets of object properties, {dblp:writes } and {nsf:leads, The algorithm assigns to each link specification an overlap nsf:supports }, that connect the main Instancesdblp:Author factor. In addition, we store the set of object properties and nsf:Researcher with the class of the Instances covered that connect the class of the input Instance with the class by LSAP , dblp:Article and nsf:Paper. Finally, for each link of the Instances covered by the link specification (line 13 in specification, its related overlap factor and the sets of ob- Algorithm 1 and Algorithm 2). Finally, it combines with ject properties, the algorithm creates a CASameAsCondition different aggregation functions the results obtained in the (line 24 in Algorithm 2). Every CASameAsCondition is added previous step, creating the context-aware link specification to a set, which is the output of the algorithm when there are (line 14 in Algorithm 1). no more link specifications to compute. Algorithm 2 assignOverlapFactor 1: input Algorithm 3 createCALinkSpecification 2: LS: P LinkSpecification 1: input 3: i1 , i2 : Instance 2: i1 , i2 : Instance 4: C1 , C2 : P Instance 3: SO: P CASameAsCondition 5: output 4: output 6: SO: P CASameAsCondition 5: cals: CALinkSpecification 7: variables 6: variables 8: ls: LinkSpecification 7: classsrc , classtrg : P Class 9: o1 , o2 : Double 8: aggrAND: ConditionComposite 10: oF: OverlapFactor 9: 11: opsrc , optrg : P ObjectLeafNode 10: aggrAND ← combineWithAndAggregations (SO) 12: 11: classsrc ← extractRDFClass (i1 ) 13: SO ← ∅ 12: classtrg ← extractRDFClass (i2 ) 14: oF ← {} 13: cals ← createCALS (aggrAND, classsrc , classtrg ) 15: for each ls in LS 16: (o1 ,o2 ) ← measureOverlap (ls, C1 , C2 ) 17: opsrc ← objectPropertiesPath (ls, C1 , i1 ) Algorithm 3 receives as input a set of CASameAsCondi- 18: optrg ← objectPropertiesPath (ls, C2 , i2 ) tion and the input Instances, the output of the algorithm 19: if o1 = 1.0 and o2 = 1.0 then is a CALinkSpecification. The algorithm starts combining 20: oF ← {for all } all the different CASameAsConditions of the input set with 21: if o1 > 0.0 and o2 > 0.0 then and aggregations (line 10 in Algorithm 3). Finally the algo- 22: oF ← {exists } rithm extracts the class of each input Instance (lines 11-12 23: if oF = {for all } or oF = {exists } then in Algorithm 3) and creates a CALinkSpecification (line 13 24: SO ∪ createCASameAsCond (oF, ls, opsrc , optrg ) in Algorithm 3). In Figure 1(b), the classes of the input In- stances are dblp:Author for Wang0011:Wei and nsf:Researcher Algorithm 2 receives as input a set of link specifications, for WeiWang0007, the final context-aware link specification two Instances to be linked, and two sets of Instance that with the aggregation functions is depicted in this figure. It are the contexts; the output of the algorithm is a set of links dblp:Author and nsf:Researcher Instances if they have CASameAsCondition. The algorithm starts iterating over the similar names (for all LSAR ) and some of their publications link specifications and, for each of them (line 15 in Algorithm have similar titles (exists LSAP ). LS: Jaro(dblp : name, nsf : name) ≥ threshold. CALS: for all LS and exists Jaro(dblp : title, nsf : title) ≥ threshold. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Effectiveness Effectiveness 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 Precision Precision 0.2 Recall 0.2 Recall 0.1 F-Measure 0.1 F-Measure 0.7 0.75 0.8 0.85 0.9 0.95 1 0.7 0.75 0.8 0.85 0.9 0.95 1 Threshold Threshold (a) Link specification (b) Context-Aware link specfication Figure 4: Effectiveness results when specifications are given by an expert in DBLP-NSF 5. EVALUATION cher instances that have the same name but are different We use two scenarios in which we study the effectiveness authors, therefore, taking some context information into ac- using link specifications and context-aware link specifica- count, like their publications, is crucial to perform a suitable tions. Both scenarios were built with real data using re- link discovery task. searchers that have published in PVLDB of 2013 extracted To build this scenario, firstly, we extracted from DBLP all from DBLP. Furthermore, there are real-world situations in the articles and authors that have been published in PVLDB which taking the context into account is crucial to perform of 2013. Then, we looked up their names in the NSF portal the optimal link discovery task. For each scenario, we did and we extracted all their related information. Finally, we two evaluations: in the first one, an expert defined a link created two RDF datasets, whose data models are depicted specification, to the best of his/her knowledge, and then, in Figures 1(b) and 2(b), respectively. The resulting DBLP the same expert defined a context-aware link specification. dataset comprises 764 instances of dblp:Author and 47,225 in- Since link specifications are very sensitive to their accep- stances of dblp:Article. The resulting NSF dataset comprises tance threshold, for each defined specification, we tuned the 235 instances of nsf:Researcher, 235 instances of nsf:Award, acceptance threshold value of their string similarity from 0.7 and 6,877 instances of nsf:Paper. Since NSF has information to 1.00 and analyzed for which values the best effectiveness about different disciplines, in this dataset we have several re- was achieved. In the second evaluation, we used GenLink searchers that have the same name but are different people. [11] to generate link specifications between the same classes For example, in Figure 1, instances WeiWang0012 and Wei- of the previous experiments, whose goal is to analyze the Wang0007 have the same literal for nsf:name, although they impact of adding context to a regular link specification gen- are describing different researchers. In the whole dataset, erated by a technique. only 74 instances of nsf:Researcher have different literals for We make our data, algorithms, and scripts, publicly avail- nsf:name. able [4]. Therefore, our results can be reproduced and tested Figure 4(a) depicts the effectiveness obtained with the link by third parties and researchers can extend our results to specification provided by the expert, a Jaro comparison over cope with future requirements. dblp:name and nsf:name, and using all possible acceptance We have implemented our technique in Java 1.8, and Jena thresholds values from 0.7 to 1.0. Figure 4(b) depicts the 3.0.0. Our experiments were run on a computer that was effectiveness obtained using the context-aware link specifi- equipped with a Intel Core i7 2.8 GHz CPU and 16 GB cation that extends the previous link specification, adding RAM, running on Mac OS 10.9.5 (64-bits). a link specification composed by Jaro over dblp:title and In section 5.1, we present our first scenario, DBLP-NSF, nsf:title, and using the best threshold for the dblp:Author we describe its characteristics, the relationships between the and nsf:Researcher link specification. The overlap factor for datasets and how we built them. Section 5.2 follows the the link specifications between the publications is exists and same structure, in which we present our second scenario for the link specification between persons is for all. DBLP-DBLP. The results in Figure 4(a) shows how the effectiveness of the link specification is better if the threshold acceptance is higher, although it never reaches the best precision or 5.1 NSF-DBLP scenario F-Measure of 1.0, it always obtains a recall of 1.00. Re- In this scenario, we have 188 owl:sameAs links between db- call never changes because every dblp:Author that should lp:Author and nsf:Researcher instances, which we consider our be linked with a nsf:Researcher has exactly the same name, gold standard. All of them relate authors and researchers hence, if the threshold is low, the string metric generates with the same name and publications in common. Between false positives but always recognizes pairs of instances with the datasets, we have 57 pair of dblp:Author and nsf:Resear- LS: Jaro(dblp : name, dblp : name) ≥ threshold. CALS: for all Jaro(dblp:title, dblp:title) ≥ threshold. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Effectiveness Effectiveness 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 Precision Precision 0.2 Recall 0.2 Recall 0.1 F-Measure 0.1 F-Measure 0.7 0.75 0.8 0.85 0.9 0.95 1 0.7 0.75 0.8 0.85 0.9 0.95 1 Threshold Thresholds (a) Link specification (b) Context-Aware link specfication Figure 5: Effectiveness results when specifications are given by an expert in DBLP-DBLP the same name (covering all the correct links). If the thresh- amples, Genlink generated LSN1 and LSN5 ; both have a Jac- old is high, the precision improves by pruning these false card distance ≤ 0.37 over the literals of the data properties positives. dblp:name and nsf:name. Using 10 examples, GenLink gen- In Figure 4(b), the context-aware link specification ob- erated LSN10 , which has a Jaccard distance ≤ 0.21 for the tains a precision that improves when the acceptance thresh- same data properties. On the other hand, the link specifi- old is higher, however, the recall decreases for values higher cations between dblp:Article and nsf:Paper using 1 example than 0.83 of acceptance threshold. The context-aware link was LST1 , it has a Levenshtein distance ≤ 29.48 over db- specification reaches 1.00 in precision and recall for thresh- lp:title and nsf:title, using 5 examples GenLink generated olds in the range of 0.80-0.83. Recall drops when the thresh- LST5 , which has a Jaccard distance ≤ 0.59 over the same old is higher because this time we are comparing the names data properties and, finally, using 10 examples GenLink gen- of the authors, and also the titles of their publications, which erated LST10 , it has a Levenshtein distance ≤ 7.05 over the are written slightly different, e.g., SmartSaver turning flash same data properties. drive into a disk energy saver for mobile computers and We analyzed the effectiveness of dblp:Author and nsf:Re- “SmartSaver: turning flash drive into a disk energy saver searcher link specification, and the context-aware link spec- for mobilecomputers”. As result, an exact string matching ification for the same classes. The results in Table 3 shows would not recognize them as the same. Due to this issue, re- how, when we took context into account, precision improved call drops for higher thresholds. On the contrary, precision by 0.18 (1 example), 0.21 (examples) and 0.24 (10 examples). improves when the threshold is higher, it mainly generates However, recall dropped by 0.05 in the context-aware link false negatives but the instances linked are always correct. specification made by 10 examples because the acceptance threshold was restrictive enough to not recognize titles writ- LS for DBLP-NSF ten slightly different, as we explained before. LS P R F LSN1 0.76 1.00 0.86 5.2 DBLP-DBLP scenario LSN5 0.76 1.00 0.86 LSN10 0.76 1.00 0.86 This scenario has 62 owl:sameAs links between the source and target datasets. Both contains dblp:Author instances CALS for DBLP-NSF with similar names and aliases, which are different enough LS and their overlap factors P R F to produce false positives using comparators with low ac- for all LSN1 and exists LST1 0.94 1.00 0.97 ceptance thresholds, and false negatives with high accep- for all LSN5 and exists LST5 0.97 1.00 0.99 tance thresholds. This scenario was built using the same for all LSN10 and exists LST10 1.00 0.95 0.98 authors and publications in the previous scenario. We took CALS Best improvement 0.24 - 0.13 the whole list filtered by authors with aliases (like “H. V. Jagadish” and “Hosagrahar Visvesvaraya Jagadish”), then, Table 3: GenLink results for dblp:Author and nsf:Researcher we split the instances in two datasets, each of which were link specifications and context-aware link specification obtained by using a different alias for the same person. The data model of the source and target datasets is depicted in Table 3 shows the results obtained using GenLink to gen- Figure 2(a). Both datasets contain 58 dblp:Author instances erate several link specifications between the classes of pre- and their publications, which are 5284 dblp:Article in total. vious experiments. We used different number of examples We conducted similar experiments in this scenario as previ- to generate the links specifications. On one hand, for the ously. instances of dblp:Author and nsf:Researcher with 1 and 5 ex- Figure 5(a) shows the results obtained for the link spec- ification given by the expert, which relates by means of a amples it generated LST10 that relates the different dblp:title Jaro distance the dblp:name of dblp:Author instances from by means of a Levenshtein distance ≤ 1.76. the source and target dataset, then, we obtain the precision, We analyzed the effectiveness for the source and target recall and F-Measure for each possible threshold acceptance dblp:Author instances using the link specification, and then, value. Figure 5(b) shows the results for a context-aware link the context-aware link specification. The results in Table specification that uses a link specification which, by means 4 show how, when we take context into account, precision of a Jaro distance, relates the dblp:title from the source and does not change but recall improves by 0.58 (1 example), target dblp:Article instances. The overlap factor for the link 0.54 (5 examples) and 0.58 (10 examples); which entails an specification is for all. improvement in the F-Measure of 0.46 (1 example), 0.45 (5 The results of Figure 5(a) shows that the best F-Measure examples) and 0.46 (10 examples). results are obtained for thresholds values between 0.72 and 0.77; however, it never obtains a F-Measure of 1.00. For higher thresholds, recall decreases while precision increases, 6. CONCLUSION AND FUTURE WORK this tendency is inverted for lower thresholds. Due to au- In the literature, there are several techniques that gen- thors’ aliases, recall behaves in the same way of Figure 4(b) erate link specifications to perform a link discovery task; with publication titles; if the threshold is higher, the link however, none of them is able to exploit context informa- specification does not recognize as the same some aliases, tion. In this paper, we present a proposal to extend the e.g., “H.V. Jagadish” and “Hosagrahar V. Jagadish”. On the definition of link specification by means of the concept of contrary, when the threshold is higher, precision improves overlap factor, which let us exploit context information and because the linked instances have similar names. define context-aware link specifications. Additionally, we Figure 5(b) shows that the context-aware link specifica- have identified two real-world scenarios where the context tions always obtain a precision of 1.00, recall and F-Measure is crucial and where, the current techniques, are not able to increases when the threshold is higher, achieving 1.00. This obtain the best effectiveness without taking the context into situation is the same as Figure 4(a), the titles of the publica- account. tions in each dataset have exactly the same literal, therefore, Our experimental results prove how context-aware link when the threshold is higher, the recall improves. Precision specifications obtain a better effectiveness in comparison is always 1.00 because the CALS of this example only links with regular link specifications in our scenarios. We ob- two instances of dblp:Author if all their publications are ex- tained an improvement of 23% in precision and 58% in recall, actly the same, due to the for all restriction. If just one respectively. publication is not linked, then their authors are also not In future work, we plan to develop a technique to nav- linked; therefore, if a link is actually generated, it is always igate through context information of instances by not us- correct. ing all of their object properties, and selecting only those more suitable to build effective context-aware link specifi- LS for DBLP-DBLP cations. Additionally, we plan to add more metrics to cal- LS P R F culate the overlap factor extending our current for all and LSN1 1.00 0.26 0.45 exists restrictions. Finally, this paper is focused on gener- LSN5 1.00 0.30 0.46 ating owl:sameAs links, but an interesting extension of our LSN10 1.00 0.26 0.45 work is the generation of other kind of links in an automatic CALS for DBLP-DBLP way, depending on the results of the overlap factor. LS and their overlap factors P R F for all LST1 1.00 0.84 0.91 for all LST5 1.00 0.84 0.91 Acknowledgements for all LST10 1.00 0.84 0.91 Supported by the Spanish R&D&I program under grant CALS Best improvement - 0.58 0.46 TIN2013-40848-R. Table 4: GenLink results for source and target dblp:Author 7. REFERENCES link specification and context-aware link specification [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data: Table 4 shows the link specifications generated by Gen- Principles and state of the art. In WWW, pages 1–40, Link for the same classes of the previous experiments. On 2008. one hand, for the source and target dblp:Author instances, [2] C. Bizer, T. Heath, and T. Berners-Lee. Linked with 1 example, Genlink generated LSN1 that relates source Data-the story so far. Int. J. Semantic Web Inf. Syst. and target dblp:name data properties by means of a Jaccard 5(3), pages 205–227, 2009. distance ≤ 0.15, with 5 examples generated LSN5 that re- [3] M. G. Carvalho, A. H. Laender, M. A. Gonçalves, and lates the same data properties by means of a Levenshtein A. S. da Silva. Replica identification using genetic distance ≤ 1.48 and, finally, with 10 examples it generated programming. In SAC, pages 1801–1806, 2008. LSN10 , which relates the same data properties by means of [4] A. Cimmino, C. R. Rivero, and D.Ruiz. Research a Levenshtein distance ≤ 1.15. On the other hand, for the prototype, repositories and experimental results. URL source and target dblp:Article instances, with 1 example Gen- http://www.tdg-seville.info/acimmino/Cals, 2016. Link generated LST1 that relates source and target dblp:title [5] M. G. de Carvalho, A. H. F. Laender, M. A. by means of a Levenshtein distance ≤ 1.76, with 5 examples Gonçalves, and A. S. da Silva. A genetic programming it generated LST5 that relates the same data properties by approach to record deduplication. IEEE Trans. means of a Levenshtein distance ≤ 1.46, finally, with 10 ex- Knowl. Data Eng., 24(3):399–412, 2012. [6] I. Ermilov, M. Martin, J. Lehmann, and S. Auer. Linked Open Data Statistics: Collection and Exploitation. In KESW, pages 242–249. 2013. [7] H. Halpin, P. J. Hayes, J. P. McCusker, D. L. McGuinness, and H. S. Thompson. When owl:sameAs isn’t the same: An analysis of identity in Linked Data. In ISWC, pages 305–320. 2010. [8] O. Hassanzadeh, K. Q. Pu, S. H. Yeganeh, R. J. Miller, L. Popa, M. A. Hernández, and H. Ho. Discovering linkage points over web data. PVLDB, 6(6):444–456, 2013. [9] T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool Publishers, 2011. [10] M. Holub, O. Proksa, and M. Bieliková. Detecting identical entities in the Semantic Web Data. In SOFSEM, pages 519–530. 2015. [11] R. Isele and C. Bizer. Learning expressive linkage rules using genetic programming. PVLDB, 5(11):1638–1649, 2012. [12] R. Isele and C. Bizer. Active learning of expressive linkage rules using genetic programming. J. Web Sem., 23:2–15, 2013. [13] R. Isele, A. Jentzsch, and C. Bizer. Efficient multidimensional blocking for link discovery without losing recall. In ACM SIGMOD workshops, 2011. [14] M. Nentwig, M. Hartung, A.-C. N. Ngomo, and E. Rahm. A survey of current Link Discovery frameworks. Web Sem. J., pages 1–18, 2015. [15] A.-C. N. Ngomo and S. Auer. LIMES: A time-efficient approach for large-scale Link Discovery on the Web of Data. In IJCAI, pages 2312–2317, 2011. [16] A.-C. N. Ngomo, J. Lehmann, S. Auer, and K. Höffner. RAVEN - Active learning of link specifications. In ISWC workshops, pages 25–37, 2011. [17] A.-C. N. Ngomo and K. Lyko. EAGLE: Efficient active learning of link specifications using genetic programming. In ESWC, pages 149–163. 2012. [18] A.-C. N. Ngomo and K. Lyko. Unsupervised learning of link specifications: deterministic vs. non-deterministic. In ISWC workshops, pages 25–36, 2013. [19] A. Nikolov, M. d’Aquin, and E. Motta. Unsupervised learning of Link Discovery configuration. In ESWC, pages 119–133. 2012. [20] C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. Exchanging data amongst linked data applications. Knowl. Inf. Syst., 37(3):693–729, 2013. [21] D. Song and J. Heflin. Automatically generating data linkages using a domain-independent candidate selection approach. In ISWC, pages 649–664. 2011. [22] T. Soru and A.-C. N. Ngomo. A comparison of supervised learning classifiers for Link Discovery. In SEM, pages 41–44, 2014. [23] F. M. Suchanek, S. Abiteboul, and P. Senellart. PARIS: Probabilistic alignment of relations, instances, and schema. PVLDB, 5(3):157–168, 2011.