Interlinking: Performance Assessment of User Evaluation vs. Supervised Learning Approaches Mofeed Hassan Jens Lehmann Axel-Cyrille Ngonga Department of Computer Department of Computer Ngomo Science Science Department of Computer AKSW Research Group AKSW Research Group Science University of Leipzig University of Leipzig AKSW Research Group mounir@informatik.uni- lehmann@informatik.uni- University of Leipzig leipzig.de leipzig.de ngonga@informatik.uni- leipzig.de ABSTRACT number of knowledge bases, the links among them are rel- Interlinking knowledge bases are widely recognized as an im- atively few with more than 500 million 2 in 2011 and they portant, but challenging problem. A significant amount of have very different qualities [8]. Creating high-quality links research has been undertaken to provide solutions to this across the knowledge bases on the Web of Data thus re- problem with varying degrees of automation and user in- mains a task of central importance to empower manifold volvement. In this paper, we present a two-staged experi- application on the Web of Data, including federated query ment for the creation of gold standards that act as bench- processing and cross-ontology question answering. Many al- marks for several interlinking algorithms. In the first stage gorithms have been proposed and implemented in different the gold standards are generated through manual validation interlinking tools to address this task [17, 20, 12, 13]. While process highlighting the role of users. Using the gold stan- these approaches vary w.r.t. several aspects, one of the most dards obtained from this stage, we assess the performance important aspects is the degree of user involvement. In [20, of human evaluators in addition to supervised interlinking 16] interlinking tools are categorized based on the degree of algorithms. We evaluate our approach on several data inter- automation that regulates the amount of user involvement linking tasks with respect to precision, recall and F-measure. at different levels [17]. In general, the most costly aspect of Additionally we perform a qualitative analysis on the types user involvement in interlinking is the validation of links, of errors made by humans and machines. also dubbed manual link validation. This is the process where a user, i.e. a validator or evaluator, specifies whether Categories and Subject Descriptors a link generated by an interlinking tool is correct or incor- H.4 [LINK Discovery] rect. In frameworks which implement active batch learning to determine links (for example LIMES [9] and SILK [3]), the results of the link validation process are reused to learn General Terms presumably better link specifications and thus to generate Data Integration,Enterprise Linked Data high-quality links. Keywords While several benchmarks have been made available to mea- Interlinking, Links validation, Gold standard, Manual vali- sure the performance of existing link discovery systems for dation, Performance evaluation the Web of Data, several questions pertaining to this task have remained unanswered so far such as: 1. INTRODUCTION Over the last years, the number of knowledge bases pub- 1. Costs of an annotation: The first question pertains lished on the Web of Data has grown considerably. Ac- to the cost of link discovery. Determining how much cording to statistics performed in the beginning of 2015, the it actually costs (w.r.t. to time) to validate a link number of published knowledge bases surpassed 3800 pro- between two knowledge bases, enable users to quantify viding over than 88 billion triples1 . In spite of the large how long it will take them to generate clean links from their knowledge base to other knowledge bases. 1 http://stats.lod2.eu 2. When should a tool be used : Human annotators are able to detect links between knowledge bases at a small scale. On the other hand, machines need a significant number of examples and clear patterns in the underly- ing data to be able to detect high-quality links between knowledge bases. Hence, determining the knowledge base sizes on which machines should be used for link discovery is of utmost practical importance. Copyright is held by the author/owner(s). WWW2015 Workshop: Linked Data on the Web (LDOW2015). 2 http://lod-cloud.net/state/ 3. Performance of machine-learning of small tasks: While One of the interlinking tools is RKB-CRS[7]. It is a domain- it is well established that machine-learning tools per- dependent tool. It focuses on universities and publications form well on knowledge bases that contain hundreds domains. RKB-CRS is a semi-automated tool where its pro- of resources or more, many of the knowledge bases cess depends on providing URIs using a Java program de- on the Web of Data are small and pertain to a dedi- veloped by the user. This is performed according to each cated domain. Providing guidelines towards when to dataset to be interlinked. The tool applies string matching use machine-learning algorithms to link these knowl- technique to find URIs equivalences that can be represented edge bases to other knowledge bases can improve the as an owl:sameAs relationship. effectiveness of linking on the Web of Data. Another domain-dependent tool is LD-Mapper[15]. It fo- Consequently, we propose an experiment to investigate the cuses on datasets in the music domain. It provides an ap- effect of user intervention in dataset interlinking on small proach that depends on string similarity and also considers knowledge bases. We study the effort needed for manual the neighbour similarity to the resources. validation using a quantitative approach. Furthermore, we compare the performance of a human validator and super- Knofuss[14] is an automatic and domain-independent tools vised interlinking approaches to find a break-even point where that focuses on merging two datasets where each is described machine-learning techniques should be used. Note that we by an ontology. An alignment ontology is also given by the intentionally limit ourselves to small numbers of resources user in case of ontology heterogeneity. The tool has two in our experiments as (1) experiments on large number of contexts: (i) application context which is provided by the resources have already established that machines perform datasets’ ontology and (ii) object context model that points well and (2) such experiments would be intractable for hu- out what properties are needed for the matching process. man users due to long time and great effort. Matching is performed through string matching and adap- tive learning techniques. Knofuss operates on local copies of The core contributions of the paper are: the datasets. • An evaluation of the performance of human evaluators on the interlinking task. An example of a semi-automated tool that works on datasets • A comparison of human and machine performance on local copies is RDF-AI[18]. It consists of five linking phases. the interlinking task for small knowledge bases. These phases are (i) preprocessing, (ii) matching, (iii) fusion, • A methodology for designing and executing such ex- (iv) interlinking and (v) post-processing and each phase is periments. described by a XML file. The input includes the alignment • A gold standard for three small interlinking tasks. method and the dataset structure. The utilized matching techniques are string matching and word relation matching. The rest of our paper is organized as follows. In section 2, RDF-AI provides a merged dataset or an entity correspon- a short overview about interlinking approaches is provided. dence list as an output. Section 3 is a description of the experimental approach. The experiment setup and preparation is described in section SILK[3] is a semi-automated tool and it is a domain in- 4. In section 5, the results of our experiment are shown. dependent. Unlike the aforementioned tools, it works on A discussion about the results in section 6 is followed by the datasets through SPARQL endpoint. The user speci- related work summarized in section 7. Finally in section 8 fies the linking process parameters using a declarative lan- the conclusion and future work are presented. guage dubbed Silk Link Specification Language (Silk-SLS). Using Silk-SLS allows the user to focus on specific type of re- 2. BACKGROUND sources. It supports the use of different matching techniques 2.1 Interlinking Tool such as string matching, date similarities and numerical sim- ilarities. Set operators like MAX, AVG and MIN combines Due to the highly increase of published datasets on the web more than one similarity metric. Links are generated if two and the rising number of interlinks required among them, resources similarity exceeds a previously specified threshold. many interlinking tools are proposed based on different algo- rithms. Some surveys provided comparative studies about LIMES[9] is an interlinking tool belonging to the same cat- these tools showing the major differences among them[20, egory as SILK by being semi-automated and domain inde- 16]. Interlinking tools differ in many aspects. Two of these pendent. It works as a framework for multiple interlinking aspects are (i)domain dependency and (ii) Automation. algorithms either unsupervised or supervised learning algo- rithms. For the unsupervised algorithm the user provides By the aspect domain dependency, the interlinking tool is linking specifications. The Linking specifications provide classified as domain-dependent when it works on interlinking the set classes, properties and metrics to make interlink- between two datasets in specific domain. With the Automa- ing. On the other hand, different supervised algorithms are tion perspective, the interlinking tools are categorized into implemented by applying genetic learning combined with ac- (a) Automated tools and (b)Semi-automated tools based on tive learning approaches. The target of these algorithms is the degree of user’s contribution in the interlinking process. finding the best classification of candidate links using the In the semi-automated tools User intervention is important minimum number of training data. Minimizing the train- for the linking process in different views, such as setting ing data is performed through finding the most informative the link specifications, ontology alignment, providing pos- data reviewed (labelled) by an oracle. Examples of these itive and negative examples for tools based on supervised algorithms are EAGLE, RAVEN, COALA and EUCLID[11, learning algorithms and validating the final generated links. 12, 13]. the few observational studies about users interactions with one of the linking process phases. The study focuses on the RAVEN and EAGLE[10, 11] are two interlinking algorithms cognitive process performed by the users to find mappings. that depend on active learning and genetic programming methods with supervised algorithms. As the authors stated, According to our knowledge there is no such observational These algorithms implement Time-efficient matching tech- study about the problems face users during validating datasets niques to reduce number of comparisons between instances interlinks and no quantifying experiment to measure the ef- pairs. COALA[12] is combined with EAGLE to consider the fort done by the users in validating links and generating gold intra and inter correlation between learning examples to the standards. learning algorithm. 3. EXPERIMENTAL APPROACH Based on the motivations explained formerly, we designed a LIMES was used in our work due to different reasons. One two-stage experiment. The first stage consists of two steps. reason is its simplicity. It uses a simple configuration file The first step is performing interlinking between different to perform interlinking by any of the contained algorithms. datasets using the unsupervised learning algorithm. These It supports working on SPARQL or dump-files. The imple- datasets represent different domains. In the second step, the mented algorithms are another strength point in LIMES as resulting links are manually validated by human validators. it supported our work with different interlinking algorithm The validators will do this step first individually then in a in the same pool. Three algorithms EAGLE, COALA and group, where unsure decisions about links are reviewed by EUCLID are used in our work and dubbed as Genetic Active all validators. The resulting links are then considered to be Learning (GAL), Genetic Active Learning with Correlation a gold standard for their interlinking tasks. Later, we will (GCAL) and Genetic Batch Learning (GBL), respectively. discuss about problems in manual link validation of single evaluators and groups. 2.2 Manual Links Validation In the second stage, different supervised algorithms are ap- Validating the generated links gives two beneficial outputs. plied on the same interlinking tasks. Using the gold stan- First, it provides positive and negative examples for super- dard generated from the first stage of the experiment as a vised learning algorithms. Second, it creates gold standards benchmark, the performance of the used approaches can be to be used for tools and reviewers assessments of other sim- compared to each other and to humans. ilar linking tasks. LATC 3 is one of the efforts in generating reviewed links samples to achieve the previously mentioned In our evaluation we use real data for generating bench- two benefits. marks. They are generated from actual data forming three interlinking tasks in different domains. Size limitation was In [6] the authors stated that there is a need for more work on forced to ease the validation process. We use this bench- generating benchmarks for interlinking algorithms. A sum- mark to evaluate different interlinking algorithms in terms mary of different benchmarking approaches is provided with of precision, recall , and F-Measure as assessment measure- exposing their strengths and weaknesses. One of these ap- ments[2]. proaches is Ontology Alignment Evaluation Initiative (OAEI). It provides two tracks for evaluating ontology matching al- An additional qualitative analysis is performed to detect the gorithms and instance matching algorithms. This is done common problems faced by humans during the links valida- by using common benchmarks in the evaluations. Other tion process. This analysis focuses on the problems reported approaches rendered benchmarks are Yatskevich et al.[21], by the validators which affects their judgement quality and Alexe et al.[1], and SWING[6]. According to [4] three basic the process difficulty. points form the criticisms of many generated benchmarks. These points are: 4. EXPERIMENTAL SETUP In our experiment, we applied linking on six different datasets • Using real data using LIMES[9]. These datasets represent three different • Benchmarks generation flexibility linking tasks. Each task corresponds to specific domain that • Scalability and correctness differs in nature from the other tasks and varies in familiar- ity to the reviewers. This will affect the reviewers’ decisions Recently crowdsourcing role has increased in links valida- correctness and effort in different ways. These tasks are: tions and gold standard generation. Crowdsourcing is a new trend for users involvement in different publishing and • Task 1 represents the geographic domain. In this do- linking data phases. In[19] an analytical research about in- main, the basic informative information are specific terlinking and user intervention is presented. It gave an as many locations are described by specific geometry analysis about what phases in the interlinking process can measures. be amenable to crowdsourcing. A general architecture was In this task, links between the DBpedia and Linked- proposed to integrate interlinking frameworks with crowd- GeoData datasets needed to be set. Both datasets con- sourcing (Amazon Mechanical Turk-MTurk) to enhance in- tain geographic information, for example latitude and terlinking process including links validation. longitude, for locations such as cities, states, and coun- In [5] a case-study was introduced to find out the problems tries. We restricted our linking to be between cities that the users face in ontology matching. This study is one of where their labels started with letter ’A’. The confin- ing of the labels was made for getting a reasonable 3 http://latc-project.eu/ number of links for evaluators in the first stage of our experiment and to simplify calculations in the second The third task generated links between DBpedia and stage. This provides also a random sample of inter- Drugbank datasets. We selected drugs with names links with the ability to tune the retrieved number of starting with letter ’A’. Further a compound similarity online instances. function is used involving the Levenshtein similarity The Label, latitude, and longitude properties are se- metric. This function utilizes property label. Table 3 lected to apply similarity metrics on them. Similarity shows the basic information in this linking task where metrics used are Trigrams and Euclidean in a com- ’a’ represents ’rdf:type’ property. pound function. The compound function combines atomic metrics such as Trigrams, Euclidean and Lev- Datasets DBpedia DrugsBank enstein using metric operators such as MIN or MAX. Restrictions a dbpedia-owl:Drug a drug:drugs Table 2 shows the basic information in this linking task rdfs:label where ’a’ represents ’rdf:type’ property. starts with ’A’ Similarity rdfs:label rdfs:label Properties Datasets DBpedia LinkedGeoData rdfs:label rdfs:label Restrictions a dbpedia-owl:City a lgdo:City rdfs:label drug:genericName rdfs:label Similarity levenshtein starts with ’A’ Metrics Similarity rdfs:label rdfs:label Properties Table 3: Link specification for task 3 wgs84:lat wgs84:lat wgs84:long wgs84:long The aim of the second stage is to investigate whether us- Similarity ing machine learning approaches for linking can outperform trigrams Metrics humans. Our experiment achieves this aim by using three euclidean different supervised learning algorithms EAGLE, COALA and EUCLID [11, 12, 13]. The three algorithms are all im- Table 1: Link specification for task 1 plemented in the LIMES framework [9]. The interlinking approaches are given different percentage of positive and • Task 2 represents the movies domain. This domain negative examples for each single task. The examples are is very interesting as it has some tricky information provided in an increasing percentages 10%,33% and 50% of of the movies such as the name of the movie. In a the total examples resulting from the first stage for each movie’s series,it can be confusing for the validator to task. As these examples play the role of oracle in the super- give a decision as the names of these movies are close vised learning approaches, the increasing percentages should to each other, having the same actors and even the enhance the algorithm performance and converge against a same director. This needs additional information such score either above or somewhat close to single human per- as the movie’s date, which is not always available. formance. The three approaches function on the same spec- In the second task, we performed linking on DBpe- ifications of the tasks in the first stage and also on the same dia and LinkedMDB datasets that contain information datasets. concerning movies. Both have large amounts of infor- mation on movies like their names, directors, release Links evaluation is done by using an evaluation tool with a date etc. The triples are restricted to represent movies graphical user interface dubbed Evalink(see Figure 1). The with release dates beginning from the year 1990. This reviewer specifies the endpoints where the source and tar- provides a reasonable number of links. get links triples are available. It enables the evaluators to The similarity function applied for linking is a com- load the links to be reviewed and retrieves their properties pound function of Trigrams metric. This function uses information from the specified endpoints. The reviewer can properties such as label, director and release date. Ta- check the correlated properties values and give a decision ble 2 shows the basic information in this linking task either ‘Correct‘, ‘Incorrect‘ or ‘Unsure‘. The spanned time where ’a’ represents ’rdf:type’ property. for taking a decision is stored in milliseconds. The source code is available in ”https://github.com/AKSW/Evalink”. Datasets DBpedia LinkedMDB Restrictions a dbpedia-owl:Film a linkedmdb:film 5. RESULTS initial release date Based on the previously described specifications, the exper- >”1990-01-01” iment was carried out in two stages. The first stage aims to Similarity label label generate a gold standard for each task. A set of five indepen- Properties director director dent reviewers evaluated the links generated such that each releaseDate initial release date gold standard was provided based on minimum four out of Similarity five agreement on a decision for each link. In order to ex- trigrams Metrics press the total effort needed by a reviewer to provide a gold standard, we considered the time for deciding if a link as Table 2: Link specification for task 2 correct or incorrect as a measure. This time is measured in milliseconds. In the experiment, the average times for each • Task 3 represents the drugs domain. Reviewers have task to be evaluated by the users are as follows: 18773764 to check chemical and medical information for drugs. milliseconds for task 1; 16628607 milliseconds for task 2; and Correct Incorrect Total user 1 496 6 502 user 2 492 10 502 user 3 481 21 502 user 4 487 15 502 user 5 481 21 502 Table 6: Users evaluations assessment of total eval- uation The second stage of our experiment was performing link- ing between the specified datasets using different supervised Figure 1: Evalink tool to evaluate links learning algorithms and assessing their performance against the generated gold standards in terms of precision, recall The user selects the task to be evaluated and specifies the proper and F-measure. LIMES[9] is an interlinking tool that is also endpoints to access the triples. The URIs of the task are loaded sequentially with displaying their retrieved information. By a framework with different implemented interlinking algo- selecting a property in source dataset, the corresponding property is rithms with different learning approaches. EAGLE, COALA highlighted in the target datasets’ side. By pressing the proper and EUCLID are used to provide set of interlinks that are button the decision of the link is specified either ”Correct”,”Incorrect”, or ”Unsure”. compared to the gold standard as aforementioned. The re- sulting comparisons are demonstrated in tables 8,9 and 10 in terms of Precision. 18777477 milliseconds for task 3, which are shown in table 4 The cost to achieve a specific F-Measure w.r.t. the percent- . Overall, approximately 15 hours of evaluation effort have age of the training data is calculated in terms of time. Using gone into our experiment per participant. the average times to validate a link in each task, the times for different percentages are calculated. Figures 2, 3 and 4 Task 1 Task 2 Task 3 plot F-Measure corresponding to afforded costs in minutes Average time 18773764 16628607 18777477 for the three tasks. The figures show the overall supremacy of GCAL over other approaches and even over the human Table 4: Average times of the evaluation processes performance. GBL has the worst behaviour among the su- for each task (in milliseconds) pervised learning approaches. Task 3 was the least costly one which is self explained by the high F-Measure values achieved for all algorithms. A more detailed way to express the provided effort by a user is the average time for a single link to be evaluated by a 1 user in a single task. Table 5 shows the performed average times in each task. It is evident that there are significant GAL differences between users and that overall the evaluation of 0.8 GCAL a large number of links is a time consuming process. GBL Human F − M easure task 1 task 2 task 3 0.6 user 1 36803 22974 10223 user 2 21465 18821 20358 user 3 12299 39363 9802 0.4 user 4 10922 11329 34553 user 5 38853 43811 44664 0.2 Table 5: Average times for evaluating a single link within a task by each user(in milliseconds) 0 0 10 20 30 40 50 Cost(M in.) An assessment of the links evaluation performed by users was achieved. Out of 535 links, 502 links had certain, may be different, decisions made by each single user. 32 links did not have enough information to reach a decision and marked as Figure 2: F-Measure results relative to the learning unsure links. Gold standards are created by giving each link cost of each approach in terms of time(Task 1) a final decision using inter-rater agreement. The assessment was done by comparing each user’s evaluation for each task to the gold standard. Small number of decisions made by 6. DISCUSSION users were incorrect compared to the gold standard. Details Our discussion of the experiment is divided into two parts. of the assessment are described in table 6 and table 7. The first part concerns the users evaluations results and Tasks Task 1 Task 2 Task 3 Measures Precision Recall F-Measure Precision Recall F-Measure Precision Recall F-Measure User 1 0.81 0.98 0.89 0.98 1 0.99 0.97 0.99 0.98 User 2 0.83 1 0.91 0.93 0.94 0.93 0.96 0.98 0.97 User 3 0.74 0.9 0.81 0.97 0.98 0.98 0.94 0.96 0.95 User 4 0.81 0.98 0.88 0.95 0.97 0.96 0.93 0.95 0.94 User 5 0.82 0.99 0.9 0.91 0.93 0.92 0.91 0.93 0.92 Table 7: Precision, Recall and F-Measure results achieved by every user in each task. Tasks Task 1 Task 2 Task 3 percentages 10% 33% 50% 10% 33% 50% 10% 33% 50% GAL 0.12 0.32 0.8 0.63 0.33 0.32 0.078 0.47 0.79 GCAL 0.81 0.76 0.8 0.69 0.27 0.056 0.88 0.54 0.88 GBL 0.04 0.77 0.4 0.8 0.007 0.047 0.13 0.29 0.29 Table 8: Precision results of supervised learning approaches Supervised learning approaches include: GAL, GCAL and GBL. For each, its performance is measured relevant to different percentages of training data 10%, 33% and 50%. Tasks Task 1 Task 2 Task 3 percentages 10% 33% 50% 10% 33% 50% 10% 33% 50% GAL 0.398 0.35 0.35 0.28 0.71 0.92 0.17 0.5 0.53 GCAL 0.35 0.29 0.37 0.19 0.82 0.85 0.23 0.53 0.69 GBL 0.24 0.27 0.18 0.07 0.43 0.79 0.43 0.74 0.98 Table 9: Recalls results of supervised learning approaches. Supervised learning approaches include: GAL, GCAL and GBL. For each, its performance is measured relevant to different percentages of training data 10%, 33% and 50%. Tasks Task 1 Task 2 Task 3 percentages 10% 33% 50% 10% 33% 50% 10% 33% 50% GAL 0.4 0.3 0.5 0.4 0.4 0.5 0.1 0.5 0.6 GCAL 0.49 0.43 0.5 0.31 0.4 0.1 0.37 0.53 0.78 GBL 0.1 0.4 0.2 0.1 0 0.1 0.2 0.4 0.5 Table 10: F-Measure results of supervised learning approaches. Supervised learning approaches include: GAL, GCAL and GBL. For each, its performance is measured relevant to different percentages of training data 10%, 33% and 50%. 1 1 GAL GAL GCAL GCAL GBL GBL 0.8 0.8 Human Human F − M easure F − M easure 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 10 20 30 40 50 0 10 20 30 40 50 60 Cost(M in.) Cost(M in.) Figure 3: F-Measure results relative to the learning Figure 4: F-Measure results relative to the learning cost of each approach in terms of time(Task 2) cost of each approach in terms of time(Task 3) observations. The second part analyzes the learning algo- ficulties might rise from unfamiliar domain evaluation to rithm’s performances. users. Information concerning the resource was in some cases either ambiguous and, thus, not allowing for a de- The user evaluation aims to generate links as a gold standard cision to be made and in other cases too much informa- to be used as a benchmark for link evaluators and interlink- tion was available that confused the users. As an example ing algorithms. Many observations are recorded while users from drugs domain a URI had plenty jor factors that influence the evaluation process. These fac- of information such as secondaryAccessionNumber and pre- tors include: (i)entity description availability which includes dictedWaterSolubility which are non crucial for the decision endpoints availability and the amount of available informa- making process. Both cases caused significant time delays tion,(ii)domain familiarity and (iii)information ambiguity. for a subset of the judgements which were made. Filtering the suitable information to avoid unnecessary properties and Endpoints availability, occasionally, was problematic. As providing crucial ones will provide great value to the evalua- the evaluation process required querying the endpoints for tion process in terms of time and decision correctness. With the triples information, having the endpoints down and not missed information, it is important to create a measure of working consumed more time. This imposed the need to confidence for the decisions. Building up strategies for in- cache the appropriate data which creates an overhead. This formation integration with other related datasets to cover overhead was reasonable for these small datasets but it will the absent of information can help in this case too. Measur- increase in case of large datasets. Still having active end- ing the time to generate gold standards for each task(table point is necessary due to the continues information updat- 4and table 6), we find that there were no significant differ- ing. ences among the average times of all tasks. This shows how links validation is improved by the availability and clarity of Once the information are available, the second point con- important properties identifying the linking process to the cerning their sizes comes in focus. Although the number validators. This indicates that with trivial domain knowl- of links and the their related information were relatively edge and appropriate properties to compare, the users per- small,the manual evaluation was very tedious and exhaust- form evaluation with F-Measure above 0.8 (table 7). In these ing for the users. Supporting the evaluation by using Evalink tables we can see how almost all the user’s perform with rea- tool overcame the unnecessary efforts like loading the links sonable high values of F-Measure in all tasks. The achieved information and aligning them. It also put the whole eval- F-Measure scores range from 0.81 to 0.99. These ratios will uation effort on the time of making a decision by the user. be used in comparison between the user performance and The manual setting of Evalink generated more settings effort machine (algorithm) performance. which should be further extended for intelligent properties mapping. The results of the second stage are represented in figures 2, 3 and 4. We can see that, in most cases, machine learn- The help given by Evalink had its effect on the domain fa- ing algorithms outperform the human in terms of F-Measure miliarity too. With the suitable evaluation tool that maps when considering the cost to provide the training set. GAL, the related properties between two datasets, the domain fa- in tasks 1 and 3, has better performance compared to a hu- miliarity was not affecting the evaluation. Finding the right man up to 50% of the gold standard as training data. On properties and comparing their values diminished the dif- the other side, in task 2 although it achieved better results than an average human but for lower costs the F-Measure 2008. is almost stable around 0.4, so increasing the labelling effort [8] M. Nentwig, T. Soru, A.-C. N. Ngomo, and E. Rahm. for training data provided no significant improvement. Even Linklion: A link repository for the web of data. in those cases where it improved with more training data, [9] A.-C. Ngonga Ngomo and S. Auer. LIMES - a its ultimate performance fell short of human performance time-efficient approach for large-scale link discovery on in the long run. GCAL and GBL both recorded increas- the web of data. In Proceedings of IJCAI, 2011. ing results with task 1 and task 3 with more training data, [10] A.-C. Ngonga Ngomo, J. Lehmann, S. Auer, and while performing worst in task 2. GAL and GBL perform K. Höffner. RAVEN – active learning of link learning by using a portion of the data space. If this por- specifications. Technical report, 2012. tion is a good representative of the data distribution, the [11] A.-C. Ngonga Ngomo and K. Lyko. EAGLE: Efficient performance increases. GCAL considers the correlation be- active learning of link specifications using genetic tween training data examples. It classifies the training data programming. In Proceedings of ESWC, 2012. based on the inter- and intra correlation which is calculated [12] A.-C. Ngonga Ngomo, K. Lyko, and V. Christen. based on similarities between the training data examples. COALA – correlation-aware active learning of link We conclude from the results for the three tasks that the specifications. In Proceedings of ESWC, 2013. links of task 1 and task 3, which formed the training data, are good representatives of the datasets for geographic and [13] A. Nikolov, M. D’Aquin, and E. Motta. Unsupervised drugs data while links of task 2 are randomly distributed and learning of data linking configuration. In Proceedings apparently not good representatives of the movies task. We of ESWC, 2012. can further infer that with small datasets, machine learning [14] A. Nikolov, V. S. Uren, E. Motta, and A. N. D. Roeck. algorithms are outperforming humans in case of well repre- Handling instance coreferencing in the knofuss sentative training data being available. If that is not the architecture. In P. Bouquet, H. Halpin, H. Stoermer, case, humans perform better in the long run. and G. Tummarello, editors, IRSW, volume 422 of CEUR Workshop Proceedings. CEUR-WS.org, 2008. [15] Y. Raimond, C. Sutton, and M. Sandler. Automatic 7. CONCLUSION interlinking of music datasets on the semantic web. In our experiment, we emphasized on the factors affecting 2008. the evaluators in their linking evaluations. These factors [16] F. Scharffe and J. Euzenat. Melinda: an interlinking include: (i)endpoints availability, (ii) amount of available framework for the web of data, 2011. information,(iii)domain familiarity and (iv)information am- [17] F. Scharffe, Z. Fan, A. Ferrara, H. Khrouf, and biguity. We quantitatively determined the human effort re- A. Nikolov. Methods for automated dataset quired for interlinking in terms of time for different datasets. interlinking. Technical Report 4.1, Datalift, 2011. The experiment showed how much training data is sufficient to act as a representative of the interlinked datasets. It [18] F. Scharffe, Y. Liu, and C. Zhou. Rdf-ai: an also revealed experimentally that for small datasets, how architecture for rdf datasets matching, fusion and much training data, which is a sufficient representative of interlink. In Proc. IJCAI 2009 workshop on Identity, the dataset, can affect the machine learning approaches to reference, and knowledge representation (IR-KR), the degree that humans exceed its accuracy. Pasadena (CA US), 2009. [19] E. Simperl, S. Wölger, S. Thaler, B. Norton, and T. Bürger. Combining human and computation 8. REFERENCES intelligence: the case of data interlinking tools. [1] B. Alexe, W. C. Tan, and Y. Velegrakis. Stbenchmark: IJMSO, 7(2):77–92, 2012. towards a benchmark for mapping systems. PVLDB, [20] S. Wolger, K. Siorapes, T. Bürger, E. Simperl, 1(1):230–244, 2008. S. Thaler, and C. Hofer. A survey on data interlinking [2] S. Araujo, J. Hidders, D. Schwabe, and A. P. de Vries. methods. or Interlinking data approaches and tools. Serimi - resource description similarity, rdf instance Technical Report MSU-CSE-00-2, Semantic matching and interlinking. CoRR, abs/1107.1104, Technology Insttue (STI), February 2011. 2011. [21] M. Yatskevich, F. Giunchiglia, and P. Avesani. A large [3] C. Bizer, J. Volz, G. Kobilarov, and M. Gaedke. Silk - scale dataset for the evaluation of matching systems. a link discovery framework for the web of data. In Technical report, DISI, University of Trento, 2006. 18th International World Wide Web Conference, April 2009. [4] J. Euzenat, M.-E. Rosoiu, and C. T. dos Santos. Ontology matching benchmarks: Generation, stability, and discriminability. J. Web Sem., 21:30–48, 2013. [5] S. Falconer and M.-A. Storey. A cognitive support framework for ontology mapping. pages 114–127. 2008. [6] A. Ferrara, S. Montanelli, J. Noessner, and H. Stuckenschmidt. Benchmarking matching applications on the semantic web. In ESWC (2), pages 108–122, 2011. [7] A. Jaffri, H. Glaser, and I. Millard. Managing uri synonymity to enable consistent reference on the semantic web. http://eprints.ecs.soton.ac.uk/15614/,