How well does your Instance Matching system perform? Experimental evaluation with LANCE Tzanina Saveta1 , Evangelia Daskalaki1 , Giorgos Flouris1 , Irini Fundulaki1 , and Axel-Cyrille Ngonga Ngomo2 1 Institute of Computer Science-FORTH? Greece, 2 IFI/AKSW, University of Leipzig, Germany Abstract. Identifying duplicate instances in the Data Web is most com- monly performed (semi-)automatically using instance matching frame- works. However, current instance matching benchmarks fail to provide end users and developers with the necessary insights pertaining to how current frameworks behave when dealing with real data. In this paper, we present the results of the evaluation of instance matching systems using Lance, a domain-independent, schema agnostic instance match- ing benchmark generator for Linked Data. Lance is the first benchmark generator for Linked Data to support semantics-aware test cases that take into account complex OWL constructs in addition to the standard test cases related to structure and value transformations. We provide a comparative analysis with benchmarks produced using the Lance frame- work for different domains to assess and identify the capabilities of state of the art instance matching systems. 1 Introduction Instance matching (IM), refers to the problem of identifying instances that de- scribe the same real-world object. With the increasing adoption of Semantic Web technologies and the publication of large interrelated RDF datasets and ontolo- gies that form the Linked Data (LD) Cloud, a number of IM techniques adapted to this setting have been proposed [1,2,3]. Clearly, the large variety of IM techniques requires their comparative evalua- tion to determine which technique is best suited for a given application. Assess- ing the performance of these systems generally requires well-defined and widely accepted benchmarks to allow determining the weak and strong points of the methods or systems, as well as for motivating the development of better systems to overcome the identified weak points. Hence, properly designed benchmarks help push the limit of existing systems [4,5,6,7,8], advancing both research and technology. Recently Lance [8], a state-of-the-art benchmark generator for benchmark- ing instance matching systems in the LD context was introduced. Lance is a flexible, generic, domain-independent and schema-agnostic benchmark generator for IM systems. Lance supports a large variety of value, structure based and semantics-aware transformations with varying degrees of difficulty. The results ? The presented work was funded by the H2020 project HOBBIT (#688227). 2 T. Saveta et. al. of these transformations are recorded in the form of a weighted gold standard that allows a more fine-grained analysis of the performance of instance matching tools. Details on the different transformation types, our weighted gold standard and metrics, as well as the evaluation of our system can be found in [8]. In the current paper, our focus lies on evaluating state-of-the-art instance matching systems with benchmarks produced using the Lance framework. The purpose of this evaluation is to provide further insights on the weak and strong points of different IM systems, that would be complementary to the ones already established in [8]. In particular, we evaluate the effect of using different datasets as input to the benchmark generator module of Lance, and show that the per- formance of IM systems is not only affected by the benchmark creation process itself, but also by the characteristics of the input dataset that was used to gen- erate the benchmark. For our tests, we used SPIMBENCH [7] and UOBM [9] datasets. 2 LANCE Approach Here, we give the basic features of Lance. The interested reader can find more details in [8]: Transformation-based test cases. Lance supports a set of test cases based on transformations that distinguish different types of matching entities. Similarly to existing IM benchmarks, Lance supports value-based (typos, date/number formats, etc.) and structure-based (deletion of classes/properties, aggregations, splits, etc.) test cases. Lance is the first benchmark generator to support semantics- aware test cases that go beyond the standard RDFS constructs and allow testing the ability of IM systems to use the semantics of RDFS/OWL axioms to iden- tify matches and include tests involving instance (in)equality, class and property equivalence and disjointness, property constraints, as well as complex class defi- nitions. Lance also supports simple combination (SC) test cases (implemented using the aforementioned transformations applied on different triples pertaining to the same instance), as well as complex combination (CC) test cases (imple- mented by combinations of individual transformations on the same triple). Similarity score and fine-grained evaluation metrics. Lance provides an enriched, weighted gold standard and related evaluation metrics, which allow a more fine-grained analysis of the performance of systems for tests with varying difficulty. The gold standard indicates the matches between source and target instances. In particular, each match in the gold standard is enriched with anno- tations specific to the test case that generated each pair, i.e., the type of test case it represents, the property on which a transformation was applied, and a simi- larity score (or weight) of the pair of reported matched instances that essentially quantifies the difficulty of finding a particular match. This detailed informa- tion allows Lance to provide more detailed views and novel evaluation metrics to assess the completeness, soundness, and overall matching quality of an IM system on top of the standard precision/recall metrics. Thus, Lance provides fine-grained information to support debugging and extending IM systems. Lance 3 Ingestion Module Data RDF Repository (Schema SPARQL Queries Initialization Test Case Generator Stats) Module Resource SPARQL Queries Resource Transformation (IR) Generator Module Matched Instances Weight Computation Module MATCHER SAMPLER RESCAL Fig. 1. Lance System Architecture High level of customization Lance provides the ability to build benchmarks with different characteristics on top of any input dataset, thereby allowing the implementation of diverse test cases for different domains, dataset sizes and morphology. This makes Lance highly customizable and domain independent; Implementation of LANCE. Lance1 is a highly configurable instance match- ing benchmark generator for Linked Data that consists of two components : (i) an RDF repository that stores the source datasets and (ii) a test case gener- ator (see Figure 1). The test case generator takes as input a source dataset and produces a target dataset that implements various test cases according to the specified configuration parameters to be used for testing instance matching tools. It consists of the Initialization, the Resource Generator and the Resource Transformation modules. – Initialization module reads the test case generation parameters and retrieves by means of SPARQL queries the schema information (e.g., schema classes and properties) from the RDF repository that will be used for producing the target dataset. – The Resource Generator uses this input to retrieve instances of those schema constructs from the RDF repository and passes those (along with the con- figuration parameters) to the Resource Transformation Module. – The Resource Transformation module returns for a source instance ui the transformed instance u0i and stores this in the target dataset; this module is also responsible in producing an entry in the gold standard. Once Lance has performed all the requested transformations, the Weight Computation Module calculates the similarity scores of the produced matches. The con- figuration parameters specify the part of the schema and data to consider when producing the different test cases as well as the percentage and type of transformations to consider. More specifically, parameters for value-based test cases specify the kind and severity of transformations to be applied; for structure and semantics-aware test cases, the parameters specify the type of transformations to be considered. The idea behind configuration parameters is to allow one to tune the benchmark generator into producing benchmarks 1 The code of Lance is available at https://github.com/jsaveta/Lance 4 T. Saveta et. al. of varying degrees of difficulty which test different aspects of an instance matching tool. Lance is implemented in Java and in the current version we use OWLIM Version 2.7.3. as our RDF repository. 3 Experimental Results Settings. Our evaluation focused on demonstrating the capability of our bench- mark generator in assessing and identifying the strengths and weaknesses of in- stance matching systems. For this purpose, we evaluated LogMap Version 2.4 [10] using the MoRe [11] reasoner, OtO [12] and LIMES [2] running the EAGLE [13] algorithm. We chose these tools because they are prototypical working instances of existing IM systems. Attempts to evaluate systems such as RiMOM-IM [14], COMA++ [15] and CODI [16] with Lance were not successful due to issues from the systems’ side. We were not able to work with RiMOM-IM due to incomplete information regarding the use of the system; COMA++ supports instance-based ontology matching but does not aim for instance matching per se. CODI is no longer supported by their development team. LogMap considers both schema and instance level matching; OtO on the other hand, needs to be configured manually to implement instance matching tasks. The same holds for EAGLE, which can learn specifications and focuses on instance matching tasks only. In order to identify strong and weak points of state-of-the-art IM systems, we tested the tools at hand with difficult tasks in which we transform the entirety of the source dataset to produce the target dataset. All experiments were conducted on an Intel(R) Core(TM) 2 Duo CPU E8400 @3.00GHz with 8G of main memory running Windows 7 (64-bit). Datasets. We used as source datasets produced by LDBC’s2 SPIMBENCH [7] and UOBM’s [9] data generators. SPIMBENCH datasets are described using a rich ontology with many different OWL constructs, in contrast with UOBM that employs a simpler ontology with many object and some datatype properties. For each generator (SPIMBENCH, UOBM) we produced two datasets, one with 10K triples and one with 50K triples. For SPIMBENCH those triples approximately correspond to 500 and 2.5K instances respectively and for UOBM to 2K and 10K. Results. Figures 2 and 3 report the results for the different types of test cases and for the aforementioned datasets. In all cases, we measured recall, precision and f-measure, along with the similarity score and standard deviation. Regarding the SPIMBENCH dataset, LogMap responds well to the value- based test cases having a high precision and recall (close to 0.75) but its perfor- mance degrades when the instances are involved in semantics-aware test cases giving low precision and recall (below 0.4). Despite of these results we claim that LogMap performs sufficiently well when faced with semantics-aware trans- formations since it is called to perform a matching task for highly heterogeneous datasets. OtO gives very good precision results for the value-based test cases but in some cases it is not able to find any match (recall is below 0.1). 2 LDBC Semantic Publishing Benchmark: http://ldbcouncil.org/developer/spb Lance 5 LogMap'10K' EAGLE'10K' OtO'10K' Precision" Recall" f