How well does your Instance Matching system
perform? Experimental evaluation with LANCE

             Tzanina Saveta1 , Evangelia Daskalaki1 , Giorgos Flouris1 ,
               Irini Fundulaki1 , and Axel-Cyrille Ngonga Ngomo2
                   1
                       Institute of Computer Science-FORTH? Greece,
                       2
                         IFI/AKSW, University of Leipzig, Germany


        Abstract. Identifying duplicate instances in the Data Web is most com-
        monly performed (semi-)automatically using instance matching frame-
        works. However, current instance matching benchmarks fail to provide
        end users and developers with the necessary insights pertaining to how
        current frameworks behave when dealing with real data. In this paper,
        we present the results of the evaluation of instance matching systems
        using Lance, a domain-independent, schema agnostic instance match-
        ing benchmark generator for Linked Data. Lance is the first benchmark
        generator for Linked Data to support semantics-aware test cases that
        take into account complex OWL constructs in addition to the standard
        test cases related to structure and value transformations. We provide a
        comparative analysis with benchmarks produced using the Lance frame-
        work for different domains to assess and identify the capabilities of state
        of the art instance matching systems.

1     Introduction
Instance matching (IM), refers to the problem of identifying instances that de-
scribe the same real-world object. With the increasing adoption of Semantic Web
technologies and the publication of large interrelated RDF datasets and ontolo-
gies that form the Linked Data (LD) Cloud, a number of IM techniques adapted
to this setting have been proposed [1,2,3].
    Clearly, the large variety of IM techniques requires their comparative evalua-
tion to determine which technique is best suited for a given application. Assess-
ing the performance of these systems generally requires well-defined and widely
accepted benchmarks to allow determining the weak and strong points of the
methods or systems, as well as for motivating the development of better systems
to overcome the identified weak points. Hence, properly designed benchmarks
help push the limit of existing systems [4,5,6,7,8], advancing both research and
technology.
    Recently Lance [8], a state-of-the-art benchmark generator for benchmark-
ing instance matching systems in the LD context was introduced. Lance is a
flexible, generic, domain-independent and schema-agnostic benchmark generator
for IM systems. Lance supports a large variety of value, structure based and
semantics-aware transformations with varying degrees of difficulty. The results
?
    The presented work was funded by the H2020 project HOBBIT (#688227).
2       T. Saveta et. al.

of these transformations are recorded in the form of a weighted gold standard
that allows a more fine-grained analysis of the performance of instance matching
tools. Details on the different transformation types, our weighted gold standard
and metrics, as well as the evaluation of our system can be found in [8].
    In the current paper, our focus lies on evaluating state-of-the-art instance
matching systems with benchmarks produced using the Lance framework. The
purpose of this evaluation is to provide further insights on the weak and strong
points of different IM systems, that would be complementary to the ones already
established in [8]. In particular, we evaluate the effect of using different datasets
as input to the benchmark generator module of Lance, and show that the per-
formance of IM systems is not only affected by the benchmark creation process
itself, but also by the characteristics of the input dataset that was used to gen-
erate the benchmark. For our tests, we used SPIMBENCH [7] and UOBM [9]
datasets.


2   LANCE Approach
Here, we give the basic features of Lance. The interested reader can find more
details in [8]:
Transformation-based test cases. Lance supports a set of test cases based
on transformations that distinguish different types of matching entities. Similarly
to existing IM benchmarks, Lance supports value-based (typos, date/number
formats, etc.) and structure-based (deletion of classes/properties, aggregations,
splits, etc.) test cases. Lance is the first benchmark generator to support semantics-
aware test cases that go beyond the standard RDFS constructs and allow testing
the ability of IM systems to use the semantics of RDFS/OWL axioms to iden-
tify matches and include tests involving instance (in)equality, class and property
equivalence and disjointness, property constraints, as well as complex class defi-
nitions. Lance also supports simple combination (SC) test cases (implemented
using the aforementioned transformations applied on different triples pertaining
to the same instance), as well as complex combination (CC) test cases (imple-
mented by combinations of individual transformations on the same triple).
Similarity score and fine-grained evaluation metrics. Lance provides an
enriched, weighted gold standard and related evaluation metrics, which allow a
more fine-grained analysis of the performance of systems for tests with varying
difficulty. The gold standard indicates the matches between source and target
instances. In particular, each match in the gold standard is enriched with anno-
tations specific to the test case that generated each pair, i.e., the type of test case
it represents, the property on which a transformation was applied, and a simi-
larity score (or weight) of the pair of reported matched instances that essentially
quantifies the difficulty of finding a particular match. This detailed informa-
tion allows Lance to provide more detailed views and novel evaluation metrics
to assess the completeness, soundness, and overall matching quality of an IM
system on top of the standard precision/recall metrics. Thus, Lance provides
fine-grained information to support debugging and extending IM systems.
                                                                                                   Lance   3


                   Ingestion
                    Module
                     Data
                               RDF Repository


                                                (Schema
                                                SPARQL
                                                 Queries
                                                           Initialization   Test Case Generator


                                                  Stats)
                                                              Module


                                                                                     Resource


                                                SPARQL
                                                Queries
                                                           Resource
                                                                                  Transformation


                                                  (IR)
                                                           Generator
                                                                                      Module

                                                 Matched Instances

                                                      Weight Computation Module

                                           MATCHER                SAMPLER          RESCAL


                               Fig. 1. Lance System Architecture
High level of customization Lance provides the ability to build benchmarks
with different characteristics on top of any input dataset, thereby allowing the
implementation of diverse test cases for different domains, dataset sizes and
morphology. This makes Lance highly customizable and domain independent;
Implementation of LANCE. Lance1 is a highly configurable instance match-
ing benchmark generator for Linked Data that consists of two components : (i)
an RDF repository that stores the source datasets and (ii) a test case gener-
ator (see Figure 1). The test case generator takes as input a source dataset
and produces a target dataset that implements various test cases according to
the specified configuration parameters to be used for testing instance matching
tools. It consists of the Initialization, the Resource Generator and the Resource
Transformation modules.
  – Initialization module reads the test case generation parameters and retrieves
    by means of SPARQL queries the schema information (e.g., schema classes
    and properties) from the RDF repository that will be used for producing the
    target dataset.
 – The Resource Generator uses this input to retrieve instances of those schema
   constructs from the RDF repository and passes those (along with the con-
   figuration parameters) to the Resource Transformation Module.
 – The Resource Transformation module returns for a source instance ui the
   transformed instance u0i and stores this in the target dataset; this module
   is also responsible in producing an entry in the gold standard. Once Lance
   has performed all the requested transformations, the Weight Computation
   Module calculates the similarity scores of the produced matches. The con-
   figuration parameters specify the part of the schema and data to consider
   when producing the different test cases as well as the percentage and type
   of transformations to consider. More specifically, parameters for value-based
   test cases specify the kind and severity of transformations to be applied; for
   structure and semantics-aware test cases, the parameters specify the type of
   transformations to be considered. The idea behind configuration parameters
   is to allow one to tune the benchmark generator into producing benchmarks
1
    The code of Lance is available at https://github.com/jsaveta/Lance
4       T. Saveta et. al.

    of varying degrees of difficulty which test different aspects of an instance
    matching tool.
Lance is implemented in Java and in the current version we use OWLIM Version
2.7.3. as our RDF repository.

3     Experimental Results
Settings. Our evaluation focused on demonstrating the capability of our bench-
mark generator in assessing and identifying the strengths and weaknesses of in-
stance matching systems. For this purpose, we evaluated LogMap Version 2.4 [10]
using the MoRe [11] reasoner, OtO [12] and LIMES [2] running the EAGLE [13]
algorithm. We chose these tools because they are prototypical working instances
of existing IM systems. Attempts to evaluate systems such as RiMOM-IM [14],
COMA++ [15] and CODI [16] with Lance were not successful due to issues from
the systems’ side. We were not able to work with RiMOM-IM due to incomplete
information regarding the use of the system; COMA++ supports instance-based
ontology matching but does not aim for instance matching per se. CODI is no
longer supported by their development team. LogMap considers both schema
and instance level matching; OtO on the other hand, needs to be configured
manually to implement instance matching tasks. The same holds for EAGLE,
which can learn specifications and focuses on instance matching tasks only. In
order to identify strong and weak points of state-of-the-art IM systems, we tested
the tools at hand with difficult tasks in which we transform the entirety of the
source dataset to produce the target dataset. All experiments were conducted on
an Intel(R) Core(TM) 2 Duo CPU E8400 @3.00GHz with 8G of main memory
running Windows 7 (64-bit).
Datasets. We used as source datasets produced by LDBC’s2 SPIMBENCH [7]
and UOBM’s [9] data generators. SPIMBENCH datasets are described using a
rich ontology with many different OWL constructs, in contrast with UOBM that
employs a simpler ontology with many object and some datatype properties. For
each generator (SPIMBENCH, UOBM) we produced two datasets, one with 10K
triples and one with 50K triples. For SPIMBENCH those triples approximately
correspond to 500 and 2.5K instances respectively and for UOBM to 2K and
10K.
Results. Figures 2 and 3 report the results for the different types of test cases
and for the aforementioned datasets. In all cases, we measured recall, precision
and f-measure, along with the similarity score and standard deviation.
    Regarding the SPIMBENCH dataset, LogMap responds well to the value-
based test cases having a high precision and recall (close to 0.75) but its perfor-
mance degrades when the instances are involved in semantics-aware test cases
giving low precision and recall (below 0.4). Despite of these results we claim
that LogMap performs sufficiently well when faced with semantics-aware trans-
formations since it is called to perform a matching task for highly heterogeneous
datasets. OtO gives very good precision results for the value-based test cases but
in some cases it is not able to find any match (recall is below 0.1).
2
    LDBC Semantic Publishing Benchmark: http://ldbcouncil.org/developer/spb
                                                                                                                                               Lance                        5

                    LogMap'10K'                                                 EAGLE'10K'                                                  OtO'10K'
             Precision"   Recall"    f<measure"                         Precision"   Recall"    f<measure"                     Precision"     Recall"   f<measure"
  1"                                                      1"                                                         1"
0.8"                                                    0.8"                                                       0.8"
0.6"                                                    0.6"                                                       0.6"
0.4"                                                    0.4"                                                       0.4"
0.2"                                                    0.2"                                                       0.2"
  0"                                                      0"                                                         0"
       Value" Structure" Seman4cs"      SC"       CC"          Value"    Structure" Seman4cs"      SC"       CC"          Value"       Seman4cs"        SC"           CC"


                    LogMap'50K'                                                 EAGLE'50K'                                                  OtO'50K'
             Precision"   Recall"    f<measure"                         Precision"   Recall"    f<measure"                     Precision"     Recall"   f<measure"
  1"                                                      1"                                                         1"
0.8"                                                    0.8"                                                       0.8"
0.6"                                                    0.6"                                                       0.6"
0.4"                                                    0.4"                                                       0.4"
0.2"                                                    0.2"                                                       0.2"
  0"                                                      0"                                                         0"
       Value" Structure" Seman4cs"      SC"       CC"          Value"    Structure" Seman4cs"     SC"        CC"                   Value"                 Seman4cs"


 Fig. 2. Applicability experiments for LogMap, EAGLE and OtO for SPIMBENCH


                 Fig. 3. Applicability experiments for LogMap and OtO for UOBM


   The algorithm of EAGLE performs well when faced with syntactic trans-
formations. Increasing changes to the topology of the underlying RDF graphs
(the case of semantics-aware test cases) leads to a degradation of the perfor-
mance of the algorithm. The performance of EAGLE is not consistent since it is
non-deterministic and uses unsupervised learning.
   The second experiment that we conducted compared the similarity scores as
well as the standard deviation of the results of the systems with those of Lance
when the latter is used as a baseline. These metrics provide insights on the
6         T. Saveta et. al.


    Fig. 4. Standard Deviation for LogMap, Fig. 5. Standard Deviation for LogMap
    EAGLE, OtO, for 10K and semantics- and OtO, for 10K and structure-based
    aware test cases for SPIMBENCH.        test cases for UOBM.


Fig. 6. Similarity score distribution for LogMap, EAGLE, OtO, for 10K and semantics-
aware test cases for SPIMBENCH. The similarity score of the benchmark is also shown
(column Lance).


ability of the systems to address the challenges proposed by Lance benchmarks.
Figures 4 and 6 give the standard deviation and similarity scores for all three
systems and for the semantics-aware test case for the 10K triples SPIMBENCH
dataset. They also show the values for Lance in order to have a baseline for
comparison. We can see that LogMap reports scores and standard deviation close
to the ones given by Lance verifying that it can address the “difficult” test cases.
EAGLE and OtO report lower similarity scores and standard deviation, meaning
that they cannot address the challenges imposed by the, harder, semantics-aware
test cases.
    For UOBM, we ran LogMap and OtO, but not EAGLE because we were not
able to correctly initialize it. UOBM datasets seem to be more “difficult” for
both IM systems, and this difficulty stems from the dataset itself, rather than
from the transformations imposed by Lance. In particular, an important source
of difficulty for the systems derives from the fact that the URIs of the instances
in the dataset look very similar to each other, so even the change of a URI can
lead to false positives or false negatives. To conclude, LogMap does not respond
                                                                      Lance        7

well to any of the categories, but its performance is not affected by the dataset
size. On the other hand, OtO responds better, but is affected negatively when the
dataset gets larger. Figures 5 and 7 give the standard deviation and similarity
scores for both systems and for the structure-based test case for the 10K triples
UOBM dataset. We can also see that OtO reports scores and standard deviation
results slightly closer to the ones given by Lance than LogMap, verifying that
it can address more difficult test cases.


Fig. 7. Similarity score distribution for LogMap and OtO, for 10K and structure-based
test cases for UOBM. The similarity score of the benchmark is also shown (column
Lance).
    In our previous evaluation [8], we showed how the different types of trans-
formations affected the performance of IM systems when tested using Lance.
The current work shows that the dataset used as a source for generating the
benchmark is also an important factor that may affect such performance. This
conclusion is derived by the fact that, even though we used the same parame-
ters for the transformations in SPIMBENCH and UOBM and for all sizes, the
systems did not respond similarly in the two datasets and dataset sizes as one
might expect. This phenomenon was explained in Section 3, but note that our
conclusions do not necessarily generalize to other datasets. This is a novel in-
sight that we plan to exploit further by determining the dataset characteristics
that are critical for this effect (e.g., dataset structure, URI format and scheme,
dataset domain, etc.).


4   Conclusions
In our previous work we introduced Lance, an instance matching benchmark
generator focusing on benchmarking instance matching systems for Linked Data.
Lance is a domain-independent, highly modular and configurable generator that
can accept as input any linked dataset and its accompanying schema to produce
a target dataset implementing matching tasks of varying levels of difficulty.
    In the current paper, we ran experiments which used benchmarks generated
by Lance to evaluate state-of-the-art IM systems. These experiments should be
viewed as an addendum to the experiments appearing in [8], and have provided
8       T. Saveta et. al.

additional insights on the factors that affect the performance of an IM system.
In fact, it was shown that it is not only the types (and difficulty) of the trans-
formations imposed by Lance that affect a system’s performance, but also the
characteristics of the source dataset may play an important role.
    In the future, we plan to study further this observation by pinpointing those
characteristics of a dataset that have the most important effect on the systems’
performance. Regarding Lance itself, we will consider extensions for spatial
and streaming data; we also intend to work with datasets that include blank
nodes thereby creating more challenging tasks for instance matching tools. Fur-
thermore, we plan to study the frequency of appearance of the various types of
transformations in real datasets in order to be able to propose mixes of different
transformations that are more realistic with respect to actual datasets.

References
 1. R. Isele, A. Jentzsch, and C. Bizer. Silk Server - Adding missing Links while
    consuming Linked Data. In COLD, 2010.
 2. A.-C. Ngonga Ngomo and Soren Auer. LIMES - A Time-Efficient Approach for
    Large-Scale Link Discovery on the Web of Data. IJCAI, 2011.
 3. K. Stefanidis, V. Efthymiou, M. Herschel, and V. Christophides. Entity resolution
    in the web of data. In WWW, Companion Volume, 2014.
 4. Ontology Alignment Evaluation Initiative. http://oaei.ontologymatching.org/.
 5. K. Zaiss, S. Conrad, and S. Vater. A Benchmark for Testing Instance-Based On-
    tology Matching Methods. In KMIS, 2010.
 6. B. Alexe, W.-C Tan, and Y. Velegrakis. STBenchmark: Towards a benchmark for
    mapping systems. In PVLDB, 2008.
 7. T. Saveta, E. Daskalaki, G. Flouris, et al. Pushing the Limits of Instance Matching
    Systems: A Semantics-Aware Benchmark for Linked Data. In WWW (Companion
    Volume), 2015.
 8. T. Saveta, E. Daskalaki, G. Flouris, I. Fundulaki, and A.-C. Ngonga. LANCE:
    Piercing to the Heart of Instance Matching Tools. In ISWC, 2015.
 9. L. Ma, Y. Yang, Z. Qiu, et al. Towards a Complete OWL Ontology Benchmark.
    In ESWC, 2006.
10. E. Jiménez-Ruiz and B. C. Grau. Logmap: Logic-based and scalable ontology
    matching. In ISWC, 2011.
11. A. A. Romero, B.C. Grau, I. Horrocks Ian, and E. Jiménez-Ruiz. MORe: a Modular
    OWL Reasoner for Ontology Classification. In ORE, 2013.
12. E. Daskalaki and D. Plexousakis. OtO Matching System: A Multi-strategy Ap-
    proach to Instance Matching. In CAiSE, 2012.
13. A.-C. Ngonga Ngomo and K. Lyko. EAGLE: Efficient Active Learning of Link
    Specifications using Genetic Programming. In ESWC, 2012.
14. J. Li, J. Tang, Y. Li, and Q. Luo. Rimom: A dynamic multistrategy ontology
    alignment framework. TKDE, 21(8), 2009.
15. S. Massmann, S. Raunich, D. Aumüller, P. Arnold, and E. Rahm. Evolution of
    the COMA match system. Ontology Matching, 49, 2011.
16. J. Euzenat, A. Ferrara, W. R. van Hage, et al. Results of the ontology alignment
    evaluation initiative 2011. In OM, 2011.