<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generating corrupted data sources for the evaluation of matching systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alasdair J G Gray</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabina Jedrzejczyk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ahmad Alsadeeqi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Heriot-Watt University</institution>
          ,
          <addr-line>Edinburgh, Scotland</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>One of the most di cult aspects of developing matching systems { whether for matching ontologies or for other types of mismatched data { is evaluation. The accuracy of matchers are usually evaluated by measuring the results produced by the systems against reference sets, but gold-standard reference sets are expensive and di cult to create. In this paper we introduce crptr, which generates multiple variations of di erent sorts of dataset, where the degree of variation is controlled, in order that they can be used to evaluate matchers in di erent context.</p>
      </abstract>
      <kwd-group>
        <kwd>Matching Evaluation Data Corruption</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>One of the central problems of data matching is the issue of evaluation: when
a system returns a set of matches, how are we to determine whether they are
correct or not? How exactly do we de ne what a correct match is, and how do
we determine whether the proposed matches fall into that category? If we have
a range of di erent options, how do we determine which is the `best' match?</p>
      <p>In this paper we describe the use of the crptr system to create evaluation
datasets for matching. crptr was developed to simulate data quality issues for
test datasets used for record linkage evaluation. It can create multiple similar
datasets with varying amounts of variation controlled by input settings, and
provides a clear mapping back to the original dataset. This creates training and
evaluation sets for matchers to run against. We have extended the crptr system
to deal with structure in a context where we want to corrupt data sources in
order to evaluate the semantic rewriting of queries to unknown data sources.</p>
      <p>In Section 2 we describe the crptr system and its original application domain.
Section 3 then details how we extended crptr to address corruption of other data
sets and of queries. We discuss issues around evaluation in Section 4 and touch
on related work in Section 5 before concluding the paper in Section 6.
0 Copyright 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>The crptr system</title>
      <p>
        Synthetically generated data is a common approach for evaluating and testing
data analysis and mining approaches [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, the use of synthetically
generated data fails to capture the messiness of real world data, i.e., they omit data
quality issues [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. To overcome this we developed crptr: a data corruption
application that injects data errors and variations based on user requirements. crptr
allows the user to control data quality in the generated dataset by simulating and
injecting data corruptions into any dataset using known (non-random) methods
that mimic real-world data quality issues (errors and variations). Applying these
corruptions on a synthetic dataset enables the control of data quality, which
makes the synthetic data more realistic and usable for evaluations. crptr
contains many corruption methods that mimic commonly found data quality issues,
e.g., typing errors, alternative spellings, and missing or swapped attributes, that
can be used to simulate di erent corruption scenarios based on the experiment
or project requirements.
      </p>
      <p>crptr works by using a corruption pro le that controls which methods are
used and how much. The idea is that the pro le attempts to capture the data
quality characteristics of the dataset being modelled. The corruption pro le
consist of many factors that de ne the way data need to be corrupted such as the
total number of records that need to be corrupted and the corruption methods
required to be applied on the data. By controlling the factors of the corruption
pro le, the user can con gure crptr to mimic the data quality characteristics
that t the purpose of the research.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Application of crptr to Query Rewriting</title>
      <p>
        The CHAIn system (Combining Heterogenous Agencies' Information) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] has
been designed to support users to successfully query data from a wide range of
di erent data sources, even when these data sources are not known in advance
(e.g., data sources of new collaborators). It is primarily aimed at supporting
decision makers during crisis response, but is applicable in many domains. Any
queries pre-written to extract required information are likely to fail (i.e., not
return any data) on these unknown or updated data sources because the queries
were not written according to the structure and terminology of the target data
source. However, the data sources may well have relevant information that is
closely related to the query. CHAIn extracts data from the target source that
approximately matches the query (i.e., exceeds a given threshold) and uses this
data to rewrite the query so that it succeeds on the datasource. It returns
(potentially) multiple rewritten queries, the mappings used to generate them, the
data they retrieved and a similarity score 2 [0; 1], ranked in order of similarity.
      </p>
      <p>Evaluation in this context therefore means determining whether the scores
returned are a reasonable re ection of the distance between the original query
and the rewritten query, and hence whether the ranked list is a reasonable
ordering of the likely relevance of the responses to what the decision maker actually
wants to know. In this context, the matching is done between the schema of the
query and the schema of the target datasource1. In order to mimic the process
of rewriting a query designed for one data source to succeed on a di erent data
source, we create a query based on the data in a particular data source (i.e.,
so that it would be able to successfully query that data source) and then
introduce corruption re ecting naturally-occurring di erences. We can either keep
the query xed and corrupt the data source in multiple ways, or keep the data
source xed and corrupt the query. In practice, we focused on corrupting
datasources and then generating corrupted queries from these corrupted datasources
- rstly, because it created a more generic process that was able to corrupt both
datasources and queries; secondly, because it allows us to more easily focus on
the part of the query that is relevant in this context, which is the terminology
referring to the target datasource.</p>
      <p>We therefore needed to extend the functionality of crptr in two ways. (i) We
need to consider the domain in which this matching is occurring to determine
how terms should be corrupted; (ii) Because there is a structural element to
schema, we need to consider how this could be corrupted and extend the system
to perform this.</p>
      <p>
        In terms of the rst requirement, some of the corruptions methods in crptr
(e.g., those focusing on spelling errors) are not relevant, whilst others such as
abbreviations, need to be adapted, as some kinds of abbreviations (e.g., of rst
names) are unlikely to occur in our data sources. We need to determine what
kinds of mismatches are likely to occur in our domain, and determine what
sources we can automatically extract them from. CHAIn is designed to be
domain independent, and when addressing the problem of matching di erent (but
similar) data sources in the general case, we need a domain-independent lexical
resource to suggest the kinds of synonyms, hyponyms, hypernyms and meronyms
that di erent creators of data sources in a similar domain may naturally use. We
therefore turned to WordNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a generic and widely used lexical resource, to
allow us to do term corruption. WordNet does provide some information about
abbreviations and acronyms which we are able to use in our matching, although
additional resources that provide more relevant corruptions in this area would
improve performance (but are hard to nd).
      </p>
      <p>In terms of the second requirement, we needed to make sure any potential
structural change in the schema of a CSV le was considered. This is structurally
simple, consisting of columns which are named and ordered, and thus structural
changes are restricted to reorganisation (addition, deletion and reordering) of
the columns. For SPARQL queries in general there are, of course, many more
structural elements (e.g, the potential presence of SPARQL commands such
as aggregate functions), and a complete list of potential structural mismatches
would be more complicated. As we are only concerned with the terms in the
query which correspond with those of the expected data source, we can ignore
all of the additional SPARQL structure, stripping out the relevant terms and
reinserting the new terms after matching.
1 Matching at the data level is required when queries are partially instantiated.</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation of crptr for di erent data formats</title>
      <p>The quality of the crptr output depends on whether the corrupted data sources
it produces are a reasonable facsimile of di erent but related data sources that
would naturally be found. If this is the case then we can infer that the
performance of a matching system when matching di erent data sources created by
crptr is a good indication of the matchers performance in real-world settings,
and that therefore crptr is a useful matching evaluation tool.</p>
      <p>This depends on two things: (i) are the terms in the look up table a good
approximation of terms that could be used interchangable or in a similar way: is
it modelling genuine semantic and syntactic mismatches?; (ii) are the structural
mismatches introduced through the corruption process a good approximation of
how similar data sources may di er? The rst is highly domain dependent. We
use WordNet, which is a very widely used lexical resource. It is also likely to be
of bene t to also use domain-speci c ontologies and lexicographies for each
particular domain; however, these are hard to nd and often of questionable quality,
so this kind of domain-speci c corruption may be hard to perform. Matching in
such domains is also more e cient for the same reasons. The second aspect is
domain independent but format speci c. For each format the system is extended
to, an analysis of what structural mismatches are possible is necessary in order
to demonstrate that the corruptions produced are plausible and thorough.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Related work</title>
      <p>To the best of our knowledge, a system to generate reference sets (records,
queries, RDF data sources, ontologies, etc) in order to evaluate matching in
these domains is unique.</p>
      <p>
        Since reference ontologies are expensive to generate and often not available,
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], automatically generated test sets have been used to evaluate ontology
matching since the Benchmark Test Set was developed for the Ontology Alignment
Evaluation Initiative in 2004 and updated in 2016 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Several other generators
were inspired by this, including Swing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These tend to focus on OWL ontologies
and are less broadly applicable than crptr. The range of methods they use are
in some cases more sophisticated than our techniques, and in domains for which
they are relevant, crptr could be improved by incorporated such approaches.
      </p>
      <p>
        Aside from ontology matching, there is existing work on generating
synthetic datasets with structural variations for relational and RDF data for use
in benchmarking. The Linked Data Benchmark Council [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has supported the
development of con gurable and scalable synthetic RDF datasets with similar
irregularities to real data, including structural irregularities, speci cally in the
domains of social networks and semantic publishing. Existing work on generating
structural variatons in RDF data (e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) is intended to test the functionality
and scalability of searches and the maintenance of RDF datasets. STBenchmark
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] generates test cases for schema mapping systems, taking an original dataset
and applying structural and term variations. This is used to create benchmark
data for hand-mapping systems rather than for automated matching or
querying. Our work could be extended with similar strategies to these to experiment
with greater structural variations.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In this paper we have discussed using the crptr system for generating multiple
similar datasets for evaluating matchers within di erent domains. We brie y
described how crptr was developed to focus on records and then extended to
deal with queries based on CSV les, and could be extended to deal with other
kinds of data sources. We discussed what evaluation of these corruption systems
means in di erent contexts.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alexe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velegrakis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Stbenchmark: towards a benchmark for mapping systems</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <volume>230</volume>
          {
          <fpage>244</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Angles</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boncz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larriba-Pey</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fundulaki</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erling</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neubauer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinez-Bazan</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotsev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toma</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>The linked data benchmark council: a graph and rdf industry benchmarking e ort</article-title>
          .
          <source>ACM SIGMOD Record</source>
          <volume>43</volume>
          (
          <issue>1</issue>
          ),
          <volume>27</volume>
          {
          <fpage>31</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rooiu</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trojahn</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Ontology matching benchmarks: Generation, stability, and discriminability</article-title>
          .
          <source>Journal of Web Semantics</source>
          <volume>21</volume>
          ,
          <issue>30</issue>
          {
          <fpage>48</fpage>
          (
          <year>2013</year>
          ). https://doi.org/https://doi.org/10.1016/j.websem.
          <year>2013</year>
          .
          <volume>05</volume>
          .002, http://www.sciencedirect.com/science/article/pii/S1570826813000188, special Issue on Evaluation of Semantic Technologies
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ferrara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montanelli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noessner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stuckenschmidt</surname>
          </string-name>
          , H.:
          <article-title>Benchmarking matching applications on the semantic web</article-title>
          . In: Antoniou,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Grobelnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Simperl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Parsia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Plexousakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>De Leenheer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . (eds.)
          <source>The Semanic Web: Research and Applications</source>
          . pp.
          <volume>108</volume>
          {
          <fpage>122</fpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hernandez</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stolfo</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          :
          <article-title>Real-world data is dirty: Data cleansing and the merge/purge problem. Data mining and knowledge discovery 2(1),</article-title>
          <volume>9</volume>
          {
          <fpage>37</fpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ivanova</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bach</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietriga</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lambrix</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Alignment cubes: Towards interactive visual exploration and evaluation of multiple ontology alignments</article-title>
          . In: d'Amato,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Tamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Lecue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Cudre-Mauroux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>He</surname>
          </string-name>
          <string-name>
            <surname>in</surname>
          </string-name>
          , J. (eds.)
          <source>The Semantic Web { ISWC 2017</source>
          . pp.
          <volume>400</volume>
          {
          <fpage>417</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>McNeill</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gkaniatsou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bundy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Dynamic data sharing from large data sources</article-title>
          .
          <source>In: Proceedings of 11th International Conference on Information Systems for Crisis Response and Management (ISCRAM</source>
          <year>2014</year>
          )
          <article-title>(</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          :
          <article-title>Wordnet: A lexical database for English</article-title>
          .
          <source>Commun. ACM</source>
          <volume>38</volume>
          (
          <issue>11</issue>
          ),
          <volume>39</volume>
          {41 (Nov
          <year>1995</year>
          ). https://doi.org/10.1145/219717.219748, http://doi.acm.
          <source>org/10</source>
          .1145/219717.219748
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>K.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vatsalan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Geco: an online personal data generator and corruptor</article-title>
          .
          <source>In: Proceedings of the 22nd ACM international conference on Information &amp; Knowledge Management</source>
          . pp.
          <volume>2473</volume>
          {
          <fpage>2476</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>