<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Utilizing Regular Expressions for Instance-Based Schema Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benjamin Zapilko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthaus Zloch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johann Schaible</string-name>
          <email>johann.schaibleg@gesis.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GESIS - Leibniz Institute for the Social Sciences Unter Sachsenhausen 6-8</institution>
          ,
          <addr-line>50667 Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Statistical data consists mostly of numerical values, entries of codelists like country codes or acronyms for gender. Such values are typically described according to speci c patterns. In this paper we present a novel approach for instance-based schema matching, where regular expressions are utilized for matching patterns of instance values.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In various domains, e.g. the social sciences, the matching of statistical data is
a typical task. Schema elements of statistical data, e.g. rows or columns of a
spreadsheet, are named usually by simple and short labels, sometimes even with
abbreviated terms. However, the structure and semantics of their instances (e.g.
numerical values, entries of codelists) di er in various aspects from text-heavy
data. Instances are often described by a speci c syntactical pattern, e.g. dates
consist of numerical values divided by periods or slashes or a three-letter code
for a geographical area.</p>
      <p>
        For instance-based schema matching [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] states that di erent domains reveal
new challenges like treating new types of information resources, e.g. spatial or
temporal information or domain-speci c constrains. According to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] especially
domain-speci c values, signi cant occurrences and patterns of values are relevant
characteristics to be considered at instance level, as well as integrity constraints
for schema elements and their instance values. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] the matching process is
enhanced by applying a constraint-based matching. Moreover, regular expressions
and catchwords are considered for instance-based schema matching in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We
focus on statistical data, where the potential of patterns and regular expressions
for schema matching can be fully exposed.
      </p>
      <p>
        Schema Matching using Regular Expressions
By utilizing pattern classes our approach considers two schema elements as
a match, if their instances can be expressed via at least one regular
expression of the same pattern class. We de ne multiple pattern classes, which
correspond to a speci c data element, e.g. dates, age groups or geographical codes,
and contain various patterns for describing this data element. For a data
element "date" di erent patterns might be e.g. [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-9</xref>
        ]{4}, [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-9</xref>
        ]{2}-[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-9</xref>
        ]{4} or
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-9</xref>
        ]{2}.[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-9</xref>
        ]{2}.[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-9</xref>
        ]{4}. Each pattern is expressed as a regular
expression and is assigned a weighting, which states the accuracy of the pattern to
compass typical instances of the data element. Inside a pattern class the regular
expressions are sorted by their weightings in descending order.
      </p>
      <p>We assume two datasets M and N with their schema elements SM 2 M
and SN 2 N . The pattern classes Cx with Cx = f(regex; !)jregex matches x;
0 &lt; ! &lt; 1g contain multiple regular expressions regex describing the statistical
data element x of the class. They are accompanied with a weighting !.</p>
      <p>For each pattern class Cx, we compute an average weighting for every schema
element SM and SN . This average weighting indicates how often instances of the
schema element can be expressed by a pattern of the class. Hereby, as soon as an
instance can be expressed by a (regex; !) 2 Cx, the value of ! is added to the sum
of all weightings, whose regular expressions previously matched another instance
from this same schema element, resulting in the nal P0 !. The average is then
retrieved by normalizing this sum regarding the total number of instances inside
this particular schema element. For each SM , this is avg(SM ) = jInstanPce0s!in SM j .
For SN the average is calculated analogously. If this average weight is not 0, the
schema element is collected among its average weight in a set. We de ne these
sets as Mx and Nx with Mx = f(SM ; avg(SM )g and Nx = f(SN ; avg(SN )g.</p>
      <p>The Cartesian product of Mx and Nx is computed and added to M atchesx,
in which a triple (SM ; SN ; avg(SM ) avg(SN )) de nes a match between a SM
and a SN with the probability of avg(SM ) avg(SN ). Finally, the result set
M atchesx contains all matches between two datasets M and N .</p>
      <p>Our approach has been implemented in Java using the JENA API. The source
code and an executable jar le are available at https://github.com/mazlo/smurf.
In rst experiments with real-world statistical data we obtained better results
for matching schema elements than other existing matching systems. A detailed
evaluation with generic test datasets is currently work-in-progress. We aim to
extend our approach to extract patterns from instance values and to generate
weightings automatically. Feature extraction from instance values can enhance
our approach in computing weightings and in assigning regular expressions to
adequate pattern classes.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Engmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <article-title>Ma mann, S. Instance Matching with COMA++</article-title>
          .
          <source>BTW Workshops</source>
          ,
          <year>2007</year>
          ,
          <fpage>28</fpage>
          -
          <lpage>37</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Why Your Data Won't Mix</surname>
            <given-names>Queue</given-names>
          </string-name>
          , ACM,
          <year>2005</year>
          ,
          <volume>3</volume>
          ,
          <fpage>50</fpage>
          -
          <lpage>58</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Shvaiko</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ontology</surname>
          </string-name>
          <article-title>Matching: State of the Art and Future Challenges</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <year>2011</year>
          ,
          <fpage>99</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Zaiss</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schlueter</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Conrad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Instance-Based Ontology Matching Using Di erent Kinds of Formalisms</article-title>
          .
          <source>Proceedings of the International Conference on Semantic Web Engineering</source>
          , Oslo, Norway, July,
          <year>2009</year>
          ,
          <fpage>29</fpage>
          -
          <lpage>31</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>