<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Rule Learning Approaches to Instance-based Ontology Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Frederik Janssen</string-name>
          <email>jan@informatik.uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Faraz Fallahi</string-name>
          <email>fallahi@ontoprise.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Noessner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heiko Paulheim</string-name>
          <email>paulheimg@ke.tu-darmstadt.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KR &amp; KM Research Group University of Mannheim</institution>
          ,
          <addr-line>B6 26, 68159 Mannheim</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Knowledge Engineering Group, TU Darmstadt</institution>
          ,
          <addr-line>Hochschulstrasse 10, 64289 Darmstadt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>ontoprise GmbH</institution>
          ,
          <addr-line>An der Raumfabrik 29, 76227 Karlsruhe</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Ontology matching approaches have mostly worked on the schema level so far. With the advent of Linked Open Data and the availability of a massive amount of instance information, instance-based approaches become possible. This position paper discusses approaches and challenges for using those instances as input for machine learning algorithms, with a focus on rule learning algorithms, as a means for ontology matching.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Today, data integration results produced by various developed automated
algorithms and systems are still error-prone. Therefore, the eld of data integration
is a major challenge of modern information technology and of signi cance to
various research areas. Ontologies can play a key role in resolving semantic
heterogeneity by providing a formal description of a domain.</p>
      <p>Not only since the advent of Linked Open Data, ontology mappings have
become an essential ingredient to ensure the interoperability of data sets using
di erent ontologies. While Linked Open Data is much concerned about the
linking of instances, mappings on the schema level, e.g. for federating queries across
di erent data sets, are currently under-represented. At the same time, the vast
amount of instance data in Linked Open Data allows for developing powerful
instance-based approaches to ontology matching to complement the currently
predominant schema-level approaches.</p>
      <p>Utilizing matching algorithms with lexical distance measure and pattern
recognition techniques for automatically nding mappings between ontologies do
not always yield the expected results. Where lexical distance measure algorithms
fail, machine learning algorithms which are based on symbolic representations
can serve as a remedy, at least to some extent.</p>
      <p>
        In the last years, a major research focus has been set on schema based
solutions. Hence, a variety of state-of-the-art algorithms for alignment have been
developed, which work mostly on traversing schemas and their structure, and
only a few approaches explicitly use data about instances [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. In the scope of
the MappingAssistant project [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], instance data has been utilized to repair and
re ne existing ontology alignments.
      </p>
      <p>In this paper, we discuss two possible approaches for employing machine
learning for instance-based ontology matching. The basic idea of both of them is
to use rule learning to solve this problem. The main advantage of rule learning
is its symbolic character, i.e., rules can be interpreted and compared to each
other. In a rst case study, we use association rule learners on the instance level
to discover mappings. A second case study shows how separate-and-conquer rule
learners can be employed to further re ne mappings. Both case studies aim at
demonstrating the ideas and exploring possible challenges.
2</p>
      <p>Case Study 1 { Creating Mappings by Rule Learning
In this case study, we focus on cases where a set of instances is contained
in di erent data sets { either identi ed by identical URIs or connected via
owl:sameAs statements. While this is rather frequent in Linked Open Data,
the OAEI datasets typically do not contain shared instances. Therefore, we have
created our own dataset, using an excerpt from DBpedia. As DBpedia uses its
own ontology as well as YAGO for classi cation, it provides an instance set using
di erent ontologies, which can be used as a benchmark for instance-based
ontology matching. To construct our dataset, we used an existing, publicly available
partial 1:1 mapping4 between the DBpedia and the YAGO ontology as a starting
point. The dataset contains 169 simple mappings between DBpedia and YAGO.
For that dataset, we retrieved all instances that contain at least one of the types
in the mapping. That data set has a total of 231,635 instances.</p>
      <p>
        We use association rule mining to discover ontologies, as suggested by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Association rules are used, e.g., to discover sets of items that are likely to be
bought together, in order to build recommender algorithms. Likewise, we use
them to nd out which classes (and other ontology constructs) frequently
cooccur, and derive mappings from those co-occurences.
      </p>
      <p>The portions of the two ontologies under consideration are essentially di
erent. For example, the YAGO ontology relies mainly on a rich classi cation and
contains very special classes such as yago:IndustrialRockMusicGroups, while
the DBpedia relies on relations to describe such classes, e.g., as a DBpedia-owl:
Band with a value of DBpedia:IndustrialRock for DBpedia-owl:genre5. Thus,
the mapping between the two ontologies can only be expressed by a complex one:
yago:IndustrialRockMusicGroups</p>
      <p>DBpedia-owl:Band u 9DBpedia-owl:genre: fDBpedia:IndustrialRockg
4 http://www.netestate.de/De/Loesungen/DBpedia-YAGO-Ontology-Matching
5 See http://DBpedia.org/resource/Nine_Inch_Nails for this example.
0
&gt;=95
&gt;=90 &gt;=85
Minimum confidence threshold</p>
      <p>
        In a rst experiment, we tried to infer the simple 1:1 mappings. Using the
feature generation toolkit FeGeLOD [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we added a boolean feature for each
rdf:type property for all the instances in our data set, ending up with a total of
98,414 features. On this data set, we used an association rule mining algorithm
to nd rules expressing frequent co-occurring rdf:type statements, i.e., two
classes in di erent ontologies that share many instances. For example, from two
symmetrically occurring rules, we conclude a mapping as follows:
)
      </p>
      <p>DBpedia-owl:ProtectedArea</p>
      <p>
        yago:Park
DBpedia-owl:ProtectedArea
yago:Park
DBpedia-owl:ProtectedArea
yago:Park
As discussed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], an F-Measure of up to 25% can be achieved with this naive
approach. While this is not as much as state-of-the-art approaches, the mappings
found are often hard to nd with conventional approaches, as the above example
shows.
      </p>
      <p>In a second experiment, we tried to infer complex mappings between two
classes. To that end, we have created a second dataset which comprises not only
the type information as in the rst case, but also boolean properties indicating
whether there is an incoming or outgoing relation of a certain type. The
resulting data set has 117,029 features. From that dataset, we were able to discover
complex mappings, such as</p>
      <p>1DBpedia-owl:name v yago:Person
Since there is, to the best of our knowledge, no gold standard for complex
mappings, we have evaluated our approach by counting the number of rules found,
and manually determining the rules that are correct. The results are depicted in
Fig. 1 for di erent thresholds. The experiments show that it is possible to employ
association rule learning for discovering ontology mappings. Typical challenges
include scalability to larger datasets, learning more expressive rules, and
evaluation of complex mappings.</p>
      <p>dataset from ontology O1
@data
high,low,high,medium,low
high,low,low,high,medium
low,low,high,high,low
low,low,low,low,medium
medium,high,high,low,low
medium,high,low,high,medium
low,high,high,medium,high
...</p>
      <p>dataset from ontology O2
@relation cars
@attribute acceleration {low,medium,high}
@attribute cargoCapacity {low,high}
@attribute passengerSpace {low,high}
@attribute convenience {low,medium,high}
@attribute numberOfExtras {low,medium,high}
@attribute mpg {low,medium,high}
@data
high,low,high,medium,medium,low
high,low,low,low,high,high
low,low,high,high,high,low
low,low,low,high,medium,low
low,high,high,high,low,medium
medium,high,high,high,high,medium
low,high,high,medium,low,high
...
In the second case study, our goal is to use classi cation rule learning to nd
non-trivial alignments between ontologies, which cannot be found through
conventional schema-based matching techniques. We focus on nding additional
property alignments, assuming that some alignments between properties de ned
in two ontologies O1 and O2 are given as an initial mapping, as indicated by the
arrows in Fig. 2. Note that in that example, the property numberOfExtras that
is present in O2 is missing in O1.</p>
      <p>The goal is to detect additional property alignments between O1 and O2. In
the example, the alignment still left to detect is milesPerGallon mpg.</p>
      <p>For each unmapped property, we construct a dataset as input for a rule
learner. The datasets are created by including all the instances available from
the A-boxes of the two ontologies, and using the unmapped property as a
class attribute. In Fig. 2, three classi cation problems would be created:
using milesPerGallon as the target class in the data set for O1, and using
numberOfExtras and mpg as the target class in the data set for O2.</p>
      <p>In our example, we learn rule sets for each non-aligned property q1 and
q2, using a CN2-like rule learner. While for the illustration of the approach
the actual learning algorithm is not that important (given that the output are
rules), certainly the performance of the approach is strongly related to the used
algorithm (cf. the discussion in Section 4). In the end, the rule learner outputs
three rule sets, one for each non-aligned property, e.g., one ruleset R1 for the
property milesPerGalllon of O1, and two rulesets (R2 and R3) for the two
properties that reside unmapped in O2 (numberOfExtras and mpg). The key
idea of the approach is to compare the rules found on the instances of the two
ontologies: if two rules comprise attribute-value tests for properties that are
previously mapped, the consequence is that the target properties (the prediction
of the rules) then can be also mapped, because their characteristics are similar
to a certain extent.</p>
      <p>To illustrate such a mapping, in the following examples for similar and
di erent rules are shown. The best case is that the rules are identical: r1;1:
milesPerGallon=medium convenienceRating=high ^ acceleration=high.
r2;1: mpg=medium convenience=high ^ acceleration=high. where ri;j stands for
the j-th rule of rule set i. Rule r1;1 tests whether the convenienceRating is high
and the acceleration is high. Due to the prede ned mappings, it can be
concluded that both rules use the same properties with the same values and hence
are similar (with con dence 1:0). Note that all numerical values of all properties
are discretized in a preprocessing step for simpli cation issues. Therefore, rules
that are exactly similar may not be so rare. Nevertheless, rules that are found
on real world data may not be identical. An example for such di erent rules is
given below.
r1;3: milesPerGallon=high acceleration=medium ^ cargoCapacityRating=low.
r3;3: numberOfExtras=high convenience=high ^ passengerSpace=high.</p>
      <p>In that case, the rules are completely dissimilar. Thus, a similarity measure
simr(r; r0) for a pair of rules r and r0 is needed. This similarity of individual
rules is used then to compute the similarity of the whole rule sets simR(R; R0).
A naive choice to de ne the similarity of two rule sets could look as follows:
simR(R; R0) =</p>
      <p>Psimr(r1;i;r2;j)
(jD1j+jD2j)=2
(tp(r1;i)+tp(r2;j))=2
where jDij is the total number of examples in dataset Di, tp(r) yields the number
of true positive examples covered by rule r (i.e., the examples that are correctly
covered by the rule r), and is a threshold that decides whether the rules are
equal or not. The similarity of single rules can naively be de ned by
simr(r; r0) =
(1 if r matches r0 exactly
0 otherwise
(1)
(2)
Clearly, in real world situations these de nitions are too restrictive. Nevertheless,
despite their simplicity, rst experiments show that they are su cient. These
preliminary experiments with two ontologies containing a few instances show
that the approach works. The two rule sets R1 (rules for milesPerGalllon)
and R3 (rules for mpg) were identical. In contrast, the rule sets learned for
numberOfExtras and milesPerGallon are completely di erent, since these are
di erent properties. Thus, the overall similarity value for numberOfExtras and
milesPerGallon is signi cantly lower than for mpg and milesPerGallon.
Therefore, in the end, the mapping milesPerGallon mpg was found while
numberOfExtras is left unmapped.</p>
      <p>
        The case study shows that rule learning can be used to derive additional
mappings from those already found. Open challenges include the de nition of
metrics for rule comparison, the use of suitable rule learning heuristics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and
the decomposition of the ontology matching problem into a binary classi cation
problem.
      </p>
      <p>Janssen et al.</p>
      <p>Conclusion and Challenges
In this paper, we have discussed approaches of learning and re ning non-trivial
ontology alignments from instances through utilization of inductive rule
learning algorithms. We have shown two case studies which examine how nding
and re ning ontology mappings can be reformulated as problems of association
rule mining and separate-and-conquer rule learning, respectively. While the case
studies show the feasibility of employing rule learning algorithms for ontology
matching, there are quite a few open research issues and challenges.</p>
      <p>The rst case study has shown an approach which is especially suitable for
discovering complex mappings. Suitable benchmark datasets for complex
mappings do not exist at the moment. Generating those benchmark sets from Linked
Open Data is a suitable option, which at the same time produces new scalability
challenges. When dealing with instance-based matching, using a reasoner in the
loop is also an interesting opportunity for re ning mappings. Necessary
preprocessing steps, e.g., for dealing with numerical data properties, are also a subject
of future research.</p>
      <p>The second case study has used the similarity of rule sets as a means of
assessing validity of ontology mappings. Finding suitable similarity measures of
rule sets is a challenge for future work. It may also be a valid direction to use other
learning models (e.g. subgroup discovery, or even decision trees) and suitable
heuristics, which may allow di erent and maybe better means for comparison.</p>
      <p>Although those challenges are non-trivial and require further deep
investigations, we are con dent that rule learning is suitable means to instance-based
ontology matching. Thus, we are con dent that the approaches sketched in this
paper will lead to the development of high-performance matching algorithms.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Ehrig</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sure</surname>
          </string-name>
          .
          <article-title>Ontology Mapping - An Integrated Approach</article-title>
          . In C. Bussler,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fensel</surname>
          </string-name>
          , and R. Studer, editors,
          <source>Proceedings of the 1st European Semantic Web Symposium</source>
          , volume
          <volume>3053</volume>
          , pages
          <fpage>76</fpage>
          {
          <fpage>91</fpage>
          ,
          <string-name>
            <surname>Heraklion</surname>
          </string-name>
          , Greece,
          <year>2004</year>
          . Springer Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>J.</given-names>
            <surname>Huber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sztyler</surname>
          </string-name>
          , J. No
          <issue>ner</issue>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Meilicke</surname>
          </string-name>
          . CODI:
          <article-title>Combinatorial Optimization for Data Integration: Results for OAEI 2011</article-title>
          .
          <source>In Proceedings of the 6th International Workshop on Ontology Matching</source>
          , Bonn, Germany,
          <year>October 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>F.</given-names>
            <surname>Janssen</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          <article-title>rnkranz. On the Quest for Optimal Rule Learning Heuristics</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>78</volume>
          (
          <issue>3</issue>
          ):
          <volume>343</volume>
          {
          <fpage>379</fpage>
          ,
          <string-name>
            <surname>March</surname>
          </string-name>
          <year>2010</year>
          . DOI 10.1007/s10994-009-5162-2.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J.</given-names>
            <surname>Noessner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fallahi</surname>
          </string-name>
          , , E. Kiss, and
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          .
          <article-title>Interactive data integration with mappingassistant</article-title>
          .
          <source>In Demo Proceedings of the 10th International Semantic Web Conference (ISWC)</source>
          , Bonn, Germany,
          <year>October 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          <article-title>rnkranz. Unsupervised Feature Generation from Linked Open Data</article-title>
          .
          <source>In 2nd International Conference on Web Intelligence</source>
          , Mining, and
          <string-name>
            <surname>Semantics</surname>
          </string-name>
          ,
          <year>2012</year>
          . to appear.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J.</given-names>
            <surname>Vo</surname>
          </string-name>
          <article-title>lker and M. Niepert. Statistical Schema Induction</article-title>
          .
          <source>In Proceedings of the 8th Extended Semantic Web Conference: Linked Open Data Track</source>
          , pages
          <volume>124</volume>
          {
          <fpage>138</fpage>
          , Berlin, Heidelberg,
          <year>2011</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>