<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Applying Matching Tools to Large-Scale Ontologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Heiko Paulheim</string-name>
          <email>heiko.paulheim@sap.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SAP Research</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Scalability of Existing Matching Tools</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Many existing ontology matching tools are not well scalable. In this paper, we present the Malasco system, which serves as a framework for reusing existing, non-scalable matching systems on large-scale ontologies. The results achieved with di erent combinations of partitioning and matching tools are discussed, and optimization techniques are examined. It is shown that the loss of result quality when matching with partitioned data can be reduced to less than 5% compared to matching with unpartitioned data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The need for matching large-scale ontologies arises in di erent elds. In electronic
business, several large ontologies representing business standards are in use [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Another example is the eld of medical research where large databases and
ontologies exist in which taxonomies, de nitions, and experimental results are
stored.
      </p>
      <p>
        To the best of our knowledge, there are only three works which explicitly
address the scalability issue: The schema matching system COMA++ [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the
ontology matching tool Falcon-AO [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and the MOM approach [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; however,
there seems to be no publicly available implementation of the latter. All of
them address the scalability problem by rst partitioning the input ontologies
into smaller sub-ontologies and then performing the actual matching task on
the partitions. This approach seems promising, although one must take care to
implement the partitioning step in a way that large ontologies can be processed,
in order not to replace one bottleneck with another.
To examine the scalability of existing ontology matching tools, we used two pairs
of large ontologies: the e-business standards eClass (375K triples) and UNSPSC
(83K triples), and the medical ontologies GO (465K triples) and NCI thesaurus
(543K triples). From the large variety of matching tools, we chose tools that
are publicly available and widely known, two of which focus explicitly on the
matching of large-scale ontologies. We conducted tests with the above mentioned
COMA++ and Falcon-AO, as well as FOAM, INRIA, PROMPT, and CROSI
(using a simple string comparator), on a standard desktop PC.
      </p>
      <p>The business pair of only be processed by COMA++ and Falcon-AO. The
larger medical pair could not be processed by any of the tools examined. Most of
the tools su ered from a lack of memory. These experiments show that matching
large ontologies is a severe problem with many of the tools that are currently
available.
3</p>
    </sec>
    <sec id="sec-2">
      <title>The Malasco System</title>
      <p>The system introduced in this paper is called Malasco (Matching large scale
ontologies). It allows matching large-scale ontologies by rst partitioning the
input ontologies. The actual matching is then carried out on the smaller partitions.
3.1</p>
      <sec id="sec-2-1">
        <title>Design</title>
        <p>This approach has also been implemented in COMA++ and Falcon-AO. Unlike
those systems, our implementation follows a more modular design, which allows
the use of di erent existing systems both for partitioning and for matching the
partitions. This approach has several advantages:
{ Existing matching and partioning tools can easily be reused. This lowers the
e ort of setting up a matching solution and o ers the possibility to bene t
from future developments without having to modify the system.
{ Di erent matching tools provide results of various quality, depending on the
nature of the input ontologies. Therefore, building a system that can work
with di erent matching tools is a promising approach for creating a versatile
tool.
{ From an academical point of view, the approach allows experiments on
different combinations of partitioning and matching tools.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Partitioning approaches</title>
        <p>
          As a simple partitioning approach, we implemented a naive baseline algorithm
which iterates over the RDF sentences [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and creates chunks of N triples. While
that approach is rather naive (as it does not create clusters of concepts that
are semantically related), two more sophisticated algorithms are used in the
prototype: the islands algorithm developed by Stuckenschmidt and Klein [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ],
implemented in the tool PATO and the "-connections algorithm proposed by
Grau et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], implemented in the tool SWOOP.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>The Malasco system has been evaluated in two respects: the ability to process
large-scale ontologies, and the quality of the matching results achieved.
4.1</p>
      <sec id="sec-3-1">
        <title>Scalability</title>
        <p>To demonstrate that our system is capable of matching large-scale ontologies,
we used the test ontologies and test environment described in section 2. As an
example, we used the baseline algorithm with a maximum partition size of 5,000
statements and the INRIA matching system. Our system could process both
pairs; the largest amount of time { more than 100 times longer than the rest of
the process { was consumed by the pairwise matching of partitions.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Result Quality</title>
        <p>While it is obvious that element-based matching algorithms can be run on
partitions of the input ontologies with unchanging results (given a covering
partitioning), most matching systems are structure-based and will thus produce di erent
(and probably worse) results on partitioned data. To evaluate how big the loss of
quality is when working on partitioned data, we ran the example matching tools
both on unpartitioned and on partitioned ontologies and compared the results.
Since the matching tools examined can only work on smaller-scale ontologies,
such a comparison is only possible on smaller-scale data sets.</p>
        <p>
          Six pairs of ontologies of a size between 600 and 2,000 statements were used
for evaluation. For partitioning, we used two variants of the baseline algorithm
(with 250 and 500 statements as a maximum), two variants of the islands
algorithm (with 50 and 100 classes per island as a maximum), the "-connections
algorithm1, and the unpartitioned ontologies for comparison. For pairwise matching
of the partitions, we used INRIA [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and FOAM [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] in their respective standard
con gurations (both of which partly rely on structure-based algorithms).
        </p>
        <p>To evaluate the results, we calculated recall, precision, and F-measure. While
the recall value achieved on partitioned data is as high as (and in some cases
even slightly higher than) the result on unpartitioned data, the precision value is
less than 50% than that achieved on unpartitioned data, caused by a very high
number of false positives.
4.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Optimization I: Using overlapping partitions</title>
        <p>To achieve better results, in particular better precision values, we tested two
optimization approaches. The rst one is motivated by the insight that
structurebased matching approaches use information on neighboring elements. For
partioned ontologies, those are missing for elements on the border of a partition.
Hence, for the rst optimization approach, we added the direct neighbors for
each concept contained in a partition, thus creating overlapping partitions. The
matching is then performed on the overlapping partitions. Mapping elements
found between the neighboring elements are discarded, because the matching
algorithm only has partial information about those elements.</p>
        <p>
          When using overlapping partitions, it can be observed that using
overlapping partitions causes a signi cant improvement of the precision value (the loss
1 For the "-connections algorithm, various problems can be observed [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]; two ontologies
could not be partitioned at all. Therefore, that algorithm is considered not suitable
and not regarded any further in the following results.
of precision can be limited to less than 20%), almost without any negative a
ection of the recall value. On the other hand, since the overlapping partitions are
larger, the matching phase runs up to four times as long as for non-overlapping
partitions.
4.4
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Optimization II: Thresholding</title>
        <p>
          The second approach to improve our system's results' quality is the use of a
lower threshold. As most matching system provide a con dence parameter with
each mapping element, a lower threshold can be employed to discard all elements
with a con dence value below that threshold in order to improve the results [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
This approach has been motivated by the observation that the average con dence
value is signi cantly lower for false positives than for true positives.
        </p>
        <p>To determine an optimal lower threshold , we calculated precision, recall,
and F-measure for threshold values between 0 and 1 and determined the
average optimal (w.r.t. F-measure) threshold values for each partitioning algorithm,
including the unpartitioned case for comparison.</p>
        <p>Thresholding the results leads to a signi cant improvements in precision and
F-measure. Fig. 1 shows the results using the matching system FOAM2. The
improvement is stronger than using overlapping partitions, more than 95% of
the F-measure achieved on unpartitioned data can be reached (even up to 99%
for INRIA). Applying a lter which is optimal for a given partitioning technique
leads to almost the same results, thus, the choice for an actual partitioning
algorithm is of marginal e ect.</p>
        <p>Using the thresholding optimization is less costly than using overlapping
partitions: the matching system does not have to work on larger partitions, and the
runtime complexity of applying the threshold is only linear in the number of
results. Combining both overlapping partitions and thresholding leads only to
2 The results with INRIA were in most of the cases comparable to those achieved with</p>
        <p>FOAM and are therefore not shown separately.
minimal improvements (less than 5%) compared to thresholding alone. Since
using overlapping partitions is rather costly, thresholding alone is the more
approriate approach in most usage scenarios.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we have presented the Malasco framework which allows using
existing matching tools for matching large-scale ontologies. Its modular
architecture allows for using arbitrary partitioning and matching tools, including
domain-speci c tools for particular matching tasks.</p>
      <p>In our evaluation, we have shown that our system is actually capable of
matching large ontologies, that the choice of a particular partitioning algorithm
is only of minor importance, and that the quality deviation compared to the
results which would be achieved on the unpartitioned ontologies (given a matcher
that could process them) can be reduced to less than 5%.</p>
      <sec id="sec-4-1">
        <title>Acknowledgements</title>
        <p>The author would like to thank the members of the e-business integration group
at Hochschule Darmstadt and the members of the knowledge engineering group
at TU Darmstadt for their valuable and helpful remarks.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Rebstock</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fengel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Ontologies-based Business Integration</article-title>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Do</surname>
            ,
            <given-names>H.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahm</surname>
          </string-name>
          , E.:
          <article-title>Matching large schemas: Approaches and evaluation</article-title>
          .
          <source>Information Systems</source>
          <volume>32</volume>
          (
          <issue>6</issue>
          ) (
          <year>2007</year>
          )
          <volume>857</volume>
          {
          <fpage>885</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Partition-based block matching of large class hierarchies</article-title>
          . [
          <volume>11</volume>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Matching large scale ontology e ectively</article-title>
          . [
          <volume>11</volume>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , X., Cheng, G.,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Ontology summarization based on rdf sentence graph</article-title>
          . In Williamson,
          <string-name>
            <given-names>C.L.</given-names>
            ,
            <surname>Zurko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.E.</given-names>
            ,
            <surname>Patel-Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.F.</given-names>
            ,
            <surname>Shenoy</surname>
          </string-name>
          , P.J., eds.
          <source>: Proceedings of the 16th International Conference on World Wide Web, WWW</source>
          <year>2007</year>
          , Ban , Alberta, Canada, May 8-
          <issue>12</issue>
          ,
          <year>2007</year>
          , ACM (
          <year>2007</year>
          )
          <volume>707</volume>
          {
          <fpage>716</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Stuckenschmidt</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Structure-based partitioning of large concept hierarchies</article-title>
          . [
          <volume>12</volume>
          ]
          <fpage>289</fpage>
          {
          <fpage>303</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Grau</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sirin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalyanpur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Modularizing OWL ontologies</article-title>
          . In Sleeman, D.,
          <string-name>
            <surname>Alani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brewster</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noy</surname>
          </string-name>
          , N., eds.
          <source>: Proceedings of the KCAP-2005 Workshop on Ontology Management</source>
          . (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>An API for ontology alignment</article-title>
          . [
          <volume>12</volume>
          ]
          <fpage>698</fpage>
          {
          <fpage>712</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ehrig</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Ontology Alignment - Bridging the Semantic Gap</article-title>
          .
          <source>Semantic Web and Beyond. Computing for Human Experience</source>
          . Springer, New York (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shvaiko</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Ontology Matching. Springer, Berlin, Heidelberg, New York (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mizoguchi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giunchiglia</surname>
          </string-name>
          , F., eds.:
          <source>The Semantic Web - ASWC</source>
          <year>2006</year>
          ,
          <article-title>First Asian Semantic Web Conference</article-title>
          . Number 4183 in LNCS, Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>McIlraith</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plexousakis</surname>
          </string-name>
          , D., van
          <string-name>
            <surname>Harmelen</surname>
          </string-name>
          , F., eds.:
          <source>The Semantic Web - ISWC 2004: Proceedings of the Third International Semantic Web Conference. Number 3298 in LNCS</source>
          , Springer (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>