<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Rule Induction for Concept Hierarchy Alignment</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>ICHISE Ryutaro, TAKEDA Hideaki. and HONIDEN Shinichi Intelligent Systems Research Division, National Institute of Informatics</institution>
          ,
          <addr-line>2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>To manage information like ontology, we usually use categorization with concept hierarchy. Such concept hierarchies are managed individual for each system due to the many differences in concept hierarchies. Consequently, it is difficult to reuse information in computer-based systems. Here, we propose a new concept alignment method for concept hierarchies as a solution to this problem, and construct a system to evaluate the performance of our method. The results of this experiment reveal that the proposed method can be used to induce appropriate align rules for concept hierarchies and classify information into appropriate categories within another concept hierarchy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The rapid advances in computer technology have allowed
us to archive much more information than ever before. To
maintain such archives, hierarchical categorization is often
used, by which information is categorized based on a
concept hierarchy. Ontologies and class libraries are examples
of such systems. The hierarchical categorization is usually
maintained by a single organization or individual for
consistency. In practice, a concept hierarchy is not suitable for
multiple purposes, requiring that concept hierarchies be
constructed for individual application domains. In other words,
concept hierarchies are maintained in each domain separately.
This means that it is not convenient to reuse information
because the concept categorization differs between concept
hierarchies. To solve this problem, we propose a new method
that allows a concept in one concept hierarchy to be aligned
with another concept in another concept hierarchy. In this
paper, we describe a machine-learning method for aligning
multiple concept hierarchies.</p>
      <p>This paper is organized as follows: A concept hierarchy
model is defined, followed by the proposal for the new
machine learning method used to align concepts. In the
experiment section, the performance of our system, HICAL
(HIerarchical Concept ALignment system), which is based on the
proposed method, is compared in a variety of settings,
followed by a discussion of the results. We then discuss related
work in regard to HICAL and present the conclusions of this
study.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Concept Hierarchy Model</title>
      <p>In this section, we describe a model of the nature of
concept hierarchies. Many information management systems for
use with conceptual information like ontologies and class
libraries are managed via a system of hierarchical
categorization. A concept hierarchy contains only one conceptual
categorization criterion. We can consider the diagram on the left
side of Figure 1 to be a representation of a single concept
hierarchy.</p>
      <p>C1</p>
      <p>C2
I 1
I 2</p>
      <p>I 3</p>
      <p>In the figure there are two different concept hierarchies (C1
and C2) and three different information instances (I1, I2 and
I3). Some instances are shared between the two concept
hierarchies and some are not. It is important to keep in mind
that these concept hierarchies are not the same. The next step
is to consider an appropriate way to transfer an instance from
C1 into C2. In the example shown in Figure 1, C2 does not
contain I2. If I2 can be placed in the concept hierarchy of
C2, the user can then use I2 with concept hierarchy C2. In
the next section, we propose a method of defining a rule for
conversion from the concept hierarchy of C1 to that of C2, so
that an instance that has already been categorized in C1 can
subsequently be categorized in C2. The most important point
of this approach is that the concept hierarchy of C2 does not
need to be adjusted to fit the concept hierarchy of C1. Thus, a
user can apply our method while continuing to use whichever
concept hierarchy they are accustomed to.
Output:
begin
/* make pair for candidate */
/* using child node or parent node */
statistic” method. When a similar pair is not generated, the
algorithm outputs the alignment rule between the two concept
hierarchies. We can then apply this rule to deciding whether
a particular instance in C1 fits the concept hierarchy in C2.</p>
      <sec id="sec-2-1">
        <title>Input: Table 1: Classification of Instances by Two Categories</title>
        <p>In our proposed method of constructing an alignment rule, we
must first find categories that are similar to each other
(“similar categories”); the instance will then be aligned based on the
similarity between categories. To find similar categories, our
algorithm starts by comparing the most general categories of
the two concept hierarchies. For each pair of categories, we
can determine similarity based on the instances categorized in
the two categories. For each category, we can decide whether
a particular instance belongs to that category. If the
categorization methods of two categories are similar, then the
system can generate an aligning rule for them. For example, if
one category contains 100 instances and a category in another
concept hierarchy contains the same 100 instances, then the
algorithm can generate an aligning rule for these categories,
because they can be considered to have the same
categorization criteria. It should be noted that, because concept
hierarchies are structured as trees, we can easily categorize
according to a nodal structure, such that lower (more specific)
categories are included in higher (more general) categories.</p>
        <p>To find similar categories, we used a statistical method for
determining the degree of similarity between two
categorization criteria. The “ statistic” method [Fleiss, 1973] is an
established method of evaluating similarity between two
criteria. We explain this method briefly. Let us suppose that
there are two categorization criteria, S1 and S2. As
mentioned earlier, we can decide whether a particular
information instance belongs to a particular category or not.
Consequently, instances are divided into four classes shown in
Table 1. Symbols N11; N12; N21; N22 denote numbers of
instances for each class. For example, N11 denotes the number
of instances which belong to both the category S1 and the
category S2. We may logically suppose that if category S1 and
S2 have the same criterion of categorization, then N12 and
N21 become close to zero and if the two categories have a
different criterion of categorization, then N11 and N22 become
close to zero. The “ statistic” method utilizes this principle
to determine the similarity of categorization criteria.</p>
        <p>The relationship between the two categorization criteria is
examined from “top” to “bottom”. The alignment algorithm
is shown in Figure 2. First, the most general categories in the
two concept hierarchies are compared using the “ statistic”.
If the comparison confirms that the two categories are
similar, then the algorithm outputs an alignment rule for them. At
the same time, the algorithm pairs one of these two similar
categories with a “child” category in the other similar
category. This new pair is then evaluated recursively using the “
fi;</p>
        <p>I;</p>
        <sec id="sec-2-1-1">
          <title>4.1 Data and Settings</title>
          <p>In order to evaluate this algorithm, we conducted three
experiments using the Yahoo! Japan [Yahoo! Japan, 2000] and
LYCOS Japan [LYCOS Japan, 2000] directories as concept
hierarchies, and the links (URLs) in each directory as
information instances. The Yahoo! directory contains
approximately 41,000 categories and 224,000 URLs. LYCOS
contains approximately 5,700 categories and 48,000 URLs.
Approximately 25,000 URLs are common to both Yahoo! and
LYCOS. Generally speaking, as a concept hierarchy, Yahoo!
contains more knowledge than LYCOS, however half of the
URLs in LYCOS are not contained in Yahoo!. This
demonstrates that even a concept hierarchy that contains an
enormous amount of information does not cover all information.</p>
          <p>In this study, we used the three category pairs (and
subcategories) as our two concept hierarchies. The location in
Yahoo! and LYCOS are as follows:</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Yahoo! : Arts / Humanities / Literature</title>
        <p>LYCOS : Arts / Literature</p>
        <p>We conducted 10-fold cross validation for shared
instances. Shared instances were divided into 10 data sets; 9
of these sets were used for training and the remaining set was
used for testing. Ten experiments were conducted for each
data set. The parameter of significance level for the “
statistic” was set at 5%.
4.2</p>
        <sec id="sec-2-2-1">
          <title>Results</title>
          <p>Criterion 1</p>
          <p>Criterion 2</p>
          <p>Criterion 1</p>
          <p>Criterion 2
0.8
0.6
ccycua
r
A0.4
0.2
0.8
0
1
0.8
0.6
The results of the experiments are shown in Figure 3, 4 and
5. The vertical axis is the average accuracy of the test data.
“Exact rules” represent values of accuracy for a system that
only uses the alignment rules for the category to which the
instance belongs. “Parent rules” indicates that if the system
does not generate an alignment rule for a category to which
the instance belongs, it will use the rule generated for the
parent category instead. “Criterion 1” indicates that the
instance is categorized in the same category as the test data and
“Criterion 2” indicates that the instance is categorized in the
same category or parent category as the test data. “Criterion
1” is very strict criterion because the target concept
hierarchy should have enough intermediate categories in
comparison with the source concept hierarchy, while “Criterion 2”
is more general and more realistic because it does not matter
which concept hierarchies are rich in categorization. Two
directions of alignment are presented; from Yahoo! to Lycos,
and from Lycos to Yahoo!.</p>
          <p>B X however concept hierarchy does not divide into
cateS A:/X/S”, because “B:/X” contains instances that belong in
A S concept hierarchy contains category within concept X,
involve transferring instances from relatively complex
categories to relatively general categories. Such a trend is likely
to result in relatively accurate rules. For example, suppose
gories. In such a case, it would be much easier to learn a
rule for “A:/X/S -&gt; B:/X” than to learn a rule for “B:/X -&gt;
and instances that do not. The results shown in figures reflect
this. Nevertheless, in this situation our method works
properly. When regarding parent categories as correct answers
(“Criterion 2”), both alignment directions exhibited almost
the same results, i.e., “B:/X -&gt; A:/X” is learned instead of
“B:/X -&gt; A:/X/S”.</p>
          <p>The limitation of this method is that it does not work in
cases in which the more general and more specific categories
within a concept hierarchy are independent. For example,
let’s consider two concept hierarchies; one that classifies a
food-related instance under foodstuff as the more general
category and then by country of origin as the more specific
category, and another that classifies a food-related instance by
country of origin as the more general category and by
foodstuff as the more specific category. Both types of concept
hierarchy can be found on the concept hierarchy. In such a
case, our current system HICAL would not work, because
comparison between the two categories proceeds from
genMore than 80% of the instances used in our experiments, with
the exception of the company domain, were categorized
correctly by HICAL. In the company domain, more than 60%
of the instances were categorize appropriately. The data in
Figure 3 5 imply that Yahoo-to-Lycos alignment was more
accurate than the inverse operation. As can be seen from
the total number of categories, the categorical hierarchy in
Yahoo! is more complex than that of LYCOS. Therefore,
for Yahoo-to-Lycos alignment, the learned rules are likely to
eral to specific (top to bottom). To solve this problem, we
are planning to combine our current system with a bottom-up
method.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>One of the systems related to HICAL is the ontology
merging/alignment system. In the merging process for ontology, a
process such as our system is necessary due the requirements
of concept hierarchy management. Chimaera [McGuinness et
al., 2000] and PROMPT [Noy and Musen, 2000] are
examples of such systems, assisting in the combination of different
ontologies. However, such systems require human
interaction for merging or alignment. In addition to this
requirement, they are based on similarity between words, which
introduces instability. Word similarity is often biased by the
dictionaries used. In contrast, our system does not use word
similarity, instead using syntactics alone. Hence, our system
has the ability to find identical concepts regardless of the
category name or word. For example in the experiment
conducted in this study, in the literature domain, HICAL found
the relationship between the “Genji-monogatari” (a famous
Japanese story written by “Murasakishikibu”) category in
LYCOS and the “Murasakishikibu” category in Yahoo!1. In
LYCOS, classical literature is classified by title (concept
category), whereas in Yahoo!, poetry masters are categorized by
author. As the dictionaries commonly used do not contain
such information, word-based systems would not be capable
of finding title/author relationships.</p>
      <p>The bookmark-sharing systems of Siteseer [Rucker, 1997]
and Blink [Blink, 2000] are also similar to HICAL. The main
difference is the use of hierarchies for categorization. The
Siteseer and Blink system only considers the number of URLs
(instances) in a given category, whereas our method uses
hierarchical structures. One of the merits of our approach is
that if there is no exact category into which a given URL
(instance) fits, then the URL (instance) is mapped into the parent
category. kMedia [Takeda et al.2000] is another
bookmarksharing system that uses hierarchical structures explicitly but
is dependent on similarity of words in pages.
BookmarkAgent [Mori and Yamada, 1999] uses another approach,
utilizing bookmarks based on keywords. As mentioned above,
HICAL only uses syntactical information, not words as are
used by a bookmark agent. HICAL is therefore capable of
correctly categorizing different words under the same
concept.
6</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we propose a new method for aligning concept
hierarchies as a new approach to utilizing information in
multiple concept hierarchies, based on statistical methods. To test
our ideas, we conducted experiments using the Yahoo! and
LYCOS categories. Our experimental results show that the
alignment rules learned by HICAL yield reliable alignments,
allowing information in one concept hierarchy to be aligned
to an appropriate position in another concept hierarchy. The
1similar to the relationship between “Sherlock Holmes” and
“Conan Doyle”
advantage of using our method is that it allows users to use
their own concept hierarchy for categorizing all information,
and may serve as a powerful tool for aligning concept
hierarchies.</p>
      <p>With these encouraging results, several research
possibilities present themselves for future development of alignment
strategies. Our alignment method is based on a top-down
approach. We should combine a bottom-up approach to increase
accuracy. In addition, there may exist other possibilities for
alignment. Extending the proposal to applying to more than
three concept hierarchies needs to be investigated. In such a
case, despite confliction between several concept hierarchies,
more hints can be obtained from other concept hierarchies.</p>
      <sec id="sec-4-1">
        <title>Japan.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Blink</source>
          , 2000] Blink. http://www.blink.com/,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[Fleiss</source>
          , 1973] Joseph L. Fleiss.
          <article-title>Statistical Methods for Rates and Proportions</article-title>
          . John Wiley &amp; Sons,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[LYCOS Japan</source>
          ,
          <year>2000</year>
          ] LYCOS http://www.lycos.co.jp/,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>[McGuinness</surname>
          </string-name>
          et al.,
          <year>2000</year>
          ]
          <string-name>
            <surname>Deborah L. McGuinness</surname>
            , Richard Fikes, James Rice and
            <given-names>Stive</given-names>
          </string-name>
          <string-name>
            <surname>Wilder</surname>
          </string-name>
          .
          <source>An Environment for Merging and Testing Large Ontologies In Proceedings of the seventh International Conference on Principles of Knowledge Representation and Reasoning(KR2000)</source>
          , Morgan Kaufman Publishers,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Mori and Yamada</source>
          , 1999]
          <string-name>
            <given-names>Mikihiko</given-names>
            <surname>Mori</surname>
          </string-name>
          and
          <string-name>
            <given-names>Seiji</given-names>
            <surname>Yamada</surname>
          </string-name>
          .
          <source>Bookmark-Agent: Information Sharing of URLs In Poster Proceedings of the 8th International World Wide Web Conference(WWW-10)</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Noy and Musen</source>
          , 2000] Natalya
          <string-name>
            <given-names>F.</given-names>
            <surname>Noy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark A.</given-names>
            <surname>Musen</surname>
          </string-name>
          . PROMPT:
          <article-title>Algorithm and Tool for Automated Ontology Merging and Alignment</article-title>
          .
          <source>In Proceedings of the 17th National Conference on Artificial Intelligence</source>
          , pages
          <fpage>450</fpage>
          -
          <lpage>455</lpage>
          , Austin, Texas,
          <string-name>
            <surname>July-August</surname>
          </string-name>
          <year>2000</year>
          .
          <article-title>American Association for Artificial Intelligence</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Rucker</source>
          , 1997]
          <string-name>
            <given-names>James</given-names>
            <surname>Rucker</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marcos J.</given-names>
            <surname>Polanco</surname>
          </string-name>
          . Siteseer:
          <article-title>Personalized Navigation for the Web Communications of the ACM</article-title>
          ,
          <volume>40</volume>
          (
          <issue>3</issue>
          ):
          <fpage>73</fpage>
          -
          <lpage>75</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Takeda et al.2000]
          <article-title>Hideaki Takeda, Takeshi Matsuzuka</article-title>
          and
          <string-name>
            <given-names>Yuichiro</given-names>
            <surname>Taniguchi</surname>
          </string-name>
          .
          <article-title>Discovery of Shared Topics Networks among People - A Simple Approach to Find Community Knowledge from WWW</article-title>
          <source>Bookmarks In Proceedings of the 6th Pacific Rim International Conference on Artificial Intelligence(PRICAI-2000)</source>
          , pages
          <fpage>668</fpage>
          -
          <lpage>678</lpage>
          , Melbourne, Australia,
          <source>August 28 - September 1</source>
          , Springer,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Yahoo! Japan, 2000] Yahoo! http://www.yahoo.co.jp/,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>