Recognizing Emergent Nodes in
            Aligning Multiple Document Taxonomies

                                       Tim Musgrove

                                      TextDigger, Inc.
                               305 Vineyard Town Center #375
                                Morgan Hill, CA 95037 USA
                                 tmusgrove@textdigger.com


       Abstract. A document taxonomy alignment method, relying on document
       glosses and utilizing a soft ontology expansion, enables us to devise some all-
       new hierarchical leaf nodes for the purpose of better aligning a plurality of
       document taxonomies.


    1. Introduction

In our past work of mapping different document taxonomies, we frequently were left
with some “isolated nodes”, i.e. categories of documents seeming to have no correlate
in the other taxonomies. An example was in the Archery category on Yahoo, the sub-
category of “Kyudo” (traditional Japanese archery). There was no equivalent to this
category on DMOZ or About.com, the two taxonomies we were hoping to correlate.
However, a soft ontology expansion we had devised to assist in the mapping
meanwhile produced numerous candidate ontology nodes, such as “coaching/training”
or “competitions/tournaments,” and in this particular case, “traditional archery.”
While not a node in any of three reference taxonomies, “traditional archery”
nonetheless applied to a great number of documents in all three, and especially in
Yahoo’s “Kyudo” category. Having used DMOZ as our “master taxonomy”, we not
only added “traditional archery” to it, but devised a method of automatically adding
every other similar example, with the result of adding these new nodes: Traditional
Archery, Coaching & Training, Equipment & Gear, Stories & Discussion.
   The first of these, “traditional archery”, included (as a child node) all the Kyudo
documents, plus numerous documents from the other two indices, all of which pertain
to traditional forms of archery. Since there are other traditional forms of archery
(such as medieval European forms) besides Kyudo, it made sense that Kyudo be
subsumed in the new node. We found that it was rather straightforward to devise a
heuristic for automating this addition of nodes according to the following heuristic:
     1. Find an expanded concept that is instantiated disproportionately in the
          document glosses of an unmapped node.
     2. Test if that node is instantiated also in numerous documents not classified at
          a leaf node in a plurality of taxonomies.
2   Tim Musgrove


    3.   If such a node is found, then create a new node with that concept and place
         the relevant documents under it.
   In order to explain how this was accomplished, we will outline (1) our general
approach to taxonomy node alignment by semantic resemblance; (2) our conception
of a soft ontology expansion (3) the way in which results of the soft ontology
expansion can be leveraged to create new nodes as described above.


    2. Taxonomy alignment by semantic resemblance

    One approach to taxonomy alignment is the intensional method, which examines
the semantics of the names of the nodes, and the titles of documents, as well as the
glosses applied to those documents by the taxonomy editors. We applied such a
method to human-crafted document taxonomies bearing short glosses. These glosses
are meant to summarize what the documents are about and what differentiates each
one from others in the same topic, hence they are obviously valuable to our task.
    We take the content words of the document titles and glosses, as well as bi-grams
containing a topic word in any derived form (e.g., in the archery category we would
take “field archery” and “archer’s union”, in addition to single words such as
“arrows” and “bows”). We then check to see which of these may be closely related
by semantic resemblance. For measuring semantic resemblance, we test for
“semantic proximity” in WordNet, which we define as having a maximum distance of
2 in the WordNet hierarchy, with the additional limitations:
    1. Only synonyms, hyponyms, hypernyms, and sister-terms are to be considered.
    2. Sister-terms are considered proximate only if they share multiple content words
in their glosses and/or example sentences in WordNet.
    3. Hypernyms are included only if they are at least 4 levels down in the WordNet
hierarchy from the root.
    Note that this is similar to (Leacock 1998) in that it considers the depth of the
taxonomy as counting toward semantic nearness, though our implementation is
heuristic rather than statistical. (Since our application is to Web documents, we found
it necessary to ignore certain words that are excessively frequent across all categories,
and hence not useful, such as “photos”, “contact details”, “site map”, etc.). Table 1
shows an outline of one of our case studies.

Table 1. Comparison of Archery in DMOZ, Yahoo and About.com
    DMOZ                        Yahoo                      About.com
    Chats & Forums              Bow Hunting             Shop for Archery & Bowhunting Gear
    Clubs & Associations        Clubs & Organizations   Archery & Bowhunting Gear Manufacturers
    Equipment Manufacturers     Competitions            Archery & Bowhunting Organizations
    For Kids and Teens          Gear & Instruction
    Guides & Directories        Kyudo
    News & Media                Magazines
    Personal Pages              National Teams
    Tournaments & Events        Web Directories

   The result of our method is, for example, that “clubs” and “organizations” are
treated as equivalent terms. This happens by means of a simple percentage match
                                                        Recognizing Emergent Nodes in
                                           Aligning Multiple Document Taxonomies    3

scoring of the content words in node names. For example, the pair of “Equipment
Manufacturers” and “Archery and Bowhunting Gear Manufacturers” receives a score
of 0.80, owing to the following facts: First, “Archery” is omitted because it is the
same as the overarching topic of “Archery” and hence implicit in all node names.
Second, the stop word “and” is discarded. Third, “gear” is matched to “equipment” as
a hypernym. That leaves five words total, with only one of them (“bowhunting”)
lacking a match: hence the score of 4/5 = 0.80. By trial and error we decided 0.66
was sufficient for alignment.
   The virtue of this simple node name resemblance test is that it lets us align, for
example, “Clubs and Organizations” with “Clubs and Associations” in two different
taxonomies. However it leaves us with the different problem of the numerous
documents not assigned a leaf node. In other words, in all three indices, many
documents were simply classified in “Archery” without being assigned to a sub-
category. In some cases, this seems correct, in that the documents in question were
very general archery documents (or websites) not belonging to any particular sub-
class. But in many other cases, it seemed that a node in a different taxonomy was a
natural place for such documents. For example, a website of personal anecdotes and
combined with feedback from others, was classified in one taxonomy simply as an
“archery” document, but it would have found a perfect home in “Chats and Forums”.
This defeats taxonomy alignment, in that it is implied that none of the documents in
the one taxonomy would belong in “Chats & Forums” of the other – and yet many of
them did.
   This type of predicament was later resolved, in some cases, by the results of a soft
ontology expansion of all three taxonomies. In other words, after having enriched the
ontological characterization of each specific leaf node, we could often align it with an
appropriate subset of the documents lumped together in a more general topic of a
different taxonomy.


    3. Soft ontology expansion of document taxonomy leaf nodes
       using WordNet

For this exercise, we went back to our extracted words and bi-grams (e.g. “calendar”
and “field archery”, etc.), examined their WordNet glosses and example sentences
and compared them with collocations and phrases in the document glosses, and found
the following to hold true: if two words were frequently paired (collocated after
skipping non-content words) in the taxonomy document glosses and also were found
in each other’s WordNet glosses, they were, without exception (in our case studies),
genuinely related and of ontological import in the category. Our operational
definition of “frequent” was: having at least one occurrence in all three taxonomies
and having multiple occurrences (2 or more) in at least two of three taxonomies.
   This technique has similarities to (Beneventano 2003) and (Martin 2004), in that it
employs WordNet to develop one’s taxonomy and/or ontology. The difference is that
we are driving the process by reference to the glosses already created by editors of the
various taxonomies. Our procedure derived the following soft concepts in Archery:
         [calendar,schedule] having a relation to [event]
4   Tim Musgrove


         [tournament,competition] having relations to both [results] and [standings]
         [outdoor] having a relation to [ranges]
         [bow] having relations to [crossbow], [compound bow], and [long bow]
   We call these “concepts” rather than merely “word occurrences” in view of the
following: each is based on a small web of similar words, (e,g.
“calendar”|”schedule”); each has an additional word relation (“events,” etc.); all are
contextualized to the local topic of Archery. The totality of all such extracted
concepts we call a “soft ontology,” in that it delineates raw materials of the local
ontology, but obviously falls short of a formal representation of the relations between
the concepts, such as those discussed in (Gaurino 1998).
   Next, when checked the non-leaf-node documents’ glosses for the presence of
these concepts. If they matched, then we moved them to the newly created node. For
example, several documents with glosses containing “discussion” and “stories” found
their way into “Stories & Discussion.” In the end, 37 of 189 documents were thus
“migrated downward” to a leaf node, with the result that, on inspection, it seemed the
alignment between taxonomies was more complete and intensionally unequivocal.
This illustrates that taxonomy alignment cannot be divorced from issues of
taxonomical scope and adequacy. If one taxonomy lacks the scope or granularity of
another, then the only way to achieve proper alignment is to sort through some of the
items in the less granular taxonomy so as to “multiply align” it to other nodes.


    4. Emergent Nodes

Finally, we reached the result that certain of our soft ontology concepts embrace
otherwise isolated nodes of one taxonomy, together with non-leaf-node documents of
another. A clear example was the topic mentioned earlier, “Kyudo.” Our soft
ontology expansion had derived a “traditional archery” as a bi-gram. This was very
dense in the Kyudo category (occurring in all but one of its items), while being found
also in 16 non-leaf-node documents in DMOZ, including:
Donadoni Archery: Supplier of traditional archery equipment in Italy…
The Archery Centre: Specialists in field, traditional, and re-enactment archery…
Perris Archery: Recurve, compound and traditional archery equipment…
   Our procedure was to use the concept string as a new node name (inserting “and”
between words that had been found separately rather than directly collocated), and
including as a child node the originally isolated node. So our master taxonomy now
included “Archery/Traditional Archery/Kyudo” with several DMOZ documents
placed in the new node “traditional archery.” This satisfied us as being a far better
alignment than we would have without the new node. Kyudo documents now had a
closer parent than just being a direct child of “Archery.” And the new interstitial
node of “traditional archery” functions to explain where “Kyudo” belongs. We think
the same of “Stories and Discussion” introduced as a parent of “Chat and Forums”,
and of “Coaching and Training” as a parent for “Instruction” documents that Yahoo
had mixed with “Gear”. Table 2 shows the overall alignment results.
                                                                          Recognizing Emergent Nodes in
                                                             Aligning Multiple Document Taxonomies    5

   Table 2. Results of alignment – New Nodes
    New Nodes
              -Child node         DMOZ                                 Yahoo                            About
Stories & Discussion                                              Glosses with        "stories,    Glosses with "stories,
          - Chats & Forums    Chats & Forums                      ""discussion"                    ""discussion"
                                                                                                   Archery & Bowhunting
                                                                                                   Gear Manufacturers, Shop
                                                                  Glosses with "equipment"         for Archery & Bowhunting
Equipment and Gear            Equipment Manufacturers             and "gear"                       Gear
Bow Hunting                   Glosses with "bow hunting"          Bow Hunting                      Glosses with "bow hunting"
                              Glosses with "instruct", "coach",   Glosses with       "instruct",   Glosses with "instruct",
Coaching & Training           "train"                             "coach", "train"                 "coach", "train"
Traditional Archery
                    - Kyudo   Glosses with "traditional"          Kyudo                            Glosses with "traditional"

      Regarding accuracy, the introduction of new nodes carried just one misclassified
   document, which had been misclassified already on of the third party indices. In
   general, the accuracy of this method should be as good as the accuracy of the
   classification of the participant taxonomies. However, we are concerned about the
   naming of the newly created nodes. In the Archery case above, all the names read
   nicely, but when we did Soccer, one node received the name “Instructing” when we
   would prefer to see “Instruction.” Further work could be done on node naming


         5. Conclusions

   Editorially created document glosses are a boon to taxonomy alignment, in that they
   constitute a rich resource to guide semantic resemblance analysis, and have the added
   bonus, when soft ontology expansion is applied via WordNet, of enabling us to create
   new interstitial nodes for a more complete and unequivocal alignment of taxonomies.


   References

   Beneventano D., Bergamaschi S., Guerra, F., Vincini, M. 2003. Building an integrated
     Ontology within SEWASIE system. Proceedings of the First International Workshop on
     Semantic Web and Data-bases (SWDB)

   Gaurino, N. 1998. Formal ontologies and information systems. Proceedings of the
     International Conference on Formal Ontology in Information Systems (FOIS’98), Trento,
     Italy.

   Leacock, Claudia, Martin Chodorow & George A. Miller. 1998. Using corpus statistics and
     WordNet relations for sense identification. Computational Linguistics, 24(1): 147-165.

   Martin, T. ., Ben Azvine, Behrad Assadian. 2004. Acquisition of Soft Taxonomies. IPMU-04,
     1021-1032.