Corpus-Driven Contextualized Categorization
                                                      Tony Veale and Yanfen Hao1


Abstract. Ontologies strive to offer a interconnected, hierarchical        of [1,5,6]. In this paper we look at one conventional ontology, the
systems of categories to guide our actions in a complex world. But         HowNet system of [1], which is a large-scale bilingual lexical on-
the boundaries of these categories are highly context-dependent, and       tology for words and their meanings in both Chinese and English.
what constitutes a prototypical category member in one context may         In many respects, HowNet is similar to the WordNet lexical ontol-
be atypical or unrepresentative in another. In this paper we outline a     ogy for English [6], though in contrast to WordNet, HowNet pro-
dynamic, trainable, bottom-up view of category structure based on          vides an explicit, if sparse, propositional semantics for each of the
context-sensitive corpus analysis. By learning from corpora about          word-concepts it defines. Complementing this frame-like semantics,
how people creatively actually use categories in different contexts,       in which concepts are defined in terms of actions, case-roles and
we can train our ontologies to creatively adapt themselves to these        fillers, is a taxonomic backbone that seems rather impoverished when
contexts.                                                                  compared to that of WordNet. HowNet is essentially an ontology of
                                                                           ”Being” rather than an ontology of ”Doing” which is to say that it
                                                                           defines concepts according to conventional kinds like human, ani-
1     INTRODUCTION                                                         mal, tool and so on - rather than according to how specific concepts
An ontology is a system of inter-connected categories that collec-         actually behave in context. However, we describe in section 2 how
tively provide a structured representation of a given domain. As such,     HowNet’s propositional semantics can be used to automatically de-
an ontology serves as the conceptual bedrock against which domain          rive an ontology of ”Doing” to replace HowNet’s rather shallow tax-
meanings are constructed, manipulated and interpreted. However,            onomy of conventional categories [8]. Once in place, we demonstrate
this fundamental role of the ontology should not blind us to the fact      how this new system of derived categories can be made contextually
that much of what an ontology attempts to model, via its category          sensitive by defining their membership criteria in statistical, corpus-
structure, is not static but dynamic, making the use of these cate-        based terms, to create a fluid system of membership akin to the Slip-
gories highly sensitive to context. Consider that many categories in       nets of Hofstadter [3]. Once sensitized in this way, the ontology can
a language-oriented ontology, like Genius, Fool, Hero, Villain, Ex-        be moved with ease from one context to another simply by replacing
pert, Hunter, and so on, possess subjective membership criteria that       the underlying corpus.
change from user to user, and from context to context. Are politi-
cians fools, villains or schemers? Are firemen heroes or workmen?
Are scientists experts or geniuses?                                        2 ONTOLOGIES OF ”BEING” AND ”DOING”
   Since top-down definitions of membership criteria will always
seem brittle or inadequate in some contexts, it seems best to allow        HowNet and WordNet each reflect a different view of semantic
contexts to define their own criteria, bottom-up. In other words, we       organization. WordNet [7] is differential in nature: rather than
need to establish a contextual ontology [10] based category struc-         attempting to express the meaning of a word explicitly, WordNet
ture, which not only preserves the common view of concepts, but            instead differentiates words with different meanings by placing them
also keeps the local perspective of domains. For language-oriented         in different synonym sets, or synsets, and further differentiates these
ontologies, like WordNet [6] (a flawed, lightweight ontology to be         synsets from one another by assigning them to different positions
sure, but an ontology none the less), HowNet [1] and, to some ex-          of a taxonomy. In contrast, HowNet is constructive in nature. It
tent, Cyc [5], the context of usage can conveniently be captured via       does not provide a human-oriented textual gloss for each lexical
a large corpus of representative texts. A corpus-based approach to         concept, but instead composes sememes from a less discriminating
determining category membership allows us to structure the middle          taxonomy to provide a semantic representation for each word sense.
and lower layers of an ontology according to how words and con-            For example, HowNet defines the lexical concept surgeon|医生 as
cepts are actually used in a particular domain. In short, a corpus-        follows:
based approach supports an extremely flexible, non-classical view of
category structure, one that views category membership as a graded         (1)surgeon|医生 {human|人 :HostOf ={Occupation|职位}
rather than binary notion [4], and one in which concepts can fluidly          domain={medical|医}}, {doctor|医治:agent={∼}}}
move (via metaphor) from one category to another [2]. In this cur-
rent work, we use the ability to support metaphoric reasoning as the       which can be glossed thus: ”a surgeon is a human, with an oc-
yardstick against which ontological flexibility should be measured.        cupation in the medical domain, who acts as an agent of a doctoring
   Of course, this fluidity does not sit well with conventional per-       activity” (the {∼} here serves to indicate the placement of the
spectives on ontological structure, as represented by the ontologies       concept within its associated propositional structure). We see a
                                                                           similar structure employed by HowNet for the lexical concept
1 School of Computer Science and Informatics, University College Dublin,   repairman|修理工:
    Ireland, email: {tony.veale, yanfen.hao}@ucd.ie
(2)repairman|修理工 {human|人:HostOf ={Occupation|职位},                           metaphorically viewed as butchers and assassins, and for viruses to
   {repair|修理:agent={∼}}}                                                    be seen as deadly intruders, or even man-eaters.

Note that the impoverished nature of HowNet’s taxonomy means
that over 3000 different concepts are forced to share the immediate
                                                                             3 DERIVING FLUID CATEGORY
hypernym human|人. However, human|人 merely states, very
                                                                               STRUCTURES
generally, what a repairman is, rather than what a repairman does.           An ontology of ”doing” begs a number of obvious questions about
Fortunately, HowNet also organizes its verb entries taxonomically,           the nature of categorization. For instance, is every concept that
and so we find the verbs doctor|医治 and repair|修理 organized                   kills an equally representative member of the category kill-agent?
under the hypernym resume|恢复 (the logic being, one supposes,                 Is movement always allowed between any two categories that share
that ”doctoring” and ”repairing” both involve a resumption of an             a common abstraction like MakeBad-agent, or is movement limited
earlier, better state). This similarity of verbs, combined with an           to certain members only, and in certain directions? When a concept
identicality of case-roles (both surgeon and repairman are agents of         moves from its conventional category to another, how is its degree of
their respective activities), allows us to abstract out a new taxonomy,      membership in this new category to be assessed? In this section we
based on the behaviour rather than the general type of these entities.       address this key issue of obtaining fluid category structure.
                                                                                There are two major approaches in the community of automatic
                                                                             acquisition of taxonomies. One approach is based on the distribu-
                                                                             tional hypothesis made by Harris[11], in which he believes that word
                                                                             terms are similar if they have similar linguistic contexts. For instance,
                                                                             Hindle[12] clusters nouns according to their contextual attributes,
                                                                             such as the co-occurrence of nouns with verbs as subjects or objects.
                                                                             Steffen Staab[13] also extracts context information (verb/subject de-
                                                                             pendencies, verb/object dependencies, e.g.) about a certain term from
                                                                             corpus and applies a Formal Concept Analysis to generate a lattice
Figure 1. A new 3-level abstraction hierarchy derived from verb/role com-    that is finally transformed into a partial order closer to a concept hi-
binations.                                                                   erarchy. Another major approach is on the basis of investigating the
                                                                             ontological relations such as is-a relation, part-of relation, e.g. via the
Figure 1 illustrates the creation of such a taxonomy, whose cate-            corpus. Hearst[14] is a representative of this field. However, it seems
gories represent a yoking of verbs to specific case-roles, such as           that these approaches still result in binary and static taxonomies be-
repair-agent and amend-agent, and whose category members are                 cause they all apply the threshold to the category or the concept ar-
those HowNet concepts defined using these verbs and roles. The               chitecture to determine whether or not a word concept belongs to it.
category-hopping nature of metaphor is now rather easily construed           In our approach, we also follow Harris[11]’s distributional hypothe-
as a combination of generalization and re-specialization operations,         sis to investigate the contextual attributes, particularly, the behavior
in which one moves from one category to another by first passing             of nouns. The difference is that we apply Lakoff[4]’s category theory
through a common super-category like resume-agent. Thus, a                   to assign the graded membership to nouns within a category rather
surgeon can be seen as a repairman or a watchmaker, while a reviser          than simply grouping them into classes according to their contextual
of texts (an editor) can sometimes be seen as a surgeon. These               attributes or ontological relations.
metaphors make sense not because each is a human, but because                   Following Lakoff [4], every category will possess a prototype, a
each restores a better state.                                                member that is highly representative of the category as a whole. Such
                                                                             prototypes are often lexicalized in simple terms; for instance, ”killer”
                                                                             will be a highly representative of kill-agent, while the Chinese trans-
                                                                             lation ”杀手” is a composition of ”killing” (杀) and ”expert” (手).
                                                                             However, many categories like damage-agent have no obvious lex-
                                                                             icalized prototype, so we need a more generic means of identifying
                                                                             the prototypical member of a category. Lakoff [4] suggests that the
                                                                             prototype will occupy a central position in the category’s structure,
                                                                             with other members organized in a radial fashion, at a distance from
                                                                             the centre that is inversely proportional to their similarity to the pro-
Figure 2.   Newly derived HowNet categories may contain a diverse range of   totype. If we assume that the prototype will be that member that is
concepts.                                                                    most evocative of a category, we should first measure the evocation
                                                                             strength of each concept for a given category. This can be done by
Of course, this Aristotelian view of metaphor as an abstract                 determining the frequency of occurrence of each concept within the
”carrying-over” (the etymological origin of the word ”metaphor”)             category, and this, in turn, can be estimated by looking to a large cor-
can only be valid if concepts are ontologized by what they do, rather        pus to see how each concept is actually employed by language users.
than by what they are (as is typically the case, in both WordNet             Once the most evocative example is found for each category, mem-
and HowNet, and even Cyc [6]). Otherwise, metaphor could never               bership scores can be assigned based on the strength of evocation.
operate between semantically distant concepts, which it plainly does.        The corpus we use must be large, and while reasonably authoritative
For instance, figure 2 illustrates the derived taxonomy for HowNet           it must use words both literally and figuratively. For reasons outlined
concepts that are defined as agents of the verbs ”kill”, ”damage”            in section 5, we use here as our corpus the complete text of the open-
and ”attack”, each a specialization of the abstract verb MakeBad in          source encyclopaedia Wikipedia [9].
HowNet. We see in this taxonomy the potential for famines to be                 Thus, to estimate the membership level of the word-concept
butcher|屠 夫 in the category kill-agent, we first determine the               As a corpus, Wikipedia is biased toward popular culture and
corpus-frequency of the phrase ”butcher who kills/killed”. In             genres such as science fiction. This lack of neutrality makes the
general, for estimating the membership of the concept C in the            Wikipedia corpus an excellent example of a context, more so than
category V-agent, we use the query form ”C who|which|that V”;             traditional language corpora. Consider the population of the category
for categories of the form V-instrument, we use the query ”V with         Army-member as derived from Wikipedia:
C”, and so on. Of course, some verbs are more vague than others,
and can have much higher corpus frequencies. We therefore need            mercenary(238), clone(132), soldier(122), volunteer(72), mon-
to normalize raw corpus-frequencies to obtain a truer picture of          ster(70), robot(63), minion(60), warrior(60), frog(58), knight(50),
evocation power. If fraw (V-role:C) denotes the corpus frequency          slave(48), demon(46), clansman(46), monkey(46), crusader(44),
of concept C when considered as a member of the category V-role,          gladiator(38), ant(37), lawyer(32), contributor(28), mutant(27), ...
where V is a verb like ”kill” and role is one of agent, instrument,
etc., then the adjusted frequency, a measure of true evocation, is           Note the prominent presence of the genre elements ”clone”,
estimated by:                                                             ”robot” and ”minion”, as well as examples like ”lawyer” for which
                                                                          ”army” has a metaphoric meaning. This grouping suggests that
                                           P
fadj (V-role:C) = ln(fraw (V-role:C))×ln(       fraw (V-role:x))−1 (1)    lawyers may be seen, alternately, as mercenaries, warriors and even
                                            x                             clones, while the extent to which these comparisons are apt in a par-
Now, the prototype will be that member of a category with the             ticular context is a function of how many different groups can contex-
strongest evocation:                                                      tually claim both as members. For instance, ”lawyer” and ”warrior”
                                                                          are used with seven different group terms in the Wikipedia corpus −
           P rototype(V-role) = maxc (fadj (V-role:C))             (2)    society, family, cadre, team, army, class and squad, while ”lawyer”
                                                                          and ”mercenary” share just three groupings − team, army, squad. In-
The degree of membership of C in the category V-role is relative to       terestingly, the most common group term for ”lawyer” in Wikipedia
the prototype:                                                            is ”huddle” (the phrase ”huddle of lawyers” occurs 64 times, twice
                                                                          as often as ”army of lawyers”), which suggests that, in this context,
M embership(V-role:C) = fadj (V-role:C)×                                  lawyers are more likely to be categorized as players than warriors,
                            fadj (V-role:prototype(V-role))−1      (3)    mercenaries, clones or robots

This ensures that the prototypical member has a membership score
of 1, while all other members of a category will have a score in the
                                                                          5 PRELIMINARY EMPIRICAL EVALUATION
range 0... 1. A concept can metaphorically be moved from a category       The choice of corpus is clearly key to the quality of category-
in which it is conventionally a member to any other category in which     membership statistics that can be derived using the methods of sec-
it is considered to have a non-zero membership score.                     tions 3 and 4. This corpus must be large, it must be representative of
                                                                          language use in general, and it should offer a means of search that is
                                                                          robust in the face of noise. At first blush, then, the world-wide-web
4   CLUSTERS AND GROUP-TERMS
                                                                          seems an ideal candidate: in size it is unmatched, and various APIs
For ontological purposes, a category is essentially a cluster of con-     are available to access powerful search engines like Google. Unfor-
cepts that allows one to conveniently infer similarity − the posses-      tunately, such APIs rarely provide enough control over the query or
sion of common properties and shared behaviour − from the simple          the archive to ensure that noise can be eliminated, since these en-
act of co-categorization. That these clusters often have a heteroge-      gines typically perform their own stemming and stop-word elimi-
neous roster of members (e.g., as illustrated in Figure 2) is testament   nation, putting truly strict matching beyond our reach. This means
both to the prevalence of metaphor and to the necessity of viewing        that common noun-noun collocations, like ”fossil record” and ”share
ontological categories as categories of ”doing” rather than of ”be-       issue”, are easily confused for infrequent or nonsensical noun-verb
ing”. Of course, the converse is also true: we can infer the contextual   collocations like ”fossils that record” and ”shares that issue”.
behaviour of a concept from how that concept is explicitly clustered         To ensure strict matching with controlled morphology, we require
with others. And one common way of signalling the appropriate clus-       a local text corpus that we can index and search directly, and even
ter for a concept is through an evocative group word, like ”army”,        subject to part-of-speech tagging. For this reason we choose the
”mob”, ”tribe” or ”coven”. For instance, when one uses the phrase         collected text of the open-source encyclopaedia Wikipedia [9],
”an army of robots”, one is conveying a soldier-like perspective on       which is available to download in XML form. Wikipedia has several
the concept Robot, signalling that in this context, Robot should be       obvious benefits as a text corpus: each document is explicitly tagged
viewed more as a attacking agent than as a utensil.                       with a subject-label, since each article defines a specific headword;
   Group terms like ”army”, ”family” and ”swarm” are highly sug-          documents exist in a rich web of interconnections; and documents
gestive of particular behaviours. For instance, the corpus techniques     strive to be authoritative on their subjects. Consider the range of
of section 3 reveal that, in the context of Wikipedia, a ”swarm” has      subjects that are found in Wikipedia for the verb ”to infect” (with
two dominant behaviours, biting and attacking, while an ”army” has        frequencies shown in parentheses):
three, defeating, fighting and attacking. To use the phrase ”swarm
of X” or ”army of X” is to suggest that X also exhibits these be-         virus(46), worm(12), retrovirus(7), strain(6), disease(6), bureau-
haviours, and furthermore, that X is similar in behaviour to other        crat(6), poison(4), ally(4), fungus(4), dust(3), smut(2), bacterium(2),
concepts that comfortably fit these templates. This intuition is easily   physiologist(2), blood(2), plague(2), war(2), substance(2), germ(1),
contextualized, since the relative frequency of these phrases in a con-   application(1), species(1)
text’s corpus will reveal the extent to which different concepts belong
to different group-based categories.                                      Now consider the range of verbs that can be used with the
subject ”virus”:                                                             ing (captured via group-word collocations) described in section 4.
                                                                             For instance, we know that Robot is the most representative member
infect(46), attack(11), kill(7), jump(6), eat(4), drive(3), pro-             of the category army-agent in Wikipedia (with 63 examples), while
duce(3), destroy(3), spread(3), transform(3), escape(2), steal(1),           army is itself a highly representative member of the category attack-
prove(1), carry(1), freeze(1), arrive(1), control(1)                         agent. This suggests that Robot should also be a strong member of
                                                                             the category attack-agent. While Wikipedia records no uses of the
We see from this snapshot that Wikipedia contains enough di-                 collocation ”robot who|which|that attacks”, this joint perspective is
versity to capture the dominant application of each verb, and the            sufficient evidence to support going to the web for this collocation.
dominant behaviour of each subject noun. Furthermore, Wikipedia              That is, the intuition that Robot is an attack-agent is consistent with
contains enough diversity to reveal creative uses of these nouns and         the corpus, and thus the context, so the precise membership score can
verbs; this snapshot reveals, for instance, that ”smut” can ”infect” (2      be determined using the larger context of the web.
uses) and that a ”virus” can ”eat”, ”escape” and even ”steal”.                  Bootstrapping techniques like this should allow us to grow more
   One can ask how well these corpus-derived category structures             heterogeneous category structures while respecting the ontological
compare with the hand-crafted category structures of HowNet, since           biases of the specific context. Once the deficiencies of relatively
one can reasonably expect human-assigned category memberships                small corpora are addressed via such techniques, we expect to be
to be a gold standard for this task. We find that in 69% of cases,           better poised to fully explore the ramifications and opportunities of
the HowNet-assigned category for a given word-concept is also the            corpus-trained contextual ontologies.
dominant corpus-derived category, and that in 76% of cases, a word-
concept has a statistical membership in the HowNet-assigned cate-
                                                                             ACKNOWLEDGEMENTS
gory that is greater than the median membership score for that cate-
gory.                                                                        We would like to thank Enterprise Ireland for supporting this re-
   In fact, these results suggest that HowNet is far from being a gold-      search through a grant from the Commercialization Fund.
standard for category membership. In many cases, the HowNet cat-
egory name is either poorly named or is dangerously misleading.
                                                                             REFERENCES
For instance, the primary sense of the verb ”doctor” in English is
not ”heal” but ”fiddle” (as in ”to doctor one’s résumé”). Likewise,        [1] Dong, Z. and Dong, Q., HowNet and the Computation of Meaning,
                                                                                 World Scientific, Singapore, 2006.
HowNet assigns the name ”resume” to the super-category of ”repair”
                                                                             [2] Glucksberg, S. and Keysar,B., How Metaphors Work, Metaphor and
and ”doctor”, when the verb ”restore” is more appropriate in En-                 Thought (2nd edition), A. Ortony (Ed.), Cambridge University Press,
glish. In many other cases, the HowNet assigned category is only one             1993.
of several that seem intuitively appropriate. For instance, the word         [3] Hofstadter,D., Fluid Concepts and Creative Analogies, Basic Books,
”knight” is assigned the dominant category protect-agent (based on               1995.
                                                                             [4] Lakoff, G., Women, Fire and Dangerous Things, Chicago University
12 occurrences of the pattern ”knight who protects”) while HowNet                Press, 1987.
assigns it to the category defend-agent (which is the second-most            [5] Lenat, D. and Guha, R. V., Building Large Knowledge-Based Systems,
popular corpus assignment, based on 10 occurrences of ”knight who                Addison Wesley, 1990.
defends”). Viewed from this perspective, the corpus-based and hand-          [6] Miller, G. A., WordNet: A Lexical Database for English, Communica-
                                                                                 tions of the ACM, Vol. 38 No.11, 1995.
crafted approaches to category assignment are complementary, not
                                                                             [7] Searle, J., Metaphor, Metaphor and Thought (2nd edition), A. Ortony
conflicting, where each can serve to validate and enrich the other.              (Ed.), Cambridge University Press, 1993.
                                                                             [8] Veale, T., Analogy Generation in HowNet, The proceedings of IJ-
                                                                                 CAI’2005, the International Joint Conference on Artificial Intelligence,
6 CONCLUSION                                                                     2005.
                                                                             [9] Wikipedia open-source encyclopaedia: www.wikipedia.org.
The results of our experiments with Wikipedia are promisingly sug-          [10] F van Harmelen, L Serafini and H Stuckenschmidt, C-OWL: Contextu-
gestive about the possibility of contextualizing ontological category            alizing ontologies, Proceedings of the 2nd International Semantic Web
structures via corpus-derived statistics. For example, the Wikipedia             Conference, 2003.
                                                                            [11] Harris, Z., Mathematical Structures of Language, Wiley. 1968.
corpus reveals that the most common verb for the subject noun ”vam-         [12] Hindle, D., Noun classification from predicate-argument structures, Pro-
pire” is ”hunt” (where the phrase ”vampires who hunt” occurs 4                   ceedings of the Annual Meeting of the Association for Computational
times), indicating that in this pop-culture/fantasy-oriented context,            Linguistics (ACL), pp. 268-275, 1990.
a vampire is to be seen predominantly as a member of the category           [13] P. Cimiano, A. Hotho, S. Staab., Learning Concept Hierarchies from Text
                                                                                 Corpora using Formal Concept Analysis. Journal of AI Research, Vol-
hunt-agent, or hunter. While one is unlikely to find such a categoriza-
                                                                                 ume 24: 305-339, 2005.
tion in an ontology like WordNet, or even Cyc, this is the most ap-         [14] Hearst, M., Automatic acquisition of hyponyms from large text corpora.
propriate categorization in this context. Nonetheless, these results are         Proceedings of the 14th International Conference on Computational Lin-
hardly conclusive, for although large, Wikipedia is simply not large             guistics (COLING), pp. 539-545, 1992.
enough to provide the diversity of evidence needed to reliably derive
a heterogeneous category membership. If a resource like Wikipedia
lacks the necessary scale, surely this speaks to the futility of defining
a context via a corpus?
   We believe the answer to this dilemma lies not in ever-larger cor-
pora (which may be too large to preserve the distinctive biases of a
given context), but in the combination of different perspectives of-
fered by the same corpus. We have described two different perspec-
tives in this paper: the perspective of behaviour (captured via verb
collocations) described in section 3, and the perspective of cluster-