Corpus-Driven Contextualized Categorization Tony Veale and Yanfen Hao1 Abstract. Ontologies strive to offer a interconnected, hierarchical of [1,5,6]. In this paper we look at one conventional ontology, the systems of categories to guide our actions in a complex world. But HowNet system of [1], which is a large-scale bilingual lexical on- the boundaries of these categories are highly context-dependent, and tology for words and their meanings in both Chinese and English. what constitutes a prototypical category member in one context may In many respects, HowNet is similar to the WordNet lexical ontol- be atypical or unrepresentative in another. In this paper we outline a ogy for English [6], though in contrast to WordNet, HowNet pro- dynamic, trainable, bottom-up view of category structure based on vides an explicit, if sparse, propositional semantics for each of the context-sensitive corpus analysis. By learning from corpora about word-concepts it defines. Complementing this frame-like semantics, how people creatively actually use categories in different contexts, in which concepts are defined in terms of actions, case-roles and we can train our ontologies to creatively adapt themselves to these fillers, is a taxonomic backbone that seems rather impoverished when contexts. compared to that of WordNet. HowNet is essentially an ontology of ”Being” rather than an ontology of ”Doing” which is to say that it defines concepts according to conventional kinds like human, ani- 1 INTRODUCTION mal, tool and so on - rather than according to how specific concepts An ontology is a system of inter-connected categories that collec- actually behave in context. However, we describe in section 2 how tively provide a structured representation of a given domain. As such, HowNet’s propositional semantics can be used to automatically de- an ontology serves as the conceptual bedrock against which domain rive an ontology of ”Doing” to replace HowNet’s rather shallow tax- meanings are constructed, manipulated and interpreted. However, onomy of conventional categories [8]. Once in place, we demonstrate this fundamental role of the ontology should not blind us to the fact how this new system of derived categories can be made contextually that much of what an ontology attempts to model, via its category sensitive by defining their membership criteria in statistical, corpus- structure, is not static but dynamic, making the use of these cate- based terms, to create a fluid system of membership akin to the Slip- gories highly sensitive to context. Consider that many categories in nets of Hofstadter [3]. Once sensitized in this way, the ontology can a language-oriented ontology, like Genius, Fool, Hero, Villain, Ex- be moved with ease from one context to another simply by replacing pert, Hunter, and so on, possess subjective membership criteria that the underlying corpus. change from user to user, and from context to context. Are politi- cians fools, villains or schemers? Are firemen heroes or workmen? Are scientists experts or geniuses? 2 ONTOLOGIES OF ”BEING” AND ”DOING” Since top-down definitions of membership criteria will always seem brittle or inadequate in some contexts, it seems best to allow HowNet and WordNet each reflect a different view of semantic contexts to define their own criteria, bottom-up. In other words, we organization. WordNet [7] is differential in nature: rather than need to establish a contextual ontology [10] based category struc- attempting to express the meaning of a word explicitly, WordNet ture, which not only preserves the common view of concepts, but instead differentiates words with different meanings by placing them also keeps the local perspective of domains. For language-oriented in different synonym sets, or synsets, and further differentiates these ontologies, like WordNet [6] (a flawed, lightweight ontology to be synsets from one another by assigning them to different positions sure, but an ontology none the less), HowNet [1] and, to some ex- of a taxonomy. In contrast, HowNet is constructive in nature. It tent, Cyc [5], the context of usage can conveniently be captured via does not provide a human-oriented textual gloss for each lexical a large corpus of representative texts. A corpus-based approach to concept, but instead composes sememes from a less discriminating determining category membership allows us to structure the middle taxonomy to provide a semantic representation for each word sense. and lower layers of an ontology according to how words and con- For example, HowNet defines the lexical concept surgeon|医生 as cepts are actually used in a particular domain. In short, a corpus- follows: based approach supports an extremely flexible, non-classical view of category structure, one that views category membership as a graded (1)surgeon|医生 {human|人 :HostOf ={Occupation|职位} rather than binary notion [4], and one in which concepts can fluidly domain={medical|医}}, {doctor|医治:agent={∼}}} move (via metaphor) from one category to another [2]. In this cur- rent work, we use the ability to support metaphoric reasoning as the which can be glossed thus: ”a surgeon is a human, with an oc- yardstick against which ontological flexibility should be measured. cupation in the medical domain, who acts as an agent of a doctoring Of course, this fluidity does not sit well with conventional per- activity” (the {∼} here serves to indicate the placement of the spectives on ontological structure, as represented by the ontologies concept within its associated propositional structure). We see a similar structure employed by HowNet for the lexical concept 1 School of Computer Science and Informatics, University College Dublin, repairman|修理工: Ireland, email: {tony.veale, yanfen.hao}@ucd.ie (2)repairman|修理工 {human|人:HostOf ={Occupation|职位}, metaphorically viewed as butchers and assassins, and for viruses to {repair|修理:agent={∼}}} be seen as deadly intruders, or even man-eaters. Note that the impoverished nature of HowNet’s taxonomy means that over 3000 different concepts are forced to share the immediate 3 DERIVING FLUID CATEGORY hypernym human|人. However, human|人 merely states, very STRUCTURES generally, what a repairman is, rather than what a repairman does. An ontology of ”doing” begs a number of obvious questions about Fortunately, HowNet also organizes its verb entries taxonomically, the nature of categorization. For instance, is every concept that and so we find the verbs doctor|医治 and repair|修理 organized kills an equally representative member of the category kill-agent? under the hypernym resume|恢复 (the logic being, one supposes, Is movement always allowed between any two categories that share that ”doctoring” and ”repairing” both involve a resumption of an a common abstraction like MakeBad-agent, or is movement limited earlier, better state). This similarity of verbs, combined with an to certain members only, and in certain directions? When a concept identicality of case-roles (both surgeon and repairman are agents of moves from its conventional category to another, how is its degree of their respective activities), allows us to abstract out a new taxonomy, membership in this new category to be assessed? In this section we based on the behaviour rather than the general type of these entities. address this key issue of obtaining fluid category structure. There are two major approaches in the community of automatic acquisition of taxonomies. One approach is based on the distribu- tional hypothesis made by Harris[11], in which he believes that word terms are similar if they have similar linguistic contexts. For instance, Hindle[12] clusters nouns according to their contextual attributes, such as the co-occurrence of nouns with verbs as subjects or objects. Steffen Staab[13] also extracts context information (verb/subject de- pendencies, verb/object dependencies, e.g.) about a certain term from corpus and applies a Formal Concept Analysis to generate a lattice Figure 1. A new 3-level abstraction hierarchy derived from verb/role com- that is finally transformed into a partial order closer to a concept hi- binations. erarchy. Another major approach is on the basis of investigating the ontological relations such as is-a relation, part-of relation, e.g. via the Figure 1 illustrates the creation of such a taxonomy, whose cate- corpus. Hearst[14] is a representative of this field. However, it seems gories represent a yoking of verbs to specific case-roles, such as that these approaches still result in binary and static taxonomies be- repair-agent and amend-agent, and whose category members are cause they all apply the threshold to the category or the concept ar- those HowNet concepts defined using these verbs and roles. The chitecture to determine whether or not a word concept belongs to it. category-hopping nature of metaphor is now rather easily construed In our approach, we also follow Harris[11]’s distributional hypothe- as a combination of generalization and re-specialization operations, sis to investigate the contextual attributes, particularly, the behavior in which one moves from one category to another by first passing of nouns. The difference is that we apply Lakoff[4]’s category theory through a common super-category like resume-agent. Thus, a to assign the graded membership to nouns within a category rather surgeon can be seen as a repairman or a watchmaker, while a reviser than simply grouping them into classes according to their contextual of texts (an editor) can sometimes be seen as a surgeon. These attributes or ontological relations. metaphors make sense not because each is a human, but because Following Lakoff [4], every category will possess a prototype, a each restores a better state. member that is highly representative of the category as a whole. Such prototypes are often lexicalized in simple terms; for instance, ”killer” will be a highly representative of kill-agent, while the Chinese trans- lation ”杀手” is a composition of ”killing” (杀) and ”expert” (手). However, many categories like damage-agent have no obvious lex- icalized prototype, so we need a more generic means of identifying the prototypical member of a category. Lakoff [4] suggests that the prototype will occupy a central position in the category’s structure, with other members organized in a radial fashion, at a distance from the centre that is inversely proportional to their similarity to the pro- Figure 2. Newly derived HowNet categories may contain a diverse range of totype. If we assume that the prototype will be that member that is concepts. most evocative of a category, we should first measure the evocation strength of each concept for a given category. This can be done by Of course, this Aristotelian view of metaphor as an abstract determining the frequency of occurrence of each concept within the ”carrying-over” (the etymological origin of the word ”metaphor”) category, and this, in turn, can be estimated by looking to a large cor- can only be valid if concepts are ontologized by what they do, rather pus to see how each concept is actually employed by language users. than by what they are (as is typically the case, in both WordNet Once the most evocative example is found for each category, mem- and HowNet, and even Cyc [6]). Otherwise, metaphor could never bership scores can be assigned based on the strength of evocation. operate between semantically distant concepts, which it plainly does. The corpus we use must be large, and while reasonably authoritative For instance, figure 2 illustrates the derived taxonomy for HowNet it must use words both literally and figuratively. For reasons outlined concepts that are defined as agents of the verbs ”kill”, ”damage” in section 5, we use here as our corpus the complete text of the open- and ”attack”, each a specialization of the abstract verb MakeBad in source encyclopaedia Wikipedia [9]. HowNet. We see in this taxonomy the potential for famines to be Thus, to estimate the membership level of the word-concept butcher|屠 夫 in the category kill-agent, we first determine the As a corpus, Wikipedia is biased toward popular culture and corpus-frequency of the phrase ”butcher who kills/killed”. In genres such as science fiction. This lack of neutrality makes the general, for estimating the membership of the concept C in the Wikipedia corpus an excellent example of a context, more so than category V-agent, we use the query form ”C who|which|that V”; traditional language corpora. Consider the population of the category for categories of the form V-instrument, we use the query ”V with Army-member as derived from Wikipedia: C”, and so on. Of course, some verbs are more vague than others, and can have much higher corpus frequencies. We therefore need mercenary(238), clone(132), soldier(122), volunteer(72), mon- to normalize raw corpus-frequencies to obtain a truer picture of ster(70), robot(63), minion(60), warrior(60), frog(58), knight(50), evocation power. If fraw (V-role:C) denotes the corpus frequency slave(48), demon(46), clansman(46), monkey(46), crusader(44), of concept C when considered as a member of the category V-role, gladiator(38), ant(37), lawyer(32), contributor(28), mutant(27), ... where V is a verb like ”kill” and role is one of agent, instrument, etc., then the adjusted frequency, a measure of true evocation, is Note the prominent presence of the genre elements ”clone”, estimated by: ”robot” and ”minion”, as well as examples like ”lawyer” for which ”army” has a metaphoric meaning. This grouping suggests that P fadj (V-role:C) = ln(fraw (V-role:C))×ln( fraw (V-role:x))−1 (1) lawyers may be seen, alternately, as mercenaries, warriors and even x clones, while the extent to which these comparisons are apt in a par- Now, the prototype will be that member of a category with the ticular context is a function of how many different groups can contex- strongest evocation: tually claim both as members. For instance, ”lawyer” and ”warrior” are used with seven different group terms in the Wikipedia corpus − P rototype(V-role) = maxc (fadj (V-role:C)) (2) society, family, cadre, team, army, class and squad, while ”lawyer” and ”mercenary” share just three groupings − team, army, squad. In- The degree of membership of C in the category V-role is relative to terestingly, the most common group term for ”lawyer” in Wikipedia the prototype: is ”huddle” (the phrase ”huddle of lawyers” occurs 64 times, twice as often as ”army of lawyers”), which suggests that, in this context, M embership(V-role:C) = fadj (V-role:C)× lawyers are more likely to be categorized as players than warriors, fadj (V-role:prototype(V-role))−1 (3) mercenaries, clones or robots This ensures that the prototypical member has a membership score of 1, while all other members of a category will have a score in the 5 PRELIMINARY EMPIRICAL EVALUATION range 0... 1. A concept can metaphorically be moved from a category The choice of corpus is clearly key to the quality of category- in which it is conventionally a member to any other category in which membership statistics that can be derived using the methods of sec- it is considered to have a non-zero membership score. tions 3 and 4. This corpus must be large, it must be representative of language use in general, and it should offer a means of search that is robust in the face of noise. At first blush, then, the world-wide-web 4 CLUSTERS AND GROUP-TERMS seems an ideal candidate: in size it is unmatched, and various APIs For ontological purposes, a category is essentially a cluster of con- are available to access powerful search engines like Google. Unfor- cepts that allows one to conveniently infer similarity − the posses- tunately, such APIs rarely provide enough control over the query or sion of common properties and shared behaviour − from the simple the archive to ensure that noise can be eliminated, since these en- act of co-categorization. That these clusters often have a heteroge- gines typically perform their own stemming and stop-word elimi- neous roster of members (e.g., as illustrated in Figure 2) is testament nation, putting truly strict matching beyond our reach. This means both to the prevalence of metaphor and to the necessity of viewing that common noun-noun collocations, like ”fossil record” and ”share ontological categories as categories of ”doing” rather than of ”be- issue”, are easily confused for infrequent or nonsensical noun-verb ing”. Of course, the converse is also true: we can infer the contextual collocations like ”fossils that record” and ”shares that issue”. behaviour of a concept from how that concept is explicitly clustered To ensure strict matching with controlled morphology, we require with others. And one common way of signalling the appropriate clus- a local text corpus that we can index and search directly, and even ter for a concept is through an evocative group word, like ”army”, subject to part-of-speech tagging. For this reason we choose the ”mob”, ”tribe” or ”coven”. For instance, when one uses the phrase collected text of the open-source encyclopaedia Wikipedia [9], ”an army of robots”, one is conveying a soldier-like perspective on which is available to download in XML form. Wikipedia has several the concept Robot, signalling that in this context, Robot should be obvious benefits as a text corpus: each document is explicitly tagged viewed more as a attacking agent than as a utensil. with a subject-label, since each article defines a specific headword; Group terms like ”army”, ”family” and ”swarm” are highly sug- documents exist in a rich web of interconnections; and documents gestive of particular behaviours. For instance, the corpus techniques strive to be authoritative on their subjects. Consider the range of of section 3 reveal that, in the context of Wikipedia, a ”swarm” has subjects that are found in Wikipedia for the verb ”to infect” (with two dominant behaviours, biting and attacking, while an ”army” has frequencies shown in parentheses): three, defeating, fighting and attacking. To use the phrase ”swarm of X” or ”army of X” is to suggest that X also exhibits these be- virus(46), worm(12), retrovirus(7), strain(6), disease(6), bureau- haviours, and furthermore, that X is similar in behaviour to other crat(6), poison(4), ally(4), fungus(4), dust(3), smut(2), bacterium(2), concepts that comfortably fit these templates. This intuition is easily physiologist(2), blood(2), plague(2), war(2), substance(2), germ(1), contextualized, since the relative frequency of these phrases in a con- application(1), species(1) text’s corpus will reveal the extent to which different concepts belong to different group-based categories. Now consider the range of verbs that can be used with the subject ”virus”: ing (captured via group-word collocations) described in section 4. For instance, we know that Robot is the most representative member infect(46), attack(11), kill(7), jump(6), eat(4), drive(3), pro- of the category army-agent in Wikipedia (with 63 examples), while duce(3), destroy(3), spread(3), transform(3), escape(2), steal(1), army is itself a highly representative member of the category attack- prove(1), carry(1), freeze(1), arrive(1), control(1) agent. This suggests that Robot should also be a strong member of the category attack-agent. While Wikipedia records no uses of the We see from this snapshot that Wikipedia contains enough di- collocation ”robot who|which|that attacks”, this joint perspective is versity to capture the dominant application of each verb, and the sufficient evidence to support going to the web for this collocation. dominant behaviour of each subject noun. Furthermore, Wikipedia That is, the intuition that Robot is an attack-agent is consistent with contains enough diversity to reveal creative uses of these nouns and the corpus, and thus the context, so the precise membership score can verbs; this snapshot reveals, for instance, that ”smut” can ”infect” (2 be determined using the larger context of the web. uses) and that a ”virus” can ”eat”, ”escape” and even ”steal”. Bootstrapping techniques like this should allow us to grow more One can ask how well these corpus-derived category structures heterogeneous category structures while respecting the ontological compare with the hand-crafted category structures of HowNet, since biases of the specific context. Once the deficiencies of relatively one can reasonably expect human-assigned category memberships small corpora are addressed via such techniques, we expect to be to be a gold standard for this task. We find that in 69% of cases, better poised to fully explore the ramifications and opportunities of the HowNet-assigned category for a given word-concept is also the corpus-trained contextual ontologies. dominant corpus-derived category, and that in 76% of cases, a word- concept has a statistical membership in the HowNet-assigned cate- ACKNOWLEDGEMENTS gory that is greater than the median membership score for that cate- gory. We would like to thank Enterprise Ireland for supporting this re- In fact, these results suggest that HowNet is far from being a gold- search through a grant from the Commercialization Fund. standard for category membership. In many cases, the HowNet cat- egory name is either poorly named or is dangerously misleading. REFERENCES For instance, the primary sense of the verb ”doctor” in English is not ”heal” but ”fiddle” (as in ”to doctor one’s résumé”). Likewise, [1] Dong, Z. and Dong, Q., HowNet and the Computation of Meaning, World Scientific, Singapore, 2006. HowNet assigns the name ”resume” to the super-category of ”repair” [2] Glucksberg, S. and Keysar,B., How Metaphors Work, Metaphor and and ”doctor”, when the verb ”restore” is more appropriate in En- Thought (2nd edition), A. Ortony (Ed.), Cambridge University Press, glish. In many other cases, the HowNet assigned category is only one 1993. of several that seem intuitively appropriate. For instance, the word [3] Hofstadter,D., Fluid Concepts and Creative Analogies, Basic Books, ”knight” is assigned the dominant category protect-agent (based on 1995. [4] Lakoff, G., Women, Fire and Dangerous Things, Chicago University 12 occurrences of the pattern ”knight who protects”) while HowNet Press, 1987. assigns it to the category defend-agent (which is the second-most [5] Lenat, D. and Guha, R. V., Building Large Knowledge-Based Systems, popular corpus assignment, based on 10 occurrences of ”knight who Addison Wesley, 1990. defends”). Viewed from this perspective, the corpus-based and hand- [6] Miller, G. A., WordNet: A Lexical Database for English, Communica- tions of the ACM, Vol. 38 No.11, 1995. crafted approaches to category assignment are complementary, not [7] Searle, J., Metaphor, Metaphor and Thought (2nd edition), A. Ortony conflicting, where each can serve to validate and enrich the other. (Ed.), Cambridge University Press, 1993. [8] Veale, T., Analogy Generation in HowNet, The proceedings of IJ- CAI’2005, the International Joint Conference on Artificial Intelligence, 6 CONCLUSION 2005. [9] Wikipedia open-source encyclopaedia: www.wikipedia.org. The results of our experiments with Wikipedia are promisingly sug- [10] F van Harmelen, L Serafini and H Stuckenschmidt, C-OWL: Contextu- gestive about the possibility of contextualizing ontological category alizing ontologies, Proceedings of the 2nd International Semantic Web structures via corpus-derived statistics. For example, the Wikipedia Conference, 2003. [11] Harris, Z., Mathematical Structures of Language, Wiley. 1968. corpus reveals that the most common verb for the subject noun ”vam- [12] Hindle, D., Noun classification from predicate-argument structures, Pro- pire” is ”hunt” (where the phrase ”vampires who hunt” occurs 4 ceedings of the Annual Meeting of the Association for Computational times), indicating that in this pop-culture/fantasy-oriented context, Linguistics (ACL), pp. 268-275, 1990. a vampire is to be seen predominantly as a member of the category [13] P. Cimiano, A. Hotho, S. Staab., Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis. Journal of AI Research, Vol- hunt-agent, or hunter. While one is unlikely to find such a categoriza- ume 24: 305-339, 2005. tion in an ontology like WordNet, or even Cyc, this is the most ap- [14] Hearst, M., Automatic acquisition of hyponyms from large text corpora. propriate categorization in this context. Nonetheless, these results are Proceedings of the 14th International Conference on Computational Lin- hardly conclusive, for although large, Wikipedia is simply not large guistics (COLING), pp. 539-545, 1992. enough to provide the diversity of evidence needed to reliably derive a heterogeneous category membership. If a resource like Wikipedia lacks the necessary scale, surely this speaks to the futility of defining a context via a corpus? We believe the answer to this dilemma lies not in ever-larger cor- pora (which may be too large to preserve the distinctive biases of a given context), but in the combination of different perspectives of- fered by the same corpus. We have described two different perspec- tives in this paper: the perspective of behaviour (captured via verb collocations) described in section 3, and the perspective of cluster-