100 The Semantic Structure of Roget’s Thesaurus Cross-References L. John Old Edinburgh Napier University Scotland, United Kingdom Abstract This study analyzed a database version of Roget’s Thesaurus (Roget’s International Thesaurus, 3rd Edition, 1962) for connectivity patterns among cross-references in order to identify the implicit conceptual structure. Semantic patterns implicit in the data, at both the local and global levels of the Thesaurus structure, are identified. 1. Introduction This research follows conceptually from the work of W.A. Sedelow, Jr. and S. Yeates Sedelow (1979, 1986, 1990-1993), Priss (1996) and Old (2003), on Roget’s International Thesaurus (RIT: Berrey 1962). Patterns among local views of RIT, such as for example, semantic neighbourhood lattices (Priss, 2005); patterns emerging from global views of RIT such as word-overlap (with implied semantic overlap) between Categories (Old, 2002); and conceptual and semantic hubs and authorities (semantic switching centres) among senses and words (Steyvers & Tenenbaum, 2005) have previously been identified and readily represented. Roget’s Thesaurus cross-references, however, which form a kind of shadow, or skeletal network structure of the implicit structure of the Thesaurus as a whole, have not been studied in the same way. 2. The Explicit Structure of Roget’s Thesaurus The explicit structure of Roget’s Thesaurus is a hierarchy, or tree, implemented in the book in three main parts. Following the front matter is the top level of the hierarchy represented by what Roget called the tabular Synopsis of Categories. The Synopsis lists the structure down to the level of the 1,000 or so Categories (also called headwords, or lemmas, by some researchers). Most of the categories are arranged in opposed pairs, where the meanings of the pairs are antonymous. For example, 27 Equality versus 28 Inequality, and 648 Goodness versus 649 Badness. The Synopsis is followed by the body, or Sense Index of the book, which continues the hierarchy down to the lowest levels. The Sense Index lists the Categories representing the notions found at the bottom level of the Synopsis. Each Category contains the actual entries—instances of words, ordered by part-of-speech and grouped by sense, or synset (Miller et al., 1993). Synsets are grouped into broader notions, as paragraphs. The entries are commonly referred to as synonyms, though frequently there are other semantic relations at work. For example, the part-whole relation of meronymy, as illustrated by “parts of a ship” or “historical eras”. At the back of the book is the Word Index, listing the words in alphabetic order, along with references to their senses in the Sense Index, ordered by part-of-speech. Cross-references, as they appear in the text of Roget’s Thesaurus, are similar to entries. That is, they exist in synsets, separated by commas, as do regular entries. They differ in that they are sense index numbers, not members of the set of Words, and represent a relationship between their own synset and semantically related, but remote, synsets. Cross-references are an explicit shadow of the implicit structure of Roget’s Thesaurus. They are analogous to the links between synsets implied by words shared between synsets. So “Cross- reference,” like synonymy, is a relation. But Cross-reference differs in several ways. A cross- reference is directed—it has an origin and it has a destination. In other words it is not symmetric (although cross-references can be reciprocated between categories). A further idiosyncrasy is that a cross-reference points not to a single sense, but to a set of senses—always at either the Paragraph level or Category level. The example in Figure 1 shows a cross-reference from Category 1: Existence, Paragraph 2: Reality, (found in the third synset of the paragraph); this references Category 515: Truth, Paragraph 5: Genuineness. 1 EXISTENCE NOUNS 1. existence, subsistence, being; entity, essence; presence, occurrence; life 406. 2. reality, actuality, factuality; truth 515; authenticity 515.5; sober reality, grim reality, no joke, not a dream; thing-in-itself, ultimate reality. … Figure 1. An example of a cross-reference in situ [1:2:3 – 515:5]. The referencing (or source), and destination, or target, synsets here share a word in common. The source synset includes {authenticity 515.5}. The word authenticity is an “anchor” word, in the same way that an Internet hyperlink contains both the link (an HTTP reference to a remote location) and anchor text, which usually describes the target of the link. The first synset of the referenced, destination, or target Paragraph, 515:5, {genuineness, authenticity…realness, reality…} also contains authenticity. This mechanism occurs frequently among regular (called hereafter, normal) Roget’s Thesaurus cross-references. However, cross-references do not necessarily indicate shared strings (words in common) between the source and destination locations. The main purpose of a cross-reference is to indicate shared meaning, not shared words. 3. Types of Cross-References Instead of a normal cross-reference, a synset may contain several cross-references pointing to a sequence, or range of senses. For example, a cross-reference found in Category 299: Arrival (see Figure 2), points to three Paragraphs: Source Category 299: Arrival Paragraph 4: Welcome Synset 1: {welcome, greeting} Destination Category 923: Hospitality, welcome Paragraph 2: Welcome Category 923: Hospitality, welcome Paragraph 3: Greetings Category 923: Hospitality, welcome Paragraph 4: Greeting Figure 2. A range cross-reference: 291:4:1 – 923:2-4 This type of cross-reference (belonging to a set of two or more sequential cross-references) we call a range cross-reference. A cross-reference to a whole Category, a Category-only cross-reference appears in the text without a Paragraph index. An example is seen in Figure 1 as ”… ; life 406.” which ‘points’ to Category 406: Life. Usually there is an anchor word in the source location that is the actual name of the destination Category, as the previous examples referencing Category 515 Truth and Category 406 Life. However, about twenty-percent of cross-references do not anchor on the name of the destination category. Examples are: Anchor Source Destination explanation 550 543:3:1 Meaning 550 Interpretation sameness 14 30:2:2 Equality 14 Identity When a set of cross-references from one location refers to multiple locations in a remote Category there is clearly a very strong semantic relationship between the two Categories. In the following example the cross-references serve as links between similar concepts listed under the equivalent parts-of-speech in each Category. This type of cross-reference (belonging to a set of two or more concurrent cross-references) can be seen as concurrent cross-references. Note that, in this case, no character strings, or words, are shared between the source and destination. Source Paragraph Destination Paragraph POS 763:6:3 Submission Submit 764:2 Obedience Obey Verbs 763:12:2 Submission Submissive 764:3 Obedience Obedient Adjectives 763:17:2 Submission Submissively 764:6 Obedience Obediently Adverbs . A final cross-reference type is an internal reference, where the source and destination Categories are the same, but the Paragraphs are different. This is termed here a self-reference. This type occurred frequently in the original and older editions of Roget’s Thesaurus (total 1,946 for the 1911 Edition), but is now quite rare. The goal was to link a concept in one part-of-speech section to a more-specific set of words relating to the same concept. An example is found within Category 123: Oldness, in synset 123:4:3 containing the entry “archaeology.” This references synset 123:22, which lists branches of archaeology such as “paleoanthropology, paleohydrography, paleolatry, paleolithy, paleometeorology,” and “Egyptology.” Antonymous notions are classified, via the Synopsis, in adjacent Categories, so the Editors may have considered such references to be redundant. There are, however, twelve cross-references between adjacent Categories linking such complementary concepts as bequest and inheritance. 4. Descriptive Statistics There are approximately 3,772 cross-references in RIT. Of these, 3,171 point to Paragraphs and 601 point to whole Categories. Of the Paragraph-referencing cross-references there are 637 of the range cross-reference type; 1,313 of the concurrent cross-reference type; and 1,164 of the normal cross-reference type. 102 categories are not involved in any cross-references. Of the categories involved in cross- references, there are three types: • 114 categories contain cross-references, but are not ever referenced • 136 are referenced by cross-references but contain no cross-references • 691 categories both reference, and are referenced by, cross-references Ten percent (335) of the cross-references are of the first type, while ninety percent (3,437) are of the third type. Of course the second type has no cross-references, but they do involve ten percent of cross-references destinations. 5. The Implicit Structure of Roget’s Thesaurus This section describes and illustrates the results of analysis, and patterns, found among the RIT cross-references that imply structures other than those described in Section 2, above. This section begins with a brief discussion of the implicit structures discovered in previous research through the analysis of patterns of words and senses in Roget’s Thesaurus. 5.1. The Non-Cross-reference Implicit Structure A Small-world model can be utilized to account for much of the implicit structure of Roget’s Thesaurus. The model derives (Travers & Milgram, 1969) from the observation that people find, when first introduced, that they know people in common. There are many other variations on this theme, such as “went to the same school,” “come from the same town,” and so on, but Stanley Milgram set out to quantify how separated, or not, people really are from each other in terms of connections through other people. His experiment, where he had people pass letters to friends and acquaintances, recording the paths taken by the letters, confirmed our common assumption: that it really is a small world. A mathematical model developed from Milgram’s experiment has been found to be applicable to diverse natural phenomena (Watts 1999; Watts & Strogatz, 1998). The essence of the model is that in some large networks, such as social networks, the connectivity is such that no point, or node, in the network is ever far from another. Small-worlds may be characterized by particular measures. Word association data has about (on average) 3.0 degrees of separation. Old (2000) shows that Roget’s Thesaurus satisfies the criteria of being a small-world network, and Young (1993) shows that the neural network of the brain also fits the criteria. Other researchers (Steyvers & Tenenbaum, 2005; Motter, de Moura, Lai, & Dasgupta, 2002) find that Roget’s Thesaurus (1911 edition) has about 3 degrees of separation. WordNet has a higher degree, but this may be due to the fact that it has been organized into a classification structure that separates verbs from nouns from adjectives, and separates more general words from more specific words. A small-world network is not a homogeneous network – it is “lumpy,” with sparse areas and highly connected clusters. Kleinberg (1999) shows that the World Wide Web is also a small-world. Because URLs are directed (links go in only one direction) Kleinberg classifies the highly connected nodes (Web sites) into those that link to many Web pages and those that are linked to by many Web pages. The Google search engine also uses this principle. The small-world model suggests the probability that the underlying meanings of words form a vast interconnected semantic network. The words developed to express these meanings, if they formed a complete coverage (and Roget’s entries do, to the extent that the list is kept current), would also form such a network. Roget Categories arose by Roget forming clusters of like meaning words, and categorizing them by general notion. But if the actual organization of words is a small-world, how then do the Categories remain separated as words are added? Roget’s son, and the second editor of Roget’s Thesaurus, John knew this was a problem (see Section 7, Fan-in and Fan-out; Semantic Hubs and Authorities, below). 5.2. Cross-Reference Patterns As described in the discussion of the RIT cross-references in Section 3, there are five types of cross-references, termed here , normal-, category-, range-, concurrent-and self-referencing cross- references. These references, or links, are directed—they go in only one direction—from the source (referencing) to the destination (referenced) location. There is also an implied relation back from the referenced location to the source. This is equivalent to the concept of Internet back-links (also called reverse links or backward links); and in citation analysis called a “citation” -- as opposed to a “reference” (Small, 1978, p. 339). A reference corresponds to a regular RIT cross-reference or Internet hyperlink (a URL on a Web page). It is easy, looking at a particular published document, to see what other papers that document references, but impossible to see what citations it has (who has referenced it). Likewise, by looking at a Web page alone one cannot know what pages link to it. Search engines do, however, provide back-links on request; these show which Web pages link to a particular page (provided they are indexed by the search engine). The Google (Brin & Page, 1998) search engine uses the number, or count, of back-links that a Web page has as part of its measure of importance of the page on the Internet (Google, 2003, [PageRank Explained]). Using a database of thesaurus cross- references it is possible to identify the equivalent “cross-reference back-links.” A thesaurus cross-reference, such as 1.2.3 -> 515.5, from Category 1 to Category 515, could imply that there is a reciprocal relationship from Category 515 back to Category 1. However, as stated earlier, cross-references are directed arcs or links. While the count or number of links to a Category or concept may be significant in studying the importance of it, the semantics of the source and destination are not equivalent, so cross-reference back-links are not a semantic relation. In support of this view, the thesaurus editors supply return-, or reciprocal-cross-references in about only one third of the cases. For all three situations, cross-reference, hyperlink/back-link, or citation/reference, the arc between locations has different implications depending on which direction the arc is followed. Also contributing to the asymmetry for cross-references may be the fact that they are specific-to- general; the source is always a specified synset, and the destination is always at least a Paragraph (several synsets), and often a whole Category. That is because cross-references are meant to lead the thesaurus user to a broader or more specific notion, not just to the same or similar sense of the word adjacent to the cross-reference source—such information could be achieved simply by looking in the Word Index, at the back of the thesaurus. Figure 3 shows an implicit structure formed by reciprocated cross-references chains (Length 5), many of which are between Categories of different classes. This illustrates the type of coherence that exists across the thesaurus, but which is not available through any explicit structure or organization of the thesaurus. Figure 3 Graph combining all chains of reciprocated cross-references of at least five nodes length 6. Implications of Cross-references among Upper Levels of the Hierarchy A cross-reference is not only between a synset and a remote Paragraph, or Category. It is also between the hypernyms, or upper level nodes of the hierarchy above that synset, and the hypernyms above the referenced Paragraph or Category. Most cross-references do not cross Class boundaries. That is, they usually reference Categories within the same Class. Those that do cross boundaries reflect strong relationships between the Classes. Class-crossing, reciprocated, whole-Category links are represented by the links in the graph in Figure 4, labelled by the number of links that occur. The relationship is strongest between the Intellect and the Affections Classes. An example of the nature of the relationships between Classes is shown in Figure 5. The example suggests that, despite the fact that their semantics and words overlap (as evidenced by the strong, reciprocated, whole-Category link between the two Categories), a qualitative division exists between these almost-equivalent concepts found categorized under the Affections and Intellect Classes. A second example, more elaborated, is given in Figure 6 to support this observation. Whole-Category links between the Affections Class and Intellect Class, such as (for a further example) 921: Unsociability )—( 611: Uncommunicativeness, suggest that whether a concept has social-emotional connotations, or is purely intellectual (at least, to the observer) affects the semantics of practically identical concepts, and consequently, the way in which the concepts are classified. Matter 2 1 Abstract 1 Space Intellect Relations 2 5 3 Volition Affection 3 s Figure 4. Whole-Category reciprocated links crossing Class boundaries Class level: 7: Affections C6: Intellect Category 867: Discontent  539: Disappointment Level: Figure 5. A reciprocated, whole-Category link at the Class level In this way Hope can be seen as the emotional equivalent of Expectation, an emotionally neutral, intellectual notion; and Care the intellectual equivalent of Caution. Likewise science (an intellectual pursuit) does “541: Prediction,” but when non-scientists make claims about the future they are said to {foretell, augur, divine, prophesy, forecast…}, and it is called “1032: Occultism.” Similar analysis can be made of the second level, or Roman-level Classes, of the hierarchy; and the third-level, or Letter Classes. Examples of strong, reciprocated, whole-Category links that cross only Letter Class boundaries (both derive from the same top level Class and Roman-level Classes) are given in Figure 7. Class level: 7: Affections 6: Intellect Roman Class Level: I. Personal Affections II. States of Mind Letter Class Level: D. Contemplative Emotions D. Anticipation Category Level: 886: Hope  537: Expectation Figure 6. Reciprocated, whole-Category link shown at all Class levels At this lower level of the hierarchy Categories are related only by the fact that they share the very broad notion of their Roman-level Classes—they represent dimensions of the Roman Class notion. For example, Categories 16: Difference and 21: Dissimilarity share only Roman Class II: Relation. They are discriminated by their Letter Classes, A: Absolute Relation and B: Partial Relation. The relationship unearthed by reciprocated, whole-Category links shows that distant Categories can bear a close, possibly redundant, semantic relationship. This is not a criticism of Roget’s hierarchy (although the hierarchy may warrant criticism) as semantics is multi-faceted and multi- dimensional and it should be expected that not all facets of meaning shared between two notions could be represented by a single relation, or even a single structure. The words classified under a Category in one Class (or facet) will be different from the words classified under a Category in a different Class (or facet), even though the notions which the Categories represent may seem the same. Category 537: Expectation contains 147 entries and Category 886: Hope contains 154 entries—but they have only 10 words in common. Category Name Category Name 16 Difference 21 Dissimilarity 38 Increase 40 Addition 179 Region 183 Location 195 Littleness 202 Shortness 468 Unintelligence 476 Ignorance 495 Misjudgment 517 Error 502 Unbelief 513 Uncertainty 555 Information 560 Teaching 739 Government 745 Direction, Management 819 Borrowing 838 Debt 920 Sociability 925 Friendship Figure 7. Reciprocated, whole-Category links that cross Letter Class boundaries only. 6.1. Implications The idea of “set implication” suggests that subsets imply their supersets. In general, a word in RIT that has a set of senses that is a subset of senses of a second word, implies the second word. In Figure 8, the words on the left have fewer senses than the words on the right, and those senses are a subset of the senses where the words on the right are found, in RIT. So the words on the left imply the words on the right. SubSet SuperSet stereoscopic 3-D deserted abandoned circa about deem allow stipend allowance gory bloody take the edge off blunt turn red blush tranquil calm Figure 8. Set implication between RIT words. The words on the left are rarer (have fewer senses) and are more specific, while the words on the right are more polysemous (have more senses), and are more general. For native English speakers the words and phrases on the left tend to be less familiar. Consequently those on the right tend to be explanatory. Implications can form chains: poodle and terrier both imply dog; dog and cat in turn imply animal; animal implies living thing, and so on. In this way synsets, being subsets, can be seen to imply Paragraphs, which in turn imply Categories; and so on up the hierarchy. A cross-reference carries with it these implications. Implication associated with cross-reference is illustrated schematically in Figure 9. Category Category Paragraph Paragraph Cross- Synset reference Figure 9. Implications (dotted lines) in cross-reference (solid line) The source synset of a cross-reference is almost always a smaller set of concepts (96%) than the destination of the cross-reference. It does not, however, always contain a word that is contained in the destination set. Of the whole-Category links, those that contain an identical string in both the source and destination provide semantic evidence of implication in cross-references—the source Category concept implies the destination Category concept (there is an inference from the source to the destination). Figure 10 illustrates this (the source Category concepts are on the left). Source Source Name Destination Destination Name 30 Equality 14 Identity 34 Greatness 194 Size 38 Increase 196 Growth 82 Conformity 643 Convention 119 Past 123 Oldness 140 Permanence 112 Perpetuity 143 Continuance 110 Durability 168 Reproduction 22 Imitation 179 Region 183 Location 197 Contraction 39 Decrease 489 Measurement 29 Degree 513 Uncertainty 502 Unbelief 529 Inattention 532 Neglect 539 Disappointment 867 Discontent Figure 10. Implication in cross-references As mentioned earlier, many Categories participating as sources and destinations of cross- references share the anchor term. Figure 11 shows the cross-references anchored on the entry paint. Anchor Source Destination paint Art-572:19:12 Color-361:7:2 paint Covering-227:13:5 Color-361:7:2 paint Ornamentation-899:8:12 Color-361:14:1 paint Representation-570:7:2 Art-572:20:2 Figure 11. Cross-references anchored on the term “paint” Examples of cross-references that do not share the anchor term are Category 137: Regular Recurrence, anchored on holy days referencing a Paragraph in Category1038: Religious Rites, that contains a list of holy days (but not the actual term “holy days”); Category 123: Oldness, anchored on ancient manuscripts referencing a Paragraph in Category 600: Writing, that contains a list of important ancient manuscripts; and Category 161: Violence, anchored on windstorm and referencing a Paragraph in Category 402: Wind, that contains such terms as sand spout, dust-devil, cyclone and hurricane. Although there is no shared anchor term and no inference between the Categories, there is still clearly method to these cross-references. It is noteworthy that the average polysemy of cross-reference anchors is 6.33 senses, while the average polysemy of Thesaurus entries, in general, is about 2.3 senses. This is probably a consequence of the fact that normal cross-references indicate further senses of a polysemous word, and exclude words with only one sense (about 40% of all words). 7. Fan-in and Fan-out; Semantic Hubs and Authorities The examples of cross-reference types given earlier have all been semantically strong cross- references. The majority of cross-references, however, are of the weaker types—pointing from a synset, to one or two paragraphs, unreciprocated by a link of the same type. Source Categories have as many as 27 outgoing links, of all or any types. Destination Categories have as many as 33 links directed to them. Using electrical circuit terminology these counts of cross-reference are referred to as fan-in and fan-out1 (fan, because connections with many links look like a fan when drawn on a circuit board diagram). Categories with high fan-in or fan-out are analogous to the hubs and authorities (Kleinberg, 1999) identified in studies of the distribution and density of hyperlinks to and from Internet Web pages. Those Categories with a high fan-in are like semantic authorities, referred to by other Categories; those Categories with a high fan-out are like the hubs, referring to assorted Categories across the thesaurus for semantic-authority. Like Web pages, not all Categories have links, and some are never referenced; and Categories may participate in both sets. Figure 12 and Figure 13 show the top twenty Categories by fan-in and fan-out, or cross-reference count. The top couple of Categories by cross-reference count are intellectual in nature, but on the whole the Categories represent negative emotional notions such as sadness, falseness, deception, 1 Also known by the terms: in-degree and out-degree. uncertainty, displeasure (existing in both sets), and other notions with negative connotations such as disease and weakness. The most common themes for authority Categories in RIT are (considering all Categories, and totaling cross-references at the Roman Class level): • 6:I: Intellectual faculties and processes, 349 links; • 8:I: Personal affections, 348 links; • 6:III: Communication of ideas, 281 links; • 7:I: Volition in general, 231 links; • 2:IV: Motion, 213 links. 6:I: Intellectual faculties and processes includes Categories 513: Uncertainty, 472: Insanity, Mania, and 469: Foolishness among its top authorities; and 6:III: Communication of ideas includes 614: Falseness and 616: Deception. Almost the same Roman Classes appear for the hub Categories, except that Volition in general disappears. Fan In Fan In Cat# Label Count Cat# Label Count 466 Intelligence, Wisdom 33 864 Displeasure 19 474 Knowledge 30 159 Weakness 18 870 Sadness 26 469 Foolishness 18 512 Certainty 26 855 Excitement 17 542 Foreboding 24 532 Neglect 17 614 Falseness 21 227 Covering 16 646 Motivation, Inducement 21 112 Perpetuity 16 616 Deception 21 336 Darkness 15 513 Uncertainty 21 907 Vanity 15 472 Insanity, Mania 21 967 Disapprobation 15 Figure 12. The top 20 (of 821) destination Categories by fan-in. Authority-like nodes. Fan Out Fan Out Cat# Label Count Cat# Label Count 572 Art 27 418 Sex 19 562 Learning 25 537 Expectation 18 1002 Lawsuit 23 697 Protection 18 973 Improbity 22 876 Amusement 18 614 Falseness 21 270 Transference 18 684 Disease 20 635 Choice 18 514 Gamble 19 870 Sadness 17 Psychology, 688 Psychotherapy 19 616 Deception 17 541 Prediction 19 864 Displeasure 17 680 Uncleanness 19 540 Foresight 16 Figure 13. The top 20 (of 782) source Categories by fan-out. Hub-like nodes. Almost all of the semantically strong cross-references and most (75%) of the unreciprocated cross- references between cross-reference hubs and authorities occur within the Roman level Classes. In other words, a strongly connected hub and authority pair will usually occur within a single Roman- level Class. This suggests strong coherence within Roman Classes. Figure 14 lists examples of the links running from hubs to authorities that cross Roman Class boundaries. This illustrates the type of coherence that exists between the Roman-level Classes. RC1 Cat1 Category Name1 ParaName1 < Shared Word > ParaName2 Category Name2 Cat2 RC2 2.IV 270 Transference carrier letter carrier postman Messenger 559 6.III 2.IV 308 Ejection get rid of throw away discard Disuse 666 7.I 2.IV 323 Agitation agitated excited excited Excitement 855 8.I 6.I 495 Misjudgment misjudge misconstrue misinterpret Misinterpretation 551 6.III 7.I 629 Avoidance dodge shrink pull back Reaction 283 2.IV 7.I 655 Way passage-way inlet place for entering Ingress, Entrance 301 2.IV 8.I 860 Impatience impatient impetuous impulsive Impulsiveness 628 7.I 8.I 881 Dullness triteness cliché platitude Maxim 516 6.I Figure 14. Links running from hubs to authorities crossing Roman-level Class boundaries. The semantic connections between remote clusters are clearly reasonable and could bring into question the reasonableness of the locations chosen for the Categories and Classes that participate in the clusters, in the classification system. However the majority of semantically strong links exist within the Roman-level Classes; this sample is representative of only about 15% of the total cross- references between the core hubs and authorities; the other 85% are internal to their Roman-level Classes. This sample probably illustrates John Lewis Roget’s (Peter Roget’s son) assertion that: Many words, originally employed to express simple conceptions, are found to be capable, with perhaps a very slight modification of meaning, of being applied in many varied associations. Connecting links, thus formed, induce an approach between the categories; and a danger arises that the outlines of the classification may, by their means, become confused and eventually merged (Roget, J. L, 1879, p. ix). Furthermore, the relations in this sample often represent 1) implications, 2) cause and effect, or 3) general-to-specific instances; rather than equivalence. For example, impatience is an internal state, while impulsiveness is observable, and it could be said that the first leads to the second. Also, these are further examples of the multi-facetedness of semantics discussed earlier—that similar notions are not identical notions. The context (such as emotional or intellectual context) often demands a different vocabulary, and justifies the apparent redundancy of some Categories in different parts of the classification hierarchy. The alternative to cross-references is for all related senses (Synsets of words) to be repeated, separately under their relevant Categories. But that also has drawbacks --either the categories become so interconnected that they are indistinguishable, or they become so big that their core ideas cannot be discriminated. This reflection of the small-world phenomenon (in modern terminology) became more of a concern for Roget Jr., as he added more and more words. The only solution he foresaw was to use cross-references (this was in contradiction to his father’s advice, which had been to repeat related Synsets under every category). So the cross-references now also participate in the small-world network and “may … be looked upon as indicating in some degree the natural points of connection between the categories” (Roget, J. L, 1879,, p. xi). They solve the essential problem, that “as would be in any classification of language, a large proportion of expressions … lie on the ill-defined border between one category and another” (ibid, p. xi). 8. Conclusion Cross-references form an elaborate network of links throughout the thesaurus. Latent semantic information can be extracted from the cross-references by 1) classifying them then selecting relationships among the different types of cross-references; 2) by examining the density of cross- references at specific levels of the hierarchy; and 3) by studying the semantics shared by disparate locations in the thesaurus linked by cross-references. The links range from semantically strong, reciprocated, whole-Category cross-references located in different Classes, to weak self- referencing links that reference locations within their own source Categories. Citations, hyperlinks and cross-references, unlike other forms of RIT connectivity, are all directed links. Cross-reference link densities are similar to those found among the hyperlinks of Web pages, suggesting the same hub-like and authority-like connectivity of a small-world model (although it has not been tested here mathematically). There is strong coherence within the top, Roman Class, and Letter Class levels—the majority of the cross-references, by source and destination, fall within the bounds of the same class. There is also a significant minority of cross-references which cross boundaries. Strogatz (2001) points out that a small-world network falls somewhere between networks of random connections (with isolated fragments, or components) and regular networks (up to fully connected). The latter may be highly clustered, but with long paths required to cross the clusters. These are analogous to Classes and Categories. By adding random links [“the slightest bit of rewiring” (Strogatz, 2001, p.273)] to models of networks of this type, they soon transform into a small-world network. The added paths act like short-circuits cross-linking clusters and parts of clusters, facilitating short paths across and between them. These random links are analogous to the cross-references that cross Class boundaries. References Berrey, L. (Ed.). (1962). Roget’s international thesaurus (3rd ed.). New York: Crowell. Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual (Web) Search Engine. Computer Networks and ISDN Systems. 30(1-7), 107-117. Google (2003). Our search: Google technology. Available at http://www.google.com/technology/index.html Kleinberg, J. M. (1999). Hubs, authorities, and communities. ACM Computing Surveys, 31(4es): 5. Miller, G., Beckwith, R., Fellbaum, C., Gross, D., Miller, K., & Tengi, R. (1993). Five papers on WordNet. Technical Report. Princeton, N.J: Princeton University. Old, L. John, (2003). The Semantic Structure of Roget's, A Whole-Language Thesaurus. (Doctorial dissertation, Indiana University, 2003). Dissertation Abstracts International. Old, L. John, (2002). Information Cartography Applied to the Semantics of Roget's Thesaurus. Proceedings, 13h Midwest Artificial Intelligence and Cognitive Science Conference (MAICS'02), Chicago, Illinois. Priss, U. (1996). Relational Concept Analysis: Semantic structures in dictionaries and lexical databases. (Doctoral Dissertation, Technical University of Darmstadt, 1998). Aachen, Germany: Shaker Verlag. Priss, U. (2005). Linguistic Applications of Formal Concept Analysis. In: Ganter; Stumme; Wille (eds.), Formal Concept Analysis, Foundations and Applications. Springer Verlag. LNAI 3626, p. 149-160. Roget, J. L. (1879). Thesaurus Of English Words And Phrases Classified And Arranged So As To Facilitate The Expression Of Ideas And Assist In Literary Composition by Peter Mark Roget, M.D., F.R.S. New York, NY: United States Book Company. Sedelow, S.Y. (1991). Exploring the terra incognita of whole-language thesauri. In R. Gamble & W. Ball (Eds.), Proceedings of the Third Midwest AI and Cognitive Science Conference (pp. 108-111). Carbondale, IL: Southern Illinois University. Sedelow, S.Y. (1993). Formally modeling and extending whole-language-scale semantic space. Behavior Research Methods, Instruments and Computers, 25(2), 310-314. Sedelow, S. Y. & Sedelow, W. A., Jr. (1986a). The lexicon in the background. Computers and Translation, 1(2), 73-81. Sedelow, S. Y., & Sedelow W. A., Jr. (1986b). Thesaural knowledge representation. In Proceedings of the 2nd International Conference of the University of Waterloo Centre for the New Oxford English Dictionary: Advances in Lexicology (pp. 29-43). Waterloo, ON: University of Waterloo. Sedelow, S.Y., & Sedelow, W. A., Jr. (1992). Recent model-based and model-related studies of a large-scale lexical resource. In Proceedings of COLING-92, 1, 1223-1227. Sedelow, S.Y., & Sedelow, W. A., Jr. (1994a). Graph theory, set theory, & order theory in semantic space: Analysis for use in knowledge representation. In J. Liebowitz (Ed.), Proceedings of the Second World Congress on Expert Systems. New York: Cognizant Communications Corporation. (CD ROM - The World Congress on Expert Systems ‘94. Cambridge, MA: Macmillan New Media). Sedelow, W. A., Jr. (1990). Computer-based planning technology: an overview of inner structure analysis. In L. J. Old (Ed.), Getting at disciplinary interdependence, (pp. 7-23). Little Rock, AR: Arkansas University Press. Sedelow, W. A., Jr. (1993). The formal analysis of concepts. Behavioral Research Methods, Instruments and Computers 25(2), 314-317. Sedelow, W.A., Jr., & Sedelow, S.Y. (1979). Graph theory, logic, and formal languages in relation to language research,. In W. A. Sedelow Jr. & S. Y. Sedelow (Eds.), Computers in Language Research: Formal Methods (pp. 7-17). The Hague: Mouton. Small, H. (1978). Cited documents as concept symbols. Social Studies of Science, 8, 327-340. Steyvers, M., & Tenenbaum, J. B. (2005). The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cognitive Science, 29(1). Strogatz, H. (2001). Exploring complex networks. Nature, 410, 268-276. Travers, J, & Milgram, S. (1969). An experimental study of the small world problem. Sociometry, 32(4), 425-443. Watts, D.J. (1999). Small worlds: The dynamics of networks between order and randomness. Princeton, NJ: Princeton University Press. Watts, D.J., & Strogatz, S.H. (1998). Collective dynamics of ‘small-world’ networks. Nature, 393, 440-442.