=Paper=
{{Paper
|id=Vol-31/paper-5
|storemode=property
|title=Enriching very large ontologies using the WWW
|pdfUrl=https://ceur-ws.org/Vol-31/EAgirre_14.pdf
|volume=Vol-31
|dblpUrl=https://dblp.org/rec/conf/ecai/AgirreAHM00
}}
==Enriching very large ontologies using the WWW==
<pdf width="1500px">https://ceur-ws.org/Vol-31/EAgirre_14.pdf</pdf>
<pre>
                 Enriching very large ontologies using the WWW
                               Eneko Agirre1, Olatz Ansa1, Eduard Hovy2 and David Martínez1

Abstract. This paper explores the possibility to exploit text on         the topic, with strength si. Topic signatures resemble relevancy
the world wide web in order to enrich the concepts in existing           signatures [10], but are not sentence-based, do not require parsing
ontologies. First, a method to retrieve documents from the WWW           to construct, and are not suitable for use in information extraction.
related to a concept is described. These document collections are        Topic signatures were originally developed for use in text
used 1) to construct topic signatures (lists of topically related        summarization.
words) for each concept in WordNet, and 2) to build hierarchical             On the other hand, given a word, the concepts that lexicalize it
clusters of the concepts (the word senses) that lexicalize a given       (its word senses) are hierarchically clustered [11], thus tackling
word. The overall goal is to overcome two shortcomings of                sense proliferation inWordNet.
WordNet: the lack of topical links among concepts, and the                   Evaluation of automatically acquired semantic and world
proliferation of senses. Topic signatures are validated on a word        knowledge information is not an easy task. In this case we chose to
sense disambiguation task with good results, which are improved          perform task-oriented evaluation, via word sense disambiguation.
when the hierarchical clusters are used.                                 That is, we used the topic signatures and hierarchical clusters to tag
                                                                         a given occurrence of a word with the intended concept. The
                                                                         benchmark corpus for evaluation is SemCor [12]. Our aim is not to
1     INTRODUCTION                                                       compete with other word sense disambiguation algorithms, but to
   Knowledge acquisition is a long-standing problem in both              test whether the acquired knowledge is valid.
Artificial Intelligence and Computational Linguistics. Semantic              This paper describes preliminary experiments. Several aspects
and world knowledge acquisition pose a problem with no simple            could be improved and optimized but we chose to pursue the entire
answer. Huge efforts and investments have been made to build             process first, in order to decide whether this approach is feasible
repositories with such knowledge (which we shall call ontologies         and interesting. The resulting topical signatures and hierarchical
for simplicity) but with unclear results, e.g. CYC [1], EDR [2],         clusters and their use on word sense disambiguation provide
WordNet [3]. WordNet, for instance, has been criticized for its lack     exciting perspectives.
of relations between topically related concepts, and the                     The structure of the paper follows the same spirit: we first
proliferation of word senses.                                            explain our method and experiments, and later review some
   As an alternative to entirely hand-made repositories, automatic       alternatives, shortcomings and improvements. Section two reviews
or semi-automatic means have been proposed for the last 30 years.        the ontology used and the benchmark corpus for word sense
On the one hand, shallow techniques are used to enrich existing          disambiguation. Next the method to build the topic signatures is
ontologies [4] or to induce hierarchies [5], usually analyzing large     presented, and a separate section shows the results on a word sense
corpora of texts. On the other hand, deep natural language               disambiguation task. The clustering method is presented alongside
processing is called for to acquire knowledge from more                  the associated word sense disambiguation results. Related work is
specialized texts (dictionaries, encyclopedias or domain specific        discussed in the following section, and finally some conclusions
texts) [6][7]. These research lines are complementary; deep              are drawn and further work is outlined.
understanding would provide specific relations among concepts,
whereas shallow techniques could provide generic knowledge               2    BRIEF INTRODUCTION TO WORDNET
about the concepts.
   This paper explores the possibility to exploit text on the world           AND SEMCOR
wide web in order to enrich WordNet. The first step consists on          WordNet is an online lexicon based on psycholinguistic theories
linking each concept in WordNet to relevant document collections         [3]. It comprises nouns, verbs, adjectives and adverbs, organized in
in the web, which are further processed to overcome some of              terms of their meanings around lexical-semantic relations, which
WordNet’s shortcomings.                                                  include among others, synonymy and antonymy, hypernymy and
   On the one hand, concepts are linked to topically related words.      hyponymy (similar to is-a links), meronymy and holonymy (similar
Topically related words form the topic signature for each concept        to part-of links). Lexicalized concepts, represented as sets of
in the hierarchy. As in [8][9] we define a topic signature as a          synonyms called synsets, are the basic elements of WordNet. The
family of related terms {t, <(w1,s1)…(wi,si)…>}, where t is the          version used in this work, WordNet 1.6, contains 121,962 words
topic (i.e. the target concept) and each wi is a word associated with    and 99,642 concepts.
                                                                            The noun boy, for instance, has 4 word senses, i.e. lexicalized
1
    IxA NLP group. University of the Basque Country. 649 pk. 20.080      concepts. The set of synonyms for each sense and the gloss is
    Donostia. Spain. Email: eneko@si.ehu.es, jipanoso@si.ehu.es.         shown below:
    jibmaird@si.ehu.es                                                      1: male child, boy, child — a youthful male person
2
    USC Information Sciences Institute, 4676 Admiralty Way, Marina del      2: boy — a friendly informal reference to a grown man
    Rey, CA 90292-6695, USA. Email: hovy@isi.edu.
                                                                            3: son, boy — a male human offspring
Target word
                           sense1 + information                     Query1                    Document                          Topic
                            sense2 + information          Build     Query2        Query       collection1          Build        signature1
                                                                                                                                    Docume
              Look-up                                                                             Docum
                             ...                         queries     ...          WWW                            Signatures      signature2
                                                                                               collection2
                                                                                                    Docum                             Docume
Word                          senseN + information                    QueryN
                                                                                                collectionN                        signatureN
Net

                                                             Figure 1. Overall design.

   4: boy — offensive term for Black man                                   collections. The words that have a distinctive frequency for one of
   Being one of the most commonly used semantic resources in               the collections are collected in a list, which constitutes the topic
natural language processing, some of its shortcomings are broadly          signature for each word sense.
acknowledged:                                                                 The steps are further explained below.
1. It lacks explicit links among semantic variant concepts with
   different part of speech; for instance paint-to paint or song-to        3.1      Building the queries
   sing are not related.
                                                                           The original goal is to retrieve from the web all documents related
2. Topically related concepts are not explicitly related: there is no      to an ontology concept. If we assume that such documents have to
   link between pairs like bat–baseball, fork–dinner, farm–                contain the words that lexicalize the concept, the task can be
   chicken, etc.                                                           reduced to classifying all documents where a given word occurs
                                                                           into a number of collections of documents, one collection per word
3. The proliferation of word sense distinctions in WordNet, which          sense. If a document cannot be classified, it would be assigned to
   is difficult to justify and use in practical terms, since many of       an additional collection.
   the distinctions are unclear. Line for instance has 32 word                The goal as phrased above is unattainable, because of the huge
   senses. This makes it very difficult to perform automatic word          amount of documents involved. Most of words get millions of hits:
   sense disambiguation.                                                   boy would involve retrieving 2,325,355 documents, church
                                                                           6,243,775, etc. Perhaps in the future a more ambitious approach
   This paper shows how to build lists of words that are topically         could be tried, but at present we cannot aim at classifying those
related to a topic (a concept). These lists can be used to overcome        enormous collections. Instead, we construct queries, one per
the shortcomings just mentioned. In particular we show how to              concept, which are fed to a search engine. Each query will retrieve
address the third issue, using the lists of words to cluster word          the documents related to that concept.
senses according to the topic.                                                The queries are constructed using the information in the
   SemCor [12] is a corpus in which word sense tags (which                 ontology. In the case of WordNet each concept can include the
correspond to WordNet concepts) have been manually included for            following data: words that lexicalize the concept (synonyms), a
all open-class words in a 360,000-word subset of the Brown                 gloss and examples, hypernyms, hyponyms, meronyms, holonyms
Corpus. We use SemCor to evaluate the topic signatures in a word           and attributes. Altogether a wealth of related words is available,
sense disambiguation task. In order to choose a few nouns to               which we shall call cuewords. If a document contains a high
perform our experiments, we focused on a random set of 20 nouns            number of such cuewords around the target word, we can conclude
which occur at least 100 times in SemCor. The set comprises                that the target word corresponds to the target concept. The
commonly used nouns like boy, child, action, accident, church, etc.        cuewords are used to build a query which is fed into a search
These nouns are highly polysemous, with 6.3 senses on average.             engine, retrieving the collection of related documents.
                                                                              As we try to constrain the retrieved documents to the ‘purest’
                                                                           documents, we build the queries for each word sense trying to
3    BUILDING TOPIC SIGNATURES FOR THE                                     discard documents that could belong to more than one sense. For
     CONCEPTS IN WORDNET                                                   instance, the query for word x in word sense i (being j,k other word
In this work we want to collect for each concept in WordNet the            senses for x) is constructed as follows:
words that appear most distinctively in texts related to it. That is,         (x AND (cueword1,i OR cueword2,i ...)
we aim at constructing lists of closely related words for each                   AND NOT (cueword1,j OR cueword2,j ... OR
concept. For example, WordNet provides two possible word senses                                    cueword1,k OR cueword2,k ...)
or concepts for the noun waiter:                                           where cuewordl,m stands for the cueword l of word sense m. This
   1: waiter, server — a person whose occupation is to serve at            boolean query searches for documents that contain the target word
                         table (as in a restaurant)                        together with one of the cuewords of the target concept, but do not
   2: waiter — a person who waits or awaits                                contain any of the cuewords of the remaining concepts. If a
   For each of these concepts we would expect to obtain two lists          cueword appears in the information relative to more than one
with words like the following:                                             sense, it is discarded.
   1: restaurant, menu, waitress, dinner, lunch, counter, etc.                Deciding which of the cuewords to use, and when, is not an
   2: hospital, station, airport, boyfriend, girlfriend, cigarette, etc.   easy task. For instance, nouns in the definition are preferable to the
   The strategy to build such lists is the following (cf. Figure 1).       other parts of speech, monosemous cuewords are more valuable
We first exploit the information in WordNet to build queries,              than polysemous ones, synonyms provide stronger evidence than
which are used to search in the Internet those texts related to the        meronyms, other concepts in the hierarchy can also be used, etc.
given word sense. We organize the texts in collections, one                After some preliminary tests, we decided to experiment with all
collection per word sense. For each collection we extract the words        available information: synonyms, hypernyms, hyponyms,
and their frequencies, and compare them with the data in the other         coordinate sisters, meronyms, holonyms and nouns in the
               Table 1. Information for sense 1 of boy.                          Table 2. Top words in signatures for three senses of boy.

synonyms       male child, child                                                       Boy1                   Boy2                    Boy3
gloss          a youthful male person                                        (child 9854)            (gay 7474)            (human 5023)
                                                                             (Child 5979)            (reference 5154)      (son 4898)
hypernyms      male, male person
                                                                             (person 4671)           (tpd-results 3930)    (Human 3055)
hyponyms       altar boy, ball boy, bat boy, cub, lad, laddie, sonny,        (anything.com 3702)     (sec 3917)            (Soup 1852)
               sonny boy, boy scout, farm boy, plowboy, ...                  (Opportunities 1808)    (gay 2906)            (interactive 1842)
coordinate     chap, fellow, lad, gent, fella, blighter, cuss, foster        (Insurance 1796)        (Xena 1604)           (hyperinstrument 1841)
systers        brother, male child, boy, child, man, adult male, ...         (children 1458)         (male 1370)           (Son 1564)
                                                                             (Girl 1236)             (ADD 1304)            (clips 1007)
                                                                             (Person 1093)           (storing 1297)        (father 918)
definition. Table 1 shows part of the information available for
                                                                             (Careguide 918)         (photos 1203)         (man-child 689)
sense 1 of boy.                                                              (Spend 839)             (merr 1077)           (measure 681 )
   The query for sense 1 of boy would include the above cuewords             (Wash 821)              (accept 1071)         (focus 555)
plus the negation for the cuewords of the other senses. An excerpt           (enriching 774)         (PNorsen 1056)        (research 532)
of the query:                                                                (prizes 708)            (software 1021)       (show 461)
(boy AND (’altar boy’ OR ’ball boy’ OR ...OR ’male person)                   (Scouts 683)            (adult 983)           (Teller 456)
      AND NOT (’man’... OR ’broth of a boy’ OR           # sense 2           (Guides 631)            (penny 943)           (Yo-Yo 455)
                 ’son’ OR... OR ’mama’s boy’ OR          # sense 3           (Helps 614)             (PAGE 849)            (modalities 450)
                                                                             (Christmas 525)         (Sex 835)             (performers 450)
                 ’nigger’ OR ... OR ’black’)             # sense 4
                                                                             (male 523)              (Internet 725)        (senses 450)
                                                                             (address 504)           (studs 692)           (magicians 448)
                                                                             (paid 472)              (porno 675)           (percussion 439)
3.2     Search the internet                                                  (age 470)               (naked 616)           (mother 437)
                                                                             (mother 468)            (erotic 611)          (entertainment 391)
Once the queries are constructed we can use a number of different
                                                                             ...up to 6.4 Mbytes     ...up to 4.4 Mbytes   ... up to 4.7 Mbytes
search engines. We started to use just the first 100 documents from
a list of search engines. This could bias the documents, and some
could be retrieved repeatedly. Therefore, unlike [13], we decided to       Equation 2 defines mi,j, the expected mean of word j in
use only one search engine, the most comprehensive search engine         document i.
at the time, AltaVista [14]. AltaVista allows complex queries
which were not possible in some of the other web search engines.                                    Σifreqi,j Σjfreqi,j
    The number of documents retrieved for the 20 words amounts to                          mi,j=                                                    (2)
the tens of thousands, taking more that one gigabyte of disk space                                      Σi,jfreqi,j
once compressed, and 9 days of constant internet access. For
instance, it took 3 hours and a half to retrieve the 1,217 documents        When computing the χ2 values, the frequencies in the target
for the four senses of boy, which took 100 megabytes once                document collection are compared with the rest of the document
compressed.                                                              collection, which we call the contrast set. In this case the contrast
                                                                         set is formed by the other word senses. Excerpts from the
                                                                         signatures for boy are shown in Table 2.
3.3     Build topic signatures
The document collections retrieved in step 3.2 are used to build the     4    APPLY SIGNATURES FOR WORD SENSE
topic signatures. The documents are processed in order to extract
the words in the text. We did not perform any normalization; the              DISAMBIGUATION
words are collected as they stand. The words are counted and a              The goal of this experiment is to evaluate the automatically
vector is formed with all words and their frequencies in the             constructed topic signatures, not to compete against other word
document collection. We thus obtain one vector for each collection,      sense disambiguation algorithms. If topic signatures yield good
that is, one vector for each word sense of the target word.              results in word sense disambiguation, it would mean that topic
   In order to measure which words appear distinctively in one           signatures have correct information, and that they are useful for
collection in respect to the others, a signature function was selected   word sense disambiguation. Given the following sentence from
based on previous experiments [15][13]. We needed a function             SemCor, a word sense disambiguation algorithm should decide that
that would give high values for terms that appear more frequently        the intended meaning for waiter is that of a restaurant employee:
than expected in a given collection. The signature function that we         "There was a brief interruption while one of O’Banion’s men
used is χ2, which we will define next.                                      jerked out both his guns and threatened to shoot a waiter who
   The vector vfi contains all the words and their frequencies in the       was pestering him for a tip."
document collection i, and is constituted by pairs (wordj, freqi,j),        Word sense disambiguation is a very active research area (cf.
that is, one word j and the frequency of the word j in the document      [16] for a good review of the state of the art). Present word sense
collection i. We want to construct another vector vxi with pairs         disambiguation systems use a variety of information sources [17]
(wordj, wi,j) where wi,j is the χ2 value for the word j in the           which play an important role, such us collocations, selectional
document collection i (cf. Equation 1).                                  restrictions, topic and domain information, co-occurrence relations,
                                                                         etc. Topic signatures constitute one source of evidence, but do not
                                                                         replace the others. Therefore, we do not expect impressive results.
              (freqi,j – mi,j)                                              The word sense disambiguation algorithm is straightforward.
                                 if freqi,j > mi,j
    wi,j=          mi,j                                           (1)    Given an occurrence of the target word in the text we collect the
                                                                         words in its context, and for each word sense we retrieve the χ2
              0                  otherwise                               values for the context words in the corresponding topic signature.
                                                                                          Table 3. Word sense disambiguation results.
                                                                                     Word      #s #occ     Ran    Syn    S+def   S+all   Sign
                                                                                 Accident        2  12     0.50   0.00    0.56   0.71    0.50
                                                                                 Action          8 130     0.12   0.00    0.05   0.29    0.02
                                                                                 Age             3 104     0.33   0.01    0.04   0.03    0.60
                                                                                 Amount          4 103     0.25   0.22    0.27   0.30    0.50
                                                                                 Band            7  21     0.14   0.11    0.13   0.28    0.25
                                                                                 Boy             4 169     0.25   0.45    0.37   0.59    0.66
                                                                                 Cell            3 116     0.33   0.00    0.37   0.36    0.59
                                                                                 Child           2 206     0.50   0.37    0.47   0.43    0.29
                                                                                 Church          3 128     0.33   0.28    0.50   0.46    0.45
                                                                                 Difference      5 112     0.20   0.02    0.28   0.35    0.17
                                                                                 Door            4 138     0.25   0.05    0.24   0.26    0.04
                                                                                 Experience      3 125     0.33   0.22    0.42   0.35    0.42
                                                                                 Fact            4 124     0.25   0.02    0.48   0.58    0.82
                                                                                 Family          6 135     0.17   0.12    0.18   0.15    0.36
                                                                                 Girl            5 152     0.20   0.34    0.21   0.33    0.25
                                                                                 History         5 104     0.20   0.06    0.16   0.17    0.18
                                                                                 Hour            2 110     0.50   0.21    0.63   0.38    0.40
               Figure 2: Hierarchy for the word senses of boy                    Information     3 146     0.33   0.00    0.12   0.64    0.66
                                                                                 Plant           2  99     0.50   0.30    0.42   0.45    0.82
                                                                                 World           8 210     0.12   0.09    0.18   0.19    0.34
For each word sense we add these χ2 values, and then select the                  Overall       83 2444     0.28   0.16    0.30   0.36    0.41
word sense with the highest value. Different context sizes have
been tested in the literature, and large windows have proved to be
useful for topical word sense disambiguation [18]. We chose a              Two of the senses refer to young boys while two of them refer to
window of 100 words around the target.                                     grown males. ’Boy as a young person’ would tend to appear in a
    In order to compare our results, we computed a number of               certain kind of documents, while ’boy as a grown man’ in others,
baselines. First of all choosing the sense at random (ran). We also        and ’boy as a colored person’ in yet other documents.
constructed lists of related words using WordNet, in order to                 In this work, as in [15][13], we tried to compare the overlap
compare their performance with that of the signatures: the list of         between the signatures by simply counting shared words, but this
synonyms (Syn), these plus the content words in the definitions            did not yield interesting results. Instead we used binary hierarchical
(S+def), and these plus the hyponyms, hypernyms and meronyms               clustering directly on the retrieved documents [11]. We
(S+all). The algorithm to use these lists is the same as for the topic     experimented with various distance metrics and clustering methods
signatures.                                                                but the results did not vary substantially: slink [19], clink [20],
    Table 3 shows the results for the selected nouns. The number of        median, and Ward’s method [21]. Some of the resulting hierarchies
senses attested in SemCor3 (#s) and the number of occurrences of           were analyzed by hand and they were coherent according to our
the word in SemCor (#occ) are also presented. The results are              own intuitions. For instance Figure 2 shows that the young and
given as precision, that is, the number of successful tags divided by      offspring senses of boy (nodes 1 and 3) are the closest (similarity
the total number of occurrences. A precision of one would mean             of 0.65), while the informal (node 2) and colored (node 4) senses
that all occurrences of the word are correctly tagged.                     are further apart. The contexts for the colored sense are the least
    The results show that, the precision of the signature-based word       similar to the others (0.46).
sense disambiguation (Sign column) is well above the precision for
random selection (a few exceptions are in bold), and, that, overall,
it outperforms the other WordNet-based lists of words (the winner
                                                                           5.1      Evaluation of word sense clusters on a
for each word is in bold). This proves that topic signatures                        word sense disambiguation task
managed to learn topic information that was not originally present
                                                                           Hand evaluation of the hierarchies is a difficult task, and very hard
in WordNet. This information is overly correct, but in some cases
                                                                           to define [11]. As before, we preferred to evaluate them on a word
introduces noise and the performance degrades even below the
                                                                           sense disambiguation task. We devised two methods to apply the
random baseline (e.g. action, hour).
                                                                           hierarchies and topic signatures to word sense disambiguation:
                                                                           1. Use the original topic signatures. In each branch of the
5      CLUSTERING WORD SENSES                                                 hierarchy we combine all the signatures for the word senses in
In principle we could try to cluster all the concepts in WordNet,             the branch, and choose the highest ranking branch. For instance,
comparing their topic signatures, but instead we experimented with            when disambiguating boy, we first choose between boy4 and
clustering just the concepts that belong to a given word (its word            the rest: boy1, boy2, boy3 (cf. Figure 2). Given a occurrence,
senses). As we mentioned in Section 2, WordNet makes very fine                the evidence for boy1, boy2, boy3 is combined, and compared
distinctions between word senses, and suffers excessive word sense            to the evidence for boy4. The winning branch is chosen. If boy4
proliferation.                                                                is discarded, then the combined evidence for boy1, boy3 is
   For many practical applications we can ignore some of the sense            compared to that of boy2. If boy2 gets more evidence, that is the
distinctions. For instance, all of the senses for boy are persons.            chosen sense.

                                                                           2. Build new topic signatures for the existing clusters. The
3
    Some word senses never occur in SemCor. We did not take those senses      document collections for all the word senses in the branch are
     into account.                                                            merged and new χ2 values are computed for each cluster in the
                                                                              hierarchy. For instance, at the first level we would have a topic
    signature for boy4 and another for the merged collections of               In general, documents retrieved from the web introduce a
    boy1, boy2 and boy3. At the second level we would have a topic         certain amount of noise into signatures. The results are still useful
    signature for boy2 and another for boy1, boy3.                         to identify the word sense of the target words, as our results show,
                                                                           but a hand evaluation of them is rather worrying. We concluded
    The word sense disambiguation algorithm can be applied at
                                                                           that the cause of the poor quality does not come from the procedure
different levels of granularity, similar to decision trees. At the first
                                                                           to build the signatures, but rather from the quality of the documents
level it chooses to differentiate between boy4 and the rest, at the
                                                                           retrieved (c.f. Section 6.3).
second level among boy4, boy2 and boy1-3, and at the third level it
disambiguates the finest-grained senses.
    Instead of evaluating the set of all nouns, we focused on three        6.2     Concept Clustering
nouns: boy, cell and church. The results are shown in Table 4. The
second column shows the number of senses. The signature results            Traditional clustering techniques [11] are difficult to apply to
for the original sense distinctions (cf. Table 3) are shown in the         concepts in ontologies. The reason is that the usual clustering
second column. The results for the signature and hierarchy                 methods are based on statistical word co-occurrence data, and not
combination are shown according to the sense-distinctions: the fine        on concept co-occurrence data (which is not available at present).
column shows the results using the hierarchy for the finest sense          The method presented in this paper uses the fact that concepts are
distinctions, the medium column corresponds to the medium sized            linked to document collections. Usual document clustering
clusters, and the coarse level corresponds to the coarsest clusters,       techniques are applied to document collections, effectively
i.e., all senses clustered in two groups. For each level, three results    clustering the associated concepts. This clustering method tackles
are given: the random baseline, the results using the original topic       the word sense proliferation WordNet.
signatures and the hierarchy, and the results with the new topic              The evaluation and validation of the word sense clusters is
signatures computed over the clusters (best results for in bold).          difficult [11]. We chose to evaluate the performance of the clusters
                                                                           in a word sense disambiguation task, showing that the clusters are
        Table 4: Results using hierarchies and word sense clusters
                                                                           useful to improve the results; enhanced precision for the fine-level
          Sign             Signature & Hierarchy                           sense distinctions, and over 90% precision for the coarse level.
Word #             Fine           Medium         Coarse
         Orig
              Rand Orig New Rand Orig New Rand Orig New                    6.3     Searching the internet for concepts
Boy    4 0.66 0.25 0.68 0.38 0.33 0.83 0.67 0.50 0.99 0.99
Cell   3 0.59 0.33 0.62 0.52 -       -     - 0.50 0.52 0.96                The core component of the method explored in this paper is the
Church 3 0.45 0.33 0.48 0.54 -       -     - 0.50 0.77 0.90                technique to link documents in the web to concepts in an ontology.
                                                                           Recently, some methods have been explored to automatically
   The results show that the information contained in the hierarchy        retrieve examples for concepts from large corpora and the internet.
helps improve the precision obtained without hierarchies, even at          Leacock et al. [22] use a strategy based on the monosemous
the fine level. For coarser sense distinctions it exceeds 0.90             relatives of WordNet concepts to retrieve examples from a 30
precision. Regarding the way to apply the hierarchy, the results are       million word corpus. As their goal is to find 100 examples for each
not conclusive. Further experiments would be needed to show                word sense of a given word, they prefer close relatives such us
whether it is useful or not to compute new topic signatures for each       synonyms or hyponym collocations that contain the target
cluster.                                                                   hyponym. If enough examples are not found, they also use other
                                                                           hyponyms, sisters and hypernyms. The examples were used to train
                                                                           a supervised word sense disambiguation algorithm with very good
6     DISCUSSION AND COMPARISON WITH                                       results, but no provision was made to enrich WordNet with them.
      RELATED WORK                                                         The main shortcoming of this strategy is that limiting the search to
                                                                           monosemous relatives, only 65% of the concepts under study could
The work here presented involves different areas of research. We           get training examples.
will focus on the method to build topic signatures, the method to             Mihalcea and Mondovan [23] present a similar work which tries
cluster the concepts and how the document collection for each              to improve the previous method. When a monosemous synonym
word sense is constructed.                                                 for a given concept is not found, additional information from the
                                                                           definition of the concept is used, in the form of defining phrases
                                                                           constructed after parsing and processing the definition. The whole
6.1     Building topic signatures                                          internet is used as a corpus, using a search engine to retrieve the
Topic signatures were an extension of relevancy signatures [10]            examples. Four procedures are defined to query the search engine
developed for text summarization [15]. To identify topics in               in order: use monosemous synonyms, use the defining phrases, use
documents, [15]       constructed topic signatures from 16,137             synonyms with the AND operator and words from the defining
documents classified into 32 topics of interest. His topic signature       phrase with the NEAR operator, and lastly, use synonyms and
construction method is similar to ours, except that he used tf.idf for     words from the defining phrases with the AND operator. The
term weighting. In subsequent work, Hovy and Junk [13] explored            procedures are sorted by preference, and one procedure is only
several alternative weighting schemes in a topic identification task,      applied if the previous one fails to retrieve any examples. 20 words
finding that χ2 provided better results than tf.idf or tf, and that        totaling 120 senses were chosen, and an average of 670 examples
specific combinations of χ2 and Latent Semantic Analysis                   could be retrieved for each word sense. The top 10 examples for
provided even better results on clean training data. Lin and Hovy          each word sense were hand-checked and 91% were found correct.
[9] use a likelihood ratio from maximum likelihood estimates that             Both these methods focus on obtaining training examples. In
achieves even better performance on clean data. However, their             contrast, our method aims at getting documents related to the
experiments with text extracted from the web proved somewhat               concept. This allows us to be less constraining; the more
disappointing, like the ones reported here.                                documents the better, because that allows to found more
distinctively co-occurring terms. That is why we chose to use all        disambiguation methods could profit from these richer ontologies,
close relatives for a given concept, in contrast to [22] which only      and improve word sense disambiguation performance.
focuses on monosemous relatives, and [23], which uses synonyms
and a different strategy to process the gloss. Another difference is
that our method forbids the cuewords of the rest of the senses.          ACKNOWLEDGEMENTS
   We have found that searching the web is the weakest point of          We would like to thank the referees for their fruitful comments.
our method. The quality and performance of the topic signatures          Part of the work was done while Eneko Agirre was visiting ISI,
and clusters depends on the quality and number of the retrieved          funded by the Basque Government.
documents, and our query strategy is not entirely satisfactory. On
the one hand some kind of balance is needed. For some querying
strategies some word senses do not get any document, and with            REFERENCES
other strategies too many and less relevant documents are
                                                                          [1] Lenat, D.B. 1995. CYC: A Large-Scale Investment in Knowledge
retrieved. On the other hand the web is not a balanced corpus (e.g.
                                                                              Infrastructure, in Communications of the ACM, vol. 38, no. 11.
the sexual content in the topic signatures for boy). Besides, many        [2] Yokoi, T. 1995. The EDR Electronic Dictionary. Communications of
documents are short indexes or cover pages, with little text on               the ACM, vol. 38, no. 11.
them. In this sense, the query construction has to be improved and        [3] Fellbaum, C. 1998. Wordnet: An Electronic Lexical Database.
some filtering techniques should be devised.                                  Cambridge: MIT Press.
   Other important consideration about searching the internet is          [4] Hearst, M. and H. Schütze. 1993. Customizing a Lexicon to Better
that technical features have to be taken in consideration. For                Suit a Computational Task. Proc. of the Workshop on Extracting
instance, our system had some timeout parameters, meaning that                Lexical Knowledge.
                                                                          [5] Caraballo, S.A. 1999. Automatic construction of a hypernym-labeled
the retrieval delay of the documents (caused by the hour, workload,
                                                                              noun hierarchy from text. Proc. of the Conference of the Association
localization of server, etc.) could affect the results.                       for Computational Linguistics.
                                                                          [6] Wilks, Y., B.M. Slator, and L. Guthrie. 1996. Electric Words:
                                                                              Dictionaries, Computers, and Meanings. Cambridge: MIT Press.
7    CONCLUSIONS AND FURTHER                                              [7] Harabagiu, S.M., G.A. Miller, and D.I. Moldovan. 1999. WordNet 2 -
     RESEARCH                                                                 A Morphologically and Semantically Enhanced Resource. Proc. of
                                                                              the SIGLEX Workshop.
We have introduced an automatic method to enrich very large               [8] Hovy, E.H. and C.-Y. Lin. 1999. Automated Text Summarization in
ontologies, e.g. WordNet, that uses the huge amount of documents              SUMMARIST. In M. Maybury and I. Mani (eds), Advances in
in the world wide web. The core of our method is a technique to               Automatic Text Summarization. Cambridge: MIT Press.
link document collections from the web to concepts, which allows          [9] Lin, C.-Y. and E.H. Hovy. 2000. The Automated Acquisition of
to alleviate some of the main problems acknowledged in WordNet;               Topic Signatures for Text Summarization. Proc. of the COLING
                                                                              Conference. Strasbourg, France. August, 2000.
lack of relations between topically related concepts, and the
                                                                         [10] Riloff, E. 1996. An Empirical Study of Automated Dictionary
proliferation of word senses. We show in practice that the                    Construction for Information Extraction in Three Domains. Artificial
document collections can be used 1) to create topic signatures (lists         Intelligence, vol . 85.
of words that are topically related to the concept) for each             [11] Rasmussen, E. 1992. Clustering Algorithms. In W.B. Frakes and R.
WordNet concept, and, 2) given a word, to cluster the concepts that           Baeza-Yates (eds.), Information Retrieval: Data Structures and
lexicalize it (its word senses), thus tackling sense proliferation. In        Algorithms. London: Prentice Hall. 419–442
order to validate the topic signatures and word sense clusters, we       [12] Miller, G., C. Leacock, R. Tengi, and T. Bunker. 1993. A Semantic
demonstrate that they contain information which is useful in a                Concordance. Proc. of ARPA Workshop on Human Language
                                                                              Technology.
word sense disambiguation task.
                                                                         [13] Hovy, E.H. and M. Junk. 1998. Using Topic Signatures to Enrich the
   This work combines several techniques, and we chose to pursue              SENSUS Ontology. In prep.
the whole method from start to end. This strategy left much room         [14] AltaVista. 2000. www.altavista.com.
for improvement in all steps. Both signature construction and            [15] Lin, C.Y. 1997. Robust Automated Topic Identification. PhD thesis.
clustering seem to be satisfactory, as other work has also shown. In          University of Southern California.
particular, nice clean signatures are obtained when constructing         [16] N. Ide and J. Veronis. 1998. Introduction to the Special Issue on
topic signatures from topically organized documents. On the                   Word Sense Disambiguation: The State of the Art. Computational
contrary, topic signatures extracted from the web seem to be                  Linguistics, 24(1), 1–40.
                                                                         [17] McRoy, S. 1992. Using Multiple Knowledge Sources for Word Sense
dirtier.
                                                                              Discrimination, in Computational Linguistics, vol. 18, no. 1.
   We think that, in this work, the main obstacle to get clean           [18] Gale, W.A., K.W. Church and D. Yarowsky. 1993. A method for
signatures comes from the method to link concepts and relevant                Disambiguating Word Senses in a Large Corpus. Computer and the
documents from the web. The causes are basically two. First, the              Humanities, 26, 415–39.
difficulty to retrieve documents relevant to one and only one            [19] Sibson, R. 1973. SLINK: an Optimally efficient algorithm for the
concept. The query construction has to be improved and carefully              Single-Link Cluster Method. Computer Journal, 16, 30–34.
fine-tuned to overcome this problem. Second, the wild and noisy          [20] Defays, D. 1977. An Efficient Algorithm for a Complete Link
nature of the texts in the web, with its high bias towards some               Method. Computer Journal, 20, 364–66.
                                                                         [21] Ward, J.H., Jr. and M.E. Hook. 1963. Application of an Hierarchical
topics, high number of not really textual documents e.g., indexes.,
                                                                              Grouping Procedure to a Problem of Grouping Profiles. Educational
etc. Some filtering techniques have to be applied in order to get             and Psychological Measurement, 23, 69–81.
documents with less bias and more content.                               [22] Leacock, C., M. Chodorow and G.A. Miller. 1998. Using Corpus
   Cleaner topic signatures open the avenue for interesting                   Statistics and WordNet Relations for Sense Identification, in
ontology enhancements, as they provide concepts with rich topical             Computational Linguistics, vol. 24, no. 2.
information. For instance, similarity between topic signatures could     [23] Mihalcea, R. and D.I. Moldovan. 1999. An Automatic Method for
be used to find out topically related concepts, the clustering                Generating Sense Tagged Corpora. Proc. of the Conference of the
strategy could be extended to all concepts rather that just the               American Association of Artificial Intelligence.
concepts that lexicalize the same word, etc. Besides, word sense

</pre>