=Paper= {{Paper |id=Vol-1517/JOWO-15_WoMO_paper_4 |storemode=property |title=Ontology Population using Corpus Statistics |pdfUrl=https://ceur-ws.org/Vol-1517/JOWO-15_WoMO_paper_4.pdf |volume=Vol-1517 |dblpUrl=https://dblp.org/rec/conf/ijcai/NazarR15 }} ==Ontology Population using Corpus Statistics== https://ceur-ws.org/Vol-1517/JOWO-15_WoMO_paper_4.pdf
                                Ontology Population Using Corpus Statistics

                                                   Rogelio Nazar, Irene Renau
                                            Instituto de Literatura y Ciencias del Lenguaje
                                             Pontificia Universidad Católica de Valparaı́so
                                                  {rogelio.nazar,irene.renau}@ucv.cl




                            Abstract                                   the objective is to link a noun such as “bicycle” with its hy-
                                                                       pernym, “Vehicle”, and this one with “Artifact”, and so on.
     This paper presents a combination of algorithms for au-
                                                                          For a more precise definition of the terms, taxonomies
     tomatic ontology building based mainly on lexical co-
     occurrence statistics. We populate an ontology with hy-           and ontologies are different kinds of knowledge structures.
     pernymy links, thus we refer more specifically to a tax-          Whereas an ontology is “a system of categories accounting
     onomy of lexical units (nouns organized by hypernymy              for a certain vision of the world” (Maedche 1995, 11), a tax-
     relations) rather than an ontology of formally defined            onomy can be considered a hierarchical relational structure
     concepts. A set of combined statistical procedures pro-           of words. For instance, “Vehicle” can be a formally defined
     duce fragments of taxonomies from corpora that are                concept in an ontology, and “vehicle”, “car” or “bicycle”
     later integrated into a unified taxonomy by a central al-         words related to this concept, the first being a hypernym of
     gorithm. Our results show that with an ensemble of dif-           the others. A hypernymy relation is a basic semantic rela-
     ferent components it is possible to achieve an accuracy           tion between a word, the hyponym, and the word used as
     only slightly worse than human performance. Finally,
                                                                       a descriptor to define it, the hypernym (Lyons 1977). Hy-
     as our methods are based on quantitative linguistics, the
     algorithm we propose is not language specific. The lan-           pernymy provides the hierarchical structure for conceptual
     guage used for the experiments is, however, Spanish.              organization of a domain. In the following pages, we will
                                                                       use the term ‘ontology’ to refer to the most general nodes of
                                                                       the structure, and ‘taxonomy’ to refer to the connection be-
                     1    Introduction                                 tween concepts of the ontology and words, establishing thus
                                                                       a difference between the ontological and the linguistic point
The study of the vocabulary in its real context of use
                                                                       of view.
is currently a central part of linguistics (Kilgarriff 2007;
Hanks 2013). Among other tasks in this discipline, it is of               In different ways, the paper represents an innovative way
utmost importance to extract and organize vocabulary units             of addressing the problem. Our method is based on a com-
from corpora. There is an intrinsic theoretical interest in            bination of five different algorithms which produce raw re-
such attempt, like the study of the laws that govern how               sults from corpora in the form of fragments of taxonomies,
words can be combined and classified. But there is also a              which are later compared and integrated into a single struc-
practical motivation: to have the ability of transforming un-          ture by a central algorithm in charge with the decision mak-
structured data into structured databases, i.e., to go from            ing process. The result is a tree of hyponym/hypernym re-
plain text to lexical databases, which can be later organized          lations between nouns, i.e. words rather than concepts and
as an ontology which specifies the terminology of a domain             their formal definitions, characteristic of the linguistic view.
and the conceptual relations between terms.                            Another novelty of the approach is that it is quantitative, thus
   This paper presents a preliminary description and assess-           it does not involve language or domain specific knowledge
ment of results of a methodology based on co-occurrence                coded directly into the system. No external resources are
statistics to transform text into a knowledge structure, which         needed apart from the analyzed corpora, a Part-of-Speech
can be later developed into a taxonomy or an ontology. The             tagger and the CPA Ontology itself, which does not change
objective is to populate with lexical units the CPA Ontol-             because it uses English as a metalanguage. Up to now, how-
ogy (http://www.pdev.org.uk/#onto), handcrafted by Puste-              ever, experiments have been carried out only in English and
jovsky et al. (2004) and substantially modified later by               more extensively in Spanish. We are starting with French
Hanks (in process). Given a top-node ontology of around                and expect to continue replicating the experiment in other
200 lexical units (nouns) denoting the most general concepts           languages and offering the results in the accompanying web-
of the language, the proposed method consists of populating            site (http://www.verbario.com).
this shallow ontology by means of corpus statistics. Hence,               In the following sections we present a general overview
                                                                       of the related work and then we describe our proposal. We
                                                                       offer an evaluation of the results and, finally, we draft some
                                                                       conclusions and plans for future work.
      2    Related Work on Taxonomy Building                         its general architecture based on synsets can often be prob-
                                                                     lematic because at times the words in a synset are too dif-
The interest for the development of taxonomies is of course
                                                                     ferent from a semantic point of view. For instance, the case
not new, as the publications on the subject span for four
                                                                     of the Spanish synset containing the words animal, bestia,
decades. Space limitations only allow us to offer a very brief
                                                                     criatura and fauna, which is equivalent to the English synset
account of the research in this field, which we organize as
                                                                     containing ‘animal’, ‘animate being’, ‘beast’, ‘brute’, ‘crea-
follows: first, some efforts to produce taxonomies by hand.
                                                                     ture’ and ‘fauna’. Here, the Spanish word pez and its English
Then, the literature on automatic taxonomy building from
                                                                     equivalent ‘fish’ are correctly placed as hyponyms of ‘ani-
machine readable dictionaries. Finally, taxonomy extraction
                                                                     mal’, but not as hyponyms of ‘beast’. Furthermore, WordNet
from corpora, on the one hand by rule-based systems and,
                                                                     is a top-down approach, while our interest is on the corpus-
on the other, based on quantitative analysis.
                                                                     driven approach.
                                                                        Overall, we decided in favor of the CPA Ontology for our
2.1       Handmade Taxonomies                                        taxonomy population project because its architecture, based
There is a large body of work in handcrafted taxonomy cre-           on lexical units rather than synsets, is simple enough to be
ation. We will only focus on some of the most representative         manipulated as needed.
modern efforts, not 3rd century Porphyrian tree or Roget’s
Thesaurus. We also exclude from this account all the spe-            2.2   Taxonomy Induction from Machine Readable
cialized ontologies, restricting ourselves to some of the most             Dictionaries
well-known projects devoted to general vocabulary. Among
                                                                     The field of automatic semantic relation extraction and, in
the most cited projects are FrameNet, Cyc, WordNet and the
                                                                     particular, hypernymy extraction, began to develop soon af-
CPA Ontology. With the exception of WordNet, that also in-
                                                                     ter the publication of the first machine readable dictionaries
cludes a Spanish version, the rest are only available in En-
                                                                     in the seventies and eighties. This new resource favored the
glish1 .
                                                                     development of different methodologies to transform dictio-
   FrameNet is aimed at the implementation of                        naries made for human users into a lexical database with
Charles Fillmore’s (1976) frame semantics as a lexical               information stored and organized for computers (Calzolari,
database organized in conceptual structures, available at            Pecchia, and Zampolli 1973; Calzolari 1977; Amsler 1981;
http://framenet.icsi.berkeley.edu/. The Cyc ontology, in             Chodorow, Byrd, and Heidorn 1985; Alshawi 1989; Fox et
turn, was born in 1984 not in the context of a linguistic            al. 1988; Nakamura and Nagao 1988; Wilks et al. 1989;
theory but in the field of Artificial Intelligence (Lenat 1995).     Guthrie et al. 1990; Boguraev 1991; Araujo and Pérez-
It is defined as an ontology of everyday common sense                Agüera 2006).
knowledge and is available at http://www.opencyc.org/.
                                                                        The first researchers shared the idea of taking a machine
Another large taxonomy is WordNet (Miller 1995;
                                                                     readable dictionary and study the regularities and patterns
Vossen 1998), originally created by psychologists but then
                                                                     in the definitions and subsequently write a system of rules
widely used in many natural language processing tasks.
                                                                     that would allow to extract hypernymy and other semantic
WordNet, avialable at https://wordnet.princeton.edu/, is
                                                                     relations between vocabulary units. Depending on the dic-
based on ‘synsets’, defined as sets of words that have the
                                                                     tionary, one of these rules could be that the first noun of the
same sense or refer to the same concept. It can be considered
                                                                     definition would be the hypernym of a defined noun. How-
a taxonomy because it includes hypernymy links. Finally,
                                                                     ever, this is not always the case, and thus one needs to de-
the other project that has come to our attention is the CPA
                                                                     velop more rules to cope with the exceptions.
Ontology, created in the context of a lexicography project
(Hanks in progress). It is at the moment a shallow ontology
including only the upper nodes, i.e. the most general
                                                                     2.3   Taxonomy Induction from Corpora
concepts denoted by words called “semantic types” in                 The Pattern-based Approaches With the advent of cor-
CPA terminology: “Event”, “Emotion”, “Physical Object”               pus linguistics in the nineties, researchers interested in se-
or “Human”, etc. This includes no more than 200 words                mantic relation extraction moved on to corpus analysis but
hierarchically organized in hypernymy links. This top-node           keeping the same philosophy as in the previous attempts
structure is currently being populated with lexical items            with dictionaries, that is, elaborating rule-based systems that
by Hanks and his team. It is handmade work but built                 would search for lexico-syntactic patterns in corpora ex-
from corpus analysis, which means that categories are not            pressing the desired information. Typically, if one finds in
assumed a priori.                                                    running text a sequence such as “X is a type of Y” or
   An examination of these taxonomies reveals different lim-         “X and other (types of) Y”, etc., then one would assume
itations. FrameNet departs from our main interest because            that any pair of nouns occupying the positions X and Y
strictly speaking it cannot be considered a taxonomy. In the         would hold a hypernymy relation (Hearst 1992; Rydin 2002;
case of the Cyc ontology, the formalisms used to express             Cimiano and Völker 2005; Snow, Jurafsky, and Ng 2006;
the relations are too complex to be manipulated and used as          Pantel and Pennacchiotti 2006; Potrich and Pianta 2008;
a basis for this Spanish taxonomy. In the case of WordNet,           Auger and Barriere 2008; Aussenac-Gilles and Jacques
                                                                     2008, among others).
   1
     There are, however, ongoing efforts to produce a Spanish ver-      Of course, the problem with this approach is that not
sion of FrameNet as well, cf. http://sfn.uab.es/                     always the collected patterns express the desired relations
and, in addition, many times the desired relations appear ex-      inference engine, which tries to reason upon the results of
pressed in patterns that the researchers were not able to an-      the other algorithms and extract new hypernymy assertions.
ticipate.                                                          Algorithm 6, finally, is the “assembly algorithm”, which is
                                                                   in charge of integrating the taxonomy fragments produced
The Quantitative Approaches A different view on the                by all the components into a modified version of the CPA
subject is the extraction of thesauri from corpora based on        Ontology.
distributional similarity. There are two main lines of re-            Experimental evaluation shows that with this method it
search, one that specializes in finding semantic similarity        is possible to obtain a robust homeostatic or self-regulated
between groups of words and the other in establishing hy-          taxonomy, because it is based on corpus statistics and can
pernymy links between pairs of words.                              update itself automatically. Evidently, this list of methods is
   In the first case, the semantic similarity between groups       not exhaustive and, as this is work in progress, we foresee
of words is calculated on the basis of distributional similar-     the integration of other methodologies as well. Up to now
ity, as it is considered that semantically similar words will      we have avoided matching Hearst patterns because they are
tend to occur in similar contexts. To be semantically simi-        costly to develop, they are language specific and, depending
lar, in this case, means to be synonyms or near-synonyms           on the implementation, can also be error prone, as in the case
or, more interestingly, words that pertain to the same seman-      of the Text2onto software, with precision figures of 17.38%
tic class, i.e., cohyponyms (Grefenstette 1994; Landauer and       and 29.95% recall for hypernymy extraction (Cimiano and
Dumais 1997; Schütze and Pedersen 1997; Lin 1998; Cia-            Völker 2005). We have also avoided the use of explicit se-
ramita 2002; Biemann, Bordag, and Quasthoff 2003; Alfon-           mantic or grammatical knowledge, preferring a design that is
seca and Manandhar 2002; Pekar, Krkoska, and Staab 2004;           self-contained and not dependent on external resources like
Bullinaria 2008). This line of research is tributary to the        Hearst patterns or WordNet, because this facilitates replica-
general notion of distributional semantics initiated by Harris     tion in other languages.
(1954) and developed later by many others (Sahlgren 2008;             As a textual corpus for our experiments we used a collec-
Baroni and Lenci 2010; Nazar 2010).                                tion of Spanish press articles and Wikipedia pages accumu-
   The second trend goes a step further than the previous          lated on a single text file of ca. a billion tokens. In the case
notion of distributional thesauri as just clusters of sim-         of algorithm 3, as it is fed with a lexicographic corpus, we
ilar words, and emphasizes the importance of establish-            used online dictionaries via a web search engine.
ing a hierarchic organization of the vocabulary, a difficult
task that imposes its own challenges. As in the previous           3.1   Algorithm 1: Clustering of Nouns Based on
case, the data is obtained from corpora defined as docu-                 Distributional Similarity
ment collections, the Wikipedia or the Web, but the method
used is most often directed co-occurrence graphs (Woon             The first component is based on a clustering technique that
and Madnick 2009; Wang, Barnaghi, and Bargiela 2009;               produces sets of semantically related nouns on the basis
Navigli, Velardi, and Faralli 2011; Nazar, Vivaldi, and Wan-       of distributional similarity. It bears some resemblance with
ner 2012; Fountain and Lapata 2012; Medelyan et al. 2013;          the quantitative approach of Grefenstette (1994), although
Velardi, Faralli, and Navigli 2013).                               aimed at cohyponyms rather than synonyms, and without
                                                                   grammar-specific information.
2.4    Why a New Approach                                             Consider, for instance, the semantic class of drinks, with
                                                                   elements such as “coffee”, “tea”, “beer”, “brandy”, and so
After so many publications on the subject, there continue to       on. In the case of these nouns, there is a great probability
be attempts on taxonomy extraction, because despite the va-        that they will co-occur with other words such as the verb “to
riety of ideas already proposed there is still plenty of room      drink” or nouns such as “glass”, “cup” or “bottle”. These and
for improvement. The large body of bibliography appears to         other shared words are the ones we used as indicators of the
indicate that the field has come to a point in which the inte-     nouns’ semantic relatedness, without POS-tag distinction.
gration of different ideas is needed, i.e., an algorithm able to      We can represent the overlap of shared vocabulary be-
integrate different fragments of taxonomies.                       tween lexical units as a Venn diagram (figure 1). In the
                                                                   intersection we can observe words that are shared by the
3     Methodology: an Integration of Algorithms                    units cerveza (beer), café (coffee) and té (tea), e.g. servir (to
The novelty of our approach lies on the modular design.            serve), beber (to drink), tomar (to drink), querer (to want),
Modular algorithms produce small fragments of taxonomies           etc. Of course, we also have words that are shared only by
which are later contrasted and integrated into a larger taxon-     two of the units, e.g., café and té share caliente (hot), which
omy by a central module.                                           does not co-occur with cerveza. By the same token, cerveza
   Algorithm 1 computes distributional similarity between          and café share the unit amargo/a (bitter), which does not co-
words. Algorithm 2 calculates asymmetric relations in word         occur with té.
co-occurrence in corpora. Algorithm 3 analyzes definitions            In concrete, this component analyzes the syntagmatic
from various dictionaries of a language and detects cases of       context in which a word appears and extracts the vocab-
significant definiens-definiendum co-occurrence. Algorithm         ulary (excluding function words). It then obtains pairs of
4 is a variant of 1 because it computes distributional sim-        words that, following the previous examples of drinks, could
ilarity as the number of identical ngrams (as sequences of         be brandy francés, bebiendo brandy, tomar brandy, brandy
words) that a group of words may share. Algorithm 5 is an          barato, etc. (French brandy, drinking brandy, drink brandy,
                                                                                  Figure 2: In algorithm 2, example of co-occurrence graphs
                                                                                  depicting hypernymy relations. Hypernym nodes are the
                                                                                  ones that have the largest number of incoming arrows.
Figure 1: A Venn diagram to represent the intersection and
difference between the co-occurrence sets.                                        1 shows some examples of the clusters created from these
                                                                                  nouns and their member elements. In this experiment, our
cheap brandy, etc.). From these elements, a data structure                        algorithm was able to produce correct clusters in 96% of the
is created in which each term is associated with the lexical                      cases, though it was only capable of classifying half of the
units it co-occurs with.                                                          input words (51%). Better precision was met at the expense
   Terms are then represented as co-occurrence vectors, and                       of a considerable loss of recall. More details on this experi-
thus the algorithm conducts a pairwise comparison of the                          ment will appear in Nazar & Renau (In press).
terms applying a similarity measure which calculates the de-                      3.2   Algorithm 2: Taxonomy Extraction Based on
gree of overlapping between vectors. In this case, this is cal-
culated with the Jaccard coefficient, as suggested by Grefen-
                                                                                        Asymmetric Word Co-occurrence
stette (1994), defined as follows, where A and B are the two                      Instead of clusters of semantically-related words, as pro-
vectors to be compared:                                                           duced by the first algorithm, the second one consists of creat-
                                                                                  ing hyponym-hypernym pairs based on their co-occurrence
                                  |A [ B|                                         patterning. Here, we define co-occurrence as a tendency
                         J(A, B) =                          (1)                   of two lexical units to appear together in the same sen-
                                  |A \ B|
                                                                                  tences, not taking into account the distance nor their order
   As it is usual in any clustering procedure, for this com-                      in the sentence. The main idea behind this study is that co-
parison we need a table of distances, from which we obtain                        occurrence is asymmetric in the case of hyponym-hypernym
a pair of units showing the greatest similarity. Hereafter, the                   pairs. For instance, the word motocicleta shows a tendency
members of this pair merge and create the first cluster, which                    to appear in the same sentences with the word vehı́culo, but
occupies the place of both words and contains the sum of                          the relation is not reciprocated. The asymmetric nature of
their attributes. The process is iterative, thus, another table                   such association allows us to automatically represent hier-
of distances is created, but every time with one less element.                    archical relations without resorting to external knowledge
This process stops when the units to cluster do not reach a                       bases. The computation of these relations is produced with
similarity threshold, defined as a minimum proportion of at-                      the help of directed graphs that express the co-occurrence
tributes in common that a set of units must have in order to                      relations.
be assigned the same cluster.                                                        The graphs shown in figure 2 illustrate this method. The
                                                                                  arrows in the graph represent asymmetric co-occurrence re-
 Class             Members of the cluster
                                                                                  lations, and the node with most incoming arrows is selected
 Vehicles          carro, automóvil, coche, autobús, tranvı́a, carroza, car-
                   ruaje, camión, jeep, camioneta
                                                                                  as the hypernym of the input term. Here, the input term cor-
 Types of cheese    brie, parmesano, camembert, mozzarella, gorgonzola,
                                                                                  tisol tends to co-occur with glucocoricoide and hormona. In
                   roquefort, gruyer                                              turn, glucocoricoide also tends to co-occur with hormona,
 Drinks             chocolate, licor, chicha, cerveza, aguardiente                but this last unit does not reciprocate the relation with neither
 Hats               pavero, tricornio, bicornio, guarapón, canotier, calañés   of both. The output of the graph is read as saying that hor-
 Animals             venado, ciervo, tigre, elefante, perro, gato, puerco,        mona is the hypernym of cortisol because it is the node with
                   cerdo, carnero, conejo, ratón, rata                           the largest number of incoming arrows. As shown by the
                                                                                  other graph, the same pattern is exhibited by noun phrases,
Table 1: Examples of clusters of Spanish nouns made by                            very common in multiword terms.
algorithm 1.                                                                         In order to test the performance of this module alone, we
                                                                                  manually evaluated the results of an experiment with 200
   In order to obtain an estimation of the quality of the re-                     Spanish nouns pertaining to the semantic classes of mam-
sults produced by this single module, we manually evaluated                       mals, insects, drinks, hats, vehicles and, again, varieties of
an arbitrary selection of 145 nouns which can be classified                       cheese. This preliminary evaluation shows that we can ex-
as drinks, hats, vehicles, animals and types of cheese. Table                     pect approximately a 60% chance of obtaining a correct hy-
pernym for a given noun using this algorithm in isolation.       tendency to appear in the same positions in a large number
More details on this experiment can be found in Nazar &          of different ngrams, then one can conclude that these words
Renau (2012).                                                    are paradigmatically related. As a result, this algorithm pro-
                                                                 duces clusters of words, where it can be seen that the mem-
3.3    Algorithm 3: Extraction of Hypernymy                      bers of each class share not only the same grammatical cate-
       Relations from Definiens-Definiendum                      gory but also an evident semantic relatedness. Table 3 shows
       Co-occurrence in General Dictionaries                     some examples of the results in English. Results were virtu-
As already mentioned in subsection 2.2, electronic dictionar-    ally the same in Spanish. In this paper we are only interested
ies have been used in the past to extract hypernymy and other    in nouns, but as the table shows, the same procedure can be
semantic relations, but in general the approach has been fo-     applied to the study of other grammatical categories.
cused on a single dictionary, which is parsed with a rule-
                                                                  POS-tag        Members of the cluster
based system to extract the relations from the definitions.
                                                                  Adverbs        entirely, exclusively, mainly, primarily, principally
Our approach here is different because we use a set of dic-
                                                                  Adjectives     cost-effective, efficient, elaborate, professional, subtle
tionaries and infer the hypernymy relations by the frequency
                                                                  Proper nouns   Australia, Dublin, England, France, India, Janeiro, Mid-
of co-occurrence between lexical items in the headword and                       dlesbrough, Newcastle, Sunderland, Yorkshire
in the definitions. The algorithm uses the frequency to select    Nouns          chess, cricket, football, golf, rugby, soccer, tennis
hypernyms from the text of the definitions, assuming that
there will be some consensus among the dictionaries when         Table 3: Examples of clusters in English, result of Algorithm
selecting a given hypernym, and thus this should be the most     4.
frequent word (excluding function words). In this way we
save the effort of building a set of rules for each dictionary      In order to evaluate this component, we again carried out
and we make it possible to replicate the experiment in other     a manual examination of the results. This was done by a ran-
languages.                                                       dom sample of 30 clusters, containing 1191 words in total.
   In order to obtain an evaluation of the performance of this   We found that in 96% of the cases the clusters were con-
single module, we manually examined the results for a ran-       sistent. This internal consistency is computed as the mean
dom sample of 150 nouns and concluded that we can expect         consistency of each individual cluster, with numbers for the
approximately 70% chance of obtaining a correct hypernym         individual clusters ranging from 80 to 100% consistency.
for a given input word. More details on this experiment can      More details on this experiment will appear in Nazar & Re-
be found in Renau & Nazar (2012).                                nau (Submitted).
3.4    Algorithm 4: N grams with “Asterisks”                     3.5      Algorithm 5: Analogical Inference
With algorithm 4 we explored the possibility of creating
clusters of words that have a tendency to occur in exactly       We use algorithm 5 as an inference engine to analyze the re-
the same positions in short sequences of words. This is why      sults produced by the previous modules. It is the only com-
we describe this module “ngrams with asterisks”.                 ponent that is not corpus-driven, in the sense that it only an-
   What we do here is to study large samples of ngrams, de-      alyzes the morphological and lexical features of the terms.
fined as sequences of three to five words, and then replace         This component is described as an analogical inference
one of the words inside the ngram with an asterisk. The goal     engine because it learns to associate features of the lexical
is to record then which are the words that most frequently       items with the category that is assigned to them by the other
occur in the position of such asterisk. Normally, these words    algorithms. In this way, if a term cannot be found in the an-
will show some kind of paradigmatic relation and therefore       alyzed corpora, it may still be classified by this module.
will have some features in common, such as the grammati-            The features that are learned are both lexical and morpho-
cal category and, in most cases, also a semantic relatedness.    logical. The lexical level is only useful in the case of multi-
Consider, for instance, the case of the ngram ‘at * airport’,    word expressions, but of course these are also very frequent,
taken from the BNC corpus (table 2). Only a limited number       especially in the case of technical or specialized domains.
of words can occur in the position of the asterisk, and these    For instance, this algorithm first learns that the other ones
are semantically related.                                        are placing terms such as sı́ndrome de Carpenter (Carpenter
                                                                 syndrome) and sı́ndrome de Meretoja (Meretoja syndrome),
 at * airport [645]                                              among others, as hyponyms of enfermedad (disease). Thus,
 Heathrow 56, Manchester 23, Gatwick 19, London 15,              it learns to associate features of the terms (in the case, the
 Frankfurt 15, Teesside 10, Edinburgh 8, an 7, Glasgow 7,        sequence sı́ndrome de) with a semantic class, and subse-
 Stansted 7, Dublin 6, Birmingham 5, Coventry 5, Aberdeen 5,     quently classify new terms such as sı́ndrome de Maffucci
 ...                                                             (Maffucci syndrome) also as a disease. The same is done at
                                                                 the morphology level: the module learns to detect morpho-
Table 2: Words appearing in the position of the asterisk in      logical similarities between the members of the same seman-
the sequence ‘at * airport’ in the BNC corpus.                   tic class. Continuing with the same example, it can learn that
                                                                 very frequently the terms denoting diseases have the suffixes
  To share a single ngram is of course not and indication        -osis or -itis, such as hepatitis or endometriosis. Given, thus,
of semantic similarity. But if there are words that show a       a new term such as pancreatitis, the module will recognize it
as a disease. More details on this experiment were published      random sample of 52 nouns to evaluate. Criteria for consid-
in Nazar et al. (2012).                                           ering a link between a word and a node as correct was that it
                                                                  should correspond to the hypernymy relation type. Among
3.6   Integration into a Single Taxonomy                          the words from the samples we have, for instance, lechuga
After the implementation and experimentation with each            (‘lettuce’). In this case, the taxonomy offers two hypernymy
procedure, we developed a new central or “assembly” al-           chains:
gorithm, with the purpose of integrating the results into a          Entity ! Physical Object ! Plant [Planta] ! [Arbusto] !
                                                                  lechuga
single taxonomy. The task of this module is to reinforce the
certainty of the results on the basis of the combined output of
                                                                    Entity ! Physical Object ! Plant [Planta] ! lechuga
each module. The result is a sort of “consensus” taxonomy
which, according to our preliminary experiments, is larger
and more reliable than the ones produced by each module in           For one, it states that it is a type of bush ([Arbusto]),
isolation.                                                        which, in turn, is a type of plant, and so on. But a lettuce
   The integration procedure is however not straightforward,      is not really a bush, thus this chain is incorrect (it cannot be
because each algorithm is of a different nature, and thus         analyzed as “lettuce IS A bush”). For the other, it asserts that
the combination of the results cannot be solved with a sim-       a lettuce is a type of plant ([Planta]), which in turn is a type
ple voting scheme. Algorithms 1 and 4 result in groups of         of physical object, and so on. This last chain is considered
semantically-similar words, while 2, 3 and 5 result in hyper-     correct (it can be analyzed as “lettuce IS A plant”).
nymy pairs. Moreover, the desired result is to populate the          For each noun, our human judges indicate how many hy-
already existing CPA Ontology, which, as already explained,       pernyms were offered by the taxonomy (two in the case of
refers to very general or abstract concepts.                      lechuga) and how many of them were correct: one out of two
   Two basic operations are conducted. On the one hand, to        in this case. We thus calculated overall precision as the ra-
integrate the results of modules that produce clusters of se-     tio of correct chains over total chains. We consider recall
mantically similar words and, on the other hand, to link these    very difficult to calculate, because of the lack of corpus-
clusters of words with a correct hypernym. This is done in        based lexicographic material in Spanish. As a consequence,
sequential steps. For each noun in each cluster produced by       we only evaluated precision, and obtained that for a total of
modules 1 and 4, there (might) be a hypernym candidate pro-       763 hypernymy chains examined, 586 were found to be cor-
vided by modules 2, 3 and 5. The result is that, for each clus-   rect, which makes a precision of 76.80%. The standard devi-
ter, there will be a most frequent hypernym candidate, which      ation in the group of judges is 14.34. If we exclude the two
is thus selected as the semantic class of all the members of      judges with more extreme positive and negative scores, then
the cluster. As a result, a pairing of a hypernym with a group    mean precision rate is 77.18%, with a standard deviation of
of hyponyms is obtained, e.g. sedán, coche, limusina, etc.       12.82.
are classified as hyponyms of automóvil. A chain of ascend-         Regarding the control of inter-coder agreement, we in-
ing hypernymy links is built until one of the semantic types      cluded in all samples given to the judges a common group of
of the CPA Ontology is found, and then each word is inte-         11 nouns, which makes 88 judgments that should ideally be
grated in an hypernymy chain until the top node Entity:           identical. However, raters agreed only on 72 cases, which is
 Entity ! Physical Object ! Inanimate ! Artifact ! Machine        still more than moderate agreement (81.8% or 63.2% if mea-
! Vehicle ! Automobile ! sedan                                    sured with a Kappa coefficient to correct for chance-related
                                                                  agreement). We can interpret the agreement percentage as
3.7   User Interface                                              the ceiling of the precision one can expect from this type of
                                                                  algorithms.
A first prototype is now being developed as a web demo,              With respect to error analysis, we found out that they
available at http://www.verbario.com. The taxonomy of             mostly occurred as a consequence of the polysemy of the
nouns is only a module of Verbario.com, a website that is         words, a circumstance already noticed by Amsler (1981)
part of a wider project devoted to lexical analysis. The “Tax-    for this kind of output. It is the case, for instance, of the
onomy” part shows two ways of obtaining results: the user         word adicción (addiction), tagged as hyponym of dependen-
can either introduce a noun and get the hypernymy chain or        cia (dependence), which is correct in principle. But then de-
viceversa, he/she can obtain all the hyponyms of a given tar-     pendencia is only registered as a type of construction, ac-
get noun. At the moment, more than 30,000 Spanish nouns           cording to one of the senses that the word has in Spanish.
have already been introduced and, while the system is run-
                                                                     Another frequent cause of error is confusion between hy-
ning, more words are being added at a fast pace with a rea-
                                                                  pernymy and other semantic relation. Vitı́ligo, for instance,
sonably low error rate, as shown in the next section. We plan
                                                                  is correctly classified as a disease in one case but incorrectly
to offer regular back up files of the taxonomy in OWL for-
                                                                  as a hyponym of piel (skin) in another, obviously because
mat at this website.
                                                                  it is a skin-related disease and then both words tend to co-
                                                                  occur in the same contexts. The same happens in the case
            4   Evaluation of the Results                         of synonyms, which are often placed incorrectly in a hyper-
Samples of the overall results were evaluated by a group of       nymy relation. For instance, the word cuchı́ is a synonym
8 human judges, all of them advanced graduate students in         and not a hyponym of cerdo (pig). There are also cases
linguistics. Each one received the same instructions and a        of meronymy, such as the word océano (ocean), which is
wrongly connected to agua (water): an ocean is made of wa-          Aussenac-Gilles, N., and Jacques, M.-P. 2008. Designing
ter, but it is not a type of water.                                 and evaluating patterns for relation acquisition from texts with
                                                                    Caméléon. Terminology 14(1):45–73.
         5   Conclusions and Future Work                            Baroni, M., and Lenci, A. 2010. Distributional memory: A gen-
                                                                    eral framework for corpus-based semantics. Comput. Linguist.
This paper has presented a set of combined algorithms for
                                                                    36(4):673–721.
building a taxonomy of Spanish nouns based on procedures
from quantitative linguistics. The method, based mainly on          Biemann, C.; Bordag, S.; and Quasthoff, U. 2003. Ler-
the study of co-occurrence patterns, could in principle be          nen von paradigmatischen relationen auf iterierten kollokatio-
replicated with different languages. The precision we obtain        nen. In Beiträge zum GermaNet-Workshop: Anwendungendes
at the moment can be improved but at the same time it is only       deutschen Wortnetzes in Theorie und Praxis.
slightly lower than the inter-coder agreement percentage. In        Boguraev, B. 1991. Building a Lexicon: The Contribution of
semantic analysis, total agreement is unrealistic given the         Computers. International Journal of Lexicography 4(3):227–
fact that even dictionaries not always agree.                       260.
   For future work, we are focusing on the following aspects:       Bullinaria, J. 2008. Semantic categorization using simple word
on the one hand, we will try to improve precision by address-       co-occurrence statistics. In Baroni, M.; Evert, S.; and Lenci, A.,
ing the problem of polysemy and the confusion between syn-          eds., Proceedings of the ESSLLI Workshop on Distributional
onyms and meronyms with hypernyms. We have already ex-              Lexical Semantics, 1–8.
perimented with sense-induction algorithms which, for each          Calzolari, N.; Pecchia, L.; and Zampolli, A. 1973. Working on
noun found in a corpus, will produce a list of different senses     the Italian machine dictionary: a semantic approach. In Proc.
(Nazar 2010). This algorithm can now be used to map each            of 5th Conference on Computational Linguistics (Morristown,
sense with a hypernym. Pending work also includes a de-             NJ, USA), 49–52.
tailed large scale evaluation. In this respect, we must dis-        Calzolari, N. 1977. An empirical approach to circularity in
tinguish precision of frequent nouns (e.g. manzana - apple,         dictionary definitions. Cahiers de Lexicologie 31(2) 118–128.
casa - house, etc.) from precision of very infrequent nouns
                                                                    Chodorow, M.; Byrd, R.; and Heidorn, G. 1985. Extracting
(e.g. acetábulo, an anatomic part of a bone). Finally, we will     semantic hierarchies from a large on-line dictionary. In Proc.
evaluate separately how the system operates with specialized        of the 23rd annual meeting on ACL (Chicago, Illinois, USA),
terms and with general language.                                    299–304.
                                                                    Ciaramita, M. 2002. Boosting automatic lexical acquisition
                 6    Acknowledgments                               with morphological information. In Proc. of the ACL-02 Work-
                                                                    shop on Unsupervised Lexical Acquisition, ACL, 17–25.
This paper has been made possible thanks to funding                 Cimiano, P., and Völker, J. 2005. Text2onto. In Natural lan-
from Projects Fondecyt 11140704, lead by Irene Re-                  guage processing and information systems. Springer. 227–238.
nau, and Fondecyt 11140686, lead by Rogelio Nazar                   Fillmore, C. J. 1976. Frame semantics and the nature of lan-
(http://www.conicyt.cl/fondecyt). We would like to express          guage. Annals of the New York Academy of Sciences: Confer-
our gratitude to the students for their participation and to the    ence on the Origin and Development of Language and Speech
reviewers for their extended and detailed comments, which           280(1):20–32.
have been very helpful to improve this paper. Unfortunately,        Fountain, T., and Lapata, M. 2012. Taxonomy induction using
lack of time and space prevented us from introducing all of         hierarchical random graphs. In Proc. of the 2012 Conference
the changes that were suggested.                                    of the North American Chapter of the Association for Compu-
                                                                    tational Linguistics: Human Language Technologies, NAACL
                        References                                  HLT ’12, 466–476. Stroudsburg, PA, USA: Association for
Alfonseca, E., and Manandhar, S. 2002. Extending a lexical on-      Computational Linguistics.
tology by a combination of distributional semantics signatures.     Fox, E. A.; Nutter, J. T.; Ahlswede, T.; Evens, M.; and
In Proc. of EKAW’02, 1–7.                                           Markowitz, J. 1988. Building a large thesaurus for information
Alshawi, H. 1989. Analysing the dictionary definitions. In          retrieval. In Proc. of the Second Conference on Applied Natural
Boguraev, B., and Briscoe, T., eds., Computational Lexicogra-       Language Processing, ANLC ’88, 101–108. Stroudsburg, PA,
phy for Natural Language Processing. White Plains, NY, USA:         USA: Association for Computational Linguistics.
Longman Publishing Group. 153–169.                                  Grefenstette, G. 1994. Explorations in Automatic Thesaurus
Amsler, R. 1981. A taxonomy for English nouns and verbs. In         Discovery. Kluwer, Dordrecht, The Netherlands.
Proc. of 19th annual meeting on ACL (Morristown, NJ, USA),          Guthrie, L.; Slator, B.; Wilks, Y.; and Bruce, R. 1990. Is there
133–138.                                                            content in empty heads? In Proc. of the 13th International Con-
Araujo, L., and Pérez-Agüera, J. R. 2006. Enriching thesauri      ference on Computational Linguistics, COLING’90 (Helsinki,
with hierarchical relationships by pattern matching in dictionar-   Finland), 138–143.
ies. In FinTAL, 268–279.                                            Hanks, P. 2013. Lexical Analysis: Norms and Exploitations.
Auger, A., and Barriere, C. 2008. Pattern-based approaches to       MIT Press, Cambridge, MA.
semantic relation extraction - A state-of-the-art. Terminology      Hanks, P. in progress. Pattern Dictionary of English Verbs.
14(1):1–19.                                                         http://www.pdev.org.uk/ (last access: 26/04/0215.
Harris, Z. 1954. Distributional structure. Word 10(23):146–         tions. In Proc. of 21st International Conference on Computa-
162.                                                                tional Linguistics and 44th annual meeting of the ACL (Sydney,
Hearst, M. 1992. Automatic acquisition of hyponyms from             Australia), 113–120.
large text corpora. In Proc. of the 14th International Conference   Pekar, V.; Krkoska, M.; and Staab, S. 2004. Feature weighting
on Computational Linguistics (Nantes, France), 539–545.             for co-occurrence-based classification of words. In Proc. of the
Kilgarriff, A. 2007. Googleology is bad science. Comput.            20th International Conference on Computational Linguistics,
Linguist. 33(1):147–151.                                            COLING ’04. Stroudsburg, PA, USA: Association for Compu-
                                                                    tational Linguistics.
Landauer, T. K., and Dumais, S. T. 1997. A solution to plato’s
problem: The latent semantic analysis theory of acquisition, in-    Potrich, A., and Pianta, E. 2008. L-isa: Learning domain spe-
duction, and representation of knowledge. Psychological re-         cific isa-relations from the web. In LREC. European Language
view 211–240.                                                       Resources Association.
Lenat, D. B. 1995. Cyc: A large-scale investment in knowledge       Pustejovsky, J.; Hanks, P.; and Rumshisky, A. 2004. Automated
infrastructure. Commun. ACM 38(11):33–38.                           induction of sense in context. In Proc. of the 20th Interna-
                                                                    tional Conference on Computational Linguistics, COLING ’04.
Lin, D. 1998. Automatic retrieval and clustering of similar         Stroudsburg, PA, USA: Association for Computational Linguis-
words. In Proc. of the 36th Annual Meeting of the Association       tics.
for Computational Linguistics and 17th International Confer-
ence on Computational Linguistics - Volume 2, ACL ’98, 768–         Renau, I., and Nazar, R. 2012. Hypernym extraction by
774. Stroudsburg, PA, USA: Association for Computational            definiens-definiendum co-occurrence in multiple dictionaries.
Linguistics.                                                        Procesamiento del Lenguaje Natural (49):83–90.
                                                                    Rydin, S. 2002. Building a hyponymy lexicon with hierarchical
Lyons, J. 1977. Semantics, volume 2. Cambridge University
                                                                    structure. In Proc. of the ACL-02 Workshop on Unsupervised
Press.
                                                                    Lexical Acquisition - Volume 9, ULA ’02, 26–33. Stroudsburg,
Maedche, A. 1995. Ontology Learning For The Semantic Web.           PA, USA: Association for Computational Linguistics.
Doredrecht, The Netherlands: Kluwer.
                                                                    Sahlgren, M. 2008. The distributional hypothesis. Rivista di
Medelyan, O.; Manion, S.; Broekstra, J.; Divoli, A.; Huang, A.;     Linguistica 20(1) 33–53.
and Witten, I. H. 2013. Constructing a focused taxonomy from
                                                                    Schütze, H., and Pedersen, J. 1997. A co-occurrence-based
a document collection. In The Semantic Web: Semantics and
                                                                    thesaurus and two applications to information retrieval. Infor-
Big Data, 10th International Conference, ESWC 2013, Mont-
                                                                    mation Processing and Management 33(3):307–318.
pellier, France, May 26-30, 2013. Proc., 367–381.
                                                                    Snow, R.; Jurafsky, D.; and Ng, A. 2006. Semantic taxonomy
Miller, G. A. 1995. WordNet: A Lexical Database for English.        induction from heterogeneous evidence. In Proc. of the 21st In-
Communications of the ACM 38(11):39–41.                             ternational Conference on Computational Linguistics (Sydney,
Nakamura, J., and Nagao, M. 1988. Extraction of Semantic            Australia), 801–808.
Information from an Ordinary English Dictionary and its Eval-       Velardi, P.; Faralli, S.; and Navigli, R. 2013. Ontolearn
uation. In Proc. of the 12th International Conference on Com-       reloaded: A graph-based algorithm for taxonomy induction.
putational Linguistics COLING-88 (Budapest, Hungary), 459–          Computational Linguistics 39(3):665–707.
464.
                                                                    Vossen, P. 1998. Eurowordnet: A multilingual database with
Navigli, R.; Velardi, P.; and Faralli, S. 2011. A graph-based al-   lexical semantic networks. Computers and the Humanities
gorithm for inducing lexical taxonomies from scratch. In Proc.      32(2-3).
of the Twenty-Second International Joint Conference on Artifi-
cial Intelligence - Volume Volume Three, IJCAI’11, 1872–1877.       Wang, W.; Barnaghi, P.; and Bargiela, A. 2009. Probabilis-
AAAI Press.                                                         tic topic models for learning terminological ontologies. IEEE
                                                                    Transactions on Knowledge and Data Engineering 99(Rapid-
Nazar, R., and Renau, I. 2012. A co-occurrence taxonomy from        Posts).
a general language corpus. In Proc. of EURALEX, 367–375.
                                                                    Wilks, Y.; Fass, D.; Guo, C.; McDonald, J.; Plate, T.; and Sla-
Nazar, R., and Renau, I. In press. Agrupación semántica de sus-   tor, B. 1989. A tractable machine dictionary as a resource
tantivos basada en similitud distribucional. implicaciones lex-     for computational semantics. In Computational Lexiography
icográficas. In Actas del V Congreso Internacional de Lexi-        for Natural Language Processing. B. Boguraev and T. Briscoe
cografı́a Hispánica (25-27 de junio de 2012).                      (eds): Essex, UK: Longman. 193–228.
Nazar, R., and Renau, I. Submitted. Extraños-misteriosos-          Woon, W. L., and Madnick, S. 2009. Asymmetric information
insondables-inescrutables son los caminos del señor: extracción   distances for automated taxonomy construction. Knowl. Inf.
de relaciones paradigmáticas mediante análisis estadı́stico de    Syst. 21(1):91–111.
textos.
Nazar, R.; Vivaldi, J.; and Wanner, L. 2012. Co-occurrence
graphs applied to taxonomy extraction in scientific and techni-
cal corpora. Procesamiento del Lenguaje Natural (49):67–74.
Nazar, R. 2010. A Quantitative Approach to Concept Analysis.
Ph.D. Dissertation, Universitat Pompeu Fabra.
Pantel, P., and Pennacchiotti, M. 2006. Espresso: Leverag-
ing generic patterns for automatically harvesting semantic rela-