=Paper=
{{Paper
|id=Vol-1517/JOWO-15_WoMO_paper_4
|storemode=property
|title=Ontology Population using Corpus Statistics
|pdfUrl=https://ceur-ws.org/Vol-1517/JOWO-15_WoMO_paper_4.pdf
|volume=Vol-1517
|dblpUrl=https://dblp.org/rec/conf/ijcai/NazarR15
}}
==Ontology Population using Corpus Statistics==
Ontology Population Using Corpus Statistics Rogelio Nazar, Irene Renau Instituto de Literatura y Ciencias del Lenguaje Pontificia Universidad Católica de Valparaı́so {rogelio.nazar,irene.renau}@ucv.cl Abstract the objective is to link a noun such as “bicycle” with its hy- pernym, “Vehicle”, and this one with “Artifact”, and so on. This paper presents a combination of algorithms for au- For a more precise definition of the terms, taxonomies tomatic ontology building based mainly on lexical co- occurrence statistics. We populate an ontology with hy- and ontologies are different kinds of knowledge structures. pernymy links, thus we refer more specifically to a tax- Whereas an ontology is “a system of categories accounting onomy of lexical units (nouns organized by hypernymy for a certain vision of the world” (Maedche 1995, 11), a tax- relations) rather than an ontology of formally defined onomy can be considered a hierarchical relational structure concepts. A set of combined statistical procedures pro- of words. For instance, “Vehicle” can be a formally defined duce fragments of taxonomies from corpora that are concept in an ontology, and “vehicle”, “car” or “bicycle” later integrated into a unified taxonomy by a central al- words related to this concept, the first being a hypernym of gorithm. Our results show that with an ensemble of dif- the others. A hypernymy relation is a basic semantic rela- ferent components it is possible to achieve an accuracy tion between a word, the hyponym, and the word used as only slightly worse than human performance. Finally, a descriptor to define it, the hypernym (Lyons 1977). Hy- as our methods are based on quantitative linguistics, the algorithm we propose is not language specific. The lan- pernymy provides the hierarchical structure for conceptual guage used for the experiments is, however, Spanish. organization of a domain. In the following pages, we will use the term ‘ontology’ to refer to the most general nodes of the structure, and ‘taxonomy’ to refer to the connection be- 1 Introduction tween concepts of the ontology and words, establishing thus a difference between the ontological and the linguistic point The study of the vocabulary in its real context of use of view. is currently a central part of linguistics (Kilgarriff 2007; Hanks 2013). Among other tasks in this discipline, it is of In different ways, the paper represents an innovative way utmost importance to extract and organize vocabulary units of addressing the problem. Our method is based on a com- from corpora. There is an intrinsic theoretical interest in bination of five different algorithms which produce raw re- such attempt, like the study of the laws that govern how sults from corpora in the form of fragments of taxonomies, words can be combined and classified. But there is also a which are later compared and integrated into a single struc- practical motivation: to have the ability of transforming un- ture by a central algorithm in charge with the decision mak- structured data into structured databases, i.e., to go from ing process. The result is a tree of hyponym/hypernym re- plain text to lexical databases, which can be later organized lations between nouns, i.e. words rather than concepts and as an ontology which specifies the terminology of a domain their formal definitions, characteristic of the linguistic view. and the conceptual relations between terms. Another novelty of the approach is that it is quantitative, thus This paper presents a preliminary description and assess- it does not involve language or domain specific knowledge ment of results of a methodology based on co-occurrence coded directly into the system. No external resources are statistics to transform text into a knowledge structure, which needed apart from the analyzed corpora, a Part-of-Speech can be later developed into a taxonomy or an ontology. The tagger and the CPA Ontology itself, which does not change objective is to populate with lexical units the CPA Ontol- because it uses English as a metalanguage. Up to now, how- ogy (http://www.pdev.org.uk/#onto), handcrafted by Puste- ever, experiments have been carried out only in English and jovsky et al. (2004) and substantially modified later by more extensively in Spanish. We are starting with French Hanks (in process). Given a top-node ontology of around and expect to continue replicating the experiment in other 200 lexical units (nouns) denoting the most general concepts languages and offering the results in the accompanying web- of the language, the proposed method consists of populating site (http://www.verbario.com). this shallow ontology by means of corpus statistics. Hence, In the following sections we present a general overview of the related work and then we describe our proposal. We offer an evaluation of the results and, finally, we draft some conclusions and plans for future work. 2 Related Work on Taxonomy Building its general architecture based on synsets can often be prob- lematic because at times the words in a synset are too dif- The interest for the development of taxonomies is of course ferent from a semantic point of view. For instance, the case not new, as the publications on the subject span for four of the Spanish synset containing the words animal, bestia, decades. Space limitations only allow us to offer a very brief criatura and fauna, which is equivalent to the English synset account of the research in this field, which we organize as containing ‘animal’, ‘animate being’, ‘beast’, ‘brute’, ‘crea- follows: first, some efforts to produce taxonomies by hand. ture’ and ‘fauna’. Here, the Spanish word pez and its English Then, the literature on automatic taxonomy building from equivalent ‘fish’ are correctly placed as hyponyms of ‘ani- machine readable dictionaries. Finally, taxonomy extraction mal’, but not as hyponyms of ‘beast’. Furthermore, WordNet from corpora, on the one hand by rule-based systems and, is a top-down approach, while our interest is on the corpus- on the other, based on quantitative analysis. driven approach. Overall, we decided in favor of the CPA Ontology for our 2.1 Handmade Taxonomies taxonomy population project because its architecture, based There is a large body of work in handcrafted taxonomy cre- on lexical units rather than synsets, is simple enough to be ation. We will only focus on some of the most representative manipulated as needed. modern efforts, not 3rd century Porphyrian tree or Roget’s Thesaurus. We also exclude from this account all the spe- 2.2 Taxonomy Induction from Machine Readable cialized ontologies, restricting ourselves to some of the most Dictionaries well-known projects devoted to general vocabulary. Among The field of automatic semantic relation extraction and, in the most cited projects are FrameNet, Cyc, WordNet and the particular, hypernymy extraction, began to develop soon af- CPA Ontology. With the exception of WordNet, that also in- ter the publication of the first machine readable dictionaries cludes a Spanish version, the rest are only available in En- in the seventies and eighties. This new resource favored the glish1 . development of different methodologies to transform dictio- FrameNet is aimed at the implementation of naries made for human users into a lexical database with Charles Fillmore’s (1976) frame semantics as a lexical information stored and organized for computers (Calzolari, database organized in conceptual structures, available at Pecchia, and Zampolli 1973; Calzolari 1977; Amsler 1981; http://framenet.icsi.berkeley.edu/. The Cyc ontology, in Chodorow, Byrd, and Heidorn 1985; Alshawi 1989; Fox et turn, was born in 1984 not in the context of a linguistic al. 1988; Nakamura and Nagao 1988; Wilks et al. 1989; theory but in the field of Artificial Intelligence (Lenat 1995). Guthrie et al. 1990; Boguraev 1991; Araujo and Pérez- It is defined as an ontology of everyday common sense Agüera 2006). knowledge and is available at http://www.opencyc.org/. The first researchers shared the idea of taking a machine Another large taxonomy is WordNet (Miller 1995; readable dictionary and study the regularities and patterns Vossen 1998), originally created by psychologists but then in the definitions and subsequently write a system of rules widely used in many natural language processing tasks. that would allow to extract hypernymy and other semantic WordNet, avialable at https://wordnet.princeton.edu/, is relations between vocabulary units. Depending on the dic- based on ‘synsets’, defined as sets of words that have the tionary, one of these rules could be that the first noun of the same sense or refer to the same concept. It can be considered definition would be the hypernym of a defined noun. How- a taxonomy because it includes hypernymy links. Finally, ever, this is not always the case, and thus one needs to de- the other project that has come to our attention is the CPA velop more rules to cope with the exceptions. Ontology, created in the context of a lexicography project (Hanks in progress). It is at the moment a shallow ontology including only the upper nodes, i.e. the most general 2.3 Taxonomy Induction from Corpora concepts denoted by words called “semantic types” in The Pattern-based Approaches With the advent of cor- CPA terminology: “Event”, “Emotion”, “Physical Object” pus linguistics in the nineties, researchers interested in se- or “Human”, etc. This includes no more than 200 words mantic relation extraction moved on to corpus analysis but hierarchically organized in hypernymy links. This top-node keeping the same philosophy as in the previous attempts structure is currently being populated with lexical items with dictionaries, that is, elaborating rule-based systems that by Hanks and his team. It is handmade work but built would search for lexico-syntactic patterns in corpora ex- from corpus analysis, which means that categories are not pressing the desired information. Typically, if one finds in assumed a priori. running text a sequence such as “X is a type of Y” or An examination of these taxonomies reveals different lim- “X and other (types of) Y”, etc., then one would assume itations. FrameNet departs from our main interest because that any pair of nouns occupying the positions X and Y strictly speaking it cannot be considered a taxonomy. In the would hold a hypernymy relation (Hearst 1992; Rydin 2002; case of the Cyc ontology, the formalisms used to express Cimiano and Völker 2005; Snow, Jurafsky, and Ng 2006; the relations are too complex to be manipulated and used as Pantel and Pennacchiotti 2006; Potrich and Pianta 2008; a basis for this Spanish taxonomy. In the case of WordNet, Auger and Barriere 2008; Aussenac-Gilles and Jacques 2008, among others). 1 There are, however, ongoing efforts to produce a Spanish ver- Of course, the problem with this approach is that not sion of FrameNet as well, cf. http://sfn.uab.es/ always the collected patterns express the desired relations and, in addition, many times the desired relations appear ex- inference engine, which tries to reason upon the results of pressed in patterns that the researchers were not able to an- the other algorithms and extract new hypernymy assertions. ticipate. Algorithm 6, finally, is the “assembly algorithm”, which is in charge of integrating the taxonomy fragments produced The Quantitative Approaches A different view on the by all the components into a modified version of the CPA subject is the extraction of thesauri from corpora based on Ontology. distributional similarity. There are two main lines of re- Experimental evaluation shows that with this method it search, one that specializes in finding semantic similarity is possible to obtain a robust homeostatic or self-regulated between groups of words and the other in establishing hy- taxonomy, because it is based on corpus statistics and can pernymy links between pairs of words. update itself automatically. Evidently, this list of methods is In the first case, the semantic similarity between groups not exhaustive and, as this is work in progress, we foresee of words is calculated on the basis of distributional similar- the integration of other methodologies as well. Up to now ity, as it is considered that semantically similar words will we have avoided matching Hearst patterns because they are tend to occur in similar contexts. To be semantically simi- costly to develop, they are language specific and, depending lar, in this case, means to be synonyms or near-synonyms on the implementation, can also be error prone, as in the case or, more interestingly, words that pertain to the same seman- of the Text2onto software, with precision figures of 17.38% tic class, i.e., cohyponyms (Grefenstette 1994; Landauer and and 29.95% recall for hypernymy extraction (Cimiano and Dumais 1997; Schütze and Pedersen 1997; Lin 1998; Cia- Völker 2005). We have also avoided the use of explicit se- ramita 2002; Biemann, Bordag, and Quasthoff 2003; Alfon- mantic or grammatical knowledge, preferring a design that is seca and Manandhar 2002; Pekar, Krkoska, and Staab 2004; self-contained and not dependent on external resources like Bullinaria 2008). This line of research is tributary to the Hearst patterns or WordNet, because this facilitates replica- general notion of distributional semantics initiated by Harris tion in other languages. (1954) and developed later by many others (Sahlgren 2008; As a textual corpus for our experiments we used a collec- Baroni and Lenci 2010; Nazar 2010). tion of Spanish press articles and Wikipedia pages accumu- The second trend goes a step further than the previous lated on a single text file of ca. a billion tokens. In the case notion of distributional thesauri as just clusters of sim- of algorithm 3, as it is fed with a lexicographic corpus, we ilar words, and emphasizes the importance of establish- used online dictionaries via a web search engine. ing a hierarchic organization of the vocabulary, a difficult task that imposes its own challenges. As in the previous 3.1 Algorithm 1: Clustering of Nouns Based on case, the data is obtained from corpora defined as docu- Distributional Similarity ment collections, the Wikipedia or the Web, but the method used is most often directed co-occurrence graphs (Woon The first component is based on a clustering technique that and Madnick 2009; Wang, Barnaghi, and Bargiela 2009; produces sets of semantically related nouns on the basis Navigli, Velardi, and Faralli 2011; Nazar, Vivaldi, and Wan- of distributional similarity. It bears some resemblance with ner 2012; Fountain and Lapata 2012; Medelyan et al. 2013; the quantitative approach of Grefenstette (1994), although Velardi, Faralli, and Navigli 2013). aimed at cohyponyms rather than synonyms, and without grammar-specific information. 2.4 Why a New Approach Consider, for instance, the semantic class of drinks, with elements such as “coffee”, “tea”, “beer”, “brandy”, and so After so many publications on the subject, there continue to on. In the case of these nouns, there is a great probability be attempts on taxonomy extraction, because despite the va- that they will co-occur with other words such as the verb “to riety of ideas already proposed there is still plenty of room drink” or nouns such as “glass”, “cup” or “bottle”. These and for improvement. The large body of bibliography appears to other shared words are the ones we used as indicators of the indicate that the field has come to a point in which the inte- nouns’ semantic relatedness, without POS-tag distinction. gration of different ideas is needed, i.e., an algorithm able to We can represent the overlap of shared vocabulary be- integrate different fragments of taxonomies. tween lexical units as a Venn diagram (figure 1). In the intersection we can observe words that are shared by the 3 Methodology: an Integration of Algorithms units cerveza (beer), café (coffee) and té (tea), e.g. servir (to The novelty of our approach lies on the modular design. serve), beber (to drink), tomar (to drink), querer (to want), Modular algorithms produce small fragments of taxonomies etc. Of course, we also have words that are shared only by which are later contrasted and integrated into a larger taxon- two of the units, e.g., café and té share caliente (hot), which omy by a central module. does not co-occur with cerveza. By the same token, cerveza Algorithm 1 computes distributional similarity between and café share the unit amargo/a (bitter), which does not co- words. Algorithm 2 calculates asymmetric relations in word occur with té. co-occurrence in corpora. Algorithm 3 analyzes definitions In concrete, this component analyzes the syntagmatic from various dictionaries of a language and detects cases of context in which a word appears and extracts the vocab- significant definiens-definiendum co-occurrence. Algorithm ulary (excluding function words). It then obtains pairs of 4 is a variant of 1 because it computes distributional sim- words that, following the previous examples of drinks, could ilarity as the number of identical ngrams (as sequences of be brandy francés, bebiendo brandy, tomar brandy, brandy words) that a group of words may share. Algorithm 5 is an barato, etc. (French brandy, drinking brandy, drink brandy, Figure 2: In algorithm 2, example of co-occurrence graphs depicting hypernymy relations. Hypernym nodes are the ones that have the largest number of incoming arrows. Figure 1: A Venn diagram to represent the intersection and difference between the co-occurrence sets. 1 shows some examples of the clusters created from these nouns and their member elements. In this experiment, our cheap brandy, etc.). From these elements, a data structure algorithm was able to produce correct clusters in 96% of the is created in which each term is associated with the lexical cases, though it was only capable of classifying half of the units it co-occurs with. input words (51%). Better precision was met at the expense Terms are then represented as co-occurrence vectors, and of a considerable loss of recall. More details on this experi- thus the algorithm conducts a pairwise comparison of the ment will appear in Nazar & Renau (In press). terms applying a similarity measure which calculates the de- 3.2 Algorithm 2: Taxonomy Extraction Based on gree of overlapping between vectors. In this case, this is cal- culated with the Jaccard coefficient, as suggested by Grefen- Asymmetric Word Co-occurrence stette (1994), defined as follows, where A and B are the two Instead of clusters of semantically-related words, as pro- vectors to be compared: duced by the first algorithm, the second one consists of creat- ing hyponym-hypernym pairs based on their co-occurrence |A [ B| patterning. Here, we define co-occurrence as a tendency J(A, B) = (1) of two lexical units to appear together in the same sen- |A \ B| tences, not taking into account the distance nor their order As it is usual in any clustering procedure, for this com- in the sentence. The main idea behind this study is that co- parison we need a table of distances, from which we obtain occurrence is asymmetric in the case of hyponym-hypernym a pair of units showing the greatest similarity. Hereafter, the pairs. For instance, the word motocicleta shows a tendency members of this pair merge and create the first cluster, which to appear in the same sentences with the word vehı́culo, but occupies the place of both words and contains the sum of the relation is not reciprocated. The asymmetric nature of their attributes. The process is iterative, thus, another table such association allows us to automatically represent hier- of distances is created, but every time with one less element. archical relations without resorting to external knowledge This process stops when the units to cluster do not reach a bases. The computation of these relations is produced with similarity threshold, defined as a minimum proportion of at- the help of directed graphs that express the co-occurrence tributes in common that a set of units must have in order to relations. be assigned the same cluster. The graphs shown in figure 2 illustrate this method. The arrows in the graph represent asymmetric co-occurrence re- Class Members of the cluster lations, and the node with most incoming arrows is selected Vehicles carro, automóvil, coche, autobús, tranvı́a, carroza, car- ruaje, camión, jeep, camioneta as the hypernym of the input term. Here, the input term cor- Types of cheese brie, parmesano, camembert, mozzarella, gorgonzola, tisol tends to co-occur with glucocoricoide and hormona. In roquefort, gruyer turn, glucocoricoide also tends to co-occur with hormona, Drinks chocolate, licor, chicha, cerveza, aguardiente but this last unit does not reciprocate the relation with neither Hats pavero, tricornio, bicornio, guarapón, canotier, calañés of both. The output of the graph is read as saying that hor- Animals venado, ciervo, tigre, elefante, perro, gato, puerco, mona is the hypernym of cortisol because it is the node with cerdo, carnero, conejo, ratón, rata the largest number of incoming arrows. As shown by the other graph, the same pattern is exhibited by noun phrases, Table 1: Examples of clusters of Spanish nouns made by very common in multiword terms. algorithm 1. In order to test the performance of this module alone, we manually evaluated the results of an experiment with 200 In order to obtain an estimation of the quality of the re- Spanish nouns pertaining to the semantic classes of mam- sults produced by this single module, we manually evaluated mals, insects, drinks, hats, vehicles and, again, varieties of an arbitrary selection of 145 nouns which can be classified cheese. This preliminary evaluation shows that we can ex- as drinks, hats, vehicles, animals and types of cheese. Table pect approximately a 60% chance of obtaining a correct hy- pernym for a given noun using this algorithm in isolation. tendency to appear in the same positions in a large number More details on this experiment can be found in Nazar & of different ngrams, then one can conclude that these words Renau (2012). are paradigmatically related. As a result, this algorithm pro- duces clusters of words, where it can be seen that the mem- 3.3 Algorithm 3: Extraction of Hypernymy bers of each class share not only the same grammatical cate- Relations from Definiens-Definiendum gory but also an evident semantic relatedness. Table 3 shows Co-occurrence in General Dictionaries some examples of the results in English. Results were virtu- As already mentioned in subsection 2.2, electronic dictionar- ally the same in Spanish. In this paper we are only interested ies have been used in the past to extract hypernymy and other in nouns, but as the table shows, the same procedure can be semantic relations, but in general the approach has been fo- applied to the study of other grammatical categories. cused on a single dictionary, which is parsed with a rule- POS-tag Members of the cluster based system to extract the relations from the definitions. Adverbs entirely, exclusively, mainly, primarily, principally Our approach here is different because we use a set of dic- Adjectives cost-effective, efficient, elaborate, professional, subtle tionaries and infer the hypernymy relations by the frequency Proper nouns Australia, Dublin, England, France, India, Janeiro, Mid- of co-occurrence between lexical items in the headword and dlesbrough, Newcastle, Sunderland, Yorkshire in the definitions. The algorithm uses the frequency to select Nouns chess, cricket, football, golf, rugby, soccer, tennis hypernyms from the text of the definitions, assuming that there will be some consensus among the dictionaries when Table 3: Examples of clusters in English, result of Algorithm selecting a given hypernym, and thus this should be the most 4. frequent word (excluding function words). In this way we save the effort of building a set of rules for each dictionary In order to evaluate this component, we again carried out and we make it possible to replicate the experiment in other a manual examination of the results. This was done by a ran- languages. dom sample of 30 clusters, containing 1191 words in total. In order to obtain an evaluation of the performance of this We found that in 96% of the cases the clusters were con- single module, we manually examined the results for a ran- sistent. This internal consistency is computed as the mean dom sample of 150 nouns and concluded that we can expect consistency of each individual cluster, with numbers for the approximately 70% chance of obtaining a correct hypernym individual clusters ranging from 80 to 100% consistency. for a given input word. More details on this experiment can More details on this experiment will appear in Nazar & Re- be found in Renau & Nazar (2012). nau (Submitted). 3.4 Algorithm 4: N grams with “Asterisks” 3.5 Algorithm 5: Analogical Inference With algorithm 4 we explored the possibility of creating clusters of words that have a tendency to occur in exactly We use algorithm 5 as an inference engine to analyze the re- the same positions in short sequences of words. This is why sults produced by the previous modules. It is the only com- we describe this module “ngrams with asterisks”. ponent that is not corpus-driven, in the sense that it only an- What we do here is to study large samples of ngrams, de- alyzes the morphological and lexical features of the terms. fined as sequences of three to five words, and then replace This component is described as an analogical inference one of the words inside the ngram with an asterisk. The goal engine because it learns to associate features of the lexical is to record then which are the words that most frequently items with the category that is assigned to them by the other occur in the position of such asterisk. Normally, these words algorithms. In this way, if a term cannot be found in the an- will show some kind of paradigmatic relation and therefore alyzed corpora, it may still be classified by this module. will have some features in common, such as the grammati- The features that are learned are both lexical and morpho- cal category and, in most cases, also a semantic relatedness. logical. The lexical level is only useful in the case of multi- Consider, for instance, the case of the ngram ‘at * airport’, word expressions, but of course these are also very frequent, taken from the BNC corpus (table 2). Only a limited number especially in the case of technical or specialized domains. of words can occur in the position of the asterisk, and these For instance, this algorithm first learns that the other ones are semantically related. are placing terms such as sı́ndrome de Carpenter (Carpenter syndrome) and sı́ndrome de Meretoja (Meretoja syndrome), at * airport [645] among others, as hyponyms of enfermedad (disease). Thus, Heathrow 56, Manchester 23, Gatwick 19, London 15, it learns to associate features of the terms (in the case, the Frankfurt 15, Teesside 10, Edinburgh 8, an 7, Glasgow 7, sequence sı́ndrome de) with a semantic class, and subse- Stansted 7, Dublin 6, Birmingham 5, Coventry 5, Aberdeen 5, quently classify new terms such as sı́ndrome de Maffucci ... (Maffucci syndrome) also as a disease. The same is done at the morphology level: the module learns to detect morpho- Table 2: Words appearing in the position of the asterisk in logical similarities between the members of the same seman- the sequence ‘at * airport’ in the BNC corpus. tic class. Continuing with the same example, it can learn that very frequently the terms denoting diseases have the suffixes To share a single ngram is of course not and indication -osis or -itis, such as hepatitis or endometriosis. Given, thus, of semantic similarity. But if there are words that show a a new term such as pancreatitis, the module will recognize it as a disease. More details on this experiment were published random sample of 52 nouns to evaluate. Criteria for consid- in Nazar et al. (2012). ering a link between a word and a node as correct was that it should correspond to the hypernymy relation type. Among 3.6 Integration into a Single Taxonomy the words from the samples we have, for instance, lechuga After the implementation and experimentation with each (‘lettuce’). In this case, the taxonomy offers two hypernymy procedure, we developed a new central or “assembly” al- chains: gorithm, with the purpose of integrating the results into a Entity ! Physical Object ! Plant [Planta] ! [Arbusto] ! lechuga single taxonomy. The task of this module is to reinforce the certainty of the results on the basis of the combined output of Entity ! Physical Object ! Plant [Planta] ! lechuga each module. The result is a sort of “consensus” taxonomy which, according to our preliminary experiments, is larger and more reliable than the ones produced by each module in For one, it states that it is a type of bush ([Arbusto]), isolation. which, in turn, is a type of plant, and so on. But a lettuce The integration procedure is however not straightforward, is not really a bush, thus this chain is incorrect (it cannot be because each algorithm is of a different nature, and thus analyzed as “lettuce IS A bush”). For the other, it asserts that the combination of the results cannot be solved with a sim- a lettuce is a type of plant ([Planta]), which in turn is a type ple voting scheme. Algorithms 1 and 4 result in groups of of physical object, and so on. This last chain is considered semantically-similar words, while 2, 3 and 5 result in hyper- correct (it can be analyzed as “lettuce IS A plant”). nymy pairs. Moreover, the desired result is to populate the For each noun, our human judges indicate how many hy- already existing CPA Ontology, which, as already explained, pernyms were offered by the taxonomy (two in the case of refers to very general or abstract concepts. lechuga) and how many of them were correct: one out of two Two basic operations are conducted. On the one hand, to in this case. We thus calculated overall precision as the ra- integrate the results of modules that produce clusters of se- tio of correct chains over total chains. We consider recall mantically similar words and, on the other hand, to link these very difficult to calculate, because of the lack of corpus- clusters of words with a correct hypernym. This is done in based lexicographic material in Spanish. As a consequence, sequential steps. For each noun in each cluster produced by we only evaluated precision, and obtained that for a total of modules 1 and 4, there (might) be a hypernym candidate pro- 763 hypernymy chains examined, 586 were found to be cor- vided by modules 2, 3 and 5. The result is that, for each clus- rect, which makes a precision of 76.80%. The standard devi- ter, there will be a most frequent hypernym candidate, which ation in the group of judges is 14.34. If we exclude the two is thus selected as the semantic class of all the members of judges with more extreme positive and negative scores, then the cluster. As a result, a pairing of a hypernym with a group mean precision rate is 77.18%, with a standard deviation of of hyponyms is obtained, e.g. sedán, coche, limusina, etc. 12.82. are classified as hyponyms of automóvil. A chain of ascend- Regarding the control of inter-coder agreement, we in- ing hypernymy links is built until one of the semantic types cluded in all samples given to the judges a common group of of the CPA Ontology is found, and then each word is inte- 11 nouns, which makes 88 judgments that should ideally be grated in an hypernymy chain until the top node Entity: identical. However, raters agreed only on 72 cases, which is Entity ! Physical Object ! Inanimate ! Artifact ! Machine still more than moderate agreement (81.8% or 63.2% if mea- ! Vehicle ! Automobile ! sedan sured with a Kappa coefficient to correct for chance-related agreement). We can interpret the agreement percentage as 3.7 User Interface the ceiling of the precision one can expect from this type of algorithms. A first prototype is now being developed as a web demo, With respect to error analysis, we found out that they available at http://www.verbario.com. The taxonomy of mostly occurred as a consequence of the polysemy of the nouns is only a module of Verbario.com, a website that is words, a circumstance already noticed by Amsler (1981) part of a wider project devoted to lexical analysis. The “Tax- for this kind of output. It is the case, for instance, of the onomy” part shows two ways of obtaining results: the user word adicción (addiction), tagged as hyponym of dependen- can either introduce a noun and get the hypernymy chain or cia (dependence), which is correct in principle. But then de- viceversa, he/she can obtain all the hyponyms of a given tar- pendencia is only registered as a type of construction, ac- get noun. At the moment, more than 30,000 Spanish nouns cording to one of the senses that the word has in Spanish. have already been introduced and, while the system is run- Another frequent cause of error is confusion between hy- ning, more words are being added at a fast pace with a rea- pernymy and other semantic relation. Vitı́ligo, for instance, sonably low error rate, as shown in the next section. We plan is correctly classified as a disease in one case but incorrectly to offer regular back up files of the taxonomy in OWL for- as a hyponym of piel (skin) in another, obviously because mat at this website. it is a skin-related disease and then both words tend to co- occur in the same contexts. The same happens in the case 4 Evaluation of the Results of synonyms, which are often placed incorrectly in a hyper- Samples of the overall results were evaluated by a group of nymy relation. For instance, the word cuchı́ is a synonym 8 human judges, all of them advanced graduate students in and not a hyponym of cerdo (pig). There are also cases linguistics. Each one received the same instructions and a of meronymy, such as the word océano (ocean), which is wrongly connected to agua (water): an ocean is made of wa- Aussenac-Gilles, N., and Jacques, M.-P. 2008. Designing ter, but it is not a type of water. and evaluating patterns for relation acquisition from texts with Caméléon. Terminology 14(1):45–73. 5 Conclusions and Future Work Baroni, M., and Lenci, A. 2010. Distributional memory: A gen- eral framework for corpus-based semantics. Comput. Linguist. This paper has presented a set of combined algorithms for 36(4):673–721. building a taxonomy of Spanish nouns based on procedures from quantitative linguistics. The method, based mainly on Biemann, C.; Bordag, S.; and Quasthoff, U. 2003. Ler- the study of co-occurrence patterns, could in principle be nen von paradigmatischen relationen auf iterierten kollokatio- replicated with different languages. The precision we obtain nen. In Beiträge zum GermaNet-Workshop: Anwendungendes at the moment can be improved but at the same time it is only deutschen Wortnetzes in Theorie und Praxis. slightly lower than the inter-coder agreement percentage. In Boguraev, B. 1991. Building a Lexicon: The Contribution of semantic analysis, total agreement is unrealistic given the Computers. International Journal of Lexicography 4(3):227– fact that even dictionaries not always agree. 260. For future work, we are focusing on the following aspects: Bullinaria, J. 2008. Semantic categorization using simple word on the one hand, we will try to improve precision by address- co-occurrence statistics. In Baroni, M.; Evert, S.; and Lenci, A., ing the problem of polysemy and the confusion between syn- eds., Proceedings of the ESSLLI Workshop on Distributional onyms and meronyms with hypernyms. We have already ex- Lexical Semantics, 1–8. perimented with sense-induction algorithms which, for each Calzolari, N.; Pecchia, L.; and Zampolli, A. 1973. Working on noun found in a corpus, will produce a list of different senses the Italian machine dictionary: a semantic approach. In Proc. (Nazar 2010). This algorithm can now be used to map each of 5th Conference on Computational Linguistics (Morristown, sense with a hypernym. Pending work also includes a de- NJ, USA), 49–52. tailed large scale evaluation. In this respect, we must dis- Calzolari, N. 1977. An empirical approach to circularity in tinguish precision of frequent nouns (e.g. manzana - apple, dictionary definitions. Cahiers de Lexicologie 31(2) 118–128. casa - house, etc.) from precision of very infrequent nouns Chodorow, M.; Byrd, R.; and Heidorn, G. 1985. Extracting (e.g. acetábulo, an anatomic part of a bone). Finally, we will semantic hierarchies from a large on-line dictionary. In Proc. evaluate separately how the system operates with specialized of the 23rd annual meeting on ACL (Chicago, Illinois, USA), terms and with general language. 299–304. Ciaramita, M. 2002. Boosting automatic lexical acquisition 6 Acknowledgments with morphological information. In Proc. of the ACL-02 Work- shop on Unsupervised Lexical Acquisition, ACL, 17–25. This paper has been made possible thanks to funding Cimiano, P., and Völker, J. 2005. Text2onto. In Natural lan- from Projects Fondecyt 11140704, lead by Irene Re- guage processing and information systems. Springer. 227–238. nau, and Fondecyt 11140686, lead by Rogelio Nazar Fillmore, C. J. 1976. Frame semantics and the nature of lan- (http://www.conicyt.cl/fondecyt). We would like to express guage. Annals of the New York Academy of Sciences: Confer- our gratitude to the students for their participation and to the ence on the Origin and Development of Language and Speech reviewers for their extended and detailed comments, which 280(1):20–32. have been very helpful to improve this paper. Unfortunately, Fountain, T., and Lapata, M. 2012. Taxonomy induction using lack of time and space prevented us from introducing all of hierarchical random graphs. In Proc. of the 2012 Conference the changes that were suggested. of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, NAACL References HLT ’12, 466–476. Stroudsburg, PA, USA: Association for Alfonseca, E., and Manandhar, S. 2002. Extending a lexical on- Computational Linguistics. tology by a combination of distributional semantics signatures. Fox, E. A.; Nutter, J. T.; Ahlswede, T.; Evens, M.; and In Proc. of EKAW’02, 1–7. Markowitz, J. 1988. Building a large thesaurus for information Alshawi, H. 1989. Analysing the dictionary definitions. In retrieval. In Proc. of the Second Conference on Applied Natural Boguraev, B., and Briscoe, T., eds., Computational Lexicogra- Language Processing, ANLC ’88, 101–108. Stroudsburg, PA, phy for Natural Language Processing. White Plains, NY, USA: USA: Association for Computational Linguistics. Longman Publishing Group. 153–169. Grefenstette, G. 1994. Explorations in Automatic Thesaurus Amsler, R. 1981. A taxonomy for English nouns and verbs. In Discovery. Kluwer, Dordrecht, The Netherlands. Proc. of 19th annual meeting on ACL (Morristown, NJ, USA), Guthrie, L.; Slator, B.; Wilks, Y.; and Bruce, R. 1990. Is there 133–138. content in empty heads? In Proc. of the 13th International Con- Araujo, L., and Pérez-Agüera, J. R. 2006. Enriching thesauri ference on Computational Linguistics, COLING’90 (Helsinki, with hierarchical relationships by pattern matching in dictionar- Finland), 138–143. ies. In FinTAL, 268–279. Hanks, P. 2013. Lexical Analysis: Norms and Exploitations. Auger, A., and Barriere, C. 2008. Pattern-based approaches to MIT Press, Cambridge, MA. semantic relation extraction - A state-of-the-art. Terminology Hanks, P. in progress. Pattern Dictionary of English Verbs. 14(1):1–19. http://www.pdev.org.uk/ (last access: 26/04/0215. Harris, Z. 1954. Distributional structure. Word 10(23):146– tions. In Proc. of 21st International Conference on Computa- 162. tional Linguistics and 44th annual meeting of the ACL (Sydney, Hearst, M. 1992. Automatic acquisition of hyponyms from Australia), 113–120. large text corpora. In Proc. of the 14th International Conference Pekar, V.; Krkoska, M.; and Staab, S. 2004. Feature weighting on Computational Linguistics (Nantes, France), 539–545. for co-occurrence-based classification of words. In Proc. of the Kilgarriff, A. 2007. Googleology is bad science. Comput. 20th International Conference on Computational Linguistics, Linguist. 33(1):147–151. COLING ’04. Stroudsburg, PA, USA: Association for Compu- tational Linguistics. Landauer, T. K., and Dumais, S. T. 1997. A solution to plato’s problem: The latent semantic analysis theory of acquisition, in- Potrich, A., and Pianta, E. 2008. L-isa: Learning domain spe- duction, and representation of knowledge. Psychological re- cific isa-relations from the web. In LREC. European Language view 211–240. Resources Association. Lenat, D. B. 1995. Cyc: A large-scale investment in knowledge Pustejovsky, J.; Hanks, P.; and Rumshisky, A. 2004. Automated infrastructure. Commun. ACM 38(11):33–38. induction of sense in context. In Proc. of the 20th Interna- tional Conference on Computational Linguistics, COLING ’04. Lin, D. 1998. Automatic retrieval and clustering of similar Stroudsburg, PA, USA: Association for Computational Linguis- words. In Proc. of the 36th Annual Meeting of the Association tics. for Computational Linguistics and 17th International Confer- ence on Computational Linguistics - Volume 2, ACL ’98, 768– Renau, I., and Nazar, R. 2012. Hypernym extraction by 774. Stroudsburg, PA, USA: Association for Computational definiens-definiendum co-occurrence in multiple dictionaries. Linguistics. Procesamiento del Lenguaje Natural (49):83–90. Rydin, S. 2002. Building a hyponymy lexicon with hierarchical Lyons, J. 1977. Semantics, volume 2. Cambridge University structure. In Proc. of the ACL-02 Workshop on Unsupervised Press. Lexical Acquisition - Volume 9, ULA ’02, 26–33. Stroudsburg, Maedche, A. 1995. Ontology Learning For The Semantic Web. PA, USA: Association for Computational Linguistics. Doredrecht, The Netherlands: Kluwer. Sahlgren, M. 2008. The distributional hypothesis. Rivista di Medelyan, O.; Manion, S.; Broekstra, J.; Divoli, A.; Huang, A.; Linguistica 20(1) 33–53. and Witten, I. H. 2013. Constructing a focused taxonomy from Schütze, H., and Pedersen, J. 1997. A co-occurrence-based a document collection. In The Semantic Web: Semantics and thesaurus and two applications to information retrieval. Infor- Big Data, 10th International Conference, ESWC 2013, Mont- mation Processing and Management 33(3):307–318. pellier, France, May 26-30, 2013. Proc., 367–381. Snow, R.; Jurafsky, D.; and Ng, A. 2006. Semantic taxonomy Miller, G. A. 1995. WordNet: A Lexical Database for English. induction from heterogeneous evidence. In Proc. of the 21st In- Communications of the ACM 38(11):39–41. ternational Conference on Computational Linguistics (Sydney, Nakamura, J., and Nagao, M. 1988. Extraction of Semantic Australia), 801–808. Information from an Ordinary English Dictionary and its Eval- Velardi, P.; Faralli, S.; and Navigli, R. 2013. Ontolearn uation. In Proc. of the 12th International Conference on Com- reloaded: A graph-based algorithm for taxonomy induction. putational Linguistics COLING-88 (Budapest, Hungary), 459– Computational Linguistics 39(3):665–707. 464. Vossen, P. 1998. Eurowordnet: A multilingual database with Navigli, R.; Velardi, P.; and Faralli, S. 2011. A graph-based al- lexical semantic networks. Computers and the Humanities gorithm for inducing lexical taxonomies from scratch. In Proc. 32(2-3). of the Twenty-Second International Joint Conference on Artifi- cial Intelligence - Volume Volume Three, IJCAI’11, 1872–1877. Wang, W.; Barnaghi, P.; and Bargiela, A. 2009. Probabilis- AAAI Press. tic topic models for learning terminological ontologies. IEEE Transactions on Knowledge and Data Engineering 99(Rapid- Nazar, R., and Renau, I. 2012. A co-occurrence taxonomy from Posts). a general language corpus. In Proc. of EURALEX, 367–375. Wilks, Y.; Fass, D.; Guo, C.; McDonald, J.; Plate, T.; and Sla- Nazar, R., and Renau, I. In press. Agrupación semántica de sus- tor, B. 1989. A tractable machine dictionary as a resource tantivos basada en similitud distribucional. implicaciones lex- for computational semantics. In Computational Lexiography icográficas. In Actas del V Congreso Internacional de Lexi- for Natural Language Processing. B. Boguraev and T. Briscoe cografı́a Hispánica (25-27 de junio de 2012). (eds): Essex, UK: Longman. 193–228. Nazar, R., and Renau, I. Submitted. Extraños-misteriosos- Woon, W. L., and Madnick, S. 2009. Asymmetric information insondables-inescrutables son los caminos del señor: extracción distances for automated taxonomy construction. Knowl. Inf. de relaciones paradigmáticas mediante análisis estadı́stico de Syst. 21(1):91–111. textos. Nazar, R.; Vivaldi, J.; and Wanner, L. 2012. Co-occurrence graphs applied to taxonomy extraction in scientific and techni- cal corpora. Procesamiento del Lenguaje Natural (49):67–74. Nazar, R. 2010. A Quantitative Approach to Concept Analysis. Ph.D. Dissertation, Universitat Pompeu Fabra. Pantel, P., and Pennacchiotti, M. 2006. Espresso: Leverag- ing generic patterns for automatically harvesting semantic rela-