AHyDA: Automatic Hypernym Detection with feature Augmentation Ludovica Pannitto Lavinia Salicchi Alessandro Lenci University of Pisa University of Pisa University of Pisa ellepannitto@gmail.com lavinia.salicchi@libero.it alessandro.lenci@unipi.it Abstract are symmetrical (see example (2)), while most of the similarity measures defined in distributional English. Several unsupervised methods semantics are, like the cosine. for hypernym detection have been inves- tigated in distributional semantics. Here (2) a. I saw a dog → I saw an animal we present a new approach based on a b. I saw an animal 9 I saw a dog smoothed version of the distributional in- Hypernymy is an asymmetric relation. Au- clusion hypothesis. The new method is tomatic hypernym identification is a very well- able to improve hypernym detection after known task in literature, which has mostly been testing on the BLESS dataset. addressed with semi-supervised, pattern-based ap- proaches (Hearst, 1992; Pantel and Pennacchiotti, Italiano. Sulla base dei metodi non 2006). Various unsupervised models have been supervisionati presenti in letteratura, proposed (Weeds and Weir, 2003; Weeds et al., affrontiamo il task di riconoscimento di 2004; Clarke, 2009; Lenci and Benotto, 2012; iperonimi nello spazio distribuzionale. Santus et al., 2014), based on the notion of Distri- Introduciamo una nuova misura di- butional Generality (Weeds et al., 2004) and on rezionale, basata su un’espansione the Distributional Inclusion Hypothesis (DIH) dell’ipotesi di inclusione distribuzionale, (Geffet and Dagan, 2005) which has been derived che migliora il riconoscimento degli from it. iperonimi, testandola sul dataset BLESS. 1.1 The pitfalls of the DIH The DIH aims at providing a distributional corre- 1 Introduction and related works late of the extensional definition of hyponymy in Within the Distributional Semantics framework, terms of set inclusion: x is a hyponym of y iff the semantic similarity between words is usually ex- extension of x (i.e. the set of entities denoted by pressed in terms of proximity in a semantic space, x) is a subset of the extension of y. The DIH turns where the dimensions of the space represent, at this into the assumption that a significant number some level of abstraction, the contexts in which of the most salient contexts of x should also ap- the words occur. pear among the salient contexts of y. While this Our intuitions about the meaning of words al- is consistent with the logical inferences licensed low inferences of the kind expressed in example by hyponymy (cf. (2)), it does not take into ac- (1), and we expect Distributional Semantic Mod- count the actual usage of hypernyms with respect els (DSMs) to support such inferences: to hyponyms. Consider for instance the following examples: (1) a. Wilbrand invented TNT → Wilbrand ? uncovered TNT (3) a. A horse gallops → An animal gallops b. A horse ran → An animal moved ? b. A dog barks → An animal barks The type of relation between semantically sim- These inferences are truth-conditionally valid: ilar lexemes may differ significantly, but DSMs whenever the antecedent is true, the consequent is only account for a generic notion of semantic re- also true. However, they are not equally “prag- latedness. Furthermore, not all lexical relations matically” sound. In fact, the fact that one uses horse dog animal feature Augmentation), a smoothed version of the gallop 216 – 7 DIH: given a context feature f that is salient for bark – 869 16 a lexical item x, we expect co-hyponyms of x to have some feature g that is similar to f , and an hy- Table 1: Co-occurrence frequency distribution ex- pernym of x to have a number of these clusters of tracted from the ukWaC corpus features. To remain in the animal sounds area, we expect a dog to bark and a duck to quack and an animal to produce either of those sounds or to co- a sentence like A dog barks does not entail that occur with a more general sound-emission verb. in the same situation one would have also used the sentence An animal barks. The latter sen- 2 AHyDA: Smoothing the DIH tence would be pragmatically appropriate only in cases in which one knows that something is bark- All the measures implementing the DIH are based ing, without knowing which animal is producing on computing the (weighted) intersection of this sound. However, the latter condition hardly the features of the hyponym and the hypernym. applies, since barking is a very typical feature of This is then typically divided by the hyponym dogs: knowing that something is barking typically features. AHyDA essentially proposes a new way entails knowing that it is a dog, since we know that to compute the intersection of the hyponym and barking is something dogs do. The same argument hypernym contexts. Given a lexical item x, we also applies to the case of horse and galloping. call Fx the set of its distributional features. Note The problem of the DIH is that the assumption that features need not be pure lexical items. In it rests on, namely that the most typical contexts general, we define f as a pair (fw , fr ) where fw of the hyponym are also typical contexts of the is typically a lexical item, and fr is any additional hypernym, is not borne out in practical language contextual information, in the present case a usage because of pragmatic constraints. The most pattern occurring between x and fw , as explained typical contexts of an hyponym are not necessar- in section 3.1. The core novelty of AHyDA is to ily the typical contexts of its hypernym. This is use a smoothed version of Fx , called Fx0 . also proved by a simple inspection of corpus data, as reported in Table 1. Despite animal (161, 107) The idea is shown in figure 1, which provides is more frequent than dog (128, 765) and horse a simplified graphical example of the intersection (90, 437), its co-occurrence with bark and gallop operation. Consider a case where the target is much lower than the ones of the hyponyms: horse has some feature with gallop as a lexical bark and gallop are not typical contexts of animal. item, for example a feature f = (gallop, sbj) If the inferences in (3) are pragmatically odd, meaning that horse is a possible subject of gallop. the following ones are instead fully acceptable: Given what we have said in Section 1.1, we do not expect animal to share this horse-specific (4) a. A horse gallops → An animal moves property. So, instead of looking for this par- b. A dog barks → An animal calls ticular feature among the ones of animal, we generate a new set Nhorse (gallop) of features Salient features of the hypernym are indeed sup- g = (gw , fr ) such that gw is a neighbor of posed to be semantically more general than the gallop and is a feature (with the same syntactic salient features of the hyponym. Santus et al. relation sbj) of some neighbor of horse. Sup- (2014) tried to capture this fact by abandoning the pose that run, move, and cycle are neighbors of DIH and introducing an entropy-based measure to gallop. As run and move are also features of estimate of informativeness of the hypernym and some neighbor of horse (e.g., lion), we would hyponym contexts, under the assumption that the have Nhorse (gallop) = {gallop, run, move}. former have a higher entropy, because they are Conversely, since cycle is not a feature of a close more general (e.g. move vs. gallop). neighbor of horse, it would not be included in the In this paper, we address the same issue by expanded feature set. amending the DIH, to make it more consistent with the actual distributional properties of hy- ponyms and hypernyms. Therefore, we introduce AHyDA (Automatic Hypernym Detection with is asymmetric (like the others implementing the DIH), and therefore it is suitable to capture the asymmetric nature of hypernymy. 3 Experiments and Evaluation 3.1 Distributional Space Each lexical item u is represented with distribu- Figure 1: An example of smoothed intersection. tional features extracted from the TypeDM ten- Black arrows indicate semantic similarity with sor (Baroni and Lenci, 2010). In TypeDM, dis- gallop, items with the blue background are the tributional co-occurrences are represented as a ones included in Nhorse (gallop). weighted tuple structure, a set of ((u, l, v), σ), such that u and v are lexical items, l is a syntag- matic co-occurrence link between u and v and σ is Mathematically, we define the expanded feature the Local Mutual Information (Evert, 2005) com- set Fx0 as follows: puted on link type frequency. Hence, each lexical item u is represented in terms of features of the Fx0 = {(f, Nx (f )) ∀f ∈ Fx } (1) kind (l, v). In addition to the sparse space, we also pro- Nx (f ) = {g|g = (gw , fr )} (2) duced a dense space of 300 dimensions reduc- ing the matrix with Singular Value Decomposition where the following conditions hold for g: (SVD). This additional space was used to retrieve neighbors during the smoothing operation, as it al- d (fw , gw ) < k ∧ ∃y|d (x, y) < h ∧ g ∈ Fy (3) lowed us to perform faster and more accurate cal- culations for cosines. The sparse space was in- where d(x, y) is any distance measure in the se- stead employed to retrieve features and get their mantic space, k and h are empirically set thresh- weights. old values. Nx (f ) is generated by looking for features g that 3.2 Data set are similar to fw , We then check whether this new Evaluation was carried on a subset of the BLESS feature is shared by some neighbor of the target x, dataset (Baroni and Lenci, 2011), consisting of tu- and eventually include g in Nx (f ). This allows us ples expressing a relation between nouns. to redefine the intersection operation between Fx0 BLESS includes 200 English concrete nouns as and Fy as: target concepts, equally divided between living Fx 0 ∩ ˆ Fy = {f |f ∈ Fx ∧ Nx (f ) ∩ Fy 6= ∅} (4) and non-living entities. For each concept noun, BLESS includes several relatum words, linked to the concept by one of the following 5 relations: When expanding a feature f into Nx (f ), we COORD (i.e. co-hyponyms), HYPER (i.e. hyper- expect to find in Nx (f ) features that express nyms), MERO (i.e. meronyms), ATTRI (i.e. at- the same “property” in different ways. We ex- tributes), EVENT (i.e. verbs that define events pect these features to be shared by hypernyms related to the target). BLESS also includes the more than co-hyponyms, because hypernyms are relations RANDOM - N, RANDOM - J, RANDOM - V, supposed to collect features from all their hy- which relate the targets to control tuples with ran- ponyms, while co-hyponyms lack those of other dom noun, adjective and verb relata, respectively. co-hyponyms (e.g. lions run but do not gallop). By restricting to noun-noun tuples, we got AHyDA is thus defined as follows: a subset containing these relations: COORD, 0 HYPER , MERO , RANDOM - N . We preprocessed the P f ∈Fx |Fx ∩ Fy | AHyDA (x, y) = (5) dataset in order to exclude lexical items that are |Fx | not included in TypeDM. As reported in table 2, Importantly, AHyDA only considers the aver- the distribution (minimum, mean and maximum) age cardinality of the intersections, without look- of the relata of all BLESS concepts is not even, ing at the feature weights. Moreover, the formula and therefore we took this into account while relation min avg max the minimum and maximum number of items hold- coord 6 17.1 35 ing a relation with x, and performed maximum minimum ran- hyper 2 6.7 15 dom samples where each relation is presented with mero 2 14.7 53 minimum relata, and then averaged the results. ran-n 16 32.9 67 For example, consider the situation where x has 3 hypernyms, 6 co-hyponyms, 6 meronyms and Table 2: Distribution (minimum, mean and maxi- 12 random nouns. In this situation, the minimum mum) of the relata of all BLESS concepts number of relata for x would be 3, while the maxi- mum would be 12. Therefore, we would perform 4 random sampling for each relation, averaging the evaluating our results. results in order to obtain a singular measurement for each relation in the end. 3.3 Evaluation We adopted the same evaluation methods de- scribed in Lenci and Benotto (2012): plotting the We compared AHyDA with a number of direc- distribution of scores per relation across the BLESS tional similarity measures tested on BLESS, with concepts, and calculating Average Precision (AP). the goal of evaluating their ability to discriminate hypernyms from other semantic relations, in par- 3.4 Results ticular co-hyponyms. Given a lexical item x, Fx Table 3 summarizes the Average Precision ob- is the set of its distributional features, wx (f ) is the weight of the feature f for the term x: tained by AHyDA, the other DIH-based measures, WeedsPrec - quantifies the weighted inclusion of and the cosine. Although AHyDA’s improvement the features of a term x within the features of a is not big in hypernym detection, co-hyponyms get term y (Weeds and Weir, 2003; Weeds et al., 2004; lower values of AP, thus showing that smoothing Kotlerman et al., 2010) the intersection allows a better discrimination be- tween the two classes. It is worth remarking that P f ∈F ∩F wx (f ) the values for the other measures are generally WeedsPrec(x, y) = P x y (6) higher than those reported by Lenci and Benotto f ∈Fx wx (f ) (2012), because of the evaluation on the balanced ClarkeDE - a variation of WeedsPrec, proposed in random samples of relations we have adopted. We (Clarke, 2009) also reported, in table 4, the AP values obtained through the standard measures, without employ- P f ∈Fx ∩Fy min(wx (f ), wy (f )) ing the feature augementation procedure. Altough ClarkeDE(x, y) = P values for hypernyms do not change much, the f ∈Fx wx (f ) (7) main differences are in the coord values, which invCL - a new measure introduced in (Lenci and are generally higher without feature augmentation. Benotto, 2012), to take into account not only the As mentioned in section 3.1, the results for all the inclusion of x in y but also the non-inclusion of measures are obtained using the sparse space. The y in x. The measure is defined as a function of reduced space was employed to compute the Co- ClarkeDE (CD). sine baseline. As regards the AP values for hypernyms, we must notice that not all hypernyms in BLESS share p invCL(x, y) = CD(x, y)(1 − CD(x, y)) (8) the same status: some of them are what we would We used the cosine as a baseline, since it is consider logic entailments (e.g. eagle → bird), a symmetric similarity measure and is commonly others depict taxonomic relations (e.g. alligator used to evaluate semantic similarity/relatedness in → chordate), some are not true logic entailments ? DSMs. In the definition of Nx (f ), the target and (e.g. hawk → predator) feature neighbors are identified with the cosine, Figure 2 shows the average score produced with setting the k and h parameters to 0.8 and 0.9 re- the new measure. Here hypernyms are neatly spectively. set apart from co-hyponyms, whereas the distance To avoid biases due to the relata distribution with meronyms and with the control group, ran- among concepts, for each target x, we computed doms, is less significative. measure coord hyper mero ran-n Cosine 0.77 0.31 0.21 0.14 WeedsPrec 0.29 0.50 0.32 0.16 ClarkeDE 0.31 0.52 0.24 0.14 invCL 0.28 0.52 0.32 0.17 AHyDA 0.20 0.49 0.33 0.23 Table 3: Mean AP values for each semantic rela- tion achieved by AHyDA and the other similarity scores measure coord hyper mero ran-n Cosine 0.77 0.32 0.21 0.14 WeedsPrec 0.34 0.51 0.28 0.15 ClarkeDE 0.36 0.51 0.27 0.16 invCL 0.31 0.51 0.29 0.16 Table 4: Mean AP values for each semantic rela- tion achieved by the cited similarity scores, with- out employing feature augmentation Figure 2: Distribution of relata similarity scores obtained with AHyDA (values are concept-by- Figure 3 shows the average scores produced by concept z-normalized scores) AHyDA when applied to the reverse hypernym pair. It is interesting to notice that in this case AHyDA produces basically the same results as random pairs. This suggests that AHYDA cor- rectly predicts that hyponyms entail hypernyms, but not vice versa, thereby capturing the asymmet- ric nature of hypernymy. 4 Conclusion The Distributional inclusion hypothesis has proven to be a viable approach to hypernym detection. However, its original formulation rests on an assumption that does not take into consideration the actual usage of hypernyms in texts. In this paper we have shown that, by adding some further pragmatically inspired constraints, a better discrimination can be achieved between co-hyponyms and hypernyms. Our ongoing work focuses on refining the way in which the smooth- ing is performed, and testing its performance on other datasets of semantic relations. Figure 3: Distribution of relata similarity scores References obtained with AHyDA (values are concept-by- Marco Baroni and Alessandro Lenci. 2010. Dis- concept z-normalized scores), when tested on the tributional memory: A general framework for inverse inclusion (i.e. hypernym does not entail corpus-based semantics. Computational Linguis- hyponym) tics, 36(4):673–721. Marco Baroni and Alessandro Lenci. 2011. How we blessed distributional semantic evaluation. In Pro- ceedings of the GEMS 2011 Workshop on GEomet- rical Models of Natural Language Semantics, pages 1–10. Association for Computational Linguistics. Daoud Clarke. 2009. Context-theoretic semantics for natural language: an overview. In Proceedings of the workshop on geometrical models of natural lan- guage semantics, pages 112–119. Association for Computational Linguistics. Stefan Evert. 2005. The statistics of word cooccur- rences: word pairs and collocations. Maayan Geffet and Ido Dagan. 2005. The distribu- tional inclusion hypotheses and lexical entailment. In Proceedings of the 43rd Annual Meeting on Asso- ciation for Computational Linguistics, pages 107– 114. Association for Computational Linguistics. Marti A Hearst. 1992. Automatic acquisition of hy- ponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics- Volume 2, pages 539–545. Association for Compu- tational Linguistics. Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-Geffet. 2010. Directional distribu- tional similarity for lexical inference. Natural Lan- guage Engineering, 16(4):359–389. Alessandro Lenci and Giulia Benotto. 2012. Identi- fying hypernyms in distributional semantic spaces. In Proceedings of the First Joint Conference on Lex- ical and Computational Semantics-Volume 1: Pro- ceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth Interna- tional Workshop on Semantic Evaluation, pages 75– 79. Association for Computational Linguistics. Patrick Pantel and Marco Pennacchiotti. 2006. Espresso: Leveraging generic patterns for automat- ically harvesting semantic relations. In Proceed- ings of the 21st International Conference on Com- putational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 113–120. Association for Computational Lin- guistics. Enrico Santus, Alessandro Lenci, Qin Lu, and Sabine Schulte Im Walde. 2014. Chasing hyper- nyms in vector spaces with entropy. In EACL, pages 38–42. Julie Weeds and David Weir. 2003. A general frame- work for distributional similarity. In Proceedings of the 2003 conference on Empirical methods in natu- ral language processing, pages 81–88. Association for Computational Linguistics. Julie Weeds, David Weir, and Diana McCarthy. 2004. Characterising measures of lexical distributional similarity. In Proceedings of the 20th interna- tional conference on Computational Linguistics, page 1015. Association for Computational Linguis- tics.