AHyDA: Automatic Hypernym Detection with feature Augmentation

          Ludovica Pannitto                Lavinia Salicchi                  Alessandro Lenci
           University of Pisa              University of Pisa                University of Pisa
     ellepannitto@gmail.com          lavinia.salicchi@libero.it         alessandro.lenci@unipi.it


                     Abstract                        are symmetrical (see example (2)), while most of
                                                     the similarity measures defined in distributional
     English. Several unsupervised methods           semantics are, like the cosine.
     for hypernym detection have been inves-
     tigated in distributional semantics. Here          (2)     a. I saw a dog → I saw an animal
     we present a new approach based on a                       b. I saw an animal 9 I saw a dog
     smoothed version of the distributional in-         Hypernymy is an asymmetric relation. Au-
     clusion hypothesis. The new method is           tomatic hypernym identification is a very well-
     able to improve hypernym detection after        known task in literature, which has mostly been
     testing on the BLESS dataset.                   addressed with semi-supervised, pattern-based ap-
                                                     proaches (Hearst, 1992; Pantel and Pennacchiotti,
     Italiano. Sulla base dei metodi non
                                                     2006). Various unsupervised models have been
     supervisionati presenti in letteratura,
                                                     proposed (Weeds and Weir, 2003; Weeds et al.,
     affrontiamo il task di riconoscimento di
                                                     2004; Clarke, 2009; Lenci and Benotto, 2012;
     iperonimi nello spazio distribuzionale.
                                                     Santus et al., 2014), based on the notion of Distri-
     Introduciamo una nuova misura di-
                                                     butional Generality (Weeds et al., 2004) and on
     rezionale, basata su un’espansione
                                                     the Distributional Inclusion Hypothesis (DIH)
     dell’ipotesi di inclusione distribuzionale,
                                                     (Geffet and Dagan, 2005) which has been derived
     che migliora il riconoscimento degli
                                                     from it.
     iperonimi, testandola sul dataset BLESS.
                                                     1.1      The pitfalls of the DIH
                                                     The DIH aims at providing a distributional corre-
1    Introduction and related works
                                                     late of the extensional definition of hyponymy in
Within the Distributional Semantics framework,       terms of set inclusion: x is a hyponym of y iff the
semantic similarity between words is usually ex-     extension of x (i.e. the set of entities denoted by
pressed in terms of proximity in a semantic space,   x) is a subset of the extension of y. The DIH turns
where the dimensions of the space represent, at      this into the assumption that a significant number
some level of abstraction, the contexts in which     of the most salient contexts of x should also ap-
the words occur.                                     pear among the salient contexts of y. While this
   Our intuitions about the meaning of words al-     is consistent with the logical inferences licensed
low inferences of the kind expressed in example      by hyponymy (cf. (2)), it does not take into ac-
(1), and we expect Distributional Semantic Mod-      count the actual usage of hypernyms with respect
els (DSMs) to support such inferences:               to hyponyms. Consider for instance the following
                                                     examples:
    (1)    a. Wilbrand invented TNT → Wilbrand
                                                                                    ?
              uncovered TNT                             (3)     a. A horse gallops → An animal gallops
           b. A horse ran → An animal moved                                     ?
                                                                b. A dog barks → An animal barks
   The type of relation between semantically sim-    These inferences are truth-conditionally valid:
ilar lexemes may differ significantly, but DSMs      whenever the antecedent is true, the consequent is
only account for a generic notion of semantic re-    also true. However, they are not equally “prag-
latedness. Furthermore, not all lexical relations    matically” sound. In fact, the fact that one uses
                   horse    dog   animal               feature Augmentation), a smoothed version of the
         gallop     216      –      7                  DIH: given a context feature f that is salient for
         bark        –      869     16                 a lexical item x, we expect co-hyponyms of x to
                                                       have some feature g that is similar to f , and an hy-
Table 1: Co-occurrence frequency distribution ex-      pernym of x to have a number of these clusters of
tracted from the ukWaC corpus                          features. To remain in the animal sounds area, we
                                                       expect a dog to bark and a duck to quack and an
                                                       animal to produce either of those sounds or to co-
a sentence like A dog barks does not entail that
                                                       occur with a more general sound-emission verb.
in the same situation one would have also used
the sentence An animal barks. The latter sen-          2   AHyDA: Smoothing the DIH
tence would be pragmatically appropriate only in
cases in which one knows that something is bark-       All the measures implementing the DIH are based
ing, without knowing which animal is producing         on computing the (weighted) intersection of
this sound. However, the latter condition hardly       the features of the hyponym and the hypernym.
applies, since barking is a very typical feature of    This is then typically divided by the hyponym
dogs: knowing that something is barking typically      features. AHyDA essentially proposes a new way
entails knowing that it is a dog, since we know that   to compute the intersection of the hyponym and
barking is something dogs do. The same argument        hypernym contexts. Given a lexical item x, we
also applies to the case of horse and galloping.       call Fx the set of its distributional features. Note
   The problem of the DIH is that the assumption       that features need not be pure lexical items. In
it rests on, namely that the most typical contexts     general, we define f as a pair (fw , fr ) where fw
of the hyponym are also typical contexts of the        is typically a lexical item, and fr is any additional
hypernym, is not borne out in practical language       contextual information, in the present case a
usage because of pragmatic constraints. The most       pattern occurring between x and fw , as explained
typical contexts of an hyponym are not necessar-       in section 3.1. The core novelty of AHyDA is to
ily the typical contexts of its hypernym. This is      use a smoothed version of Fx , called Fx0 .
also proved by a simple inspection of corpus data,
as reported in Table 1. Despite animal (161, 107)         The idea is shown in figure 1, which provides
is more frequent than dog (128, 765) and horse         a simplified graphical example of the intersection
(90, 437), its co-occurrence with bark and gallop      operation. Consider a case where the target
is much lower than the ones of the hyponyms:           horse has some feature with gallop as a lexical
bark and gallop are not typical contexts of animal.    item, for example a feature f = (gallop, sbj)
   If the inferences in (3) are pragmatically odd,     meaning that horse is a possible subject of gallop.
the following ones are instead fully acceptable:       Given what we have said in Section 1.1, we do
                                                       not expect animal to share this horse-specific
  (4)   a. A horse gallops → An animal moves           property. So, instead of looking for this par-
        b. A dog barks → An animal calls               ticular feature among the ones of animal, we
                                                       generate a new set Nhorse (gallop) of features
Salient features of the hypernym are indeed sup-
                                                       g = (gw , fr ) such that gw is a neighbor of
posed to be semantically more general than the
                                                       gallop and is a feature (with the same syntactic
salient features of the hyponym. Santus et al.
                                                       relation sbj) of some neighbor of horse. Sup-
(2014) tried to capture this fact by abandoning the
                                                       pose that run, move, and cycle are neighbors of
DIH and introducing an entropy-based measure to
                                                       gallop. As run and move are also features of
estimate of informativeness of the hypernym and
                                                       some neighbor of horse (e.g., lion), we would
hyponym contexts, under the assumption that the
                                                       have Nhorse (gallop) = {gallop, run, move}.
former have a higher entropy, because they are
                                                       Conversely, since cycle is not a feature of a close
more general (e.g. move vs. gallop).
                                                       neighbor of horse, it would not be included in the
   In this paper, we address the same issue by
                                                       expanded feature set.
amending the DIH, to make it more consistent
with the actual distributional properties of hy-
ponyms and hypernyms. Therefore, we introduce
AHyDA (Automatic Hypernym Detection with
                                                          is asymmetric (like the others implementing the
                                                          DIH), and therefore it is suitable to capture the
                                                          asymmetric nature of hypernymy.

                                                          3     Experiments and Evaluation
                                                          3.1    Distributional Space
                                                          Each lexical item u is represented with distribu-
Figure 1: An example of smoothed intersection.            tional features extracted from the TypeDM ten-
Black arrows indicate semantic similarity with            sor (Baroni and Lenci, 2010). In TypeDM, dis-
gallop, items with the blue background are the            tributional co-occurrences are represented as a
ones included in Nhorse (gallop).                         weighted tuple structure, a set of ((u, l, v), σ),
                                                          such that u and v are lexical items, l is a syntag-
                                                          matic co-occurrence link between u and v and σ is
   Mathematically, we define the expanded feature         the Local Mutual Information (Evert, 2005) com-
set Fx0 as follows:                                       puted on link type frequency. Hence, each lexical
                                                          item u is represented in terms of features of the
          Fx0 = {(f, Nx (f )) ∀f ∈ Fx }             (1)   kind (l, v).
                                                             In addition to the sparse space, we also pro-
           Nx (f ) = {g|g = (gw , fr )}             (2)   duced a dense space of 300 dimensions reduc-
                                                          ing the matrix with Singular Value Decomposition
where the following conditions hold for g:                (SVD). This additional space was used to retrieve
                                                          neighbors during the smoothing operation, as it al-
 d (fw , gw ) < k ∧ ∃y|d (x, y) < h ∧ g ∈ Fy (3)          lowed us to perform faster and more accurate cal-
                                                          culations for cosines. The sparse space was in-
where d(x, y) is any distance measure in the se-
                                                          stead employed to retrieve features and get their
mantic space, k and h are empirically set thresh-
                                                          weights.
old values.
Nx (f ) is generated by looking for features g that       3.2    Data set
are similar to fw , We then check whether this new
                                                          Evaluation was carried on a subset of the BLESS
feature is shared by some neighbor of the target x,
                                                          dataset (Baroni and Lenci, 2011), consisting of tu-
and eventually include g in Nx (f ). This allows us
                                                          ples expressing a relation between nouns.
to redefine the intersection operation between Fx0
                                                             BLESS includes 200 English concrete nouns as
and Fy as:
                                                          target concepts, equally divided between living
      Fx 0 ∩
           ˆ Fy = {f |f ∈ Fx ∧ Nx (f ) ∩ Fy 6= ∅}   (4)   and non-living entities. For each concept noun,
                                                          BLESS includes several relatum words, linked to
                                                          the concept by one of the following 5 relations:
  When expanding a feature f into Nx (f ), we
                                                          COORD (i.e. co-hyponyms), HYPER (i.e. hyper-
expect to find in Nx (f ) features that express
                                                          nyms), MERO (i.e. meronyms), ATTRI (i.e. at-
the same “property” in different ways. We ex-
                                                          tributes), EVENT (i.e. verbs that define events
pect these features to be shared by hypernyms
                                                          related to the target). BLESS also includes the
more than co-hyponyms, because hypernyms are
                                                          relations RANDOM - N, RANDOM - J, RANDOM - V,
supposed to collect features from all their hy-
                                                          which relate the targets to control tuples with ran-
ponyms, while co-hyponyms lack those of other
                                                          dom noun, adjective and verb relata, respectively.
co-hyponyms (e.g. lions run but do not gallop).
                                                             By restricting to noun-noun tuples, we got
AHyDA is thus defined as follows:
                                                          a subset containing these relations: COORD,
                                   0                      HYPER , MERO , RANDOM - N . We preprocessed the
                         P
                           f ∈Fx |Fx ∩ Fy |
      AHyDA (x, y) =                          (5)         dataset in order to exclude lexical items that are
                               |Fx |
                                                          not included in TypeDM. As reported in table 2,
  Importantly, AHyDA only considers the aver-             the distribution (minimum, mean and maximum)
age cardinality of the intersections, without look-       of the relata of all BLESS concepts is not even,
ing at the feature weights. Moreover, the formula         and therefore we took this into account while
            relation    min    avg     max                  the minimum and maximum number of items hold-
            coord       6      17.1    35                   ing a relation with x, and performed maximum
                                                                                                     minimum ran-
            hyper       2      6.7     15                   dom samples where each relation is presented with
            mero        2      14.7    53                   minimum relata, and then averaged the results.
            ran-n       16     32.9    67                   For example, consider the situation where x has
                                                            3 hypernyms, 6 co-hyponyms, 6 meronyms and
Table 2: Distribution (minimum, mean and maxi-              12 random nouns. In this situation, the minimum
mum) of the relata of all BLESS concepts                    number of relata for x would be 3, while the maxi-
                                                            mum would be 12. Therefore, we would perform 4
                                                            random sampling for each relation, averaging the
evaluating our results.
                                                            results in order to obtain a singular measurement
                                                            for each relation in the end.
3.3   Evaluation                                               We adopted the same evaluation methods de-
                                                            scribed in Lenci and Benotto (2012): plotting the
We compared AHyDA with a number of direc-                   distribution of scores per relation across the BLESS
tional similarity measures tested on BLESS, with            concepts, and calculating Average Precision (AP).
the goal of evaluating their ability to discriminate
hypernyms from other semantic relations, in par-            3.4   Results
ticular co-hyponyms. Given a lexical item x, Fx
                                                      Table 3 summarizes the Average Precision ob-
is the set of its distributional features, wx (f ) is the
weight of the feature f for the term x:               tained by AHyDA, the other DIH-based measures,
WeedsPrec - quantifies the weighted inclusion of      and the cosine. Although AHyDA’s improvement
the features of a term x within the features of a     is not big in hypernym detection, co-hyponyms get
term y (Weeds and Weir, 2003; Weeds et al., 2004;     lower values of AP, thus showing that smoothing
Kotlerman et al., 2010)                               the intersection allows a better discrimination be-
                                                      tween the two classes. It is worth remarking that
                             P
                               f ∈F ∩F wx (f )
                                                      the values for the other measures are generally
     WeedsPrec(x, y) = P x y                          (6)
                                                      higher than those reported by Lenci and Benotto
                                  f ∈Fx wx (f )
                                                      (2012), because of the evaluation on the balanced
ClarkeDE - a variation of WeedsPrec, proposed in      random samples of relations we have adopted. We
(Clarke, 2009)                                        also reported, in table 4, the AP values obtained
                                                      through the standard measures, without employ-
                   P
                      f ∈Fx ∩Fy min(wx (f ), wy (f ))
                                                      ing the feature augementation procedure. Altough
ClarkeDE(x, y) =             P                        values for hypernyms do not change much, the
                                f ∈Fx wx (f )
                                                (7)   main differences are in the coord values, which
invCL - a new measure introduced in (Lenci and        are generally higher without feature augmentation.
Benotto, 2012), to take into account not only the     As mentioned in section 3.1, the results for all the
inclusion of x in y but also the non-inclusion of     measures are obtained using the sparse space. The
y in x. The measure is defined as a function of       reduced space was employed to compute the Co-
ClarkeDE (CD).                                        sine baseline.
                                                         As regards the AP values for hypernyms, we
                                                      must  notice that not all hypernyms in BLESS share
                  p
  invCL(x, y) = CD(x, y)(1 − CD(x, y)) (8)
                                                      the same status: some of them are what we would
   We used the cosine as a baseline, since it is      consider logic entailments (e.g. eagle → bird),
a symmetric similarity measure and is commonly        others depict taxonomic relations (e.g. alligator
used to evaluate semantic similarity/relatedness in   → chordate), some are not true logic entailments
                                                                    ?
DSMs. In the definition of Nx (f ), the target and    (e.g. hawk → predator)
feature neighbors are identified with the cosine,        Figure 2 shows the average score produced with
setting the k and h parameters to 0.8 and 0.9 re-     the new measure. Here hypernyms are neatly
spectively.                                           set apart from co-hyponyms, whereas the distance
   To avoid biases due to the relata distribution     with meronyms and with the control group, ran-
among concepts, for each target x, we computed        doms, is less significative.
    measure      coord    hyper    mero    ran-n
    Cosine       0.77     0.31     0.21    0.14
    WeedsPrec    0.29     0.50     0.32    0.16
    ClarkeDE     0.31     0.52     0.24    0.14
    invCL        0.28     0.52     0.32    0.17
    AHyDA        0.20     0.49     0.33    0.23

Table 3: Mean AP values for each semantic rela-
tion achieved by AHyDA and the other similarity
scores
    measure      coord    hyper    mero    ran-n
    Cosine       0.77     0.32     0.21    0.14
    WeedsPrec    0.34     0.51     0.28    0.15
    ClarkeDE     0.36     0.51     0.27    0.16
    invCL        0.31     0.51     0.29    0.16

Table 4: Mean AP values for each semantic rela-
tion achieved by the cited similarity scores, with-
out employing feature augmentation                     Figure 2: Distribution of relata similarity scores
                                                       obtained with AHyDA (values are concept-by-
   Figure 3 shows the average scores produced by       concept z-normalized scores)
AHyDA when applied to the reverse hypernym
pair. It is interesting to notice that in this case
AHyDA produces basically the same results as
random pairs. This suggests that AHYDA cor-
rectly predicts that hyponyms entail hypernyms,
but not vice versa, thereby capturing the asymmet-
ric nature of hypernymy.

4   Conclusion
The Distributional inclusion hypothesis has
proven to be a viable approach to hypernym
detection. However, its original formulation
rests on an assumption that does not take into
consideration the actual usage of hypernyms in
texts. In this paper we have shown that, by adding
some further pragmatically inspired constraints,
a better discrimination can be achieved between
co-hyponyms and hypernyms. Our ongoing work
focuses on refining the way in which the smooth-
ing is performed, and testing its performance on
other datasets of semantic relations.

                                                       Figure 3: Distribution of relata similarity scores
References                                             obtained with AHyDA (values are concept-by-
Marco Baroni and Alessandro Lenci. 2010. Dis-          concept z-normalized scores), when tested on the
 tributional memory: A general framework for           inverse inclusion (i.e. hypernym does not entail
 corpus-based semantics. Computational Linguis-        hyponym)
 tics, 36(4):673–721.
Marco Baroni and Alessandro Lenci. 2011. How we
 blessed distributional semantic evaluation. In Pro-
  ceedings of the GEMS 2011 Workshop on GEomet-
  rical Models of Natural Language Semantics, pages
  1–10. Association for Computational Linguistics.
Daoud Clarke. 2009. Context-theoretic semantics for
  natural language: an overview. In Proceedings of
  the workshop on geometrical models of natural lan-
  guage semantics, pages 112–119. Association for
  Computational Linguistics.
Stefan Evert. 2005. The statistics of word cooccur-
   rences: word pairs and collocations.
Maayan Geffet and Ido Dagan. 2005. The distribu-
 tional inclusion hypotheses and lexical entailment.
 In Proceedings of the 43rd Annual Meeting on Asso-
 ciation for Computational Linguistics, pages 107–
 114. Association for Computational Linguistics.
Marti A Hearst. 1992. Automatic acquisition of hy-
 ponyms from large text corpora. In Proceedings of
 the 14th conference on Computational linguistics-
 Volume 2, pages 539–545. Association for Compu-
 tational Linguistics.
Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan
   Zhitomirsky-Geffet. 2010. Directional distribu-
   tional similarity for lexical inference. Natural Lan-
   guage Engineering, 16(4):359–389.
Alessandro Lenci and Giulia Benotto. 2012. Identi-
  fying hypernyms in distributional semantic spaces.
  In Proceedings of the First Joint Conference on Lex-
  ical and Computational Semantics-Volume 1: Pro-
  ceedings of the main conference and the shared task,
  and Volume 2: Proceedings of the Sixth Interna-
  tional Workshop on Semantic Evaluation, pages 75–
  79. Association for Computational Linguistics.
Patrick Pantel and Marco Pennacchiotti.        2006.
  Espresso: Leveraging generic patterns for automat-
  ically harvesting semantic relations. In Proceed-
  ings of the 21st International Conference on Com-
  putational Linguistics and the 44th annual meeting
  of the Association for Computational Linguistics,
  pages 113–120. Association for Computational Lin-
  guistics.
Enrico Santus, Alessandro Lenci, Qin Lu, and
  Sabine Schulte Im Walde. 2014. Chasing hyper-
  nyms in vector spaces with entropy. In EACL, pages
  38–42.
Julie Weeds and David Weir. 2003. A general frame-
   work for distributional similarity. In Proceedings of
   the 2003 conference on Empirical methods in natu-
   ral language processing, pages 81–88. Association
   for Computational Linguistics.
Julie Weeds, David Weir, and Diana McCarthy. 2004.
   Characterising measures of lexical distributional
   similarity. In Proceedings of the 20th interna-
   tional conference on Computational Linguistics,
   page 1015. Association for Computational Linguis-
   tics.