=Paper=
{{Paper
|id=Vol-2481/paper30
|storemode=property
|title=WebIsAGraph: A Very Large Hypernymy Graph from a Web Corpus
|pdfUrl=https://ceur-ws.org/Vol-2481/paper30.pdf
|volume=Vol-2481
|authors=Stefano Faralli,Irene Finocchi,Simone Paolo Ponzetto,Paola Velardi
|dblpUrl=https://dblp.org/rec/conf/clic-it/0001FPV19
}}
==WebIsAGraph: A Very Large Hypernymy Graph from a Web Corpus==
<pdf width="1500px">https://ceur-ws.org/Vol-2481/paper30.pdf</pdf>
<pre>
    WebIsAGraph: A Very Large Hypernymy Graph from a Web Corpus
           Stefano Faralli1 , Irene Finocchi 2 , Simone Paolo Ponzetto3 , Paola Velardi2
                          1
                            University of Rome Unitelma Sapienza, Italy
                                     stefano.faralli@unitelmasapienza.it
                                       2
                                           University of Rome Sapienza, Italy
                           irene.finocchi@uniroma1.it velardi@di.uniroma1.it
                                       3
                                           University of Mannheim, Germany
                                       simone@informatik.uni-mannheim.de


                        Abstract                                    Methods for hypernym detection like, e.g.,
                                                                 pattern-based approaches, have a limitation in
    In this paper, we present WebIsAGraph,                       that they do not necessarily produce proper tax-
    a very large hypernymy graph compiled                        onomies (Camacho-Collados, 2017): automati-
    from a dataset of is-a relationships ex-                     cally detected is-a relationships, on the other hand,
    tracted from the CommonCrawl. We pro-                        can be used as input to taxonomy induction algo-
    vide the resource together with a Neo4j                      rithms (Velardi et al., 2013; Faralli et al., 2017;
    plugin to enable efficient searching and                     Faralli et al., 2018, inter alia). These algo-
    querying over such large graph. We use                       rithms rely on the topology of the input graph,
    WebIsAGraph to study the problem of de-                      and, therefore, cannot be applied ‘as-is’ to Web-
    tecting polysemous terms in a noisy termi-                   scale resources like WebIsaDb, since this resource
    nological knowledge graph, thus quantify-                    merely consists of a set of triples. Moreover, We-
    ing the degree of polysemy of terms found                    bIsADb does not contain fully semantified triples,
    in is-a extractions from Web text.                           i.e., subjects and objects of the is-a relationships
                                                                 consist of potentially ambiguous terminological
1    Introduction                                                nodes. This is because, due to their large size,
                                                                 source input corpora like the CommonCrawl can-
Acquiring concept hierarchies, i.e., taxonomies                  not be semantified upfront. Linking to the seman-
from text, is a long-standing problem in Natu-                   tic vocabulary of a reference resource like DBpe-
ral Language Processing (NLP). Much previous                     dia (Hertling and Paulheim, 2017) also barely mit-
work leveraged lexico-syntactic patterns, which                  igate this problem, since Wikipedia-centric knowl-
can be either manually defined (Hearst, 1992)                    edge bases have not, and cannot be expected to
or automatically learned (Shwartz et al., 2016).                 have, complete coverage over Web data (Lin et al.,
Pattern-based methods were shown by (Roller et                   2012).
al., 2018) to outperform distributional methods,                    In this paper, we present an initial solution to
and can be complemented with state-of-the-art                    these problems by building the first very large hy-
meaning representations such as hyperbolic em-                   pernymy graph, dubbed WebIsAGraph, built from
beddings (Nickel and Kiela, 2017) to infer miss-                 is-a relationships extracted from a Web-scale cor-
ing is-a relations and filter wrong extractions (Le              pus. This is a relevant task: although Word-
et al., 2019). Complementary to these efforts, re-               Net (and other thesauri) already provides a cata-
searchers looked at ways to scale hypernymy de-                  log of ambiguous terms, many nodes of WebIsA-
tection to very large, i.e., Web-scale corpora (Wu               Graph are not covered in available lexicographic
et al., 2012). Recently, (Seitner et al., 2016) ap-              resources, because they are proper names, techni-
plied Hearst patterns to the CommonCrawl1 to                     cal terms, or polysemantic words. Our graph –
produce the WebIsaDb. Using Web corpora makes                    which we make freely available to the research
it possible to produce hundreds of millions of is-               community to foster further work on Web-scale
a triples: the extractions, however, include many                knowledge acquisition – is built from the We-
false positives and cycles (Ristoski et al., 2017).              bIsADb on top of state-of-the-art graph mining
      Copyright c 2019 for this paper by its authors. Use per-   tools2 : thanks to an accompanying plugin, it can
mitted under Creative Commons License Attribution 4.0 In-        be easily searched, queried, and explored. We-
ternational (CC BY 4.0).
    1                                                               2
      http://commoncrawl.org                                            Neo4j: https://neo4j.com/
bIsAGraph may represent an opportunity to re-                               WebIsAGraph
searchers for investigating approaches to a variety          nodes                          33,030,457
                                                             edges                          65,681,899
of tasks on large automatically acquired term tu-
                                                             weakly connected components     3,099,898
ples. As an example, we use our resource to inves-
                                                             nodes of largest component     26,099,001
tigate the problem of identifying ambiguous termi-           Avg. node Degree                     3.97
nological nodes. To automatically detect whether
a lexicographic node is ambiguous or not, we use          Table 1: Structural statistics of WebIsAGraph
information from both the graph (topological fea-
tures) and textual labels (word embeddings) as          and textual features, described hereafter.
features to train a model using supervised learn-
ing. Our results provide a first estimate of the de-    Topological features. Our conjecture is that in a
gree of polysemy that can be found among is-a           taxonomy-like terminological graph (even a noisy
relationships from the Web.                             one) there is a correlation between the mutual con-
                                                        nectivity of a node neighborhoods and its pol-
2   Creating WebIsAGraph                                ysemy. For example, consider the polysemous
                                                        word machine – which, according to WordNet,
We created a directed hypernymy graph from the          has at least six heterogeneous meanings, ranging
WebIsADb (Seitner et al., 2016). WebIsADb is            from the ‘any mechanical or electrical device’ to
a Web-scale collection of noisy hypernymy re-           ‘a group that controls the activities of a political
lations harvested with 58 extraction patterns and       party’ – and the monosemous word floppy disk.
consisting of 607,621,170 tuples. Since the aim of      We expect to observe a different degree of mu-
WebIsADb was to study the behaviour (on a large         tual connectivity across the corresponding incom-
scale) of Hearst-like extraction patterns, rather       ing and outgoing nodes. In particular, for monose-
than collecting relations with high precision, in       mous words, we expect a higher mutual connec-
order to reduce noise (false positives) we pre-         tivity. With reference to Figure 1, left side, the
selected the top-20 more precise extraction pat-        two hypernyms of ”floppy disk”: ”memory” and
terns in (2016) from the original 58 and identified     ”data storage”, have also ”RAM” as a common
385,459,302 tuples.                                     hyponym. In contrast, nodes in the direct neigh-
   After removing matches with a frequency lower        borhood of ”machine” (leftmost graph in Figure
than 3 and isolated nodes, i.e., nodes with degree      1) do not have mutual connections.
equal to 0, we obtained a directed graph consist-          Our aim is thus to identify topological features
ing of 33,030,457 nodes and 65,681,899 directed         that may help quantifying the previously described
edges (see Table 1). The generation of such a large     connectivity properties. To cope with scalabil-
graph required several weeks of computation on a        ity, we consider topological features built on top
quad-core machine with 32 GB of RAM, using a            of 1-hop/2-hop sub-graphs of a node n. Hence,
state-of-the art graph-db system, like Neo4j. Note      we identify two induced sub-graphs G−+ (n) and
that the inherent sequential nature of the task of      G+− (n), induced on V −+ (n) = In(n) ∪v∈In(n)
indexing tuples, nodes and edges does not benefit       Out(v) and V +− (n) = Out(n) ∪v∈Out(n) In(v)
from the use of parallel computation. Next, we          respectively, where In(x) and Out(x) are the sets
developed efficient tools for graph querying,           of incoming and outgoing nodes of x (including
which are released to the community, and de-            x). Next, we remove from these sub-graphs the
scribed in https://sites.google.com/                    node n, and compute the following features:
unitelmasapienza.it/webisagraph/,
where we also include examples of queries.              • ccG−+ (n) and ccG+− (n): the resulting number
                                                          of weakly connected components;
3   Measuring the polysemy of
                                                        • vG−+ (n) and vG+− (n): the resulting number of
    WebIsAGraph
                                                          nodes;
Let pSI (n) be the function that predicts if a termi-
                                                        • eG−+ (n) and eG+− (n): the resulting number of
nological node n corresponds to a monosemous or
                                                          edges.
a polysemous concept. We leverage a companion
sense inventory as a ground truth, and we train dif-    With reference to the example of Figure 1, the
ferent classifiers with a combination of topological    light gray sub-graph (a) is G−+ (n), the dark sub-
                                                                                                                                                         Features
                                                                                  data
        device                group                       memory
                                                                                storage
                                                                                                                                   topological             textual                  all
                                                                                                                   Algo.      P          R       F1     P      R       F1     P       R     F1
                                                                                                                   Rnd      0.47       0.47    0.47   0.47   0.47    0.47   0.47   0.47   0.47
                                                                                                                            ±.05      ±.05     ±.05   ±.05  ±.05     ±.05   ±.05   ±.05   ±.05
                                                                   RAM
   keyboard                clan             party   ROM                                                            NN       0.61       0.61    0.60   0.72   0.72    0.72   0.73   0.73   0.73
                                                                                      (a)                                   ±.02      ±.02     ±.02   ±.03  ±.03     ±.02   ±.04   ±.04   ±.04


                                                                                             WordNet
                 machine                                           floppy                                          ABC      0.62       0.62    0.62   0.67   0.67    0.67   0.70   0.70   0.70
                                                                     disk
                                                                                                                            ±.03      ±.03     ±.03   ±.02  ±.02     ±.01   ±.03   ±.03   ±.03
                                      vehicle
                                                                                      (b)                          GBC      0.62       0.62    0.62   0.69   0.68    0.68   0.72   0.71   0.71
    system
                                                                     abandoned                                              ±.02      ±.02     ±.02   ±.01  ±.01     ±.01   ±.03   ±.03   ±.03
                                                                      hardware
                                                                                                                   Rnd      0.51       0.51    0.51   0.51   0.51    0.51   0.51   0.51   0.51
                                                                                                                            ±.03      ±.03     ±.03   ±.03  ±.03     ±.03   ±.03   ±.03   ±.03
      computer         ship           car               zip disk                                                   NN       0.60       0.60    0.59   0.73   0.73    0.73   0.74   0.74   0.74


                                                                                             DBpedia
                                                                                                                            ±.01        .01    ±.01   ±.03  ±.03     ±.03   ±.03   ±.03   ±.03
                                                                                                                   ABC      0.60       0.60    0.60   0.69   0.69    0.69   0.71   0.71   0.71
                                                                                                                            ±.02      ±.02     ±.02   ±.02  ±.02     ±.02   ±.04   ±.04   ±.04
Figure 1: An example excerpt of the neighborhood                                                                   GBC      0.61
                                                                                                                            ±.02
                                                                                                                                       0.61
                                                                                                                                      ±.02
                                                                                                                                               0.61
                                                                                                                                               ±.02
                                                                                                                                                      0.70
                                                                                                                                                      ±.03
                                                                                                                                                             0.70
                                                                                                                                                            ±.03
                                                                                                                                                                     0.70
                                                                                                                                                                     ±.03
                                                                                                                                                                            0.73
                                                                                                                                                                            ±.02
                                                                                                                                                                                   0.73
                                                                                                                                                                                   ±.02
                                                                                                                                                                                          0.73
                                                                                                                                                                                          ±.02
induced sub-graphs for ”machine” and ”floppy                                                                       Rnd      0.50       0.50    0.50   0.50   0.50    0.50   0.50   0.50   0.50


                                                                                             WordNet∪DBpedia
                                                                                                                            ±.01      ±.01     ±.01   ±.01  ±.01     ±.01   ±.01   ±.01   ±.01
disk”, (a) G−+ (n) in gray and (b) G+− (n) in dark                                                                 NN       0.54       0.53    0.50   0.70   0.70    0.70   0.71   0.70   0.70
                                                                                                                            ±.03      ±.05     ±.12   ±.02  ±.02     ±.02   ±.02   ±.01   ±.01
gray. Dashed edges connect each n with its hyper-                                                                  ABC      0.55       0.55    0.55   0.64   0.64    0.64   0.65   0.65   0.65
nyms and hyponyms.                                                                                                         ±0.02     ±0.02 ±0.02      ±.01  ±.01     ±.01   ±.02   ±.02   ±.02
                                                                                                                   GBC      0.56       0.56    0.55   0.66   0.66    0.65   0.67   0.67   0.67
                                                                                                                           ±0.02     ±0.01 ±0.01      ±.02  ±.02     ±.02   ±.02   ±.02   ±.02


graph (b) is G+− (n), and furthermore for n =                                               Table 2: Performance of different algorithms to
”machine”: ccG−+ (n) = 2, ccG+− (n) = 2,                                                    detect node ambiguity.
vG−+ (n) = 5, vG+− (n) = 5, eG−+ (n) = 3, and
eG+− (n) = 3, while for the n =”floppy disk”:                                                                                          #»
                                                                                            • Gini(n): sparsity index (David, 1968) of W (n).
ccG−+ (n) = 1, ccG+− (n) = 1, vG−+ (n) = 4,
vG+− (n) = 2, eG−+ (n) = 3, and eG+− (n) = 1.
                                                                                            3.1                      Evaluation
 Textual features. Similarly to topological fea-
 tures, our hypothesis is that textual features of the                                      Computing features. Topological features are
 neighborhood nodes should exhibit a lower aver-                                            efficiently extracted using the query tool men-
 age similarity when n is polysemous. We extract                                            tioned in Section 2. To compute textual features
 textual features on top of pre-trained word em-                                            (see Section 3) we use the Glove pre-trained word
 beddings, widely adopted in many NLP-related                                               embedding vector (Pennington et al., 2014) of
 tasks (Camacho-Collados and Pilehvar, 2018).                                               length 300 from the CommonCrawl.3
 Formally, given a node n:                                                                     By combining these two types of features (topo-
                                                                                            logical and textual) we obtained three different
   #»
• W (n) is the word embedding vector of n                                                   vector input representations consisting of 6 (only
   computed as follows:                                                                     topological features), 303 (only textual features)
                                                                                            and 309 (textual and topological) dimensions re-
                                         #»
                                         we(t)
                                                P
                     #»      t∈tokens(n)
                                                                                            spectively.
                     W (n) =                                                       (1)         Finally, we created three ”ground truth” sets of
                                |tokens(n)|
                                                                                            nodes in the graph for which pSI (n) is known. We
  where tokens(n) is the function that retrieves
                                                                                            selected a balanced number of monosemous and
  the set of tokens composing the word n (e.g., if
                                                                                            polysemous nouns, using the following sense in-
  n = hot dog, tokens(n) = {hot , dog}), and
  # » is a pre-trained word embedding vector;                                               ventories: i) WordNet (14,659 examples); ii) DB-
  we(t)
                                                                                            pedia (17,041 examples); iii) WordNet and DBpe-
• ∆in (n) and ∆out (n): the cosine similarity be-                                           dia (31,701 examples).
         #»
  tween W (n) and the average word embeddings                                               Algorithms. We compared four algorithms:
  vector of incoming and outgoing nodes of n re-
  spectively;                                                                               • Random (Rnd): a random baseline which ran-
                                                    P         #»                              domly classifies the ambiguity of a node;
                                                              W (m)
                          #»     m∈In(n)
         ∆in (n) = CosSim(W (n),                                            )      (2)      • Neural Network (NN): a neural network with
                                    |In(n)|
                                                                                              Softmax activation function in the output layer
                                                    P        #»                               and dropout (Srivastava et al., 2014);
                                                             W (m)
                        #»     m∈Out(n)
      ∆out (n) = CosSim(W (n),                                           )         (3)                         3
                                                                                                                   https://nlp.stanford.edu/projects/glove/.
                                 |Out(n)|
                                       WordNet                            DBpedia                       WordNet ∪ DBpedia
                Features    dCor          ρ PI weight ± std   dCor         ρ PI weight ± std    dCor          ρ PI weight ± std
                ccG−+       0.593     0.185 0.0039±0.0001     0.614    0.228 0.0628±0.0051      0.513     0.027 0.0038±0.0008
  topological
                vG−+        0.602     0.203 0.0022±0.0003     0.597    0.194 0.0045±0.0010      0.513     0.025 0.0025±0.0003
                eG−+        0.597     0.194 0.0100±0.0016     0.600    0.200 0.0048±0.0008      0.514     0.027 0.0024±0.0001
                ccG+−       0.606     0.212 0.0131±0.0013     0.579    0.159 0.0092±0.0016      0.492    -0.014 0.0049±0.0003
                vG+−        0.623     0.247 0.0383±0.0035     0.580    0.159 0.0029±0.0009      0.495    -0.010 0.0013±0.0008
                eG+−        0.619     0.237 0.0074±0.0010     0.583    0.167 0.0034±0.0013      0.497    -0.006 0.0054±0.0004
                ∆in         0.379    -0.242 0.0699±0.0036     0.399   -0.202 0.0231±0.0023      0.433    -0.134 0.0470±0.0027
                ∆out        0.400    -0.199 0.0101±0.0004     0.415   -0.170 0.0037±0.0015      0.431    -0.138 0.0120±0.0007
   textual


                Gini        0.443    -0.114 0.0042±0.0004     0.460   -0.080 0.0035±0.0009      0.494    -0.013 0.0059±0.0006
                #»
                W (300 dimensions)     Avg 0.0029±0.0004                Avg. 0.0030±0.0005                 Avg 0.0028±0.0003
                                       Min 0.0016±0.0003                Min 0.0005±0.0005                  Min 0.0014±0.0003
                                       Max 0.0077±0.0009                Max 0.0180±0.0013                  Max 0.0123±0.0011

Table 3: Distance correlation dCor and Pearson coefficient ρ between polysemy and features and Per-
mutation Importance (PI) weights (NN estimator).


• Two ensemble-based learning algorithms,                             the distance correlation dCor4 , with the aim of an-
  namely AdaBoost (ABC) (Zhu et al., 2009) and                        alyzing how each feature correlates with the poly-
  Gradient Boosting (GBC) (Friedman, 2001):                           semy observed in the three ground truth dictionar-
  both have been shown to have high predictive                        ies. We observed that the features with the high-
  accuracy (Kotsiantis et al., 2006) and are good                     est correlation with polysemy are eG+− , ccG−+
  competitors of neural methods, especially with                      and vG−+ (see Section 3). Additionally we re-
  very large datasets.                                                port the resulting weights of Permutation Impor-
                                                                      tance (PI) applied to the NN system with the
Parameter selection. Based on the Area Under                          aim of measuring how the performance decreases
Curve ROC (AUC) analysis (Kim et al., 2017),                          when a feature is perturbed, by shuffling its val-
NN parameters have been empirically set as fol-                       ues across training examples (Breiman, 2001).
lows: i) when testing only with topological fea-                      We observed that the features which most influ-
tures (6 dimensions), we use 2 hidden layers with                     enced the performances are ∆in (n) (WordNet and
4 and 2 neurons respectively and a dropout of 0.2                     WordNet∪DBpedia) and ccG−+ (DBpedia). Fur-
and 0.15; ii) when using only textual (303 dimen-                     thermore, we found that although topological fea-
sions), or combined textual and topological fea-                      tures affect the performance only by a 1% in the
tures (309 dimensions), we use 4 hidden layers,                       average, a number of topologically related fea-
with 128, 64, 32 and 8 neurons respectively and a                     tures, such as ccG−+ , vG−+ and eG+− are shown
dropout of 0.3,0.25,0.2 and 0.15.                                     to be indeed related with polysemy. In our fu-
                                                                      ture work, we plan to create an ad-hoc ground-
Results. We show in Table 2 the resulting preci-                      truth sense dictionary, since especially WordNet
sion, recall and F1 of the five systems across the                    includes extremely fine-grained senses that do not
ground truths datasets and for the combinations                       help validating our conjecture about reduced mu-
of features (see Section 3). The metrics are av-                      tual connectivity and contextual similarity of a
eraged on five classification experiments, with a                     node’s neighborhood in case of monosemy.
random split (85% train, 10% validation and 5%                        4    Conclusion
test) of the ground truth sets. As shown in Table
2, NN outperforms the others ensemble methods,                        The main contribution of this work is a new re-
obtaining a F1 score around 0.70. The comparison                      source obtained by converting a large dataset of
of performances across the three combinations of                      is-a (hypernymy) relations automatically extracted
features reveals that topological features are not                    from the Web (such as WebIsADb) into a graph
enough to build a model for polysemy classifica-                      structure. This graph, along with its accompany-
tion but can slightly boost the overall already com-                  ing search tools, enables descriptive and predic-
pelling performances of word embeddings-based                         tive analytics of emerging properties of termino-
features.                                                                 4
                                                                            ρ and dCor are indexes to estimate how two distributions
   In Table 3 we show the Person coefficient ρ and                    are independent.
logical nodes. We used here our new resource to             S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas.
investigate whether a node polysemy can be pre-                2006. Machine learning: a review of classification
                                                               and combining techniques. Artificial Intelligence
dicted from its topological features (i.e., connec-
                                                               Review, 26(3):159–190, Nov.
tivity patterns) and textual features (meaning rep-
resentations from word embeddings). The results             Matt Le, Stephen Roller, Laetitia Papaxanthos, Douwe
                                                             Kiela, and Maximilian Nickel. 2019. Inferring con-
of this preliminary study have shown that textual
                                                             cept hierarchies from text corpora via hyperbolic
features are good predictors of polysemy, while              embeddings.
topological features appear to be weaker predic-
                                                            Thomas Lin, Mausam, and Oren Etzioni. 2012. En-
tors even if they have a significant correlation with         tity linking at web scale. In Proc. of the Joint
the polysemy of the related node.                             Workshop on Automatic Knowledge Base Construc-
                                                              tion and Web-scale Knowledge Extraction (AKBC-
                                                              WEKEX), pages 84–88. Association for Computa-
References                                                    tional Linguistics.
Leo Breiman. 2001. Random forests. Machine Learn-           Maximilian Nickel and Douwe Kiela. 2017. Poincaré
  ing, 45(1):5–32, Oct.                                      embeddings for learning hierarchical representa-
                                                             tions. In I. Guyon, U. V. Luxburg, S. Bengio,
José Camacho-Collados and Mohammad Taher Pile-              H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
   hvar. 2018. From word to sense embeddings: A              nett, editors, Advances in Neural Information Pro-
   survey on vector representations of meaning. J. Ar-       cessing Systems 30, pages 6341–6350. Curran As-
   tif. Intell. Res., 63:743–788.                            sociates, Inc.
Jose Camacho-Collados. 2017. Why we have switched           Jeffrey Pennington, Richard Socher, and Christopher D
   from building full-fledged taxonomies to simply de-         Manning. 2014. Glove: Global vectors for word
   tecting hypernymy relations.                                representation. In EMNLP, volume 14, pages 1532–
                                                               1543.
H. A. David. 1968. Gini’s mean difference rediscov-
  ered. Biometrika, 55(3):573–575.                          Petar Ristoski, Stefano Faralli, Simone Paolo Ponzetto,
                                                              and Heiko Paulheim. 2017. Large-scale taxonomy
Stefano Faralli, Alexander Panchenko, Chris Biemann,          induction using entity and word embeddings. In
   and Simone Paolo Ponzetto. 2017. The con-                  Proc. of the International Conference on Web Intel-
   trastmedium algorithm: Taxonomy induction from             ligence, WI ’17, pages 81–87, New York, NY, USA.
   noisy knowledge graphs with just a few links. In           ACM.
   Proc. of the 15th Conference of the European Chap-
   ter of the Association for Computational Linguistics:    Stephen Roller, Douwe Kiela, and Maximilian Nickel.
   Volume 1, Long Papers, pages 590–600. Association           2018. Hearst patterns revisited: Automatic hyper-
   for Computational Linguistics.                              nym detection from large text corpora. In Proc. of
                                                               the 56th ACL (Volume 2: Short Papers), pages 358–
Stefano Faralli, Irene Finocchi, Simone Paolo                  363. Association for Computational Linguistics.
   Ponzetto, and Paola Velardi. 2018. Efficient
   pruning of large knowledge graphs. In Proc. of           Julian Seitner, Christian Bizer, Kai Eckert, Stefano
   the Twenty-Seventh International Joint Conference           Faralli, Robert Meusel, Heiko Paulheim, and Si-
   on Artificial Intelligence, IJCAI 2018, July 13-19,         mone Paolo Ponzetto. 2016. A large database of
   2018, Stockholm, Sweden., pages 4055–4063.                  hypernymy relations extracted from the web. In
                                                               Proc. of the Tenth International Conference on Lan-
Jerome H. Friedman. 2001. Greedy function approx-              guage Resources and Evaluation LREC 2016, Por-
   imation: A gradient boosting machine. The Annals            torož, Slovenia, May 23-28, 2016.
   of Statistics, 29(5):1189–1232.
                                                            Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016.
Marti A. Hearst. 1992. Automatic acquisition of hy-           Improving hypernymy detection with an integrated
 ponyms from large text corpora. In Proc. of COL-             path-based and distributional method. In Proc. of
 ING, pages 539–545.                                          the 54th ACL (Volume 1: Long Papers), pages 2389–
                                                              2398. Association for Computational Linguistics.
Sven Hertling and Heiko Paulheim. 2017. Webisa-
  lod: Providing hypernymy relations extracted from         Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
  the web as linked open data. In The Semantic Web            Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
  - ISWC 2017 - 16th International Semantic Web               Dropout: A simple way to prevent neural networks
  Conference, Vienna, Austria, October 21-25, 2017,           from overfitting. J. Mach. Learn. Res., 15(1):1929–
  Proc., Part II, pages 111–119.                              1958, January.
Chulwoo Kim, Sung-Hyuk Cha, Yoo An, and Ned Wil-            Paola Velardi, Stefano Faralli, and Roberto Navigli.
  son. 2017. On roc curve analysis of artificial neural       2013. Ontolearn reloaded: A graph-based algorithm
  network classifiers. In Florida Artificial Intelligence     for taxonomy induction. Computational Linguistics,
  Research Society Conference.                                39(3).
Wentao Wu, Hongsong Li, Haixun Wang, and
  Kenny Q. Zhu. 2012. Probase: A probabilistic tax-
  onomy for text understanding. In Proc. of the 2012
  ACM SIGMOD International Conference on Man-
  agement of Data, SIGMOD ’12, pages 481–492,
  New York, NY, USA. ACM.
Ji Zhu, Hui Zou, Saharon Rosset, and Trevor Hastie.
   2009. Multi-class adaboost. Statistics and Its Inter-
   face, 2(3):349–360.

</pre>