=Paper=
{{Paper
|id=Vol-2481/paper30
|storemode=property
|title=WebIsAGraph: A Very Large Hypernymy Graph from a Web Corpus
|pdfUrl=https://ceur-ws.org/Vol-2481/paper30.pdf
|volume=Vol-2481
|authors=Stefano Faralli,Irene Finocchi,Simone Paolo Ponzetto,Paola Velardi
|dblpUrl=https://dblp.org/rec/conf/clic-it/0001FPV19
}}
==WebIsAGraph: A Very Large Hypernymy Graph from a Web Corpus==
WebIsAGraph: A Very Large Hypernymy Graph from a Web Corpus
Stefano Faralli1 , Irene Finocchi 2 , Simone Paolo Ponzetto3 , Paola Velardi2
1
University of Rome Unitelma Sapienza, Italy
stefano.faralli@unitelmasapienza.it
2
University of Rome Sapienza, Italy
irene.finocchi@uniroma1.it velardi@di.uniroma1.it
3
University of Mannheim, Germany
simone@informatik.uni-mannheim.de
Abstract Methods for hypernym detection like, e.g.,
pattern-based approaches, have a limitation in
In this paper, we present WebIsAGraph, that they do not necessarily produce proper tax-
a very large hypernymy graph compiled onomies (Camacho-Collados, 2017): automati-
from a dataset of is-a relationships ex- cally detected is-a relationships, on the other hand,
tracted from the CommonCrawl. We pro- can be used as input to taxonomy induction algo-
vide the resource together with a Neo4j rithms (Velardi et al., 2013; Faralli et al., 2017;
plugin to enable efficient searching and Faralli et al., 2018, inter alia). These algo-
querying over such large graph. We use rithms rely on the topology of the input graph,
WebIsAGraph to study the problem of de- and, therefore, cannot be applied ‘as-is’ to Web-
tecting polysemous terms in a noisy termi- scale resources like WebIsaDb, since this resource
nological knowledge graph, thus quantify- merely consists of a set of triples. Moreover, We-
ing the degree of polysemy of terms found bIsADb does not contain fully semantified triples,
in is-a extractions from Web text. i.e., subjects and objects of the is-a relationships
consist of potentially ambiguous terminological
1 Introduction nodes. This is because, due to their large size,
source input corpora like the CommonCrawl can-
Acquiring concept hierarchies, i.e., taxonomies not be semantified upfront. Linking to the seman-
from text, is a long-standing problem in Natu- tic vocabulary of a reference resource like DBpe-
ral Language Processing (NLP). Much previous dia (Hertling and Paulheim, 2017) also barely mit-
work leveraged lexico-syntactic patterns, which igate this problem, since Wikipedia-centric knowl-
can be either manually defined (Hearst, 1992) edge bases have not, and cannot be expected to
or automatically learned (Shwartz et al., 2016). have, complete coverage over Web data (Lin et al.,
Pattern-based methods were shown by (Roller et 2012).
al., 2018) to outperform distributional methods, In this paper, we present an initial solution to
and can be complemented with state-of-the-art these problems by building the first very large hy-
meaning representations such as hyperbolic em- pernymy graph, dubbed WebIsAGraph, built from
beddings (Nickel and Kiela, 2017) to infer miss- is-a relationships extracted from a Web-scale cor-
ing is-a relations and filter wrong extractions (Le pus. This is a relevant task: although Word-
et al., 2019). Complementary to these efforts, re- Net (and other thesauri) already provides a cata-
searchers looked at ways to scale hypernymy de- log of ambiguous terms, many nodes of WebIsA-
tection to very large, i.e., Web-scale corpora (Wu Graph are not covered in available lexicographic
et al., 2012). Recently, (Seitner et al., 2016) ap- resources, because they are proper names, techni-
plied Hearst patterns to the CommonCrawl1 to cal terms, or polysemantic words. Our graph –
produce the WebIsaDb. Using Web corpora makes which we make freely available to the research
it possible to produce hundreds of millions of is- community to foster further work on Web-scale
a triples: the extractions, however, include many knowledge acquisition – is built from the We-
false positives and cycles (Ristoski et al., 2017). bIsADb on top of state-of-the-art graph mining
Copyright c 2019 for this paper by its authors. Use per- tools2 : thanks to an accompanying plugin, it can
mitted under Creative Commons License Attribution 4.0 In- be easily searched, queried, and explored. We-
ternational (CC BY 4.0).
1 2
http://commoncrawl.org Neo4j: https://neo4j.com/
bIsAGraph may represent an opportunity to re- WebIsAGraph
searchers for investigating approaches to a variety nodes 33,030,457
edges 65,681,899
of tasks on large automatically acquired term tu-
weakly connected components 3,099,898
ples. As an example, we use our resource to inves-
nodes of largest component 26,099,001
tigate the problem of identifying ambiguous termi- Avg. node Degree 3.97
nological nodes. To automatically detect whether
a lexicographic node is ambiguous or not, we use Table 1: Structural statistics of WebIsAGraph
information from both the graph (topological fea-
tures) and textual labels (word embeddings) as and textual features, described hereafter.
features to train a model using supervised learn-
ing. Our results provide a first estimate of the de- Topological features. Our conjecture is that in a
gree of polysemy that can be found among is-a taxonomy-like terminological graph (even a noisy
relationships from the Web. one) there is a correlation between the mutual con-
nectivity of a node neighborhoods and its pol-
2 Creating WebIsAGraph ysemy. For example, consider the polysemous
word machine – which, according to WordNet,
We created a directed hypernymy graph from the has at least six heterogeneous meanings, ranging
WebIsADb (Seitner et al., 2016). WebIsADb is from the ‘any mechanical or electrical device’ to
a Web-scale collection of noisy hypernymy re- ‘a group that controls the activities of a political
lations harvested with 58 extraction patterns and party’ – and the monosemous word floppy disk.
consisting of 607,621,170 tuples. Since the aim of We expect to observe a different degree of mu-
WebIsADb was to study the behaviour (on a large tual connectivity across the corresponding incom-
scale) of Hearst-like extraction patterns, rather ing and outgoing nodes. In particular, for monose-
than collecting relations with high precision, in mous words, we expect a higher mutual connec-
order to reduce noise (false positives) we pre- tivity. With reference to Figure 1, left side, the
selected the top-20 more precise extraction pat- two hypernyms of ”floppy disk”: ”memory” and
terns in (2016) from the original 58 and identified ”data storage”, have also ”RAM” as a common
385,459,302 tuples. hyponym. In contrast, nodes in the direct neigh-
After removing matches with a frequency lower borhood of ”machine” (leftmost graph in Figure
than 3 and isolated nodes, i.e., nodes with degree 1) do not have mutual connections.
equal to 0, we obtained a directed graph consist- Our aim is thus to identify topological features
ing of 33,030,457 nodes and 65,681,899 directed that may help quantifying the previously described
edges (see Table 1). The generation of such a large connectivity properties. To cope with scalabil-
graph required several weeks of computation on a ity, we consider topological features built on top
quad-core machine with 32 GB of RAM, using a of 1-hop/2-hop sub-graphs of a node n. Hence,
state-of-the art graph-db system, like Neo4j. Note we identify two induced sub-graphs G−+ (n) and
that the inherent sequential nature of the task of G+− (n), induced on V −+ (n) = In(n) ∪v∈In(n)
indexing tuples, nodes and edges does not benefit Out(v) and V +− (n) = Out(n) ∪v∈Out(n) In(v)
from the use of parallel computation. Next, we respectively, where In(x) and Out(x) are the sets
developed efficient tools for graph querying, of incoming and outgoing nodes of x (including
which are released to the community, and de- x). Next, we remove from these sub-graphs the
scribed in https://sites.google.com/ node n, and compute the following features:
unitelmasapienza.it/webisagraph/,
where we also include examples of queries. • ccG−+ (n) and ccG+− (n): the resulting number
of weakly connected components;
3 Measuring the polysemy of
• vG−+ (n) and vG+− (n): the resulting number of
WebIsAGraph
nodes;
Let pSI (n) be the function that predicts if a termi-
• eG−+ (n) and eG+− (n): the resulting number of
nological node n corresponds to a monosemous or
edges.
a polysemous concept. We leverage a companion
sense inventory as a ground truth, and we train dif- With reference to the example of Figure 1, the
ferent classifiers with a combination of topological light gray sub-graph (a) is G−+ (n), the dark sub-
Features
data
device group memory
storage
topological textual all
Algo. P R F1 P R F1 P R F1
Rnd 0.47 0.47 0.47 0.47 0.47 0.47 0.47 0.47 0.47
±.05 ±.05 ±.05 ±.05 ±.05 ±.05 ±.05 ±.05 ±.05
RAM
keyboard clan party ROM NN 0.61 0.61 0.60 0.72 0.72 0.72 0.73 0.73 0.73
(a) ±.02 ±.02 ±.02 ±.03 ±.03 ±.02 ±.04 ±.04 ±.04
WordNet
machine floppy ABC 0.62 0.62 0.62 0.67 0.67 0.67 0.70 0.70 0.70
disk
±.03 ±.03 ±.03 ±.02 ±.02 ±.01 ±.03 ±.03 ±.03
vehicle
(b) GBC 0.62 0.62 0.62 0.69 0.68 0.68 0.72 0.71 0.71
system
abandoned ±.02 ±.02 ±.02 ±.01 ±.01 ±.01 ±.03 ±.03 ±.03
hardware
Rnd 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.51
±.03 ±.03 ±.03 ±.03 ±.03 ±.03 ±.03 ±.03 ±.03
computer ship car zip disk NN 0.60 0.60 0.59 0.73 0.73 0.73 0.74 0.74 0.74
DBpedia
±.01 .01 ±.01 ±.03 ±.03 ±.03 ±.03 ±.03 ±.03
ABC 0.60 0.60 0.60 0.69 0.69 0.69 0.71 0.71 0.71
±.02 ±.02 ±.02 ±.02 ±.02 ±.02 ±.04 ±.04 ±.04
Figure 1: An example excerpt of the neighborhood GBC 0.61
±.02
0.61
±.02
0.61
±.02
0.70
±.03
0.70
±.03
0.70
±.03
0.73
±.02
0.73
±.02
0.73
±.02
induced sub-graphs for ”machine” and ”floppy Rnd 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50
WordNet∪DBpedia
±.01 ±.01 ±.01 ±.01 ±.01 ±.01 ±.01 ±.01 ±.01
disk”, (a) G−+ (n) in gray and (b) G+− (n) in dark NN 0.54 0.53 0.50 0.70 0.70 0.70 0.71 0.70 0.70
±.03 ±.05 ±.12 ±.02 ±.02 ±.02 ±.02 ±.01 ±.01
gray. Dashed edges connect each n with its hyper- ABC 0.55 0.55 0.55 0.64 0.64 0.64 0.65 0.65 0.65
nyms and hyponyms. ±0.02 ±0.02 ±0.02 ±.01 ±.01 ±.01 ±.02 ±.02 ±.02
GBC 0.56 0.56 0.55 0.66 0.66 0.65 0.67 0.67 0.67
±0.02 ±0.01 ±0.01 ±.02 ±.02 ±.02 ±.02 ±.02 ±.02
graph (b) is G+− (n), and furthermore for n = Table 2: Performance of different algorithms to
”machine”: ccG−+ (n) = 2, ccG+− (n) = 2, detect node ambiguity.
vG−+ (n) = 5, vG+− (n) = 5, eG−+ (n) = 3, and
eG+− (n) = 3, while for the n =”floppy disk”: #»
• Gini(n): sparsity index (David, 1968) of W (n).
ccG−+ (n) = 1, ccG+− (n) = 1, vG−+ (n) = 4,
vG+− (n) = 2, eG−+ (n) = 3, and eG+− (n) = 1.
3.1 Evaluation
Textual features. Similarly to topological fea-
tures, our hypothesis is that textual features of the Computing features. Topological features are
neighborhood nodes should exhibit a lower aver- efficiently extracted using the query tool men-
age similarity when n is polysemous. We extract tioned in Section 2. To compute textual features
textual features on top of pre-trained word em- (see Section 3) we use the Glove pre-trained word
beddings, widely adopted in many NLP-related embedding vector (Pennington et al., 2014) of
tasks (Camacho-Collados and Pilehvar, 2018). length 300 from the CommonCrawl.3
Formally, given a node n: By combining these two types of features (topo-
logical and textual) we obtained three different
#»
• W (n) is the word embedding vector of n vector input representations consisting of 6 (only
computed as follows: topological features), 303 (only textual features)
and 309 (textual and topological) dimensions re-
#»
we(t)
P
#» t∈tokens(n)
spectively.
W (n) = (1) Finally, we created three ”ground truth” sets of
|tokens(n)|
nodes in the graph for which pSI (n) is known. We
where tokens(n) is the function that retrieves
selected a balanced number of monosemous and
the set of tokens composing the word n (e.g., if
polysemous nouns, using the following sense in-
n = hot dog, tokens(n) = {hot , dog}), and
# » is a pre-trained word embedding vector; ventories: i) WordNet (14,659 examples); ii) DB-
we(t)
pedia (17,041 examples); iii) WordNet and DBpe-
• ∆in (n) and ∆out (n): the cosine similarity be- dia (31,701 examples).
#»
tween W (n) and the average word embeddings Algorithms. We compared four algorithms:
vector of incoming and outgoing nodes of n re-
spectively; • Random (Rnd): a random baseline which ran-
P #» domly classifies the ambiguity of a node;
W (m)
#» m∈In(n)
∆in (n) = CosSim(W (n), ) (2) • Neural Network (NN): a neural network with
|In(n)|
Softmax activation function in the output layer
P #» and dropout (Srivastava et al., 2014);
W (m)
#» m∈Out(n)
∆out (n) = CosSim(W (n), ) (3) 3
https://nlp.stanford.edu/projects/glove/.
|Out(n)|
WordNet DBpedia WordNet ∪ DBpedia
Features dCor ρ PI weight ± std dCor ρ PI weight ± std dCor ρ PI weight ± std
ccG−+ 0.593 0.185 0.0039±0.0001 0.614 0.228 0.0628±0.0051 0.513 0.027 0.0038±0.0008
topological
vG−+ 0.602 0.203 0.0022±0.0003 0.597 0.194 0.0045±0.0010 0.513 0.025 0.0025±0.0003
eG−+ 0.597 0.194 0.0100±0.0016 0.600 0.200 0.0048±0.0008 0.514 0.027 0.0024±0.0001
ccG+− 0.606 0.212 0.0131±0.0013 0.579 0.159 0.0092±0.0016 0.492 -0.014 0.0049±0.0003
vG+− 0.623 0.247 0.0383±0.0035 0.580 0.159 0.0029±0.0009 0.495 -0.010 0.0013±0.0008
eG+− 0.619 0.237 0.0074±0.0010 0.583 0.167 0.0034±0.0013 0.497 -0.006 0.0054±0.0004
∆in 0.379 -0.242 0.0699±0.0036 0.399 -0.202 0.0231±0.0023 0.433 -0.134 0.0470±0.0027
∆out 0.400 -0.199 0.0101±0.0004 0.415 -0.170 0.0037±0.0015 0.431 -0.138 0.0120±0.0007
textual
Gini 0.443 -0.114 0.0042±0.0004 0.460 -0.080 0.0035±0.0009 0.494 -0.013 0.0059±0.0006
#»
W (300 dimensions) Avg 0.0029±0.0004 Avg. 0.0030±0.0005 Avg 0.0028±0.0003
Min 0.0016±0.0003 Min 0.0005±0.0005 Min 0.0014±0.0003
Max 0.0077±0.0009 Max 0.0180±0.0013 Max 0.0123±0.0011
Table 3: Distance correlation dCor and Pearson coefficient ρ between polysemy and features and Per-
mutation Importance (PI) weights (NN estimator).
• Two ensemble-based learning algorithms, the distance correlation dCor4 , with the aim of an-
namely AdaBoost (ABC) (Zhu et al., 2009) and alyzing how each feature correlates with the poly-
Gradient Boosting (GBC) (Friedman, 2001): semy observed in the three ground truth dictionar-
both have been shown to have high predictive ies. We observed that the features with the high-
accuracy (Kotsiantis et al., 2006) and are good est correlation with polysemy are eG+− , ccG−+
competitors of neural methods, especially with and vG−+ (see Section 3). Additionally we re-
very large datasets. port the resulting weights of Permutation Impor-
tance (PI) applied to the NN system with the
Parameter selection. Based on the Area Under aim of measuring how the performance decreases
Curve ROC (AUC) analysis (Kim et al., 2017), when a feature is perturbed, by shuffling its val-
NN parameters have been empirically set as fol- ues across training examples (Breiman, 2001).
lows: i) when testing only with topological fea- We observed that the features which most influ-
tures (6 dimensions), we use 2 hidden layers with enced the performances are ∆in (n) (WordNet and
4 and 2 neurons respectively and a dropout of 0.2 WordNet∪DBpedia) and ccG−+ (DBpedia). Fur-
and 0.15; ii) when using only textual (303 dimen- thermore, we found that although topological fea-
sions), or combined textual and topological fea- tures affect the performance only by a 1% in the
tures (309 dimensions), we use 4 hidden layers, average, a number of topologically related fea-
with 128, 64, 32 and 8 neurons respectively and a tures, such as ccG−+ , vG−+ and eG+− are shown
dropout of 0.3,0.25,0.2 and 0.15. to be indeed related with polysemy. In our fu-
ture work, we plan to create an ad-hoc ground-
Results. We show in Table 2 the resulting preci- truth sense dictionary, since especially WordNet
sion, recall and F1 of the five systems across the includes extremely fine-grained senses that do not
ground truths datasets and for the combinations help validating our conjecture about reduced mu-
of features (see Section 3). The metrics are av- tual connectivity and contextual similarity of a
eraged on five classification experiments, with a node’s neighborhood in case of monosemy.
random split (85% train, 10% validation and 5% 4 Conclusion
test) of the ground truth sets. As shown in Table
2, NN outperforms the others ensemble methods, The main contribution of this work is a new re-
obtaining a F1 score around 0.70. The comparison source obtained by converting a large dataset of
of performances across the three combinations of is-a (hypernymy) relations automatically extracted
features reveals that topological features are not from the Web (such as WebIsADb) into a graph
enough to build a model for polysemy classifica- structure. This graph, along with its accompany-
tion but can slightly boost the overall already com- ing search tools, enables descriptive and predic-
pelling performances of word embeddings-based tive analytics of emerging properties of termino-
features. 4
ρ and dCor are indexes to estimate how two distributions
In Table 3 we show the Person coefficient ρ and are independent.
logical nodes. We used here our new resource to S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas.
investigate whether a node polysemy can be pre- 2006. Machine learning: a review of classification
and combining techniques. Artificial Intelligence
dicted from its topological features (i.e., connec-
Review, 26(3):159–190, Nov.
tivity patterns) and textual features (meaning rep-
resentations from word embeddings). The results Matt Le, Stephen Roller, Laetitia Papaxanthos, Douwe
Kiela, and Maximilian Nickel. 2019. Inferring con-
of this preliminary study have shown that textual
cept hierarchies from text corpora via hyperbolic
features are good predictors of polysemy, while embeddings.
topological features appear to be weaker predic-
Thomas Lin, Mausam, and Oren Etzioni. 2012. En-
tors even if they have a significant correlation with tity linking at web scale. In Proc. of the Joint
the polysemy of the related node. Workshop on Automatic Knowledge Base Construc-
tion and Web-scale Knowledge Extraction (AKBC-
WEKEX), pages 84–88. Association for Computa-
References tional Linguistics.
Leo Breiman. 2001. Random forests. Machine Learn- Maximilian Nickel and Douwe Kiela. 2017. Poincaré
ing, 45(1):5–32, Oct. embeddings for learning hierarchical representa-
tions. In I. Guyon, U. V. Luxburg, S. Bengio,
José Camacho-Collados and Mohammad Taher Pile- H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
hvar. 2018. From word to sense embeddings: A nett, editors, Advances in Neural Information Pro-
survey on vector representations of meaning. J. Ar- cessing Systems 30, pages 6341–6350. Curran As-
tif. Intell. Res., 63:743–788. sociates, Inc.
Jose Camacho-Collados. 2017. Why we have switched Jeffrey Pennington, Richard Socher, and Christopher D
from building full-fledged taxonomies to simply de- Manning. 2014. Glove: Global vectors for word
tecting hypernymy relations. representation. In EMNLP, volume 14, pages 1532–
1543.
H. A. David. 1968. Gini’s mean difference rediscov-
ered. Biometrika, 55(3):573–575. Petar Ristoski, Stefano Faralli, Simone Paolo Ponzetto,
and Heiko Paulheim. 2017. Large-scale taxonomy
Stefano Faralli, Alexander Panchenko, Chris Biemann, induction using entity and word embeddings. In
and Simone Paolo Ponzetto. 2017. The con- Proc. of the International Conference on Web Intel-
trastmedium algorithm: Taxonomy induction from ligence, WI ’17, pages 81–87, New York, NY, USA.
noisy knowledge graphs with just a few links. In ACM.
Proc. of the 15th Conference of the European Chap-
ter of the Association for Computational Linguistics: Stephen Roller, Douwe Kiela, and Maximilian Nickel.
Volume 1, Long Papers, pages 590–600. Association 2018. Hearst patterns revisited: Automatic hyper-
for Computational Linguistics. nym detection from large text corpora. In Proc. of
the 56th ACL (Volume 2: Short Papers), pages 358–
Stefano Faralli, Irene Finocchi, Simone Paolo 363. Association for Computational Linguistics.
Ponzetto, and Paola Velardi. 2018. Efficient
pruning of large knowledge graphs. In Proc. of Julian Seitner, Christian Bizer, Kai Eckert, Stefano
the Twenty-Seventh International Joint Conference Faralli, Robert Meusel, Heiko Paulheim, and Si-
on Artificial Intelligence, IJCAI 2018, July 13-19, mone Paolo Ponzetto. 2016. A large database of
2018, Stockholm, Sweden., pages 4055–4063. hypernymy relations extracted from the web. In
Proc. of the Tenth International Conference on Lan-
Jerome H. Friedman. 2001. Greedy function approx- guage Resources and Evaluation LREC 2016, Por-
imation: A gradient boosting machine. The Annals torož, Slovenia, May 23-28, 2016.
of Statistics, 29(5):1189–1232.
Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016.
Marti A. Hearst. 1992. Automatic acquisition of hy- Improving hypernymy detection with an integrated
ponyms from large text corpora. In Proc. of COL- path-based and distributional method. In Proc. of
ING, pages 539–545. the 54th ACL (Volume 1: Long Papers), pages 2389–
2398. Association for Computational Linguistics.
Sven Hertling and Heiko Paulheim. 2017. Webisa-
lod: Providing hypernymy relations extracted from Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
the web as linked open data. In The Semantic Web Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
- ISWC 2017 - 16th International Semantic Web Dropout: A simple way to prevent neural networks
Conference, Vienna, Austria, October 21-25, 2017, from overfitting. J. Mach. Learn. Res., 15(1):1929–
Proc., Part II, pages 111–119. 1958, January.
Chulwoo Kim, Sung-Hyuk Cha, Yoo An, and Ned Wil- Paola Velardi, Stefano Faralli, and Roberto Navigli.
son. 2017. On roc curve analysis of artificial neural 2013. Ontolearn reloaded: A graph-based algorithm
network classifiers. In Florida Artificial Intelligence for taxonomy induction. Computational Linguistics,
Research Society Conference. 39(3).
Wentao Wu, Hongsong Li, Haixun Wang, and
Kenny Q. Zhu. 2012. Probase: A probabilistic tax-
onomy for text understanding. In Proc. of the 2012
ACM SIGMOD International Conference on Man-
agement of Data, SIGMOD ’12, pages 481–492,
New York, NY, USA. ACM.
Ji Zhu, Hui Zou, Saharon Rosset, and Trevor Hastie.
2009. Multi-class adaboost. Statistics and Its Inter-
face, 2(3):349–360.