Language-independent Taxonomy Derivation from
                                Wikipedia via Multi-task Adversarial Learning
                                Shulin Cao1 , Zijun Yao1 , Lei Hou1 and Juanzi Li1,*,†
                                1
                                    Tsinghua University


                                              Abstract
                                              Many recent efforts explore the task of Taxonomy Derivation from Wikipedia Category Network (TDWCN),
                                              which induces rich hypernymy relations between instances and classes from Wikipedia to integrate
                                              hierarchical information into knowledge graphs. However, current methods rely heavily on language-
                                              dependent information including heuristic rules, human annotations and inter-language links, which limit
                                              their applications. In this paper, we propose a language-independent model for TDWCN. Specifically, we
                                              design an adversarial learning approach to distill hypernymy relations from noisy raw Wikipedia, avoiding
                                              any language dependencies. Besides, we incorporate multi-task learning to explore the correlation among
                                              instanceOf, subClassOf, and the relations of instances. In addition, we contribute an English evaluation
                                              dataset ENT5k with about 6000 categories. Experimental results on 4 different languages demonstrate
                                              that our model can be applied generally to any language and achieve better or comparable performance
                                              compared with previous language-dependent models.

                                              Keywords
                                              Taxonomy, Knowledge Base, Wikipedia Category Network, Adversarial Learning


                                1. Introduction
                                Taxonomies hierarchically organize hypernymy (consisting of instanceOf and subClassOf ) among
                                instances and classes, which are the core pieces of large-scale knowledge graphs [1], and have
                                been proven beneficial for various NLP tasks such as question answering [2], document under-
                                standing [3] and information extraction [4]. Taxonomy derivation is a crucial task to integrate
                                hierarchical and abstract information into knowledge graphs.
                                   Current approaches for taxonomy derivation can be divided into three lines: (1) manual
                                construction [5, 6]; (2) extracting separate hypernymy from unstructured text and then organizing
                                the collection into a complete taxonomy [7, 8]; (3) recognizing hypernymy from Wikipedia
                                Category Network (WCN). Because WCN in Wikipedia is large-scale, domain-independent,
                                dynamically generated and with high coverage, Taxonomy Derivation from WCN (TDWCN)
                                has attracted lots of research [9, 10, 11, 12], which is the focus of this paper.
                                   As shown in Figure 1, WCN is a directed graph linking Wikipedia articles (e.g., Micky Mouse)
                                with inter-connected categories of different granularities (e.g., Disney comics characters, Disney
                                characters, Disney comics). The links from articles to categories are articleOf, and those between

                                Wikidata’23: Wikidata workshop at ISWC 2023
                                *
                                 Corresponding author.
                                " caosl19@mails.tsinghua.edu.cn (S. Cao); yaozj20@mails.tsinghua.edu.cn (Z. Yao); houlei@tsinghua.edu.cn
                                (L. Hou); lijuanzi@tsinghua.edu.cn (J. Li)
                                ~ https://github.com/ShulinCao (S. Cao)
                                            © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                      Derivation             Taxonomy
                             WCN
         The Walt                                              The Walt
          Disney
                           Fictional        Comics by                         Fictional      Comics by
                                                                Disney
         Company          Characters         Genre             Company       Characters       Genre

               Disney                Disney                    Animated       Disney           Disney
              Characters             Comics                   Film Series    Characters        Comics

                            Disney                                             Disney
        Animated
                            Comics                                             Comics
       Film Series         Characters                                         Characters


                Donald                Mickey                           Donald           Mickey
                 Duck                 Mouse                             Duck            Mouse


                    article/instance category/class   articleOf subClassOf instanceOf subClassOf

Figure 1: TDWCN filters out the non-hypernymy from WCN and outputs the taxonomy. The red
lines represent non-hypernymy .


categories are subCategoryOf. By treating each article and category as one candidate instance
and class, each articleOf and subCategoryOf as one candidate instanceOf and subClassOf
respectively, we can obtain a large-scale taxonomy with millions of instances and classes without
extra human efforts. However, not all articleOf and subCategoryOf links are correct hypernymy
relations. If we do not filter out the non-hypernymy in WCN, wrong facts might be inferred
(e.g., “(Micky Mouse, instanceOf, The Walt Disney Company)”). Therefore, TDWCN needs to
recognize whether each articleOf and subCategoryOf in WCN is a correct hypernymy relation,
which can be formed as a hypernymy classification task consisting of InstanceOf Classification
and SubClassOf Classification.
   Most previous methods for TDWCN [9, 13, 10, 11] rely on heuristic rules mainly designed
for English (e.g., syntactic and lexicon patterns). They can hardly be applied to non-English
languages. Recently, supervised methods are proposed, which rely on labeled corpus from
expensive human annotations [14, 15] or sparse inter-language links in Wikipedia [12]. Both
rule-based and supervised methods rely on language-dependent information including heuristic
rules, human annotations and inter-language links, which is quite time-consuming and labor-
intensive. For example, if we design heuristic rules, annotate a corpus or use inter-language
links to construct a dataset for one certain language, we can not directly apply them to another
language, since the syntactic, lexicon and the patterns for different languages are quite different.
These language dependencies limit their applications.
   To address the above issues, we propose a language-independent method through multi-task
adversarial learning to perform TDWCN. Specifically, we pretrain a coarse classifier over the
raw WCN, based on which we split the training data into a reliable set and an unreliable set.
Then, we use adversarial learning to iteratively distill the two training sets and refine the classifier
(i.e., the discriminator) through a min-max game between the discriminator and sampler. Our
model can purify the large-scale raw WCN and is general enough to any language without the
limitation of heuristic rules, human annotations or inter-language links. In addition, considering
that (1) InstanceOf Classification and SubClassOf Classification can mutually enhance each other
because instances and classes are highly correlated; (2) the rich semantics provided by relational
facts among instances through Knowledge Embedding may benefit Hypernymy Classification,
we propose a multi-task learning framework to learn Knowledge Embedding, InstanceOf Classifi-
cation and SubClassOf Classification simultaneously. These three sub-tasks fully integrate the
connections from multiple views of instance-instance, instance-class and class-class information
flow respectively to further improve the performance of TDWCN.


2. Related Work
Taxonomy Derivation. Taxonomies organize classes and instances in the real world in hierarchi-
cal structure, and directly affect the computational ability of knowledge graphs. Therefore, the
derivation of large-scale, high-coverage and high-quality taxonomies is essential.
   Current methods for taxonomy derivation can be divided into three categories. One category
focuses on manual construction [5, 6], which is time-intensive and domain-dependent. The
second category, taxonomy derivation from text, usually includes two steps: hypernymy relation
extraction from text and taxonomy induction. The representative works include [7, 16, 17, 18, 19,
20, 21], etc.. Our paper focuses on the third category, taxonomy derivation from WCN. Most
previous works utilize heuristic hand-crafted rules, such as the syntactic structure of category
labels, the topology and lexico-syntactic patterns [9], the lemmas from the first sentences of
articles (WiBi [22]), linking with external resources (MENTA [13]), inter-language links and
link surface forms (MultiWiBi [10]) and so on [23, 24, 1, 11]. Recent supervised methods rely
on human annotations [14, 15] or inter-language links (MultiTax [12]), where the former is
expensive and the latter is sparse. MultiTax, given an English taxonomy as a source taxonomy,
leverages inter-language links to construct the dataset for the target language and then trains
classifiers.
   Different from MENTA, MultiWiBi and MultiTax, our method avoids language-dependent in-
formation including heuristic rules and inter-language links. Different from WiBi and MultiWiBi
which also consider the correlation between instances and classes, we use deep representation
learning to vectorize them, which serve as a basis that connects InstanceOf and SubClassOf
classification.
   Note that Wikipedia has links from each article to the corresponding Wikidata item and
Wikidata has taxonomic relations among its items. However, these relations focus only on the
articles of Wikipedia and ignore categories. We believe the rich taxonomic relations among
articles and categories in Wikipedia are crucial for large-scale and high-coverage taxonomies and
can complement with the existing taxonomy in Wikidata.
Adversarial Training. For adversarial training, prior works in computer vision add imperceptible
adversarial perturbations to input images, relying on the fact that such small perturbations cannot
change an image’s true label. [25] add noise in the form of small perturbations to the input data,
and the generated adversarial examples let models make wrong predictions. Then, [26] attempt
to analyze adversarial examples and propose adversarial training for image classification tasks.
These works inspire subsequent works for NLP tasks, such as text generation [27], knowledge
graph embedding [28], etc.
   Different from previous works, we exploit the ability of adversarial training to distinguish
nuances between input data and refine a pretrained coarse classifier. We split the unlabeled
training data into a reliable set and an unreliable set, and use adversarial training to iteratively
distill the two training sets through a min-max game between a discriminator and a sampler.


3. Notations and Definitions
Definition 1. WCN is a directed graph defined as 𝒲𝒞𝒩 = ⟨𝒜, 𝒞, ℛ𝒜 , ℛ𝒞 ⟩: Each 𝑎𝑗 ∈ 𝒜 is
an article in Wikipedia. Each 𝑐𝑗 ∈ 𝒞 is a category grouping articles and other categories on
similar topics, which can be represented as a word sequence {𝑤1 , . . . , 𝑤|𝑐𝑗 | }. ℛ𝒜 = {𝑟𝑗𝑎 |𝑟𝑗𝑎 =
(𝑎𝑘 , 𝑐𝑙 ), 𝑎𝑘 ∈ 𝒜, 𝑐𝑙 ∈ 𝒞}, 𝑟𝑗𝑎 is articleOf between article 𝑎𝑘 and category 𝑐𝑙 . ℛ𝒞 = {𝑟𝑗𝑐 |𝑟𝑗𝑐 =
(𝑐𝑘 , 𝑐𝑙 ), 𝑐𝑘 , 𝑐𝑙 ∈ 𝒞}, 𝑟𝑗𝑐 is subCategoryOf between two categories 𝑐𝑘 , 𝑐𝑙 .
                                                                                                ˜
Definition 2. Taxonomy is a directed acyclic graph defined as 𝒯 = ⟨ℐ, 𝒞˜ , ℛℐ , ℛ𝒞 ⟩: (1) Each
𝑖𝑗 ∈ ℐ is an instance. Each ˜𝑐𝑗 ∈ 𝒞˜ is a class. (2) ℛℐ = {𝑟𝑗𝑖 |𝑟𝑗𝑖 = (𝑖𝑘 , ˜𝑐𝑙 ), 𝑖𝑘 ∈ ℐ, ˜𝑐𝑙 ∈ 𝒞˜ },
                                                       ˜
𝑟𝑗𝑖 is instanceOf between instance 𝑖𝑘 and class ˜𝑐𝑙 . ℛ𝒞 = {𝑟𝑗˜𝑐 |𝑟𝑗˜𝑐 = (𝑐˜𝑘 , ˜𝑐𝑙 ), ˜𝑐𝑘 , ˜𝑐𝑙 ∈ 𝒞˜ }, 𝑟𝑗˜𝑐 is
subClassOf between two classes ˜𝑐𝑘 , ˜𝑐𝑙 .

   As shown in Figure 1, articles and categories in WCN can be viewed as candidate instances and
classes respectively; articleOf and subCategoryOf are candidates of instanceOf and subClassOf.
                                           ˜
Namely, ℐ ⊆ 𝒜, ˜𝒞 ⊆ 𝒞, ℛℐ ⊆ ℛ𝒜 and ℛ𝒞 ⊆ ℛ𝒞 . We want to recognize whether each articleOf
is a correct instanceOf and whether each subCategoryOf is a correct subClassOf. Therefore, the
main task Hypernymy Classification can be formalized as follows.
Definition 3. Hypernymy Classification is to learn two functions ℐ𝒞 and 𝒮𝒞 for instanceOf
classification and subClassOf classification respectively: (1) ℐ𝒞(𝑟𝑗𝑎 ) ↦→ {+1, −1}, 𝑟𝑗𝑎 ∈ ℛ𝒜 , +1
denotes articleOf 𝑟𝑗𝑎 is a correct instanceOf and -1 not. (2) 𝒮𝒞(𝑟𝑗𝑐 ) ↦→ {+1, −1}, 𝑟𝑗𝑐 ∈ ℛ𝒞 , +1
denotes subCategoryOf 𝑟𝑗𝑐 is a correct subClassOf and -1 not.


4. Methodology
We conduct taxonomy derivation in three steps: (1) Network cleanup, a pre-processing step to
filter out meta-categories related to Wikipedia management; (2) Hypernymy classification, the
core step to learn both InstanceOf and SubClassOf classification; (3) Taxonomy induction, a
post-processing step to induce a globally-optimized taxonomy.
    For network cleanup, we follow [9] to use several light-weighted rules. For taxonomy induction,
we follow [12] to use greedy selection strategies. These two steps are not our focus and will not
be unfold in this paper due to space limit.
    For Hypernymy Classification, as shown in Figure 2, we learn three sub-tasks Knowledge
Embedding, InstanceOf Classification and SubClassOf Classification simultaneously in a multi-
task learning framework to fully incorporate the connections among instances and classes. For
             Knowledge                  Coarse Pretraining
             Embedding                    𝒖(') via negative sampling            𝒖(%) from WCN


                                                                   Classifier

                                                             ( '
                                              Unreliable set 𝓤                               ( %
                                                                                Reliable set 𝓤
             Shared instances


                                        Adversarial Refining
                                                      ( '
                                                      𝓤                                ( %
                                                                                       𝓤
            InstanceOf
           Classification
                                            Unreliable Samples              Reliable Samples


                                                  Encoder                          Encoder
              Shared classes

                                                     Scorer
                                                      𝑷𝓤
                                                  Sampler                       Discriminator
             SubClassOf
            Classification
                                             Noisy      Informative                   0/1


Figure 2: The overall framework.


Knowledge Embedding, we introduce an external knowledge graph with rich semantic relations
among instances. For InstanceOf and SubClassOf classification, they follow the same learning
process and model architecture. Specifically, we pretrain a coarse classifier based on the raw WCN
and a negative sampling strategy. According to the output of the classifier, we split the training
data into a reliable set and an unreliable set. Then, we use adversarial learning to iteratively
distill the two training sets and refine the classifier (i.e., the discriminator) through a min-max
game between the discriminator and sampler. In Section Coarse Pretraining and Adversarial
Refining, we take SubClassOf Classification as a representative to introduce the details. In
Section Multi-task Learning, we introduce the overall learning objective in the multi-task learning
framework.

4.1. Coarse Pretraining
The coarse pretraining aims to learn a coarse classifier to predict whether each relation in WCN
is a hypernymy.
4.1.1. Category Encoding
Firstly, we capture the semantics for each category from its word sequence. Specifically, given
the word sequence {𝑤1 , · · · , 𝑤|𝑐𝑗 | } of category 𝑐𝑗 ∈ 𝒞, we represent all words with their word
embeddings {𝑤1 , . . . , 𝑤|𝑐𝑗 | }, and then feed the embeddings into a neural encoder to obtain the
category representation 𝑐𝑗 . Without loss of generality, we select convolutional neural networks
(CNN) [29] as the neural encoder.

4.1.2. Hypernymy Encoding
Next, we encode each category pair in WCN to get the representations of hypernymy candidates.
Given a relation (𝑐𝑗 , 𝑐𝑘 ) ∈ ℛ𝒞 , we take their difference as the relation embedding 𝑟𝑐𝑗 ,𝑐𝑘 . Formally,
we calculate the relation embedding with equation 𝑟𝑐𝑗 ,𝑐𝑘 = 𝑐𝑗 − 𝑐𝑘 1 .

4.1.3. Hypernymy Scoring
Finally, we learn a scoring function to predict whether a subCategoryOf relation is a correct
subClassOf. Given (𝑐𝑗 , 𝑐𝑘 ) ∈ ℛ𝒞 , we measure its possibility of subClassOf relation by
                                                   (︀           )︀
                                    𝑆(𝑐𝑗 , 𝑐𝑘 ) = 𝜎 𝑟𝑐𝑗 ,𝑐𝑘 · 𝑟𝑐 .                        (1)

where 𝜎(·) is the sigmoid function. 𝑟𝑐 is a vector which is randomly initialized and to be learned.
   We observe that most of subCategoryOf are correct subClassOf and most of subClassOf
are already contained in subCategoryOf. Due to the lack of supervised labels, we assume the
equivalence between the subClassOf set and subCategoryOf set to coarsely train our classifier,
and then enhance it in a finer granularity by multi-task adversarial learning which will be explained
in the following.
   Specifically, let 𝒰 (+) and 𝒰 (−) be the positive and negative sample sets of the subClassOf
classifier. We have 𝒰 (+) = ℛ𝒞 and 𝒰 (−) = {(𝑐𝑗 , 𝑐𝑘 )|𝑐𝑗 , 𝑐𝑘 ∈ 𝒞, (𝑐𝑗 , 𝑐𝑘 ) ∈
                                                                               / ℛ𝒞 }. As there are
a huge amount of negative samples and most of them can be easily recognized, we design an
efficient negative sampling strategy to sample the most informative ones from 𝒰 (−) . Specifically,
for each category pair (𝑐𝑗 , 𝑐𝑘 ) ∈ 𝒰 (+) , we choose (1) one reverse hypernymy pair for predicting
the directionality of hypernymy; (2) one co-hypernymy pair for distinguishing hypernymy from
semantic relatedness relations; (3) one randomly corrupted pair for distinguishing hypernymy
from other relations. The loss function of coarse pretraining is:
                                             ∑︁         (︀         )︀
                              ℒ𝐶 = −                log 𝑆(𝑐𝑗 , 𝑐𝑘 )
                                               (𝑐𝑗 ,𝑐𝑘 )∈𝒰 (+)
                                                    ∑︁              (︀              )︀                            (2)
                                           −                     log 1 − 𝑆(𝑐𝑗 , 𝑐𝑘 ) .
                                               (𝑐𝑗 ,𝑐𝑘 )∈𝒰 (−)


1
    For InstanceOf Classification, given a relation (𝑎𝑗 , 𝑐𝑘 ) ∈ ℛℐ , the relation embedding 𝑟𝑎𝑗 ,𝑐𝑘 = 𝑀 𝐿𝑃 (𝑎𝑗 ) − 𝑐𝑘
    where 𝑎𝑗 is the instance embedding initialized by Knowledge Embedding as will be introduced in Section Multi-
    task Learning and 𝑀 𝐿𝑃 (·) is a multilayer perceptron to project the instance embedding to the space of category
    embedding.
4.2. Adversarial Refining
The 𝒰 (+) and 𝒰 (−) mentioned in the above section are coarse-grained because a critical mass of
samples are placed into the mistaken set. Inspired by [30, 31], we apply adversarial training to
iteratively distill 𝒰 (+) and 𝒰 (−) and refine the classifier.
   According to the predicted score in Eq. (1), we choose the samples in 𝒰 (+) whose scores are
higher than a threshold 𝜏 to construct a reliable set 𝒰˜ (+) , and the remaining samples in 𝒰 (+) and
𝒰 (−) to construct an unreliable set 𝒰˜ (−) . As shown in Figure 2, we design a discriminator and a
sampler to conduct an adversarial min-max game. Given a sample (𝑐𝑗 , 𝑐𝑘 ), the discriminator aims
to learn a score function 𝐷(𝑐𝑗 , 𝑐𝑘 ) to judge whether it is from 𝒰  ˜ (+) or 𝒰
                                                                              ˜ (−) , while the sampler
learns a probability 𝑃𝑢 (𝑐𝑗 , 𝑐𝑘 ) for each sample of 𝒰    ˜ (−) , representing its chance of being a false
negative. According to 𝑃𝑢 , we select the most confusing negative samples from 𝒰             ˜ (−) to cheat
the discriminator. During training, the generator provides large amounts of latent noisy samples
to enhance the discriminator, and the discriminator influences the generator to select the more
informative samples. We also dynamically select the most informative and reliable samples from
the unreliable set to the reliable set. During the adversarial refining process, we can enhance the
classification capability of the discriminator. Formally, the objective of the min-max game can be
expressed as
                              min max(𝐸               ˜ (+) [𝑙𝑜𝑔𝐷(𝑐𝑗 , 𝑐𝑘 )]
                               𝑃𝑢    𝐷      (𝑐𝑗 ,𝑐𝑘 )∼𝒰
                                                                                                        (3)
                                    + 𝐸(𝑐𝑗 ,𝑐𝑘 )∼𝑃𝑢 [𝑙𝑜𝑔(1 − 𝐷(𝑐𝑗 , 𝑐𝑘 ))]).
  Discriminator is transferred from the coarsely trained hypernymy classifier in Eq. 1:
                                                 (︀           )︀
                                  𝐷(𝑐𝑗 , 𝑐𝑘 ) = 𝜎 𝑟𝑐𝑗 ,𝑐𝑘 · 𝑟𝑐 .                                       (4)
which will be further refined with adversarial loss.
  Sampler aims to select samples from 𝒰     ˜ (−) to cheat the discriminator according to 𝑃𝑢 which is
calculated as,
                            𝑞(𝑐𝑗 , 𝑐𝑘 ) = 𝑚 · 𝑟𝑐𝑗 ,𝑐𝑘 + 𝑑;
                                                         (︀        )︀
                                                   exp 𝑞(𝑐𝑗 , 𝑐𝑘 )                               (5)
                           𝑃𝑢 (𝑐𝑗 , 𝑐𝑘 ) = ∑︀                   (︀           )︀ .
                                                      ˜ (−) exp 𝑞(𝑐˜
                                                      (𝑐˜𝑗 ,𝑐𝑘˜ )∈𝒰   𝑗 , 𝑐𝑘
                                                                           ˜)

where 𝑚 and 𝑑 are parameters.
   By unfolding the min-max objective in Eq. 3, the adversarial loss for the discriminator is as
follows:                           ∑︁           1
                     ℒ𝐷 = −                     (+)
                                                      log 𝐷(𝑐𝑗 , 𝑐𝑘 )
                                         (+)
                                              ˜ |
                                             |𝒰
                                       ˜
                             (𝑐𝑗 ,𝑐𝑘 )∈𝒰
                                   ∑︁                         (︀              )︀             (6)
                          −                  𝑃𝑢 (𝑐𝑗 , 𝑐𝑘 ) log 1 − 𝐷(𝑐𝑗 , 𝑐𝑘 ) .
                                            ˜   (−)
                                  (𝑐𝑗 ,𝑐𝑘 )∈𝒰

And the adversarial loss for the sampler is:
                                      ∑︁
                          ℒ𝑆 = −                           𝑃𝑢 (𝑐𝑗 , 𝑐𝑘 ) log 𝐷(𝑐𝑗 , 𝑐𝑘 ).              (7)
                                      (𝑐𝑗 ,𝑐𝑘 )∈𝒰 ˜ (−)
  As we treat instanceOf and subClassOf separately, and adopt adversarial training for both of
them, the holistic adversarial training loss functions for instanceOf and subClassOf are:
                            ℒ𝐼𝐴 = ℒ𝐼𝐷 + 𝜆𝐼 ℒ𝐼𝑆 ; ℒ𝐶    𝐶    𝐶 𝐶
                                                  𝐴 = ℒ𝐷 + 𝜆 ℒ𝑆 .                              (8)
ℒ𝐼𝐷 and ℒ𝐶 𝐷 are the discriminator loss functions for instanceOf and subClassOf respectively.
Similarly, ℒ𝐼𝑆 and ℒ𝐶                                      𝐼     𝐶
                    𝑆 denote the sampler loss functions. 𝜆 and 𝜆 are the weighting factors.


4.3. Multi-task Learning
Besides distilling the training data and refining the classifiers through adversarial learning, we
further incorporate multi-task learning to enhance the hypernymy classifiers. The main idea is
that (1) InstanceOf Classification and SubClassOf Classification can mutually enhance each other
because instances and classes are highly correlated; (2) relational facts about instances provide
rich semantics which benefits Hypernymy Classification. Specifically, we learn three sub-tasks,
Knowledge Embedding, InstanceOf Classification, and SubClassOf Classification simultaneously
to integrate the instance-instance, instance-class and class-class information flow.
   For Knowledge Embedding, we introduce a knowledge graph 𝒢, which expresses data as a
directed graph 𝒢 = {ℐ, 𝒫, 𝒯 }. ℐ, 𝒫 and 𝒯 indicate the sets of instances, predicates and triples
respectively. A score function 𝐾(ℎ, 𝑝, 𝑡) is learned to measure the plausibility of (ℎ, 𝑝, 𝑡) being
a legal triple, where ℎ, 𝑡 ∈ ℐ, 𝑝 ∈ 𝒫. In this paper, we utilize TransE [32] as a representative,
whose scoring function is 𝐾(ℎ, 𝑝, 𝑡) = −‖ℎ + 𝑝 − 𝑡‖ where ℎ, 𝑝, 𝑡 are embeddings of instances
and predicates. We utilize a hinge loss function ℒ𝐾 , which is calculated as,
                                  ∑︁             ∑︁
                            ℒ𝐾 =
                                       (ℎ,𝑝,𝑡)∈𝒢        ˜,𝑡˜)∈𝒢
                                                    (ℎ̃,𝑝    /                                  (9)
                                                                    ˜, ˜𝑡)
                                         (︀                               )︀
                                  max 0, 𝛾 + 𝐾(ℎ, 𝑝, 𝑡) − 𝐾(ℎ̃, 𝑝
where 𝛾 is a hyper-parameter denoting the margin.
   Finally, the overall loss of multi-task learning is formalized as
                                   ℒ = ℒ𝐾 + 𝛼1 ℒ𝐼𝐴 + 𝛼2 ℒ𝐶
                                                         𝐴.                                   (10)
Here, 𝛼1 and 𝛼2 are two weighting factors. Specifically, instance embeddings are shared by
Knowledge Embedding and InstanceOf Classification. Class embeddings are shared by In-
stanceOf Classification and SubClassOf Classification. By jointly optimizing the shared pa-
rameters, we can fully integrate the connections among instances and classes and enhance the
hypernymy classifiers.

4.4. Model Training
First, we optimize the loss function ℒ𝐶 in Eq. (2). Then, we use the coarsely trained model and
hyper-parameter 𝜏 to construct 𝒰  ˜ (+) and 𝒰˜ (−) for adversarial training. In practice, we share
the parameters of the classifier (Eq. (2)) and discriminator (Eq. (6)) to warm up the adversarial
training process. Then, we optimize the multi-task learning loss function in Eq. (10). ℒ𝐼𝐷 and ℒ𝐼𝑆
are optimized alternately, with 𝜆𝐼 integrated into the learning rate of ℒ𝐼𝑆 to avoid adjusting. ℒ𝐶
                                                                                                 𝐷
and ℒ𝐶 𝑆 take the similar optimization strategy. Instead of directly updating ℒ, we optimize ℒ𝐾 ,
ℒ𝐼𝐴 and ℒ𝐶 𝐴 alternatively.
Table 1
The statistics of Wikipedia dump and evaluation datasets.
               Language           Article       Category     ArticleOf      SubCategoryOf
                                                Wikipedia dump
               English          5,139,414       13,80,351    25,841,897       3,416,766
               French           2,033,360        372,208      6,661,384        814,539
               Italian          1,406,807        361,728      2,394,169        683,477
               Spanish          1,483,920        365,611      4,566,147        815,055
                                              Evaluation datasets
               ENT5k              5,989            5,983      27,696           19,857
               French              200              187         862             430
               Italian             200              184        1225             382
               Spanish             200              200         706             438


Table 2
P* of the original WCN.
                                Language         articleOf   subCategoryOf
                                English            97.0%            75.2%
                                French             72.0%            78.8%
                                Italian            74.5%            76.2%
                                Spanish            81.4%            80.9%


5. Experiments
5.1. Datasets
As far as we know, previous datasets for TDWCN are all small datasets. For example, as shown
in Table 1, the datasets in MultiTax contain only about 200 articles and 200 categories. For better
evaluation, we create a large-scale English evaluation dataset ENT5k. Specifically, we use a 2018
snapshot of Wikipedia, select 7,000 articles and 7,000 categories from its WCN and then annotate
whether articleOf and subCategoryOf of sampled WCN are hypernymy or not. Each articleOf or
subCategoryOf is allocated to 5 highly-educated crowd-workers2 . Only the ones consented by
more than 4 crowd-workers are kept to assure quality. Instead of selecting categories randomly,
we consider both the abstract ones such as “(Learning, Education)” and the specific ones such
as “(American Male Painters, American Painters)”, and select categories to cover diverse areas
such as people, society, geography, etc. Finally, ENT5k contains 5,989 articles, 5,983 categories,
27,696 articleOf and 19,857 subCategoryOf. As for the annotated results, for articleOf, the
incorrect relations make up 3.0% and for subCategoryOf, the incorrect make up 24.8%.


2
    Inter-annotator agreement (Cohen’s Kappa) is 0.72
5.2. Baselines
As far as we know, our model is the first weakly-supervised method. We compare our method
with the following rule-based and supervised methods:
   Heads [11], a rule-based method only designed for English.
   MENTA [13], a rule-based method, links WordNet and Wikipedia of different languages into
a single taxonomy using heuristic rules.
   MultiWiBi [10], a rule-based method, induces taxonomies for English, and then transfers
them to other languages using heuristic rules and inter-language links.
   MultiTax [12], a supervised method, given an English taxonomy as a source taxonomy, first
constructs a supervised dataset for the target language using inter-language links and then trains
binary classifiers. MultiTax is not designed for inducing English taxonomy. Instead, it takes the
existing English taxonomy as input.

5.3. Model settings and Evaluation Metrics
We use pretrained 50-dimensional Glove [33] for English and 300-dimensional fasttext [34]
for other languages. For knowledge graph in ℒ𝐾 (Eq. (9)), we employ Wikidata [35] which is
closely related to WCN. The optimizer is selected through a grid search over {Adam, Adagrad,
SGD}. The learning rate is selected over {0.1, 0.01, 0.001}. The threshold 𝜏 𝐶 for subClassOf
and instanceOf are selected over {0.1, 0.2, · · · , 0.9}. The margin for Knowledge Embedding is
selected over {0.5, 1.0, 2.0, 3.0, 4.0, 5.0}. Finally, the optimizers for ℒ𝐶    𝐶    𝐼    𝐼
                                                                           𝐷 , ℒ𝑆 , ℒ𝐷 , ℒ𝑆 and ℒ𝐾
are Adam, Adam, Adagrad, Adagrad and SGD respectively. The learning rate for them are 0.001,
0.001, 0.01, 0.01 and 0.1 respectively. The threshold 𝜏 𝐶 for subClassOf and instanceOf are both
0.9. The margin 𝛾 for Knowledge Embedding is 1.0. The hidden size and sliding window size
for CNN are 50 and 3 respectively. MultiWiBi for non-English languages, MENTA and MultiTax
results are evaluated by [12]. Theoretically, for comparison on English for MultiWiBi and Heads,
it is best that we use the evaluation dataset of the corresponding old version. However, 2012
and 2015 snapshots of Wikipedia are not available (e.g., https://dumps.wikimedia.org/enwiki/
does not maintain the old versions of Wikipedia.). Therefore, it is a compromise that Heads
and MultiWiBi for English are evaluated based on ENT5k and their published taxonomies. For
French, Italian and Spanish, we use the small datasets with only 200 articles by [12]. For English,
we use ENT5k. The results of MENTA and MultiTax are not evaluated for English because: (1)
MultiTax is not designed for English. (2) For MENTA, the codes are not public and we can not
reproduce them because lots of details are missing in the papers.
   For a fair comparison with the baselines, we follow the evaluation metrics used in Multi-
WiBi [10]: (1) Macro-precision (P*), the average ratio of the correct hypernyms to the total
number of hypernyms returned (per node in taxonomies); (2) Recall (R*), the ratio of the nodes
for which at least one correct hypernym is returned; (3) Coverage (C), the ratio of the nodes with
at least one hypernym returned irrespective of its correctness. Note that (1) P*, R* are different
from the conventional precision and recall; (2) F1 calculation of P* and R* is meaningless; (3)
The R* and C of the original WCN are 100%. The P* of the raw WCN is shown in Table 2
according to the annotated evaluation datasets.
Table 3
Precision (P*), recall (R*) and coverage (C) scores. The top 2 results are in bold. The best
among ours and rule-based methods are also underlined.
                                              instanceOf               subClassOf
          Language      Methods
                                        P*        R*        C     P*       R*       C
          English       Heads          21.9      52.0      69.8   44.0    60.8   53.7
                        MultiWiBi      84.1      79.4      92.6   39.4    63.9   71.7
                        Ours           97.6      97.6      100    85.5    85.5      100
                        MENTA          81.4      48.8      59.8   82.6    55.0   65.7
          French        MultiWiBi      84.5      80.9      94.1   80.7    80.7   100
                        Ours           84.6      84.0      100    94.1    94.1      100
                        MultiTax       88.0      91.7      100    93.9    95.1      100
                        MENTA          79.7      53.2      66.7   77.1    25.4   32.8
          Italian       MultiWiBi      80.1      79.4      96.3   89.7    89.0   99.2
                        Ours           83.2      83.2      100    84.9    84.9      100
                        MultiTax       92.6      97.2      100    89.2    90.9      100
                        MENTA          81.0      42.9      52.7   80.5    54.2   66.4
          Spanish       MultiWiBi      87.0      82.0      93.7   84.8    84.4   100
                        Ours           85.7      85.7      100    94.9    94.9      100
                        MultiTax       93.4      96.3      100    92.9    95.1      100


5.4. Overall Performance
Table 3 shows the overall performance. From the table, we can observe that:
   (1) For English, our model distinguishes the non-hypernymy from the original WCN and
improves P* by a large margin. It also surpasses Heads and MultiWiBi significantly, which
indicates that our neural model can outperform the rule-based method.
   (2) For the non-English languages, our method significantly outperforms the rule-based models.
Even compared with the supervised model, our weakly-supervised model provides comparable
performance. Especially for subClassOf, it even outperforms MultiTax slightly in French and
Spanish, indicating that weakly-supervised methods are promising and worth exploring in the
future.
   (3) Our model performance for instanceOf is worse than that for subClassOf. A possible
reason is that infrequent instances cannot learn a good representation due to data sparsity. As
described in Section Coarse Pretraining, categories are represented by a textual encoder, but
most of the instances (e.g., “Donald Duck") are named entities whose semantics are beyond
the word sequence can describe. Instead, instance embeddings are randomly initialized and
further learned from the knowledge graph 𝒦 by Knowledge Embedding. However, as previous
study shows [36, 37], the frequency of instances follows a pow-law distribution and most of the
instances are infrequent, which cannot learn a good representation and further harm InstanceOf
Classification. A reasonable solution is to utilize instance descriptions, which will be our future
work.
   As shown in Table 2, the original WCNs for different languages vary a lot. For French, only
72.0% articleOf are correct, yet for English, 97% are correct. For English, improving InstanceOf
Classification is not easy but necessary because more than 25 million articleOf exist in WCN
and the number of invalid articleOf is 750k, which will harm SubClassOf Classification due to
error propagation.
   Note that we propose the language-independent method to avoid excessive manual rules and
corpus labeling in the language-dependent method. Our focus is on reducing costs and improving
generalization ability, rather than claiming that our experimental results are definitely better than
theirs. Therefore, in the experiment, compared to language-dependent methods (which are based
on manual rules or corpus annotations), our model can achieve comparable results and be applied
generally to different languages without rules or annotations, showing the benefits of our method.

5.5. Ablation Study
In this section, we conduct ablation study to further investigate the proposed adversarial training
strategy and multi-task learning framework. Without loss of generality, we investigate the
subClassOf results on ENT5k. We refer to the coarsely trained classifier as BASE, the classifier
with adversarial training as BASE+ADV, and the classifier with both adversarial training and
multi-task learning as BASE+ADV+MT. Further, we denote the BASE+ADV+MT model without
Knowledge Embedding as BASE+ADV+MT− .
   We use a variant of precision (P̃), recall (R̃), F1 score (F1)
                                                              ̃︁ and Area Under Curve (AUC)
                                                                                          ˜︁ for
evaluation. Specifically, since we expect our model to find out as many true negatives as possible,
                      TN               TN
we employ P̃ = (TN+FN)       , R̃ = (TN+FP)   , where TN, FN and FP denote true negative, false
negative and false positive results respectively. AUC
                                                   ˜︁ is the area under the P̃-R̃ curve. The overall
results are shown in Table 4.
   The reasons that different metrics are used in the overall performance evaluation and the
ablation study are as follows: (1) the overall performance evaluation is to measure the quality of
taxonomy, while the ablation study is to investigate the effectiveness of hypernymy classification.
(2) the quality of our taxonomy is not merely determined by hypernymy classification, because
we conduct taxonomy derivation in three steps: network cleanup, hypernymy classification and
taxonomy induction. Therefore, in the ablation study, we use the precision, recall and F1 measures
for the classification model. While in the overall performance evaluation, we follow previous
work and use P*, R* and C.
Effect of Adversarial Training. When we apply adversarial training, F1     ̃︁ is improved by 3.6%
and AUC
     ˜︁ is improved by 3.9%, which indicates that adversarial training improves classification
performance and generalization ability.
   We further conduct an in-depth study of the sampler. Examples in Table 5 show that, given a
hyponym and multiple candidate hypernyms, the sampler can reasonably calculate the probability
distribution and distinguish informative candidates from noisy ones. The informative data from
the sampler can further help to boost the performance of the discriminator, which explains the
improvement of BASE+ADV over BASE.
Table 4
Results of ablation study. From BASE to BASE+ADV, from BASE+ADV to BASE+ADV+MT− and
from BASE+ADV+MT− to BASE+ADV+MT, both F1            ̃︁ and AUC
                                                              ˜︁ improvements were found to be
statistically significant using a two sample t test with 𝑝 < 0.01.

              Method                   P̃         R̃          F1
                                                              ̃︁           ˜︁
                                                                           AUC
              BASE                    25.6    78.6           38.6          28.7
              BASE+ADV                30.7    67.3        42.2(+3.60)   32.6(+3.90)
              BASE+ADV+MT−            29.5    80.7        43.2(+1.00)   34.8(+2.20)
              BASE+ADV+MT             40.1    53.8        45.9(+2.70)   39.3(+4.50)


Table 5
The scores of hypernymy candidates for one hyponym by the sampler. Take “Royal families” as
an example. According to the sampler, the most informative candidate hypernym is “Political
families” and the most noisy one is “Oligarchs”, which is reasonable.
         Royal families                Feminists
         Political families   0.27     People by political orientation                0.24
         Noble families       0.22     People associated with identity politics       0.22
         Monarchy             0.19     People associated with feminism                0.21
         Royalty              0.16     Feminist movement                              0.20
         Oligarchs            0.16     Feminism                                       0.13


Table 6
The comparison on the rank of candidate hypernyms between BASE+ADV and BASE+ADV+MT.
GT means ground truth.
                    BASE+ADV                                       BASE+ADV+MT
      Category                               GT        Category                              GT
                                     Open-source movement
      Collaboration                          ×    Social movements                           ✓
      Criticism of intellectual property     ×    Criticism of intellectual property         ×
      Social movements                       ✓    Sharing                                    ×
      Sharing                                ×    Collaboration                              ×
                                        Treasure hunters
      Treasure troves                        ×         People by occupations                 ✓
      People by occupation                   ✓         Treasure troves                       ×
      Treasure                               ×         Treasure                              ×


Effect of Multi-task Learning. When we compare BASE+ADV+MT− with BASE+ADV, F1             ̃︁
and AUC
    ˜︁ is improved by 1.0% and 2.2% respectively, demonstrating that InstanceOf and Sub-
ClassOf classification mutually enhance each other. When introducing Knowledge Embedding,
we achieve the best F1
                    ̃︁ and AUC
                           ˜︁ compared with all the other models. Specifically, when we com-
pare BASE+ADV+MT with BASE+ADV+MT− , F1          ̃︁ and AUC
                                                        ˜︁ is increased by 2.7% and 4.5%
respectively. This shows the relational facts among instances benefit Hypernymy Classification.
All the results demonstrate the effectiveness of our multi-task learning framework.
   To further show the effectiveness of multi-task learning, we conduct a case study by comparing
BASE+ADV+MT with BASE+ADV. From Table 6 we can see that BASE+ADV+MT can produce
more reasonable results.


6. Conclusion
In this paper, we propose a language-independent model for TDWCN, which (1) designs an
adversarial learning approach to distill hypernymy relations from noisy raw Wikipedia without
the limitation of language dependencies; (2) incorporates multi-task learning to integrate the
information flow among instances and classes. In addition, we contribute a large-scale evaluation
dataset with 27k articleOf and 19k subCategoryOf for TDWCN. Experimental results on 4
different languages demonstrate that our model can be applied generally to different languages
and achieve better or comparable performance compared with previous language-dependent
approaches. Future work includes investigating instance embeddings, deriving taxonomies for
more languages and extracting domain-specific taxonomies based on our approach.


References
 [1] J. Hoffart, F. M. Suchanek, K. Berberich, G. Weikum, Yago2: A spatially and temporally
     enhanced knowledge base from wikipedia, in: Artif. Intell., 2013.
 [2] A. B. Abacha, P. Zweigenbaum, Means: A medical question-answering system combining
     nlp techniques and semantic web technologies, Information processing & management 51
     (2015).
 [3] B. W. Liu, W. Guo, D. Niu, C. Wang, S. Xu, J. Lin, K. Lai, Y. Xu, A user-centered concept
     mining system for query and document understanding at tencent, Proceedings of the 25th
     ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2019).
 [4] X. Han, P. Yu, Z. Liu, M. Sun, P. Li, Hierarchical relation extraction with coarse-to-fine
     grained attention, in: EMNLP, 2018.
 [5] C. Elkan, R. Greiner, Building large knowledge-based systems: Representation and infer-
     ence in the cyc project: Db lenat and rv guha, 1993.
 [6] G. A. Miller, WordNet: An electronic lexical database, MIT press, 1998.
 [7] M. A. Hearst, Automatic acquisition of hyponyms from large text corpora, in: ACL, 1992.
 [8] M. Le, S. Roller, L. Papaxanthos, D. Kiela, M. Nickel, Inferring concept hierarchies from
     text corpora via hyperbolic embeddings, in: ACL, 2019.
 [9] S. P. Ponzetto, M. Strube, Wikitaxonomy: A large scale knowledge resource., in: ECAI,
     2008.
[10] T. Flati, D. Vannella, T. Pasini, R. Navigli, Multiwibi: The multilingual wikipedia bitaxon-
     omy project, Artif. Intell. 241 (2016).
[11] A. Gupta, F. Piccinno, M. Kozhevnikov, M. Pasca, D. Pighin, Revisiting taxonomy induction
     over wikipedia, in: COLING, 2016.
[12] A. Gupta, R. Lebret, H. Harkous, K. Aberer, 280 birds with one stone: Inducing multilingual
     taxonomies from wikipedia using character-level classification, in: AAAI, 2018.
[13] G. de Melo, G. Weikum, Menta: Inducing multilingual taxonomies from wikipedia, in:
     CIKM, 2010.
[14] Z. Wang, J. Li, S. Li, M. Li, J. Tang, K. Zhang, K. Zhang, Cross-lingual knowledge
     validation based taxonomy derivation from heterogeneous online wikis, in: AAAI, 2014.
[15] C. X. Chu, S. Razniewski, G. Weikum, Tifi: Taxonomy induction for fictional domains?, in:
     WWW, 2019.
[16] S. Roller, K. Erk, G. Boleda, Inclusive yet selective: Supervised distributional hypernymy
     detection, in: COLING, 2014.
[17] M. Bansal, D. Burkett, G. de Melo, D. Klein, Structured learning for taxonomy induction
     with belief propagation, in: ACL, 2014.
[18] A. Gupta, R. Lebret, H. Harkous, K. Aberer, Taxonomy induction using hypernym subse-
     quences, in: CIKM, 2017.
[19] Y. Mao, X. Ren, J. Shen, X. Gu, J. Han, End-to-end reinforcement learning for automatic
     taxonomy induction, in: ACL, 2018.
[20] R. Aly, S. Acharya, A. Ossa, A. Köhn, C. Biemann, A. Panchenko, Every child should have
     parents: a taxonomy refinement algorithm based on hyperbolic term embeddings, in: ACL,
     2019.
[21] G. Bordea, S. Faralli, F. Mougin, P. Buitelaar, G. Diallo, Evaluation dataset and methodology
     for extracting application-specific taxonomies from the wikipedia knowledge graph, in:
     LREC, 2020.
[22] T. Flati, D. Vannella, T. Pasini, R. Navigli, Two is bigger (and better) than one: the wikipedia
     bitaxonomy project, in: ACL, 2014.
[23] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, Dbpedia: A nucleus for
     a web of open data, in: ISWC, 2007.
[24] V. Nastase, M. Strube, B. Börschinger, C. Zirn, A. Elghafari, Wikinet: A very large scale
     multi-lingual concept network., in: LREC, 2010.
[25] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, R. Fergus,
     Intriguing properties of neural networks, CoRR abs/1312.6199 (2013).
[26] I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples,
     CoRR abs/1412.6572 (2014).
[27] H. Chen, H. Zhang, P.-Y. Chen, J. Yi, C.-J. Hsieh, Attacking visual language grounding
     with adversarial examples: A case study on neural image captioning, in: ACL, 2018.
[28] Z. Sun, Z.-H. Deng, J.-Y. Nie, J. Tang, Rotate: Knowledge graph embedding by relational
     rotation in complex space, in: ICLR, 2019.
[29] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional
     neural networks, in: NIPS, 2012.
[30] P. Qin, W. Xu, W. Y. Wang, Dsgan: Generative adversarial training for distant supervision
     relation extraction, in: ACL, 2018.
[31] X. Wang, X. Han, Z. Liu, M. Sun, P. Li, Adversarial training for weakly supervised event
     detection, in: NAACL-HLT, 2019.
[32] A. Bordes, N. Usunier, A. García-Durán, J. Weston, O. Yakhnenko, Translating embeddings
     for modeling multi-relational data, in: NIPS, 2013.
[33] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation,
     in: EMNLP, 2014.
[34] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification,
     in: EACL, 2016.
[35] D. Vrandecic, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Commun. ACM
     57 (2014).
[36] Z. Wang, K. P. Lai, P. Li, L. Bing, W. H. Lam, Tackling long-tailed relations and uncommon
     entities in knowledge graph completion, ArXiv abs/1909.11359 (2019).
[37] E. Cao, D. feng Wang, J. Huang, W. Hu, Open knowledge enrichment for long-tail entities,
     Proceedings of The Web Conference 2020 (2020).