Domain-specific Taxonomy Enrichment based on Meta-Embeddings Mikhail Tikhomirov1[0000−0001−7209−9335] and Natalia V. Loukachevitch1 Lomonosov Moscow State University Moscow, Russia tikhomirov.mm@gmail.com louk nat@mail.ru Abstract. In this paper we study the use of meta-embeddings approaches, which combine several source embeddings, for the taxonomy class pre- diction of new terms. We test the proposed approach in the information- security domain in the task of enriching the Ontology on Natural Sci- ences and technologies (OENT). We show that autoencoder-based meta- embeddings with triplet loss achieve the best results in the task. The highest results are obtained on combination of in-domain and out-of- domain embeddings. Keywords: Taxonomy · Hypernym prediction· Meta-embeddings 1 Introduction Ontologies, knowledge graphs in the majority of domains have a taxonomy as a backbone. Relations in taxonomies usually comprise class-subclass relations be- tween concepts, or instance-class relations connecting a specific entity represen- tation and a concept [3,13]: relations of both types can be called IS-A relations or hypernym relations [24]. Development of an ontology in a new domain usually begins from constructing its taxonomy, which determines the ontology scope. It make it easier to build a taxonomy, various approaches were proposed for extracting hypernym relations for new terms from texts including specific patterns, word co-occurrences, distributional characteristics of words, and others [27]. Currently, the important component of extracting hypernym relations from texts are vector representations (embeddings) of words, which can provide an additional evidence of semantic similarity between words [12,18,28], which is important for identification of hypernym concept for a new term. Word vectors can be calculated using various text collections and various methods, which means that different vector representations capture the context in different ways, resulting in a wide variety of vector representations for the same words. From here, we can be suppose that some combinations of vectors, so-called meta-embeddings [8], can improve vector representation of words, which allows achieving better prediction of semantic similarity between words or their hypernym concepts. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 285 It was already shown that meta-embeddings improved performance in word analoqy and similarity tasks [41,8,7]. In recent work [38], it was shown that combinations of general word embeddings calculated on large Internet text col- lections have substantial impact on the performance in taxonomy enrichment of general lexical-semantic resources such as WordNet [24] and RuWordNet [20]. In this paper we show that in the task of extracting taxonomic relations in a specific domain, meta-embeddings combining general (out-of-domain) and in- domain embeddings significantly improve the hypernym concept prediction from the given taxonomy for a new term. We experiment on an information-security text collection, which is used to enrich the Ontology on Natural Sciences and Technologies [11,36] in the information-security domain. 2 Related Work Traditional methods for hypernym detection include pattern-based methods, searching for specific hypernym patterns in sentences [15,30,31], methods based on similarity of word vector representations [12,18], and also combined ap- proaches integrating various context and similarity features of words [35,4,34]. In 2016, the taxonomy enrichment task was organized as a shared task at SemEval workshop (task 14) [16]. At this task, the participants should to attach words with definitions to correct hypernyms in WordNet [24] using their defini- tions. In 2020, a new open evaluation on taxonomy enrichment of the Russian wordnet RuWordNet [20] RUSSE’2020 was organized [27]. The task was to find correct hypernyms from an older RuWordNet version for words described in a newer RuWordNet version. In the RUSSE-2020 evaluation of predicting RuWordNet hypernym synsets for new words [27], the participants used various word embeddings (static – fastText [5], word2vec [21], and contextualized - BERT [10]), the available Ru- WordNet taxonomy structure, hypernym and co-hyponym patterns, definitions of words from Wiktionary, and global search engines results [2,9,37,27]. Recent methods to hypernym extraction exploit graph-based representations of taxonomy structure. Liu et al. [19] use node2vec embeddings of graph struc- tures [14] for taxonomy induction. Aly et al. [1] use hyperbolic Poincare embed- dings [26] for automatic generation of taxonomies. Graph convolutional networks (GCNs) [17] are applied to the link prediction task on large knowledge bases. In [29], the authors study graph-based representation methods on the Diachronic- wordnets dataset, which contains several English and Russian WordNet versions and correct answers of links of words from newer versions to concepts of older versions. Most current approaches of taxonomy enrichment are based on vector repre- sentations of new words and existing concepts in a taxonomy [28]. To improve vector representations, the combined approaches of some source vector repre- sentations such as vector concatenation or averaging can be used [8]. In [41] it was shown that using singular value decomposition (SVD) over concatenation 286 of several source vectors can improve the results in several tasks with the ability to control the final vector size. Autoencoders [7], called Autoencoded Meta-Embeddings (AEME), became a further development of the idea of creating meta-embeddings. In [7], the au- thors proposed several algorithms (CAEME, AAEME and etc.) for combination various word vectors in one vector by encoding initial vectors in some meta- embedding space and then decoding backward. The CAEME approach tries to reproduce source vectors from the concatenation of encoded representation of these vectors. In the AAEME approach each vector is mapped to a fixed-size vector and all encoded representations are averaged, but not concatenated, which restricts the vector dimension. In [25] the authors investigated the performance of the autoencoders depend- ing on the loss function (MSE loss, KL-divergence loss, cosine distance loss and also their combinations). They found that there is no evident winner across tasks and that different loss functions should be chosen for different applications. In [38] the best results for enriching taxonomies of general lexical-semantic re- sources such as WordNet [23] and RuWordNet [20] were achieved using AAEME encoders with triplet loss, combining fastText, glove and word2vec embeddings in a single meta-representation. The meta-representation was further used for training of a supervised model, which also included features from Wiktionary. In the current paper we study the performance of meta-embedding approaches in domain-specific taxonomy enrichment: we experiment with assigning new terms from the information-security domain to Ontology in natural sciences and technologies (OENT). 3 OENT Ontology We study the task of domain-specific taxonomy enrichment using Ontology of Natural Sciences and Technologies (OENT) [11,36]. The OENT ontology [11] is presented as a semantic net of concepts and relations between them, each concept is connected with the set of words and phrases, which can express this concept in documents (text entries). All text entries of the same concepts can be called a synset similar to WordNet synsets [23]. For example, ”Mathematical analysis” concept has Russian text entries (synset) as matematicheskii analis, matanalis, matan. Synsets in OENT can include different parts of speech: nouns, adjective, verbs, or adverbs. The OENT ontology is used for automatic document analysis in information- analytical systems, which includes providing of conceptual search, query expan- sion using the ontology, knowledge-based document categorization etc [36]. OENT comprises large volumes of concepts and terminology from several scientific disciplines and technological domains presented as a connected seman- tic network of concepts with corresponding text entries and relations between concepts [11]. The ontology was started from extracting terms from specialized text collections (web-sites, school and university text books) in mathematics, 287 physics, geology, biology, and chemistry. Currently, this initial terminology is collected in the OENT subset called OENT-lite. In further projects, the available conceptual structures were elaborated to more specific levels, also the terminologies of technological domains such as oil-and-gas industry, power energy, education policy and techniques, computer technologies, and information security were added to OENT. The full version of OENT consists of 106K concepts and 308K single and multi-word terms, while the OENT-lite consists of 37K concept and 133K terms. In the current study we use the OENT-lite ontology and the terminology of the Information security domain to study the enrichment of the ontology with a domain-specific terminology via extracting hypernym concept relations from domain-specific text collections. 4 Taxonomy Enrichment Task The task of taxonomy enrichment consists in finding an appropriate concept from a given taxonomy for a new word, which can be considered as a hypernym or class for this word. Similar to RUSSE-2020 evaluation, the task is to find a direct (the closest) hypernym concept from the taxonomy [28]. To make the extraction less restricted but to keep it quite precise, the second order hypernym concepts (hypernyms of hypernyms) are also considered as correct answers. In this, we try to simulate the work of knowledge engineers, which should find the most specific concept in a taxonomy to attach a new domain term. The taxonomy enrichment is treated as a ranking task where the correct an- swers should be in the top of a candidate list [28]. In contrast to the classification task, the ranking is a more appropriate setting in conditions when the share of correct answers is much smaller that the overall number of candidates. As a subject area for taxonomy enrichment, the information security domain was chosen, represented by a corpus of 500 thousand texts. For this corpus, a frequency list was built so that each word occurs at least 50 times. The OENTCyber dataset for evaluation hypernym detection was constructed as follows: 1. All one-word text entries of concepts from the OENT ontology were selected so that they appear in the full version, but are absent in the OENT-lite version; 2. From this list, only words for which the hypernyms are present in the OENT- lite were taken. As a result, a dataset of 4372 words was obtained, and the task was to predict hypernyms (OENT concepts) for a given set of words using OENT- lite and the available corpus of articles from information-security domain. This dataset contains specific names such as ”chrome”, ”amazon”, ”cisco”, etc and specific terms such as ”css3”, ”dbscan”, ”dll”, etc. 288 5 Method of Hypernym Concept Prediction In our approach, we use word embeddings to generate a list of most similar taxonomy entries (words or phrases from the taxonomy) to the target word ac- cording to cosine similarity. For each target word, the top 20 taxonomy entries are considered. The number of elements for consideration was chosen experimen- tally [38]. For each entry in the similarity list, all corresponding concepts, their direct and second-order hypernyms are extracted from the taxonomy. They are considered as candidate concepts to be hypernyms of the target word. For candidate hypernym concepts, several features are calculated. Logistic regression is used to predict the probability of a candidate to be a hypernym of the target word. The calculated features are as follows: – the minimum, average, and maximum similarities of the target word to all words of the concept synset; – the features based on hyponyms of a candidate concept synset: • we extract all hyponyms (lower classes) of the candidate concept; • for each word/phrase in each hyponym synset we compute their similar- ity to the target word; • we compute the minimum, average, and maximum similarity for each hyponym synset; • we form three vectors: a vector of minimums of similarities, average similarities, and maximum similarities of hyponym synsets; • for each of these vectors we compute minimum, average, and maximum. We use these resulting 9 numbers as features. – the minimum, average, and maximum similarity level of the concept in the merged candidate list: • the level is 0 if the concept was added based on similarity to the target word; • the level of 1 is for the immediate hypernyms of the word in the similarity list; • the level of 2 is for the hypernyms of the hypernyms of words in the similarity list. – the number of occurrences (n) of the concept in the merged candidate list and the quantity log2 (2 + n) which serves for smoothing. In total, 17 features were calculated. Training data were generated randomly and automatically from the OENT-lite (thus, the data do not contain test data). We use two models for calculating word embeddings: fastText [6] and word2vec [22]. In order to obtain vectors for words absent in embedding models, the fol- lowing procedures were carried out: – For FastText, embeddings if not in the vocabulary, are obtained in a natural way, by calculating vectors by the model itself; – For Word2Vec, embeddings are calculated by averaging the vectors of maxi- mum prefixes for the constituent words of a multi-word expressions. There is a limitation on the minimum length of a prefix word, which is 4 characters; – For meta-embeddings, if there is no vector for a word in any model, the corresponding source vector is initialized with zeros. 289 6 Meta-Embeddings in Taxonomy Enrichment Task In our work we compare simple meta-embeddings such as concatenation of source embeddings and SVD over the concatenation and two variants of autoencoders generating meta-embeddings: Concatenated Autoencoded Meta-Embeddings (CAEME) and Averaged Autoencoded Meta-Embeddings (AAEME), which have shown good results in previous works [7,38]. Suppose we have two source embeddings s1 (w) and s2 (w), their encoders E1 (w) and E2 (w) and their decoders D1 (w) and D2 (w). Meta-embedding m(w) in CAEME is constructed as the L2 -normalised concatenation of two encoded source embeddings E1 (s1 (w)) and E2 (s2 (w)): E1 (s1 (w)) ⊕ E2 (s2 (w)) m(w) = , (1) ||E1 (s1 (w)) ⊕ E2 (s2 (w)||2 where ⊕ is the concatenation operation. In CAEME, the dimensionality of the meta-embedding space is the sum of the dimensions of the source embeddings. The AAEME encoder can be seen as a special case of the CAEME encoder, where the meta-embedding is computed by averaging the two encoded sources in (1) instead of their concatenation. Averaging gives the possibility to avoid increasing the dimensionality of the meta-embedding. The AAEME encoder computes the meta-embedding of a word w from its two source embeddings s1 (w) and s2 (w) as the L2 -normalised sum of two encoded versions of the source embeddings E1 (s1 (w)) and E2 (s2 (w)): E1 (s1 (w)) + E2 (s2 (w)) m(w) = . (2) ||E1 (s1 (w)) + E2 (s2 (w))||2 The CAEME and AAEME decoders reconstruct the source embeddings from the same meta-embedding m(w), thereby implicitly using both common and complementary information in the source embeddings. The overall objective of autoencoder training is given below. Function f can be any distance or similarity measure as MSE, KL-divergence, or cosine distance.The coefficients λ1 and λ2 can be used to give different emphasis to the reconstruction of the two sources. X Lossw (E1 , E2 , D1 , D2 ) = (λ1 fs1 (w),ŝ1 (w) + λ2 fs2 (w),ŝ2 (w) ), (3) w where ŝi (w) - decoded embeddings corresponding to si (w). Jointly learning of E1 , E2 , D1 , D2 minimises the total reconstruction error given by Equation 3. To obtain meta-embedding representations after training, only the encoders are applied, which convert the input source embeddings into a meta representation. Further, these meta-embedding vectors are used as vector representations of words. The standard loss function for AEME approaches we used was cosine dis- tance loss. We have tried variations and combinations between MSE loss, KL divergence loss and cosine distance loss, and last one works best in our case. 290 We can impose additional restrictions on AEME models during training. One of such restrictions is the use of triplet loss. The triplet loss function is a loss function for machine learning algorithms where some basic example (anchor) is compared with positive and negative ex- amples. The goal is to minimize the difference in distance between base and positive examples and base and negative examples. In this case, there is often some margin parameter that controls how much the distance to the negative example is greater than to the positive one. One of the first formulations of the triplet loss equivalent approach was introduced in [33] for the metric learning problem. The use of a similar loss function for modifying algorithms has also been used in the problems of image similarity [39], face recognition [32], text classification [40] and other tasks. We restrict a word to be closer to the words that are semantically related to it according to the taxonomy than to a randomly chosen word with some margin: L(wa , wp , wn ) = max(||m(wa ) − m(wp ))||− (4) ||m(wa ) − m(wn ))|| + margin, 0), where ||.|| is a distance function, wa is the target word, wp and wn are positive and negative words, respectively. The algorithm of calculating triplet loss is as follows: 1. for each word presented in the taxonomy, we compile a list of semantically related words which includes synonyms, hyponyms and hypernyms; 2. at each epoch, we randomly select K positive words from this related words set and form a set of K negative words by selecting them randomly from the vocabulary; 3. if the word is not presented in the taxonomy, then we cannot form a list of related words for it. In this case, we generate positive vectors for it by adding random noise to its vector; 4. next, we calculate the triplet margin loss by combining the triplet loss with the original loss as α ∗ loss + (1 − α) ∗ triplet loss. We use the following parameters for the triplet loss: K = 5, margin = 0.1, alpha = 0.005. These parameters were selected via grid search with AAEME algorithm. 7 Experiments The quality of the approach was evaluated using two general (external) source vector representations: fastText1 and word2vec2 . Also different meta-embedding 1 Common Crawl Russian versions from https://fastText.cc/docs/en/crawl- vectors.html 2 Araneum for Russian from http://vectors.nlpl.eu/repository/ 291 approaches were investigated: concatenation, SVD over concatenation, CAEME, AAEME (with and without triplet loss). In addition to the two ”external” vector models word2vec and fastText, two vector models (word2vec and fastText, respectively) were trained on the informa- tion security text corpus (hereinafter ”internal”). The training parameters were as follows: window = 3, vector size = 300, epochs = 10, method = skip-gram. The performance of models trained on the domain corpus and also their combi- nation with more ”powerful” models was investigated, since 500 thousand texts is significantly less compared to text collections on which external word2vec and fastText models were trained. For evaluation of hypernym prediction, a traditional measure for ranking tasks Mean Average Precision measure is used. This measure achieves the max- imal value equal 1 when all correct answers are located in the beginning of a ranking list: PN M AP = N1 i=1 APi ; (5) 1 Pn APi = M i preci × I[yi = 1]. Another traditional metric for such tasks is Mean Reciprocal Rank (MRR), depending on positions of the first correct answers. This measure is equal to maximal value 1, when all first correct answers are located on 1st positions in ranking lists for all target words. N 1 X 1 M RR = , (6) N i=1 ranki where rank is the position of the first relevant item in the ranked list. Where N and M are the number of predicted and ground truth values, respectively, preci is the fraction of ground truth values in the predictions from 1 to i, yi is the label of the i-th answer in the ranked list of predictions, and I is the indicator function. In order to evaluate the quality of the approach in such a setting, the de- scribed methods of constructing meta-embeddings were used, and each vector model was evaluated separately. In case of using the AEME approaches for all vector models, it was necessary to determine the individual contributions of each vector model when calculating the loss function. The following weights were obtained experimentally: the weight of 1.0 for the internal models trained on the corpus, the weight of 5.0 for the external word2vec model, and the weight of 2.0 for the external fasttext model. The results can be seen in Table 1 (external models), Table 2 (internal models), and Table 3 (combination of external and internal models). From Tables 1 and 2, we can see that powerful external models calculated on large text collections still predict hypernym concepts much better than internal, domain-specific models. In both cases all variants of meta-embeddings better predict hypernyms than source vectors. The best results are achieved by encoders 292 with triplet loss. The combination of all models (Table 3) achieves the best results in the hypernym concept prediction. The best prediction results are much higher than the prediction on any of source models. The results achieved by encoders are much better than the results of simple approaches (concatenation and SVD). method MAP MRR fastText 0.362 0.407 word2vec 0.375 0.421 concat 0.397 0.446 SVD 0.400 0.447 CAEME 0.391 0.439 CAEME triplet 0.398 0.448 AAEME 0.404 0.453 AAEME triplet 0.412 0.464 Table 1. OENT-lite enrichment: external models method MAP MRR fastText 0.277 0.317 word2vec 0.277 0.316 concat 0.287 0.327 SVD 0.283 0.324 CAEME 0.286 0.325 CAEME triplet 0.298 0.339 AAEME 0.280 0.319 AAEME triplet 0.295 0.335 Table 2. OENT-lite enrichment: internal models 7.1 Analysis of Results We analysed hypernym predictions for new words for which correct predictions were not found in the Top-10 of correct answers and found the following cases: – Predicted hypernyms correspond to senses missed in the taxonomy. For ex- ample, word ”halo” is described in OENT-lite only in the sense of Russian helicopter, but also this word can mean a computer game. Predicted hyper- nyms include the concept ”computer program” at the first position of the candidate list and the ”game” concept at eighth position. – A predicted hypernym concept is quite valid and convey another aspect of a target word. For example, for verb ”to cache”, the correct answer in OENT is 293 method MAP MRR concat 0.386 0.434 SVD 0.387 0.433 CAEME 0.385 0.434 CAEME triplet 0.408 0.456 AAEME 0.414 0.463 AAEME triplet 0.427 0.479 Table 3. OENT-lite enrichment: external + internal models concept ”data storing”, but the first predicted hypernym concept ”computer technology” seems also correct; – Predicted hypernyms may be too general in OENT, but more specific in predictions, for example for word ”amazon” the correct answers are con- cepts: ”American company”, ”company”, ”foreign company”. The predicted concepts contain such concept as ”American software company”, ”American tech company”. These predicted concepts seem to be more correct; – In many cases predictions are very semantically close to correct answers but not correct. For example, for word ”CSS3” the correct answers are concepts ”document markup language”, ”formal language”. The predicted hypernym concepts are ”programming language”, ”scripting language”, ”object-oriented language”, ”computer technologies”; – there are also numerous examples when too general hypernyms are predicted; in some cases predicted conceptr are very far from reasonable answers and are difficult for explanation. To show that the higher confidence of the model correlates with better results of hypernym concept prediction, we calculated the plots of dependence of correct answers on the weight of the first prediction. Figure 1 shows the proportion of a correct hypernym concept among first 1, 3, 5, and 10 answers depending on the prediction weight. It can be clearly seen that the higher weight of a predicted hypernym leads to the higher proportion of correct answers. This means that model predictions with high predicted weights and absence of correct answers in the top can be considered as a source of improving hypernym class descriptions in the ontology. 8 Conclusion In this paper we considered the problem of adapting the OENT ontology to a spe- cific domain of information security: for new words from an information-security text collection, a hypernym concept from the OENT ontology has to be pre- dicted. We investigated methods for combining different word embeddings in a single meta-embedding. The meta-embeddings methods included concatenation of initial embeddings, SVD over the concatenation, two variants of autoencoders aimed to learn better word embeddings from initial vectors. 294 Prediction ratio top1_ratio top3_ratio 0.8 top5_ratio top10_ratio 0.6 Ratio 0.4 0.2 0.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 LogReg scores Fig. 1. Ratios between correct and incorrect predictions in: top1, top3, top5, top10 depends on first prediction weight We showed that the use of meta-embeddings improves the performance of the system for the considered datasets. SVD always improves the results compared to concatenation. Autoencoder-based meta-embeddings achieve the best results in all cases. It can also be seen that adding the triplet loss improves the results significantly. It has been also shown that the use of vector models trained on specific domain in combination with the meta-embedding approach can improve the quality of hypernym concept prediction. It can also be seen that the quality of the approach on the specific domain is worse then on general domain [38]. We plan to make OENT-lite and the related hypernym dataset publicly available. Acknowledgements. The participation of M. Tikhomirov in the reported study was funded by RFBR, project number 19-37-90119. The work of Natalia Loukachevitch in the current study (preparation of data for the experiments) is supported by the Russian Science Foundation (project 20-11-20166). References 1. Aly, R., Acharya, S., Ossa, A., Köhn, A., Biemann, C., Panchenko, A.: Every child should have parents: A taxonomy refinement algorithm based on hyperbolic term embeddings. In: Proceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics. pp. 4811–4817. Association for Computational Linguistics, Florence, Italy (2019) 2. Arefyev, N., Fedoseev, M., Kabanov, A., Zizov, V.: Word2vec not dead: predicting hypernyms of co-hyponyms is better than reading definitions. In: Computational Linguistics and Intellectual Technologies: papers from the Annual conference “Di- alogue” (2020) 295 3. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific american 284(5), 34–43 (2001) 4. Bernier-Colborne, G., Barriere, C.: Crim at semeval-2018 task 9: A hybrid approach to hypernym discovery. In: Proceedings of the 12th international workshop on semantic evaluation. pp. 725–731 (2018) 5. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguis- tics 5, 135–146 (2017) 6. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguis- tics 5, 135–146 (2017) 7. Bollegala, D., Bao, C.: Learning word meta-embeddings by autoencoding. In: Pro- ceedings of the 27th international conference on computational linguistics. pp. 1650–1661 (2018) 8. Coates, J., Bollegala, D.: Frustratingly easy meta-embedding–computing meta-embeddings by averaging source word embeddings. arXiv preprint arXiv:1804.05262 (2018) 9. Dale, D.: A simple solution for the taxonomy enrichment task: Discovering hyper- nyms using nearest neighbor search. In: Computational Linguistics and Intellectual Technologies: papers from the Annual conference “Dialogue” (2020) 10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics (Jun 2019) 11. Dobrov, B.V., Loukachevitch, N.V.: Development of linguistic ontology on natural sciences and technology. In: LREC. pp. 1077–1082. Citeseer (2006) 12. Fu, R., Guo, J., Qin, B., Che, W., Wang, H., Liu, T.: Learning semantic hierarchies via word embeddings. In: Proceedings of the 52nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers). pp. 1199–1209 (2014) 13. Gómez-Pérez, A., Corcho, O.: Ontology languages for the semantic web. IEEE Intelligent systems 17(1), 54–60 (2002) 14. Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Pro- ceedings of the 22nd ACM SIGKDD international conference on Knowledge dis- covery and data mining. pp. 855–864 (2016) 15. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Col- ing 1992 volume 2: The 15th international conference on computational linguistics (1992) 16. Jurgens, D., Pilehvar, M.T.: SemEval-2016 task 14: Semantic taxonomy enrich- ment. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). pp. 1092–1102. Association for Computational Linguistics (Jun 2016) 17. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 18. Levy, O., Remus, S., Biemann, C., Dagan, I.: Do supervised distributional methods really learn lexical inference relations? In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 970–976 (2015) 296 19. Liu, N., Huang, X., Li, J., Hu, X.: On interpretation of network embedding via taxonomy induction. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 1812–1820 (2018) 20. Loukachevitch, N.V., Lashevich, G., Gerasimova, A.A., Ivanov, V.V., Dobrov, B.V.: Creating russian wordnet by conversion. In: Computational Linguistics and Intellectual Technologies: papers from the Annual conference “Dialogue. pp. 405– 415 (2016) 21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed represen- tations of words and phrases and their compositionality. In: Burges, C.J.C., Bot- tou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013) 22. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sentations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013) 23. Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995) 24. Miller, G.A.: WordNet: An electronic lexical database. MIT press (1998) 25. Neill, J.O., Bollegala, D.: Meta-embedding as auxiliary task regularization. arXiv preprint arXiv:1809.05886 (2018) 26. Nickel, M., Kiela, D.: Poincar\’e embeddings for learning hierarchical representa- tions. arXiv preprint arXiv:1705.08039 (2017) 27. Nikishina, I., Logacheva, V., Panchenko, A., Loukachevitch, N.: RUSSE’2020: Find- ings of the First Taxonomy Enrichment Task for the Russian Language. In: Com- putational Linguistics and Intellectual Technologies: papers from the Annual con- ference “Dialogue” (2020) 28. Nikishina, I., Panchenko, A., Logacheva, V., Loukachevitch, N.: Studying taxon- omy enrichment on diachronic wordnet versions. In: Proceedings of the 28th Inter- national Conference on Computational Linguistics. Association for Computational Linguistics, Barcelona, Spain (December 2020) 29. Nikishina, I., Panchenko, A., Logacheva, V., Loukachevitch, N.: Evaluation of tax- onomy enrichment on diachronic wordnet versions . In: Proceedings of the 11th Global WordNet conference GWC-2021 (2021) 30. Roller, S., Kiela, D., Nickel, M.: Hearst patterns revisited: Automatic hypernym detection from large text corpora. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 358–363 (2018) 31. Sabirova, K., Lukanin, A.: Automatic extraction of hypernyms and hyponyms from russian texts. In: AIST (Supplement). pp. 35–40 (2014) 32. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 815–823 (2015) 33. Schultz, M., Joachims, T.: Learning a distance metric from relative comparisons. Advances in neural information processing systems 16, 41–48 (2004) 34. Shwartz, V., Dagan, I.: Path-based vs. distributional information in recognizing lexical semantic relations. COLING 2016 p. 24 (2016) 35. Snow, R., Jurafsky, D., Ng, A.Y.: Semantic taxonomy induction from heterogenous evidence. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Lin- guistics. pp. 801–808 (2006) 297 36. Tikhomirov, M., Loukachevitch, N., Dobrov, B.: Methods for assessing theme ad- herence in student thesis. In: International Conference on Text, Speech, and Dia- logue. pp. 69–81. Springer (2019) 37. Tikhomirov, M., LOukachevitch, N., Ekaterina, P.: Combined approach to hy- pernym detection for thesaurus enrichment. In: Computational Linguistics and Intellectual Technologies: papers from the Annual conference “Dialogue” (2020) 38. Tikhomirov, M., Loukachevitch, N.: Meta-embeddings in taxonomy enrichment task pp. 681–692 (2021) 39. Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y.: Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1386–1393 (2014) 40. Wei, J., Huang, C., Vosoughi, S., Cheng, Y., Xu, S.: Few-shot text classification with triplet networks, data augmentation, and curriculum learning. arXiv preprint arXiv:2103.07552 (2021) 41. Yin, W., Schütze, H.: Learning word meta-embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1351–1360 (2016) 298