=Paper=
{{Paper
|id=Vol-2989/long_paper35
|storemode=property
|title=Type- and Token-based Word Embeddings in the Digital
Humanities
|pdfUrl=https://ceur-ws.org/Vol-2989/long_paper35.pdf
|volume=Vol-2989
|authors=Anton Ehrmanntraut,Thora Hagen,Leonard Konle,Fotis Jannidis
|dblpUrl=https://dblp.org/rec/conf/chr/EhrmanntrautHKJ21
}}
==Type- and Token-based Word Embeddings in the Digital
Humanities==
Type- and Token-based Word Embeddings in the Digital Humanities Anton Ehrmanntraut, Thora Hagen, Leonard Konle and Fotis Jannidis Julius-Maximilians-Universität Würzburg Abstract In the general perception of the NLP community, the new dynamic, context-sensitive, token-based embeddings from language models like BERT have replaced the older static, type-based embeddings like word2vec or fastText, due to their better performance. We can show that this is not the case for one area of applications for word embeddings: the abstract representation of the meaning of words in a corpus. This application is especially important for the Computational Humanities, for example in order to show the development of words or ideas. The main contribution of our papers are: 1) We offer a systematic comparison between dynamic and static embeddings in respect to word similarity. 2) We test the best method to convert token embeddings to type embeddings. 3) We contribute new evaluation datasets for word similarity in German. The main goal of our contribution is to make an evidence-based argument that research on static embeddings, which basically stopped after 2019, should be continued not only because it needs less computing power and smaller corpora, but also because for this specific set of applications their performance is on par with that of dynamic embeddings. Keywords Word Embeddings, BERT, fastText 1. Introduction Since 2013 word embeddings and modern language models have revolutionized the representa- tion of words, sentences and texts in natural language processing. This started with the now famous word2vec [26] algorithms in 2013 and developed in a fast progression with the models showing an astonishing increase in performance over recent years. While the first generation of word embeddings was type-based, mapping each token to a vector, blending all contexts it was used in, the next generation with models like BERT [10] is context-sensitive, mapping the same token to different vectors, depending on the context the token is present. These token-based word embeddings are now the state of the art in natural language processing. But these models have increased dramatically in size and need much more computing power which makes it difficult or even impossible to train them from scratch in a typical digital humanities setup. These pragmatic reasons were the motivation for the research we will describe: we CHR 2021: Computational Humanities Research Conference, November 17–19, 2021, Amsterdam, The Netherlands £ anton.ehrmanntraut@uni-wuerzburg.de (A. Ehrmanntraut); thora.hagen@uni-wuerzburg.de (T. Hagen); leonard.konle@uni-wuerzburg.de (L. Konle); fotis.jannidis@uni-wuerzburg.de (F. Jannidis) DZ 0000-0001-6677-586X (A. Ehrmanntraut); 0000-0002-3731-6397 (T. Hagen); 0000-0001-5833-0414 (L. Konle); 0000-0001-6944-6113 (F. Jannidis) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi CEUR Workshop Proceedings (CEUR-WS.org) ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CRediT Roles: Anton Ehrmanntraut: Investigation, Software, Writing – original draft; Thora Hagen: Data Curation, Writing – original draft; Leonard Konle: Data Curation, Writing – original draft; Fotis Jannidis: Conceptualization, Supervision, Writing – review & editing 16 wanted to be able to describe the loss of information and performance that results from using static instead of dynamic embeddings. But our surprising preliminary results were confirmed by a paper which was published as preprint during our work [23]: Static word embeddings can be on par with dynamic embeddings or even surpass them in some constellations. The usage of word embeddings in DH can be categorized in two groups: Using embeddings as an improved form of word representation in tasks like sentiment analysis [43], word sense disambiguation [29, 38], authorship attribution [19, 33] etc. And using word embeddings as abstractions of semantic systems by describing word meaning as a set of relation of a focus term to its semantic neighbours. Since Kulkarni et al. [21] proposed the comparison of embeddings trained on texts from different slices in time to measure semantic change, their method known as diachronic, temporal or dynamic word embeddings [22] has been adopted by the Digital Humanities com- munity [31, 32, 16]. However, word embeddings as abstractions of semantic systems do not only work in the historical dimension, but are universally applicable. Even a comparison across several languages is possible [34]. Figure 1: Publications using word embeddings as either representation or abstraction in Digital Scholarship in the Humanities, DH Quarterly, the Digital Humanities Conference Abstracts and its German branch Digital Humanities im deutschsprachigen Raum. While token-based embeddings succeed in representation tasks, this is less clear for ab- straction, since these capabilities are not included in common benchmark tests [40] and some comparisons are marred by models being trained on corpora of different sizes and the usage of different dimensions for the embeddings. At the same time, unlike in computer science, ab- straction is not a rare application in DH research (see Figure 1). This poses the question under which circumstances it is worthwhile for a digital humanist, who is interested in using word embeddings as an abstraction, to work with these latest token-based approaches, if there is performance loss if one is using a token-based model and how large the loss is. To answer these questions we will create type-based embeddings from a pre-trained BERT (GBERT [8]) and compare them to static type-based embeddings, which we created. Therefore we will report on two sets of experiments: First, we will try to find a strong way to create performant type-based embedding out of token-based embeddings. Secondly, we will compare this derived embedding against static, traditional type-based embeddings. As far as possible we will train these em- beddings on the same or similar corpora and compare embeddings with the same dimensions to level the playing field and avoid the distortions which have limited the usefulness of some 17 other comparisons. Because word embeddings as abstractions of semantic systems have a close link to questions of word similarity and word relatedness we will limit our evaluation to these aspects. In contrast to most of the existing research comparing different word embeddings, we will use German corpora for the training of the models and a German BERT model (GBERT) pre-trained on similar corpora. This makes it necessary to create our own evaluation datasets, some of them mirror English datasets, sometimes by translation, to allow an easy comparison of results across languages, some are new reflecting our interests in specific aspects of word similarity and word relatedness. 2. Related work 2.1. Creating types Since the initial presentation of BERT, there were efforts to convert BERT’s contextualized token-based embedding (that is, the output of the respective Transformer layers from a pre- trained BERT model under certain input sequences) into conventional static, decontextualized type-based embeddings. We follow the terminology used in the survey by Rogers et al. [30] and exclusively use the term distillation to refer to this precedure1 . Almost all approaches aggregate the token embeddings across multiple contexts into a single type embedding in some way [10, 6, 39, 23]. Bommasani et al. [6] present a natural and generalized description of the distillation process, which covers all previously cited approaches. In Section 4, we present and extend their description, and examine a wider range of possible parameters. In contrast to the experiments of Bommasani et al., we also included an evaluation on the basis of relations resp. analogies. Also, different to the experiments of Vulić et al. [39] and Lenci et al. [23], we experiment with aggregations beyond the component-wise arithmetic mean. Similar to previous works, we only compute the embedding for a small selection of types relevant for our evaluation datasets. Thus, we already remark at this point that the generated type-based embedding is not “full” – in the sense that we assign only vectors to types that are present in our evaulation set, not to all types in the vocabulary, as is the case in static type-based embeddings. This has consequences on the type of tasks we can use to probe this small embedding, hence we were required to reformulate some tasks to account for this limit. The techniques independently proposed by Wang et al. [42], and Gutman and Jaggi [13], resp., appear to be the only processes that are not a special case of the generalized description by Bommasani et al. In both works, contextualized embeddings from BERT are used to complement the training of a word2vec-style static embedding. Wang et al. use BERT to replace the center word embeddings in a skip-gram architecture, while Gutman and Jaggi use BERT to replace the context embeddings in a CBOW archictecture. However, these techniques come with large computational effort, since training the static embedding requires at least one full BERT embedding of the entire training corpus. Therefore, we omit a detailed analysis of this technique, since this degree of required computational resources seems out of reach for a DH setup. 1 Unfortunately, this terminology might be misleading. In particular, it should not be confused with the compression technique called “knowledge distillation” utilized in, e.g., DistilBERT. 18 2.2. Evaluation of type-based word embeddings Word embeddings are evaluated either intrinsic with their ability to solve the training objective or extrinsic by measuring their performance on other NLP problems [4]. Since the develop- ment of BERT, problems using word embeddings as representations of words or sequences rather that abstractions (see Section 1) to perform supervised training on curated datasets are dominant in evaluation. Those benchmarks like GLUE [41, 42] are not in scope of this paper, because we are solely interested in abstraction. From the abstraction viewpoint word embeddings should represent various linguistic relationships between the words [41]. These relationships are distributed over several datasets to test vector spaces for desired properties. These datasets typically contain tests on: Word Similarity, Relatedness, Analogies, Synonyms, Thematic Categorization, Concept Categorization and Outlier Detection [4, 41]. For word relatedness, pairs of words are given a score based on the perceived degree of their connection, which are then compared to the corresponding distances in the vector space. For the word similarity task specifically, the concept of relatedness is dependent on the degree of synonymy [36]. Some of the most prevalent word relatedness/similarity datasets for the English language, among others, are WordSim-353 [2], MEN [7] and SimLex-999 [18]. While WordSim- 353 and MEN focus on relatedness, SimLex-999 has been specifically designed to represent similarity, meaning that pairs rated with high association in MEN or WordSim-353 could have a low similarity score in SimLex-999, as “association and similarity are neither mutually exclusive nor independent” [18]. Additionally, SimLex-999 includes verbs and adjectives apart from nouns, which the other two datasets do not. Another constraint of the MEN dataset is that there are no abstract concepts present in the word pairs. Both relatedness and similarity can also be evaluated via the word choice task, where each test instance consists of one focus word and multiple related or similar words in varying (rel- ative) degrees. Concerning word choice datasets, the synonym questions in the TOEFL exam are the most prominent. One test instance consists of a cue word and four additional words, where exactly one is a true synonym of the cue word. The distractors are usually related words or words that could generally replace the cue word in a given context, but would change the meaning of a sentence [35]. As these questions have been constructed by linguists, the data reliably depicts word similarity. However the design of the dataset does not allow for distin- guishing medium or low similarity for example because of the binary classification approach as opposed to the WS task [18]. Lastly, the word analogy task can be used to probe specific relations in the vector space. Given two related terms and a cue term, a target term has to be predicted analogous to the relation of the given word pair. For word analogy datasets, the Google Analogies test set [26] is the most notable. It includes five semantic and nine syntactic relations, where the semantic relations mostly cover world knowledge (e.g. countries and capitals), while the morphological relations include the plural of nouns and verbs or comparative and superlative among others. The usual implementation to test embeddings on this task uses linear vector arithmetic. For example, given analogy “man is to woman as king (the cue term) is to queen (the target term)”, we test if queen is the closest type in the vocabulary to king − man + woman. This implementation builds upon the supposition that the underlying embedding exhibits linear regularities among these analogies – in above example, that would be woman − man ≈ queen − king, i.e. there is a prototypical “womanness”-offset in the embedding. [27, 25] Since it was computationally infeasible for us to distill embeddings from BERT that have a comparable vocabulary size to those of static embeddings, we found that this setup becomes 19 unreliable: Due to the smaller vocabulary, we heavily restrict the search space in these analogy tests, making the prompts appear easier to solve. Following the example, consider vector v = king − man + woman. Due to the small vocabulary, there are only few distractors in the neighborhood of v. Consequently, the vector queen most probably is the closest type from the vocabulary and the prompt is answered correctly, but this is not a cause of a structurization of the embedding space. 3. Resources 3.1. General Corpora We train the type-based embeddings on the German OSCAR Corpus (Open Super-large Crawled Aggregated coRpus [28]), as this is the largest chunk of the training data for the current best German BERT model discussed below. The deduplicated variant of the German corpus contains 21B words (145 GB), filtered out of CommonCrawl.2 3.2. BERT Model As outlined in the introduction, we want to ensure that the different language models are trained on the same or similar corpora. While training of type-based models on our chosen corpora were feasible for us, it was impossible to pre-train a BERT model from scratch. There- fore, we choose a pre-trained German model GBERTBase provided by Deepset in collaboration with DBMDZ [8]. Like the original BERTBase , the German model consists of 12 layers, 768 dimensions, and a maximum sequence length of 512 tokens. Also, we choose this model since, to date, this trained model appears to be the currently best available BERT model (with above hyperparameters) [8]. The model was trained on a combination of four corpora, whereas OSCAR dominates the training set by approximately 88 % of the total data. When we dis- till type-based embeddings from BERT, we always are going to use the GBERTBase model, and will only use the OSCAR corpus to retrieve contextualized inputs. Likewise, when we train static type-based models, we are only going to use the OSCAR corpus (neglecting the remaining 22 % of BERT’s pre-training data unaccounted for). 3.3. Evaluation Data As the most popular evaluation datasets are in English, we constructed a comprehensive, German test suite consisting of multiple datasets which cover different aspects based on already existing evaluation data.3 The tasks covered are: word relatedness (WR), word similarity (WS), word choice (WC) and relation classification (RC). In addition, the data probes semantic knowledge such as synonyms and morphological knowledge, namely inflections and derivations. See Table 1 for an overview of all test datasets. Word Relatedness/Similarity For WR, we used the re-evaluated translation of WordSim- 353 (Schm280) as presented in [20], where we only corrected nouns which were written in lower case, as well as a DeepL translation of [7] (MEN), which we then reviewed and adjusted manually as needed. To assess WS, we opted for the translation of [18] (SimLex999) by [24]. Both 2 https://commoncrawl.org 3 Datasets and Code: https://github.com/cophi-wue/Word-Embeddings-in-the-Digital-Humanities 20 WR and WS are judged via the Spearman’s rank correlation coefficient between the human annotated scores and the cosine distance of all word pairs. Relation Classification As outlined above, we incorporate the concept of linearly struc- tured word analogies into the test suite not in the usual way, but using relation classification. Instead of predicting a target word through a given cue word, RC tries to predict the relation type of a word pair by comparing their offset with representative offsets for each relation type. Specifically, we are given a collection of relations R1 , R2 , . . . , Rk where each Ri is a set of word pairs. We now interpret the relation Ri as a set of offsets: for each word pair (a, b) ∈ Ri , we consider the vector vb − va . For the evaluation, we use a median-based 1-nearest-neighbor classification: We “train” the classifier by choosing the median of Ri as decision object. More precisely, we define vector ri = median({vb − va | (a, b) ∈ Ri }) as decision object of Ri . We then test this classifier on all pairs from all relations, thus check for each pair (a, b) from Ri wheter ri is in fact the closest decision object to vb −va (with respect to ℓ1 -norm). We evaluate these predictions by the “macro” F1 score, i.e. the unweighted average of the F1 scores under each relation type, respectively. While this setup allows for different aggregations resp. distance functions other than median resp. ℓ1 -norm, we surprisingly found this choice to be more successful than other candidates (such as those based on cosine distance) among all examined embeddings. For the RC data we made use of two German knowledge bases: the knowledge graph Ger- maNet [15, 17] (Ver. 14.0) and the German Wiktionary.4 GermaNet incorporates different kinds of semantic relations, including lexical relations such as synonymy, and conceptual re- lations such as hypernymy and different kinds of compound relations. For the RC evaluation, we only selected the conceptual relations and the pertainym relation for our GermaNet dataset, since only these can be considered a directed one-one relation. Wiktionary on the other hand contains tenses, the comparison of adjectives, and derivational relations among other morpho- logical relations. Again, we selected a set of inflectional resp. derivational directed one-one relations for the Wiktionary dataset for the RC evaluation, cf. Table 7 in the appendix. Even though there is a German version of the Google Analogies dataset available [20], we chose not to include it, as its semantic and morphological relations are covered entirely by GermaNet and Wiktionary, respectively. Additionally, both datasets contain more instances than the Google Analogies testset does. Word Choice Lastly, we included the WC task. Here, we used a translated version of the TOEFL synonym questions [20] as well as one automatically constructed dataset from the German Duden of synonyms. Our Duden dataset includes one synonym of the cue word as the target word, plus, as distractors, four synonyms of the target word that are not synonyms of the cue word. Evaluation is based on whether the target word is closer to the cue word than the distractors with respect to cosine distance. We report accuracy among all prompts. Initially, we also wanted to explore world knowledge (i.e. named entities) captured by the embeddings, such as city-body of water or author-work relationships. However most named entities consist of multi-word expressions which are difficult to model via type or token based embeddings. We therefore removed all instances where a concept consisted of more than one token in the datasets described above. 21 Table 1 Datasets and corresponding tasks in the embedding testsuite dataset GermaNet Wiktionary SimLex999 Schm280 MEN Duden TOEFL task RC RC WS WR WR WC WC instances 3980 23174 996 279 2964 1184 401 (42 rel.) (33 rel.) Table 2 Examined Vectorization, Subword pooling, and Context Combination functions Vectorization inputemb use BERT’s pretrained input embedding as subword vector L1, …, L12 output of layer N all concatenation of the vectors L1, …, L12 L1-4 / L9-12 concatenation of the vectors L1, …, L4 / L9, …, L12 Subword Pooling first take first subword vector as token vector mean component-wise arithmetic mean median component-wise median meannorm mean-norm of all subword vectors as token vector (see Eq. 1) l0.5medoid, l1medoid, l2medoid medoid of the subword vectors w.r.t. ℓ0.5 -distance, ℓ1 -distance, ℓ2 -distance nopooling skip the subword pooling stage and consider each subword vector as individual token vector Context Combination mean component-wise arithmetic mean as type vector median component-wise median as type vector meannorm mean-norm of all token vectors as type vector (see text) l0.5medoid, l1medoid, l2medoid medoid of the subword vectors w.r.t. ℓ0.5 -distance, ℓ1 -distance, ℓ2 -distance 4. Creating Type Vectors Comparing embeddings distilled from BERT’s token-based embeddings with traditional static type-based embeddings requires us to examine different possibilities on how to perform this distillation. Therefore, we compare these possibilities by evaluating the resulting embeddings on the discussed evaluation dataset. Given these results, we decide on a single distillation pro- cedure and compute a BERT embedding to compare its performance against static embeddings in Section 5. In order to systematically evaluate the different methods to compute an embedding from BERT’s token-based representations, we follow and extend the two-stage setup of Bommasani et al. [6]. The first stage is the subword pooling, where the k subword vectors wc1 , . . . , wck (de- rived from BERT’s output) for word w in context c are aggregated into a single contextualized token vector with aggregation function f ; that is, wc = f ({wc1 , . . . , wck }) of w in c. Then, in the second stage, the context combination, multiple contextualized token vectors wc1 , . . . , wcn are aggregated by function g into a single static embedding w = g({wc1 , . . . , wcn }). We extend this description by prepending a zeroth vectorization stage, in which we make explicit how to transform the outputs of BERT’s layers into a single subword vector. While 4 https://dumps.wikimedia.org/dewiktionary/20210701/ 22 Bommasani et al. tacitly concatenate all outputs to form a subword vector, we also allow to selectively pick specific layer output(s) as subword vector, or a summation of the layers. This setup also allows us to choose pooling functions f resp. context combination functions g which are not defined component-wise. Hence, we examine a wider range of choices for vectorizations, f and g than previously considered, e.g., in [39, 6]. One such considered novel aggregation function is mean-norm, which refers to the aggregation ( n ) 1∑ meannnorm(v1 , . . . , vn ) = norm norm(vi ) , (1) n i=1 that takes a set of vectors as input, normalizes each to unit length, calculates the mean on these normalized vectors, and normalizes this mean again. We motivate this aggregation function from the fact that meannnorm(v1 , . . . , vn ) is the unique vector on the unit hypersphere that maximizes the sum of cosine similarities with respect to each v1 , . . . , vn . Thus, mean-norm could also be understood as a “cosine centroid”. In particular, mean-norm, medoids, and aggregations based on fractional distances were included in our experiments searching for suitable distillations. Table 2 shows the functions we examined. In total, we examine 17 possible vectorizations, 8 subword pooling functions and 6 context combination functions. 4.1. Method Intending to find best-performing distillations based on the above outlined general setup, we evaluate different choices for the “free parameters” of the distillation process as shown in Table 2. For each word w to be embedded, we retrieve n = 100 sentences from the OSCAR corpus as context for w, where w occurs in a sentence of maximum sequence length of 510. If w has < n but at least one occurrence, we sampled all these occurrences. Types w that did not occur in the OSCAR corpus were removed from the evaluation dataset. For each occurrence in sentence s, we construct the input sequence by adding the [CLS] and [SEP] token. The respective outputs of BERT on all layers form the input for the vectorization stage. This method of generating input sequences by sampling sentences largely agrees with the methods proposed by [6, 39, 23], and only differs in the sampling of sentences. Due to the large number of possible distillations, it was computationally infeasible to con- struct embeddings under all distillations. Therefore, in a first experiment, we examined the quality of all considered distillations on a smaller evaluation dataset, which consists of three subsets of the MEN, the Wiktionary, the GermaNet, and the TOEFL dataset. Then, after restrict- ing the set of potential distillations to promising ones, we perform the same experiment but with the full evaluation dataset on all five datasets. In both cases, the evaluation is performed as outlined in Section 3.3. Also, since the scores in the respective tasks are reported in different metrics, we opt for standardizing the respective scores in each task when comparing some embeddings’ perfor- mances. Therefore, for one specific task, we consider the standardized score as the number of standard deviations away from the mean over all model’s scores in that task. We then can report the mean standardized score of some model taken over all considered tasks. 23 Table 3 Comparison of the embeddings computed by the medoid-based distillation methods on the full evaluation dataset. Reported numbers are the mean standardized score over all seven tasks. The score of the embedding considered in sec. 5 is the maximum and is highlighted bold. vectorization inputemb L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L1-4 L9-12 sum all poolings aggregations nopooling mean -1.372 -1.089 -0.482 0.477 0.852 0.414 0.277 0.002 -0.182 -0.343 -0.137 -0.100 -0.618 0.475 0.096 1.134 1.122 meannorm -2.068 -1.392 -0.782 0.413 0.800 0.394 0.308 0.014 -0.123 -0.304 -0.077 -0.015 -0.605 0.278 0.136 1.122 1.117 median -1.631 -1.084 -0.679 0.334 0.795 0.428 0.228 -0.060 -0.206 -0.399 -0.064 -0.135 -0.582 0.494 0.171 1.157 1.091 mean mean -1.372 -1.089 -0.482 0.477 0.852 0.414 0.277 0.002 -0.182 -0.343 -0.137 -0.100 -0.618 0.475 0.096 1.134 1.122 meannorm -1.712 -1.253 -0.685 0.432 0.838 0.432 0.338 0.034 -0.102 -0.285 -0.101 -0.025 -0.607 0.361 0.126 1.141 1.129 median -1.301 -1.018 -0.516 0.460 0.854 0.408 0.259 -0.025 -0.210 -0.354 -0.075 -0.044 -0.514 0.504 0.133 1.132 1.132 meannorm mean -2.020 -1.373 -0.783 0.370 0.788 0.384 0.282 0.013 -0.158 -0.293 -0.062 -0.053 -0.625 0.259 0.121 1.109 1.093 meannorm -2.071 -1.391 -0.782 0.409 0.805 0.394 0.308 0.011 -0.126 -0.293 -0.067 -0.021 -0.604 0.278 0.131 1.124 1.119 median -1.931 -1.272 -0.764 0.348 0.779 0.350 0.270 0.001 -0.206 -0.351 -0.060 -0.057 -0.482 0.307 0.139 1.080 1.054 median mean -1.490 -1.114 -0.550 0.480 0.870 0.523 0.261 -0.018 -0.180 -0.256 -0.155 -0.104 -0.655 0.485 0.153 1.201 1.086 meannorm -1.718 -1.270 -0.703 0.437 0.891 0.554 0.314 0.046 -0.128 -0.235 -0.104 -0.024 -0.684 0.390 0.170 1.170 1.060 median -1.469 -1.082 -0.580 0.452 0.934 0.497 0.269 -0.054 -0.187 -0.341 -0.118 -0.069 -0.534 0.550 0.175 1.253 1.108 4.2. Results The first run of the experiment with all examined distillations on the smaller datasets gave strong indication that centroid-based distillations lead to significantly better-performing em- beddings than those distillations that consist of a medoid-based poolings resp. aggregations. In fact, among the 13 distillations with the highest mean standardized score, all twelve distil- lations with centroid-based poolings and aggregations are present (Nopooling, mean, median, mean-norm). Numerical values are presented in Table 8 in the appendix. (Top 13 distillations are underlined.) With this experiment, we contributed insight on the performances on differ- ent aggregation functions not previously considered in the literature. The results suggest the interpretation that centroids – which represent a vector cluster by some synthetic aggregate – generally lead to better results than medoids – which represent a cluster by some member of that cluster. Also, in our distillation setup, fractional norms do not appear to give an advan- tage, as opposed to research that indicate that fractional distance metrics could lead to better clustering results in high-dimensional space, e.g. [1]. Hence in total, we negatively answer potential hypotheses that certain overlooked aggregation functions could lead to immediate improvements of resulting type-based embedding distilled from BERT. Therefore, we continue our evaluation of these twelve centroid-based parameter choices in the next experiment on the full dataset. We observe that, under the restriction on centroid-based poolings and aggregations, the choice of vectorization (i.e. layer) has a much higher influence on the embedding’s performance than the actual choice of functions f and g. This supports the general hypothesis that different layers capture different aspects of linguistic knowledge [30]. Additionally, these findings demonstrate that the default suggestions f = mean and g = mean in the literature [6, 39, 23] generally are a reasonable choice to perform a distillation. A visualization of the embeddings’ scores on each of the full seven datasets is presented in Figure 5 in the appendix. Again, we ranked the 17 × 4 × 3 (# vectorizations × # restricted poolings × # restricted vectorizations) analyzed embeddings with respect to their mean standardized score over all five tasks; cf. Table 3. This leads to our observation that those embeddings based on a sum-vectorization outperform any of the other embeddings; hence we suspect that a verti- cal summation resp. averaging over all layers can provide a robust vector representation for BERT’s tokens, capturing the summative linguistic knowledge of all layers in a reasonable fash- 24 ion. Also, we suspect that the smaller dimensionality of the sum-vectorization might give the embeddings an advantage in comparison to the vectorization that concatenates all layers: the former has 768 dimensions, while the latter concatenates 12 Transformer outputs plus BERT’s input embedding, leading to 768 × 13 = 9968 dimensions. To fix a single embedding for further comparison with static embeddings, we choose the embedding with the highest mean standardized score, which is the embedding based on the distillation with vectorization sum, the pooling f = median, and the aggregation g = median. The respective mean standardized score is highlighted bold in Table 3. Thus, when we now speak of BERT’s distilled embedding, then we explicitly mean this distillation (sum vectoriza- tion, median pooling and aggregation). Note that this embedding consists of 768 dimensions. Nevertheless, we explicitly remark that the small differences in performances do not admit the claim that the chosen distillation is a universal method that would always perform best in any scenario. Also, we want to highlight that the previously untreated median as aggregation function appears to cause some improvement, especially in the pooling stage. Due to its ro- bustness against outliers, we recommend to always examine this aggregation in distillations of any form that convert from token-based to type-based embeddings. 5. Comparing type- and token-based embeddings 5.1. Methods Before training the type-based embeddings models, we preprocessed the German OSCAR Cor- pus to ensure that models are trained on the same version of the data. Preprocessing has been done using the word tokenizer and sentence tokenizer of NLTK.5 Additionally, all punctuation has been removed. We trained all models using the respective default parameters and used the skip-gram model for all embeddings. We only adapted the number of dimensions and window size to create additional embedding models for comparison. While the recommended window size is 5, a window size of 2 has proven to be more effective in capturing semantic similarity in embeddings [23]. To make up for the larger dimensionality of BERT’s distilled embedding (768), we also trained static models with 768 dimensions besides the more commonly used 300 dimensions for type-based models (under the assumption that more dimensions imply a higher quality vector space). Additionally, we concatenated the embeddings in various combinations, as these “stacked” vectors often lead to better results, as presented in [3]. We experimented with all three possible tuples consisting of BERT, word2vec, and fastText (using the 768 di- mension versions only) as well as one embedding where we concatenated all three. We lastly included one fastText model with 2 × 768 dimensions to enable a direct comparison to the stacked embeddings. As explained above, we evaluate how well the models represent different term relations with four tasks: word similarity and relatedness, word choice, and relation classification. 5.2. Results The first general observation looking at Figure 2 is that BERT’s distilled embedding (again, sum vectorization, median pooling and aggregation) does not perform significantly better, contrary to our expectations. In fact, the type-based embeddings seem to be capturing term relatedness and similarity even better than the token-based embeddings distilled from BERT 5 Natural Language Toolkit (https://www.nltk.org/) 25 Table 4: Performances of the analyzed embeddings in the seven discussed tasks. Top value per task is highlighted bold. GermaNet (RC) Wiktionary (RC) SimLex999 (WS) Schm280 (WR) MEN (WR) Duden (WC) TOEFL (WC) macro F1 macro F1 Spearman ρ Spearman ρ Spearman ρ accuracy accuracy Word2Vec Dim300 .657 .739 .429 .748 .719 .605 .791 Word2Vec Dim768 .793 .803 .463 .751 .724 .625 .803 FastText Dim300WS5 .643 .795 .439 .757 .734 .617 .796 FastText Dim300WS2 .668 .811 .455 .768 .734 .619 .798 FastText Dim768WS2 .774 .863 .491 .771 .736 .634 .813 BERT (sum-median-median) .710 .906 .476 .754 .650 .543 .589 0.9 0.8 0.7 Word2Vec Dim300 Word2Vec Dim768 FastText Dim300WS5 score FastText Dim300WS2 FastText Dim768WS2 BERT (sum-median-median) 0.6 0.5 0.4 GermaNet Wiktionary SimLex999 Schm280 MEN Duden TOEFL (RC, macro F1) (RC, macro F1) (WS, Spearman ) (WR, Spearman ) (WR, Spearman ) (WC, accuracy) (WC, accuracy) Figure 2: Performances of the analyzed embeddings in the seven discussed tasks. Graphical representation of the results in Table 4. in most tasks: in WS, WR, and WC, FastText Dim768WS2 produces the best results (see Table 4) while in RC, BERT achieves the best results on the morphological relations only. The WR and WS tasks (MEN, Schm280, SimLex999) paint a similar picture. Both hyperpa- rameters, window size, and number of dimensions lead to a slight improvement when reduced and increased, respectively. Most notably, the similarity task benefits the most (about 0.05 absolute correlation improvement with both parameters adjusted for fastText, see Table 4) from altering these parameters. While BERT is on par with or even slightly outperformed by the 300-dimensional type-based embeddings in the relatedness task, it performs better in the similarity task. The higher dimensional vectors however can compare to BERT’s performance on the SimLex999 dataset. Overall, every model seems to struggle with the more narrowly defined WS task when compared to the WR task. The WC task (Duden, TOEFL) also shows a clear trend: all type-based embeddings exceed BERT’s performance noticeably, by a 0.06 accuracy difference minimum (Duden, Word2Vec Dim300) and 0.23 maximum (TOEFL, FastText Dim768WS2). Altering the parameters of the type-based embeddings, similariliy to the WR task, results in marginally better performing vectors. BERT’s embeddings perform considerably better in the RC task when compared to the 300- dimensional embeddings. However, a substantial gain from the dimensionality increase can 26 also be observed with GermaNet as opposed to the other datasets, leading to both FfastText Dim768WS2 and Word2Vec Dim768 surpassing BERT’s performance by 0.06 and 0.08, cor- respondingly. While the same trend appears on the Wiktionary dataset, the classification of morphological relations by BERT’s embeddings still remains uncontested with an accuracy of 0.91. From a human perspective, the morphological relations are rather trivial (some examples are presented in Table 7 in the appendix); even from a computational point of view, lem- matizing or stemming the tails of these triples could in theory reliably predict the individual heads. This implies that generally, BERT can reproduce these kinds of simpler relations the best, while traditional models capture complex semantic associations more accurately. We separately explored the individual performances of all relations in GermaNet and Wiktionary and discovered that the higher F1 score of BERT mainly stems from the derivations, indi- cating that the word piece tokenization of BERT might facilitate its remarkable performance. Controlling for the dataset and relation size in a linear regression did not reveal a correlation between the amount of overlap and F1 however. From these experiments we can conclude that for word similarity and term relatedness use cases, employing regular fastText embeddings, optionally increasing the number of dimensions, is sufficient. Using embeddings with the same number of dimensions as BERT results in the static embeddings taking the lead in the WS and RC task for semantic relations, specifically. More so, there appears to be no clear trend on whether BERT’s distilled embedding is gen- erally better (or worse) than others models. In certain tasks, it performs particularly well (e.g., Wiktionary), and in others particularly bad (e.g., TOEFL). To give some statistical estimate on the difference in performance between BERT and other models, we employ Bayesian hierarchi- cal correlated t-tests proposed by Benavoli et al. [5] and Corani et al. [9], designed to compare performances of two classifiers on multiple test sets.6 This hierarchical model is learned on our observed scores, and after learning, can be queried to make inference on the performance differ- ence (in score points, e.g., absolute accuracy difference) between BERT and an other language model on a future unseen dataset. See the cited references for a thorough presentation of the hierarchical model and the inference method. (Note that the Bayesian hierarchical correlated t-test is based on repeated cross-validation runs on the same dataset. Hence, to adapt our setup to the t-test, we need to modify our task procedures to obtain cross-validation results. Section A.2 in the appendix gives details on how we implemented this.) Table 5 gives the results on this inference. Most prominently, it estimates that on a fu- ture unseen dataset, FastText Dim768WS2 most certainly will outperform BERT’s distilled embedding by at least 0.03 absolute score points (P = 89.1 %). Even on the relatively weak Word2Vec Dim300, the hierarchical model predicts roughly equal probabilities for either BERT being better vs. Word2Vec Dim300 being better (by at least 0.03 absolute score points, 47.9 % vs. 51.8 %). Nevertheless, this quantitative analysis also has limits due to the stochastic model presumed by the Bayesian hierarchical correlated t-test. The model assumes that the performance dif- ferences among the datasets (δ1 , δ2 , . . . , δnext ) are i.i.d. and follow the same high-level Student- t-distribution t(µ0 , σ0 , ν); thus, the model assumes that the considered datasets are in some way homogeneous. Though all our datasets are meant to examine word similarity, the distinct differences in performance of the embedding types we observe (see fig. 2), indicate that these datasets represent different aspects of word similarity, which certain language models capture 6 We want to thank the anonymous reviewer who brought the potential of the Bayesian hierarchical correlated t-test to our attention. 27 Table 5 Results of the Bayesian hierarchical correlated t-tests on the performance differences on a future unseen dataset. The hypotheses “BERT better” resp. “BERT worse” refer to a performance gain/loss of at least 0.03 score points. The hypothesis “practically equivalent” signifies that the performance difference between the two compared embeddings is no more than 0.03 score points. BERT (sum-median-median) vs. … BERT better practically equivalent BERT worse FastText Dim768WS2 10.05 % 0.88 % 89.08 % FastText Dim300WS2 25.20 % 0.83 % 73.98 % FastText Dim300WS5 34.65 % 0.62 % 64.72 % Word2Vec Dim768 34.58 % 0.33 % 65.10 % Word2Vec Dim300 47.95 % 0.25 % 51.80 % better than others. Hence in our use case, we see the limits of the assumptions made by the stochastic model, and in this light, the results of the Bayesian hierarchical correlated t-tests need to be interpreted cautiously. 5.3. Discussion The most important result from our experiments is that a widespread assumption in NLP and Computational Humanities is not true: a context-sensitive embedding like BERT is not automatically better for all purposes. Static embeddings like fastText are at least on par if not better if word embeddings are used as abstractions of semantic systems. But our results are subject to some important limitations. For example, we can think of several ways which could increase BERT’s capabilities to represent word similarity, which we haven’t explored: • Modify the training objective for the pre-training phase, for example by adding a task which influences how the model represents word similarity. • Fine-tune the model on a task to improve the representation of word similarity, for example predict the nearest neighbour based on existing similarity word lists. • Replace wordpiece tokenization back to full word tokenization, which has been reported to improve performances in some contexts. [11] On the other hand we didn’t spend much time to find the best parameters for the static embeddings and we just used a well established static embedding like fastText and didn’t test more recent proposals for static embeddings like [14] which reported improved results. So there is a lot of room for improvements in both directions. In order to understand how the performance differences we observed between static and dynamic embeddings relate to the performance gains which have been observed by stacking embeddings from different sources [3], we combine word2vec, fastText and BERT embeddings in different constellations and add a fastText model with the same dimensions to compensate for effects based on the different dimensionality of the embeddings (see Figure 3). For four evaluation sets – GermaNet, Men, Duden, TOEFL – the differences between BERT and fastText are larger than the difference between fastText and a stacked alternative. The performance gain of using stacked embeddings is in most cases rather small. Adding BERT to the stacked embeddings either doesn’t help at all – TOEFL – or only a little bit – GermaNet, Schm280, Men, Duden. The only exception is the Wiktionary dataset which is already the only use case, where 28 Table 6: Performances of stacked embeddings in the seven discussed tasks. Top value per task is highlighted bold. GermaNet (RC) Wiktionary (RC) SimLex999 (WS) Schm280 (WR) MEN (WR) Duden (WC) TOEFL (WC) dimensions macro F1 macro F1 Spearman ρ Spearman ρ Spearman ρ accuracy accuracy BERT (sum-median-median) 768 .710 .906 .476 .754 .650 .543 .589 FastText Dim768WS2 768 .774 .863 .491 .771 .736 .634 .813 FastText Dim1536WS2 1536 .833 .898 .500 .762 .737 .651 .810 Word2Vec Dim768 +FastText Dim768WS2 1536 .826 .859 .479 .765 .733 .632 .820 Word2Vec Dim768 +BERT (sum-median-median) 1536 .833 .914 .489 .788 .728 .624 .788 FastText Dim768WS2 +BERT (sum-median-median) 1536 .791 .930 .508 .802 .734 .637 .766 Word2Vec Dim768 +FastText Dim768WS2 +BERT (sum-median-median) 2304 .837 .916 .496 .791 .737 .642 .803 0.9 0.8 BERT (sum-median-median) FastText Dim768WS2 FastText Dim1536WS2 Word2Vec Dim768 +FastText Dim768WS2 Word2Vec Dim768 score 0.7 +BERT (sum-median-median) FastText Dim768WS2 +BERT (sum-median-median) Word2Vec Dim768 +FastText Dim768WS2 +BERT (sum-median-median) 0.6 0.5 GermaNet Wiktionary SimLex999 Schm280 MEN Duden TOEFL (RC, macro F1) (RC, macro F1) (WS, Spearman ) (WS, Spearman ) (WS, Spearman ) (WC, accuracy) (WC, accuracy) Figure 3: Performances of stacked embeddings in the seven discussed tasks. Graphical representation of the results in Table 6. BERT is better than fastText. As discussed above the Wiktionary dataset consists mainly of inflections, for example singular vs. plural, or derivations, for example masculine form of a noun (‘Autor’) vs. female form (‘Autorin’). More examples are listed in Table 7. Maybe more sophisticated approaches combining the different embeddings like [13] will show better results, but obviously they all need a token-based model next to the static models. Exploring the behaviour of the different embeddings we also came across a noticeable dif- ference between the BERT-based embeddings and the static embeddings (see Figure 4). We calculated the distances between 956 synonym pairs, using synonyms as defined by GermaNet in one setting and defined by Duden in the other. To make the results comparable we stan- dardized each of them by drawing 956 random word pairs and based our calculation of the mean distance and the standard deviation on them. Then we expressed the cosine distance of the synonyms in standard deviations away from the mean distance. The results show for both datasets a much larger spread for the static embeddings indicating that the BERT vectors occupy a smaller space, an effect which is not related to the dimensionality of its vectors. 29 GermaNet synonyms 0.15 Word2Vec Dim300 Word2Vec Dim768 FastText Dim300WS5 density 0.10 FastText Dim300WS2 FastText Dim768WS2 0.05 BERT (sum-median-median) 0.00 15.0 12.5 10.0 7.5 5.0 2.5 0.0 2.5 model-standardized cosine distance Duden synonyms 0.25 0.20 Word2Vec Dim300 Word2Vec Dim768 0.15 FastText Dim300WS5 density FastText Dim300WS2 0.10 FastText Dim768WS2 BERT (sum-median-median) 0.05 0.00 15.0 12.5 10.0 7.5 5.0 2.5 0.0 2.5 model-standardized cosine distance Figure 4: Kernel density estimation (Gaussian, h = 0.5) of the cosine distance of synonym pairs in the respective embeddings, each standardized for the respective embedding. This seems to be in accordance with results from Ethayarajh [12], who reported that the contextualized token embedding of BERT is anisotropic: randomly sampled words seem to have, on average, a very high cosine similarity. In fact, Timkey and van Schijndel [37] report in a pre-print that in BERT’s contextualized embedding space, a few dimensions dominate similarity between word vectors (“rogue dimension”). As future work, we want to examine the effect of post-processing transformations on the embedding spaces, proposed by Timkey and van Schijndel, which are designed to counteract the undesirable effect of these rogue dimensions. In our first exploratory experiments we observe that all our examined embeddings – both the distilled ones from BERT, but also the static ones – appear to benefit from post-processing the type vectors. Yet even then, the post-processing still does not give BERT an advantage over static embeddings. To summarize, our main take away is not a recommendation for a specific static word embedding, rather we think it is worthwhile to continue research on static word embeddings – at least for researchers working in the field of Computational Literary Studies –, because their representational power as abstractions of semantic systems is on par to that of dynamic embeddings, the needed computing power is much less and the minimal size of the corpora needed to train them is also smaller. What we need in the field of Computational Literary Studies is a more robust understanding how the quality of embeddings is related to the size and structure of datasets, methods to improve the performance of static embeddings trained on 30 even smaller datasets, maybe by combining them with knowledge bases, and more evaluation datasets for languages beyond English. References [1] C. C. Aggarwal, A. Hinneburg, and D. A. Keim. “On the Surprising Behavior of Distance Metrics in High Dimensional Space”. In: Database Theory, ICDT 2001. Ed. by J. Van den Bussche and V. Vianu. Lecture Notes in Computer Science. 2001, pp. 420–434. doi: 10.1007/3-540-44503-x\_27. [2] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paşca, and A. Soroa. “A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches”. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Boulder, Colorado, 2009, pp. 19–27. url: https://aclanthology.org/N09-1003. [3] A. Akbik, D. Blythe, and R. Vollgraf. “Contextual String Embeddings for Sequence Labeling”. In: Proceedings of the 27th International Conference on Computational Lin- guistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, 2018, pp. 1638–1649. url: https://aclanthology.org/C18-1139. [4] A. Bakarov. “A survey of word embeddings evaluation methods”. In: arXiv preprint arXiv:1801.09536 (2018). url: http://arxiv.org/abs/1801.09536. [5] A. Benavoli, G. Corani, J. Demšar, and M. Zaffalon. “Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis”. In: Journal of Machine Learning Research 18.77 (2017), pp. 1–36. url: http://jmlr.org/papers/v18/16-305.html. [6] R. Bommasani, K. Davis, and C. Cardie. “Interpreting Pretrained Contextualized Rep- resentations via Reductions to Static Embeddings”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Proceedings of the 58th An- nual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 4758–4781. doi: 10.18653/v1/2020.acl-main.431. [7] E. Bruni, N.-K. Tran, and M. Baroni. “Multimodal distributional semantics”. In: Journal of artificial intelligence research 49 (2014), pp. 1–47. doi: 10.1007/s10462-019-09796-3. [8] B. Chan, S. Schweter, and T. Möller. “German’s Next Language Model”. In: Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online), 2020, pp. 6788–6796. doi: 10.18653/v1/2020.coling-main.598. [9] G. Corani, A. Benavoli, J. Demšar, F. Mangili, and M. Zaffalon. “Statistical compari- son of classifiers through Bayesian hierarchical modelling”. In: Machine Learning 106.11 (2017), pp. 1817–1837. doi: 10.1007/s10994-017-5641-9. [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of Deep Bidi- rectional Transformers for Language Understanding”. In: Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Naacl-hlt 2019. Min- neapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. 31 [11] H. El Boukkouri, O. Ferret, T. Lavergne, H. Noji, P. Zweigenbaum, and J. Tsujii. “Char- acterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Represen- tations From Characters”. In: Proceedings of the 28th International Conference on Com- putational Linguistics. Barcelona, Spain (Online): International Committee on Compu- tational Linguistics, 2020, pp. 6903–6915. doi: 10.18653/v1/2020.coling-main.609. [12] K. Ethayarajh. “How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 2019, pp. 55–65. doi: 10.18653/v1/D19-1006. [13] P. Gupta and M. Jaggi. “Obtaining Better Static Word Embeddings Using Contextual Embedding Models”. In: arXiv preprint arXiv:2106.04302 (2021). url: http://arxiv.org/ abs/2106.04302. [14] P. Gupta, M. Pagliardini, and M. Jaggi. “Better Word Embeddings by Disentangling Contextual n-Gram Information”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota, 2019, pp. 933– 939. doi: 10.18653/v1/N19-1098. [15] B. Hamp and H. Feldweg. “GermaNet - a Lexical-Semantic Net for German”. In: Au- tomatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications. 1997. url: https://www.aclweb.org/anthology/W97-0802. [16] S. Hengchen, R. Ros, and J. Marjanen. “A data-driven approach to the changing vocab- ulary of the ‘nation’ in English, Dutch, Swedish and Finnish newspapers, 1750-1950”. In: Book of Abstracts of DH2019. Utrecht, 2019. url: https://dev.clariah.nl/files/dh2019/ boa/0791.html. [17] V. Henrich and E. Hinrichs. “GernEdiT - The GermaNet Editing Tool”. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Valletta, Malta: European Language Resources Association (ELRA), 2010, pp. 2228– 2235. url: http://www.lrec-conf.org/proceedings/lrec2010/pdf/264%5C%5FPaper.pdf. [18] F. Hill, R. Reichart, and A. Korhonen. “SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation”. In: Computational Linguistics 41.4 (2015), pp. 665– 695. doi: 10.1162/COLI\_a\_00237. [19] M. Kocher and J. Savoy. “Distributed language representation for authorship attribu- tion”. In: Digital Scholarship in the Humanities 33.2 (2017), pp. 425–441. doi: 10.1093/ llc/fqx046. [20] M. Köper, C. Scheible, and S. Schulte im Walde. “Multilingual Reliability and “Seman- tic” Structure of Continuous Word Spaces”. In: Proceedings of the 11th International Conference on Computational Semantics. London, UK, 2015, pp. 40–45. url: https : //aclanthology.org/W15-0105. [21] V. Kulkarni, R. Al-Rfou, B. Perozzi, and S. Skiena. “Statistically Significant Detection of Linguistic Change”. In: Proceedings of the 24th International World Wide Web Con- ference. Www ’15. Florence, Italy, 2015, pp. 625–635. doi: 10.1145/2736277.2741627. 32 [22] A. Kutuzov, L. Øvrelid, T. Szymanski, and E. Velldal. “Diachronic word embeddings and semantic shifts: a survey”. In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, 2018, pp. 1384–1397. url: https://aclanthology.org/C18-1117. [23] A. Lenci, M. Sahlgren, P. Jeuniaux, A. C. Gyllensten, and M. Miliani. “A comprehen- sive comparative evaluation and analysis of Distributional Semantic Models”. In: arXiv preprint arXiv:2105.09825 (2021). url: http://arxiv.org/abs/2105.09825. [24] I. Leviant and R. Reichart. “Separated by an un-common language: Towards judgment language informed vector space modeling”. In: arXiv preprint arXiv:1508.00106 (2015). url: http://arxiv.org/abs/1508.00106. [25] O. Levy and Y. Goldberg. “Linguistic Regularities in Sparse and Explicit Word Rep- resentations”. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Proceedings of the Eighteenth Conference on Computational Natu- ral Language Learning. Ann Arbor, Michigan: Association for Computational Linguistics, 2014, pp. 171–180. doi: 10.3115/v1/W14-1618. [26] T. Mikolov, K. Chen, G. Corrado, and J. Dean. “Efficient estimation of word represen- tations in vector space”. In: arXiv preprint arXiv:1301.3781 (2013). url: http://arxiv. org/abs/1301.3781. [27] T. Mikolov, W.-t. Yih, and G. Zweig. “Linguistic Regularities in Continuous Space Word Representations”. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Naacl- hlt 2013. Atlanta, Georgia: Association for Computational Linguistics, 2013, pp. 746– 751. url: https://www.aclweb.org/anthology/N13-1090. [28] P. J. Ortiz Suárez, L. Romary, and B. Sagot. “A Monolingual Approach to Contextual- ized Word Embeddings for Mid-Resource Languages”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Com- putational Linguistics, 2020, pp. 1703–1714. url: https://www.aclweb.org/anthology/ 2020.acl-main.156. [29] S. Rahmani, S. M. Fakhrahmad, and M. H. Sadreddini. “Co-occurrence graph-based context adaptation: a new unsupervised approach to word sense disambiguation”. In: Digital Scholarship in the Humanities (2020). doi: 10.1093/llc/fqz048. [30] A. Rogers, O. Kovaleva, and A. Rumshisky. “A Primer in BERTology: What We Know About How BERT Works”. In: Transactions of the Association for Computational Lin- guistics 8 (2020), pp. 842–866. doi: 10.1162/tacl\_a\_00349. [31] R. Ros. “Conceptual Vocabularies and Changing Meanings of “Foreign” in Dutch Foreign News (1815–1914)”. In: Book of Abstracts of DH2019. Utrecht, 2019. url: https://dev. clariah.nl/files/dh2019/boa/0651.html. [32] R. Ros and J. van Eijnatten. “Disentangling a Trinity: A Digital Approach to Modernity, Civilization and Europe in Dutch Newspapers (1840-1990)”. In: Book of Abstracts of DH2019. Utrecht, 2019. url: https://dev.clariah.nl/files/dh2019/boa/0572.html. [33] D. Salami and S. Momtazi. “Recurrent convolutional neural networks for poet identifi- cation”. In: Digital Scholarship in the Humanities (2020). doi: 10.1093/llc/fqz096. 33 [34] Y. Song, T. Kimura, B. Batjargal, and A. Maeda. “Linking the Same Ukiyo-e Prints in Different Languages by Exploiting Word Semantic Relationships across Languages”. In: Book of Abstracts of DH2017. Alliance of Digital Humanities Organizations. Montréal, Canada, 2017. url: https://dh2017.adho.org/abstracts/369/369.pdf. [35] Y. Susanti, T. Tokunaga, H. Nishikawa, and H. Obari. “Automatic distractor generation for multiple-choice English vocabulary questions”. In: Research and Practice in Technol- ogy Enhanced Learning 13.2 (2018). doi: 10.1186/s41039-018-0082-z. [36] M. A. H. Taieb, T. Zesch, and M. B. Aouicha. “A survey of semantic relatedness evalu- ation datasets and procedures”. In: Artificial Intelligence Review 53.6 (2020), pp. 4407– 4448. [37] W. Timkey and M. van Schijndel. “All Bark and No Bite: Rogue Dimensions in Trans- former Language Models Obscure Representational Quality”. In: arXiv:2109.04404 [cs] (2021). url: http://arxiv.org/abs/2109.04404. [38] T. Uslu, A. Mehler, C. Schulz, and D. Baumartz. “BigSense: a Word Sense Disambiguator for Big Data”. In: Book of Abstracts of DH2019. Utrecht, 2019. url: https://dev.clariah. nl/files/dh2019/boa/0199.html. [39] I. Vulić, E. M. Ponti, R. Litschko, G. Glavaš, and A. Korhonen. “Probing Pretrained Language Models for Lexical Semantics”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 7222–7240. doi: 10.18653/v1/2020. emnlp-main.586. [40] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”. In: Proceed- ings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 353–355. doi: 10.18653/v1/W18-5446. [41] B. Wang, A. Wang, F. Chen, Y. Wang, and C.-C. J. Kuo. “Evaluating word embed- ding models: methods and experimental results”. In: APSIPA transactions on signal and information processing 8 (2019). doi: 10.1017/atsip.2019.12. [42] Y. Wang, L. Cui, and Y. Zhang. “How Can BERT Help Lexical Semantics Tasks?” In: arXiv preprint arXiv:1911.02929 (2020). url: http://arxiv.org/abs/1911.02929. [43] S. Ziehe and C. Sporleder. “Multimodale Sentimentanalyse politischer Tweets”. In: Book of Abstracts of DHd2019. Frankfurt, 2019, pp. 331–332. doi: 10.5281/zenodo.2596095. 34 A. Appendix A.1. Supplementary tables and figures Table 7 Individual relations of the GermaNet and Wiktionary dataset. relation # instances examples germanet_lexrel_has_pertainym 1290 (erfolgreich, Erfolg), (problemlos, Problem), (unterschiedlich, Unterschied) germanet_lexrel_is_part_of 193 (Hotelzimmer, Hotel), (Handgelenk, Hand), (Autoradio, Auto) germanet_lexrel_has_material 176 (Teppichboden, Teppich), (Lederjacke, Leder), (Acrylglas, Acryl) germanet_lexrel_has_part 165 (Treppenhaus, Treppe), (Zifferblatt, Ziffer), (Hafenstadt, Hafen) germanet_lexrel_has_user 150 (Tierarzt, Tier), (Tierheim, Tier), (Krankenhaus, Kranke) germanet_lexrel_has_location 129 (Gartenhaus, Garten), (Fußbodenheizung, Fußboden), (Dachboden, Dach) germanet_lexrel_has_active_usage 119 (Rettungswagen, Rettung), (Kreuzfahrtschiff, Kreuzfahrt), (Schlafanzug, Schlaf) germanet_lexrel_has_specialization 107 (Projektleiter, Projekt), (Reiseleiter, Reise), (Jugendleiter, Jugendgruppe) germanet_lexrel_has_usage 107 (Autobahn, Auto), (Gotteshaus, Gottesdienst), (Kurhaus, Kur) germanet_lexrel_has_purpose_of_usage 100 (Couchtisch, Couch), (Handtuch, Hand), (Kotflügel, Kot) germanet_lexrel_has_topic 99 (Reisebüro, Reise), (Berufsschule, Beruf), (Liebesroman, Liebe) germanet_lexrel_is_container_for 84 (Briefkasten, Brief), (Mülleimer, Müll), (Mülltonne, Müll) germanet_lexrel_has_raw_product 83 (Zitronensaft, Zitrone), (Kokosöl, Kokosnuss), (Kokosmilch, Kokosnuss) germanet_lexrel_has_manner_of_functioning 78 (Seilbahn, Seil), (Atombombe, Atom), (Handbremse, Hand) germanet_lexrel_has_appearance 76 (Wendeltreppe, Wendel), (Ringelblume, Ringel), (Plüschtier, Tier) germanet_lexrel_has_attribute 76 (Ferienwohnung, Ferien), (Ferienhaus, Ferien), (Handbuch, Hand) germanet_lexrel_has_ingredient 76 (Currywurst, Curry), (Hefeteig, Hefe), (Vollkornbrot, Vollkorn) germanet_lexrel_has_function 73 (Chefarzt, Chef), (Chefkoch, Chef), (Liegestuhl, Liege) germanet_lexrel_has_time 73 (Wintergarten, Winter), (Winterreifen, Winter), (Nachttisch, Nacht) germanet_lexrel_has_product 62 (Honigbiene, Honig), (Textilfabrik, Textil), (Obstbaum, Obst) germanet_lexrel_has_origin 61 (Regenwasser, Regen), (Leserbrief, Leser), (Menschensohn, Mensch) germanet_lexrel_has_content 57 (Telefonbuch, Telefonnummer), (Branchenbuch, Branche), (Terminkalender, Termin) germanet_lexrel_has_owner 56 (Bundesstraße, Bundesrepublik), (Vereinsheim, Verein), (Feuerwehrhaus, Feuerwehr) germanet_lexrel_has_habitat 53 (Tiergarten, Tier), (Zimmerpflanze, Zimmer), (Alpenrose, Alpen) germanet_lexrel_has_prototypical_place_of_usage 39 (Gartenmöbel, Garten), (Raumschiff, Weltraum), (Geländewagen, Gelände) germanet_lexrel_has_occasion 34 (Ehering, Ehe), (Schultüte, Schulbeginn), (Neujahrskonzert, Neujahr) germanet_lexrel_has_member 32 (Ingenieurbüro, Ingenieur), (Abgeordnetenhaus, Abgeordnete), (Hotelkette, Hotel) germanet_lexrel_is_location_of 30 (Firmengelände, Firma), (Vereinsgelände, Verein), (Museumsinsel, Museum) germanet_lexrel_has_other_property 30 (Blutzucker, Blut), (Blutbad, Blut), (Hosenanzug, Hose) germanet_lexrel_has_component 29 (Schmutzwasser, Schmutz), (Seifenwasser, Seife), (Duftöl, Duftstoff) germanet_lexrel_has_goods 29 (Autohaus, Auto), (Biergarten, Bier), (Warenhaus, Ware) germanet_lexrel_is_storage_for 28 (Bootshaus, Boot), (Gemäldegalerie, Gemälde), (Gewandhaus, Gewand) germanet_lexrel_has_prototypical_holder 28 (Handschuh, Hand), (Ohrstecker, Ohr), (Ohrring, Ohr) germanet_lexrel_is_member_of 28 (Ehefrau, Ehe), (Ehemann, Ehe), (Hansestadt, Hanse) germanet_lexrel_has_eponym 24 (Marienkirche, Maria), (Disneyland, Disney), (Marienkäfer, Maria) germanet_lexrel_has_relation 21 (Kreisstadt, Kreisverwaltung), (Gruppenleiter, Gruppe), (Torschützenkönig, Torschütze) germanet_lexrel_has_production_method 19 (Baggersee, Bagger), (Recyclingpapier, Recycling), (Sägemehl, Säge) germanet_lexrel_is_product_of 19 (Spinnennetz, Spinne), (Wespennest, Wespe), (Storchennest, Storch) germanet_lexrel_has_consistency_of 13 (Puderzucker, Puder), (Honigmelone, Honig), (Krepppapier, Krepp) germanet_lexrel_is_comparable_to 12 (Mammutbaum, Mammut), (Hitzkopf, Hitze), (Inselberg, Insel) germanet_lexrel_has_no_property 11 (Zaunkönig, Zaun), (Cocktailtomate, Cocktail), (Lackaffe, Lack) germanet_lexrel_is_prototypical_holder_for 11 (Glockenturm, Glocke), (Gardinenstange, Gardine), (Fahrradbügel, Fahrrad) wiktionary_derivations_subst_in 2000 (Autor, Autorin), (Sänger, Sängerin), (Schauspieler, Schauspielerin) wiktionary_derivations_subst_ung 1059 (nutzen, Nutzung), (unterstützen, Unterstützung), (meinen, Meinung) wiktionary_derivations_adj_isch 566 (Telefon, telefonisch), (England, englisch), (Medizin, medizinisch) wiktionary_derivations_subst_er 507 (nutzen, Nutzer), (besuchen, Besucher), (lesen, Leser) wiktionary_derivations_subst_keit 333 (tätig, Tätigkeit), (wirklich, Wirklichkeit), (verfügbar, Verfügbarkeit) wiktionary_derivations_adj_lich 235 (Natur, natürlich), (Ende, endlich), (Person, persönlich) wiktionary_derivations_adj_ig 202 (Stand, ständig), (Zustand, zuständig), (Kraft, kräftig) wiktionary_derivations_adj_los 192 (Kosten, kostenlos), (Problem, problemlos), (Mühe, mühelos) wiktionary_derivations_subst_chen 186 (Brot, Brötchen), (Paar, Pärchen), (Mann, Männchen) wiktionary_derivations_subst_heit 111 (sicher, Sicherheit), (mehr, Mehrheit), (gelegen, Gelegenheit) wiktionary_derivations_adj_bar 87 (verfügen, verfügbar), (erkennen, erkennbar), (denken, denkbar) wiktionary_derivations_subst_e 82 (helfen, Hilfe), (erst, Erste), (anzeigen, Anzeige) wiktionary_derivations_subst_ei 67 (Bäcker, Bäckerei), (Gärtner, Gärtnerei), (Brenner, Brennerei) wiktionary_derivations_subst_schaft 51 (Partner, Partnerschaft), (Mitglied, Mitgliedschaft), (Meister, Meisterschaft) wiktionary_derivations_adj_haft 45 (Beispiel, beispielhaft), (Masse, massenhaft), (Zweifel, zweifelhaft) wiktionary_derivations_subst_lein 40 (Buch, Büchlein), (Bauch, Bäuchlein), (Licht, Lichtlein) wiktionary_derivations_subst_ler 38 (Wissenschaft, Wissenschaftler), (Sport, Sportler) wiktionary_derivations_subst_tum 28 (Christ, Christentum), (Jude, Judentum), (Brauch, Brauchtum) wiktionary_derivations_subst_ling 28 (früh, Frühling), (neu, Neuling), (flüchten, Flüchtling) wiktionary_derivations_adj_en 19 (Perle, perlen), (Metall, metallen), (Bronze, bronzen) wiktionary_derivations_subst_nis 19 (erleben, Erlebnis), (verhalten, Verhältnis), (erlauben, Erlaubnis) wiktionary_derivations_adj_fach 18 (drei, dreifach), (zwei, zweifach), (vier, vierfach) wiktionary_derivations_adj_sam 17 (wirken, wirksam), (Rat, ratsam), (gemein, gemeinsam) wiktionary_infl_verb_partizip_perfekt 2115 (machen, gemacht), (finden, gefunden), (sein, gewesen) wiktionary_infl_verb_sg_2p_präsens 2014 (können, kannst), (haben, hast), (sein, bist) wiktionary_infl_verb_sg_1p_präsens 2003 (einen, eine), (haben, habe), (sein, bin) wiktionary_infl_adj_komparativ 2002 (viel, mehr), (wenig, weniger), (weit, weiter) wiktionary_infl_subst_nom_pl 2001 (Jahr, Jahre), (Bild, Bilder), (Kind, Kinder) wiktionary_infl_subst_gen_sg 2001 (Foto, Fotos), (Jahr, Jahres), (Video, Videos) wiktionary_infl_subst_dat_pl 1998 (Jahr, Jahren), (Kind, Kindern), (Spiel, Spielen) wiktionary_infl_adj_superlativ 1997 (viel, meisten), (wichtig, wichtigsten), (groß, größten) wiktionary_infl_verb_sg_1p_prät_indikativ 643 (sein, war), (werden, wurde), (haben, hatte) wiktionary_infl_verb_sg_1p_prät_konjunktiv 470 (werden, würde), (sein, wäre), (können, könnte) 35 Table 8 Comparison of the embeddings computed by the different distillation methods on a subset of the MEN task (WR), TOEFL task (WC), Wiktionary task, and Germanet task (RC). Reported numbers are the mean standardized score over all four tasks. Row maxima are highlighted bold. The top 13 maxima are unterlined. vectorization inputemb L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L1-4 L9-12 sum all poolings aggregations first l0.5medoid -3.453 -1.850 -0.831 -0.090 0.102 -0.524 -0.583 -0.839 -0.743 -0.966 -0.595 -0.403 -0.261 -0.119 -0.206 -0.018 0.474 l1medoid -3.354 -1.926 -0.838 -0.052 0.177 -0.447 -0.626 -0.896 -0.673 -0.974 -0.563 -0.494 -0.256 -0.184 -0.184 0.019 0.437 l2medoid -3.422 -1.812 -0.909 -0.099 0.223 -0.342 -0.597 -0.851 -0.746 -0.879 -0.649 -0.392 -0.169 -0.224 -0.202 -0.045 0.403 mean -2.698 -1.005 -0.831 0.085 0.221 -0.198 -0.293 -0.670 -0.445 -0.784 -0.523 -0.247 -0.450 -0.021 -0.255 -0.146 0.072 meannorm -2.719 -0.998 -0.821 0.015 0.166 -0.211 -0.316 -0.709 -0.480 -0.815 -0.521 -0.271 -0.382 -0.022 -0.285 -0.210 0.031 median -2.459 -0.969 -0.817 0.143 0.212 -0.216 -0.291 -0.621 -0.448 -0.752 -0.492 -0.291 -0.478 0.016 -0.257 -0.089 0.116 last l0.5medoid -0.979 -0.492 -0.411 0.009 0.188 -0.002 -0.247 -0.399 -0.585 -0.394 -0.263 -0.285 -0.401 0.046 -0.021 0.206 0.369 l1medoid -1.007 -0.608 -0.398 0.063 0.249 0.060 -0.238 -0.413 -0.471 -0.340 -0.216 -0.299 -0.404 0.106 -0.009 0.171 0.327 l2medoid -0.922 -0.638 -0.396 0.043 0.327 0.016 -0.177 -0.453 -0.444 -0.429 -0.194 -0.259 -0.266 0.179 -0.126 0.079 0.389 mean -0.689 -0.544 -0.436 0.162 0.232 0.153 -0.042 -0.043 -0.158 -0.101 -0.103 -0.162 -0.381 -0.002 -0.057 0.229 0.344 meannorm -0.858 -0.589 -0.405 0.188 0.302 0.185 -0.005 -0.012 -0.123 -0.061 -0.008 -0.077 -0.333 0.002 -0.009 0.251 0.392 median -0.674 -0.570 -0.420 0.160 0.271 0.146 -0.019 -0.040 -0.178 -0.076 -0.057 -0.152 -0.363 0.031 -0.049 0.279 0.392 l0.5medoid l0.5medoid -2.517 -1.359 -0.331 0.223 0.142 -0.190 -0.321 -0.309 -0.488 -0.590 -0.382 -0.304 -0.251 0.159 -0.162 0.263 0.629 l1medoid -2.540 -1.397 -0.345 0.193 0.144 -0.040 -0.261 -0.410 -0.493 -0.663 -0.392 -0.374 -0.260 0.147 -0.133 0.344 0.473 l2medoid -2.505 -1.310 -0.381 0.234 0.215 -0.104 -0.254 -0.336 -0.536 -0.726 -0.424 -0.272 -0.182 0.023 -0.198 0.160 0.426 mean -2.052 -0.825 -0.240 0.288 0.367 0.130 0.007 -0.316 -0.214 -0.388 -0.283 0.014 -0.313 0.087 -0.174 0.147 0.189 meannorm -2.071 -0.875 -0.304 0.320 0.323 0.120 -0.021 -0.407 -0.237 -0.372 -0.247 0.025 -0.240 0.090 -0.162 0.157 0.223 median -1.926 -0.832 -0.316 0.258 0.287 0.080 -0.075 -0.340 -0.334 -0.361 -0.225 -0.054 -0.373 0.099 -0.215 0.168 0.222 l1medoid l0.5medoid -2.566 -1.344 -0.416 0.165 0.115 -0.135 -0.346 -0.271 -0.481 -0.604 -0.388 -0.288 -0.265 0.123 -0.105 0.267 0.606 l1medoid -2.546 -1.461 -0.528 0.165 0.134 -0.080 -0.300 -0.393 -0.497 -0.668 -0.425 -0.358 -0.232 0.125 -0.110 0.324 0.456 l2medoid -2.539 -1.301 -0.453 0.189 0.215 -0.076 -0.283 -0.331 -0.507 -0.712 -0.424 -0.264 -0.198 0.051 -0.207 0.131 0.438 mean -2.146 -0.938 -0.240 0.212 0.354 0.077 -0.006 -0.278 -0.229 -0.399 -0.333 0.004 -0.391 0.080 -0.188 0.133 0.206 meannorm -2.117 -0.969 -0.294 0.238 0.323 0.060 -0.015 -0.337 -0.202 -0.403 -0.323 -0.008 -0.267 0.030 -0.139 0.130 0.250 median -1.983 -0.949 -0.289 0.198 0.256 0.093 -0.012 -0.358 -0.256 -0.390 -0.206 -0.113 -0.415 0.086 -0.201 0.084 0.218 l2medoid l0.5medoid -2.512 -1.438 -0.355 0.115 0.233 -0.240 -0.329 -0.339 -0.489 -0.670 -0.362 -0.215 -0.184 0.219 -0.153 0.341 0.653 l1medoid -2.545 -1.502 -0.430 0.175 0.229 -0.106 -0.305 -0.419 -0.515 -0.742 -0.416 -0.320 -0.240 0.160 -0.094 0.345 0.549 l2medoid -2.555 -1.459 -0.411 0.170 0.328 -0.162 -0.238 -0.380 -0.521 -0.729 -0.444 -0.217 -0.218 0.096 -0.174 0.251 0.564 mean -2.240 -0.973 -0.299 0.286 0.302 0.061 -0.045 -0.265 -0.277 -0.455 -0.250 -0.065 -0.338 0.104 -0.200 0.164 0.226 meannorm -2.293 -0.980 -0.332 0.288 0.284 0.087 -0.097 -0.328 -0.268 -0.457 -0.236 -0.021 -0.240 0.090 -0.139 0.174 0.249 median -2.021 -1.082 -0.361 0.203 0.237 0.054 -0.118 -0.300 -0.286 -0.413 -0.190 -0.119 -0.389 0.067 -0.177 0.124 0.279 nopooling l0.5medoid -1.936 -1.397 -0.909 -0.024 0.019 -0.260 -0.621 -0.901 -0.594 -0.674 -0.315 -0.037 -0.251 -0.259 -0.093 0.022 0.284 l1medoid -1.804 -1.379 -1.004 -0.011 -0.050 -0.113 -0.579 -0.978 -0.587 -0.595 -0.281 -0.155 -0.298 -0.227 -0.210 0.050 0.295 l2medoid -1.587 -1.437 -0.920 -0.035 0.093 -0.260 -0.514 -0.844 -0.696 -0.687 -0.342 -0.156 -0.203 -0.212 -0.144 0.042 0.171 mean 0.567 0.682 0.752 1.185 1.077 0.900 0.636 0.483 0.313 0.309 0.477 0.661 0.448 1.122 0.789 0.927 1.095 meannorm 0.061 0.398 0.526 1.115 1.055 0.869 0.607 0.469 0.305 0.323 0.459 0.701 0.486 0.964 0.751 0.898 1.053 median 0.519 0.621 0.717 1.157 1.128 0.897 0.666 0.422 0.344 0.247 0.466 0.670 0.511 1.206 0.795 1.018 1.122 mean l0.5medoid -0.404 0.291 0.292 0.761 0.805 0.338 0.070 -0.206 -0.376 -0.272 0.014 0.281 0.127 0.973 0.325 0.658 0.797 l1medoid -0.483 0.214 0.297 0.820 0.823 0.391 -0.003 -0.260 -0.328 -0.286 0.059 0.253 0.158 0.907 0.250 0.675 0.740 l2medoid -0.361 0.309 0.353 0.737 0.870 0.331 0.119 -0.347 -0.296 -0.166 -0.026 0.215 0.202 0.964 0.233 0.557 0.769 mean 0.567 0.682 0.752 1.185 1.077 0.900 0.636 0.483 0.313 0.309 0.477 0.661 0.448 1.122 0.789 0.927 1.095 meannorm 0.188 0.436 0.575 1.131 1.067 0.851 0.633 0.468 0.318 0.321 0.459 0.662 0.502 0.970 0.766 0.887 1.042 median 0.590 0.683 0.704 1.179 1.099 0.834 0.613 0.501 0.309 0.300 0.404 0.691 0.539 1.101 0.795 0.959 1.100 meannorm l0.5medoid -0.632 0.045 0.184 0.646 0.857 0.351 0.272 -0.249 -0.252 -0.184 -0.040 0.208 0.201 0.817 0.305 0.701 0.751 l1medoid -0.685 0.002 0.224 0.693 0.926 0.399 0.126 -0.092 -0.232 -0.175 -0.034 0.273 0.198 0.821 0.348 0.697 0.730 l2medoid -0.621 0.032 0.321 0.673 0.904 0.560 0.077 -0.140 -0.216 -0.180 0.000 0.444 0.227 0.768 0.487 0.591 0.655 mean 0.122 0.403 0.519 1.107 1.049 0.827 0.603 0.441 0.317 0.321 0.455 0.635 0.466 0.950 0.654 0.864 1.018 meannorm 0.066 0.397 0.527 1.113 1.054 0.868 0.603 0.475 0.306 0.332 0.458 0.685 0.516 0.966 0.750 0.884 1.053 median 0.171 0.411 0.493 1.080 1.012 0.811 0.644 0.410 0.344 0.263 0.409 0.676 0.511 0.939 0.697 0.872 1.021 median l0.5medoid -0.501 0.368 0.305 0.833 0.745 0.457 0.139 -0.129 -0.193 -0.286 -0.021 0.342 0.229 0.940 0.358 0.936 0.998 l1medoid -0.442 0.241 0.306 0.927 0.775 0.558 0.103 -0.189 -0.147 -0.261 0.065 0.310 0.193 0.997 0.377 0.884 0.886 l2medoid -0.344 0.366 0.354 0.900 0.830 0.579 0.274 -0.270 -0.153 -0.207 0.094 0.326 0.203 1.085 0.393 0.675 0.916 mean 0.511 0.716 0.763 1.121 1.139 1.007 0.711 0.592 0.262 0.316 0.561 0.681 0.455 1.251 0.747 1.074 1.111 meannorm 0.184 0.507 0.636 1.101 1.124 0.951 0.702 0.535 0.298 0.306 0.557 0.649 0.496 1.129 0.759 1.074 1.110 median 0.549 0.694 0.739 1.147 1.106 0.975 0.708 0.536 0.368 0.287 0.529 0.729 0.529 1.246 0.778 1.082 1.169 36 f=nopooling, g=mean 0.75 2 f=nopooling, g=meannorm f=nopooling, g=median standardized score f=mean, g=mean 1 f=mean, g=meannorm 0.70 macro F1 f=mean, g=median 0 f=meannorm, g=mean 0.65 f=meannorm, g=meannorm 1 f=meannorm, g=median 0.60 Germanet 2 f=median, g=mean f=median, g=meannorm f=median, g=median inputemb L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L1-4 L9-12 sum all vectorization/layer f=nopooling, g=mean f=nopooling, g=meannorm 0.900 1 f=nopooling, g=median standardized score f=mean, g=mean 0.875 0 f=mean, g=meannorm macro F1 f=mean, g=median 0.850 1 f=meannorm, g=mean f=meannorm, g=meannorm 0.825 f=meannorm, g=median 2 0.800 Wiktionary f=median, g=mean f=median, g=meannorm f=median, g=median inputemb L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L1-4 L9-12 sum all vectorization/layer f=nopooling, g=mean 0.48 f=nopooling, g=meannorm 1 f=nopooling, g=median 0.46 standardized score f=mean, g=mean f=mean, g=meannorm 0.44 0 Spearman f=mean, g=median 0.42 f=meannorm, g=mean 1 f=meannorm, g=meannorm 0.40 f=meannorm, g=median 0.38 SimLex999 2 f=median, g=mean f=median, g=meannorm f=median, g=median inputemb L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L1-4 L9-12 sum all vectorization/layer f=nopooling, g=mean 2 f=nopooling, g=meannorm 0.74 f=nopooling, g=median standardized score 1 f=mean, g=mean 0.72 f=mean, g=meannorm Spearman 0 f=mean, g=median 0.70 f=meannorm, g=mean 1 f=meannorm, g=meannorm 0.68 f=meannorm, g=median 0.66 Schm280 2 f=median, g=mean f=median, g=meannorm f=median, g=median inputemb L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L1-4 L9-12 sum all vectorization/layer f=nopooling, g=mean f=nopooling, g=meannorm 1 f=nopooling, g=median 0.64 standardized score f=mean, g=mean f=mean, g=meannorm Spearman 0.62 0 f=mean, g=median f=meannorm, g=mean f=meannorm, g=meannorm 0.60 1 f=meannorm, g=median 0.58 MEN 2 f=median, g=mean f=median, g=meannorm f=median, g=median inputemb L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L1-4 L9-12 sum all vectorization/layer f=nopooling, g=mean 0.56 f=nopooling, g=meannorm 1 f=nopooling, g=median 0.54 standardized score f=mean, g=mean 0.52 0 f=mean, g=meannorm accuracy f=mean, g=median 0.50 f=meannorm, g=mean 1 f=meannorm, g=meannorm 0.48 f=meannorm, g=median 0.46 Duden 2 f=median, g=mean f=median, g=meannorm f=median, g=median inputemb L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L1-4 L9-12 sum all vectorization/layer f=nopooling, g=mean 0.60 f=nopooling, g=meannorm 1 f=nopooling, g=median standardized score f=mean, g=mean 0.55 0 f=mean, g=meannorm accuracy f=mean, g=median 0.50 1 f=meannorm, g=mean f=meannorm, g=meannorm 0.45 2 f=meannorm, g=median TOEFL 3 f=median, g=mean f=median, g=meannorm f=median, g=median inputemb L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L1-4 L9-12 sum all vectorization/layer Figure 5: Comparison of the embeddings computed by the medoid-based distillation methods on the full evaluation dataset, visualized for each dataset separately. 37 A.2. Bayesian model comparison To compare two embeddings on our datasets, we have employed the Bayesian hierarchical correlated t-test as described by Corani et al. [9]. This test was originally designed to compare two classifiers on multiple datasets, given their respective cross-validation results. As presented in Sec. 3.3, our tasks do not perform such cross-validation. Therefore, to adapt to the test, we modify our tasks as follows to obtain cross-validation results: • As the Relation Classification task (Germanet, Wiktionary) is implemented as median- based 1-nearest-neighbor classifier, it can be naturally extended to separate train and test sets. Given a train set of (labeled) word pairs and a test set of word pairs, we construct the decision objects (i.e., median) for the relations only on the training examples. Then we test the 1-nearest-neighbor classifier only on the test examples. On each Relation Classification dataset, we perform 10 runs of 10-fold stratified cross validations to obtain 100 F1 scores. • For the Word Relatedness and Word Similarity tasks (SimLex999, Schm280, MEN), there is no natural way to implement a cross-validation, since these tasks measure correlation and are not “trained”. Therefore, to mimic the 10-fold cross-validation, on each dataset we randomly sample 100 subsets that each contain 10 % of the respective dataset, and calculate the Spearman ρ on each of these subsets to obtain 100 correlation coefficients. • Similarly for the Word Choice tasks (Duden, TOEFL). We randomly sample 100 subsets containing 10 % of the respective dataset, and calculate the accuracies on each subset. Fix a pair of two models we want to compare. For i-th dataset (of a total of q datasets), we calculate a vector xi = (xi1 , xi2 , . . . , xi100 ) of differences of score on each cross-validation fold, using the same fold for each dataset. On these vectors x1 , . . . , xq , we now can perform the Bayesian hierarchical correlated t-test using the Python package baycomp7 that implements the hierarchical stochastic model and performs the “hypothesis tests” that estimates the posterior distribution of the difference of score between the two models on a future unseen data set, as proposed by Corani et al. 7 https://github.com/janezd/baycomp 38