1. Introduction

Computational Humanities Research Conference, November

Type- and Token-based Word Embeddings in the Digital Humanities

Anton Ehrmanntraut

Thora Hagen

Leonard Konle

Fotis Jannidis

Julius-Maximilians-Universität Würzburg

2021

1 7 19

In the general perception of the NLP community, the new dynamic, context-sensitive, token-based embeddings from language models like BERT have replaced the older static, type-based embeddings like word2vec or fastText, due to their better performance. We can show that this is not the case for one area of applications for word embeddings: the abstract representation of the meaning of words in a corpus. This application is especially important for the Computational Humanities, for example in order to show the development of words or ideas. The main contribution of our papers are: 1) We ofer a systematic comparison between dynamic and static embeddings in respect to word similarity. 2) We test the best method to convert token embeddings to type embeddings. 3) We contribute new evaluation datasets for word similarity in German. The main goal of our contribution is to make an evidence-based argument that research on static embeddings, which basically stopped after 2019, should be continued not only because it needs less computing power and smaller corpora, but also because for this specific set of applications their performance is on par with that of dynamic embeddings.

eol>Word Embeddings BERT fastText

1. Introduction

wanted to be able to describe the loss of information and performance that results from using static instead of dynamic embeddings. But our surprising preliminary results were confirmed by a paper which was published as preprint during our work [ 23 ]: Static word embeddings can be on par with dynamic embeddings or even surpass them in some constellations. The usage of word embeddings in DH can be categorized in two groups: Using embeddings as an improved form of word representation in tasks like sentiment analysis [ 43 ], word sense disambiguation [ 29, 38 ], authorship attribution [ 19, 33 ] etc. And using word embeddings as abstractions of semantic systems by describing word meaning as a set of relation of a focus term to its semantic neighbours. Since Kulkarni et al. [ 21 ] proposed the comparison of embeddings trained on texts from diferent slices in time to measure semantic change, their method known as diachronic, temporal or dynamic word embeddings [ 22 ] has been adopted by the Digital Humanities community [ 31, 32, 16 ]. However, word embeddings as abstractions of semantic systems do not only work in the historical dimension, but are universally applicable. Even a comparison across several languages is possible [ 34 ].

While token-based embeddings succeed in representation tasks, this is less clear for abstraction, since these capabilities are not included in common benchmark tests [ 40 ] and some comparisons are marred by models being trained on corpora of diferent sizes and the usage of diferent dimensions for the embeddings. At the same time, unlike in computer science, abstraction is not a rare application in DH research (see Figure 1). This poses the question under which circumstances it is worthwhile for a digital humanist, who is interested in using word embeddings as an abstraction, to work with these latest token-based approaches, if there is performance loss if one is using a token-based model and how large the loss is. To answer these questions we will create type-based embeddings from a pre-trained BERT (GBERT [ 8 ]) and compare them to static type-based embeddings, which we created. Therefore we will report on two sets of experiments: First, we will try to find a strong way to create performant type-based embedding out of token-based embeddings. Secondly, we will compare this derived embedding against static, traditional type-based embeddings. As far as possible we will train these embeddings on the same or similar corpora and compare embeddings with the same dimensions to level the playing field and avoid the distortions which have limited the usefulness of some other comparisons. Because word embeddings as abstractions of semantic systems have a close link to questions of word similarity and word relatedness we will limit our evaluation to these aspects. In contrast to most of the existing research comparing diferent word embeddings, we will use German corpora for the training of the models and a German BERT model (GBERT) pre-trained on similar corpora. This makes it necessary to create our own evaluation datasets, some of them mirror English datasets, sometimes by translation, to allow an easy comparison of results across languages, some are new reflecting our interests in specific aspects of word similarity and word relatedness.

2. Related work 2.1. Creating types

Since the initial presentation of BERT, there were eforts to convert BERT’s contextualized token-based embedding (that is, the output of the respective Transformer layers from a pretrained BERT model under certain input sequences) into conventional static, decontextualized type-based embeddings. We follow the terminology used in the survey by Rogers et al. [ 30 ] and exclusively use the term distillation to refer to this precedure1.

Almost all approaches aggregate the token embeddings across multiple contexts into a single type embedding in some way [ 10, 6, 39, 23 ]. Bommasani et al. [ 6 ] present a natural and generalized description of the distillation process, which covers all previously cited approaches. In Section 4, we present and extend their description, and examine a wider range of possible parameters. In contrast to the experiments of Bommasani et al., we also included an evaluation on the basis of relations resp. analogies. Also, diferent to the experiments of Vulić et al. [ 39 ] and Lenci et al. [ 23 ], we experiment with aggregations beyond the component-wise arithmetic mean.

Similar to previous works, we only compute the embedding for a small selection of types relevant for our evaluation datasets. Thus, we already remark at this point that the generated type-based embedding is not “full” – in the sense that we assign only vectors to types that are present in our evaulation set, not to all types in the vocabulary, as is the case in static type-based embeddings. This has consequences on the type of tasks we can use to probe this small embedding, hence we were required to reformulate some tasks to account for this limit.

The techniques independently proposed by Wang et al. [ 42 ], and Gutman and Jaggi [ 13 ], resp., appear to be the only processes that are not a special case of the generalized description by Bommasani et al. In both works, contextualized embeddings from BERT are used to complement the training of a word2vec-style static embedding. Wang et al. use BERT to replace the center word embeddings in a skip-gram architecture, while Gutman and Jaggi use BERT to replace the context embeddings in a CBOW archictecture. However, these techniques come with large computational efort, since training the static embedding requires at least one full BERT embedding of the entire training corpus. Therefore, we omit a detailed analysis of this technique, since this degree of required computational resources seems out of reach for a DH setup.

1Unfortunately, this terminology might be misleading. In particular, it should not be confused with the compression technique called “knowledge distillation” utilized in, e.g., DistilBERT.

2.2. Evaluation of type-based word embeddings

Word embeddings are evaluated either intrinsic with their ability to solve the training objective or extrinsic by measuring their performance on other NLP problems [ 4 ]. Since the development of BERT, problems using word embeddings as representations of words or sequences rather that abstractions (see Section 1) to perform supervised training on curated datasets are dominant in evaluation. Those benchmarks like GLUE [ 41, 42 ] are not in scope of this paper, because we are solely interested in abstraction. From the abstraction viewpoint word embeddings should represent various linguistic relationships between the words [ 41 ]. These relationships are distributed over several datasets to test vector spaces for desired properties. These datasets typically contain tests on: Word Similarity, Relatedness, Analogies, Synonyms, Thematic Categorization, Concept Categorization and Outlier Detection [ 4, 41 ].

For word relatedness, pairs of words are given a score based on the perceived degree of their connection, which are then compared to the corresponding distances in the vector space. For the word similarity task specifically, the concept of relatedness is dependent on the degree of synonymy [ 36 ]. Some of the most prevalent word relatedness/similarity datasets for the English language, among others, are WordSim-353 [ 2 ], MEN [ 7 ] and SimLex-999 [ 18 ]. While WordSim353 and MEN focus on relatedness, SimLex-999 has been specifically designed to represent similarity, meaning that pairs rated with high association in MEN or WordSim-353 could have a low similarity score in SimLex-999, as “association and similarity are neither mutually exclusive nor independent” [ 18 ]. Additionally, SimLex-999 includes verbs and adjectives apart from nouns, which the other two datasets do not. Another constraint of the MEN dataset is that there are no abstract concepts present in the word pairs.

Both relatedness and similarity can also be evaluated via the word choice task, where each test instance consists of one focus word and multiple related or similar words in varying (relative) degrees. Concerning word choice datasets, the synonym questions in the TOEFL exam are the most prominent. One test instance consists of a cue word and four additional words, where exactly one is a true synonym of the cue word. The distractors are usually related words or words that could generally replace the cue word in a given context, but would change the meaning of a sentence [ 35 ]. As these questions have been constructed by linguists, the data reliably depicts word similarity. However the design of the dataset does not allow for distinguishing medium or low similarity for example because of the binary classification approach as opposed to the WS task [ 18 ].

Lastly, the word analogy task can be used to probe specific relations in the vector space. Given two related terms and a cue term, a target term has to be predicted analogous to the relation of the given word pair. For word analogy datasets, the Google Analogies test set [ 26 ] is the most notable. It includes five semantic and nine syntactic relations, where the semantic relations mostly cover world knowledge (e.g. countries and capitals), while the morphological relations include the plural of nouns and verbs or comparative and superlative among others.

The usual implementation to test embeddings on this task uses linear vector arithmetic. For example, given analogy “man is to woman as king (the cue term) is to queen (the target term)”, we test if queen is the closest type in the vocabulary to king − man + woman. This implementation builds upon the supposition that the underlying embedding exhibits linear regularities among these analogies – in above example, that would be woman − man ≈ queen − king, i.e. there is a prototypical “womanness”-ofset in the embedding. [ 27, 25 ]

Since it was computationally infeasible for us to distill embeddings from BERT that have a comparable vocabulary size to those of static embeddings, we found that this setup becomes unreliable: Due to the smaller vocabulary, we heavily restrict the search space in these analogy tests, making the prompts appear easier to solve. Following the example, consider vector v = king − man + woman. Due to the small vocabulary, there are only few distractors in the neighborhood of v. Consequently, the vector queen most probably is the closest type from the vocabulary and the prompt is answered correctly, but this is not a cause of a structurization of the embedding space.

3. Resources 3.1. General Corpora

We train the type-based embeddings on the German OSCAR Corpus (Open Super-large Crawled Aggregated coRpus [ 28 ]), as this is the largest chunk of the training data for the current best German BERT model discussed below. The deduplicated variant of the German corpus contains 21B words (145 GB), filtered out of CommonCrawl. 2

3.2. BERT Model

As outlined in the introduction, we want to ensure that the diferent language models are trained on the same or similar corpora. While training of type-based models on our chosen corpora were feasible for us, it was impossible to pre-train a BERT model from scratch. Therefore, we choose a pre-trained German model GBERTBase provided by Deepset in collaboration with DBMDZ [ 8 ]. Like the original BERTBase, the German model consists of 12 layers, 768 dimensions, and a maximum sequence length of 512 tokens. Also, we choose this model since, to date, this trained model appears to be the currently best available BERT model (with above hyperparameters) [ 8 ]. The model was trained on a combination of four corpora, whereas OSCAR dominates the training set by approximately 88 % of the total data. When we distill type-based embeddings from BERT, we always are going to use the GBERTBase model, and will only use the OSCAR corpus to retrieve contextualized inputs. Likewise, when we train static type-based models, we are only going to use the OSCAR corpus (neglecting the remaining 22 % of BERT’s pre-training data unaccounted for).

3.3. Evaluation Data

As the most popular evaluation datasets are in English, we constructed a comprehensive, German test suite consisting of multiple datasets which cover diferent aspects based on already existing evaluation data.3 The tasks covered are: word relatedness (WR), word similarity (WS), word choice (WC) and relation classification (RC). In addition, the data probes semantic knowledge such as synonyms and morphological knowledge, namely inflections and derivations. See Table 1 for an overview of all test datasets.

Word Relatedness/Similarity For WR, we used the re-evaluated translation of WordSim353 (Schm280) as presented in [ 20 ], where we only corrected nouns which were written in lower case, as well as a DeepL translation of [ 7 ] (MEN), which we then reviewed and adjusted manually as needed. To assess WS, we opted for the translation of [ 18 ] (SimLex999) by [ 24 ]. Both

2https://commoncrawl.org 3Datasets and Code: https://github.com/cophi-wue/Word-Embeddings-in-the-Digital-Humanities

WR and WS are judged via the Spearman’s rank correlation coefficient between the human annotated scores and the cosine distance of all word pairs.

Relation Classification As outlined above, we incorporate the concept of linearly structured word analogies into the test suite not in the usual way, but using relation classification. Instead of predicting a target word through a given cue word, RC tries to predict the relation type of a word pair by comparing their ofset with representative ofsets for each relation type. Specifically, we are given a collection of relations R1, R2, . . . , Rk where each Ri is a set of word pairs. We now interpret the relation Ri as a set of ofsets: for each word pair (a, b) ∈ Ri, we consider the vector vb − va. For the evaluation, we use a median-based 1-nearest-neighbor classification: We “train” the classifier by choosing the median of Ri as decision object. More precisely, we define vector ri = median({vb − va | (a, b) ∈ Ri}) as decision object of Ri. We then test this classifier on all pairs from all relations, thus check for each pair (a, b) from Ri wheter ri is in fact the closest decision object to vb −va (with respect to ℓ1-norm). We evaluate these predictions by the “macro” F1 score, i.e. the unweighted average of the F1 scores under each relation type, respectively.

While this setup allows for diferent aggregations resp. distance functions other than median resp. ℓ1-norm, we surprisingly found this choice to be more successful than other candidates (such as those based on cosine distance) among all examined embeddings.

For the RC data we made use of two German knowledge bases: the knowledge graph GermaNet [ 15, 17 ] (Ver. 14.0) and the German Wiktionary.4 GermaNet incorporates diferent kinds of semantic relations, including lexical relations such as synonymy, and conceptual relations such as hypernymy and diferent kinds of compound relations. For the RC evaluation, we only selected the conceptual relations and the pertainym relation for our GermaNet dataset, since only these can be considered a directed one-one relation. Wiktionary on the other hand contains tenses, the comparison of adjectives, and derivational relations among other morphological relations. Again, we selected a set of inflectional resp. derivational directed one-one relations for the Wiktionary dataset for the RC evaluation, cf. Table 7 in the appendix. Even though there is a German version of the Google Analogies dataset available [ 20 ], we chose not to include it, as its semantic and morphological relations are covered entirely by GermaNet and Wiktionary, respectively. Additionally, both datasets contain more instances than the Google Analogies testset does.

Word Choice Lastly, we included the WC task. Here, we used a translated version of the TOEFL synonym questions [ 20 ] as well as one automatically constructed dataset from the German Duden of synonyms. Our Duden dataset includes one synonym of the cue word as the target word, plus, as distractors, four synonyms of the target word that are not synonyms of the cue word. Evaluation is based on whether the target word is closer to the cue word than the distractors with respect to cosine distance. We report accuracy among all prompts.

Initially, we also wanted to explore world knowledge (i.e. named entities) captured by the embeddings, such as city-body of water or author-work relationships. However most named entities consist of multi-word expressions which are difficult to model via type or token based embeddings. We therefore removed all instances where a concept consisted of more than one token in the datasets described above.

4. Creating Type Vectors

Comparing embeddings distilled from BERT’s token-based embeddings with traditional static type-based embeddings requires us to examine diferent possibilities on how to perform this distillation. Therefore, we compare these possibilities by evaluating the resulting embeddings on the discussed evaluation dataset. Given these results, we decide on a single distillation procedure and compute a BERT embedding to compare its performance against static embeddings in Section 5.

In order to systematically evaluate the diferent methods to compute an embedding from BERT’s token-based representations, we follow and extend the two-stage setup of Bommasani et al. [ 6 ]. The first stage is the subword pooling, where the k subword vectors wc1, . . . , wck (derived from BERT’s output) for word w in context c are aggregated into a single contextualized token vector with aggregation function f ; that is, wc = f ({wc1, . . . , wck}) of w in c. Then, in the second stage, the context combination, multiple contextualized token vectors wc1 , . . . , wcn are aggregated by function g into a single static embedding w = g({wc1 , . . . , wcn }).

We extend this description by prepending a zeroth vectorization stage, in which we make explicit how to transform the outputs of BERT’s layers into a single subword vector. While

4https://dumps.wikimedia.org/dewiktionary/20210701/

Bommasani et al. tacitly concatenate all outputs to form a subword vector, we also allow to selectively pick specific layer output(s) as subword vector, or a summation of the layers.

This setup also allows us to choose pooling functions f resp. context combination functions g which are not defined component-wise. Hence, we examine a wider range of choices for vectorizations, f and g than previously considered, e.g., in [ 39, 6 ]. One such considered novel aggregation function is mean-norm, which refers to the aggregation meannnorm(v1, . . . , vn) = norm (1) ( 1 n )

∑ norm(vi) , n i=1 that takes a set of vectors as input, normalizes each to unit length, calculates the mean on these normalized vectors, and normalizes this mean again. We motivate this aggregation function from the fact that meannnorm(v1, . . . , vn) is the unique vector on the unit hypersphere that maximizes the sum of cosine similarities with respect to each v1, . . . , vn. Thus, mean-norm could also be understood as a “cosine centroid”. In particular, mean-norm, medoids, and aggregations based on fractional distances were included in our experiments searching for suitable distillations. Table 2 shows the functions we examined. In total, we examine 17 possible vectorizations, 8 subword pooling functions and 6 context combination functions.

4.1. Method

Intending to find best-performing distillations based on the above outlined general setup, we evaluate diferent choices for the “free parameters” of the distillation process as shown in Table 2.

For each word w to be embedded, we retrieve n = 100 sentences from the OSCAR corpus as context for w, where w occurs in a sentence of maximum sequence length of 510. If w has < n but at least one occurrence, we sampled all these occurrences. Types w that did not occur in the OSCAR corpus were removed from the evaluation dataset. For each occurrence in sentence s, we construct the input sequence by adding the [CLS] and [SEP] token. The respective outputs of BERT on all layers form the input for the vectorization stage. This method of generating input sequences by sampling sentences largely agrees with the methods proposed by [ 6, 39, 23 ], and only difers in the sampling of sentences.

Due to the large number of possible distillations, it was computationally infeasible to construct embeddings under all distillations. Therefore, in a first experiment, we examined the quality of all considered distillations on a smaller evaluation dataset, which consists of three subsets of the MEN, the Wiktionary, the GermaNet, and the TOEFL dataset. Then, after restricting the set of potential distillations to promising ones, we perform the same experiment but with the full evaluation dataset on all five datasets. In both cases, the evaluation is performed as outlined in Section 3.3.

Also, since the scores in the respective tasks are reported in diferent metrics, we opt for standardizing the respective scores in each task when comparing some embeddings’ performances. Therefore, for one specific task, we consider the standardized score as the number of standard deviations away from the mean over all model’s scores in that task. We then can report the mean standardized score of some model taken over all considered tasks.

4.2. Results

The first run of the experiment with all examined distillations on the smaller datasets gave strong indication that centroid-based distillations lead to significantly better-performing embeddings than those distillations that consist of a medoid-based poolings resp. aggregations. In fact, among the 13 distillations with the highest mean standardized score, all twelve distillations with centroid-based poolings and aggregations are present (Nopooling, mean, median, mean-norm). Numerical values are presented in Table 8 in the appendix. (Top 13 distillations are underlined.) With this experiment, we contributed insight on the performances on diferent aggregation functions not previously considered in the literature. The results suggest the interpretation that centroids – which represent a vector cluster by some synthetic aggregate – generally lead to better results than medoids – which represent a cluster by some member of that cluster. Also, in our distillation setup, fractional norms do not appear to give an advantage, as opposed to research that indicate that fractional distance metrics could lead to better clustering results in high-dimensional space, e.g. [ 1 ]. Hence in total, we negatively answer potential hypotheses that certain overlooked aggregation functions could lead to immediate improvements of resulting type-based embedding distilled from BERT.

Therefore, we continue our evaluation of these twelve centroid-based parameter choices in the next experiment on the full dataset. We observe that, under the restriction on centroid-based poolings and aggregations, the choice of vectorization (i.e. layer) has a much higher influence on the embedding’s performance than the actual choice of functions f and g. This supports the general hypothesis that diferent layers capture diferent aspects of linguistic knowledge [ 30 ]. Additionally, these findings demonstrate that the default suggestions f = mean and g = mean in the literature [ 6, 39, 23 ] generally are a reasonable choice to perform a distillation. A visualization of the embeddings’ scores on each of the full seven datasets is presented in Figure 5 in the appendix.

Again, we ranked the 17 × 4 × 3 (# vectorizations × # restricted poolings × # restricted vectorizations) analyzed embeddings with respect to their mean standardized score over all ifve tasks; cf. Table 3. This leads to our observation that those embeddings based on a sum-vectorization outperform any of the other embeddings; hence we suspect that a vertical summation resp. averaging over all layers can provide a robust vector representation for BERT’s tokens, capturing the summative linguistic knowledge of all layers in a reasonable fashion. Also, we suspect that the smaller dimensionality of the sum-vectorization might give the embeddings an advantage in comparison to the vectorization that concatenates all layers: the former has 768 dimensions, while the latter concatenates 12 Transformer outputs plus BERT’s input embedding, leading to 768 × 13 = 9968 dimensions.

To fix a single embedding for further comparison with static embeddings, we choose the embedding with the highest mean standardized score, which is the embedding based on the distillation with vectorization sum, the pooling f = median, and the aggregation g = median. The respective mean standardized score is highlighted bold in Table 3. Thus, when we now speak of BERT’s distilled embedding, then we explicitly mean this distillation (sum vectorization, median pooling and aggregation). Note that this embedding consists of 768 dimensions. Nevertheless, we explicitly remark that the small diferences in performances do not admit the claim that the chosen distillation is a universal method that would always perform best in any scenario. Also, we want to highlight that the previously untreated median as aggregation function appears to cause some improvement, especially in the pooling stage. Due to its robustness against outliers, we recommend to always examine this aggregation in distillations of any form that convert from token-based to type-based embeddings.

5. Comparing type- and token-based embeddings 5.1. Methods

Before training the type-based embeddings models, we preprocessed the German OSCAR Corpus to ensure that models are trained on the same version of the data. Preprocessing has been done using the word tokenizer and sentence tokenizer of NLTK.5 Additionally, all punctuation has been removed. We trained all models using the respective default parameters and used the skip-gram model for all embeddings. We only adapted the number of dimensions and window size to create additional embedding models for comparison. While the recommended window size is 5, a window size of 2 has proven to be more efective in capturing semantic similarity in embeddings [ 23 ]. To make up for the larger dimensionality of BERT’s distilled embedding (768), we also trained static models with 768 dimensions besides the more commonly used 300 dimensions for type-based models (under the assumption that more dimensions imply a higher quality vector space). Additionally, we concatenated the embeddings in various combinations, as these “stacked” vectors often lead to better results, as presented in [ 3 ]. We experimented with all three possible tuples consisting of BERT, word2vec, and fastText (using the 768 dimension versions only) as well as one embedding where we concatenated all three. We lastly included one fastText model with 2 × 768 dimensions to enable a direct comparison to the stacked embeddings.

As explained above, we evaluate how well the models represent diferent term relations with four tasks: word similarity and relatedness, word choice, and relation classification.

5.2. Results

The first general observation looking at Figure 2 is that BERT’s distilled embedding (again, sum vectorization, median pooling and aggregation) does not perform significantly better, contrary to our expectations. In fact, the type-based embeddings seem to be capturing term relatedness and similarity even better than the token-based embeddings distilled from BERT 5Natural Language Toolkit (https://www.nltk.org/) 0.9 0.8 0.7 e r o c s 0.6 0.5 0.4 in most tasks: in WS, WR, and WC, FastText Dim768WS2 produces the best results (see Table 4) while in RC, BERT achieves the best results on the morphological relations only.

The WR and WS tasks (MEN, Schm280, SimLex999) paint a similar picture. Both hyperparameters, window size, and number of dimensions lead to a slight improvement when reduced and increased, respectively. Most notably, the similarity task benefits the most (about 0.05 absolute correlation improvement with both parameters adjusted for fastText, see Table 4) from altering these parameters. While BERT is on par with or even slightly outperformed by the 300-dimensional type-based embeddings in the relatedness task, it performs better in the similarity task. The higher dimensional vectors however can compare to BERT’s performance on the SimLex999 dataset. Overall, every model seems to struggle with the more narrowly defined WS task when compared to the WR task.

The WC task (Duden, TOEFL) also shows a clear trend: all type-based embeddings exceed BERT’s performance noticeably, by a 0.06 accuracy diference minimum ( Duden, Word2Vec Dim300) and 0.23 maximum (TOEFL, FastText Dim768WS2). Altering the parameters of the type-based embeddings, similariliy to the WR task, results in marginally better performing vectors.

BERT’s embeddings perform considerably better in the RC task when compared to the 300dimensional embeddings. However, a substantial gain from the dimensionality increase can also be observed with GermaNet as opposed to the other datasets, leading to both FfastText Dim768WS2 and Word2Vec Dim768 surpassing BERT’s performance by 0.06 and 0.08, correspondingly. While the same trend appears on the Wiktionary dataset, the classification of morphological relations by BERT’s embeddings still remains uncontested with an accuracy of 0.91. From a human perspective, the morphological relations are rather trivial (some examples are presented in Table 7 in the appendix); even from a computational point of view, lemmatizing or stemming the tails of these triples could in theory reliably predict the individual heads. This implies that generally, BERT can reproduce these kinds of simpler relations the best, while traditional models capture complex semantic associations more accurately. We separately explored the individual performances of all relations in GermaNet and Wiktionary and discovered that the higher F1 score of BERT mainly stems from the derivations, indicating that the word piece tokenization of BERT might facilitate its remarkable performance. Controlling for the dataset and relation size in a linear regression did not reveal a correlation between the amount of overlap and F1 however.

From these experiments we can conclude that for word similarity and term relatedness use cases, employing regular fastText embeddings, optionally increasing the number of dimensions, is sufficient. Using embeddings with the same number of dimensions as BERT results in the static embeddings taking the lead in the WS and RC task for semantic relations, specifically.

More so, there appears to be no clear trend on whether BERT’s distilled embedding is generally better (or worse) than others models. In certain tasks, it performs particularly well (e.g., Wiktionary), and in others particularly bad (e.g., TOEFL). To give some statistical estimate on the diference in performance between BERT and other models, we employ Bayesian hierarchical correlated t-tests proposed by Benavoli et al. [ 5 ] and Corani et al. [ 9 ], designed to compare performances of two classifiers on multiple test sets. 6 This hierarchical model is learned on our observed scores, and after learning, can be queried to make inference on the performance diference (in score points, e.g., absolute accuracy diference) between BERT and an other language model on a future unseen dataset. See the cited references for a thorough presentation of the hierarchical model and the inference method. (Note that the Bayesian hierarchical correlated t-test is based on repeated cross-validation runs on the same dataset. Hence, to adapt our setup to the t-test, we need to modify our task procedures to obtain cross-validation results. Section A.2 in the appendix gives details on how we implemented this.)

Table 5 gives the results on this inference. Most prominently, it estimates that on a future unseen dataset, FastText Dim768WS2 most certainly will outperform BERT’s distilled embedding by at least 0.03 absolute score points (P = 89.1 %). Even on the relatively weak Word2Vec Dim300, the hierarchical model predicts roughly equal probabilities for either BERT being better vs. Word2Vec Dim300 being better (by at least 0.03 absolute score points, 47.9 % vs. 51.8 %).

Nevertheless, this quantitative analysis also has limits due to the stochastic model presumed by the Bayesian hierarchical correlated t-test. The model assumes that the performance differences among the datasets (δ1, δ2, . . . , δnext) are i.i.d. and follow the same high-level Studentt-distribution t(µ 0, σ0, ν); thus, the model assumes that the considered datasets are in some way homogeneous. Though all our datasets are meant to examine word similarity, the distinct diferences in performance of the embedding types we observe (see fig. 2), indicate that these datasets represent diferent aspects of word similarity, which certain language models capture 6We want to thank the anonymous reviewer who brought the potential of the Bayesian hierarchical correlated t-test to our attention. better than others. Hence in our use case, we see the limits of the assumptions made by the stochastic model, and in this light, the results of the Bayesian hierarchical correlated t-tests need to be interpreted cautiously.

5.3. Discussion

The most important result from our experiments is that a widespread assumption in NLP and Computational Humanities is not true: a context-sensitive embedding like BERT is not automatically better for all purposes. Static embeddings like fastText are at least on par if not better if word embeddings are used as abstractions of semantic systems. But our results are subject to some important limitations. For example, we can think of several ways which could increase BERT’s capabilities to represent word similarity, which we haven’t explored: • Modify the training objective for the pre-training phase, for example by adding a task which influences how the model represents word similarity. • Fine-tune the model on a task to improve the representation of word similarity, for example predict the nearest neighbour based on existing similarity word lists. • Replace wordpiece tokenization back to full word tokenization, which has been reported to improve performances in some contexts. [ 11 ] On the other hand we didn’t spend much time to find the best parameters for the static embeddings and we just used a well established static embedding like fastText and didn’t test more recent proposals for static embeddings like [ 14 ] which reported improved results. So there is a lot of room for improvements in both directions.

In order to understand how the performance diferences we observed between static and dynamic embeddings relate to the performance gains which have been observed by stacking embeddings from diferent sources [ 3 ], we combine word2vec, fastText and BERT embeddings in diferent constellations and add a fastText model with the same dimensions to compensate for efects based on the diferent dimensionality of the embeddings (see Figure 3). For four evaluation sets – GermaNet, Men, Duden, TOEFL – the diferences between BERT and fastText are larger than the diference between fastText and a stacked alternative. The performance gain of using stacked embeddings is in most cases rather small. Adding BERT to the stacked embeddings either doesn’t help at all – TOEFL – or only a little bit – GermaNet, Schm280, Men, Duden. The only exception is the Wiktionary dataset which is already the only use case, where 0.9 0.8 e rco0.7 s 0.6 0.5 BERT (sum-median-median) FastText Dim768WS2 FastText Dim1536WS2 +WFoardst2TVeexct DDiimm776688WS2 Word2Vec Dim768 +BERT (sum-median-median) FastText Dim768WS2 +BERT (sum-median-median) Word2Vec Dim768 +FastText Dim768WS2 +BERT (sum-median-median) (RCG,emrmaacrNoeFt1) (RC, macro F1) (WS, Spearman ) (WS, Spearman ) (WS, SpMeEaNrman ) (WC,Daucdceunracy) (WC,TaOcEcFuLracy)

Wiktionary SimLex999 Schm280

BERT is better than fastText. As discussed above the Wiktionary dataset consists mainly of inflections, for example singular vs. plural, or derivations, for example masculine form of a noun (‘Autor’) vs. female form (‘Autorin’). More examples are listed in Table 7. Maybe more sophisticated approaches combining the diferent embeddings like [ 13 ] will show better results, but obviously they all need a token-based model next to the static models.

Exploring the behaviour of the diferent embeddings we also came across a noticeable difference between the BERT-based embeddings and the static embeddings (see Figure 4). We calculated the distances between 956 synonym pairs, using synonyms as defined by GermaNet in one setting and defined by Duden in the other. To make the results comparable we standardized each of them by drawing 956 random word pairs and based our calculation of the mean distance and the standard deviation on them. Then we expressed the cosine distance of the synonyms in standard deviations away from the mean distance. The results show for both datasets a much larger spread for the static embeddings indicating that the BERT vectors occupy a smaller space, an efect which is not related to the dimensionality of its vectors. 0.15 y its0.10 n e d 0.05 15.0 12.5 10.0 7.5 5.0 2.5 model-standardized cosine distance

This seems to be in accordance with results from Ethayarajh [ 12 ], who reported that the contextualized token embedding of BERT is anisotropic: randomly sampled words seem to have, on average, a very high cosine similarity. In fact, Timkey and van Schijndel [ 37 ] report in a pre-print that in BERT’s contextualized embedding space, a few dimensions dominate similarity between word vectors (“rogue dimension”). As future work, we want to examine the efect of post-processing transformations on the embedding spaces, proposed by Timkey and van Schijndel, which are designed to counteract the undesirable efect of these rogue dimensions. In our first exploratory experiments we observe that all our examined embeddings – both the distilled ones from BERT, but also the static ones – appear to benefit from post-processing the type vectors. Yet even then, the post-processing still does not give BERT an advantage over static embeddings.

To summarize, our main take away is not a recommendation for a specific static word embedding, rather we think it is worthwhile to continue research on static word embeddings – at least for researchers working in the field of Computational Literary Studies –, because their representational power as abstractions of semantic systems is on par to that of dynamic embeddings, the needed computing power is much less and the minimal size of the corpora needed to train them is also smaller. What we need in the field of Computational Literary Studies is a more robust understanding how the quality of embeddings is related to the size and structure of datasets, methods to improve the performance of static embeddings trained on even smaller datasets, maybe by combining them with knowledge bases, and more evaluation datasets for languages beyond English.

A. Appendix A.1. Supplementary tables and figures

poolings ifrst last l1medoid l2medoid l0.5medoid l0.5medoid l1medoid l2medoid mean meannorm median l0.5medoid l1medoid l2medoid mean meannorm median l0.5medoid l1medoid l2medoid mean meannorm median l0.5medoid l1medoid l2medoid mean meannorm median l0.5medoid l1medoid l2medoid mean meannorm median nopooling l0.5medoid l1medoid l2medoid mean meannorm median mean l0.5medoid l1medoid l2medoid mean meannorm median meannorm l0.5medoid l1medoid l2medoid mean meannorm median median l0.5medoid l1medoid l2medoid mean meannorm median all all all inputemb L1

L10

L11

L12

L1-4 L9-12 sum

all L7 L8 L9 vectorization/layer inputemb L1

L10

L11

L12

L1-4 L9-12 sum all inputemb L1

L10

L11

L12

L1-4 L9-12 sum

all L7 L8 L9 vectorization/layer L7 L8 L9 vectorization/layer L7 L8 L9 vectorization/layer L7 L8 L9 vectorization/layer L7 L8 L9 vectorization/layer L7 L8 L9 vectorization/layer inputemb L1

L10

L11

L12

L1-4 L9-12 sum all inputemb L1

L10

L11

L12

L1-4 L9-12 sum all

Germanet Wiktionary 1 0 1 0 1 0 2 1 0 1 0 1 0 1 0 1 2 1 2 1 2 1 2 1 2 1 2 1 2 3 e r o c s d e z i d r a d n a t s e r o c s d e z i d r a d n a t s f=nopooling, g=mean f=nopooling, g=meannorm f=nopooling, g=median f=mean, g=mean f=mean, g=meannorm f=mean, g=median f=meannorm, g=mean f=meannorm, g=meannorm f=meannorm, g=median f=median, g=mean f=median, g=meannorm f=median, g=median f=nopooling, g=mean f=nopooling, g=meannorm f=nopooling, g=median f=mean, g=mean f=mean, g=meannorm f=mean, g=median f=meannorm, g=mean f=meannorm, g=meannorm f=meannorm, g=median f=median, g=mean f=median, g=meannorm f=median, g=median f=nopooling, g=mean f=nopooling, g=meannorm f=nopooling, g=median f=mean, g=mean f=mean, g=meannorm f=mean, g=median f=meannorm, g=mean f=meannorm, g=meannorm f=meannorm, g=median f=median, g=mean f=median, g=meannorm f=median, g=median f=nopooling, g=mean f=nopooling, g=meannorm f=nopooling, g=median f=mean, g=mean f=mean, g=meannorm f=mean, g=median f=meannorm, g=mean f=meannorm, g=meannorm f=meannorm, g=median f=median, g=mean f=median, g=meannorm f=median, g=median To compare two embeddings on our datasets, we have employed the Bayesian hierarchical correlated t-test as described by Corani et al. [ 9 ]. This test was originally designed to compare two classifiers on multiple datasets, given their respective cross-validation results.

As presented in Sec. 3.3, our tasks do not perform such cross-validation. Therefore, to adapt to the test, we modify our tasks as follows to obtain cross-validation results: • As the Relation Classification task ( Germanet, Wiktionary) is implemented as medianbased 1-nearest-neighbor classifier, it can be naturally extended to separate train and test sets. Given a train set of (labeled) word pairs and a test set of word pairs, we construct the decision objects (i.e., median) for the relations only on the training examples. Then we test the 1-nearest-neighbor classifier only on the test examples.

On each Relation Classification dataset, we perform 10 runs of 10-fold stratified cross validations to obtain 100 F1 scores. • For the Word Relatedness and Word Similarity tasks (SimLex999, Schm280, MEN), there is no natural way to implement a cross-validation, since these tasks measure correlation and are not “trained”.

Therefore, to mimic the 10-fold cross-validation, on each dataset we randomly sample 100 subsets that each contain 10 % of the respective dataset, and calculate the Spearman ρ on each of these subsets to obtain 100 correlation coefficients. • Similarly for the Word Choice tasks (Duden, TOEFL). We randomly sample 100 subsets containing 10 % of the respective dataset, and calculate the accuracies on each subset.

Fix a pair of two models we want to compare. For i-th dataset (of a total of q datasets), we calculate a vector xi = (xi1, xi2, . . . , xi100) of diferences of score on each cross-validation fold, using the same fold for each dataset. On these vectors x1, . . . , xq, we now can perform the Bayesian hierarchical correlated t-test using the Python package baycomp7 that implements the hierarchical stochastic model and performs the “hypothesis tests” that estimates the posterior distribution of the diference of score between the two models on a future unseen data set, as proposed by Corani et al.

[1]

C. C.

Aggarwal ,

Hinneburg , and

D. A.

Keim . “ On the Surprising Behavior of Distance Metrics in High Dimensional Space” . In: Database Theory, ICDT 2001 . Ed. by J. Van den Bussche and V. Vianu. Lecture Notes in Computer Science . 2001 , pp. 420 - 434 . doi: 10 .1007/3-540-44503-x\_ 27 .

[2]

Agirre ,

Alfonseca ,

Hall ,

Kravalova ,

Paşca , and

Soroa . “ A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches” . In: Proceedings of Human Language Technologies : The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics . Boulder, Colorado, 2009 , pp. 19 - 27 . url: https://aclanthology.org/N09-1003.

[3]

Akbik ,

Blythe , and

Vollgraf . “ Contextual String Embeddings for Sequence Labeling” . In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe , New Mexico, USA: Association for Computational Linguistics, 2018 , pp. 1638 - 1649 . url: https://aclanthology.org/C18-1139.

[4]

Bakarov . “ A survey of word embeddings evaluation methods” . In: arXiv preprint arXiv: 1801 . 09536 ( 2018 ). url: http://arxiv.org/abs/ 1801 .09536.

[5]

Benavoli , G. Corani,

Demšar , and

Zafalon . “ Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis” . In: Journal of Machine Learning Research 18.77 ( 2017 ), pp. 1 - 36 . url: http://jmlr.org/papers/v18/ 16 - 305 .html.

[6]

Bommasani ,

Davis , and

Cardie . “ Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings” . In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics , 2020 , pp. 4758 - 4781 . doi: 10 .18653/v1/ 2020 .acl-main. 431 .

[7]

Bruni ,

N.-K.

Tran , and

Baroni . “ Multimodal distributional semantics” . In: Journal of artificial intelligence research 49 ( 2014 ), pp. 1 - 47 . doi: 10 .1007/s10462-019-09796-3.

[8]

Chan ,

Schweter , and

Möller . “ German's Next Language Model” . In: Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online) , 2020 , pp. 6788 - 6796 . doi: 10 .18653/v1/ 2020 .coling-main. 598 .

[9]

Corani ,

Benavoli ,

Demšar ,

Mangili , and

Zafalon . “ Statistical comparison of classifiers through Bayesian hierarchical modelling” . In: Machine Learning 106.11 ( 2017 ), pp. 1817 - 1837 . doi: 10 .1007/s10994-017-5641-9.

[10]

Devlin , M.-

Chang ,

Lee , and

Toutanova . “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). Naacl-hlt 2019 . Minneapolis, Minnesota: Association for Computational Linguistics, 2019 , pp. 4171 - 4186 . doi: 10 .18653/v1/ N19 -1423.

[11]

El Boukkouri ,

Ferret ,

Lavergne ,

Noji ,

Zweigenbaum , and

Tsujii . “ CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters” . In: Proceedings of the 28th International Conference on Computational Linguistics . Barcelona, Spain (Online): International Committee on Computational Linguistics , 2020 , pp. 6903 - 6915 . doi: 10 .18653/v1/ 2020 .coling-main. 609 .

[12]

Ethayarajh . “ How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings” . In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . Hong Kong , China, 2019 , pp. 55 - 65 . doi: 10 .18653/v1/ D19 -1006.

[13]

Gupta and

Jaggi . “ Obtaining Better Static Word Embeddings Using Contextual Embedding Models” . In: arXiv preprint arXiv:2106.04302 ( 2021 ). url: http://arxiv.org/ abs/2106.04302.

[14]

Gupta ,

Pagliardini , and

Jaggi . “ Better Word Embeddings by Disentangling Contextual n-Gram Information” . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). Minneapolis, Minnesota, 2019 , pp. 933 - 939 . doi: 10 .18653/v1/ N19 -1098.

[15]

Hamp and

Feldweg . “ GermaNet - a Lexical-Semantic Net for German” . In: Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications . 1997 . url: https://www.aclweb.org/anthology/W97-0802.

[16]

Hengchen ,

Ros , and

Marjanen . “ A data-driven approach to the changing vocabulary of the 'nation' in English, Dutch, Swedish and Finnish newspapers, 1750 - 1950 ”. In: Book of Abstracts of DH2019. Utrecht , 2019 . url: https://dev.clariah.nl/files/dh2019/ boa/0791.html.

[17]

Henrich and

Hinrichs. “GernEdiT - The GermaNet Editing Tool ” . In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10) . Valletta, Malta: European Language Resources Association (ELRA) , 2010 , pp. 2228 - 2235 . url: http://www.lrec-conf. org/proceedings/lrec2010/pdf/264%5C%5FPaper.pdf.

[18]

Hill ,

Reichart , and

Korhonen . “SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation” . In: Computational Linguistics 41.4 ( 2015 ), pp. 665 - 695 . doi: 10 .1162/COLI\_a\_ 00237 .

[19]

Kocher and

Savoy . “ Distributed language representation for authorship attribution” . In: Digital Scholarship in the Humanities 33.2 ( 2017 ), pp. 425 - 441 . doi: 10 .1093/ llc/fqx046.

[20]

Köper ,

Scheible , and S. Schulte im Walde. “Multilingual Reliability and “Semantic” Structure of Continuous Word Spaces” . In: Proceedings of the 11th International Conference on Computational Semantics . London, UK, 2015 , pp. 40 - 45 . url: https : //aclanthology.org/W15-0105.

[21]

Kulkarni ,

Al-Rfou ,

Perozzi , and

Skiena . “ Statistically Significant Detection of Linguistic Change” . In: Proceedings of the 24th International World Wide Web Conference. Www '15 . Florence , Italy, 2015 , pp. 625 - 635 . doi: 10 .1145/2736277.2741627.

[22]

Kutuzov ,

Øvrelid ,

Szymanski , and E. Velldal. “ Diachronic word embeddings and semantic shifts: a survey” . In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe , New Mexico, USA: Association for Computational Linguistics, 2018 , pp. 1384 - 1397 . url: https://aclanthology.org/C18-1117.

[23]

Lenci ,

Sahlgren ,

Jeuniaux ,

A. C.

Gyllensten , and

Miliani . “ A comprehensive comparative evaluation and analysis of Distributional Semantic Models” . In: arXiv preprint arXiv:2105.09825 ( 2021 ). url: http://arxiv.org/abs/2105.09825.

[24] I. Leviant and

Reichart . “ Separated by an un-common language: Towards judgment language informed vector space modeling” . In: arXiv preprint arXiv:1508.00106 ( 2015 ). url: http://arxiv.org/abs/1508.00106.

[25]

Levy and

Goldberg . “ Linguistic Regularities in Sparse and Explicit Word Representations” . In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Proceedings of the Eighteenth Conference on Computational Natural Language Learning . Ann Arbor, Michigan: Association for Computational Linguistics, 2014 , pp. 171 - 180 . doi: 10 .3115/v1/ W14 -1618.

[26]

Mikolov ,

Chen , G. Corrado, and

Dean . “ Efficient estimation of word representations in vector space” . In: arXiv preprint arXiv:1301.3781 ( 2013 ). url: http://arxiv. org/abs/1301.3781.

[27]

Mikolov , W.-t. Yih, and G. Zweig. “ Linguistic Regularities in Continuous Space Word Representations” . In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Naaclhlt 2013 . Atlanta, Georgia: Association for Computational Linguistics, 2013 , pp. 746 - 751 . url: https://www.aclweb.org/anthology/N13-1090.

[28]

P. J. Ortiz

Suárez ,

Romary , and

Sagot . “A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages” . In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics , 2020 , pp. 1703 - 1714 . url: https://www.aclweb.org/anthology/ 2020.acl-main. 156 .

[29]

Rahmani ,

S. M.

Fakhrahmad , and

M. H.

Sadreddini . “ Co-occurrence graph-based context adaptation: a new unsupervised approach to word sense disambiguation” . In: Digital Scholarship in the Humanities ( 2020 ). doi: 10 .1093/llc/fqz048.

[30]

Rogers ,

Kovaleva , and

Rumshisky . “A Primer in BERTology: What We Know About How BERT Works” . In: Transactions of the Association for Computational Linguistics 8 ( 2020 ), pp. 842 - 866 . doi: 10 .1162/tacl\_a\_ 00349 .

[31]

Ros . “ Conceptual Vocabularies and Changing Meanings of “Foreign” in Dutch Foreign News ( 1815 -1914) ” . In: Book of Abstracts of DH2019. Utrecht , 2019 . url: https://dev. clariah.nl/files/dh2019/boa/0651.html.

[32]

Ros and J. van Eijnatten. “ Disentangling a Trinity: A Digital Approach to Modernity, Civilization and Europe in Dutch Newspapers ( 1840 -1990) ” . In: Book of Abstracts of DH2019. Utrecht , 2019 . url: https://dev.clariah.nl/files/dh2019/boa/0572.html.

[33]

Salami and

Momtazi . “ Recurrent convolutional neural networks for poet identification” . In: Digital Scholarship in the Humanities ( 2020 ). doi: 10 .1093/llc/fqz096.

[34]

Song ,

Kimura ,

Batjargal , and

Maeda . “ Linking the Same Ukiyo-e Prints in Diferent Languages by Exploiting Word Semantic Relationships across Languages” . In: Book of Abstracts of DH2017 . Alliance of Digital Humanities Organizations . Montréal, Canada, 2017 . url: https://dh2017.adho.org/abstracts/369/369.pdf.

[35]

Susanti ,

Tokunaga ,

Nishikawa , and

Obari . “ Automatic distractor generation for multiple-choice English vocabulary questions” . In: Research and Practice in Technology Enhanced Learning 13.2 ( 2018 ). doi: 10 .1186/s41039-018-0082-z.

[36]

M. A. H.

Taieb ,

Zesch , and

M. B.

Aouicha . “ A survey of semantic relatedness evaluation datasets and procedures” . In: Artificial Intelligence Review 53.6 ( 2020 ), pp. 4407 - 4448 .

[37]

Timkey and

M. van Schijndel. “All

Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality” . In: arXiv: 2109 .04404 [cs] ( 2021 ). url: http://arxiv.org/abs/2109.04404.

[38]

Uslu ,

Mehler ,

Schulz , and

Baumartz . “ BigSense: a Word Sense Disambiguator for Big Data” . In: Book of Abstracts of DH2019. Utrecht , 2019 . url: https://dev.clariah. nl/files/dh2019/boa/0199.html.

[39]

Vulić ,

E. M.

Ponti ,

Litschko , G.

Glavaš, and

Korhonen . “ Probing Pretrained Language Models for Lexical Semantics” . In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Online: Association for Computational Linguistics , 2020 , pp. 7222 - 7240 . doi: 10 .18653/v1/ 2020 . emnlp-main. 586 .

[40]

Wang ,

Singh ,

Michael ,

Hill , O. Levy , and

Bowman . “ GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding” . In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP . Brussels, Belgium: Association for Computational Linguistics, 2018 , pp. 353 - 355 . doi: 10 .18653/v1/ W18 -5446.

[41]

Wang ,

Chen ,

Wang , and C.-C. J. Kuo . “ Evaluating word embedding models: methods and experimental results” . In: APSIPA transactions on signal and information processing 8 ( 2019 ). doi: 10 .1017/atsip. 2019 . 12 .

[42]

Wang ,

Cui , and Y. Zhang. “ How Can BERT Help Lexical Semantics Tasks ?” In: arXiv preprint arXiv: 1911 . 02929 ( 2020 ). url: http://arxiv.org/abs/ 1911 .02929.

[43]

Ziehe and

Sporleder . “ Multimodale Sentimentanalyse politischer Tweets” . In: Book of Abstracts of DHd2019. Frankfurt , 2019 , pp. 331 - 332 . doi: 10 .5281/zenodo.2596095.

2 . Bayesian model comparison