<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Computational Humanities Research Conference, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Type- and Token-based Word Embeddings in the Digital Humanities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anton Ehrmanntraut</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thora Hagen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonard Konle</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fotis Jannidis</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julius-Maximilians-Universität Würzburg</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <fpage>7</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>In the general perception of the NLP community, the new dynamic, context-sensitive, token-based embeddings from language models like BERT have replaced the older static, type-based embeddings like word2vec or fastText, due to their better performance. We can show that this is not the case for one area of applications for word embeddings: the abstract representation of the meaning of words in a corpus. This application is especially important for the Computational Humanities, for example in order to show the development of words or ideas. The main contribution of our papers are: 1) We ofer a systematic comparison between dynamic and static embeddings in respect to word similarity. 2) We test the best method to convert token embeddings to type embeddings. 3) We contribute new evaluation datasets for word similarity in German. The main goal of our contribution is to make an evidence-based argument that research on static embeddings, which basically stopped after 2019, should be continued not only because it needs less computing power and smaller corpora, but also because for this specific set of applications their performance is on par with that of dynamic embeddings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Word Embeddings</kwd>
        <kwd>BERT</kwd>
        <kwd>fastText</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        wanted to be able to describe the loss of information and performance that results from using
static instead of dynamic embeddings. But our surprising preliminary results were confirmed
by a paper which was published as preprint during our work [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]: Static word embeddings can
be on par with dynamic embeddings or even surpass them in some constellations. The usage of
word embeddings in DH can be categorized in two groups: Using embeddings as an improved
form of word representation in tasks like sentiment analysis [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ], word sense disambiguation
[
        <xref ref-type="bibr" rid="ref29 ref38">29, 38</xref>
        ], authorship attribution [
        <xref ref-type="bibr" rid="ref19 ref33">19, 33</xref>
        ] etc. And using word embeddings as abstractions of
semantic systems by describing word meaning as a set of relation of a focus term to its semantic
neighbours. Since Kulkarni et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] proposed the comparison of embeddings trained on texts
from diferent slices in time to measure semantic change, their method known as diachronic,
temporal or dynamic word embeddings [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] has been adopted by the Digital Humanities
community [
        <xref ref-type="bibr" rid="ref16 ref31 ref32">31, 32, 16</xref>
        ]. However, word embeddings as abstractions of semantic systems do not
only work in the historical dimension, but are universally applicable. Even a comparison across
several languages is possible [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ].
      </p>
      <p>
        While token-based embeddings succeed in representation tasks, this is less clear for
abstraction, since these capabilities are not included in common benchmark tests [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] and some
comparisons are marred by models being trained on corpora of diferent sizes and the usage
of diferent dimensions for the embeddings. At the same time, unlike in computer science,
abstraction is not a rare application in DH research (see Figure 1). This poses the question under
which circumstances it is worthwhile for a digital humanist, who is interested in using word
embeddings as an abstraction, to work with these latest token-based approaches, if there is
performance loss if one is using a token-based model and how large the loss is. To answer these
questions we will create type-based embeddings from a pre-trained BERT (GBERT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) and
compare them to static type-based embeddings, which we created. Therefore we will report on
two sets of experiments: First, we will try to find a strong way to create performant type-based
embedding out of token-based embeddings. Secondly, we will compare this derived embedding
against static, traditional type-based embeddings. As far as possible we will train these
embeddings on the same or similar corpora and compare embeddings with the same dimensions
to level the playing field and avoid the distortions which have limited the usefulness of some
other comparisons. Because word embeddings as abstractions of semantic systems have a close
link to questions of word similarity and word relatedness we will limit our evaluation to these
aspects. In contrast to most of the existing research comparing diferent word embeddings, we
will use German corpora for the training of the models and a German BERT model (GBERT)
pre-trained on similar corpora. This makes it necessary to create our own evaluation datasets,
some of them mirror English datasets, sometimes by translation, to allow an easy comparison
of results across languages, some are new reflecting our interests in specific aspects of word
similarity and word relatedness.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. Creating types</title>
        <p>
          Since the initial presentation of BERT, there were eforts to convert BERT’s contextualized
token-based embedding (that is, the output of the respective Transformer layers from a
pretrained BERT model under certain input sequences) into conventional static, decontextualized
type-based embeddings. We follow the terminology used in the survey by Rogers et al. [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]
and exclusively use the term distillation to refer to this precedure1.
        </p>
        <p>
          Almost all approaches aggregate the token embeddings across multiple contexts into a single
type embedding in some way [
          <xref ref-type="bibr" rid="ref10 ref23 ref39 ref6">10, 6, 39, 23</xref>
          ]. Bommasani et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] present a natural and
generalized description of the distillation process, which covers all previously cited approaches.
In Section 4, we present and extend their description, and examine a wider range of possible
parameters. In contrast to the experiments of Bommasani et al., we also included an evaluation
on the basis of relations resp. analogies. Also, diferent to the experiments of Vulić et al. [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ]
and Lenci et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], we experiment with aggregations beyond the component-wise arithmetic
mean.
        </p>
        <p>Similar to previous works, we only compute the embedding for a small selection of types
relevant for our evaluation datasets. Thus, we already remark at this point that the generated
type-based embedding is not “full” – in the sense that we assign only vectors to types that
are present in our evaulation set, not to all types in the vocabulary, as is the case in static
type-based embeddings. This has consequences on the type of tasks we can use to probe this
small embedding, hence we were required to reformulate some tasks to account for this limit.</p>
        <p>
          The techniques independently proposed by Wang et al. [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ], and Gutman and Jaggi [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
resp., appear to be the only processes that are not a special case of the generalized description
by Bommasani et al. In both works, contextualized embeddings from BERT are used to
complement the training of a word2vec-style static embedding. Wang et al. use BERT to
replace the center word embeddings in a skip-gram architecture, while Gutman and Jaggi use
BERT to replace the context embeddings in a CBOW archictecture. However, these techniques
come with large computational efort, since training the static embedding requires at least one
full BERT embedding of the entire training corpus. Therefore, we omit a detailed analysis of
this technique, since this degree of required computational resources seems out of reach for a
DH setup.
        </p>
        <p>1Unfortunately, this terminology might be misleading. In particular, it should not be confused with the
compression technique called “knowledge distillation” utilized in, e.g., DistilBERT.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Evaluation of type-based word embeddings</title>
        <p>
          Word embeddings are evaluated either intrinsic with their ability to solve the training objective
or extrinsic by measuring their performance on other NLP problems [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Since the
development of BERT, problems using word embeddings as representations of words or sequences
rather that abstractions (see Section 1) to perform supervised training on curated datasets
are dominant in evaluation. Those benchmarks like GLUE [
          <xref ref-type="bibr" rid="ref41 ref42">41, 42</xref>
          ] are not in scope of this
paper, because we are solely interested in abstraction. From the abstraction viewpoint word
embeddings should represent various linguistic relationships between the words [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ]. These
relationships are distributed over several datasets to test vector spaces for desired properties.
These datasets typically contain tests on: Word Similarity, Relatedness, Analogies, Synonyms,
Thematic Categorization, Concept Categorization and Outlier Detection [
          <xref ref-type="bibr" rid="ref4 ref41">4, 41</xref>
          ].
        </p>
        <p>
          For word relatedness, pairs of words are given a score based on the perceived degree of their
connection, which are then compared to the corresponding distances in the vector space. For
the word similarity task specifically, the concept of relatedness is dependent on the degree of
synonymy [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]. Some of the most prevalent word relatedness/similarity datasets for the English
language, among others, are WordSim-353 [
          <xref ref-type="bibr" rid="ref2 ref44">2</xref>
          ], MEN [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and SimLex-999 [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. While
WordSim353 and MEN focus on relatedness, SimLex-999 has been specifically designed to represent
similarity, meaning that pairs rated with high association in MEN or WordSim-353 could
have a low similarity score in SimLex-999, as “association and similarity are neither mutually
exclusive nor independent” [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Additionally, SimLex-999 includes verbs and adjectives apart
from nouns, which the other two datasets do not. Another constraint of the MEN dataset is
that there are no abstract concepts present in the word pairs.
        </p>
        <p>
          Both relatedness and similarity can also be evaluated via the word choice task, where each
test instance consists of one focus word and multiple related or similar words in varying
(relative) degrees. Concerning word choice datasets, the synonym questions in the TOEFL exam
are the most prominent. One test instance consists of a cue word and four additional words,
where exactly one is a true synonym of the cue word. The distractors are usually related words
or words that could generally replace the cue word in a given context, but would change the
meaning of a sentence [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ]. As these questions have been constructed by linguists, the data
reliably depicts word similarity. However the design of the dataset does not allow for
distinguishing medium or low similarity for example because of the binary classification approach
as opposed to the WS task [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>
          Lastly, the word analogy task can be used to probe specific relations in the vector space.
Given two related terms and a cue term, a target term has to be predicted analogous to the
relation of the given word pair. For word analogy datasets, the Google Analogies test set [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]
is the most notable. It includes five semantic and nine syntactic relations, where the semantic
relations mostly cover world knowledge (e.g. countries and capitals), while the morphological
relations include the plural of nouns and verbs or comparative and superlative among others.
        </p>
        <p>
          The usual implementation to test embeddings on this task uses linear vector arithmetic.
For example, given analogy “man is to woman as king (the cue term) is to queen (the target
term)”, we test if queen is the closest type in the vocabulary to king − man + woman. This
implementation builds upon the supposition that the underlying embedding exhibits linear
regularities among these analogies – in above example, that would be woman − man ≈ queen
− king, i.e. there is a prototypical “womanness”-ofset in the embedding. [
          <xref ref-type="bibr" rid="ref25 ref27">27, 25</xref>
          ]
        </p>
        <p>Since it was computationally infeasible for us to distill embeddings from BERT that have a
comparable vocabulary size to those of static embeddings, we found that this setup becomes
unreliable: Due to the smaller vocabulary, we heavily restrict the search space in these analogy
tests, making the prompts appear easier to solve. Following the example, consider vector v =
king − man + woman. Due to the small vocabulary, there are only few distractors in the
neighborhood of v. Consequently, the vector queen most probably is the closest type from the
vocabulary and the prompt is answered correctly, but this is not a cause of a structurization
of the embedding space.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Resources</title>
      <sec id="sec-3-1">
        <title>3.1. General Corpora</title>
        <p>
          We train the type-based embeddings on the German OSCAR Corpus (Open Super-large
Crawled Aggregated coRpus [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]), as this is the largest chunk of the training data for the
current best German BERT model discussed below. The deduplicated variant of the German
corpus contains 21B words (145 GB), filtered out of CommonCrawl. 2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. BERT Model</title>
        <p>
          As outlined in the introduction, we want to ensure that the diferent language models are
trained on the same or similar corpora. While training of type-based models on our chosen
corpora were feasible for us, it was impossible to pre-train a BERT model from scratch.
Therefore, we choose a pre-trained German model GBERTBase provided by Deepset in collaboration
with DBMDZ [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Like the original BERTBase, the German model consists of 12 layers, 768
dimensions, and a maximum sequence length of 512 tokens. Also, we choose this model since,
to date, this trained model appears to be the currently best available BERT model (with
above hyperparameters) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The model was trained on a combination of four corpora, whereas
OSCAR dominates the training set by approximately 88 % of the total data. When we
distill type-based embeddings from BERT, we always are going to use the GBERTBase model,
and will only use the OSCAR corpus to retrieve contextualized inputs. Likewise, when we
train static type-based models, we are only going to use the OSCAR corpus (neglecting the
remaining 22 % of BERT’s pre-training data unaccounted for).
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Evaluation Data</title>
        <p>As the most popular evaluation datasets are in English, we constructed a comprehensive,
German test suite consisting of multiple datasets which cover diferent aspects based on already
existing evaluation data.3 The tasks covered are: word relatedness (WR), word similarity
(WS), word choice (WC) and relation classification (RC). In addition, the data probes semantic
knowledge such as synonyms and morphological knowledge, namely inflections and derivations.
See Table 1 for an overview of all test datasets.</p>
        <p>
          Word Relatedness/Similarity For WR, we used the re-evaluated translation of
WordSim353 (Schm280) as presented in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], where we only corrected nouns which were written in lower
case, as well as a DeepL translation of [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] (MEN), which we then reviewed and adjusted manually
as needed. To assess WS, we opted for the translation of [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] (SimLex999) by [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. Both
        </p>
        <sec id="sec-3-3-1">
          <title>2https://commoncrawl.org 3Datasets and Code: https://github.com/cophi-wue/Word-Embeddings-in-the-Digital-Humanities</title>
          <p>WR and WS are judged via the Spearman’s rank correlation coefficient between the human
annotated scores and the cosine distance of all word pairs.</p>
          <p>Relation Classification As outlined above, we incorporate the concept of linearly
structured word analogies into the test suite not in the usual way, but using relation classification.
Instead of predicting a target word through a given cue word, RC tries to predict the relation
type of a word pair by comparing their ofset with representative ofsets for each relation type.
Specifically, we are given a collection of relations R1, R2, . . . , Rk where each Ri is a set of word
pairs. We now interpret the relation Ri as a set of ofsets: for each word pair (a, b) ∈ Ri, we
consider the vector vb − va. For the evaluation, we use a median-based 1-nearest-neighbor
classification: We “train” the classifier by choosing the median of Ri as decision object. More
precisely, we define vector ri = median({vb − va | (a, b) ∈ Ri}) as decision object of Ri. We
then test this classifier on all pairs from all relations, thus check for each pair (a, b) from Ri
wheter ri is in fact the closest decision object to vb −va (with respect to ℓ1-norm). We evaluate
these predictions by the “macro” F1 score, i.e. the unweighted average of the F1 scores under
each relation type, respectively.</p>
          <p>While this setup allows for diferent aggregations resp. distance functions other than median
resp. ℓ1-norm, we surprisingly found this choice to be more successful than other candidates
(such as those based on cosine distance) among all examined embeddings.</p>
          <p>
            For the RC data we made use of two German knowledge bases: the knowledge graph
GermaNet [
            <xref ref-type="bibr" rid="ref15 ref17">15, 17</xref>
            ] (Ver. 14.0) and the German Wiktionary.4 GermaNet incorporates diferent
kinds of semantic relations, including lexical relations such as synonymy, and conceptual
relations such as hypernymy and diferent kinds of compound relations. For the RC evaluation,
we only selected the conceptual relations and the pertainym relation for our GermaNet dataset,
since only these can be considered a directed one-one relation. Wiktionary on the other hand
contains tenses, the comparison of adjectives, and derivational relations among other
morphological relations. Again, we selected a set of inflectional resp. derivational directed one-one
relations for the Wiktionary dataset for the RC evaluation, cf. Table 7 in the appendix. Even
though there is a German version of the Google Analogies dataset available [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ], we chose not
to include it, as its semantic and morphological relations are covered entirely by GermaNet
and Wiktionary, respectively. Additionally, both datasets contain more instances than the
Google Analogies testset does.
          </p>
          <p>
            Word Choice Lastly, we included the WC task. Here, we used a translated version of
the TOEFL synonym questions [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] as well as one automatically constructed dataset from the
German Duden of synonyms. Our Duden dataset includes one synonym of the cue word as the
target word, plus, as distractors, four synonyms of the target word that are not synonyms of
the cue word. Evaluation is based on whether the target word is closer to the cue word than
the distractors with respect to cosine distance. We report accuracy among all prompts.
          </p>
          <p>Initially, we also wanted to explore world knowledge (i.e. named entities) captured by the
embeddings, such as city-body of water or author-work relationships. However most named
entities consist of multi-word expressions which are difficult to model via type or token based
embeddings. We therefore removed all instances where a concept consisted of more than one
token in the datasets described above.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Creating Type Vectors</title>
      <p>Comparing embeddings distilled from BERT’s token-based embeddings with traditional static
type-based embeddings requires us to examine diferent possibilities on how to perform this
distillation. Therefore, we compare these possibilities by evaluating the resulting embeddings
on the discussed evaluation dataset. Given these results, we decide on a single distillation
procedure and compute a BERT embedding to compare its performance against static embeddings
in Section 5.</p>
      <p>
        In order to systematically evaluate the diferent methods to compute an embedding from
BERT’s token-based representations, we follow and extend the two-stage setup of Bommasani
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The first stage is the subword pooling, where the k subword vectors wc1, . . . , wck
(derived from BERT’s output) for word w in context c are aggregated into a single contextualized
token vector with aggregation function f ; that is, wc = f ({wc1, . . . , wck}) of w in c. Then, in
the second stage, the context combination, multiple contextualized token vectors wc1 , . . . , wcn
are aggregated by function g into a single static embedding w = g({wc1 , . . . , wcn }).
      </p>
      <p>We extend this description by prepending a zeroth vectorization stage, in which we make
explicit how to transform the outputs of BERT’s layers into a single subword vector. While</p>
      <sec id="sec-4-1">
        <title>4https://dumps.wikimedia.org/dewiktionary/20210701/</title>
        <p>Bommasani et al. tacitly concatenate all outputs to form a subword vector, we also allow to
selectively pick specific layer output(s) as subword vector, or a summation of the layers.</p>
        <p>
          This setup also allows us to choose pooling functions f resp. context combination functions
g which are not defined component-wise. Hence, we examine a wider range of choices for
vectorizations, f and g than previously considered, e.g., in [
          <xref ref-type="bibr" rid="ref39 ref6">39, 6</xref>
          ]. One such considered novel
aggregation function is mean-norm, which refers to the aggregation
meannnorm(v1, . . . , vn) = norm
(1)
( 1 n )
        </p>
        <p>∑ norm(vi) ,
n i=1
that takes a set of vectors as input, normalizes each to unit length, calculates the mean on these
normalized vectors, and normalizes this mean again. We motivate this aggregation function
from the fact that meannnorm(v1, . . . , vn) is the unique vector on the unit hypersphere that
maximizes the sum of cosine similarities with respect to each v1, . . . , vn. Thus, mean-norm
could also be understood as a “cosine centroid”. In particular, mean-norm, medoids, and
aggregations based on fractional distances were included in our experiments searching for
suitable distillations. Table 2 shows the functions we examined. In total, we examine 17
possible vectorizations, 8 subword pooling functions and 6 context combination functions.</p>
        <sec id="sec-4-1-1">
          <title>4.1. Method</title>
          <p>Intending to find best-performing distillations based on the above outlined general setup, we
evaluate diferent choices for the “free parameters” of the distillation process as shown in
Table 2.</p>
          <p>
            For each word w to be embedded, we retrieve n = 100 sentences from the OSCAR corpus
as context for w, where w occurs in a sentence of maximum sequence length of 510. If w has
&lt; n but at least one occurrence, we sampled all these occurrences. Types w that did not
occur in the OSCAR corpus were removed from the evaluation dataset. For each occurrence
in sentence s, we construct the input sequence by adding the [CLS] and [SEP] token. The
respective outputs of BERT on all layers form the input for the vectorization stage. This
method of generating input sequences by sampling sentences largely agrees with the methods
proposed by [
            <xref ref-type="bibr" rid="ref23 ref39 ref6">6, 39, 23</xref>
            ], and only difers in the sampling of sentences.
          </p>
          <p>Due to the large number of possible distillations, it was computationally infeasible to
construct embeddings under all distillations. Therefore, in a first experiment, we examined the
quality of all considered distillations on a smaller evaluation dataset, which consists of three
subsets of the MEN, the Wiktionary, the GermaNet, and the TOEFL dataset. Then, after
restricting the set of potential distillations to promising ones, we perform the same experiment but
with the full evaluation dataset on all five datasets. In both cases, the evaluation is performed
as outlined in Section 3.3.</p>
          <p>Also, since the scores in the respective tasks are reported in diferent metrics, we opt for
standardizing the respective scores in each task when comparing some embeddings’
performances. Therefore, for one specific task, we consider the standardized score as the number of
standard deviations away from the mean over all model’s scores in that task. We then can
report the mean standardized score of some model taken over all considered tasks.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.2. Results</title>
          <p>
            The first run of the experiment with all examined distillations on the smaller datasets gave
strong indication that centroid-based distillations lead to significantly better-performing
embeddings than those distillations that consist of a medoid-based poolings resp. aggregations.
In fact, among the 13 distillations with the highest mean standardized score, all twelve
distillations with centroid-based poolings and aggregations are present (Nopooling, mean, median,
mean-norm). Numerical values are presented in Table 8 in the appendix. (Top 13 distillations
are underlined.) With this experiment, we contributed insight on the performances on
diferent aggregation functions not previously considered in the literature. The results suggest the
interpretation that centroids – which represent a vector cluster by some synthetic aggregate –
generally lead to better results than medoids – which represent a cluster by some member of
that cluster. Also, in our distillation setup, fractional norms do not appear to give an
advantage, as opposed to research that indicate that fractional distance metrics could lead to better
clustering results in high-dimensional space, e.g. [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. Hence in total, we negatively answer
potential hypotheses that certain overlooked aggregation functions could lead to immediate
improvements of resulting type-based embedding distilled from BERT.
          </p>
          <p>
            Therefore, we continue our evaluation of these twelve centroid-based parameter choices in the
next experiment on the full dataset. We observe that, under the restriction on centroid-based
poolings and aggregations, the choice of vectorization (i.e. layer) has a much higher influence
on the embedding’s performance than the actual choice of functions f and g. This supports the
general hypothesis that diferent layers capture diferent aspects of linguistic knowledge [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ].
Additionally, these findings demonstrate that the default suggestions f = mean and g = mean
in the literature [
            <xref ref-type="bibr" rid="ref23 ref39 ref6">6, 39, 23</xref>
            ] generally are a reasonable choice to perform a distillation. A
visualization of the embeddings’ scores on each of the full seven datasets is presented in Figure
5 in the appendix.
          </p>
          <p>Again, we ranked the 17 × 4 × 3 (# vectorizations × # restricted poolings × # restricted
vectorizations) analyzed embeddings with respect to their mean standardized score over all
ifve tasks; cf. Table 3. This leads to our observation that those embeddings based on a
sum-vectorization outperform any of the other embeddings; hence we suspect that a
vertical summation resp. averaging over all layers can provide a robust vector representation for
BERT’s tokens, capturing the summative linguistic knowledge of all layers in a reasonable
fashion. Also, we suspect that the smaller dimensionality of the sum-vectorization might give the
embeddings an advantage in comparison to the vectorization that concatenates all layers: the
former has 768 dimensions, while the latter concatenates 12 Transformer outputs plus BERT’s
input embedding, leading to 768 × 13 = 9968 dimensions.</p>
          <p>To fix a single embedding for further comparison with static embeddings, we choose the
embedding with the highest mean standardized score, which is the embedding based on the
distillation with vectorization sum, the pooling f = median, and the aggregation g = median.
The respective mean standardized score is highlighted bold in Table 3. Thus, when we now
speak of BERT’s distilled embedding, then we explicitly mean this distillation (sum
vectorization, median pooling and aggregation). Note that this embedding consists of 768 dimensions.
Nevertheless, we explicitly remark that the small diferences in performances do not admit
the claim that the chosen distillation is a universal method that would always perform best in
any scenario. Also, we want to highlight that the previously untreated median as aggregation
function appears to cause some improvement, especially in the pooling stage. Due to its
robustness against outliers, we recommend to always examine this aggregation in distillations of
any form that convert from token-based to type-based embeddings.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Comparing type- and token-based embeddings</title>
      <sec id="sec-5-1">
        <title>5.1. Methods</title>
        <p>
          Before training the type-based embeddings models, we preprocessed the German OSCAR
Corpus to ensure that models are trained on the same version of the data. Preprocessing has been
done using the word tokenizer and sentence tokenizer of NLTK.5 Additionally, all punctuation
has been removed. We trained all models using the respective default parameters and used the
skip-gram model for all embeddings. We only adapted the number of dimensions and window
size to create additional embedding models for comparison. While the recommended window
size is 5, a window size of 2 has proven to be more efective in capturing semantic similarity
in embeddings [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. To make up for the larger dimensionality of BERT’s distilled embedding
(768), we also trained static models with 768 dimensions besides the more commonly used 300
dimensions for type-based models (under the assumption that more dimensions imply a higher
quality vector space). Additionally, we concatenated the embeddings in various combinations,
as these “stacked” vectors often lead to better results, as presented in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. We experimented
with all three possible tuples consisting of BERT, word2vec, and fastText (using the 768
dimension versions only) as well as one embedding where we concatenated all three. We lastly
included one fastText model with 2 × 768 dimensions to enable a direct comparison to the
stacked embeddings.
        </p>
        <p>As explained above, we evaluate how well the models represent diferent term relations with
four tasks: word similarity and relatedness, word choice, and relation classification.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results</title>
        <p>The first general observation looking at Figure 2 is that BERT’s distilled embedding (again,
sum vectorization, median pooling and aggregation) does not perform significantly better,
contrary to our expectations. In fact, the type-based embeddings seem to be capturing term
relatedness and similarity even better than the token-based embeddings distilled from BERT
5Natural Language Toolkit (https://www.nltk.org/)
0.9
0.8
0.7
e
r
o
c
s
0.6
0.5
0.4
in most tasks: in WS, WR, and WC, FastText Dim768WS2 produces the best results (see
Table 4) while in RC, BERT achieves the best results on the morphological relations only.</p>
        <p>The WR and WS tasks (MEN, Schm280, SimLex999) paint a similar picture. Both
hyperparameters, window size, and number of dimensions lead to a slight improvement when reduced
and increased, respectively. Most notably, the similarity task benefits the most (about 0.05
absolute correlation improvement with both parameters adjusted for fastText, see Table 4)
from altering these parameters. While BERT is on par with or even slightly outperformed by
the 300-dimensional type-based embeddings in the relatedness task, it performs better in the
similarity task. The higher dimensional vectors however can compare to BERT’s performance
on the SimLex999 dataset. Overall, every model seems to struggle with the more narrowly
defined WS task when compared to the WR task.</p>
        <p>The WC task (Duden, TOEFL) also shows a clear trend: all type-based embeddings exceed
BERT’s performance noticeably, by a 0.06 accuracy diference minimum ( Duden, Word2Vec
Dim300) and 0.23 maximum (TOEFL, FastText Dim768WS2). Altering the parameters of the
type-based embeddings, similariliy to the WR task, results in marginally better performing
vectors.</p>
        <p>BERT’s embeddings perform considerably better in the RC task when compared to the
300dimensional embeddings. However, a substantial gain from the dimensionality increase can
also be observed with GermaNet as opposed to the other datasets, leading to both FfastText
Dim768WS2 and Word2Vec Dim768 surpassing BERT’s performance by 0.06 and 0.08,
correspondingly. While the same trend appears on the Wiktionary dataset, the classification of
morphological relations by BERT’s embeddings still remains uncontested with an accuracy of
0.91. From a human perspective, the morphological relations are rather trivial (some examples
are presented in Table 7 in the appendix); even from a computational point of view,
lemmatizing or stemming the tails of these triples could in theory reliably predict the individual
heads. This implies that generally, BERT can reproduce these kinds of simpler relations the
best, while traditional models capture complex semantic associations more accurately. We
separately explored the individual performances of all relations in GermaNet and Wiktionary
and discovered that the higher F1 score of BERT mainly stems from the derivations,
indicating that the word piece tokenization of BERT might facilitate its remarkable performance.
Controlling for the dataset and relation size in a linear regression did not reveal a correlation
between the amount of overlap and F1 however.</p>
        <p>From these experiments we can conclude that for word similarity and term relatedness use
cases, employing regular fastText embeddings, optionally increasing the number of dimensions,
is sufficient. Using embeddings with the same number of dimensions as BERT results in the
static embeddings taking the lead in the WS and RC task for semantic relations, specifically.</p>
        <p>
          More so, there appears to be no clear trend on whether BERT’s distilled embedding is
generally better (or worse) than others models. In certain tasks, it performs particularly well (e.g.,
Wiktionary), and in others particularly bad (e.g., TOEFL). To give some statistical estimate on
the diference in performance between BERT and other models, we employ Bayesian
hierarchical correlated t-tests proposed by Benavoli et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and Corani et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], designed to compare
performances of two classifiers on multiple test sets. 6 This hierarchical model is learned on our
observed scores, and after learning, can be queried to make inference on the performance
diference (in score points, e.g., absolute accuracy diference) between BERT and an other language
model on a future unseen dataset. See the cited references for a thorough presentation of the
hierarchical model and the inference method. (Note that the Bayesian hierarchical correlated
t-test is based on repeated cross-validation runs on the same dataset. Hence, to adapt our
setup to the t-test, we need to modify our task procedures to obtain cross-validation results.
Section A.2 in the appendix gives details on how we implemented this.)
        </p>
        <p>Table 5 gives the results on this inference. Most prominently, it estimates that on a
future unseen dataset, FastText Dim768WS2 most certainly will outperform BERT’s distilled
embedding by at least 0.03 absolute score points (P = 89.1 %). Even on the relatively weak
Word2Vec Dim300, the hierarchical model predicts roughly equal probabilities for either BERT
being better vs. Word2Vec Dim300 being better (by at least 0.03 absolute score points, 47.9 %
vs. 51.8 %).</p>
        <p>Nevertheless, this quantitative analysis also has limits due to the stochastic model presumed
by the Bayesian hierarchical correlated t-test. The model assumes that the performance
differences among the datasets (δ1, δ2, . . . , δnext) are i.i.d. and follow the same high-level
Studentt-distribution t(µ 0, σ0, ν); thus, the model assumes that the considered datasets are in some
way homogeneous. Though all our datasets are meant to examine word similarity, the distinct
diferences in performance of the embedding types we observe (see fig. 2), indicate that these
datasets represent diferent aspects of word similarity, which certain language models capture
6We want to thank the anonymous reviewer who brought the potential of the Bayesian hierarchical correlated
t-test to our attention.
better than others. Hence in our use case, we see the limits of the assumptions made by the
stochastic model, and in this light, the results of the Bayesian hierarchical correlated t-tests
need to be interpreted cautiously.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Discussion</title>
        <p>
          The most important result from our experiments is that a widespread assumption in NLP
and Computational Humanities is not true: a context-sensitive embedding like BERT is not
automatically better for all purposes. Static embeddings like fastText are at least on par if not
better if word embeddings are used as abstractions of semantic systems. But our results are
subject to some important limitations. For example, we can think of several ways which could
increase BERT’s capabilities to represent word similarity, which we haven’t explored:
• Modify the training objective for the pre-training phase, for example by adding a task
which influences how the model represents word similarity.
• Fine-tune the model on a task to improve the representation of word similarity, for
example predict the nearest neighbour based on existing similarity word lists.
• Replace wordpiece tokenization back to full word tokenization, which has been reported
to improve performances in some contexts. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
On the other hand we didn’t spend much time to find the best parameters for the static
embeddings and we just used a well established static embedding like fastText and didn’t test
more recent proposals for static embeddings like [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] which reported improved results. So there
is a lot of room for improvements in both directions.
        </p>
        <p>
          In order to understand how the performance diferences we observed between static and
dynamic embeddings relate to the performance gains which have been observed by stacking
embeddings from diferent sources [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we combine word2vec, fastText and BERT embeddings
in diferent constellations and add a fastText model with the same dimensions to compensate
for efects based on the diferent dimensionality of the embeddings (see Figure 3). For four
evaluation sets – GermaNet, Men, Duden, TOEFL – the diferences between BERT and fastText
are larger than the diference between fastText and a stacked alternative. The performance
gain of using stacked embeddings is in most cases rather small. Adding BERT to the stacked
embeddings either doesn’t help at all – TOEFL – or only a little bit – GermaNet, Schm280, Men,
Duden. The only exception is the Wiktionary dataset which is already the only use case, where
0.9
0.8
e
rco0.7
s
0.6
0.5
BERT (sum-median-median)
FastText Dim768WS2
FastText Dim1536WS2
+WFoardst2TVeexct DDiimm776688WS2
Word2Vec Dim768
+BERT (sum-median-median)
FastText Dim768WS2
+BERT (sum-median-median)
Word2Vec Dim768
+FastText Dim768WS2
+BERT (sum-median-median)
(RCG,emrmaacrNoeFt1) (RC, macro F1) (WS, Spearman ) (WS, Spearman ) (WS, SpMeEaNrman ) (WC,Daucdceunracy) (WC,TaOcEcFuLracy)
        </p>
        <p>Wiktionary SimLex999 Schm280</p>
        <p>
          BERT is better than fastText. As discussed above the Wiktionary dataset consists mainly
of inflections, for example singular vs. plural, or derivations, for example masculine form of a
noun (‘Autor’) vs. female form (‘Autorin’). More examples are listed in Table 7. Maybe more
sophisticated approaches combining the diferent embeddings like [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] will show better results,
but obviously they all need a token-based model next to the static models.
        </p>
        <p>Exploring the behaviour of the diferent embeddings we also came across a noticeable
difference between the BERT-based embeddings and the static embeddings (see Figure 4). We
calculated the distances between 956 synonym pairs, using synonyms as defined by GermaNet
in one setting and defined by Duden in the other. To make the results comparable we
standardized each of them by drawing 956 random word pairs and based our calculation of the
mean distance and the standard deviation on them. Then we expressed the cosine distance of
the synonyms in standard deviations away from the mean distance. The results show for both
datasets a much larger spread for the static embeddings indicating that the BERT vectors
occupy a smaller space, an efect which is not related to the dimensionality of its vectors.
0.15
y
its0.10
n
e
d
0.05
15.0
12.5
10.0 7.5 5.0 2.5
model-standardized cosine distance</p>
        <p>
          This seems to be in accordance with results from Ethayarajh [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], who reported that the
contextualized token embedding of BERT is anisotropic: randomly sampled words seem to
have, on average, a very high cosine similarity. In fact, Timkey and van Schijndel [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] report
in a pre-print that in BERT’s contextualized embedding space, a few dimensions dominate
similarity between word vectors (“rogue dimension”). As future work, we want to examine the
efect of post-processing transformations on the embedding spaces, proposed by Timkey and
van Schijndel, which are designed to counteract the undesirable efect of these rogue dimensions.
In our first exploratory experiments we observe that all our examined embeddings – both the
distilled ones from BERT, but also the static ones – appear to benefit from post-processing
the type vectors. Yet even then, the post-processing still does not give BERT an advantage
over static embeddings.
        </p>
        <p>To summarize, our main take away is not a recommendation for a specific static word
embedding, rather we think it is worthwhile to continue research on static word embeddings
– at least for researchers working in the field of Computational Literary Studies –, because
their representational power as abstractions of semantic systems is on par to that of dynamic
embeddings, the needed computing power is much less and the minimal size of the corpora
needed to train them is also smaller. What we need in the field of Computational Literary
Studies is a more robust understanding how the quality of embeddings is related to the size and
structure of datasets, methods to improve the performance of static embeddings trained on
even smaller datasets, maybe by combining them with knowledge bases, and more evaluation
datasets for languages beyond English.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>A. Appendix</title>
      <sec id="sec-6-1">
        <title>A.1. Supplementary tables and figures</title>
        <p>poolings
ifrst
last
l1medoid
l2medoid
l0.5medoid l0.5medoid
l1medoid
l2medoid
mean
meannorm
median
l0.5medoid
l1medoid
l2medoid
mean
meannorm
median
l0.5medoid
l1medoid
l2medoid
mean
meannorm
median
l0.5medoid
l1medoid
l2medoid
mean
meannorm
median
l0.5medoid
l1medoid
l2medoid
mean
meannorm
median
nopooling l0.5medoid
l1medoid
l2medoid
mean
meannorm
median
mean
l0.5medoid
l1medoid
l2medoid
mean
meannorm
median
meannorm l0.5medoid
l1medoid
l2medoid
mean
meannorm
median
median
l0.5medoid
l1medoid
l2medoid
mean
meannorm
median
all
all
all
inputemb L1</p>
        <p>L2</p>
        <p>L3</p>
        <p>L4</p>
        <p>L5</p>
        <p>L6</p>
        <p>L10</p>
        <p>L11</p>
        <p>L12</p>
        <p>L1-4 L9-12 sum</p>
        <p>all
L7 L8 L9
vectorization/layer
inputemb L1</p>
        <p>L2</p>
        <p>L3</p>
        <p>L4</p>
        <p>L5</p>
        <p>L6</p>
        <p>L10</p>
        <p>L11</p>
        <p>L12</p>
        <p>L1-4 L9-12 sum
all
inputemb L1</p>
        <p>L2</p>
        <p>L3</p>
        <p>L4</p>
        <p>L5</p>
        <p>L6</p>
        <p>L10</p>
        <p>L11</p>
        <p>L12</p>
        <p>L1-4 L9-12 sum</p>
        <p>all
L7 L8 L9
vectorization/layer
L7 L8 L9
vectorization/layer
L7 L8 L9
vectorization/layer
L7 L8 L9
vectorization/layer
L7 L8 L9
vectorization/layer
L7 L8 L9
vectorization/layer
inputemb L1</p>
        <p>L2</p>
        <p>L3</p>
        <p>L4</p>
        <p>L5</p>
        <p>L6</p>
        <p>L10</p>
        <p>L11</p>
        <p>L12</p>
        <p>L1-4 L9-12 sum
all
inputemb L1</p>
        <p>L2</p>
        <p>L3</p>
        <p>L4</p>
        <p>L5</p>
        <p>L6</p>
        <p>L10</p>
        <p>L11</p>
        <p>L12</p>
        <p>L1-4 L9-12 sum
all</p>
        <p>
          Germanet
Wiktionary
1
0
1
0
1
0
2
1
0
1
0
1
0
1
0
1
2
1
2
1
2
1
2
1
2
1
2
1
2
3
e
r
o
c
s
d
e
z
i
d
r
a
d
n
a
t
s
e
r
o
c
s
d
e
z
i
d
r
a
d
n
a
t
s
f=nopooling, g=mean
f=nopooling, g=meannorm
f=nopooling, g=median
f=mean, g=mean
f=mean, g=meannorm
f=mean, g=median
f=meannorm, g=mean
f=meannorm, g=meannorm
f=meannorm, g=median
f=median, g=mean
f=median, g=meannorm
f=median, g=median
f=nopooling, g=mean
f=nopooling, g=meannorm
f=nopooling, g=median
f=mean, g=mean
f=mean, g=meannorm
f=mean, g=median
f=meannorm, g=mean
f=meannorm, g=meannorm
f=meannorm, g=median
f=median, g=mean
f=median, g=meannorm
f=median, g=median
f=nopooling, g=mean
f=nopooling, g=meannorm
f=nopooling, g=median
f=mean, g=mean
f=mean, g=meannorm
f=mean, g=median
f=meannorm, g=mean
f=meannorm, g=meannorm
f=meannorm, g=median
f=median, g=mean
f=median, g=meannorm
f=median, g=median
f=nopooling, g=mean
f=nopooling, g=meannorm
f=nopooling, g=median
f=mean, g=mean
f=mean, g=meannorm
f=mean, g=median
f=meannorm, g=mean
f=meannorm, g=meannorm
f=meannorm, g=median
f=median, g=mean
f=median, g=meannorm
f=median, g=median
To compare two embeddings on our datasets, we have employed the Bayesian hierarchical
correlated t-test as described by Corani et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. This test was originally designed to compare
two classifiers on multiple datasets, given their respective cross-validation results.
        </p>
        <p>As presented in Sec. 3.3, our tasks do not perform such cross-validation. Therefore, to adapt
to the test, we modify our tasks as follows to obtain cross-validation results:
• As the Relation Classification task ( Germanet, Wiktionary) is implemented as
medianbased 1-nearest-neighbor classifier, it can be naturally extended to separate train and test
sets. Given a train set of (labeled) word pairs and a test set of word pairs, we construct
the decision objects (i.e., median) for the relations only on the training examples. Then
we test the 1-nearest-neighbor classifier only on the test examples.</p>
        <p>On each Relation Classification dataset, we perform 10 runs of 10-fold stratified cross
validations to obtain 100 F1 scores.
• For the Word Relatedness and Word Similarity tasks (SimLex999, Schm280, MEN), there
is no natural way to implement a cross-validation, since these tasks measure correlation
and are not “trained”.</p>
        <p>Therefore, to mimic the 10-fold cross-validation, on each dataset we randomly sample
100 subsets that each contain 10 % of the respective dataset, and calculate the Spearman
ρ on each of these subsets to obtain 100 correlation coefficients.
• Similarly for the Word Choice tasks (Duden, TOEFL). We randomly sample 100 subsets
containing 10 % of the respective dataset, and calculate the accuracies on each subset.</p>
        <p>Fix a pair of two models we want to compare. For i-th dataset (of a total of q datasets), we
calculate a vector xi = (xi1, xi2, . . . , xi100) of diferences of score on each cross-validation fold,
using the same fold for each dataset. On these vectors x1, . . . , xq, we now can perform the
Bayesian hierarchical correlated t-test using the Python package baycomp7 that implements the
hierarchical stochastic model and performs the “hypothesis tests” that estimates the posterior
distribution of the diference of score between the two models on a future unseen data set, as
proposed by Corani et al.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hinneburg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Keim</surname>
          </string-name>
          . “
          <article-title>On the Surprising Behavior of Distance Metrics in High Dimensional Space”</article-title>
          . In: Database Theory,
          <string-name>
            <surname>ICDT</surname>
          </string-name>
          <year>2001</year>
          . Ed. by J. Van den Bussche and V.
          <source>Vianu. Lecture Notes in Computer Science</source>
          .
          <year>2001</year>
          , pp.
          <fpage>420</fpage>
          -
          <lpage>434</lpage>
          . doi:
          <volume>10</volume>
          .1007/3-540-44503-x\_
          <volume>27</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Alfonseca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kravalova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paşca</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Soroa</surname>
          </string-name>
          . “
          <article-title>A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches”</article-title>
          .
          <source>In: Proceedings of Human Language Technologies</source>
          :
          <article-title>The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          . Boulder, Colorado,
          <year>2009</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          . url: https://aclanthology.org/N09-1003.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Akbik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Blythe</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          . “
          <article-title>Contextual String Embeddings for Sequence Labeling”</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe</source>
          , New Mexico, USA: Association for Computational Linguistics,
          <year>2018</year>
          , pp.
          <fpage>1638</fpage>
          -
          <lpage>1649</lpage>
          . url: https://aclanthology.org/C18-1139.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakarov</surname>
          </string-name>
          . “
          <article-title>A survey of word embeddings evaluation methods”</article-title>
          . In: arXiv preprint arXiv:
          <year>1801</year>
          .
          <volume>09536</volume>
          (
          <year>2018</year>
          ). url: http://arxiv.org/abs/
          <year>1801</year>
          .09536.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Benavoli</surname>
          </string-name>
          , G. Corani,
          <string-name>
            <given-names>J.</given-names>
            <surname>Demšar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zafalon</surname>
          </string-name>
          . “
          <article-title>Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis”</article-title>
          .
          <source>In: Journal of Machine Learning Research 18.77</source>
          (
          <year>2017</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          . url: http://jmlr.org/papers/v18/
          <fpage>16</fpage>
          -
          <lpage>305</lpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Davis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardie</surname>
          </string-name>
          . “
          <article-title>Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings”</article-title>
          . In:
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</article-title>
          .
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</article-title>
          .
          <source>Online: Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>4758</fpage>
          -
          <lpage>4781</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>431</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bruni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.-K.</given-names>
            <surname>Tran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          . “
          <article-title>Multimodal distributional semantics”</article-title>
          .
          <source>In: Journal of artificial intelligence research 49</source>
          (
          <year>2014</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10462-019-09796-3.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schweter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Möller</surname>
          </string-name>
          . “
          <article-title>German's Next Language Model”</article-title>
          .
          <source>In: Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6788</fpage>
          -
          <lpage>6796</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .coling-main.
          <volume>598</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Corani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Benavoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Demšar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mangili</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zafalon</surname>
          </string-name>
          . “
          <article-title>Statistical comparison of classifiers through Bayesian hierarchical modelling”</article-title>
          .
          <source>In: Machine Learning 106.11</source>
          (
          <year>2017</year>
          ), pp.
          <fpage>1817</fpage>
          -
          <lpage>1837</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10994-017-5641-9.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . “BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding”</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers). Naacl-hlt
          <year>2019</year>
          . Minneapolis, Minnesota: Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>El Boukkouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ferret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavergne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Noji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zweigenbaum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Tsujii</surname>
          </string-name>
          . “
          <article-title>CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters”</article-title>
          .
          <source>In: Proceedings of the 28th International Conference on Computational Linguistics</source>
          . Barcelona,
          <string-name>
            <surname>Spain</surname>
          </string-name>
          (Online):
          <source>International Committee on Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6903</fpage>
          -
          <lpage>6915</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .coling-main.
          <volume>609</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ethayarajh</surname>
          </string-name>
          . “
          <article-title>How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings”</article-title>
          .
          <source>In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          .
          <source>Hong Kong</source>
          , China,
          <year>2019</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>65</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1006.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Jaggi</surname>
          </string-name>
          . “
          <article-title>Obtaining Better Static Word Embeddings Using Contextual Embedding Models”</article-title>
          .
          <source>In: arXiv preprint arXiv:2106.04302</source>
          (
          <year>2021</year>
          ). url: http://arxiv.org/ abs/2106.04302.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pagliardini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Jaggi</surname>
          </string-name>
          . “
          <article-title>Better Word Embeddings by Disentangling Contextual n-Gram Information”</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers). Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>933</fpage>
          -
          <lpage>939</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1098.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hamp</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Feldweg</surname>
          </string-name>
          . “
          <article-title>GermaNet - a Lexical-Semantic Net for German”</article-title>
          .
          <source>In: Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications</source>
          .
          <year>1997</year>
          . url: https://www.aclweb.org/anthology/W97-0802.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hengchen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ros</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Marjanen</surname>
          </string-name>
          . “
          <article-title>A data-driven approach to the changing vocabulary of the 'nation' in English, Dutch, Swedish</article-title>
          and Finnish newspapers,
          <fpage>1750</fpage>
          -
          <lpage>1950</lpage>
          ”.
          <source>In: Book of Abstracts of DH2019. Utrecht</source>
          ,
          <year>2019</year>
          . url: https://dev.clariah.nl/files/dh2019/ boa/0791.html.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>V.</given-names>
            <surname>Henrich</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Hinrichs. “GernEdiT - The GermaNet Editing Tool</surname>
          </string-name>
          <article-title>”</article-title>
          .
          <source>In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)</source>
          . Valletta,
          <source>Malta: European Language Resources Association (ELRA)</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>2228</fpage>
          -
          <lpage>2235</lpage>
          . url: http://www.lrec-conf.
          <source>org/proceedings/lrec2010/pdf/264%5C%5FPaper.pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Reichart</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          . “SimLex-999:
          <article-title>Evaluating Semantic Models With (Genuine) Similarity Estimation”</article-title>
          .
          <source>In: Computational Linguistics 41.4</source>
          (
          <issue>2015</issue>
          ), pp.
          <fpage>665</fpage>
          -
          <lpage>695</lpage>
          . doi:
          <volume>10</volume>
          .1162/COLI\_a\_
          <volume>00237</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kocher</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Savoy</surname>
          </string-name>
          . “
          <article-title>Distributed language representation for authorship attribution”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 33.2</source>
          (
          <issue>2017</issue>
          ), pp.
          <fpage>425</fpage>
          -
          <lpage>441</lpage>
          . doi:
          <volume>10</volume>
          .1093/ llc/fqx046.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Köper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scheible</surname>
          </string-name>
          , and
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Schulte im Walde. “Multilingual Reliability and “Semantic” Structure of Continuous Word Spaces”</article-title>
          .
          <source>In: Proceedings of the 11th International Conference on Computational Semantics</source>
          . London, UK,
          <year>2015</year>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>45</lpage>
          . url: https : //aclanthology.org/W15-0105.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Perozzi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Skiena</surname>
          </string-name>
          . “
          <article-title>Statistically Significant Detection of Linguistic Change”</article-title>
          .
          <source>In: Proceedings of the 24th International World Wide Web Conference. Www '15</source>
          .
          <string-name>
            <surname>Florence</surname>
          </string-name>
          , Italy,
          <year>2015</year>
          , pp.
          <fpage>625</fpage>
          -
          <lpage>635</lpage>
          . doi:
          <volume>10</volume>
          .1145/2736277.2741627.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kutuzov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Øvrelid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Szymanski</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Velldal.</surname>
          </string-name>
          “
          <article-title>Diachronic word embeddings and semantic shifts: a survey”</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe</source>
          , New Mexico, USA: Association for Computational Linguistics,
          <year>2018</year>
          , pp.
          <fpage>1384</fpage>
          -
          <lpage>1397</lpage>
          . url: https://aclanthology.org/C18-1117.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jeuniaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Gyllensten</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Miliani</surname>
          </string-name>
          . “
          <article-title>A comprehensive comparative evaluation and analysis of Distributional Semantic Models”</article-title>
          .
          <source>In: arXiv preprint arXiv:2105.09825</source>
          (
          <year>2021</year>
          ). url: http://arxiv.org/abs/2105.09825.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>I. Leviant</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Reichart</surname>
          </string-name>
          . “
          <article-title>Separated by an un-common language: Towards judgment language informed vector space modeling”</article-title>
          .
          <source>In: arXiv preprint arXiv:1508.00106</source>
          (
          <year>2015</year>
          ). url: http://arxiv.org/abs/1508.00106.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          . “
          <article-title>Linguistic Regularities in Sparse and Explicit Word Representations”</article-title>
          .
          <source>In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Proceedings of the Eighteenth Conference on Computational Natural Language Learning</source>
          . Ann Arbor, Michigan: Association for Computational Linguistics,
          <year>2014</year>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>180</lpage>
          . doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>W14</fpage>
          -1618.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          . “
          <article-title>Efficient estimation of word representations in vector space”</article-title>
          .
          <source>In: arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          ). url: http://arxiv. org/abs/1301.3781.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , W.-t. Yih, and
          <string-name>
            <surname>G. Zweig.</surname>
          </string-name>
          “
          <article-title>Linguistic Regularities in Continuous Space Word Representations”</article-title>
          .
          <source>In: Proceedings of the</source>
          <year>2013</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          .
          <source>Naaclhlt</source>
          <year>2013</year>
          . Atlanta, Georgia: Association for Computational Linguistics,
          <year>2013</year>
          , pp.
          <fpage>746</fpage>
          -
          <lpage>751</lpage>
          . url: https://www.aclweb.org/anthology/N13-1090.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>P. J. Ortiz</given-names>
            <surname>Suárez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Romary</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          .
          <article-title>“A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages”</article-title>
          .
          <source>In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1703</fpage>
          -
          <lpage>1714</lpage>
          . url: https://www.aclweb.org/anthology/ 2020.acl-main.
          <volume>156</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rahmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Fakhrahmad</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Sadreddini</surname>
          </string-name>
          . “
          <article-title>Co-occurrence graph-based context adaptation: a new unsupervised approach to word sense disambiguation”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities</source>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .1093/llc/fqz048.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kovaleva</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          .
          <article-title>“A Primer in BERTology: What We Know About How BERT Works”</article-title>
          .
          <source>In: Transactions of the Association for Computational Linguistics</source>
          <volume>8</volume>
          (
          <year>2020</year>
          ), pp.
          <fpage>842</fpage>
          -
          <lpage>866</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl\_a\_
          <volume>00349</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ros</surname>
          </string-name>
          . “
          <article-title>Conceptual Vocabularies and Changing Meanings of “Foreign” in Dutch Foreign News (</article-title>
          <year>1815</year>
          -1914)
          <article-title>”</article-title>
          .
          <source>In: Book of Abstracts of DH2019. Utrecht</source>
          ,
          <year>2019</year>
          . url: https://dev. clariah.nl/files/dh2019/boa/0651.html.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ros</surname>
          </string-name>
          and
          <string-name>
            <surname>J. van Eijnatten.</surname>
          </string-name>
          “
          <article-title>Disentangling a Trinity: A Digital Approach to Modernity, Civilization and Europe in Dutch Newspapers (</article-title>
          <year>1840</year>
          -1990)
          <article-title>”</article-title>
          .
          <source>In: Book of Abstracts of DH2019. Utrecht</source>
          ,
          <year>2019</year>
          . url: https://dev.clariah.nl/files/dh2019/boa/0572.html.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>D.</given-names>
            <surname>Salami</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Momtazi</surname>
          </string-name>
          . “
          <article-title>Recurrent convolutional neural networks for poet identification”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities</source>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .1093/llc/fqz096.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kimura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Batjargal</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeda</surname>
          </string-name>
          . “
          <article-title>Linking the Same Ukiyo-e Prints in Diferent Languages by Exploiting Word Semantic Relationships across Languages”</article-title>
          .
          <source>In: Book of Abstracts of DH2017</source>
          .
          <article-title>Alliance of Digital Humanities Organizations</article-title>
          . Montréal, Canada,
          <year>2017</year>
          . url: https://dh2017.adho.org/abstracts/369/369.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Susanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tokunaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nishikawa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Obari</surname>
          </string-name>
          . “
          <article-title>Automatic distractor generation for multiple-choice English vocabulary questions”</article-title>
          .
          <source>In: Research and Practice in Technology Enhanced Learning 13.2</source>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .1186/s41039-018-0082-z.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>M. A. H.</given-names>
            <surname>Taieb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zesch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Aouicha</surname>
          </string-name>
          . “
          <article-title>A survey of semantic relatedness evaluation datasets and procedures”</article-title>
          .
          <source>In: Artificial Intelligence Review 53.6</source>
          (
          <issue>2020</issue>
          ), pp.
          <fpage>4407</fpage>
          -
          <lpage>4448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>W.</given-names>
            <surname>Timkey</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. van Schijndel. “All</given-names>
            <surname>Bark</surname>
          </string-name>
          and No Bite:
          <article-title>Rogue Dimensions in Transformer Language Models Obscure Representational Quality”</article-title>
          . In: arXiv:
          <fpage>2109</fpage>
          .04404 [cs] (
          <year>2021</year>
          ). url: http://arxiv.org/abs/2109.04404.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>T.</given-names>
            <surname>Uslu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mehler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schulz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Baumartz</surname>
          </string-name>
          . “
          <article-title>BigSense: a Word Sense Disambiguator for Big Data”</article-title>
          .
          <source>In: Book of Abstracts of DH2019. Utrecht</source>
          ,
          <year>2019</year>
          . url: https://dev.clariah. nl/files/dh2019/boa/0199.html.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>I.</given-names>
            <surname>Vulić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Ponti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Litschko</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Glavaš, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          . “
          <article-title>Probing Pretrained Language Models for Lexical Semantics”</article-title>
          .
          <source>In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <source>Online: Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>7222</fpage>
          -
          <lpage>7240</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          . emnlp-main.
          <volume>586</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bowman</surname>
          </string-name>
          . “
          <article-title>GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”</article-title>
          .
          <source>In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          . Brussels, Belgium: Association for Computational Linguistics,
          <year>2018</year>
          , pp.
          <fpage>353</fpage>
          -
          <lpage>355</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W18</fpage>
          -5446.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <surname>C.-C. J. Kuo</surname>
          </string-name>
          . “
          <article-title>Evaluating word embedding models: methods and experimental results”</article-title>
          .
          <source>In: APSIPA transactions on signal and information processing 8</source>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1017/atsip.
          <year>2019</year>
          .
          <volume>12</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. Zhang.</given-names>
            “
            <surname>How Can BERT Help Lexical Semantics Tasks</surname>
          </string-name>
          ?” In: arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>02929</volume>
          (
          <year>2020</year>
          ). url: http://arxiv.org/abs/
          <year>1911</year>
          .02929.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ziehe</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Sporleder</surname>
          </string-name>
          . “
          <article-title>Multimodale Sentimentanalyse politischer Tweets”</article-title>
          .
          <source>In: Book of Abstracts of DHd2019. Frankfurt</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>331</fpage>
          -
          <lpage>332</lpage>
          . doi:
          <volume>10</volume>
          .5281/zenodo.2596095.
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2</year>
          . Bayesian model comparison
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>