CL-IMS @ DIACR-Ita:
        Volente o Nolente: BERT does not Outperform SGNS on Semantic
                               Change Detection
                       Severin Laicher, Gioia Baldissin, Enrique Castañeda
                          Dominik Schlechtweg, Sabine Schulte im Walde
                  Institute for Natural Language Processing, University of Stuttgart
             {laichesn,baldisga,medinaeo,schlecdk,schulte}@ims.uni-stuttgart.de∗


                       Abstract                              ploit BERT embeddings in lexical semantic change
                                                             detection.
    We present the results of our participa-
    tion in the DIACR-Ita shared task on lex-                2   Related Work
    ical semantic change detection for Italian.
                                                             Most existing approaches for LSC detection are
    We exploit Average Pairwise Distance of
                                                             type-based (Schlechtweg et al., 2019; Shoemark
    token-based BERT embeddings between
                                                             et al., 2019). This means that not every word oc-
    time points and rank 5 (of 8) in the official
                                                             currence is considered individually (token-based)
    ranking with an accuracy of .72. While we
                                                             but a general vector representation that summarizes
    tune parameters on the English data set of
                                                             every occurrence of a word (including ambiguous
    SemEval-2020 Task 1 and reach high per-
                                                             words) is created. The results of the SemEval-2020
    formance, this does not translate to the Ital-
                                                             Task 1 (Martinc et al., 2020; Schlechtweg et al.,
    ian DIACR-Ita data set. Our results show
                                                             2020) showed that type-based approaches (Pražák
    that we do not manage to find robust ways
                                                             et al., 2020b; Asgari et al., 2020) achieved better
    to exploit BERT embeddings in lexical se-
                                                             results than token-based approaches (Beck, 2020;
    mantic change detection.
                                                             Kutuzov and Giulianelli, 2020a). This is some-
1   Introduction                                             what surprising since in the last years contextual-
                                                             ized token-based approaches have achieved signif-
Lexical Semantic Change (LSC) Detection has                  icant improvements over the static type-based ap-
drawn increasing attention in the past years (Kutu-          proaches in several NLP tasks (Ethayarajh, 2019).
zov et al., 2018; Tahmasebi et al., 2018). Recently,         Schlechtweg et al. (2020) suggest a range of pos-
SemEval-2020 Task 1 provided a multi-lingual                 sible reasons for this: (i) Contextual embeddings
evaluation framework to compare the variety of               are new and lack proper usage conventions. (ii)
proposed model architectures (Schlechtweg et al.,            They are pre-trained and may thus carry additional,
2020). The DIACR-Ita shared task extends parts               and possibly irrelevant, information. (iii) The con-
of this framework to Italian by providing an Italian         text of word uses in the SemEval data set was too
data set for SemEval’s binary subtask (Basile et             narrow (one sentence). (iv) The SemEval corpora
al., 2020a; Basile et al., 2020b). We present the re-        were lemmatized, while token-based models usu-
sults of our participation in the DIACR-Ita shared           ally take the raw sentence as input. In the DIACR-
task on lexical semantic change for Italian. We              Ita challenge (iii) and (iv) are irrelevant because
exploit Average Pairwise Distance of token-based             raw corpora with sufficient context are made avail-
BERT embeddings (Devlin et al., 2019) between                able to participants. We tried to tackle (i) by exces-
time points and rank 5 (of 8) in the official ranking        sively tuning parameters and system modules on
with an accuracy of .72. While we tune parameters            the English SemEval data set. (ii) can be tackled by
on the English data set of SemEval-2020 Task 1               fine-tuning BERT on the target corpora. However,
and reach high performance, this does not transfer           our experiments on the English SemEval data set
to the Italian DIACR-Ita data set. Our results show          show that exceptionally high performances can be
that we do not manage to find robust ways to ex-             reached even without fine-tuning.
    ∗
      “Copyright © 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).”
3     Experimental setup                                         context in which they appear (Ethayarajh, 2019).
                                                                 BERT can either create one vector for an input sen-
The DIACR-Ita task definition is taken from
                                                                 tence (sentence embedding) or one vector for each
SemEval-2020 Task 1 Subtask 1 (binary change
                                                                 input token (token embedding).2
detection): Given a list of target words and a di-
                                                                    Different pre-trained BERT models across lan-
acronic corpus pair C1 and C2 , the task is to identify
                                                                 guages can be downloaded. In this task, we have
the target words which have changed their mean-
                                                                 used the bert-base-italian-xxl-cased model for the
ings between the respective time periods t1 and t2
                                                                 Italian language3 to create token embeddings.
(Basile et al., 2020a; Schlechtweg et al., 2020).1
                                                                    The basic BERT version is transformer-based
C1 and C2 have been extracted from Italian newspa-
                                                                 and processes text in 12 different layers. In each
pers and books. Target words which have changed
                                                                 layer a contextualized token vector representation
their meaning are labeled with the value ‘1’, the
                                                                 can be created for each word in an input sentence.
remaining target words are labeled with ‘0’. Gold
                                                                 It has been claimed that each layer captures dif-
data for the 18 target words is semi-automatically
                                                                 ferent aspects of the input. Jawahar et al. (2019)
generated from Italian online dictionaries. Accord-
                                                                 suggest that the lower layers capture surface fea-
ing to the gold data, 6 of the 18 target words are
                                                                 tures, the middle layers capture syntactic features
subject to semantic change between t1 and t2 . This
                                                                 and the higher layers capture semantic features of
gold data was only made public after the evalua-
                                                                 the text. Each layer can serve as representation
tion phase. During the evaluation phase each team
                                                                 for the corresponding token by itself, or within a
was allowed to submit up to 4 predictions for the
                                                                 combination of multiple layers.
full list of target words, which were scored using
classification accuracy between the predicted labels
and the gold data. The final competition ranking                 4.2    Average Pairwise Distance
compares only the highest of the scores achieved                 Given two sets of token vectors from two time peri-
by each team.                                                    ods t1 and t2 , the idea of Average Pairwise Distance
                                                                 (APD) is to randomly pick a number of vectors
4     System Overview                                            from both sets and measure their pair-wise distance
                                                                 (Sagi et al., 2009; Schlechtweg et al., 2018; Giu-
Our model uses BERT to create token vectors and                  lianelli et al., 2020; Beck, 2020; Kutuzov and Giu-
the average pairwise distance to compare the token               lianelli, 2020b). The LSC score of the word is the
vectors from two times. The following chapter                    mean average distance of all comparisons:
presents our model, how we have trained it and
how we have chosen our submissions.
                                                                                         1        X
                                                                       APD(V, W ) =                       d(v, w)
                                                                                      nV ∗ nW
4.1    BERT                                                                                     v∈V,w∈W

In 2018 Google has released a pre-trained model                  where V and W are two sets of vectors, nV and
that ran over Wikipedia and books of different gen-              nW denote the number of vectors to be compared,
res (Devlin et al., 2019): BERT (Bidirectional En-               and d(v, w) refer to a distance measure (we used
coder Representations from Transformer) is a lan-                cosine distance (Salton and McGill, 1983)).
guage representation model, designed to find rep-
resentations for text by analysing its left and right            4.3    Tuning
contexts (Devlin et al., 2019). Peters et al. (2018)
show that contextual word representations derived                The choice of BERT layers and the measure used
from pre-trained bidirectional language models like              to compare the resulting vectors (e.g. APD, COS
BERT and ELMo yield significant improvements                     or clustering) strongly influence the performance
to the state-of-the-art for a wide range of NLP tasks.           (Kutuzov and Giulianelli, 2020a). Hence, we tuned
BERT can be used to analyse the semantics of in-                 these parameters/modules on the English SemEval
dividual words, by creating contextualized word                  data (Schlechtweg et al., 2020). For the 40 English
representations, vectors that are sensitive to the                 2
                                                                     The code of our system is available at https://
   1
                                                                 github.com/Garrafao/TokenChange.
     The time periods t1 and t2 were not disclosed to partici-     3
                                                                     https://huggingface.co/dbmdz/
pants.                                                           bert-base-italian-xxl-cased
target words we had access to the sentences that        5   Results
were used for the human annotation (in contrast
                                                        Table 1 shows the accuracy scores for the different
to task participants who had only access to the
                                                        submissions. The best result was achieved by com-
lemmatized larger corpora containing more target
                                                        bining the first and last layer of BERT (’First + Last,
word uses than just the annotated ones).
                                                        7’ with .72), just like on the SemEval data. The
   We tested several change measures regarding
                                                        second-best result was obtained by using the sen-
their ability to find the actual changing words. As
                                                        tences where the target word occurred in its lemma
part of our tuning, the APD measure produced the
                                                        form (’Lemma, Average, 6’ with .67). Only these
binary and graded LSC scores that best matched
                                                        two submissions outperformed the task baselines
the actual LSC scores. We also tested the token vec-
                                                        and the majority class baseline. The two lowest
tors from different layers in order to check which
                                                        results were achieved by combining the last four
one fits best to our task. The best layer combina-
                                                        layers of BERT (’Last Four, 7’ with .61) and by
tions were the average of the last four layers and
                                                        averaging the two layer combinations (’Average,
the average of the first and last layer of BERT. The
                                                        9’ with .61). The accuracy of our best submission
highest F1-score for the binary subtask was .75
                                                        (.72) was ranked at position 5 of the shared task,
and a Spearman correlation of .65 for the graded
                                                        where the best task result was achieved by two dif-
subtask. Our results outperformed all official sub-
                                                        ferent submissions and reached an accuracy of .94.
missions of the shared tasks, of which the best were
                                                        Both submissions were based on type-based em-
all type-based.
                                                        beddings (Pražák et al., 2020a; Kaiser et al., 2020),
                                                        clearly outperforming our system.
4.4   Threshold Selection
We created four predicted change rankings for the           Submission                    Thresh.     Acc.
target words with BERT+APD. By experience and               First + Last                        7      .72
consideration of the shared tasks (Schlechtweg et           Lemma, Average                      6      .67
al., 2020), we assumed that maximum half of all             Majority Class Baseline             -      .66
target words are actual words with a change. There-         Average                             9      .61
fore we always annotated at most 9 of 18 words              Last Four                           7      .61
with 1. First, we extracted for each target word a          Collocations Baseline               -      .61
maximum of 200 sentences that contain the word              Frequency Baseline                  -      .61
in any token form. We limited the number of uses
to 200 for computational efficiency reasons. Then,      Table 1: Overview accuracy scores for the four sub-
for each occurrence, we extracted and averaged the      missions with official task baselines. We also report
token vectors of (i) the last four layers of BERT,      a majority class baseline of a classifier predicting
and (ii) the first and last layer. For our first sub-   ‘0’ for all target Words.
mission (‘Last Four, 7’) we labeled those 7 words
with ‘1’ that achieved the highest APD scores in
layer combination (i). For our second submission        6   Analysis
(‘First + Last, 7’) we labeled those 7 words with       As aforementioned, the best performance of our
‘1’ that achieved the highest APD scores in layer       system, achieved with ’First + Last, 7’, has an
combination (ii). In (i) and (ii) the same 9 words      accuracy of .72. It erroneously predicts a meaning
had the highest APD scores. Therefore, in our third     change for cappuccio, unico and campionato, while
submission (‘Average, 9’) exactly these 9 words         for palmare and rampante it does not detect the
were labeled with ‘1’. And for our last submission      change as given by the gold standard.
(Lemma, Average, 6’) we extracted only sentences           We compared both corpora in order to find out if
in which the target words were present in their         the target words are correctly labeled by the gold
lemma form. Again we created the token vectors          standard as well as to identify the possible reasons
for the two layer combinations of BERT mentioned        behind the wrong predictions of our model.
above. In both mentioned layer combinations the            According to our analysis, we can state that the
same 6 words had the highest APD scores. There-         data matches the gold standard. Cappuccio is poly-
fore in our last submission exactly these 6 words       semous across both time periods t0 and t1 (“hood”,
were labeled with ‘1’ (similar as in submission 1).     “cap”). However, 31% of the uses in t1 are upper-
cased, namely proper nouns (in contrast to the 4%            ‘MEN’S TAILORING DEPARTMENT white
in t0 ), which might imply a different sense com-             textile waiter JACKET The only certain thing
pared to the above-mentioned ones:                            is that the government has received a hard
                                                              lesson by the professors.’
(1) BENEVENTO Il desiderio di il potere , il
    potere di il desiderio : ruota intorno a questo        Unico is another example of a word that was er-
    inquietante ( e attualissimo ) spunto il Festival    roneously predicted as changing. Due to its abstract
    di Benevento diretto da Ruggero Cappuccio .          meaning (“only”, “single”, “unique”), it exhibits
    ‘BENEVENTO The desire of the power, the              heterogeneous context across both time periods.
     power of the desire: the Festival di Benevento      Additionally, it can belong to different word classes
     directed by Ruggero Cappuccio revolves              (noun and adjective in (5) and (6), respectively).
     around this unsettling (and current) cue.’          (5) Rischiamo di rimanere gli unici a non aver
This skewed distribution of proper names in the              dato mano a la ristrutturazione di le Forze
two corpora is a possible reason for the wrong               Armate .
prediction of our model.                                     ‘We risk remaining the only ones not having
   Throughout all target words, we noticed that the           helped in the reorganization of the Armed
context provided by the previous and the following            Forces.’
sentences (as given as input to our model) is often
not related topic-wise; in some instances it seems       (6) ... è chiaro che l’ unica cosa da fare sarebbe l’
as if the sentences are headlines, since they refer to       unificazione di le due aziende comunali ...
different topics:                                            ‘...it is clear that the only thing to do would be
                                                              the unification of the two municipal
(2) M ROMA Sono quindici gli articoli in cui è
                                                              companies...’
    suddiviso il provvedimento « antiracket » [...].
    Roberta Serra ha vinto ieri lo slalom gigante        With regards to the undetected changes, the term
    di il campionati italiani femminili .                palmare (polysemous within and across word
    ‘M ROMA The «antiracket» measure is                  classes) acquires a novel sense in t1 . While it
     divided into fifteen articles [...]. Roberta        mostly has the meaning of “evident” in the 22
     Serra won yesterday the giant slalom of the         sentences of t0 (see (7)), it additionally denotes
     Italian female championship.’                       “palmtop” in t1 (see (8)).

(3) ... le uniche azioni pericolose fiorentine sono      (7) ... con evidenza palmare , la impossibilità di
    arrivate quando il pallone e statu giocato su i          difendere una causa perduta ...
    lati di il Campo . costruzione di centrali               ‘with undeniable evidence, the impossibility
    idroelettriche , di miniere , canali e strade ...         of defending a lost cause’
    ‘...the only dangerous Florentine actions            (8) Per i palestinesi occorre una sistemazione
     arrived when the ball was played on the sides           provvisoria in attesa che gli europei si
     of the field. Construction of hydroelectric             accordino per accoglier li . Potremmo citare
     power plants, mines, channels and streets...’           in il lungo elenco il palmare Apple Newton
   This “headlines effect” occurs across the whole           troppo in anticipo su i tempi
corpus. It can be traced back to the extraction              ‘A temporary arrangement is needed for the
process of the original corpus and may be a main              Palestinians while waiting for the Europeans
source of error in our model. Despite not being               to agree on hosting them. We could quote in
representative, the following example shows that              the long list the palmtop Apple Newton too
in some cases no centric window of any size would             far ahead of its time’
avoid considering unrelated context.
                                                         Note that also in (8), the topic of the previous and
(4) REPARTO CONFEZIONI UOMO GIACCA                       the target sentence is unrelated.
    cameriere bianca , in tessuto L’ unica cosa            Rampante is a further case of undetected change.
    certa è che il governo ha ricevuto una dura          The phrase cavallino rampante, which metonymi-
    lezione da i professori .                            cally denotes “Ferrari”, dominates the usage of the
word in t0 (70%) and covers a (slightly) relevant          of the 7th evaluation campaign of natural language
share of the uses in t1 (19%). We hypothesize that         processing and speech tools for italian. In Valerio
                                                           Basile, Danilo Croce, Maria Di Maro, and Lucia C.
this leads to a large number of homogenous usage
                                                           Passaro, editors, Proceedings of Seventh Evalua-
pairs masking the change from “rampant”, “unbri-           tion Campaign of Natural Language Processing and
dled” to “extremely ambitious” of rampante.                Speech Tools for Italian. Final Workshop (EVALITA
                                                           2020), Online. CEUR.org.
7   Conclusion
                                                         Christin Beck. 2020. DiaSense at SemEval-2020 Task
Our system comprising BERT+APD was ranked 5                1: Modeling sense change via pre-trained BERT
in the DIACR-Ita shared task. The combination of           embeddings. In Proceedings of the 14th Interna-
                                                           tional Workshop on Semantic Evaluation, Barcelona,
BERT and APD did not perform as well as expected           Spain. Association for Computational Linguistics.
and much lower than the best type-based embed-
dings, but our best submission still outperformed        Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
                                                            Kristina Toutanova. 2019. BERT: Pre-training of
all baselines. The high tuning results achieved on          deep bidirectional transformers for language under-
the SemEval data could not be transferred to the            standing. In Proceedings of the 2019 Conference of
Italian data. One reason for this may be that a dif-        the North American Chapter of the Association for
ferent BERT model was applied, trained on text of           Computational Linguistics: Human Language Tech-
                                                            nologies, Volume 1 (Long and Short Papers), pages
a different language. We have not tuned the Italian         4171–4186, Minneapolis, Minnesota, June. Associa-
BERT model. It is therefore possible that the de-           tion for Computational Linguistics.
crease in performance may be due to the change of
the underlying BERT model. Furthermore, given            Kawin Ethayarajh. 2019. How contextual are contex-
                                                           tualized word representations? comparing the geom-
that our model considers as input also the previ-          etry of BERT, ELMo, and GPT-2 embeddings. In
ous and the following sentences, the presence of           Proceedings of the 2019 Conference on Empirical
semantically unrelated context could have played a         Methods in Natural Language Processing and the
significant role in mislabeling the target words.          9th International Joint Conference on Natural Lan-
                                                           guage Processing (EMNLP-IJCNLP), pages 55–65,
                                                           Hong Kong, China. Association for Computational
Acknowledgments                                            Linguistics.
Dominik Schlechtweg was supported by the Kon-            Mario Giulianelli, Marco Del Tredici, and Raquel Fer-
rad Adenauer Foundation and the CRETA center              nández. 2020. Analysing lexical semantic change
funded by the German Ministry for Education and           with contextualised word representations. In Pro-
Research (BMBF) during the conduct of this study.         ceedings of the 58th Annual Meeting of the Asso-
                                                          ciation for Computational Linguistics, pages 3960–
We thank the task organizers and reviewers for their      3973, Online, July. Association for Computational
efforts.                                                  Linguistics.

                                                         Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.
References                                                 2019. What does BERT learn about the structure of
                                                           language? In Proceedings of the 57th Annual Meet-
Ehsaneddin Asgari, Christoph Ringlstetter, and Hinrich     ing of the Association for Computational Linguistics,
  Schütze. 2020. EmbLexChange at SemEval-2020              pages 3651–3657, Florence, Italy, July. Association
  Task 1: Unsupervised Embedding-based Detection           for Computational Linguistics.
  of Lexical Semantic Changes. In Proceedings of
  the 14th International Workshop on Semantic Eval-      Jens Kaiser, Dominik Schlechtweg, and Sabine Schulte
  uation, Barcelona, Spain. Association for Computa-        im Walde. 2020. OP-IMS @ DIACR-Ita: Back
  tional Linguistics.                                       to the Roots: SGNS+OP+CD still rocks Semantic
                                                            Change Detection. In Valerio Basile, Danilo Croce,
Pierpaolo Basile, Annalina Caputo, Tommaso Caselli,         Maria Di Maro, and Lucia C. Passaro, editors, Pro-
   Pierluigi Cassotti, and Rossella Varvara. 2020a.         ceedings of the 7th evaluation campaign of Natural
   DIACR-Ita @ EVALITA2020: Overview of the                Language Processing and Speech tools for Italian
   EVALITA 2020 Diachronic Lexical Semantics               (EVALITA 2020), Online. CEUR.org.
   (DIACR-Ita) Task. In Valerio Basile, Danilo Croce,
   Maria Di Maro, and Lucia C. Passaro, editors, Pro-    Andrey Kutuzov and Mario Giulianelli. 2020a. UiO-
   ceedings of the 7th evaluation campaign of Natural      UvA at SemEval-2020 Task 1: Contextualised Em-
   Language Processing and Speech tools for Italian        beddings for Lexical Semantic Change Detection.
  (EVALITA 2020), Online. CEUR.org.                        In Proceedings of the 14th International Workshop
                                                           on Semantic Evaluation, Barcelona, Spain. Associa-
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-       tion for Computational Linguistics.
  cia C. Passaro. 2020b. Evalita 2020: Overview
Andrey Kutuzov and Mario Giulianelli. 2020b. UiO-         the Association for Computational Linguistics: Hu-
  UvA at SemEval-2020 Task 1: Contextualised Em-          man Language Technologies, pages 169–174, New
  beddings for Lexical Semantic Change Detection.         Orleans, Louisiana, USA.
  In Proceedings of the 14th International Workshop
  on Semantic Evaluation, Barcelona, Spain. Associa-    Dominik Schlechtweg, Anna Hätty, Marco del Tredici,
  tion for Computational Linguistics.                     and Sabine Schulte im Walde. 2019. A Wind of
                                                          Change: Detecting and evaluating lexical seman-
Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski,        tic change across times and domains. In Proceed-
  and Erik Velldal. 2018. Diachronic word embed-          ings of the 57th Annual Meeting of the Association
  dings and semantic shifts: A survey. In Proceedings     for Computational Linguistics, pages 732–746, Flo-
  of the 27th International Conference on Computa-        rence, Italy. Association for Computational Linguis-
  tional Linguistics, pages 1384–1397, Santa Fe, New      tics.
  Mexico, USA. Association for Computational Lin-
  guistics.                                             Dominik Schlechtweg, Barbara McGillivray, Simon
                                                          Hengchen, Haim Dubossarsky, and Nina Tahmasebi.
Matej Martinc, Syrielle Montariol, Elaine Zosa, and       2020. SemEval-2020 Task 1: Unsupervised Lexi-
 Lidia Pivovarova. 2020. Discovery Team at                cal Semantic Change Detection. In Proceedings of
 SemEval-2020 Task 1: Context-sensitive Embed-            the 14th International Workshop on Semantic Eval-
 dings not Always Better Than Static for Seman-           uation, Barcelona, Spain. Association for Computa-
 tic Change Detection. In Proceedings of the 14th         tional Linguistics.
 International Workshop on Semantic Evaluation,
 Barcelona, Spain. Association for Computational        Philippa Shoemark, Farhana Ferdousi Liza, Dong
 Linguistics.                                             Nguyen, Scott Hale, and Barbara McGillivray. 2019.
                                                          Room to Glo: A systematic comparison of seman-
Matthew Peters, Mark Neumann, Luke Zettlemoyer,           tic change detection approaches with word embed-
 and Wen-tau Yih. 2018. Dissecting contextual             dings. In Proceedings of the 2019 Conference on
 word embeddings: Architecture and representation.        Empirical Methods in Natural Language Processing
 In Proceedings of the 2018 Conference on Empiri-         and the 9th International Joint Conference on Natu-
 cal Methods in Natural Language Processing, pages        ral Language Processing, pages 66–76, Hong Kong,
 1499–1509, Brussels, Belgium, October-November.          China. Association for Computational Linguistics.
 Association for Computational Linguistics.
                                                        Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2018.
Ondřej Pražák, Pavel Přibákň, and Stephen Taylor.      Survey of computational approaches to diachronic
  2020a. UWB @ DIACR-Ita: Lexical Semantic                conceptual change. arXiv:1811.06278.
  Change Detection with CCA and Orthogonal Trans-
  formation. In Valerio Basile, Danilo Croce, Maria
  Di Maro, and Lucia C. Passaro, editors, Proceedings
  of the 7th evaluation campaign of Natural Language
  Processing and Speech tools for Italian (EVALITA
  2020), Online. CEUR.org.

Ondřej Pražák, Pavel Přibákň, Stephen Taylor, and
  Jakub Sido. 2020b. UWB at SemEval-2020 Task
  1: Lexical Semantic Change Detection. In Proceed-
  ings of the 14th International Workshop on Semantic
  Evaluation, Barcelona, Spain. Association for Com-
  putational Linguistics.
Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2009.
  Semantic density analysis: Comparing word mean-
  ing across time and phonetic space. In Proceed-
  ings of the Workshop on Geometrical Models of Nat-
  ural Language Semantics, pages 104–111, Athens,
  Greece, March. Association for Computational Lin-
  guistics.
Gerard Salton and Michael J McGill. 1983. Introduc-
  tion to Modern Information Retrieval. McGraw-Hill
  Book Company, New York.
Dominik Schlechtweg, Sabine Schulte im Walde, and
  Stefanie Eckmann. 2018. Diachronic Usage Relat-
  edness (DURel): A framework for the annotation
  of lexical semantic change. In Proceedings of the
  2018 Conference of the North American Chapter of