CL-IMS @ DIACR-Ita: Volente o Nolente: BERT does not Outperform SGNS on Semantic Change Detection Severin Laicher, Gioia Baldissin, Enrique Castañeda Dominik Schlechtweg, Sabine Schulte im Walde Institute for Natural Language Processing, University of Stuttgart {laichesn,baldisga,medinaeo,schlecdk,schulte}@ims.uni-stuttgart.de∗ Abstract ploit BERT embeddings in lexical semantic change detection. We present the results of our participa- tion in the DIACR-Ita shared task on lex- 2 Related Work ical semantic change detection for Italian. Most existing approaches for LSC detection are We exploit Average Pairwise Distance of type-based (Schlechtweg et al., 2019; Shoemark token-based BERT embeddings between et al., 2019). This means that not every word oc- time points and rank 5 (of 8) in the official currence is considered individually (token-based) ranking with an accuracy of .72. While we but a general vector representation that summarizes tune parameters on the English data set of every occurrence of a word (including ambiguous SemEval-2020 Task 1 and reach high per- words) is created. The results of the SemEval-2020 formance, this does not translate to the Ital- Task 1 (Martinc et al., 2020; Schlechtweg et al., ian DIACR-Ita data set. Our results show 2020) showed that type-based approaches (Pražák that we do not manage to find robust ways et al., 2020b; Asgari et al., 2020) achieved better to exploit BERT embeddings in lexical se- results than token-based approaches (Beck, 2020; mantic change detection. Kutuzov and Giulianelli, 2020a). This is some- 1 Introduction what surprising since in the last years contextual- ized token-based approaches have achieved signif- Lexical Semantic Change (LSC) Detection has icant improvements over the static type-based ap- drawn increasing attention in the past years (Kutu- proaches in several NLP tasks (Ethayarajh, 2019). zov et al., 2018; Tahmasebi et al., 2018). Recently, Schlechtweg et al. (2020) suggest a range of pos- SemEval-2020 Task 1 provided a multi-lingual sible reasons for this: (i) Contextual embeddings evaluation framework to compare the variety of are new and lack proper usage conventions. (ii) proposed model architectures (Schlechtweg et al., They are pre-trained and may thus carry additional, 2020). The DIACR-Ita shared task extends parts and possibly irrelevant, information. (iii) The con- of this framework to Italian by providing an Italian text of word uses in the SemEval data set was too data set for SemEval’s binary subtask (Basile et narrow (one sentence). (iv) The SemEval corpora al., 2020a; Basile et al., 2020b). We present the re- were lemmatized, while token-based models usu- sults of our participation in the DIACR-Ita shared ally take the raw sentence as input. In the DIACR- task on lexical semantic change for Italian. We Ita challenge (iii) and (iv) are irrelevant because exploit Average Pairwise Distance of token-based raw corpora with sufficient context are made avail- BERT embeddings (Devlin et al., 2019) between able to participants. We tried to tackle (i) by exces- time points and rank 5 (of 8) in the official ranking sively tuning parameters and system modules on with an accuracy of .72. While we tune parameters the English SemEval data set. (ii) can be tackled by on the English data set of SemEval-2020 Task 1 fine-tuning BERT on the target corpora. However, and reach high performance, this does not transfer our experiments on the English SemEval data set to the Italian DIACR-Ita data set. Our results show show that exceptionally high performances can be that we do not manage to find robust ways to ex- reached even without fine-tuning. ∗ “Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).” 3 Experimental setup context in which they appear (Ethayarajh, 2019). BERT can either create one vector for an input sen- The DIACR-Ita task definition is taken from tence (sentence embedding) or one vector for each SemEval-2020 Task 1 Subtask 1 (binary change input token (token embedding).2 detection): Given a list of target words and a di- Different pre-trained BERT models across lan- acronic corpus pair C1 and C2 , the task is to identify guages can be downloaded. In this task, we have the target words which have changed their mean- used the bert-base-italian-xxl-cased model for the ings between the respective time periods t1 and t2 Italian language3 to create token embeddings. (Basile et al., 2020a; Schlechtweg et al., 2020).1 The basic BERT version is transformer-based C1 and C2 have been extracted from Italian newspa- and processes text in 12 different layers. In each pers and books. Target words which have changed layer a contextualized token vector representation their meaning are labeled with the value ‘1’, the can be created for each word in an input sentence. remaining target words are labeled with ‘0’. Gold It has been claimed that each layer captures dif- data for the 18 target words is semi-automatically ferent aspects of the input. Jawahar et al. (2019) generated from Italian online dictionaries. Accord- suggest that the lower layers capture surface fea- ing to the gold data, 6 of the 18 target words are tures, the middle layers capture syntactic features subject to semantic change between t1 and t2 . This and the higher layers capture semantic features of gold data was only made public after the evalua- the text. Each layer can serve as representation tion phase. During the evaluation phase each team for the corresponding token by itself, or within a was allowed to submit up to 4 predictions for the combination of multiple layers. full list of target words, which were scored using classification accuracy between the predicted labels and the gold data. The final competition ranking 4.2 Average Pairwise Distance compares only the highest of the scores achieved Given two sets of token vectors from two time peri- by each team. ods t1 and t2 , the idea of Average Pairwise Distance (APD) is to randomly pick a number of vectors 4 System Overview from both sets and measure their pair-wise distance (Sagi et al., 2009; Schlechtweg et al., 2018; Giu- Our model uses BERT to create token vectors and lianelli et al., 2020; Beck, 2020; Kutuzov and Giu- the average pairwise distance to compare the token lianelli, 2020b). The LSC score of the word is the vectors from two times. The following chapter mean average distance of all comparisons: presents our model, how we have trained it and how we have chosen our submissions. 1 X APD(V, W ) = d(v, w) nV ∗ nW 4.1 BERT v∈V,w∈W In 2018 Google has released a pre-trained model where V and W are two sets of vectors, nV and that ran over Wikipedia and books of different gen- nW denote the number of vectors to be compared, res (Devlin et al., 2019): BERT (Bidirectional En- and d(v, w) refer to a distance measure (we used coder Representations from Transformer) is a lan- cosine distance (Salton and McGill, 1983)). guage representation model, designed to find rep- resentations for text by analysing its left and right 4.3 Tuning contexts (Devlin et al., 2019). Peters et al. (2018) show that contextual word representations derived The choice of BERT layers and the measure used from pre-trained bidirectional language models like to compare the resulting vectors (e.g. APD, COS BERT and ELMo yield significant improvements or clustering) strongly influence the performance to the state-of-the-art for a wide range of NLP tasks. (Kutuzov and Giulianelli, 2020a). Hence, we tuned BERT can be used to analyse the semantics of in- these parameters/modules on the English SemEval dividual words, by creating contextualized word data (Schlechtweg et al., 2020). For the 40 English representations, vectors that are sensitive to the 2 The code of our system is available at https:// 1 github.com/Garrafao/TokenChange. The time periods t1 and t2 were not disclosed to partici- 3 https://huggingface.co/dbmdz/ pants. bert-base-italian-xxl-cased target words we had access to the sentences that 5 Results were used for the human annotation (in contrast Table 1 shows the accuracy scores for the different to task participants who had only access to the submissions. The best result was achieved by com- lemmatized larger corpora containing more target bining the first and last layer of BERT (’First + Last, word uses than just the annotated ones). 7’ with .72), just like on the SemEval data. The We tested several change measures regarding second-best result was obtained by using the sen- their ability to find the actual changing words. As tences where the target word occurred in its lemma part of our tuning, the APD measure produced the form (’Lemma, Average, 6’ with .67). Only these binary and graded LSC scores that best matched two submissions outperformed the task baselines the actual LSC scores. We also tested the token vec- and the majority class baseline. The two lowest tors from different layers in order to check which results were achieved by combining the last four one fits best to our task. The best layer combina- layers of BERT (’Last Four, 7’ with .61) and by tions were the average of the last four layers and averaging the two layer combinations (’Average, the average of the first and last layer of BERT. The 9’ with .61). The accuracy of our best submission highest F1-score for the binary subtask was .75 (.72) was ranked at position 5 of the shared task, and a Spearman correlation of .65 for the graded where the best task result was achieved by two dif- subtask. Our results outperformed all official sub- ferent submissions and reached an accuracy of .94. missions of the shared tasks, of which the best were Both submissions were based on type-based em- all type-based. beddings (Pražák et al., 2020a; Kaiser et al., 2020), clearly outperforming our system. 4.4 Threshold Selection We created four predicted change rankings for the Submission Thresh. Acc. target words with BERT+APD. By experience and First + Last 7 .72 consideration of the shared tasks (Schlechtweg et Lemma, Average 6 .67 al., 2020), we assumed that maximum half of all Majority Class Baseline - .66 target words are actual words with a change. There- Average 9 .61 fore we always annotated at most 9 of 18 words Last Four 7 .61 with 1. First, we extracted for each target word a Collocations Baseline - .61 maximum of 200 sentences that contain the word Frequency Baseline - .61 in any token form. We limited the number of uses to 200 for computational efficiency reasons. Then, Table 1: Overview accuracy scores for the four sub- for each occurrence, we extracted and averaged the missions with official task baselines. We also report token vectors of (i) the last four layers of BERT, a majority class baseline of a classifier predicting and (ii) the first and last layer. For our first sub- ‘0’ for all target Words. mission (‘Last Four, 7’) we labeled those 7 words with ‘1’ that achieved the highest APD scores in layer combination (i). For our second submission 6 Analysis (‘First + Last, 7’) we labeled those 7 words with As aforementioned, the best performance of our ‘1’ that achieved the highest APD scores in layer system, achieved with ’First + Last, 7’, has an combination (ii). In (i) and (ii) the same 9 words accuracy of .72. It erroneously predicts a meaning had the highest APD scores. Therefore, in our third change for cappuccio, unico and campionato, while submission (‘Average, 9’) exactly these 9 words for palmare and rampante it does not detect the were labeled with ‘1’. And for our last submission change as given by the gold standard. (Lemma, Average, 6’) we extracted only sentences We compared both corpora in order to find out if in which the target words were present in their the target words are correctly labeled by the gold lemma form. Again we created the token vectors standard as well as to identify the possible reasons for the two layer combinations of BERT mentioned behind the wrong predictions of our model. above. In both mentioned layer combinations the According to our analysis, we can state that the same 6 words had the highest APD scores. There- data matches the gold standard. Cappuccio is poly- fore in our last submission exactly these 6 words semous across both time periods t0 and t1 (“hood”, were labeled with ‘1’ (similar as in submission 1). “cap”). However, 31% of the uses in t1 are upper- cased, namely proper nouns (in contrast to the 4% ‘MEN’S TAILORING DEPARTMENT white in t0 ), which might imply a different sense com- textile waiter JACKET The only certain thing pared to the above-mentioned ones: is that the government has received a hard lesson by the professors.’ (1) BENEVENTO Il desiderio di il potere , il potere di il desiderio : ruota intorno a questo Unico is another example of a word that was er- inquietante ( e attualissimo ) spunto il Festival roneously predicted as changing. Due to its abstract di Benevento diretto da Ruggero Cappuccio . meaning (“only”, “single”, “unique”), it exhibits ‘BENEVENTO The desire of the power, the heterogeneous context across both time periods. power of the desire: the Festival di Benevento Additionally, it can belong to different word classes directed by Ruggero Cappuccio revolves (noun and adjective in (5) and (6), respectively). around this unsettling (and current) cue.’ (5) Rischiamo di rimanere gli unici a non aver This skewed distribution of proper names in the dato mano a la ristrutturazione di le Forze two corpora is a possible reason for the wrong Armate . prediction of our model. ‘We risk remaining the only ones not having Throughout all target words, we noticed that the helped in the reorganization of the Armed context provided by the previous and the following Forces.’ sentences (as given as input to our model) is often not related topic-wise; in some instances it seems (6) ... è chiaro che l’ unica cosa da fare sarebbe l’ as if the sentences are headlines, since they refer to unificazione di le due aziende comunali ... different topics: ‘...it is clear that the only thing to do would be the unification of the two municipal (2) M ROMA Sono quindici gli articoli in cui è companies...’ suddiviso il provvedimento « antiracket » [...]. Roberta Serra ha vinto ieri lo slalom gigante With regards to the undetected changes, the term di il campionati italiani femminili . palmare (polysemous within and across word ‘M ROMA The «antiracket» measure is classes) acquires a novel sense in t1 . While it divided into fifteen articles [...]. Roberta mostly has the meaning of “evident” in the 22 Serra won yesterday the giant slalom of the sentences of t0 (see (7)), it additionally denotes Italian female championship.’ “palmtop” in t1 (see (8)). (3) ... le uniche azioni pericolose fiorentine sono (7) ... con evidenza palmare , la impossibilità di arrivate quando il pallone e statu giocato su i difendere una causa perduta ... lati di il Campo . costruzione di centrali ‘with undeniable evidence, the impossibility idroelettriche , di miniere , canali e strade ... of defending a lost cause’ ‘...the only dangerous Florentine actions (8) Per i palestinesi occorre una sistemazione arrived when the ball was played on the sides provvisoria in attesa che gli europei si of the field. Construction of hydroelectric accordino per accoglier li . Potremmo citare power plants, mines, channels and streets...’ in il lungo elenco il palmare Apple Newton This “headlines effect” occurs across the whole troppo in anticipo su i tempi corpus. It can be traced back to the extraction ‘A temporary arrangement is needed for the process of the original corpus and may be a main Palestinians while waiting for the Europeans source of error in our model. Despite not being to agree on hosting them. We could quote in representative, the following example shows that the long list the palmtop Apple Newton too in some cases no centric window of any size would far ahead of its time’ avoid considering unrelated context. Note that also in (8), the topic of the previous and (4) REPARTO CONFEZIONI UOMO GIACCA the target sentence is unrelated. cameriere bianca , in tessuto L’ unica cosa Rampante is a further case of undetected change. certa è che il governo ha ricevuto una dura The phrase cavallino rampante, which metonymi- lezione da i professori . cally denotes “Ferrari”, dominates the usage of the word in t0 (70%) and covers a (slightly) relevant of the 7th evaluation campaign of natural language share of the uses in t1 (19%). We hypothesize that processing and speech tools for italian. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. this leads to a large number of homogenous usage Passaro, editors, Proceedings of Seventh Evalua- pairs masking the change from “rampant”, “unbri- tion Campaign of Natural Language Processing and dled” to “extremely ambitious” of rampante. Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR.org. 7 Conclusion Christin Beck. 2020. DiaSense at SemEval-2020 Task Our system comprising BERT+APD was ranked 5 1: Modeling sense change via pre-trained BERT in the DIACR-Ita shared task. The combination of embeddings. In Proceedings of the 14th Interna- tional Workshop on Semantic Evaluation, Barcelona, BERT and APD did not perform as well as expected Spain. Association for Computational Linguistics. and much lower than the best type-based embed- dings, but our best submission still outperformed Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of all baselines. The high tuning results achieved on deep bidirectional transformers for language under- the SemEval data could not be transferred to the standing. In Proceedings of the 2019 Conference of Italian data. One reason for this may be that a dif- the North American Chapter of the Association for ferent BERT model was applied, trained on text of Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages a different language. We have not tuned the Italian 4171–4186, Minneapolis, Minnesota, June. Associa- BERT model. It is therefore possible that the de- tion for Computational Linguistics. crease in performance may be due to the change of the underlying BERT model. Furthermore, given Kawin Ethayarajh. 2019. How contextual are contex- tualized word representations? comparing the geom- that our model considers as input also the previ- etry of BERT, ELMo, and GPT-2 embeddings. In ous and the following sentences, the presence of Proceedings of the 2019 Conference on Empirical semantically unrelated context could have played a Methods in Natural Language Processing and the significant role in mislabeling the target words. 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 55–65, Hong Kong, China. Association for Computational Acknowledgments Linguistics. Dominik Schlechtweg was supported by the Kon- Mario Giulianelli, Marco Del Tredici, and Raquel Fer- rad Adenauer Foundation and the CRETA center nández. 2020. Analysing lexical semantic change funded by the German Ministry for Education and with contextualised word representations. In Pro- Research (BMBF) during the conduct of this study. ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 3960– We thank the task organizers and reviewers for their 3973, Online, July. Association for Computational efforts. Linguistics. Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. References 2019. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meet- Ehsaneddin Asgari, Christoph Ringlstetter, and Hinrich ing of the Association for Computational Linguistics, Schütze. 2020. EmbLexChange at SemEval-2020 pages 3651–3657, Florence, Italy, July. Association Task 1: Unsupervised Embedding-based Detection for Computational Linguistics. of Lexical Semantic Changes. In Proceedings of the 14th International Workshop on Semantic Eval- Jens Kaiser, Dominik Schlechtweg, and Sabine Schulte uation, Barcelona, Spain. Association for Computa- im Walde. 2020. OP-IMS @ DIACR-Ita: Back tional Linguistics. to the Roots: SGNS+OP+CD still rocks Semantic Change Detection. In Valerio Basile, Danilo Croce, Pierpaolo Basile, Annalina Caputo, Tommaso Caselli, Maria Di Maro, and Lucia C. Passaro, editors, Pro- Pierluigi Cassotti, and Rossella Varvara. 2020a. ceedings of the 7th evaluation campaign of Natural DIACR-Ita @ EVALITA2020: Overview of the Language Processing and Speech tools for Italian EVALITA 2020 Diachronic Lexical Semantics (EVALITA 2020), Online. CEUR.org. (DIACR-Ita) Task. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Pro- Andrey Kutuzov and Mario Giulianelli. 2020a. UiO- ceedings of the 7th evaluation campaign of Natural UvA at SemEval-2020 Task 1: Contextualised Em- Language Processing and Speech tools for Italian beddings for Lexical Semantic Change Detection. (EVALITA 2020), Online. CEUR.org. In Proceedings of the 14th International Workshop on Semantic Evaluation, Barcelona, Spain. Associa- Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- tion for Computational Linguistics. cia C. Passaro. 2020b. Evalita 2020: Overview Andrey Kutuzov and Mario Giulianelli. 2020b. UiO- the Association for Computational Linguistics: Hu- UvA at SemEval-2020 Task 1: Contextualised Em- man Language Technologies, pages 169–174, New beddings for Lexical Semantic Change Detection. Orleans, Louisiana, USA. In Proceedings of the 14th International Workshop on Semantic Evaluation, Barcelona, Spain. Associa- Dominik Schlechtweg, Anna Hätty, Marco del Tredici, tion for Computational Linguistics. and Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and evaluating lexical seman- Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, tic change across times and domains. In Proceed- and Erik Velldal. 2018. Diachronic word embed- ings of the 57th Annual Meeting of the Association dings and semantic shifts: A survey. In Proceedings for Computational Linguistics, pages 732–746, Flo- of the 27th International Conference on Computa- rence, Italy. Association for Computational Linguis- tional Linguistics, pages 1384–1397, Santa Fe, New tics. Mexico, USA. Association for Computational Lin- guistics. Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, and Nina Tahmasebi. Matej Martinc, Syrielle Montariol, Elaine Zosa, and 2020. SemEval-2020 Task 1: Unsupervised Lexi- Lidia Pivovarova. 2020. Discovery Team at cal Semantic Change Detection. In Proceedings of SemEval-2020 Task 1: Context-sensitive Embed- the 14th International Workshop on Semantic Eval- dings not Always Better Than Static for Seman- uation, Barcelona, Spain. Association for Computa- tic Change Detection. In Proceedings of the 14th tional Linguistics. International Workshop on Semantic Evaluation, Barcelona, Spain. Association for Computational Philippa Shoemark, Farhana Ferdousi Liza, Dong Linguistics. Nguyen, Scott Hale, and Barbara McGillivray. 2019. Room to Glo: A systematic comparison of seman- Matthew Peters, Mark Neumann, Luke Zettlemoyer, tic change detection approaches with word embed- and Wen-tau Yih. 2018. Dissecting contextual dings. In Proceedings of the 2019 Conference on word embeddings: Architecture and representation. Empirical Methods in Natural Language Processing In Proceedings of the 2018 Conference on Empiri- and the 9th International Joint Conference on Natu- cal Methods in Natural Language Processing, pages ral Language Processing, pages 66–76, Hong Kong, 1499–1509, Brussels, Belgium, October-November. China. Association for Computational Linguistics. Association for Computational Linguistics. Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2018. Ondřej Pražák, Pavel Přibákň, and Stephen Taylor. Survey of computational approaches to diachronic 2020a. UWB @ DIACR-Ita: Lexical Semantic conceptual change. arXiv:1811.06278. Change Detection with CCA and Orthogonal Trans- formation. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of the 7th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA 2020), Online. CEUR.org. Ondřej Pražák, Pavel Přibákň, Stephen Taylor, and Jakub Sido. 2020b. UWB at SemEval-2020 Task 1: Lexical Semantic Change Detection. In Proceed- ings of the 14th International Workshop on Semantic Evaluation, Barcelona, Spain. Association for Com- putational Linguistics. Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2009. Semantic density analysis: Comparing word mean- ing across time and phonetic space. In Proceed- ings of the Workshop on Geometrical Models of Nat- ural Language Semantics, pages 104–111, Athens, Greece, March. Association for Computational Lin- guistics. Gerard Salton and Michael J McGill. 1983. Introduc- tion to Modern Information Retrieval. McGraw-Hill Book Company, New York. Dominik Schlechtweg, Sabine Schulte im Walde, and Stefanie Eckmann. 2018. Diachronic Usage Relat- edness (DURel): A framework for the annotation of lexical semantic change. In Proceedings of the 2018 Conference of the North American Chapter of