OP-IMS @ DIACR-Ita: Back to the Roots: SGNS+OP+CD still Rocks Semantic Change Detection Jens Kaiser, Dominik Schlechtweg, Sabine Schulte im Walde Institute for Natural Language Processing, University of Stuttgart {jens.kaiser,schlecdk,schulte}@ims.uni-stuttgart.de Abstract setting win the shared task with near to perfect ac- curacy (.94). Our results once more demonstrate We present the results of our participa- that, within the present task setup in lexical seman- tion in the DIACR-Ita shared task on lex- tic change detection, the traditional type-based ap- ical semantic change detection for Ital- proaches yield excellent performance. ian. We exploit one of the earliest and most influential semantic change detection 2 Related Work models based on Skip-Gram with Negative Sampling, Orthogonal Procrustes align- As evident in Schlechtweg et al. (2020) the field ment and Cosine Distance and obtain the of LSCD is currently dominated by Vector Space winning submission of the shared task Models (VSMs), which can be divided into type- with near to perfect accuracy (.94). Our based (Turney and Pantel, 2010) and token-based results once more indicate that, within (Schütze, 1998) models. Prominent type-based the present task setup in lexical seman- models include low-dimensional embeddings such tic change detection, the traditional type- as the Global Vectors (Pennington et al., 2014, based approaches yield excellent perfor- GloVe) the Continuous Bag-of-Words (CBOW), mance. the Continuous Skip-gram as well as a slight mod- ification of the latter, the Skip-gram with Negative 1 Introduction Sampling model (Mikolov et al., 2013a; Mikolov Lexical Semantic Change (LSC) Detection has et al., 2013b, SGNS). However, as these mod- drawn increasing attention in recent years (Kutu- els come with the deficiency that they aggregate zov et al., 2018; Tahmasebi et al., 2018). Re- all senses of a word into a single representation, cently, SemEval-2020 Task 1 provided a multi- token-based embeddings have been proposed (Pe- lingual evaluation framework to compare the vari- ters et al., 2018; Devlin et al., 2019). According ety of proposed model architectures (Schlechtweg to Hu et al. (2019) these models can ideally cap- et al., 2020). The DIACR-Ita shared task extends ture complex characteristics of word use, and how parts of this framework to Italian by providing they vary across linguistic contexts. The results of an Italian data set for SemEval’s binary subtask SemEval-2020 Task 1 (Schlechtweg et al., 2020), (Basile et al., 2020a; Basile et al., 2020b). however, show that contrary to this, the token- We present the results of our participation in based embedding models (Beck, 2020; Kutuzov the DIACR-Ita shared task exploiting one of the and Giulianelli, 2020) are heavily outperformed earliest and most established semantic change de- by the type-based ones (Pražák et al., 2020; As- tection models based on Skip-Gram with Nega- gari et al., 2020). The SGNS model was not tive Sampling, Orthogonal Procrustes alignment only widely used, but also performed best among and Cosine Distance (Hamilton et al., 2016a). the participants in the task. Its fast implementa- Based on our previous research (Schlechtweg et tion and combination possibilities with different al., 2019; Kaiser et al., 2020) we optimize the alignment types further solidify SGNS as the stan- dimensionality parameter assuming that high di- dard in LSCD. A common and surprisingly ro- mensionalities reduce alignment error. With our bust (Schlechtweg et al., 2019; Kaiser et al., 2020) practice is to align the time-specific SGNS embed- “Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 dings with Orthogonal Procrustes (OP) and mea- International (CC BY 4.0).” sure change with Cosine Distance (CD) (Kulka- rni et al., 2015; Hamilton et al., 2016b). This has P (c) = #(c) |D| for each observation of (w, c), cf. been shown in several small but independent ex- Levy et al. (2015). After training, each word w is periments (Hamilton et al., 2016b; Schlechtweg represented by its word vector vw . et al., 2019; Kaiser et al., 2020; Shoemark et al., Previous research on the influence of parame- 2019) and SGNS+OP+CD has produced two of ter settings on SGNS+OP+CD lays the founda- three top-performing submissions in Subtask 2 in tion for our parameter choices (Schlechtweg et al., SemEval-2020 Task 1 including the winning sub- 2019; Kaiser et al., 2020). Although this sub- mission (Pömsl and Lyapin, 2020; Arefyev and system combination is extremely stable regardless Zhikov, 2020). of parameter settings, subtle improvements can be achieved by modifying the window size and di- 3 System overview mensionality. A common hurdle in LSC detection is the small corpus size, increasing the standard Most VSMs in LSC detection combine three sub- setting for window size from 5 to 10 leads to the systems: (i) creating semantic word representa- creation of more word-context pairs used for train- tions, (ii) aligning them across corpora, and (iii) ing the model. In addition, we also experiment measuring differences between the aligned rep- with dimensionalities of 300 and 500. Higher di- resentations (Schlechtweg et al., 2019). Align- mensionalities alleviate the introduction of noise ment is needed as columns from different vector during the alignment process (Kaiser et al., 2020). spaces may not correspond to the same coordinate We keep the rest of the parameter settings at their axes, due to the stochastic nature of many low- default values (learning rate α=0.025, #negative dimensional word representations (Hamilton et al., samples k=5 and sub-sampling t=0.001). 2016b). Following the above-described success, we use SGNS to create word representations in 3.2 Alignment combination with Orthogonal Procrustes (OP) for SGNS is trained on each corpus separately, re- vector space alignment and Cosine Distance (CD) sulting in matrices A and B. To align them we (Salton and McGill, 1983) to measure differences follow Hamilton et al. (2016b) and calculate an between word vectors. From the resulting graded orthogonally-constrained matrix W ∗ : change predictions we infer binary change values by comparing the target word distribution to the W ∗ = arg min kBW − AkF full distribution of change predictions between the W ∈O(d) target corpora. For our experiments we use the where the i-th row in matrices A and B correspond code provided by Schlechtweg et al. (2019).1 to the same word. Using W ∗ we get the aligned 3.1 Semantic Representation matrices AOP = A and B OP = BW ∗ . Prior to this alignment step we length-normalize and SGNS is a shallow neural network trained on pairs mean-center both matrices (Artetxe et al., 2017; of word co-occurrences extracted from a corpus Schlechtweg et al., 2019). with a symmetric window. It represents each word w and each context c as a d-dimensional vector to 3.3 Threshold solve The DIACR-Ita shared task requires a binary la- bel for each of the target words. However, X X arg max log σ(vc · vw ) + log σ(−vc · vw ), CD produces graded values between 0.0 and 2.0 θ (w,c)∈D (w,c)∈D 0 when measuring differences in word vectors be- tween the two time periods. We tackle this prob- where σ(x) = 1+e1−x , D is the set of all ob- lem by defining a threshold parameter, similar to served word-context pairs and D0 is the set of ran- many approaches applied in SemEval-2020 Task 1 domly generated negative samples (Mikolov et al., (Schlechtweg et al., 2020). All words with a CD 2013a; Mikolov et al., 2013b; Goldberg and Levy, greater or equal than the threshold are labeled ‘1’, 2014). The optimized parameters θ are vwi and indicating change. Words with a CD less than the vci for i ∈ 1, ..., d. D0 is obtained by drawing k threshold are assigned ‘0’, indicating no change. contexts from the empirical unigram distribution A simplified approach is to set the threshold 1 https://github.com/Garrafao/ such that the number of words is equal in both LSCDetection groups. This has many disadvantages: Mainly, it relies on the assumption that the two groups are of entry dim threshold ACC AP equal size. This is rarely given in real world ap- #2 300 (µ+σ) .76 .944 .915 plications, especially if the focus is in one word #4 500 (µ+σ) .78 .889 .915 at a time. Thus a more sophisticated approach is #1 300 (50:50) .57 .833 .915 needed. In SemEval-2020’s Subtask 1 many par- #3 500 (50:50) .64 .833 .915 ticipants faced the same problem and developed major. baseline - .667 .333 various methods to solve it. Similar to the sim- freq. baseline unk. .611 .418 plified approach, Zhou and Li (2020) only look colloc. baseline unk. .500 unk. at target words, and after fitting the histogram of CDs to a gamma distribution, set the threshold at Table 1: Accuracy (ACC) and Average Precision the 75% density quantile. This approach resulted (AP) for various parameter settings and thresholds in good performance but is not always applicable and baselines; freq. baseline: Absolute frequency due to its dependence on underlying properties of difference between the words in C1 and C2 and the test set. Amar and Liebeskind (2020) avoid an unknown threshold; colloc. baseline: Bag of the dependence on target words by randomly se- Words + CD and an unknown threshold; major. lecting 200 words and setting the threshold such baseline: Every word labeled with ‘0’. that 90% of the 200 words have a lower distance than the threshold. A more careful selection of phase each team was allowed to submit 4 predic- words is taken by Martinc et al. (2020), they look tions for the full list of target words, which were at the CD of semantically stable stop words, accu- scored using classification accuracy between the mulate them in different bins and set the threshold predicted labels and the gold data. The final com- to the upper limit of the bin containing fewer than petition ranking compares only the highest of the #stopwords/#bins words. Pražák et al. (2020) 4 scores achieved by each team. propose several methods. One of them is setting the threshold at the mean of the distances of all 5 Results words in the corpus vocabulary. Our method for determining a threshold is very similar to Pražák We created target word rankings using et al. (2020), but instead of taking the mean, we SGNS+OP+CD with a dimensionality of 300 use the mean + one standard deviation (µ + σ) of and 500 as described above. From these rankings all words in the corpus vocabulary. our predictions are calculated using two different thresholding methods: (i) Splitting the targets 4 Experimental setup into two equally-sized groups (50:50) and (ii) using the mean + one standard deviation (µ+σ) The DIACR-Ita task definition is taken from as threshold, refer to Section 3.3. The accuracy SemEval-2020 Task 1 Subtask 1 (binary change scores achieved in this way are listed in Table 1, detection): Given a list of target words and a alongside the official baselines freq. and colloc. diacronic corpus pair C1 and C2 , the task is to and an additional major. baseline. Submission identify the respective target words which have #2 is our highest scoring submission and won changed their meaning between the time periods the DIACR-Ita task together with one other t1 and t2 (Basile et al., 2020a; Schlechtweg et al., undisclosed submission. For both of our rankings 2020).2 C1 and C2 have been extracted from Ital- the 50:50 threshold yielded lower accuracy than ian newspapers and books. Target words which the µ+σ threshold. This is due to the imbalance have changed their meaning are labeled with the of changed to unchanged target words in the value ‘1’, the remaining target words are labeled test set. Using µ+σ as threshold resulted in an with ‘0’. Gold data for the 18 target words is semi- optimal split for the ranking created with d=300. automatically generated from Italian online dictio- For d=500 this threshold was slightly too high naries. According to the gold data, 6 of the 18 tar- with a value of 0.78. The target word palmare get words are subject to semantic change between which, according to the gold data, has undergone t1 and t2 . This gold data was only made public semantic change (label ‘1’) has CD of 0.76 and after the evaluation phase. During the evaluation was thus incorrectly labeled by our system. Figure 2 The time periods t1 and t2 were not disclosed to partici- 1 shows the histogram of CD values for all words pants. of the corpus dictionary in gray. The green and (a) d=300 (b) d=500 Figure 1: Background shows histogram (in gray) of CDs for all words in the corpus vocabulary. The colored bars show the CDs of target words, green indicates that the target word was correctly labeled, red indicates incorrect labeling. Vertical line marks threshold value (mean + standard deviation). red colored bars correspond target words. If The one target word which our model labels in- the target word was correctly labeled the bar is correctly, across a variety of parameter settings, is green, incorrect labeled target words have red piovra. According to the gold data this word has bars. From this visualisation we can see that there not undergone semantic change between t1 and t2 , is a pronounced gap between the CDs of target while our system labels it as changed. A possi- words which have changed and those which have ble explanation for the error may be differences not. Our proposed threshold method of µ+σ tends in frequency: In C1 piovra appears 35 times and to slightly overshoot this gap. This has lead to in C2 it appears 643 times. SGNS often struggles the lower accuracy of submission #4, despite the to create reliable embeddings for low frequency ranking allowing for a higher accuracy. In order to words (Kaiser et al., 2020). Alternatively, the er- measure the quality of the rankings independent ror could be caused by discrepancies between gold from the threshold we also report AP (Shwartz labels and corpora. Basile et al. (2020a) state that et al., 2017) in Table 1, confirming the potential the gold data is initially based on Italian online equal performance. dictionaries such as ‘Sabatini Coletti’. In a man- The method of using the mean + one standard ual annotation process the gold data is further re- deviation of the CDs of all words in the corpus dic- fined by providing human judges with up to 100 tionary resulted in good accuracy, but leaves room occurrences of each target word, for which they for improvement. It tends to over-shoot the gap have to identify the used meaning according to between unchanged and changed words slightly. the meanings listed in the dictionaries. A target Only using the mean shifts the tendency towards word is labeled as changed if a meaning is ob- under-shooting the gap. The optimal threshold served in C2 which has not been observed in C1 . seems to lie somewhere in between. Though, this Although not very likely, it is possible that this needs the be confirmed on other, larger, data sets. annotation method fails to detect novel senses in Furthermore, not all binary classification tasks are C2 . Sabatini Coletti reports that in addition to the suitable for the approach of first creating a ranked sense “squid” piovra acquired a new sense “a se- list of graded change predictions and then choos- cret criminal organisation deeply rooted in soci- ing a threshold. The data set of SemEval-2020 ety” in 1983. This might explain why we detect Task 1 comprises two tasks, a binary and a ranked piovra as a word which has undergone semantic task for the same target words. It is not possible to change given that C1 comprises texts from 1948 achieve an accuracy of 1 on the binary task even if to 1970 and C2 comprises texts from 1990 to 2014 all the ranks are predicted correctly for the graded (Basile et al., 2020a). task, i.e., binary change is not just high graded The DIACR-Ita task dataset is a very valuable change (Schlechtweg et al., 2020). contribution to the research field of LSC detec- tion and extends the variety of available data sets Barcelona, Spain. Association for Computational to the Italian language. Nonetheless, two points Linguistics. are important when interpreting or results this data Mikel Artetxe, Gorka Labaka, and Eneko Agirre. set: (i) it contains a small number of target words 2017. Learning bilingual word embeddings with in combination with binary classification. This (almost) no bilingual data. In Proceedings of the makes the data set vulnerable to randomness. (ii) 55th Annual Meeting of the Association for Compu- tational Linguistics, pages 451–462. Association for The nature of the gold labels, in addition to possi- Computational Linguistics. bly not being directly related to the corpus, it is un- clear if they reflect semantic change as sense gain Ehsaneddin Asgari, Christoph Ringlstetter, and Hinrich and sense loss as in SemEval’s Subtask 1. The on- Schütze. 2020. EmbLexChange at SemEval-2020 Task 1: Unsupervised Embedding-based Detection line dictionaries which create the basis for the gold of Lexical Semantic Changes. In Proceedings of data only state sense gains. Thus, it might possible the 14th International Workshop on Semantic Eval- for a word to completely lose a sense but still be uation, Barcelona, Spain. Association for Computa- labeled as unchanged. tional Linguistics. Pierpaolo Basile, Annalina Caputo, Tommaso Caselli, 6 Conclusion Pierluigi Cassotti, and Rossella Varvara. 2020a. DIACR-Ita @ EVALITA2020: Overview of We participated in the DIACR-Ita shared task us- the EVALITA2020 Diachronic Lexical Semantics ing well-established type-based methods for di- (DIACR-Ita) Task. In Valerio Basile, Danilo Croce, acronic semantic representations in combination Maria Di Maro, and Lucia C. Passaro, editors, Pro- ceedings of the 7th evaluation campaign of Natural with a carefully calculated threshold. We were Language Processing and Speech tools for Italian able to reach the first place with a nearly perfect (EVALITA 2020), Online. CEUR.org. accuracy of .94 confirming once more the reli- ability of the type-based embeddings created by Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- cia C. Passaro. 2020b. Evalita 2020: Overview SGNS, OP as an alignment method and CD to of the 7th evaluation campaign of natural language measure differences between word vectors. The processing and speech tools for italian. In Valerio presented approach is very suitable for similar Basile, Danilo Croce, Maria Di Maro, and Lucia C. tasks as no fine-tuning of parameters is needed. Passaro, editors, Proceedings of Seventh Evalua- tion Campaign of Natural Language Processing and Yet, the system relies on the assumption that Speech Tools for Italian. Final Workshop (EVALITA graded change is indicative of binary classes. 2020), Online. CEUR.org. Acknowledgments Christin Beck. 2020. DiaSense at SemEval-2020 Task 1: Modeling sense change via pre-trained Dominik Schlechtweg was supported by the Kon- BERT embeddings. In Proceedings of the 14th International Workshop on Semantic Evaluation, rad Adenauer Foundation and the CRETA cen- Barcelona, Spain. Association for Computational ter funded by the German Ministry for Education Linguistics. and Research (BMBF) during the conduct of this study. We thank the task organizers and reviewers Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of for their efforts. deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for References Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages Efrat Amar and Chaya Liebeskind. 2020. JCT at 4171–4186, Minneapolis, Minnesota, June. Associ- SemEval-2020 Task 1: Combined Semantic Vec- ation for Computational Linguistics. tor Spaces Models for Unsupervised Lexical Se- mantic Change Detection. In Proceedings of the Yoav Goldberg and Omer Levy. 2014. 14th International Workshop on Semantic Evalua- Word2vec explained: Deriving Mikolov et al.’s tion, Barcelona, Spain. Association for Computa- negative-sampling word-embedding method. tional Linguistics. arXiv:1402.3722. Nikolay Arefyev and Vasily Zhikov. 2020. BOS William L. Hamilton, Jure Leskovec, and Dan Jurafsky. at SemEval-2020 Task 1: Word Sense Induc- 2016a. Cultural shift or linguistic drift? Comparing tion via Lexical Substitution for Lexical Seman- two computational measures of semantic change. tic Change Detection. In Proceedings of the 14th In Proceedings of the 2016 Conference on Empiri- International Workshop on Semantic Evaluation, cal Methods in Natural Language Processing, pages 2116–2121, Austin, Texas. Association for Compu- on Learning Representations, ICLR 2013, Scotts- tational Linguistics. dale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings. William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016b. Diachronic word embeddings reveal statisti- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- cal laws of semantic change. In Proceedings of the rado, and Jeffrey Dean. 2013b. Distributed repre- 54th Annual Meeting of the Association for Compu- sentations of words and phrases and their composi- tational Linguistics, pages 1489–1501, Berlin, Ger- tionality. In Advances in Neural Information Pro- many. Association for Computational Linguistics. cessing Systems 26, pages 3111–3119, Lake Tahoe, Nevada, USA. Renfen Hu, Shen Li, and Shichen Liang. 2019. Di- achronic sense modeling with deep contextualized Jeffrey Pennington, Richard Socher, and Christopher word embeddings: An ecological view. In Proceed- Manning. 2014. Glove: Global vectors for word ings of the 57th Annual Meeting of the Association representation. In Proceedings of the 2014 Con- for Computational Linguistics, pages 3899–3908, ference on Empirical Methods in Natural Language Florence, Italy. Association for Computational Lin- Processing, pages 1532–1543, Doha, Qatar. guistics. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Jens Kaiser, Dominik Schlechtweg, Sean Papay, and Gardner, Christopher Clark, Kenton Lee, and Luke Sabine Schulte im Walde. 2020. IMS at SemEval- Zettlemoyer. 2018. Deep contextualized word rep- 2020 Task 1: How low can you go? Dimensionality resentations. In Proceedings of the 2018 Conference in Lexical Semantic Change Detection. In Proceed- of the North American Chapter of the Association ings of the 14th International Workshop on Semantic for Computational Linguistics: Human Language Evaluation, Barcelona, Spain. Association for Com- Technologies, pages 2227–2237, New Orleans, LA, putational Linguistics. USA. Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Martin Pömsl and Roman Lyapin. 2020. CIRCE at Steven Skiena. 2015. Statistically significant de- SemEval-2020 Task 1: Ensembling Context-Free tection of linguistic change. In Proceedings of the and Context-Dependent Word Representations. In 24th International Conference on World Wide Web, Proceedings of the 14th International Workshop on WWW, pages 625–635, Florence, Italy. Semantic Evaluation, Barcelona, Spain. Association Andrey Kutuzov and Mario Giulianelli. 2020. UiO- for Computational Linguistics. UvA at SemEval-2020 Task 1: Contextualised Em- beddings for Lexical Semantic Change Detection. Ondřej Pražák, Pavel Přibákň, Stephen Taylor, and In Proceedings of the 14th International Workshop Jakub Sido. 2020. UWB at SemEval-2020 Task on Semantic Evaluation, Barcelona, Spain. Associa- 1: Lexical Semantic Change Detection. In Proceed- tion for Computational Linguistics. ings of the 14th International Workshop on Semantic Evaluation, Barcelona, Spain. Association for Com- Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, putational Linguistics. and Erik Velldal. 2018. Diachronic word embed- dings and semantic shifts: a survey. In Proceedings Gerard Salton and Michael J McGill. 1983. Introduc- of the 27th International Conference on Computa- tion to Modern Information Retrieval. McGraw-Hill tional Linguistics, pages 1384–1397, Santa Fe, New Book Company, New York. Mexico, USA. Association for Computational Lin- guistics. Dominik Schlechtweg, Anna Hätty, Marco del Tredici, and Sabine Schulte im Walde. 2019. A Wind of Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- Change: Detecting and evaluating lexical seman- proving distributional similarity with lessons learned tic change across times and domains. In Proceed- from word embeddings. Transactions of the Associ- ings of the 57th Annual Meeting of the Association ation for Computational Linguistics, 3:211–225. for Computational Linguistics, pages 732–746, Flo- rence, Italy. Association for Computational Linguis- Matej Martinc, Syrielle Montariol, Elaine Zosa, and tics. Lidia Pivovarova. 2020. Discovery Team at SemEval-2020 Task 1: Context-sensitive Embed- Dominik Schlechtweg, Barbara McGillivray, Simon dings not Always Better Than Static for Seman- Hengchen, Haim Dubossarsky, and Nina Tahmasebi. tic Change Detection. In Proceedings of the 14th 2020. SemEval-2020 Task 1: Unsupervised Lexi- International Workshop on Semantic Evaluation, cal Semantic Change Detection. In Proceedings of Barcelona, Spain. Association for Computational the 14th International Workshop on Semantic Eval- Linguistics. uation, Barcelona, Spain. Association for Computa- tional Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word repre- Hinrich Schütze. 1998. Automatic word sense dis- sentations in vector space. In Yoshua Bengio and crimination. Computational Linguistics, 24(1):97– Yann LeCun, editors, 1st International Conference 123, March. Philippa Shoemark, Farhana Ferdousi Liza, Dong Nguyen, Scott Hale, and Barbara McGillivray. 2019. Room to Glo: A systematic comparison of se- mantic change detection approaches with word em- beddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing, pages 66–76, Hong Kong, China. Association for Computational Linguistics. Vered Shwartz, Enrico Santus, and Dominik Schlechtweg. 2017. Hypernyms under siege: Linguistically-motivated artillery for hypernymy detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pages 65–75. Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2018. Survey of computational approaches to diachronic conceptual change. CoRR, abs/1811.06278. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of seman- tics. J. Artif. Int. Res., 37(1):141–188, January. Jinan Zhou and Jiaxin Li. 2020. TemporalTeller at SemEval-2020 Task 1: Unsupervised Lexical Se- mantic Change Detection with Temporal Referenc- ing. In Proceedings of the 14th International Work- shop on Semantic Evaluation, Barcelona, Spain. As- sociation for Computational Linguistics.