University of Padova @ DIACR-Ita ∗ Benyou Wang and Emanuele Di Buccio and Massimo Melucci Department of Information Engineering University of Padova, Padova, Italy {wang,dibuccio,melo}@dei.unipd.it Abstract tor initialization (Kim et al., 2014), and tempo- ral teferencing (Dubossarsky et al., 2019). This Semantic change detection task in a rel- work relies on contextualized word embeddings atively low-resource language like Italian as the basic word representation component (Hu is challenging. By using contextualized et al., 2019), since they have been shown to be word embeddings, we formalize the task effective in many NLP tasks including document as a distance metric for two flexible-size classification and question answering. The meth- sets of vectors. Various distance met- ods relying on contextualized word embeddings rics like average Euclidean Distance, av- performed worse than those based on static word erage Canberra distance, Hausdorff dis- embedding in Semantic Change detection tasks in tance, as well as Jensen–Shannon diver- many languages (Kutuzov and Giulianelli, 2020; gence between cluster distributions based Pömsl and Lyapin, 2020; Schlechtweg et al., 2020; on K-means clustering and Gaussian mix- Vani et al., 2020; Giulianelli et al., 2020; Giu- ture model are used. The final predic- lianelli, 2019). However, it is our opinion that tion is given by an ensemble of top-ranked the use of contextualized word embeddings for words based on each distance metric. The this task is worth investigating because (1) they proposed method achieved better perfor- have highly expressive power as demonstrated in mance than a frequency and collocation many downstream tasks e.g., document classifica- based baselines. tion and question answering, and (2) they could handle fine-grained representations of individual 1 Introduction context at the level of tokens. Lexical Semantic Change detection aims at identi- By using contextualized word embedding, each fying words that change meaning over time; this word in a specific sentence is represented as a vec- problem is of great interest for NLP, lexicogra- tor depending on the neighboring words which phy, and linguistics. A semantic change detection form the context of the word; a word appear- task in English, German, Latin, and Swedish was ing many times in a corpus is therefore repre- proposed by Schlechtweg et al. (2020). Recently, sented as a set of vectors since one vector corre- Basile et al. (2020a) organized a lexical seman- sponds to each occurrence). In this paper, seman- tic change detection task in Italian called DIACR- tic change detection is addressed by computing the Ita at EVALITA 2020 (Basile et al., 2020b). This distance between two flexible-size sets consisting technical report describes the methodology de- of vectors with respect to two time-stamped cor- signed and developed by the University of Padova pora. We investigated several distance metrics: av- for the participation to DIACR-Ita. erage Euclidean Distance, average Canberra dis- Some previous approaches for semantic change tance, and Hausdorff distance. Our methodol- modelling were based on static word embedding, ogy also relies on a clustering algorithm (e.g. K- where word vectors were trained for each time- means clustering and Gaussian Mixture Model) stamped corpus and then were aligned, e.g. by or- on the joint set and calculates a Jensen–Shannon thogonal projections (Hamilton et al., 2016), vec- divergence between cluster distributions in the ∗ two sub-corpora. We aggregate top-ranked words “Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 based on each distance metric as the final predic- International (CC BY 4.0).” tion. The proposed method achieved better perfor- mance than frequency and collocation based base- Since this is a closed task, we may not have lines and finally ranked the 8-th among 9 partici- enough annotated samples to train a f using gra- panting teams. dient descent. Therefore, a well-selected f will be crucial. 2 Problem definition Unlike the static word embedding like Word2vec 3 Methodology (Mikolov et al., 2013) 1 , contextualized word em- 3.1 Contextualized Word Embedding beddings like ELMO (Peters et al., 2018) and BERT (Devlin et al., 2018) generate word repre- Using contextualized word embeddings like sentation based on the context of a word which ELMO and BERT has be shown to improve per- does in this way not have a unique mapping with formance in various downstream tasks due to its a fixed word vector. expressive power for words. In this paper, we use a Let us denote a corpus with m sentences as C. multilingual-BERT3 . Uncased models are adopted In this paper, C is related to a time span t because since we assume that semantic change detection is of the task characteristics; however, the corpus can insensitive to word case. All models are in base be tailored to any specific aspect, e.g. a specific settings with 12 layers, 12 heads, and a hidden domain such as news or books. For a word wi ap- state dimension of 768. Only last-layer output of pearing in C, its contextualized word representa- BERT is used as word representation. (C) tion in the k-th sentence 2 is denoted by ei,k . The 3.2 Measuring Semantic Change Degree word representation in the corpus is a set 3.2.1 Distance-based methods ΦCi = {ei,1 , ei,2 , · · · ei,k , · · · , ei,m } (C) (C) (C) (C) (1) In this section, we introduce various methods to To examine whether a word wi exhibits a se- calculate the semantic change degree. mantic change between two corpora C1 (in t1 ) and Average Geometric Distance. Average Geo- C2 (in t2 ), we check the difference between two metric Distance (AGD) (also can be seen in (Ku- sets ΦCi 1 and ΦCi 2 . Let li be a human-annotated la- tuzov and Giulianelli, 2020; Giulianelli, 2019)) is bel indicating the semantic change degree; li usu- defined as below: ally ranges from 0 to 1, where 1 denotes a full semantic change. Let D be the dimension of the C C 1 X AGD(Φi 1 , Φi 2 ) = d(x, y) word vector. We define the distance metric as a mn C C function x∈Φi 1 ,y∈Φi 2 f : {RD }m , {RD }n → R. (2) The distance function d(·, ·) can be the Euclidean to obtain a semantic change degree based on the Distance 4 , the Canberra distance (Lance and representation of a word in two corpora denoted Williams, 1966) 5 or any distance function. In this as ΦCi 1 , ΦCi 2 . When labels are binary, one may sim- paper, we also use the negative cosine similarity as ply use a threshold on the values of f (·, ·) to pre- a normalized distance metric. dict the binary label. Let δ be a function to gener- ate a binary output, e.g., based on a hand-crafted Hausdorff distance. Hausdorff distance (Rock- threshold. We can predict whether wi exhibits a afellar and Wets, 2009) is denoted as HD in short semantic change between C1 and C2 as follows and is generally used to measure the distance be- tween two non-empty sets, namely, li = δ(f (ΦCi 1 , ΦCi 2 )) ¯ (3) HD(ΦCi 1 , ΦCi 2 ) = max( sup inf ||x − y||2 , where ¯li is the predicted binary label. C x∈Φi 1 y∈Φi C2 In conclusion, in our work the semantic change (5) sup inf ||x − y||2 ) detection task is formalized as follows C x∈Φi 2 y∈Φi C1 X  3 arg max δ(f (ΦCi 1 , ΦCi 2 )) == li (4) https://storage.googleapis.com/bert_ f,δ wi models/2018_11_03/multilingual_L-12_ H-768_A-12.zip. 1 4 An overview on word vectors is in Wang et al. (2019). Euclidean Distance : d(x, y) = ||x − y||2 2 5 If a word appears in a sentence more than once, we take Canberra distance is a P normalized version of the Man- |xi −yi | the average. hattan distance, d(x, y) = D i=1 |xi |+|yi | 3.2.2 Clustering-based Methods word # corpus C1 # corpus C2 By clustering the union set between ΦCi 1 and ΦCi 2 egemonizzare 11 37 lucciola 64 226 in K clusters/categories, we obtained the cate- campanello 109 628 gory distributions p, q for ΦCi 1 and ΦCi 2 , respec- trasferibile 7 60 tively. We adopted two commonly used clus- brama 17 93 polisportiva 74 134 tering methods: the K-means clustering method palmare 19 88 and the Gaussian Mixture Model method. As processare 39 594 for the distance between distributions, we adopted pilotato 34 285 cappuccio 60 198 the Jensen–Shannon Divergence (JSD), which is pacchetto 274 5690 a symmetrized and smoothed version of the Kull- ape 123 252 back–Leibler divergence: unico 4524 29620 discriminatorio 110 262 rampante 26 462 1 1 campionato 3918 11871 JSD = KL(p, q) + KL(q, p) 2 2 tac 88 438 piovra 30 621 where KL(p, q) = K pi P i=1 pi log qi . Table 1: ‘#1’ and ‘#2’ denote the number of sen- 3.3 Threshold and Ensemble tences where the target word occurs in two time- We took the top-K ranked target words of each stamped corpora C1 and C2 respectively. metric and aggregated them for the final submis- methods accuracy sion. The K was decided when the aggregated Frequencies 0.50 target words reached the half of total words num- Collocations 0.61 Aggregated results (submitted) 0.67 bers, since we assumed that the annotated labels Average negative cosine similarity 0.67 are balanced. See (Schlechtweg et al., 2020) for Average distance with Euclidean distance 0.61 detailed discussions about thresholds. Average distance with Canberra distance 0.61 Hausdorff distance 0.50 4 Experiments JS divergence with K-means Clustering 0.61 JS divergence with Gaussian Mixture Model 0.61 4.1 Dataset and Evaluation Methodology Table 2: Results of the proposed methods. DIACR-Ita is the first task on lexical semantic change for Italian. DIACR-Ita aims to automati- T, F refers to ‘True’ and ‘False’, P, N refers to cally detect whether a word semantically change ‘positive’ and ‘negative’. For example, T P is the over time. The task is to detect if a set of words, number of Truly-predicted Positive samples. called target words, change their meaning across The task735680 organizers provided two base- two periods, t1 and t2 , where t1 precedes t2 . Par- lines: Frequencies: the absolute value of the dif- ticipants are provided with two corpora C1 and C2 ference between the words’ frequencies is com- (corresponding to t1 and t2 , respectively), and a puted; Collocations: for each word, it com- set of target words. For instance, the meaning of putes the cosine similarity between two Bag-of- the word ‘imbarcata’ has changed from t1 to t2 ; Collocations (BoCs) vector representations related originally, the word referred to an ‘acrobatic ma- to C1 and C2 . In both baseline models, a threshold noeuvre of aeroplanes’, but it is nowadays used to is used to predict if the word has changed its mean- refer to the state of being deeply in love (Basile ing. et al., 2020a) although the latter meaning is much less used than the former meaning. The task is 4.2 Experimental Results formulated as a closed task, namely, models must Experimental results are reported Table 2 and be trained solely on the provided data. The occur- show that the proposed method achieved better rence about target words is reported in Table 1. performance than frequency and collocation based Labels in this task are binary and the task is baselines. considered as a binary classification problem. The evaluation is based on accuracy: 4.3 Post-hoc Analysis TP + TN In this section, we will provide a bi-dimensional Accuracy = visualization of word representation to intuitively TP + TN + FP + FN understand how the contextualized word vectors The patterns of semantic change are multifaceted work. For each word, we get all contextualized and we are questioning that a single distance met- word vectors (with a dimension of 768) based on ric could precisely distinguish all the above typical its context. To visualized word in a 2D plane, semantic shift patterns. we used a typical dimension reduction algorithm called T-SNE (Maaten and Hinton, 2008) to re- Normalization. Most of distance metrics are not duce word vectors from 768 to 2. Red and blue normalized except for negative cosine similarity. points denote the low dimensional representation Absolute values of unnormalized distance metrics of vectors when considering the two time-stamped may differ a lot among individual words; they are corpora C1 (blue) and C2 (red). sometimes unexpectedly affected by the number For example, ‘rampante’ and ‘palmare’ are the of samples, leads to that the values of metrics may predicted positive samples while ‘cappuccio’ and not be comparable among words. ‘campanello’ are predicted negative samples. As shown in Figure 1, the predicted semantically- Outliers. Some distance metrics (e.g., Haus- shifted words exhibit a clear difference between dorff distance) are sensitive to outliers. For exam- red points an blue points with respect to two time- ple, since the calculation of Hausdorff distance is stamped corpora. For the predicted semantically- based on infimum and supremum, an outlier point unshifted words (see Figure 2), it looks slightly may largely affect the final Hausdorff distance. As indistinguishable. seen in Table 3, frequently-appearing words e.g., ‘campionato’ and ‘unico’ have the highest Haus- 5 Limitations dorff distance between C1 and C2 , this is proba- In (Schlechtweg et al., 2020), semantic repre- bly biased by the fact that the two words appear sentations are mainly divided to two categories: frequently (see Table 1) and therefore likely have average embeddings (‘type embeddings’) and more unexpected outliers. contextualized embeddings (‘token embeddings’). Schlechtweg et al. (2020) illustrated the perfor- Model Fine-tuning. The contextualized word mance of token-based models are much lower than embedding that is based on pre-trained language type-based embedding models. In this section, models like BERT achieved much better results we will discuss some limitations of currently-used compared to static word embedding with a two- contextualized embedding based methods for se- stage training paradigm, where the two stages are mantic change detection. pre-training in language model (e.g., mask lan- There are typically two kinds of methods to use guage model) and fine-tuning in downstream tasks contextualized embeddings for semantic change (e.g., classifications). However, in the semantic detection: embedding-based distance metrics and change detection task, fine-tuning in downstream clustering-based distance metrics (Schlechtweg tasks is currently impossible because the anno- et al., 2020; Vani et al., 2020; Giulianelli et al., tated labels are insufficient to this aim; to some 2020; Giulianelli, 2019). The former are directly extent, the lack of fine-tuning stage may harm the calculated on the raw contextualized word embed- performance of the pre-trained language models. dings while the latter are based on the clustering results of contextualized word embeddings. 5.2 Clustering-based Distance Metrics 5.1 Embedding-based Distance Metrics After clustering, we used the Jensen–Shannon di- Can distance metrics distinguish semantic shift vergence (JSD) which is affected by the issues patterns? Many typical patterns of semantic mentioned in Section 5.1 like other distance met- shifts have been investigated (Grossmann and rics. Plus, the clustering algorithm may introduce Rainer, 2013; Basile et al., 2020a): 1) pejora- some errors of semantic change detection. First, tion or amelioration (when word meanings be- typical clustering algorithms may not necessarily come more negative or more positive); 2) broad- converge to an identical clustering result when the ening or narrowing (when it evolves as a general- seed centroids are changed. Moreover, the number ized/extended object or a restricted or specialized of clusters is crucial since the optimal number of one); 3) adding/deleting a sense; 4) totally shifted. clusters cannot easily be decided before clustering. Figure 1: Examples (i.e., ‘rampante’ and ‘palmare’) of predicted ”semantically-shifted” words. Red and blue points denote dimensionally-reduced vectors of two time-stamped corpora respectively. Figure 2: Examples (i.e., ‘cappuccio’ and ‘campanello’) of predicted ”semantically-unshifted” words. Red and blue points denote dimensionally-reduced vectors of two time-stamped corpora respectively. 6 Conclusions References This paper formalizes semantic change detection Pierpaolo Basile, Annalina Caputo, Tommaso as a distance metric between two variable-sized Caselli, Pierluigi Cassotti, and Rossella Var- sets of vectors. The final prediction is based on an vara. 2020a. DIACR-Ita @ EVALITA2020: ensemble of different distance metrics. The pro- Overview of the EVALITA 2020 Diachronic posed method outperformed weak frequency and Lexical Semantics (DIACR-Ita) Task. In collocation baselines, but it performed less well EVALITA 2020, Valerio Basile, Danilo Croce, than SOTA baselines. As a future work, this task Maria Di Maro, and Lucia C. Passaro (Eds.). may be largely improved via a supervised task CEUR.org, Online. in a unified multi-lingual framework; thus, any Valerio Basile, Danilo Croce, Maria Di Maro, human-annotated labels in other languages could and Lucia C. Passaro. 2020b. EVALITA 2020: be used in this task since currently the number of Overview of the 7th Evaluation Campaign of annotated semantically-shift words in a single lan- Natural Language Processing and Speech Tools guage is limited. for Italian. In Proceedings of Seventh Evalua- tion Campaign of Natural Language Processing Acknowledgments and Speech Tools for Italian. Final Workshop (EVALITA 2020), Valerio Basile, Danilo Croce, This work is supported by the Quantum Access Maria Di Maro, and Lucia C. Passaro (Eds.). and Retrieval Theory (QUARTZ) project, which CEUR.org, Online. has received funding from the European Union‘s Jacob Devlin, Ming-Wei Chang, Kenton Lee, Horizon 2020 research and innovation programme and Kristina Toutanova. 2018. Bert: Pre- under the Marie Skłodowska-Curie grant agree- training of deep bidirectional transformers ment No. 721321. for language understanding. arXiv preprint arXiv:1810.04805 (2018). A Appendix Haim Dubossarsky, Simon Hengchen, Nina Tah- Table 3 reports the predictions based on various masebi, and Dominik Schlechtweg. 2019. distance metrics. Time-out: Temporal referencing for robust word AGD-cosine AGD-euclidean AGD-canberra Hausdorff distance JSD-GMM JSD-Kmeans matematica 0.996 1.02 86.6 10.0 0.004 0.025 dettagliato 0.895 6.09 290.9 7.5 0.693 0.693 sanità 0.990 1.86 130.8 10.9 0.025 0.052 senatore 0.997 0.79 79.1 7.7 0.009 0.002 istruzione 0.854 6.14 333.7 14.4 0.275 0.279 egemonizzare 0.988 1.62 136.6 5.6 0.003 0.033 lucciola 0.970 2.58 187.3 8.4 0.414 0.154 campanello 0.990 1.13 131.7 10.8 0.003 0.003 trasferibile 0.873 4.25 300.7 7.2 0.059 0.073 brama 0.830 5.80 346.2 8.3 0.420 0.406 polisportiva 0.921 4.42 285.7 7.5 0.293 0.291 palmare 0.955 2.55 220.5 8.0 0.130 0.154 processare 0.986 1.76 159.9 6.9 0.105 0.067 pilotato 0.970 2.27 198.9 12.1 0.108 0.128 cappuccio 0.973 1.78 183.6 12.2 0.015 0.016 pacchetto 0.984 1.67 149.6 10.5 0.011 0.009 ape 0.953 2.09 216.7 15.3 0.033 0.031 unico 0.985 1.89 149.9 16.2 0.035 0.032 discriminatorio 0.987 1.56 150.5 10.2 0.007 0.007 rampante 0.888 4.78 302.7 6.5 0.293 0.299 campionato 0.978 2.51 183.1 16.0 0.074 0.071 tac 0.815 5.25 366.2 9.9 0.301 0.391 piovra 0.976 2.27 189.6 9.7 0.033 0.033 Table 3: Calculated scores of various distance metrics. Top ranked scores are in bold. modeling of lexical semantic change. arXiv Laurens van der Maaten and Geoffrey Hinton. preprint arXiv:1906.01688 (2019). 2008. Visualizing data using t-SNE. JMLR 9, Nov (2008), 2579–2605. Mario Giulianelli. 2019. Lexical semantic change analysis with contextualised word representa- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- tions. Unpublished master’s thesis, University frey Dean. 2013. Efficient estimation of word of Amsterdam, Amsterdam (2019). representations in vector space. arXiv preprint arXiv:1301.3781 (2013). Mario Giulianelli, Marco Del Tredici, and Raquel Fernández. 2020. Analysing Lexical Semantic Matthew E Peters, Mark Neumann, Mohit Iyyer, Change with Contextualised Word Representa- Matt Gardner, Christopher Clark, Kenton Lee, tions. arXiv preprint arXiv:2004.14118 (2020). and Luke Zettlemoyer. 2018. Deep contextu- alized word representations. In NAACL. 2227– Maria Grossmann and Franz Rainer. 2013. La 2237. formazione delle parole in italiano. Walter de Martin Pömsl and Roman Lyapin. 2020. CIRCE Gruyter. at SemEval-2020 Task 1: Ensembling Context- William L Hamilton, Jure Leskovec, and Dan Ju- Free and Context-Dependent Word Representa- rafsky. 2016. Diachronic Word Embeddings Re- tions. arXiv preprint arXiv:2005.06602 (2020). veal Statistical Laws of Semantic Change. In R Tyrrell Rockafellar and Roger J-B Wets. 2009. ACL. 1489–1501. Variational analysis. Vol. 317. Springer Science Renfen Hu, Shen Li, and Shichen Liang. 2019. Di- & Business Media. achronic sense modeling with deep contextual- Dominik Schlechtweg, Barbara McGillivray, Si- ized word embeddings: An ecological view. In mon Hengchen, Haim Dubossarsky, and Nina ACL. 3899–3908. Tahmasebi. 2020. SemEval-2020 Task 1: Un- Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan supervised Lexical Semantic Change Detection. Hegde, and Slav Petrov. 2014. Temporal Analy- arXiv preprint arXiv:2007.11464 (2020). sis of Language through Neural Language Mod- K Vani, Sandra Mitrovic, Alessandro Antonucci, els. ACL 2014 (2014), 61. and Fabio Rinaldi. 2020. SST-BERT at SemEval-2020 Task 1: Semantic Shift Trac- Andrey Kutuzov and Mario Giulianelli. 2020. ing by Clustering in BERT-based Embedding UiO-UvA at SemEval-2020 Task 1: Con- Spaces. arXiv preprint arXiv:2010.00857 textualised Embeddings for Lexical Se- (2020). mantic Change Detection. arXiv preprint arXiv:2005.00050 (2020). Benyou Wang, Emanuele Di Buccio, and Mas- simo Melucci. 2019. Representing Words in Godfrey N Lance and William T Williams. 1966. Vector Space and Beyond. In Quantum-Like Computer programs for hierarchical polythetic Models for Information Retrieval and Decision- classification (“similarity analyses”). Comput. Making. Springer, 83–113. J. 9, 1 (1966), 60–64.