=Paper=
{{Paper
|id=Vol-2765/147
|storemode=property
|title=UNIMIB @ DIACR-Ita: Aligning Distributional Embeddings with a Compass for Semantic Change Detection in the Italian Language (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2765/paper147.pdf
|volume=Vol-2765
|authors=Federico Belotti,Federico Bianchi,Matteo Palmonari
|dblpUrl=https://dblp.org/rec/conf/evalita/BelottiBP20
}}
==UNIMIB @ DIACR-Ita: Aligning Distributional Embeddings with a Compass for Semantic Change Detection in the Italian Language (short paper)==
UNIMIB @ DIACR-Ita: Aligning Distributional Embeddings with a Compass for Semantic Change Detection in the Italian Language Federico Belotti Federico Bianchi Matteo Palmonari University of Milano-Bicocca Bocconi University University of Milano-Bicocca Viale Sarca 336, 20126 Via Sarfatti 25, 20136 Viale Sarca 336, 20126 Milan, Italy Milan, Italy Milan, Italy f.belotti8@campus.unimib.it f.bianchi@unibocconi.it matteo.palmonari@unimib.it Abstract nents: 1) an alignment procedure to generate dis- tributional vector spaces that are comparable for In this paper, we present our results re- t1 and t2 and 2) the use of distance metrics to lated to the EVALITA 2020 challenge, compute the degree of semantic change for a given DIACR-Ita, for semantic change detection word. Our alignment procedure is based on Com- for the Italian language. Our approach is pass Aligned Distributional Embeddings (CADE) based on measuring the semantic distance proposed by Bianchi et al. (2020) (note the ap- across time-specific word vectors gener- proach was introduced as Temporal Word Embed- ated with Compass-aligned Distributional dings with a Compass by Di Carlo et al. (2019), Embeddings (CADE). We first generate but the name was changed to enforce the idea that temporal embeddings with CADE, a strat- the embeddings can be used to align more general egy to align word embeddings that are spe- corpora and not just diachronic ones). Given the cific for each time period; the quality of aligned embeddings, we use two measures to com- this alignment is the main asset of our pute the degree of change based on the similarities proposal. We then measure the semantic of the vectors in the embedded space. Our results shift of each word, combining two differ- show that our methodology for aligning spaces can ent semantic shift measures. Eventually, be useful in detecting lexical semantic change. we classify a word meaning as changed or not changed by defining a threshold over 2 Description of the System: Semantic the semantic distance across time. Change Detection with Compass Aligned Embeddings 1 Introduction Our approach is based on measuring the seman- Semantic change detection is the task of detecting tic distance across time of time-specific word vec- if a word has shifted in meaning between different tors generated with CADE and on the use of two periods of time (Tahmasebi et al., 2018; Kutuzov measures for detecting semantic shifts i.e., the se- et al., 2018). The DIACR-Ita (Basile et al., 2020a) mantic distance between word vectors across time. challenge (at EVALITA (Basile et al., 2020b)) is This distance can be interpreted as a function of meant to evaluate approaches for semantic change the words’ self-similarity across time, where the detection for the Italian Language. similarity is measured by a linear combination of The task is described as follows: for training, cosine and second-order similarity (Hamilton et two corpora t1 and t2, consisting of text coming al., 2016a). from different periods are given, for testing, a set Finally, a threshold over this self-similarity is of unlabeled target words is given, where for each used to classify a word as changed or not changed. of them a binary scores has to be predicted: 1 iden- This methodology was applied also in the se- tifies lexical change between t1 and t2 while 0 mantic shift detection challenge presented at Se- does not. mEval2020 (Schlechtweg et al., 2020) (to which In this paper, we present our approach to seman- we participated after the end of the challenge). tic change detection that is based on two compo- The challenge allowed us to explore and under- stand how the alignment and our self-similarity “Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 behaved. In the classification task of the Se- International (CC BY 4.0).” mEval2020 challenge (the one similar to this task), we eventually achieved 0.703, 0.771, 0.725, 0.742, ces for each new CBOW model fitted on each of in accuracy for respectively the English, German, the slices. During training, these new target matri- Latin and Swedish languages; these results have ces are frozen, i.e., they are not updated during the been obtained with extensive parameter search training on the slice. This ensures that at the end of given the gold standard available in the post- the training process, the various temporal embed- evaluation.1 In DIACR-Ita, the threshold and few dings are all aligned in the same embedding space, other hyper parameters can be heuristically set to making them comparable without losing their in- account for the limited number of possible sub- dividual temporal distinctions. We use the pub- missions. In the next subsections we provide more licly available online implementation of CADE.2 details about the alignment methodology and the similarity function; more details about how we set 2.2 Computing Semantic Change the hyper parameters are provided in Section 3. Once the embeddings are aligned, we need mea- sures to evaluate the degree of semantic change. 2.1 Aligning Embeddings We compute the semantic shift of each word, Word2vec (Mikolov et al., 2013) is a useful i.e. the semantic distance between word vectors methodology to generate vectors of words allow- across time using the combination of two differ- ing us to study word similarity through vector sim- ent measures: Local Neighbors (ln), introduced by ilarity. However, due to the stochasticity of the Hamilton et al. (2016a) and cosine similarity (cos), training procedure, running word2vec on differ- merging them with a weighted linear combination ent corpora creates word vectors that are not com- into a new measure called Move. parable. Thus, an alignment procedure that puts the temporal word vectors in the same space is Local Neighbors ln is based on the similarity needed. between a word and its neighbor words in the two There are different approaches to generate these different time periods. Essentially we compute aligned embeddings (see for example the work by the degree of semantic change of the word w in (Hamilton et al., 2016b) and (Yao et al., 2018)). two slices by first collecting the nearest neighbors In this paper, we generate aligned embeddings (NNs) of wt and wt+1 in the two respective slices, with Compass Aligned Distributional Embeddings then given the embeddings at time t the similari- (CADE) (Bianchi et al., 2020) (See Figure 1 for a ties between the vector of wt and the vectors of all schematic description of the model). CADE is a the neighbors are computed.3 The same process is strategy to align word embeddings that are specific run for time t + 1 with wt+1 , eventually giving us for each time period that extends the word2vec two vectors of similarity scores. These two vectors Continuous Bag Of Word (CBOW) model pro- are again compared using cosine similarity. The posed by Mikolov et al. (2013). CADE can be higher the value of this measure the less the vector used to generate aligned temporal word embed- has changed with respect to its neighbors and thus dings (i.e., time-specific vectors of words, like the less the word should have shifted in meaning. “amazon1974 ”) from the different slices. Cosine Similarity The second measure we use Given in input a set of slices of text, where each is simply the cosine similarity of the vectors of a slice corresponds to text coming from a specific word in two different time periods. Similarly as period of time, the alignment procedure is as fol- before , the higher the value the less the vector has lows: changed and thus the less the word should have First, the text from all the slices is concatenated shifted in meaning. and CBOW is run on this corpus in order to ob- tain a “compass” model, i.e., a model defining the The Move Measure We merge these measures embedding space. The CBOW model uses two together using a weighted linear combination, that matrices to generate the embeddings (U and C in is: Figure 1), one for the context words and one for s(wt , wt+1 ) = (1 − λ) · ln(wt , wt+1 ) the target words. The target word matrix of the compass is then used to initialize the target matri- +λ · cos-sim(wt , wt+1 ) 1 2 Check the belerico entry in the challenge leader- http://github.com/vinid/cade 3 board at https://competitions.codalab.org/ When a neighbor is missing in one time slice, we replace competitions/20948#results it with the average vector of the space. D 1) Train the D1 D2 D3 Di Dn compass from the U concatenation C 3) training 3) training 3) training 3) training 3) training 2) Initialize and freeze each CBOW target matrix with the same U matrix U U U U U C1 C2 C3 Ci Cn Figure 1: An high level overview of the Compass Aligned Distributional Embeddings model. with λ ∈ [0, 1]. In particular λ express the usage is stable or not is set to 0.7, with the decision given strength of the two measures: a high λ will shift by: Move towards the cosine similarity, while a low one towards the ln measure. As introduced before ( we classify if the meaning has changed by defin- 0 if s(wt , wt+1 ) ≥ 0.7 ing a threshold over s (more details about this are 1 otherwise presented in the next Section). 3 Experimental Evaluation Essentially, the less changed are the two vec- The dataset provided by the challenge’s organizers tors of the words (for cos) and the neighbors (for (Basile et al., 2020a) is a collection of documents ln) the more the word has been stable between extracted by newspapers written in the Italian lan- the two time periods. As heuristics we chose guage labeled with temporal information. Partici- λ ∈ {0.3, 0.5, 0.7} to evaluate the relationship be- pants must train their models only on the data pro- tween the two measures used to build move, and vided, so a pre-processed corpus is given: tab sep- we set to 22 the number of nearest neighbors to arated, with one token per line, where for each be considered by the ln; this is the general setup token there are its corresponding part-of-speech that gave the results that have been submitted to (POS) tag and lemma, with sentences separated by the challenge. empty lines. The corpus is split into two slices, We trained CADE for 10 epochs to learn 100- each belonging to a specific period of time, t1 and dimensional vectors, with the window size set to 5, t2, where t1 < t2. 10 negative examples for every positive one, with 3.1 Dataset the initial learning rate set to 0.025 and decreased linearly during training. For the training data we used the flat version with only the lemmas, obtained by the organiz- As other models, in the post evaluation we also ers’ script (Basile et al., 2020a); in addition we ap- considered one that only uses the cos (CADE plied a pre-processing step, in which we removed (cos)) similarity measure and one that uses only punctuation and non alpha-numeric symbols and the ln metric CADE (ln)) (again with 0.7 as thresh- we kept only those sentences with at least two to- old and with the number of NNs for ln set to 22). kens. As baselines, the authors propose to use baseline-freq, that is the absolute value of the 3.2 Models Considered difference between the words’ frequencies and We use the embeddings aligned with CADE and baseline-colloc, where the Bag-of-Collocations of the move measure. The parameters of the moving the two words in the two different periods is built average we need to consider are: the number of and then cosine similarity is applied. A thresh- nearest neighbors (NNs) to be collected by ln, λ old is used on both metrics to define semantic for the moving average and the threshold for the change (Basile et al., 2020a). We report also the similarity. We set the threshold to decide if a word results of the other participants. λ Acc. why we get those errors. A more precise use of pre-processing techniques with the combination of team1 / 0.944 other metrics to compute semantic change might team2 / 0.944 help in reducing these errors. team3 / 0.889 CADE (move)† 0.3 0.833 team4 / 0.833 References team5 / 0.833 Pierpaolo Basile, Annalina Caputo, Tommaso Caselli, team6 / 0.778 Pierluigi Cassotti, and Rossella Varvara. 2020a. team7 / 0.722 DIACR-Ita @ EVALITA2020: Overview of team8 / 0.667 the EVALITA2020 Diachronic Lexical Semantics team9 / 0.611 (DIACR-Ita) Task. In Valerio Basile, Danilo Croce, baseline-colloc / 0.611 Maria Di Maro, and Lucia C. Passaro, editors, Pro- ceedings of the 7th evaluation campaign of Natural baseline-freq / 0.500 Language Processing and Speech tools for Italian CADE (move)† 0.5 0.722 (EVALITA 2020), Online. CEUR.org. CADE (move)† 0.7 0.722 Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- CADE (cos) / 0.722 cia C. Passaro. 2020b. Evalita 2020: Overview of the 7th evaluation campaign of natural language CADE (ln) / 0.889 processing and speech tools for italian. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Table 1: Accuracy scores for the binary classifica- Passaro, editors, Proceedings of Seventh Evalua- tion w.r.t. the other participants to the challenge. † tion Campaign of Natural Language Processing and identifies our submitted results. Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR.org. 3.3 Results Federico Bianchi, Valerio Di Carlo, Paolo Nicoli, and Matteo Palmonari. 2020. Compass-aligned The evaluation metric used in this challenge is the distributional embeddings for studying seman- accuracy, that is, the number of correct predictions tic differences across corpora. arXiv preprint arXiv:2004.06519. over the target data. Table 1 shows the results. Our model was the third most accurate. However, in Valerio Di Carlo, Federico Bianchi, and Matteo Pal- the post-evaluation we discovered that just using monari. 2019. Training temporal word embeddings the ln metric and ignoring the use of cos (this is with a compass. In Proceedings of the AAAI Con- ference on Artificial Intelligence, volume 33, pages equivalent to using λ = 0 in our move measure) 6326–6334. improves the performance leading to the second best accuracy score in the leaderboard. William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016a. Cultural shift or linguistic drift? comparing two computational measures of semantic change. 4 Discussion In Proceedings of the 2016 Conference on Empiri- cal Methods in Natural Language Processing, pages Our results show that CADE (Bianchi et al., 2020) 2116–2121, Austin, Texas, November. Association is an effective method to generate aligned embed- for Computational Linguistics. dings for the Italian language. This result, to- gether with those obtained on the SemEval2020 William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016b. Diachronic word embeddings reveal statisti- data, suggest that CADE can support models of cal laws of semantic change. In Proceedings of the semantic shift detection in several languages. In- 54th Annual Meeting of the Association for Compu- deed, we show that in combination with some sim- tational Linguistics (Volume 1: Long Papers), pages ple semantic change measures it is possible to pro- 1489–1501, Berlin, Germany, August. Association for Computational Linguistics. vide a good model for semantic change detection that can be subsequently extended with more fea- Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, tures. Appendix A contains some more detailed and Erik Velldal. 2018. Diachronic word embed- examples of the words that CADE (ln) and CADE dings and semantic shifts: a survey. In Proceed- ings of the 27th International Conference on Com- (move), with lambda set to 0.3, could not clas- putational Linguistics, pages 1384–1397, Santa Fe, sify correctly. Also, we show the neighborhood New Mexico, USA, August. Association for Com- for some of those words to give more context on putational Linguistics. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- t1 t2 rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- azionario maxiemendamento ity. In Advances in neural information processing obbligazione finanziaria systems, pages 3111–3119. azionista decretone Dominik Schlechtweg, Barbara McGillivray, Simon azionano decreto Hengchen, Haim Dubossarsky, and Nina Tahmasebi. edison ddl 2020. Semeval-2020 task 1: Unsupervised lex- casseforte emendamento ical semantic change detection. arXiv preprint contante liberalizzazioni arXiv:2007.11464. siap decretere Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2018. shell maxidecreto Survey of computational approaches to lexical se- prestire ecobonus mantic change. arXiv preprint arXiv:1811.06278. Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Table 4: First 10 nearest neighbors by cosine sim- Hui Xiong. 2018. Dynamic word embeddings ilarity of the word “pacchetto” from t1 and t2 for evolving semantic discovery. In Proceedings of the eleventh acm international conference on web The same it seems to happen for the target word search and data mining, pages 673–681. “piovra”, as one can see from Table 5, where at A CADE Misclassifications time t1 CADE gathers senses from both consider- ing it as the animal, for example from the word We report in Tables 2 and 3 CADE’s misclassifi- “tentacle”, or as someone tied to crime in gen- cations with the two best metrics, namely CADE eral, given words such as “profittatore” or “ru- (move) with λ = 0.3 and CADE (ln). Eventually, beria” (“profiteer” and “robbery” resp.); while at we also show in Tables 4 and 5 some examples of time t2 captures a shift towards the Italian crime neighborhood for the target words. TV series “La piovra”, as emerge from words such as “fiction”, “camorra” or “retequattro”, which is Word Pred True an Italian television channel. trasferibile changed not changed t1 t2 pacchetto changed not changed piovra changed not changed tentacolo fiction ingordigia sceneggiato Table 2: Wrong predictions done by CADE profittatore tentacolo (move) with λ = 0.3. somaro camorrere feudatario retequattro insaziabile raidue Word Pred True impere puntato pacchetto changed not changed ruberia camorra rampante not changed changed zanne gomorra putrido miniserie Table 3: Wrong predictions done by CADE (ln). Table 5: First 10 nearest neighbors by cosine sim- Table 4 shows the top 10 nearest neighbors of ilarity of the word “piovra” from t1 and t2 the target word “pacchetto” and we think CADE classifies its meaning as changed because during time t1 the meaning is more focused in the eco- nomic area, as one can see from neighbors like “azionario”, “obbligazione” or “contante” (trans- lated to “stock” as referred to the market, “bond” and “cash” resp.); while at time t2 shifts to a more political sense, as shown by words such as “de- creto” or “emendamento” (“decree” and “amend- ment” resp.).