A Comparative Study of Approaches for the Diachronic Analysis of the Italian Language Pierluigi Cassotti, Pierpaolo Basile, Marco de Gemmis, and Giovanni Semeraro Department of Computer Science, University of Bari Aldo Moro Via E. Orabona, 4 - 70126 Bari (ITALY) {firstname.surname}@uniba.it Abstract. In recent years, there has been a significant increase in in- terest in lexical semantic change detection. Many are the existing ap- proaches, data used, and evaluation strategies to detect semantic drift. Most of those approaches rely on diachronic word embeddings. Some of them are created as post-processing of static word embeddings, while others produce dynamic word embeddings where vectors share the same geometric space for all time slices. The large majority of the methods use English as the target language for the diachronic analysis, while other languages remain under-explored. In this work, we compare state-of-the- art approaches in computational historical linguistics to evaluate the pros and cons of each model, and we present the results of an in-depth analysis conducted using an Italian diachronic corpus. Specifically, sev- eral approaches based on both static embeddings and dynamic ones are implemented and evaluated by using the Kronos-It dataset. We train all word embeddings on the Italian Google n-gram corpus. The main re- sult of the evaluation is that all approaches fail to significantly reduce the number of false-positive change points, which confirms that lexical semantic change is still a challenging task. · · Keywords: Computational Historical Linguistics Diachronic word em- beddings Lexical Semantic Change. 1 Background and Motivations Diachronic Linguistics concerns the investigation of language change over time. Language change involves all levels of linguistic analysis: phonology, morphology, syntax and semantics [6, 5]. In this work, we focus on lexical semantic change. Two recent surveys [11, 19] describe and compare several lexical semantic change models that have been developed in the last years. Several datasets and tasks are employed in the evaluation of those models. In [13], the authors use two corpora of scientific papers and a corpus of senate speeches, both written in © Copyright 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 130 English. They compare Static Bernoulli Embedding [14], Procrustes [10] and Dynamic Bernoulli Embeddings [13] using the held-out likelihood as evaluation metric. In [21], the authors evaluate, by using a rank-based approach, word2vec embeddings and a variant of Procrustes alignment to detect words that have undergone a semantic shift. Solving temporal word analogies is a common task used to evaluate models of lexical semantic change, which consists in detecting words analogies across time slices. In [7], the authors exploit the datasets created by [22] and [18] to compare Temporal Word Embeddings with Compass [7], LinearTrans-Word2vec [18], Procrustes [10], Dynamic Word Embeddings [22] and Geo-Word2vec [1]. However, few standard resources for evaluating lexical semantic change detection models are available. Currently, this gap is tackled by several initiatives. In [17], the authors introduce a framework (DUREL) for the annotation of lexical semantic change and at the same time they make available the annotated data1 . DUREL is also employed in the annotation process of Semeval 2020 Task 1 [16] that involves four languages: English, Sweden, German and Latin, while the Italian language remains under-explored. Semeval 2020 Task 1 provides corpora in four languages and a gold standard of lexical semantic changes for the evaluation of unsupervised systems. However, the Semeval 2020 Task 1 corpora can only be used to evaluate lexical semantic change across two time periods. Therefore, it cannot be used to perform a more fine-grained analysis of the results. In this work, we describe a systematic evaluation of models for lexical semantic change detection with the Italian Google Ngram as the corpus for training word embeddings and Kronos-it [4] as the gold standard for the evaluation. Kronos-IT is a dataset for the evaluation of semantic change point detection algorithms for the Italian language automatically built by using a web scraping strategy. In particular, it exploits the information presents on the online dictionary “Sabatini Colletti”2 to create a pool of words that have undergone a semantic change. In the dictionary, some lemmas are tagged with the year of the first attestation of its sense. In some cases, associated with the lemma there are multiple years attesting the introduction of new senses for that word. Kronos-IT uses this information to identify the set of semantic changing words. Previous works about the Italian Google Ngram corpus and Kronos-it are described in [2, 4], but they are limited to the Temporal Random Indexing model [3] and simple baselines based on word frequencies and collocations ignoring recent approaches based on word embeddings. The paper is structured as follows: Section 2 describes the approaches under analysis, while Section 3 reports details about the evaluation pipeline used in our work. Results of the evaluation are reported and discussed in Section 4. 2 Models Traditional approaches produce word vectors that are not comparable across time due to the stochastic nature of low-dimensional reduction techniques or 1 http://www.ims.uni-stuttgart.de/data/durel/ 2 https://dizionari.corriere.it/dizionario_italiano/ 131 sampling techniques. To overcome this issue a widely adopted approach is to align the spaces produced for each time step, based on the assumption that only few words change their meaning. Words that turn out to be not aligned after the alignment, changed their semantics. In this work, we investigate two approaches for producing word embeddings that are comparable across time. The first approach is based on the alignment of computed word embeddings (bins). Word vectors are computed before the alignment, once we get the bin (the embeddings matrix for a specific time slice), the different spaces obtained for each time slice are aligned. An example of this kind of approach is Procrustes [10], which aligns word embeddings with a rotation matrix. The assumption is that each word space has axes similar to the axes of the other word spaces, and two word spaces are different due to a rotation of the axes: R = arg minQT Q=I QW t − W t+1 F where W t and W t+1 are two word spaces for time slices t and t + 1, respec- tively, and Q is an orthogonal matrix that minimizes the Frobenius norm of the difference between W t and W t+1 . The second approach directly produces aligned word embeddings for each time slice, as it jointly learns word embeddings and aligns them. Dynamic word embeddings (DWE) [22] fall in this second type of approaches and it is based on the positive point-wise mutual information (PPMI) matrix factorization. In a unique optimization function, DWE produces embeddings and tries to align them according to the following equation: 1 2 λ 2 min Y (t) − U (t)U (t)T F + kU (t)kF + U (t) 2 2 τ 2 2  kU (t − 1) − U (t)kF + kU (t) − U (t + 1)kF 2 where the terms are, respectively, the factorization of the PPMI matrix Y (t), a regularization term and the alignment constraint that keeps the word embed- dings similar to the previous and the next word embeddings. The objective function of static Bernoulli embeddings is closely related to that of the CBOW (Continuous Bag of Words) [12] model, except that static Bernoulli embeddings regularize the embedding placing priors on both the em- bedding and context vectors. Dynamic Bernoulli Embeddings (DBE) [13] extends static Bernoulli embeddings including the time dimension. Context vectors are shared across all the time slices while embedding vectors are only shared within a time slice. Moreover, Dynamic Bernoulli Embedding uses a Gaussian random walk for obtaining smoothly changing estimates of each term embedding. The random walk penalizes the shifting of consecutive vectors. Finally, we investigate Temporal Random Indexing (TRI) [3] that is able to produce aligned word embeddings in a single step. Unlike previous approaches, TRI is a count-based method. TRI is based on Random Indexing [15], where a word vector (word embedding) svjTk for the word wj at time Tk is the sum of random vectors ri assigned to the co-occurring words taking into account only 132 documents dl ∈ Tk . Co-occurring words are defined as the set of m words that precede and follow the word wj . Random vectors are vectors initialized randomly and shared across all time slices so that word spaces are comparable. 3 Methodology Figure 1 shows the pipeline used for the evaluation, it consists of five modules: corpus pre-processing, computation of bins, bins alignment, construction of time- series and change point detection. The framework is written in Python, we adopt Procrustes3 , DBE4 , DWE5 and TRI6 using their original implementation. Fig. 1: The evaluation pipeline. 3.1 Corpus pre-processing The corpus pre-processing module receives as input a corpus annotated with the time label of each document. The first operation is the corpus splitting into temporal slices. During the splitting, the dictionary is computing by keeping track of each new token encountered and its occurrence. The final dictionary is built with all tokens present in each time slice and selecting the first n tokens sorted by the number of occurrences. In our evaluation, we consider n = 50, 000. 3 https://github.com/williamleif/histwords 4 https://github.com/mariru/dynamic_bernoulli_embeddings 5 https://github.com/yifan0sun/DynamicWord2Vec 6 https://github.com/pippokill/tri 133 3.2 Bins building The second module takes as input tokenized documents for each time slice and generates for each approach preliminary information useful for the next steps. It has an execution mode for each approach namely Word2Vec, PPMI, Static Bernoulli and Temporal Random Indexing. Word2Vec mode trains a Word2Vec model on each sub-corpus using Gensim7 , an open-source library for unsuper- vised topic modelling and natural language processing. The PPMI mode con- structs a PPMI matrix for each time slice, which will then be used to create Dynamic Word Embedding. The Bernoulli mode builds static Bernoulli embed- ding for each time slice that will later be used to construct Dynamic Bernoulli embeddings. The Temporal Random Indexing mode saves the occurrences of words and contexts that we will later be used to create word embeddings. 3.3 Alignment The aim of the alignment module is the alignment of the bins produced as output in the previous module, and it is composed of several sub-modules: Pro- crustes Aligner, Bernoulli Aligner, Dynamic word embeddings construction and the TRI sub-module. The Bernoulli Aligner constructs Dynamic Bernoulli Em- beddings starting from the static Bernoulli output. Procrustes Aligner is the sub-module that takes each Word2Vec model and applies Procrustes to each time slice. The Dynamic Word Embeddings sub-module takes the PPMI matri- ces previously created for building the Dynamic Word embeddings model. The TRI sub-module produces word vectors for each time slice by relying on the co-occurrences information built in the previous step. 3.4 Time-series and change point detection We compute time-series by exploiting the word embeddings created for each time slice. A time-series for each word is built, this result in a matrix W V xT where V is the dictionary size and T is the number of time slices. We explore two approaches for the computation of the time-series, namely point-wise and cumulative. In the point-wise approach, the element i, j of W V xT represent the cosine similarity j−1 j Wi,j = cos(vwi , vwi ) where wi is the i-th word in the dictionary and j is the j-th time slice. While, in the cumulative approach, the element i, j of W is Pj k−1 k=1 vwi j Wi,j = cos( , vw ) j i 7 https://radimrehurek.com/gensim/ 134 In order to detect change points, we use the algorithm proposed in [20]. According to this model, we define a mean shift of a general time-series Wi pivoted at time period j as: l j 1 X 1X K(Wi ) = Wi,k − Wi,k (1) l−j j k=j+1 k=1 To understand if a mean shift is statistically significant at time j we use a boot- strapping [8] approach under the null hypothesis. The null hypothesis states there is no change in the mean. We sample B bootstrap examples by permuting Wi,j . For each bootstrap sample P, K(P ) is calculated to provide its corresponding bootstrap statistic and statistical significance (p-value) of observing the mean shift at time j compared to the null distribution. Finally, we estimate the change point by considering the time point j with the minimum p-value score. Change points together with the year, the p-value and the word are stored in a file used for the evaluation. 4 Evaluation 4.1 Data For the training, we use the Google Ngram, a dataset of ngrams extracted by 305,763 Google Books. Google Ngram covers the period from 1500 to 2012. OCR errors can occur more in older historical documents, then we extract a sub-corpus concerning the period 1900-2010. We split Google Ngram corpus into ten slices with a range of ten years, starting from 1900 to 2010. We chose a time span of ten years for reducing the computational complexity since semantic changes are not frequent and generally require a large time span to be observed. Since the full text is not available in the Google Ngram, we use the method described in [9] for extracting co-occurrences between words. As gold standard, we use Kronos-it [4], a dataset for the Italian lexical change detection task. Kronos-it provides for each lemma a set of years indicating the semantic change for that lemma. Kronos-it is extracted by the Sabatini Coletti, an Italian dictionary that contains for some word meanings the year of the first appearance. The Kronos-it dataset contains 13,818 lemmas and 13,932 change points. Lemmas reported in Kronos-it have, on average, one change point. 4.2 Hyper-parameters We use the same hyper-parameters values shared by two or more models. We use the same values for the context-window and the dimension of the embed- dings. Table 1 reports training strategies and hyper-parameters values. We adopt default values used by the authors of the models. In particular, in DWE we specify the number of iterations over the data, the alignment weight τ , the regularization weights λ and γ. In TRI, we set the down 135 DWE TRI DBE Procrustes Parameter Value Parameter Value Parameter Value Parameter Value dimension 300 dimension 300 dimension 300 dimension 300 window 4 window 4 window 4 window 4 iters 5 down-sampling 0.001 negatives 2 min-count 1 λ 10 seeds 10 minibatch 1000 negatives 20 γ 100 n epochs 4 sample 1e-5 τ 50 iter 4 Table 1: Models hyper-parameters. sampling factor, and the number of seeds. In DBE, we set the number of negative samples, the minibatch size and the number of epochs. In Procrustes, we set the minimum number of occurrences a token must have to appear in the dictionary min-count, the number of negative samples, the downsampling parameter sample and the number of iterations over the data. 4.3 Metrics We compute the performance of each approach by using Precision, Recall and F- measure. In the evaluation, a true positive is a change point for a word reported in the gold standard that belongs to the range of the ten years predicted by the system for that word. Change points provided by the systems are compared to the change points reported in the gold standard. The false negatives (FN) are the number of change points in the gold standard minus the true positives. The false positives (FP) are the number of change points provided by the system minus the true positives. 4.4 Results Table 2 reports Precision (P), Recall (R) and F-measure (F) for each system. We can observe that generally, we obtain a low F-measure. This is due to the large numbers of change points detected by each system (false positive). We can observe that the best approach is DWE point-wise. However, the results of DWE point-wise are close to those obtained by Procrustes point-wise and TRI cumu- lative. A remarkable aspect is the worse performance of DBE respect those of TRI and DWE, the entries of DBE time-series are very close to 1, this highlights a heavy alignment. This is maybe due to the choice of hyper-parameters used to train the DBE. We use, as mentioned above, the default hyper-parameters and the type of datasets used by the authors is different from Google Ngrams, mainly due to the large amount of data in the Google Ngrams. This could have affected results obtained by DBE. The results of the evaluation prove that the task of semantic change detection is very challenging, in particular, the large number of detected change points (false positive) drastically affects the performance. Sometimes change points are detected before or after the change point reported 136 TRI cumulative DBE point-wise Procrustes point-wise DWE point-wise atomica 1 Cosine Similarity 0,8 0,6 0,4 1920 1940 1960 1980 2000 palmare 1 Cosine Similarity 0,9 0,8 0,7 1920 1940 1960 1980 2000 Oscar 1 Cosine Similarity 0,5 0 1920 1940 1960 1980 2000 Years Fig. 2: Example of semantic shifts detected. Red points marks change points in the gold standard. Change points detected in the time-series are shown. 137 Model Precision Recall F-Measure Change points detected DWE cumulative .0016 .0840 .0031 13207 DWE point-wise .0020 .0880 .0039 11115 TRI cumulative .0017 .0680 .0033 10233 TRI point-wise .0016 .0680 .0032 10315 DBE cumulative .0000 .0000 .0000 255 DBE point-wise .0019 .0200 .0035 2815 Procrustes cumulative .0016 .0640 .0033 9652 Procrustes point-wise .0019 .0200 .0036 2757 Table 2: Results of the evaluation. in the gold standard, this supports the hypothesis that the change of semantics of a word is a continuous process, which involves long periods before reaching a stabilization. More studies are necessary to understand which component affects the performance, such an in-depth and explicit analysis of time-series. More- over, it is important to underline that the year reported in the dictionary may be incorrect. In Figure 2, we show some examples of time-series. For the word ‘atomica’, DWE cumulative is the only approach that fits the change point in the gold standard, indicating the change point as the decade 1950-1959, after 1945, year of Hiroshima and Nagasaki. We do not detect change points in the time-series produced by Procrustes point-wise and DBE point-wise, while we find a change point in the TRI-cumulative time-series in the 1950-1959 decade. For the word ‘palmare’, in the DBE point-wise and Procrustes cumulative time-series, two change points are detected that are too early compared to the change point in the gold standard 1998. Procrustes provided the right range 1950-1959 for the word ‘Oscar’, years in which for the first time an Italian film director, Vittorio De Sica, won the Oscar. TRI cumulative and DBE point-wise do not detect change points, while in the DWE point-wise time-series a change point is founded in the decade 1960-1969. 5 Conclusions In this paper, we present a systematic evaluation of Dynamic Word Embeddings, Dynamic Bernoulli Embeddings, Procrustes and Temporal Random Indexing for the lexical semantic change detection for the Italian language. The results show that detect lexical semantic change is a complex task. A large number of change points is detected by systems, affecting the performance. A qualitative analysis of words time-series highlights that some change points are detected just before or after the correct period. This behaviour requires some further linguistic analysis for understanding the reasons behind. This work can be extended in two directions: 1) including some recent mod- els of lexical semantic change that involve contextual embeddings and a hyper- parameter search optimized on the Italian Google Ngram dataset; 2) investi- 138 gating other diachronic Italian corpora as training data. Moreover, we plan to investigate further methods for detecting changes in time-series. Acknowledgments This research has been partially funded by ADISU Puglia under the post- graduate programme ”Emotional city: a location-aware sentiment analysis plat- form for mining citizen opinions and monitoring the perception of quality of life”. References 1. Bamman, D., Dyer, C., Smith, N.A.: Distributed representations of geographically situated language. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 828–834 (2014) 2. Basile, P., Caputo, A., Luisi, R., Semeraro, G.: Diachronic analysis of the Italian language exploiting google ngram. In: Proceedings of the Third Italian Conference on Computational Linguistics (CLiC-it 2016). p. 56. CEUR.org (2016) 3. Basile, P., Caputo, A., Semeraro, G.: Analysing word meaning over time by ex- ploiting temporal random indexing. In: First Italian Conference on Computational Linguistics CLiC-it (CLiC-it 2014). CEUR.org (2014) 4. Basile, P., Semeraro, G., Caputo, A.: Kronos-it: a Dataset for the Italian Semantic Change Detection Task. In: Proceedings of the 6th Italian Conference on Compu- tational Linguistics (CLiC-it 2019). CEUR.org (2019) 5. Blank, A.: Why do new meanings occur? A cognitive typology of the motivations for lexical semantic change. Historical semantics and cognition (1999) 6. Bybee, J.L.: Diachronic linguistics. In: The Oxford handbook of cognitive linguis- tics. Oxford University Press (2010), https://www.oxfordhandbooks.com/view/ 10.1093/oxfordhb/9780199738632.001.0001/oxfordhb-9780199738632-e-36 7. Di Carlo, V., Bianchi, F., Palmonari, M.: Training temporal word embeddings with a compass. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 6326–6334 (2019) 8. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. CRC Press (1994) 9. Ginter, F., Kanerva, J.: Fast Training of word 2 vec Representations Using N-gram Corpora (2014), https://www2.lingfil.uu.se/SLTC2014/abstracts/sltc2014_ submission_27.pdf 10. Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1489–1501 (2016) 11. Kutuzov, A., Øvrelid, L., Szymanski, T., Velldal, E.: Diachronic word embeddings and semantic shifts: a survey. In: Proceedings of the 27th International Conference on Computational Linguistics. pp. 1384–1397 (2018) 12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Rep- resentations in Vector Space (2013) 13. Rudolph, M., Blei, D.: Dynamic embeddings for language evolution. In: Proceed- ings of the 2018 World Wide Web Conference. pp. 1003–1011 (2018) 139 14. Rudolph, M., Ruiz, F., Mandt, S., Blei, D.: Exponential family embeddings. In: Advances in Neural Information Processing Systems. pp. 478–486 (2016) 15. Sahlgren, M.: An introduction to random indexing. In: Methods and Applications of Semantic Indexing Workshop at the 7th International conference on Terminology and Knowledge Engineering (2005) 16. Schlechtweg, D., McGillivray, B., Hengchen, S., Dubossarsky, H., Tahmasebi, N.: SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In: Pro- ceedings of the 14th International Workshop on Semantic Evaluation. Association for Computational Linguistics (2020) 17. Schlechtweg, D., im Walde, S.S., Eckmann, S.: Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In: Pro- ceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). pp. 169–174 (2018) 18. Szymanski, T.: Temporal word analogies: Identifying lexical replacement with di- achronic word embeddings. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 448–453 (2017) 19. Tahmasebi, N., Borin, L., Jatowt, A.: Survey of computational approaches to lexical semantic change. arXiv preprint arXiv:1811.06278 (2018) 20. Taylor, W.A.: Change-point analysis: a powerful new tool for detecting changes 21. Tsakalidis, A., Bazzi, M., Cucuringu, M., Basile, P., McGillivray, B.: Mining the UK Web Archive for Semantic Change Detection. In: Proceedings of the Interna- tional Conference on Recent Advances in Natural Language Processing (RANLP 2019). pp. 1212–1221 (2019) 22. Yao, Z., Sun, Y., Ding, W., Rao, N., Xiong, H.: Dynamic word embeddings for evolving semantic discovery. In: Proceedings of the eleventh ACM International Conference on Web Search and Data Mining. pp. 673–681 (2018) 140