-

CLiC-it

1613-0073

Is Change the Only Constant? An Inquiry Into Diachronic Semantic Shifts in Italian and Spanish

Matteo Melis

matteo.melis@studenti.unitn.it 0 1 2

Anastasiia Salova

anastasiia.salova@studenti.unitn.it 0 1 2

Roberto Zamparelli

roberto.zamparelli@unitn.it 0 1 2 0 Centre for Mind/Brain Sciences, University of Trento , Rovereto , Italy 1 Commons License Attribution 4.0 International , CC BY 4.0 2 Workshop Proce dings

2023

An increasingly prevalent approach to studying the gradual change of word meanings over time involves using distributional semantics, which is based on neighboring words. This study combines methods from Hamilton et al. (2016) [1] and Uban et al. (2019) [2] to analyze deceptive cognate pairs in historical and contemporary Italian and Spanish corpora. By employing fastText word embeddings and various similarity measures, it aims to investigate the change of word meanings and test two laws of regularity proposed by Hamilton et al. (2016) [1], along with a new hypothesized regularity in language change regarding analogy. The findings show a coherent evolution of deceptive cognates across the two languages. However, no meaningful correlation is found regarding the two aforementioned laws. Nevertheless, the results of the hypothesized regularity ofer valuable insight into how the context of word usage shifts along with the word.

Diachronic semantics semantic shifts distributional semantics similarity measures deceptive cognates

CEUR ceur-ws.org

1. Introduction 1.1. Background In recent years, there has been a growing interest in

studying the shift of word meanings over time, with word embeddings emerging as a valuable tool for this purpose. Hamilton et al. (2016) [ 1 ] conducted research focusing on diachronic word embeddings to uncover specific statistical laws associated with semantic change. They examined the law of conformity, which suggests that words tend to change inversely to their frequency. Additionally, they explored the law of innovation, which proposes that words with greater polysemy tend to undergo semantic changes more frequently, regardless of how often they are used. The findings confirmed the hypothesized statistical laws. The study primarily focused on English, aligning word embeddings from diferent time periods and measuring semantic similarity using cosine similarity.

Dubossarsky et al. (2017) [3] contested the validity of the reported laws of semantic change based on word representation models. Replicating previous studies, they found that the law of conformity and the law of innovation did not withstand the more rigorous standard. The negative correlation between word frequency and meaning change was weaker than previously claimed, and CEUR htp:/ceur-ws.org ISN1613-073 © 2023 Copyright for this paper by its authors. Use permitted under Creative

CEUR

Workshop Proceedings (CEUR-WS.org) a shared etymon. the positive correlation between polysemy and meaning change was largely dependent on word frequency without independent contribution.

Similarly, to Hamilton et al. (2016) [ 1 ], Uban et al. (2019) [2] investigated semantic divergence across languages by examining deceptive cognate sets, which are words with a common origin in diferent languages. They focused on analyzing modern embeddings to quantify semantic shifts originating from shared etymology, identify false friends (deceptive cognates) in the cognate sets, and measure their score of falseness, namely the dissimilarity between the cognates. The study primarily concentrated on six Romance languages. The authors introduced methodologies such as aligning word embeddings across languages, measuring semantic similarity and divergence between cognate sets, and quantifying the magnitude of semantic changes. Their findings contradict those of Hamilton et al. (2016) [ 1 ], who found a negative correlation between frequency and meaning shift. However, they align with their findings regarding the law of innovation.

1.2. Objectives

to draw conclusions about the minimum amount of data needed for these analyses.

The primary focus of this study is to investigate the pres

ence of statistical laws governing semantic shifts within the Romance language group, specifically Italian and 2.1.1. Italian Spanish. The research questions revolve around explor- Four corpora were collected online for this study: Histing the laws of conformity and innovation. It is hypothe- corp [4], ChroniclItaly v3.0 [5], Unità corpus [6], and sized that more frequent words are less likely to undergo PAISÀ corpus [7]. The first three corpora were merged to semantic shifts, while more polysemous words are more form the historical dataset, covering the years 1805-1969, prone to such changes. Additionally, the study intro- with a total of 545,068,401 tokens. The PAISÀ corpus duces a new follow-up analysis on analogy, suggesting represented the modern data, containing 1,089,014,748 that over time periods the meaning of a word which is tokens, while the reduced modern version consisted of semantically related to a target (in terms of context-based 545,106,781 tokens. nearest neighbors), tends to shift in the Euclidean space coherently with the target word. 2.1.2. Spanish

The study uses distributional semantics as a methodology to explore language change. A crucial part of this re- Similarly, four corpora were collected online for Spansearch involves analyzing deceptive cognate pairs, which ish: Conha19 [8], Impact-es (BVC section) [9], Corpus of have a similar or the same form in diferent languages Political Speeches [ 10 ], and The Large Spanish Corpus but diverged in meaning over time, unlike true cognates [ 11 ]. The historical data consists of a merged collection that retain the same meaning. For instance, Figure 1 of the first three corpora, covering the period from 1830 illustrates how largo (broad) in Italian and largo (long) to 1969 and containing 204,904,549 tokens. The modern in Spanish have diverged in meaning through a seman- data representation utilizes ’The Large Spanish Corpus’ tic shift, despite both words originating from the shared (Wikipedia section), containing 975,251,278 tokens from Latin etymon largo (abundant). We believe this allows 2019. Additionally, a reduced version of The Large Spanfor a robust comparison of semantic changes, especially ish Corpus was created, containing 206,900,109 tokens. in related languages, providing illustrative examples and easily interpretable results. Our primary focus is on sys- 2.2. Pre-processing Techniques tematic semantic change that originates from the shared etymon and continues, while also controlling for the ran- The pre-processing for both languages followed the same dom appearance of lexical units in a language. Moreover, steps. After collecting the text files for each corpus, we this approach would enable cross-language analysis in used the NLTK library [12] for tokenization and stopprospective studies. word removal. The files were cleaned by removing URLs,

Our study aims to expand the current understanding numbers, non-letters, multiple empty spaces, and set to of language change by incorporating cognate compar- lowercase. For Spanish, diacritic marks were replaced usisons across languages and examining individual changes ing unicodedata. The spaCy library [13], with its reported within specific time periods. To enhance the robustness accuracy of 0.96 for Spanish and 0.97 for Italian, was emof our analyses, we introduce various similarity mea- ployed for lemmatization, and the files were merged into sures. a representative single file for each historical period and language.

2. Corpora 2.1. Corpora Selection Criteria The study uses two diferent time periods of language usage in its corpora: the 19th and 20th centuries (until 1969) for historical data, and the 21st century for modern data.

To address the size diference between the two datasets, we reduced the modern data to match the historical data’s size. This was achieved by counting the number of required tokens and removing the tokens exceeding this number. This allowed for two diferent training sets for the modern data, enabling comparisons and allowing us

2.3. Cognate Dataset We used an existing resource: an automatically generated

multilingual lexicon of false friends [14]. Following the logic that cognate pairs are considered false friends if a word in the second language is closer in meaning to the original word in the shared semantic space than its cognate in that language, a falseness score is provided.

For instance, given the cognate pair (imbarazzata, embarazada), where imbarazzata (embarassed) is a word in Italian and embarazada (pregnant) is a word in Spanish, if there is a word x in Spanish such that for any word w in Spanish the distance (imbarazzata, x) is less than the distance (imbarazzata, w), then the pair is considered a deceptive cognates pair. Since the Spanish word aver- 3.3. K-Nearest Neighbors Retrieval Using gonzada (embarassed) exists, the pair (imbarazzata, em- a Similarity Measure barazada) constitutes a set of false friends, and their arithmetic diference is the score of falseness, which ranges To obtain more qualitative data, the fastText library [15] from 0 to 1. It is lower for false friends that are closer in was used to retrieve embeddings closest to the target meaning and higher for more distant false friends. cognate in Euclidean space. The retrieval process utilized

Given this, we decided to extract the 156 deceptive the K-Nearest Neighbors (K-NN) function, where the cognate pairs with a falseness value higher than 0.25. cosine similarity measure was employed to compare two This step was taken to ensure the accuracy of the dataset vectors. The number of nearest neighbors to retrieve and account for its limitations in the unsupervised data (k) was predetermined and set to 5, 10, 20, and 50 for collection method. comparative analysis purposes.

3. Methodology 3.4. Semantic Shift Calculation within Each Language

We trained six fastText models [15] in an unsupervised regime using the six corpora that we obtained and prepared. For each model, we employed the skip-gram algorithm, set the vector dimension to 100, and trained for 5 epochs. These parameters are considered default, and as indicated by Mikolov et al. (2013) [16], the algorithm has been found to work well with small datasets. This resulted in three models for each language, trained on historical data, modern data, and modern reduced data, respectively. This produced a total of 6 diferent vector spaces.

Methodologically, the study can be divided into the fol

lowing steps1:

3.2. Embeddings Overview with RSA In order to obtain a comprehensive overview of the vector

spaces and as the initial step of our analysis, we computed Representational Similarity Analysis (RSA) between dissimilarity matrices of 156 deceptive cognate words from the dataset by Uban and Dinu (2020) [14]. These matrices were created by extracting vectors for specific cognates from the common vector spaces obtained in the previous step. The aim was to assess general similarity patterns within the word embeddings. Based on the results thus obtained we chose to exclusively use the model trained on the full modern data and discard the one trained on the reduced modern data to ensure higher-quality word embeddings in later steps. Detailed results of this analysis will be discussed later.

1All the code can be found at https://github.com/matteomls/diachronic-semantic-shift. 1. We applied Procrustes alignment [ 17 ] to the two vector spaces (historical to modern for each language) to ensure that similar vectors represented the same concepts across diferent embedding spaces. This alignment was necessary as the embeddings were trained on diferent corpora in diferent languages. 2. We calculated the cosine similarity for the cognates in diferent time periods. 3. We counted the occurrences of each cognate word from both the historical and modern corpora in Italian and Spanish. 4. We normalized the occurrences of cognate words by dividing each value by the maximum value, which is the sum of all values. This normalization resulted in a total of 1, efectively replacing the actual frequency values.

Using the NumPy library [18], we computed the corre

lation coeficient and linear regression coeficients of the frequency and semantic shift across time. In this analysis, we incorporated polysemy covariance, considering the correlation between polysemy and frequency. 3.6. Word Polysemy and Semantic outcome. Furthermore, when comparing the reduced Divergence Analysis historical Spanish embedding space with the modern embedding space, a diference of 0.0956 is observed (b).

After conducting the frequency and semantic divergence Therefore, while the results for Italian remain consistent analysis, we proceeded to measure the polysemy of words. between the full and reduced spaces, reducing the SpanTo accomplish this, we utilized the WordNet library [19], ish modern space to match the historical space produces specifically leveraging the functionality provided by the diferent outcomes compared to using the full modern ”nltk.corpus.wordnet” module. Polysemy was quantified space. Given the choice between data quality and balas the number of synsets associated with a word in Word- ance, we have opted for better data quality by discarding Net, following the methodology described by Uban et al. the models trained with reduced datasets. (2019) [2].

Subsequently, we investigated the correlation between the cosine similarity over time, which indicates the de- 4.2. Calculation of Semantic Shifts gree of semantic shifting, and the number of meanings 4.2.1. Within-Language Comparison: K-NN with a word can have according to WordNet. In this analysis, Jaccard Distance we took into account the co-variance with frequency, similarly to our previous approach.

3.7. Word Analogy and Semantic Divergence Analysis In addition to the previous analyses, we further examined

how the cosine similarity changes over time for the KNearest Neighbors (K-NN) that exhibit overlap between the two diferent time periods. For each cognate word, we employed a K-NN approach with varying values of K (5, 10, 20, 50). We examined the overlapping nearest neighbors (NN) in both the historical and modern lists of NN. For each overlapping NN, we calculated the cosine similarity and measured the diference in the shift, determining whether the NN moved closer to or further from the target cognate word.

By calculating the ratio of positive (closer) or negative (further) shifts, we could now assess the coherence (the consistency of neighbors’ movement relative to the target cognate) of the shift in the K-NN of that specific target cognate word. To identify significant coherent shifts, we set a threshold (>0.75). This threshold was chosen to be substantially higher than chance, ensuring a rigorous approach. If this ratio is crossed, it implies a major coherent shift in the K-NN of the target cognate word.

In carrying out this analysis for all the cognates in the list we removed those that had 0 or 1 NN, since they do not provide informative results.

In reference to the selection of K Nearest Neighbors

(KNN) values at 5, 10, 20, and 50, the obtained results are presented in the tables provided in the Appendices B and C (Tables 3 to 10). These tables display the average number of overlapping nearest neighbors in the cognate list, the ratio of overlapping nearest neighbors considering the extracted KNN, and the Jaccard distance. Please refer to the Appendix for a detailed representation of these values. 4.2.2. Inter-Language Comparison: K-NN with

Jaccard Distance

The values in Appendix D (Tables 11 and 12) represent

dissimilarity scores, specifically semantic shifts, calculated using the Jaccard distance (1-Jaccard index). The Pearson correlation score of 0.999 indicates a strong correlation between the shifts for Italian and Spanish as the particular K value increases. Overall, the scores show compatible semantic shifts. However, in this analysis, we can only infer the magnitude of the shifts and not the patterns, which will be explored in later analyses.

4.3. Law of Conformity 4.4. Law of Innovation

Conversely, in our study the results for the law of innovation (more polysemy = greater shift), depicted in Figure 2 (lower), difer from those reported by Hamilton et al. (2016) [ 1 ] and Uban et al. (2019) [2]. While we observed a moderate positive trend, similar to that of the law of con- Table 1 formity, with correlation scores of 0.401 for Italian and Analogy analysis for Italian 0.417 for Spanish, the partial correlation, which accounts K-NN for the frequency compound, reveals weaker values of 5 0.249 for Italian and 0.188 for Spanish. These findings 10 suggest that the data does not provide strong support 20 for the existence of the law of innovation in Romance 50 languages. However, due to the weak partial correlations observed, it is challenging to draw definitive conclusions.

N° of Cognates 53 83 104 121 and their target cognate increase, leading to less consistent shifts. To provide a visual representation, Figure 3 displays an example visualization for a single cognate pair.

4.5. Law of Analogy

One trend that emerges from our study is that semantically related words (as indicated by contextual nearest neighbors) tend to shift coherently closer or farther to the target word. Table 1 and Table 2 provide supporting evidence for this observation: as the number of nearest neighbors (K-NNs) increases, the ratio of coherent shifts tends to decrease. This aligns with the intuition that with more K-NNs, the distances between the neighbors

5. Discussion

them using partial correlation.

Utilizing the fastText model, known for its imThe hypothesized regularity regarding analogy, a follow- proved performance on non-English languages, and preup analysis in this study, has provided intriguing insights processing freely available data, the results still highlight into semantic shifts. However, it is important to note that poor quality embeddings. This underscores the need for further research into this topic is necessary to validate ongoing research and development of word embedding and expand upon these initial findings. models, alongside the creation of larger, well-curated di

On the other hand, the analyses conducted in this study achronic corpora. Improving data quality and quantity do not yield definitive results supporting the statistical can enhance the accuracy and reliability of future studies laws of semantic shifts. Firstly, the RSA evaluation of in the field. the embedding spaces revealed that the scarcity of data It is important to note that due to the limitations of the significantly impacted the quality of the embeddings. embeddings used in this study, the shifts observed in the Furthermore, while the law of conformity agrees with inter-language Jaccard distance analysis are relatively previous literature in a general trend, such as Hamilton small and close to each other. This leads to an extremely et al. (2016) [ 1 ], our study identified a contrasting trend high correlation coeficient between the languages being for the law of innovation. This discrepancy in findings analyzed, which should be interpreted with caution. may be attributed to the limitation of our study, namely In addition to the aforementioned directions, other the scarcity of data resulting from the use of relatively potential areas of research include expanding further in short time periods. time and broader in the scope of languages. For instance,

An additional factor is the relatively short temporal this could involve going beyond the Romance or even distance between the historic (as recent as 1969) and the Indo-European language family to conduct a more the modern corpora. Increasing this span is likely to comprehensive investigation into language change. lead to greater shifts, but also to greater data sparsity.

Last but not least, the alignment technique employed for matching the embedding spaces could have contributed Acknowledgements to the divergent outcomes in the analysis of the law of conformity and the law of innovation. We would like to express our gratitude to Dr. Rafaella

It is noteworthy that both the laws of conformity and Bernardi for her support and feedback throughout this the law of innovation conform to the findings of Du- project, which has been helpful in shaping our research. bossarsky et al, (2017) [3]. Their study revealed that the We also appreciate her encouragement regarding the suggested positive correlation between meaning change conference submission. and polysemy was primarily influenced by word fre- We also extend our gratitude to Dr. Lorella Viola for quency, and the correlation between word frequency her generous assistance in providing a portion of the and meaning change is indeed weaker. Here, after con- corpus used in our analysis. ducting partial correlation analysis, a weak correlation was observed. Furthermore, we noticed a high compatibility between frequency and polysemy, indicating an inherent dependence, despite our eforts to disentangle

A. RSA Correlation of Italian and Spanish

8 cognates not found 30 cognates not found N° of overlap 0.88015264 0.8567517 0.8563638

... 0.3236405 0.3200544 0.18371347

https://aclanthology.org/ 2020 .lrec- 1 . 116 .

[11]

Cañete , Compilation of large spanish unanno[1]

W. L.

Hamilton ,

Leskovec ,

Jurafsky , Diachronic tated corpora, Zenodo , 2019 .

word embeddings reveal statistical laws of semantic [12]

Bird ,

Klein , E. Loper, Natural language process-

change, in: Proceedings of the 54th Annual Meet- ing with Python: analyzing text with the natural

ing of the Association for Computational Linguis- language toolkit,

O'Reilly Media , Inc., 2009 .

tics (Volume 1 : Long

Papers)

, Association for Com- [13]

Honnibal , I. Montani , spaCy 2: Natural language

putational Linguistics , Berlin, Germany, 2016 , pp. understanding with Bloom embeddings , convolu-

1489- 1501 . URL: https://aclanthology.org/P16-1141. tional neural networks and incremental parsing,

doi:10 .18653/v1/ P16 - 1141 . 2017 . [2]

A. S.

Uban ,

A. M.

Ciobanu ,

L. P.

Dinu , Studying [14]

A. S.

Uban ,

L. P.

Dinu , Automatically building a

2019 , pp. 161 - 166 . doi: 10 .18653/v1/ W19 - 4720. guage Resources and Evaluation, 2020 . URL: https: [3]

Dubossarsky ,

Weinshall , E. Grossman, Outta //api.semanticscholar.org/CorpusID:218973843.

control: Laws of semantic change and inher- [15]

Bojanowski ,

Grave ,

Joulin , T. Mikolov, En-

Proceedings of the 2017 Conference on Empiri- Transactions of the Association for Computational

cal Methods in Natural Language Processing, As- Linguistics 5 ( 2017 ) 135 - 146 .

sociation for Computational Linguistics , Copen- [16]

Mikolov ,

Chen ,

G. S.

Corrado ,

Dean , Eficient

hagen , Denmark, 2017 . URL: https://aclanthology. estimation of word representations in vector space,

org/D17-1118 . doi: 10 .18653/v1/ D17 - 1118. in: International Conference on Learning Represen[4]

Pettersson ,

Megyesi , The histcorp collec- tations, 2013 . URL: https://api.semanticscholar.org/

tion of historical corpora and resources , in: Dig- CorpusID: 5959482 .

ital Humanities in the Nordic Countries Confer- [17]

Gower , Generalized procrustes analy-

ence , 2018 . URL: https://api.semanticscholar.org/ sis, Psychometrika 40 ( 1975 ) 33 - 51 . URL:

CorpusID:19243754 . https://EconPapers.repec.org/RePEc:spr:psycho:v: [5]

Viola ,

A. M.

Fiscarelli , Chroniclitaly 3 .0. a deep- 40 :y:1975:i:1:p: 33 - 51 .

learning, contextually enriched digital heritage col- [18]

C. R.

Harris ,

K. J.

Millman , S. J. van der Walt , R. Gom-

in the usa 1898-1936 , in: Proceedings of the Con- lor, S. Berg,

N. J.

Smith ,

Kern ,

Picus , S. Hoyer,

ference , 2021 . doi: 10 .5281/zenodo.4596345. M. H. van Kerkwijk , M.

Brett , A.

Haldane , J. Fer[6] P.

Basile , A.

Caputo , T.

Caselli , P.

Cassotti , R. Var- nández del Río , M.

Wiebe , P.

Peterson , P. Gérard-

in: CLiC-it 2020 Italian Conference on Computa - H. Abbasi , C.

Gohlke , T. E.

Oliphant , Array pro-

tional Linguistics 2020 , volume 2769 , CEUR Work- gramming with NumPy , Nature 585 ( 2020 ) 357 - 362 .

shop Proceedings (CEUR-WS.org) , 2020 . doi: 10 .1038/s41586- 020- 2649- 2. [7]

Lyding ,

Stemle ,

Borghetti ,

Brunello , [19]

Fellbaum , WordNet: An Electronic Lexical

Castagnoli ,

Dell'Orletta ,

Dittmann ,

Lenci , Database, Bradford Books, 1998 . URL: https://

Pirrelli , PAISÀ corpus of italian web text , 2013 . mitpress.mit.edu/9780262561167/.

URL: http://hdl.handle. net/20.500 . 12124 /3, eurac Re-

search CLARIN Centre. [8]

Henny-Krahmer , Corpus de novelas his-

panoamericanas del siglo xix (conha19) version

1.0.1, in: Proceedings of the Conference, 2021 .

doi:10 .5281/zenodo.4781947. [9]

Sánchez-Martínez ,

Martínez-Sempere , X. Ivars-

uation 47 ( 2013 ) 1327 - 1342 . [10]

Álvarez-Mellado , A corpus of Spanish political

speeches from 1937 to 2019 , in: Proceedings of

ciation , Marseille, France, 2020 , pp. 928 - 932 . URL: Word Querer Decir Pueblo