Kronos-it: a Dataset for the Italian Semantic Change Detection Task Pierpaolo Basile Giovanni Semeraro Annalina Caputo University of Bari A. Moro University of Bari A. Moro ADAPT Centre Dept. Computer Science Dept. Computer Science Dublin City University E. Orabona 4, Italy E. Orabona 4, Italy Dublin, Ireland pierpaolo.basile@uniba.it giovanni.semeraro@uniba.it annalina.caputo@dcu.ie Abstract web scraping strategy for extracting information from an online Italian dictionary. The goal of the This paper introduces Kronos-it, a dataset extraction is to build a list of lemmas with a set of for the evaluation of semantic change change points for each lemma. The change points point detection algorithms for the Ital- are extracted by analysing information about the ian language. The dataset is automati- year in which the lemma with a specific meaning cally built by using a web scraping strat- is observed for the first time. Relying on this infor- egy. We provide a detailed description mation we build a dataset for the Italian language about the dataset and its generation, and that can be used to evaluate algorithms for the se- four state-of-the-art approaches for the se- mantic change point detection. We provide a case mantic change point detection are bench- study in which four different approaches are anal- marked by exploiting the Italian Google n- ysed using a unique corpus. grams corpus. The rest of the article is organised as follows: Section 2 describes how our dataset is built, while 1 Background and Motivation Section 3 provides details about the approaches under analysis and the evaluation. Finally, Sec- Computational approaches to the problem of lan- tion 4 closes the paper and provides possible fu- guage change have been gaining momentum over ture work. the last decade. The availability of long-term and large-scale digital corpora, and the effectiveness of 2 Dataset Construction methods for representing words over time, are the prerequisite behind this interest. However, only The main goal of the dataset is to provide for each few attempts have focused on the evaluation, due lemma a set of years which indicate a semantic to two main issues. First, the amount of data in- change for that lemma. Some dictionaries provide volved limits the possibility to perform a manual historical information about meanings, for exam- evaluation and, secondly, to date no open dataset ple the year in which each meaning is observed for the diachronic semantic change has been made for the first time. The main problem is that gener- available. This last issue has roots in the difficul- ally these dictionaries are not digitally available or ties of building a gold-standard for detecting the they are in a format that is not machine readable. semantic change of terms in a specific corpus or Regarding the Italian language, the dictionary language. The result is a fragmented set of data “Sabatini Coletti”1 is available on-line. It provides and evaluation protocols, since each work in this for some lemmas the year in which each mean- area has used different evaluation datasets or met- ing was observed for the first time. For example, rics. This phenomenon can be gauged from (Tah- taking into account the entry for the word “imbar- masebi et al., 2019), where it is possible to count cata” from the dictionary, we capture its original at least twenty different datasets used for the eval- meaning “Group of people who gather to find each uation. In this paper, we describe how to build a other, to leave together”, and other two meanings: dataset for the evaluation of semantic change point 1) “Acrobatic manoeuvre of an air-plane” intro- detection algorithms. In particular, we adopt a duced in 1929; and 2) “fall in love” introduced in 1972. Copyright c 2019 for this paper by its authors. Use per- 1 mitted under Creative Commons License Attribution 4.0 In- https://dizionari.corriere.it/ ternational (CC BY 4.0). dizionario_italiano/ We setup a web scraping algorithm able to ex- for lemma is 3 and the number of lemmas with tract this information from the dictionary. In par- more than one change point is 113. The oldest re- ticular, the extraction process is composed of sev- ported change point is 1758, while the most recent eral steps: one is 2003; this suggests that the dictionary is out- dated and it does not contain more recent mean- 1. Downloading the list of all lemmas occurring ings. in the online dictionary with the correspond- The dataset is provided in textual format and re- ing URL. We obtain a list of 34,504 lemmas; ports for each row the lemma followed by a list of 2. For each lemma, extracting the section of the years, each one representing a change point. For web page containing the definition with the example: list of all possible meanings. We obtain a fi- enzima 1892 nal list of 34,446 definitions; monopolistico 1972 tamponare 1886 1950 3. For each definition, extracting the year in elettroforesi 1931 which that meaning was introduced. For a fuoricorso 1934 given lemma, we are not able to assign the The low number of change points for lemma re- correct year to each of its meaning, but we flects the fact that generally, the first meaning has can only extract a year associated with the no information about the year it first appeared in lemma. This happens because the dictionary or that its time period is expressed in the form of does not follow a clear template for assign- century. This means that all the other meanings ing the year to each meaning. Although as- are additional meanings introduced after the main sociating the year of change to the definition one. However, there are some more recent words of the meaning is not useful for the purpose for which the first year associated with that entry of our evaluation, it could help to understand corresponds to the year in which the word is ob- the reason behind the semantic change. We served for the first time. Unfortunately, it is not plan to fix this limitation in a further release easy to automatically discern the two cases. of the dataset. In the rest of the paper we call Finally, we report the distribution of change change point (CP) each pair (lemma, year); points over time in Figure 1. The years with a 4. Removing those change points that are ex- peak are 1942, 1905 and 1869 with respectively pressed in the form “III sec.” (III century) 404, 352 and 322 change points. because they refer to a broad period of time rather than to a specific year. 3 Evaluation For the evaluation we adopt our dataset as gold- standard and the Italian Google n-grams (Michel et al., 2011) as corpus3 . Google n-grams provides n-grams extracted from the Google Books project. The corpus is composed of several compressed files. Each file contains tab-separated data, each line has the fol- lowing format: ngram TAB year TAB match count TAB volume count NEWLINE. For example: parlare di pace e di 2005 4 4 Figure 1: The distribution of change points over parlare di pace e di 2006 3 3 time. parlare di pace e di 2007 7 7 parlare di pace e di 2008 2 2 The final dataset2 contains 13,818 lemmas and parlare di pace e di 2009 4 4 13,932 change points. The average change points The first line tells us that in 2005, the 5-grams for lemma is 1.0083 with a standard deviation of “parlare di pace e di” occurred 4 times overall, 0.0924. The maximum number of change points in 4 distinct books. 2 3 https://github.com/pippokill/ http://storage.googleapis.com/books/ kronos-it ngrams/books/datasetsv2.html In particular, we use the 5-grams corpus and we time period we obtain a list of collocations limit the analysis to words that occur at least in with the associated Dice score. For exam- twenty 5-grams. Moreover, we lowercase words ple, a portion of the list of collocations for the and filter out all words that do not match the word pace (peace) in the period 1980-1984 is following regular expression: [a-zéèàı̀òù]+. We reported as follows: limit our analysis to the period [1900-2012]. In order to build the context words by us- pace guerra 0.007223173 ing 5-grams, we adopt the technique described pace giustizia 0.0068931305 in (Ginter and Kanerva, 2014). Given a 5-gram pace trattati 0.0067062946 (w1 , w2 , w3 , w4 , w5 ), it is possible to build pace trattative 0.006033537 eight pairs: (w1 , w2 ) (w1 , w3 ) . . . (w1 , w5 ) and (w5 , w1 ) (w5 , w2 ) . . . (w5 , w4 ). Then, for each pair (wi , wj ), a sliding window method also visits Temporal Random Indexing (TRI). TRI (Ju- (wj , wi ) by obtaining 16 training examples from rgens and Stevens, 2009) is able to build each 5-gram. a word space for each time period where We investigate four systems for representing each space is comparable to one another. In words over time and then we apply a strategy for each space, a word is represented by a dense extracting change points from each technique. Fi- vector and it is possible to compute the co- nally, we evaluate the accuracy of each approach sine similarity between word vectors across by using our dataset as gold standard. time periods. In order to build comparable word spaces, TRI relies on the incremental 3.1 Representing words over time property of the Random Indexing (Sahlgren, We adopt four techniques for representing words 2005). More details are provided in (Basile over time. The first strategy is based only on word et al., 2014) and (Basile et al., 2016). co-occurrences, the other three exploit Distribu- tion Semantic Models (DSM). In particular, the Temporal Word Analogies (TWA). This ap- techniques are: proach is able to build diachronic word embeddings starting from independent Collocation. This approach is very simple and it embedding spaces for each time period. The is used as baseline. The idea is to extract output of this process is a common vector for each word and each time period the set space where word embeddings are used for of relevant collocations. A collocation is a computing temporal word analogies: word sequence of words that co-occur more often w1 at time ti is like word w2 at time tj . We than what would be expected by chance. We build the independent embedding spaces by extract the collocation by analysing the word using the C implementation of word2vec pairs extracted from 5-grams and score each with default parameters (Mikolov et al., word pair using the Dice score: 2013). More details about this approach are reported in (Szymanski, 2017). 2 ∗ fab dice(wa , wb ) = (1) fa + fb Procrustes (HIST). This approach aligns the learned low-dimensional embeddings by pre- where fab is the number of times that the serving cosine similarities across time peri- words wa and wb occur together and fa and ods. More details are available in (Hamilton fb are respectively the number of times that et al., 2016). We apply the alignment to the wa and wb occur in the corpus. Since the same embeddings created for TWA. Dice score is independent of the corpus size, it is possible to build for each word and each All approaches are built using the same vocabu- time period a list of collocations by consider- lary and the same context words generated starting ing only the collocations occurring in a spe- from the 5-grams as previously explained. cific period of time. In order to consider only a restricted number of collocations, we take 3.2 Building the time series in account only the collocations with a Dice In order to track how the semantics of a word value above 0.0001. For each word and each changes over time we need to build a time series. A time series is a sequence of values, one for each is used for HIT S obtaining the two time series time period, that indicates the semantic shift of HISTint and HISTuni . that word in the specific period. In our evaluation, For finding significant change points in a time we split the interval [1900-2012] in time periods series, we adopt the strategy proposed in (Kulka- of five years each. rni et al., 2015) based on the Mean Shift Model The time series are computed in different ways (Taylor, 2000). according to the strategy used for representing the words. In particular, the values of each time series 3.3 Metrics Γ(wi ) associated to the word wi is computed as We compute the performance of each approach by follow: using Precision, Recall and F-measure. However, • Collocation: given two lists of collocations assessing the correctness of the change points gen- related to two different periods, we compute erated by each system is a not easy task. A change the cosine similarity between the two lists by point is defined as a pair (lemma, year). In or- considering a list as a Bag-of-Collocations der to adopt a soft match, when we compare the (BoC). In this case each point k of the se- change points provided by a system with respect ries Γ(wi ) is the cosine similarity between to the change points reported in the gold standard, the BoC at time Tk−1 and the BoC at time we take into account the absolute value of the dif- Tk ; ference between the year predicted by the system and the year provided in the gold standard. • TRI: we use two strategies, (point-wise and As a first evaluation (exact match), we impose cumulative), as proposed in (Basile et al., the difference between the detected year and the 2016). The point-wise approach captures gold standard to be less or equal than five, which how the word vector changes between two is the time period span of our corpus. As a second time periods, while the cumulative analyses evaluation (soft match), we impose only that the captures how the word vector changes with predicted year is greater or equal than the change respect to all the previous periods. In the point in the gold standard. This is a common point-wise approach, each point k of Γ(wi ) is methodology adopted in previous work. the cosine similarity between the word vector For a fairer evaluation, we perform the follow- at time Tk−1 and the word vector at time Tk , ing steps: while for the cumulative approach the point k is computed as the cosine similarity between • We remove from the gold standard all the the average word vectors of all the previous change points that are outside of the period time periods T0 , T1 , . . . , Tk−1 and the word under analysis ([1900-2012]); vector at time Tk ; • We remove from the gold standard all the • TWA: we exploit the word analogies across words that are not represented in the model time and the common vector space for cap- under evaluation. This operation is necessary turing how a word embedding changes across because (1) the previous filtering step can ex- two time periods as reported in (Szymanski, clude some words;(2) there are words that do 2017); not appear in the original corpus. • HIST: time series are built by using the pair- Since the gold standard contains lemmas and wise similarity as explained in (Hamilton et not words, we perform a lemmatization of each al., 2016). output by using Morph-it! (Zanchetta and Baroni, We obtain seven time series as reported in Ta- 2005). bles 1 and 2. In particular: BoC is build on temporal collocations; T RIpoint and T RIcum are 3.4 Results based on TRI by using respectively point-wise and Results of Precision (P), Recall (R) and F-measure cumulative approach; T W Aint and T W Auni are (F) are reported in Table 1. We can observe that built using TWA on words that are common (in- generally we obtain a low F-measure. This is due tersection) to all the periods (T W Aint ) and on the to a large number of false positive change points union of words (T W Auni ). The same procedure detected by each system. exact match soft match Γ P R F P R F BoC .0034 .0084 .0049 .0274 .0670 .0389 T RIpoint .0056 .0394 .0098 .0248 .1750 .0434 T RIcum .0058 .0387 .0101 .0251 .1672 .0436 T W Aint .0034 .0009 .0015 .0165 .0046 .0072 T W Auni .0052 .0060 .0056 .0373 .0435 .0402 HISTint .0024 .0048 .0032 .0111 .02211 .0148 HISTuni .0022 .0066 .0033 .0118 .0356 .0177 Table 1: Results of the evaluation. exact match soft match Γ P R F P R F BoC .0361 .1243 .0560 .2881 .9930 .4466 T RIpoint .0581 .2244 .0923 .2581 .9973 .4100 T RIcum .0610 .2308 .0959 .2617 .9979 .4146 T W Aint .0402 .2000 .0670 .1960 .9750 .3264 T W Auni .0526 .1367 .0759 .3794 .9866 .5480 HISTint .0344 .2147 .0593 .1569 .9791 .2704 HISTuni .0314 .1842 .0536 .1675 .9836 .2863 Table 2: Results of the evaluation obtained by considering only common lemmas between the gold standard and the system output. The best approach in both evaluations is lemmas that are represented in both the gold stan- T RIcum . Considering the exact match evalua- dard and the system. Results of this further evalu- tion, the difference in performance is remarkable ation are provided in Table 2 since generally TRI has a high recall. In the soft For the exact match evaluation, T RIcum obtains match evaluation, T W Auni obtains the best pre- the best F-measure as in the first evaluation, while cision, while the simple BoC method is able to T W Auni achieves a very good performance in the achieve good results compared with more complex soft match evaluation. approaches such as T W Aint and HIST . The plot in Figure 2 reports how the F-measure The results of the evaluation prove that the task increases according to the time span that we adopt of semantic change detection is very challenging; in the soft match. In particular, the X-axis re- in particular, the large number of false positive ports the maximum absolute difference between drastically affects the performance. the year in the gold standard and the year predicted Further analyses are necessary to understand by the system. We can observe that under 20 years which component affects the performance. In T RI provide better performance than T W A, and this preliminary evaluation, we adopt a unique after 60 years all the approaches reach a stable F- approach for detecting the semantic shift. An measure value. extended benchmark is necessary for evaluating several approaches for detecting semantic change 4 Conclusion and Future Work points. In this paper, we provide details about the con- The systems are built on a vocabulary that is struction of a dataset for the evaluation of semantic larger than both the original dictionary and the change point detection algorithms. In particular, gold standard. For that reason, we provide an ad- our dataset focused on the Italian language and it is ditional evaluation in which we perform an ideal built by adopting a web-scraping strategy. We pro- analysis by evaluating only lemmas that are com- vide a usage example of our dataset by evaluating mon to the gold standard and the system output. several approaches for the representation of words The goal of this analysis is to measure the abil- over time. The results prove that the task of de- ity of correctly identifying change points for those tecting semantic shift is challenging due to a large statistical laws of semantic change. arXiv preprint arXiv:1605.09096. David Jurgens and Keith Stevens. 2009. Event detec- tion in blogs using temporal random indexing. In Proceedings of the Workshop on Events in Emerging Text Types, pages 9–16. Association for Computa- tional Linguistics. Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2015. Statistically significant de- tection of linguistic change. In Proceedings of the 24th International Conference on World Wide Web, Figure 2: The plot shows how the F-measure in- pages 625–635. International World Wide Web Con- creases according to the time span used in the soft ferences Steering Committee. match. Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K Gray, Joseph P Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, number of detected false positive. As future work, Jon Orwant, et al. 2011. Quantitative analysis of we plan to investigate further methods for building culture using millions of digitized books. science, time series and detecting semantic shifts in order 331(6014):176–182. to improve the overall performance. Moreover, we Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- plan to fix some issues of our extraction process in frey Dean. 2013. Efficient estimation of word order to improve the quality of the dataset itself. representations in vector space. arXiv preprint arXiv:1301.3781. Acknowledgements Magnus Sahlgren. 2005. An introduction to random This work was supported by the ADAPT Centre indexing. for Digital Content Technology, funded under the Terrence Szymanski. 2017. Temporal word analo- Science Foundation Ireland (SFI) Research Cen- gies: Identifying lexical replacement with di- tres Programme (Grant SFI 13/RC/2106) and is achronic word embeddings. In Proceedings of the co-funded under the European Regional Devel- 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages opment Fund and by the European Unions Hori- 448–453, Vancouver, Canada, July. Association for zon 2020 (EU2020) research and innovation pro- Computational Linguistics. gramme under the Marie Skodowska-Curie grant Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2019. agreement No.: EU2020 713567. The compu- Survey of computational approaches to lexical se- tational work has been executed on the IT re- mantic change. arXiv:1811.06278v2. sources made available by two projects, ReCaS and PRISMA, funded by MIUR under the pro- Wayne A Taylor. 2000. Change-point analysis: a pow- erful new tool for detecting changes. gram “PON R&C 2007-2013”. Eros Zanchetta and Marco Baroni. 2005. Morph-it! a free corpus-based morphological resource for the References italian language. In Proceedings of corpus linguis- tics. Pierpaolo Basile, Annalina Caputo, and Giovanni Se- meraro. 2014. Analysing word meaning over time by exploiting temporal random indexing. In First Italian Conference on Computational Linguis- tics CLiC-it. Pierpaolo Basile, Annalina Caputo, Roberta Luisi, and Giovanni Semeraro. 2016. Diachronic analysis of the italian language exploiting google ngram. CLiC it, page 56. Filip Ginter and Jenna Kanerva. 2014. Fast training of word2vec representations using n-gram corpora. William L Hamilton, Jure Leskovec, and Dan Juraf- sky. 2016. Diachronic word embeddings reveal