Diachronic Analysis of the Italian Language exploiting Google Ngram Pierpaolo Basile1 and Annalina Caputo1 and Roberta Luisi2 and Giovanni Semeraro1 Department of Computer Science University of Bari Aldo Moro Via, E. Orabona, 4 - 70125 Bari (Italy) 1 {firstname.surname}@uniba.it 2 {roby.luisi}@gmail.com Abstract by the need to reflect the continuous changes of the world. The evolution of word meanings has English. In this paper, we propose sev- been studied for several centuries, but this kind of eral methods for the diachronic analysis investigation has been limited by the low amount of the Italian language. We build several of data on which to perform the analysis. More- models by exploiting Temporal Random over, in order to reveal structural changes in word Indexing and the Google Ngram dataset meanings, this analysis has to explore long periods for the Italian language. Each proposed of time. method is evaluated on the ability to auto- Nowadays, the large amount of digital content matically identify meaning shift over time. opens new perspectives for the diachronic analysis To this end, we introduce a new dataset of language. This large amount of data needs effi- built by looking at the etymological infor- cient computational approaches. In this scenario, mation reported in some dictionaries. Distributional Semantic Models (DSMs) represent Italiano. In questo lavoro proponiamo al- a promising solution. DSMs are able to repre- cuni metodi per l’analisi diacronica della sent words as points in a geometric space, gener- lingua italiana. Abbiamo costruito differ- ally called WordSpace (Schiitze, 1993; Sahlgren, enti modelli utilizzando la tecnica del Tem- 2006) simply analysing how words are used in a poral Random Indexing e Google Ngram corpus. However, a WordSpace represents a snap- per l’italiano. Ciascun metodo proposto shot of a specific corpus and it does not take into è stato valutato rispetto alla capacità di account temporal information. identificare automaticamente i cambi di Since its first release, the Google Ngram dataset significato nel tempo. A tale scopo intro- (Michel et al., 2011) has inspired a lot of works duciamo uno nuovo dataset costruito me- on the analysis of cultural trends and linguistic diante le informazioni etimologiche pre- variations. Moving away from mere frequentist senti in alcuni dizionari. approaches, Distributional Semantic Models have proved to be quite effective in measuring a mean- ing shift through the analysis of variation of word 1 Motivation and Background co-occurrences. One of the earlier attempts can Languages can be studied from two different and be dated to Gulordava and Baroni (2011), where complementary viewpoints: the diachronic per- a co-occurrence matrix is used to model the se- spective considers the evolution of a language over mantics of terms. In this model, similarly to ours, time, while the synchronic perspective describes the cosine similarity between vectors representing the language rules at a specific point of time with- a term in two different periods is exploited as a out taking its history into account (De Saussure, predictor of the meaning shift: low values suggest 1983). In this work, we focus on the diachronic a change in the words that co-occur with the tar- approach, since language appears to be unques- get. The co-occurrence matrix is computed with tionably immersed in the temporal dimension. local mutual information scores and the context el- Language is subject to a constant evolution driven ements are fixed with respect to the different time periods, hence the spaces are directly compara- cabulary V of terms1 extracted from C, the ble. However, this kind of direct comparison does method assigns a random vector ri to each not hold when the vector representation is manipu- term ti ∈ V . A random vector is a vector that lated, like in reduction methods (SVD), or learning has values only in {-1, 0, 1} and it is sparse approaches (word2vec). In these cases, each space with few non-zero elements distributed ran- has its own coordinate axis. Then, some kind of domly along its dimensions. The set of ran- alignment between spaces is required. To this end, dom vectors assigned to all terms in V are Hamilton et al. (2016) use orthogonal Procrustes, near-orthogonal; while Kulkarni et al. (2015a) learn a transforma- tion matrix. 2. The corpus C is split in different time periods In this paper, we propose an evolution of our Tk using temporal information, for example previous work (Basile et al., 2014; Basile et al., the year of publication; 2015) for analysing word meanings over time. 3. For each period Tk , a WordSpace W Sk is This model, differently from those of Hamilton et built. All the terms of V occurring in Tk are al. (2016) and Kulkarni et al. (2015a), creates dif- represented by a semantic vector. The seman- ferent WordSpaces for each time period in terms tic vector svik for the i-th term in Tk is built as of the same common random vectors; then, the re- the sum of all the random vectors of the terms sulting word vectors are directly comparable with co-occurring with ti in Tk . When comput- one another. In particular, we propose an effi- ing the sum, we weigh the random vector; in cient method for building a DSM model which this case we adopt a formula based on inverse takes into account temporal information relying on document frequency. Formally, the  weight is a very large corpus: the Google Ngram for the Ital-  Ck ian language. Moreover, for the first time, we pro- computed as w(ri ) = log #tk , where Ck i vide a dataset for the evaluation of word meaning is the total number of occurrences in Tk and change points detection specifically set up for the #tki is the occurrences of the term ti in Tk . Italian language. The idea is to give less weight to the most The paper is structured as follows: Section frequent words. 2 provides details about our methodology, while In this way, the semantic vectors across all time Section 3 describes the dataset that we have devel- periods are comparable since they are the sum of oped and the results of a preliminary evaluation. the same random vectors. Section 4 reports final remarks and future work. RRI can be implemented by repeating the steps 2 Methodology 2 and 3 several times. Where at each iteration ran- dom vectors are replaced by the semantic vectors Our method has its roots in a previous model based built in the previous step. The idea is to model on Temporal Random Indexing (TRI) (Basile et implicit connections between terms that never co- al., 2014; Basile et al., 2015). In particular, we occur together, but that could occur frequently evolve the TRI approach in two directions: 1) we with other shared terms. improve the system in order to manage very large The next two sub-sections provide details about datasets, such as Google Ngram; 2) we introduce the Google Ngram dataset and the method used to a new approach based on Reflective Random In- automatically detect word meanings shift. dexing (RRI) (Cohen et al., 2010) with the aim of identifying indirect inferences that can lead to the 2.1 Google Ngram discovery of implicit connections between word Google Ngram is a very large dataset containing meanings. all the n-grams (up to five) extracted from Google The idea behind TRI is to build different Books. It is built by analysing over five millions WordSpaces for each time period that we want to books spanning the years from 1500 to 2012, analyse. The peculiarity of TRI is that word vec- but the developers estimate that the most reliable tors over different time periods are directly compa- period is from 1800 to 2012. The dataset covers rable because they are built using the same random several languages including Italian. For each vectors. In particular TRI works as follows: 1 The terms that we want to analyse. Usually, the most n 1. Given a corpus C of documents and a vo- frequent terms are extracted. language, several compressed files are released. depends on the semantic of all the previous Each file contains for each line the following time periods. information: Ngram year Given a time series we need a method for find- match count volume count. For ing significant change points in the series. We example, the line “analysis is often adopt the strategy proposed in (Kulkarni et al., described as 1991 104 5” means that 2015b) based on the Mean shift model (Taylor, the ngram “analysis is often described as” occurs 2000). According to this model, we define a mean 104 times in 5 books in the 1991 . shift of a general time series Γ pivoted at time pe- We modify TRI for building the WordSpaces riod j as: directly from the Google Ngram dataset. In l j particular, we need a pre-processing step in 1 X 1X K(Γ) = Γk − Γk (1) which we split the n-grams in several files ac- l−j j k=j+1 k=1 cording to the time periods we want to anal- yse. For example, if we fix the dimension In order to understand if a mean shift is statisti- of a time period to ten years from 1850 to cally significant at time j, a bootstrapping (Efron 2012, we build several files for each period: and Tibshirani, 1994) approach under the null hy- T1 = 1850-1859, T2 = 1860-1869, . . . , T16 = pothesis that there is no change in the mean is 2000-2009, T17 = 2010-2012. Each file contains adopted. In particular, statistical significance is only the n-grams that occur in the specific time computed by first constructing B bootstrap sam- period. We remove information about the year and ples by permuting Γ(ti ). Second, for each boot- the book count since they are not useful in the sub- strap sample P, K(P ) is calculated to provide sequent steps. Considering the previous example, its corresponding bootstrap statistic and statistical the line “analysis is often described significance (p-value) of observing the mean shift as 104” will be stored in the file 1990-1999. at time j compared to the null distribution. Fi- After this pre-processing step, we can easily run nally, we estimate the change point by considering TRI and RRI, where RRI can be repeated multiple the time point j with the minimum p-value score. times. Since multiple words can have the same p-value, we sort them according to their frequency. The 2.2 Change point detection output of this process is a ranking of words that To track the word meaning change over time, for potentially have changed meaning. each term ti we build a time series Γ(ti ) taking 3 Evaluation into account several methods. A time series is a sequence of values, one value for each time pe- The goal of the evaluation is twofold: 1) to build riod, that indicates the semantic shift of that term a standard benchmarking for meaning shift detec- in the specific period. We adopt several strategies tion for the Italian language; 2) to evaluate the per- for building time series. The first strategy is based formance of the proposed methods and compare on term log-frequency; each value in the series is them with the baseline model based on the word #tk frequency. defined as: Γk (ti ) = log( Cki ). In order to exploit the ability of our methods A list of meaning shifts for the Italian language in computing vectors similarity over time periods, is not available, then we build a new dataset using we define two strategies for building the time se- a pooling strategy. In particular, we retrieve the ries: list of meaning shifts, as explained in Section 2.2, using the cumulative strategy for each of the fol- point-wise: Γk (ti ) is defined as the cosine simi- lowing methods: word frequency, TRI, TRRI with larity between svik and svik−1 . In this way, one iteration and TRRI with two iterations. we want to capture vector changes between Taking into account the first 50 words for each two time periods; system, we manually check for each word if a meaning shift occurs by exploiting some dictionar- cumulative: weP build a cumulative vector ies. We use two dictionaries: the “Sabatino Co- C svi k−1 = j=0k−1 svij and compute the cosine letti” available on-line2 and the “Dizionario Eti- similarity with respect to the vector svik . 2 http://dizionari.corriere.it/ The idea is that the semantics at point k − 1 dizionario_italiano/ mologico Zanichelli” available on CD-ROM. Fi- and acc@100, while it performs worse than T RI nally, we obtain a gold standard that consists of 40 and T RRI1 when the accuracy is computed over words and their corresponding change points. the whole list of terms (ALL). These results sug- All the methods, with exception of word fre- gest that, while there are not too many differences quency, are built using co-occurrences informa- between the two methods considering smaller lists tion extracted from 5-grams in the Google Ngram of results, T RI is actually able to detect more dataset for the Italian. The vector dimension is set meaning shifts on a larger set of terms. T RRI2 to 1,000 for all the approaches based on Random always provides the worst results; we speculate Indexing using two non-zero elements in the ran- that two iterations introduce too much noise in the dom vector. model. A closer scrutiny to the list of words pro- We adopt accuracy as evaluation metric. Given vided by T RRI2 highlights the presence of many a list of n change points returned by the sys- foreign words: a simplistic conclusion may sug- tem, we compute the ratio between the number of gest that this approach is able to identify foreign change points correctly identified in the gold stan- terms that are introduced in the Italian language. dard3 and n. In order to identify the correct change However, we think that the output of this method points, we consider not only the word4 , but also deserves more investigations carried out by de- the year of the change point. In particular, the year signing an ad-hoc evaluation. predicted by the system must be equal or greater The evaluation is based on the predicted year, than one of the years reported in the gold standard. which has to be equal or greater than one of the We compute the accuracy using different values of years reported in the gold standard, we conduct n (10, 100, ALL). Results of the evaluation are re- a further analysis to measure how far the predic- ported in Table 1. In particular, we evaluate 7 sys- tion is from the exact value. In particular, we tems: logf req is the baseline based on word fre- compute the mean and the standard deviation tak- quency; T RI is the Temporal Random Indexing ing into account the differences between the pre- method, T RRI1 is the Temporal Reflective Ran- dicted and the exact year. The results of this anal- dom Indexing with one iteration, while T RRI2 ysis are reported in Table 2. We observe the both adopts two iterations. For the methods based on T RRI1cum and T RRI2cum produce the best re- Random Indexing, we investigate both the point- sults despite their low accuracy, while T RIcum re- wise and the cumulative strategy to compute the ports the best trade-off between accuracy and pre- change points. cision in detecting the correct year. It is important to underline that the size of the time interval influ- ences this kind of analysis since if the algorithm Table 1: Results of the evaluation. predicts 1900, the change point could happen in Method acc@10 acc@100 ALL the interval 1900-19095 . We plan to design a more T RIpoint 0.0247 0.1111 0.3086 accurate analysis by exploring a time interval set T RIcum 0.0123 0.0247 0.2963 to one year as future work. T RRI1point 0.0000 0.0247 0.2716 logf req 0.0247 0.1111 0.2346 Table 2: Mean and standard deviation of the dif- T RRI2point 0.0000 0.0370 0.1728 ferences between the predicted and the exact year. T RRI1cum 0.0000 0.0000 0.1605 T RRI2cum 0.0000 0.0000 0.1235 Method Mean Std.Deviation T RIpoint 38.04 34.90 The analysis of the results shows that T RI gen- T RIcum 26.45 19.60 erally provides better results than T RRI. More- T RRI1point 65.86 49.96 over, the point-wise strategy always outperforms logf req 24.15 16.19 the cumulative one. With respect to the baseline, T RRI2point 54.50 52.70 it has the same accuracy of T RI for both acc@10 T RRI1cum 16.61 14.62 3 T RRI2cum 19.40 19.85 The gold standard adopted in this evaluation is available here: https://dl.dropboxusercontent.com/u/ 16026979/data/TRI_CLIC_2016_change_word. 4 5 The word matching is performed taking into account also In our experiment, the size of the time interval is set to the inflected forms. ten years. 4 Conclusions Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2015a. Statistically significant de- In this work we proposed several methods based tection of linguistic change. In Proceedings of the on Random Indexing for the diachronic analy- 24th International Conference on World Wide Web, sis of the Italian language. We built a dataset WWW ’15, pages 625–635, New York, NY, USA. ACM. for the evaluation of meaning shift by exploiting etymological information taken from two Italian Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and dictionaries. We compared our approaches with Steven Skiena. 2015b. Statistically significant de- tection of linguistic change. In Proceedings of the respect a baseline based on word frequency ob- 24th International Conference on World Wide Web, taining promising results. In particular, the TRI pages 625–635. ACM. method showed its better capability in retrieving more meaning shifts on a longer list of terms. As Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K Gray, Joseph P future work, we plan to extend the dataset with fur- Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, ther words and to investigate other methods based Jon Orwant, et al. 2011. Quantitative analysis of on word-embeddings. culture using millions of digitized books. science, 331(6014):176–182. Acknowledgement Magnus Sahlgren. 2006. The word-space model: Us- This work is partially supported by the project ing distributional analysis to represent syntagmatic and paradigmatic relations between words in high- “Multilingual Entity Liking” funded by the Apu- dimensional vector spaces. lia Region under the program FutureInResearch. Hinrich Schiitze. 1993. Word space. Advances in neu- ral information processing systems, 5:895–902. References Wayne A Taylor. 2000. Change-point analysis: a pow- Pierpaolo Basile, Annalina Caputo, and Giovanni Se- erful new tool for detecting changes. Taylor Enter- meraro. 2014. Analysing word meaning over time prises, Inc. by exploiting temporal random indexing. In Roberto Basili, Alessandro Lenci, and Bernardo Magnini, editors, First Italian Conference on Computational Linguistics CLiC-it 2014. Pisa University Press. Pierpaolo Basile, Annalina Caputo, and Giovanni Se- meraro. 2015. Temporal random indexing: A sys- tem for analysing word meaning over time. Italian Journal of Computational Linguistics, 1(1):55–68, 12. Trevor Cohen, Roger Schvaneveldt, and Dominic Wid- dows. 2010. Reflective random indexing and indi- rect inference: A scalable method for discovery of implicit connections. Journal of biomedical infor- matics, 43(2):240–256. Ferdinand De Saussure. 1983. Course in general lin- guistics. La Salle, Illinois: Open Court. Bradley Efron and Robert J Tibshirani. 1994. An intro- duction to the bootstrap. Chapman and Hall/CRC. Kristina Gulordava and Marco Baroni. 2011. A distri- butional similarity approach to the detection of se- mantic change in the google books ngram corpus. In Proceedings of the GEMS 2011 Workshop on GE- ometrical Models of Natural Language Semantics, pages 67–71, Edinburgh, UK, July. Association for Computational Linguistics. William L. Hamilton, Jure Leskovec, and Dan Ju- rafsky. 2016. Diachronic word embeddings re- veal statistical laws of semantic change. CoRR, abs/1605.09096.