Diachronic Analysis of the Italian Language exploiting Google Ngram

Pierpaolo Basile1 and Annalina Caputo1 and Roberta Luisi2 and Giovanni Semeraro1
                            Department of Computer Science
                               University of Bari Aldo Moro
                          Via, E. Orabona, 4 - 70125 Bari (Italy)
                       1
                         {firstname.surname}@uniba.it
                            2
                              {roby.luisi}@gmail.com


                    Abstract                           by the need to reflect the continuous changes of
                                                       the world. The evolution of word meanings has
    English. In this paper, we propose sev-            been studied for several centuries, but this kind of
    eral methods for the diachronic analysis           investigation has been limited by the low amount
    of the Italian language. We build several          of data on which to perform the analysis. More-
    models by exploiting Temporal Random               over, in order to reveal structural changes in word
    Indexing and the Google Ngram dataset              meanings, this analysis has to explore long periods
    for the Italian language. Each proposed            of time.
    method is evaluated on the ability to auto-
                                                          Nowadays, the large amount of digital content
    matically identify meaning shift over time.
                                                       opens new perspectives for the diachronic analysis
    To this end, we introduce a new dataset
                                                       of language. This large amount of data needs effi-
    built by looking at the etymological infor-
                                                       cient computational approaches. In this scenario,
    mation reported in some dictionaries.
                                                       Distributional Semantic Models (DSMs) represent
    Italiano. In questo lavoro proponiamo al-          a promising solution. DSMs are able to repre-
    cuni metodi per l’analisi diacronica della         sent words as points in a geometric space, gener-
    lingua italiana. Abbiamo costruito differ-         ally called WordSpace (Schiitze, 1993; Sahlgren,
    enti modelli utilizzando la tecnica del Tem-       2006) simply analysing how words are used in a
    poral Random Indexing e Google Ngram               corpus. However, a WordSpace represents a snap-
    per l’italiano. Ciascun metodo proposto            shot of a specific corpus and it does not take into
    è stato valutato rispetto alla capacità di       account temporal information.
    identificare automaticamente i cambi di               Since its first release, the Google Ngram dataset
    significato nel tempo. A tale scopo intro-         (Michel et al., 2011) has inspired a lot of works
    duciamo uno nuovo dataset costruito me-            on the analysis of cultural trends and linguistic
    diante le informazioni etimologiche pre-           variations. Moving away from mere frequentist
    senti in alcuni dizionari.                         approaches, Distributional Semantic Models have
                                                       proved to be quite effective in measuring a mean-
                                                       ing shift through the analysis of variation of word
1   Motivation and Background
                                                       co-occurrences. One of the earlier attempts can
Languages can be studied from two different and        be dated to Gulordava and Baroni (2011), where
complementary viewpoints: the diachronic per-          a co-occurrence matrix is used to model the se-
spective considers the evolution of a language over    mantics of terms. In this model, similarly to ours,
time, while the synchronic perspective describes       the cosine similarity between vectors representing
the language rules at a specific point of time with-   a term in two different periods is exploited as a
out taking its history into account (De Saussure,      predictor of the meaning shift: low values suggest
1983). In this work, we focus on the diachronic        a change in the words that co-occur with the tar-
approach, since language appears to be unques-         get. The co-occurrence matrix is computed with
tionably immersed in the temporal dimension.           local mutual information scores and the context el-
Language is subject to a constant evolution driven     ements are fixed with respect to the different time
periods, hence the spaces are directly compara-              cabulary V of terms1 extracted from C, the
ble. However, this kind of direct comparison does            method assigns a random vector ri to each
not hold when the vector representation is manipu-           term ti ∈ V . A random vector is a vector that
lated, like in reduction methods (SVD), or learning          has values only in {-1, 0, 1} and it is sparse
approaches (word2vec). In these cases, each space            with few non-zero elements distributed ran-
has its own coordinate axis. Then, some kind of              domly along its dimensions. The set of ran-
alignment between spaces is required. To this end,           dom vectors assigned to all terms in V are
Hamilton et al. (2016) use orthogonal Procrustes,            near-orthogonal;
while Kulkarni et al. (2015a) learn a transforma-
tion matrix.                                             2. The corpus C is split in different time periods
   In this paper, we propose an evolution of our            Tk using temporal information, for example
previous work (Basile et al., 2014; Basile et al.,          the year of publication;
2015) for analysing word meanings over time.
                                                         3. For each period Tk , a WordSpace W Sk is
This model, differently from those of Hamilton et
                                                            built. All the terms of V occurring in Tk are
al. (2016) and Kulkarni et al. (2015a), creates dif-
                                                            represented by a semantic vector. The seman-
ferent WordSpaces for each time period in terms
                                                            tic vector svik for the i-th term in Tk is built as
of the same common random vectors; then, the re-
                                                            the sum of all the random vectors of the terms
sulting word vectors are directly comparable with
                                                            co-occurring with ti in Tk . When comput-
one another. In particular, we propose an effi-
                                                            ing the sum, we weigh the random vector; in
cient method for building a DSM model which
                                                            this case we adopt a formula based on inverse
takes into account temporal information relying on
                                                            document frequency. Formally,        the
                                                                                                  weight is
a very large corpus: the Google Ngram for the Ital-                                        
                                                                                             Ck
ian language. Moreover, for the first time, we pro-         computed as w(ri ) = log #tk , where Ck
                                                                                               i
vide a dataset for the evaluation of word meaning           is the total number of occurrences in Tk and
change points detection specifically set up for the         #tki is the occurrences of the term ti in Tk .
Italian language.                                           The idea is to give less weight to the most
   The paper is structured as follows: Section              frequent words.
2 provides details about our methodology, while
                                                          In this way, the semantic vectors across all time
Section 3 describes the dataset that we have devel-
                                                       periods are comparable since they are the sum of
oped and the results of a preliminary evaluation.
                                                       the same random vectors.
Section 4 reports final remarks and future work.
                                                          RRI can be implemented by repeating the steps
2    Methodology                                       2 and 3 several times. Where at each iteration ran-
                                                       dom vectors are replaced by the semantic vectors
Our method has its roots in a previous model based     built in the previous step. The idea is to model
on Temporal Random Indexing (TRI) (Basile et           implicit connections between terms that never co-
al., 2014; Basile et al., 2015). In particular, we     occur together, but that could occur frequently
evolve the TRI approach in two directions: 1) we       with other shared terms.
improve the system in order to manage very large          The next two sub-sections provide details about
datasets, such as Google Ngram; 2) we introduce        the Google Ngram dataset and the method used to
a new approach based on Reflective Random In-          automatically detect word meanings shift.
dexing (RRI) (Cohen et al., 2010) with the aim of
identifying indirect inferences that can lead to the   2.1   Google Ngram
discovery of implicit connections between word         Google Ngram is a very large dataset containing
meanings.                                              all the n-grams (up to five) extracted from Google
   The idea behind TRI is to build different           Books. It is built by analysing over five millions
WordSpaces for each time period that we want to        books spanning the years from 1500 to 2012,
analyse. The peculiarity of TRI is that word vec-      but the developers estimate that the most reliable
tors over different time periods are directly compa-   period is from 1800 to 2012. The dataset covers
rable because they are built using the same random     several languages including Italian. For each
vectors. In particular TRI works as follows:
                                                           1
                                                             The terms that we want to analyse. Usually, the most n
    1. Given a corpus C of documents and a vo-         frequent terms are extracted.
language, several compressed files are released.             depends on the semantic of all the previous
Each file contains for each line the following               time periods.
information:      Ngram <TAB> year <TAB>                   Given a time series we need a method for find-
match count <TAB> volume count. For                     ing significant change points in the series. We
example, the line “analysis is often                    adopt the strategy proposed in (Kulkarni et al.,
described as 1991 104 5” means that                     2015b) based on the Mean shift model (Taylor,
the ngram “analysis is often described as” occurs       2000). According to this model, we define a mean
104 times in 5 books in the 1991 .                      shift of a general time series Γ pivoted at time pe-
   We modify TRI for building the WordSpaces            riod j as:
directly from the Google Ngram dataset. In
                                                                          l        j
particular, we need a pre-processing step in                           1 X       1X
                                                               K(Γ) =       Γk −     Γk                   (1)
which we split the n-grams in several files ac-                       l−j        j
                                                                               k=j+1           k=1
cording to the time periods we want to anal-
yse. For example, if we fix the dimension                  In order to understand if a mean shift is statisti-
of a time period to ten years from 1850 to              cally significant at time j, a bootstrapping (Efron
2012, we build several files for each period:           and Tibshirani, 1994) approach under the null hy-
T1 = 1850-1859, T2 = 1860-1869, . . . , T16 =           pothesis that there is no change in the mean is
2000-2009, T17 = 2010-2012. Each file contains          adopted. In particular, statistical significance is
only the n-grams that occur in the specific time        computed by first constructing B bootstrap sam-
period. We remove information about the year and        ples by permuting Γ(ti ). Second, for each boot-
the book count since they are not useful in the sub-    strap sample P, K(P ) is calculated to provide
sequent steps. Considering the previous example,        its corresponding bootstrap statistic and statistical
the line “analysis is often described                   significance (p-value) of observing the mean shift
as 104” will be stored in the file 1990-1999.           at time j compared to the null distribution. Fi-
   After this pre-processing step, we can easily run    nally, we estimate the change point by considering
TRI and RRI, where RRI can be repeated multiple         the time point j with the minimum p-value score.
times.                                                  Since multiple words can have the same p-value,
                                                        we sort them according to their frequency. The
2.2   Change point detection                            output of this process is a ranking of words that
To track the word meaning change over time, for         potentially have changed meaning.
each term ti we build a time series Γ(ti ) taking
                                                        3   Evaluation
into account several methods. A time series is a
sequence of values, one value for each time pe-         The goal of the evaluation is twofold: 1) to build
riod, that indicates the semantic shift of that term    a standard benchmarking for meaning shift detec-
in the specific period. We adopt several strategies     tion for the Italian language; 2) to evaluate the per-
for building time series. The first strategy is based   formance of the proposed methods and compare
on term log-frequency; each value in the series is      them with the baseline model based on the word
                           #tk                          frequency.
defined as: Γk (ti ) = log( Cki ).
   In order to exploit the ability of our methods          A list of meaning shifts for the Italian language
in computing vectors similarity over time periods,      is not available, then we build a new dataset using
we define two strategies for building the time se-      a pooling strategy. In particular, we retrieve the
ries:                                                   list of meaning shifts, as explained in Section 2.2,
                                                        using the cumulative strategy for each of the fol-
point-wise: Γk (ti ) is defined as the cosine simi-     lowing methods: word frequency, TRI, TRRI with
    larity between svik and svik−1 . In this way,       one iteration and TRRI with two iterations.
    we want to capture vector changes between              Taking into account the first 50 words for each
    two time periods;                                   system, we manually check for each word if a
                                                        meaning shift occurs by exploiting some dictionar-
cumulative: weP build a cumulative vector               ies. We use two dictionaries: the “Sabatino Co-
      C
   svi k−1 = j=0k−1
                     svij and compute the cosine        letti” available on-line2 and the “Dizionario Eti-
   similarity with respect to the vector svik .           2
                                                            http://dizionari.corriere.it/
   The idea is that the semantics at point k − 1        dizionario_italiano/
mologico Zanichelli” available on CD-ROM. Fi-                     and acc@100, while it performs worse than T RI
nally, we obtain a gold standard that consists of 40              and T RRI1 when the accuracy is computed over
words and their corresponding change points.                      the whole list of terms (ALL). These results sug-
   All the methods, with exception of word fre-                   gest that, while there are not too many differences
quency, are built using co-occurrences informa-                   between the two methods considering smaller lists
tion extracted from 5-grams in the Google Ngram                   of results, T RI is actually able to detect more
dataset for the Italian. The vector dimension is set              meaning shifts on a larger set of terms. T RRI2
to 1,000 for all the approaches based on Random                   always provides the worst results; we speculate
Indexing using two non-zero elements in the ran-                  that two iterations introduce too much noise in the
dom vector.                                                       model. A closer scrutiny to the list of words pro-
   We adopt accuracy as evaluation metric. Given                  vided by T RRI2 highlights the presence of many
a list of n change points returned by the sys-                    foreign words: a simplistic conclusion may sug-
tem, we compute the ratio between the number of                   gest that this approach is able to identify foreign
change points correctly identified in the gold stan-              terms that are introduced in the Italian language.
dard3 and n. In order to identify the correct change              However, we think that the output of this method
points, we consider not only the word4 , but also                 deserves more investigations carried out by de-
the year of the change point. In particular, the year             signing an ad-hoc evaluation.
predicted by the system must be equal or greater                     The evaluation is based on the predicted year,
than one of the years reported in the gold standard.              which has to be equal or greater than one of the
We compute the accuracy using different values of                 years reported in the gold standard, we conduct
n (10, 100, ALL). Results of the evaluation are re-               a further analysis to measure how far the predic-
ported in Table 1. In particular, we evaluate 7 sys-              tion is from the exact value. In particular, we
tems: logf req is the baseline based on word fre-                 compute the mean and the standard deviation tak-
quency; T RI is the Temporal Random Indexing                      ing into account the differences between the pre-
method, T RRI1 is the Temporal Reflective Ran-                    dicted and the exact year. The results of this anal-
dom Indexing with one iteration, while T RRI2                     ysis are reported in Table 2. We observe the both
adopts two iterations. For the methods based on                   T RRI1cum and T RRI2cum produce the best re-
Random Indexing, we investigate both the point-                   sults despite their low accuracy, while T RIcum re-
wise and the cumulative strategy to compute the                   ports the best trade-off between accuracy and pre-
change points.                                                    cision in detecting the correct year. It is important
                                                                  to underline that the size of the time interval influ-
                                                                  ences this kind of analysis since if the algorithm
      Table 1: Results of the evaluation.
                                                                  predicts 1900, the change point could happen in
  Method        acc@10 acc@100            ALL
                                                                  the interval 1900-19095 . We plan to design a more
  T RIpoint             0.0247         0.1111      0.3086         accurate analysis by exploring a time interval set
  T RIcum               0.0123         0.0247      0.2963         to one year as future work.
  T RRI1point           0.0000         0.0247      0.2716
  logf req              0.0247         0.1111      0.2346
                                                                  Table 2: Mean and standard deviation of the dif-
  T RRI2point           0.0000         0.0370      0.1728
                                                                  ferences between the predicted and the exact year.
  T RRI1cum             0.0000         0.0000      0.1605
  T RRI2cum             0.0000         0.0000      0.1235                Method             Mean        Std.Deviation
                                                                         T RIpoint           38.04                 34.90
   The analysis of the results shows that T RI gen-                      T RIcum             26.45                 19.60
erally provides better results than T RRI. More-                         T RRI1point         65.86                 49.96
over, the point-wise strategy always outperforms                         logf req            24.15                 16.19
the cumulative one. With respect to the baseline,                        T RRI2point         54.50                 52.70
it has the same accuracy of T RI for both acc@10                         T RRI1cum           16.61                 14.62
   3
                                                                         T RRI2cum           19.40                 19.85
      The gold standard adopted in this evaluation is available
here: https://dl.dropboxusercontent.com/u/
16026979/data/TRI_CLIC_2016_change_word.
    4                                                                 5
      The word matching is performed taking into account also           In our experiment, the size of the time interval is set to
the inflected forms.                                              ten years.
4   Conclusions                                          Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and
                                                           Steven Skiena. 2015a. Statistically significant de-
In this work we proposed several methods based             tection of linguistic change. In Proceedings of the
on Random Indexing for the diachronic analy-               24th International Conference on World Wide Web,
sis of the Italian language. We built a dataset            WWW ’15, pages 625–635, New York, NY, USA.
                                                           ACM.
for the evaluation of meaning shift by exploiting
etymological information taken from two Italian          Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and
dictionaries. We compared our approaches with              Steven Skiena. 2015b. Statistically significant de-
                                                           tection of linguistic change. In Proceedings of the
respect a baseline based on word frequency ob-
                                                           24th International Conference on World Wide Web,
taining promising results. In particular, the TRI          pages 625–635. ACM.
method showed its better capability in retrieving
more meaning shifts on a longer list of terms. As        Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser
                                                            Aiden, Adrian Veres, Matthew K Gray, Joseph P
future work, we plan to extend the dataset with fur-        Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig,
ther words and to investigate other methods based           Jon Orwant, et al. 2011. Quantitative analysis of
on word-embeddings.                                         culture using millions of digitized books. science,
                                                            331(6014):176–182.
Acknowledgement
                                                         Magnus Sahlgren. 2006. The word-space model: Us-
This work is partially supported by the project           ing distributional analysis to represent syntagmatic
                                                          and paradigmatic relations between words in high-
“Multilingual Entity Liking” funded by the Apu-           dimensional vector spaces.
lia Region under the program FutureInResearch.
                                                         Hinrich Schiitze. 1993. Word space. Advances in neu-
                                                           ral information processing systems, 5:895–902.
References                                               Wayne A Taylor. 2000. Change-point analysis: a pow-
Pierpaolo Basile, Annalina Caputo, and Giovanni Se-        erful new tool for detecting changes. Taylor Enter-
   meraro. 2014. Analysing word meaning over time          prises, Inc.
   by exploiting temporal random indexing. In Roberto
   Basili, Alessandro Lenci, and Bernardo Magnini,
   editors, First Italian Conference on Computational
   Linguistics CLiC-it 2014. Pisa University Press.
Pierpaolo Basile, Annalina Caputo, and Giovanni Se-
   meraro. 2015. Temporal random indexing: A sys-
   tem for analysing word meaning over time. Italian
   Journal of Computational Linguistics, 1(1):55–68,
   12.
Trevor Cohen, Roger Schvaneveldt, and Dominic Wid-
  dows. 2010. Reflective random indexing and indi-
  rect inference: A scalable method for discovery of
  implicit connections. Journal of biomedical infor-
  matics, 43(2):240–256.
Ferdinand De Saussure. 1983. Course in general lin-
  guistics. La Salle, Illinois: Open Court.
Bradley Efron and Robert J Tibshirani. 1994. An intro-
  duction to the bootstrap. Chapman and Hall/CRC.
Kristina Gulordava and Marco Baroni. 2011. A distri-
  butional similarity approach to the detection of se-
  mantic change in the google books ngram corpus. In
  Proceedings of the GEMS 2011 Workshop on GE-
  ometrical Models of Natural Language Semantics,
  pages 67–71, Edinburgh, UK, July. Association for
  Computational Linguistics.
William L. Hamilton, Jure Leskovec, and Dan Ju-
  rafsky. 2016. Diachronic word embeddings re-
  veal statistical laws of semantic change. CoRR,
  abs/1605.09096.