Kronos-it: a Dataset for the Italian Semantic Change Detection Task

        Pierpaolo Basile                       Giovanni Semeraro                      Annalina Caputo
    University of Bari A. Moro               University of Bari A. Moro                 ADAPT Centre
     Dept. Computer Science                   Dept. Computer Science                 Dublin City University
       E. Orabona 4, Italy                      E. Orabona 4, Italy                     Dublin, Ireland
    pierpaolo.basile@uniba.it              giovanni.semeraro@uniba.it               annalina.caputo@dcu.ie


                            Abstract                            web scraping strategy for extracting information
                                                                from an online Italian dictionary. The goal of the
     This paper introduces Kronos-it, a dataset                 extraction is to build a list of lemmas with a set of
     for the evaluation of semantic change                      change points for each lemma. The change points
     point detection algorithms for the Ital-                   are extracted by analysing information about the
     ian language. The dataset is automati-                     year in which the lemma with a specific meaning
     cally built by using a web scraping strat-                 is observed for the first time. Relying on this infor-
     egy. We provide a detailed description                     mation we build a dataset for the Italian language
     about the dataset and its generation, and                  that can be used to evaluate algorithms for the se-
     four state-of-the-art approaches for the se-               mantic change point detection. We provide a case
     mantic change point detection are bench-                   study in which four different approaches are anal-
     marked by exploiting the Italian Google n-                 ysed using a unique corpus.
     grams corpus.                                                 The rest of the article is organised as follows:
                                                                Section 2 describes how our dataset is built, while
1    Background and Motivation                                  Section 3 provides details about the approaches
                                                                under analysis and the evaluation. Finally, Sec-
Computational approaches to the problem of lan-
                                                                tion 4 closes the paper and provides possible fu-
guage change have been gaining momentum over
                                                                ture work.
the last decade. The availability of long-term and
large-scale digital corpora, and the effectiveness of           2   Dataset Construction
methods for representing words over time, are the
prerequisite behind this interest. However, only                The main goal of the dataset is to provide for each
few attempts have focused on the evaluation, due                lemma a set of years which indicate a semantic
to two main issues. First, the amount of data in-               change for that lemma. Some dictionaries provide
volved limits the possibility to perform a manual               historical information about meanings, for exam-
evaluation and, secondly, to date no open dataset               ple the year in which each meaning is observed
for the diachronic semantic change has been made                for the first time. The main problem is that gener-
available. This last issue has roots in the difficul-           ally these dictionaries are not digitally available or
ties of building a gold-standard for detecting the              they are in a format that is not machine readable.
semantic change of terms in a specific corpus or                   Regarding the Italian language, the dictionary
language. The result is a fragmented set of data                “Sabatini Coletti”1 is available on-line. It provides
and evaluation protocols, since each work in this               for some lemmas the year in which each mean-
area has used different evaluation datasets or met-             ing was observed for the first time. For example,
rics. This phenomenon can be gauged from (Tah-                  taking into account the entry for the word “imbar-
masebi et al., 2019), where it is possible to count             cata” from the dictionary, we capture its original
at least twenty different datasets used for the eval-           meaning “Group of people who gather to find each
uation. In this paper, we describe how to build a               other, to leave together”, and other two meanings:
dataset for the evaluation of semantic change point             1) “Acrobatic manoeuvre of an air-plane” intro-
detection algorithms. In particular, we adopt a                 duced in 1929; and 2) “fall in love” introduced in
                                                                1972.
     Copyright c 2019 for this paper by its authors. Use per-
                                                                  1
mitted under Creative Commons License Attribution 4.0 In-           https://dizionari.corriere.it/
ternational (CC BY 4.0).                                        dizionario_italiano/
   We setup a web scraping algorithm able to ex-        for lemma is 3 and the number of lemmas with
tract this information from the dictionary. In par-     more than one change point is 113. The oldest re-
ticular, the extraction process is composed of sev-     ported change point is 1758, while the most recent
eral steps:                                             one is 2003; this suggests that the dictionary is out-
                                                        dated and it does not contain more recent mean-
  1. Downloading the list of all lemmas occurring       ings.
     in the online dictionary with the correspond-         The dataset is provided in textual format and re-
     ing URL. We obtain a list of 34,504 lemmas;        ports for each row the lemma followed by a list of
  2. For each lemma, extracting the section of the      years, each one representing a change point. For
     web page containing the definition with the        example:
     list of all possible meanings. We obtain a fi-     enzima 1892
     nal list of 34,446 definitions;                    monopolistico 1972
                                                        tamponare 1886 1950
  3. For each definition, extracting the year in        elettroforesi 1931
     which that meaning was introduced. For a           fuoricorso 1934
     given lemma, we are not able to assign the
                                                           The low number of change points for lemma re-
     correct year to each of its meaning, but we
                                                        flects the fact that generally, the first meaning has
     can only extract a year associated with the
                                                        no information about the year it first appeared in
     lemma. This happens because the dictionary
                                                        or that its time period is expressed in the form of
     does not follow a clear template for assign-
                                                        century. This means that all the other meanings
     ing the year to each meaning. Although as-
                                                        are additional meanings introduced after the main
     sociating the year of change to the definition
                                                        one. However, there are some more recent words
     of the meaning is not useful for the purpose
                                                        for which the first year associated with that entry
     of our evaluation, it could help to understand
                                                        corresponds to the year in which the word is ob-
     the reason behind the semantic change. We
                                                        served for the first time. Unfortunately, it is not
     plan to fix this limitation in a further release
                                                        easy to automatically discern the two cases.
     of the dataset. In the rest of the paper we call
                                                           Finally, we report the distribution of change
     change point (CP) each pair (lemma, year);
                                                        points over time in Figure 1. The years with a
  4. Removing those change points that are ex-          peak are 1942, 1905 and 1869 with respectively
     pressed in the form “III sec.” (III century)       404, 352 and 322 change points.
     because they refer to a broad period of time
     rather than to a specific year.
                                                        3   Evaluation
                                                        For the evaluation we adopt our dataset as gold-
                                                        standard and the Italian Google n-grams (Michel
                                                        et al., 2011) as corpus3 .
                                                           Google n-grams provides n-grams extracted
                                                        from the Google Books project. The corpus is
                                                        composed of several compressed files. Each file
                                                        contains tab-separated data, each line has the fol-
                                                        lowing format: ngram TAB year TAB match count
                                                        TAB volume count NEWLINE. For example:
                                                        parlare di pace e di 2005 4 4
Figure 1: The distribution of change points over        parlare di pace e di 2006 3 3
time.                                                   parlare di pace e di 2007 7 7
                                                        parlare di pace e di 2008 2 2
   The final dataset2 contains 13,818 lemmas and        parlare di pace e di 2009 4 4
13,932 change points. The average change points         The first line tells us that in 2005, the 5-grams
for lemma is 1.0083 with a standard deviation of        “parlare di pace e di” occurred 4 times overall,
0.0924. The maximum number of change points             in 4 distinct books.
  2                                                       3
    https://github.com/pippokill/                           http://storage.googleapis.com/books/
kronos-it                                               ngrams/books/datasetsv2.html
   In particular, we use the 5-grams corpus and we              time period we obtain a list of collocations
limit the analysis to words that occur at least in              with the associated Dice score. For exam-
twenty 5-grams. Moreover, we lowercase words                    ple, a portion of the list of collocations for the
and filter out all words that do not match the                  word pace (peace) in the period 1980-1984 is
following regular expression: [a-zéèàı̀òù]+. We            reported as follows:
limit our analysis to the period [1900-2012].
   In order to build the context words by us-                   pace guerra 0.007223173
ing 5-grams, we adopt the technique described                   pace giustizia 0.0068931305
in (Ginter and Kanerva, 2014). Given a 5-gram                   pace trattati 0.0067062946
(w1 , w2 , w3 , w4 , w5 ), it is possible to build              pace trattative 0.006033537
eight pairs: (w1 , w2 ) (w1 , w3 ) . . . (w1 , w5 ) and
(w5 , w1 ) (w5 , w2 ) . . . (w5 , w4 ). Then, for each
pair (wi , wj ), a sliding window method also visits      Temporal Random Indexing (TRI). TRI (Ju-
(wj , wi ) by obtaining 16 training examples from            rgens and Stevens, 2009) is able to build
each 5-gram.                                                 a word space for each time period where
   We investigate four systems for representing              each space is comparable to one another. In
words over time and then we apply a strategy for             each space, a word is represented by a dense
extracting change points from each technique. Fi-            vector and it is possible to compute the co-
nally, we evaluate the accuracy of each approach             sine similarity between word vectors across
by using our dataset as gold standard.                       time periods. In order to build comparable
                                                             word spaces, TRI relies on the incremental
3.1   Representing words over time                           property of the Random Indexing (Sahlgren,
We adopt four techniques for representing words              2005). More details are provided in (Basile
over time. The first strategy is based only on word          et al., 2014) and (Basile et al., 2016).
co-occurrences, the other three exploit Distribu-
tion Semantic Models (DSM). In particular, the            Temporal Word Analogies (TWA). This           ap-
techniques are:                                              proach is able to build diachronic word
                                                             embeddings starting from independent
Collocation. This approach is very simple and it             embedding spaces for each time period. The
    is used as baseline. The idea is to extract              output of this process is a common vector
    for each word and each time period the set               space where word embeddings are used for
    of relevant collocations. A collocation is a             computing temporal word analogies: word
    sequence of words that co-occur more often               w1 at time ti is like word w2 at time tj . We
    than what would be expected by chance. We                build the independent embedding spaces by
    extract the collocation by analysing the word            using the C implementation of word2vec
    pairs extracted from 5-grams and score each              with default parameters (Mikolov et al.,
    word pair using the Dice score:                          2013). More details about this approach are
                                                             reported in (Szymanski, 2017).
                                   2 ∗ fab
                dice(wa , wb ) =                   (1)
                                   fa + fb                Procrustes (HIST). This approach aligns the
                                                              learned low-dimensional embeddings by pre-
      where fab is the number of times that the
                                                              serving cosine similarities across time peri-
      words wa and wb occur together and fa and
                                                              ods. More details are available in (Hamilton
      fb are respectively the number of times that
                                                              et al., 2016). We apply the alignment to the
      wa and wb occur in the corpus. Since the
                                                              same embeddings created for TWA.
      Dice score is independent of the corpus size,
      it is possible to build for each word and each         All approaches are built using the same vocabu-
      time period a list of collocations by consider-     lary and the same context words generated starting
      ing only the collocations occurring in a spe-       from the 5-grams as previously explained.
      cific period of time. In order to consider only
      a restricted number of collocations, we take        3.2   Building the time series
      in account only the collocations with a Dice        In order to track how the semantics of a word
      value above 0.0001. For each word and each          changes over time we need to build a time series.
A time series is a sequence of values, one for each     is used for HIT S obtaining the two time series
time period, that indicates the semantic shift of       HISTint and HISTuni .
that word in the specific period. In our evaluation,       For finding significant change points in a time
we split the interval [1900-2012] in time periods       series, we adopt the strategy proposed in (Kulka-
of five years each.                                     rni et al., 2015) based on the Mean Shift Model
   The time series are computed in different ways       (Taylor, 2000).
according to the strategy used for representing the
words. In particular, the values of each time series    3.3   Metrics
Γ(wi ) associated to the word wi is computed as         We compute the performance of each approach by
follow:                                                 using Precision, Recall and F-measure. However,
  • Collocation: given two lists of collocations        assessing the correctness of the change points gen-
    related to two different periods, we compute        erated by each system is a not easy task. A change
    the cosine similarity between the two lists by      point is defined as a pair (lemma, year). In or-
    considering a list as a Bag-of-Collocations         der to adopt a soft match, when we compare the
    (BoC). In this case each point k of the se-         change points provided by a system with respect
    ries Γ(wi ) is the cosine similarity between        to the change points reported in the gold standard,
    the BoC at time Tk−1 and the BoC at time            we take into account the absolute value of the dif-
    Tk ;                                                ference between the year predicted by the system
                                                        and the year provided in the gold standard.
  • TRI: we use two strategies, (point-wise and            As a first evaluation (exact match), we impose
    cumulative), as proposed in (Basile et al.,         the difference between the detected year and the
    2016). The point-wise approach captures             gold standard to be less or equal than five, which
    how the word vector changes between two             is the time period span of our corpus. As a second
    time periods, while the cumulative analyses         evaluation (soft match), we impose only that the
    captures how the word vector changes with           predicted year is greater or equal than the change
    respect to all the previous periods. In the         point in the gold standard. This is a common
    point-wise approach, each point k of Γ(wi ) is      methodology adopted in previous work.
    the cosine similarity between the word vector          For a fairer evaluation, we perform the follow-
    at time Tk−1 and the word vector at time Tk ,       ing steps:
    while for the cumulative approach the point k
    is computed as the cosine similarity between          • We remove from the gold standard all the
    the average word vectors of all the previous            change points that are outside of the period
    time periods T0 , T1 , . . . , Tk−1 and the word        under analysis ([1900-2012]);
    vector at time Tk ;
                                                          • We remove from the gold standard all the
  • TWA: we exploit the word analogies across               words that are not represented in the model
    time and the common vector space for cap-               under evaluation. This operation is necessary
    turing how a word embedding changes across              because (1) the previous filtering step can ex-
    two time periods as reported in (Szymanski,             clude some words;(2) there are words that do
    2017);                                                  not appear in the original corpus.
  • HIST: time series are built by using the pair-        Since the gold standard contains lemmas and
    wise similarity as explained in (Hamilton et        not words, we perform a lemmatization of each
    al., 2016).                                         output by using Morph-it! (Zanchetta and Baroni,
   We obtain seven time series as reported in Ta-       2005).
bles 1 and 2. In particular: BoC is build on
temporal collocations; T RIpoint and T RIcum are        3.4   Results
based on TRI by using respectively point-wise and       Results of Precision (P), Recall (R) and F-measure
cumulative approach; T W Aint and T W Auni are          (F) are reported in Table 1. We can observe that
built using TWA on words that are common (in-           generally we obtain a low F-measure. This is due
tersection) to all the periods (T W Aint ) and on the   to a large number of false positive change points
union of words (T W Auni ). The same procedure          detected by each system.
                                        exact match                  soft match
                      Γ
                                   P       R       F        P          R        F
                      BoC          .0034 .0084 .0049        .0274      .0670    .0389
                      T RIpoint    .0056 .0394 .0098        .0248      .1750    .0434
                      T RIcum      .0058 .0387 .0101        .0251      .1672    .0436
                      T W Aint     .0034 .0009 .0015        .0165      .0046    .0072
                      T W Auni     .0052 .0060 .0056        .0373      .0435    .0402
                      HISTint      .0024 .0048 .0032        .0111      .02211 .0148
                      HISTuni      .0022 .0066 .0033        .0118      .0356    .0177

                                    Table 1: Results of the evaluation.
                                        exact match                  soft match
                      Γ
                                   P       R       F         P         R        F
                      BoC          .0361 .1243 .0560         .2881     .9930 .4466
                      T RIpoint    .0581 .2244 .0923         .2581     .9973 .4100
                      T RIcum      .0610 .2308 .0959         .2617     .9979 .4146
                      T W Aint     .0402 .2000 .0670         .1960     .9750 .3264
                      T W Auni     .0526 .1367 .0759         .3794     .9866 .5480
                      HISTint      .0344 .2147 .0593         .1569     .9791 .2704
                      HISTuni      .0314 .1842 .0536         .1675     .9836 .2863

Table 2: Results of the evaluation obtained by considering only common lemmas between the gold
standard and the system output.


   The best approach in both evaluations is            lemmas that are represented in both the gold stan-
T RIcum . Considering the exact match evalua-          dard and the system. Results of this further evalu-
tion, the difference in performance is remarkable      ation are provided in Table 2
since generally TRI has a high recall. In the soft        For the exact match evaluation, T RIcum obtains
match evaluation, T W Auni obtains the best pre-       the best F-measure as in the first evaluation, while
cision, while the simple BoC method is able to         T W Auni achieves a very good performance in the
achieve good results compared with more complex        soft match evaluation.
approaches such as T W Aint and HIST .                    The plot in Figure 2 reports how the F-measure
   The results of the evaluation prove that the task   increases according to the time span that we adopt
of semantic change detection is very challenging;      in the soft match. In particular, the X-axis re-
in particular, the large number of false positive      ports the maximum absolute difference between
drastically affects the performance.                   the year in the gold standard and the year predicted
   Further analyses are necessary to understand        by the system. We can observe that under 20 years
which component affects the performance. In            T RI provide better performance than T W A, and
this preliminary evaluation, we adopt a unique         after 60 years all the approaches reach a stable F-
approach for detecting the semantic shift. An          measure value.
extended benchmark is necessary for evaluating
several approaches for detecting semantic change
                                                       4   Conclusion and Future Work
points.                                                In this paper, we provide details about the con-
   The systems are built on a vocabulary that is       struction of a dataset for the evaluation of semantic
larger than both the original dictionary and the       change point detection algorithms. In particular,
gold standard. For that reason, we provide an ad-      our dataset focused on the Italian language and it is
ditional evaluation in which we perform an ideal       built by adopting a web-scraping strategy. We pro-
analysis by evaluating only lemmas that are com-       vide a usage example of our dataset by evaluating
mon to the gold standard and the system output.        several approaches for the representation of words
The goal of this analysis is to measure the abil-      over time. The results prove that the task of de-
ity of correctly identifying change points for those   tecting semantic shift is challenging due to a large
                                                           statistical laws of semantic change. arXiv preprint
                                                           arXiv:1605.09096.
                                                         David Jurgens and Keith Stevens. 2009. Event detec-
                                                           tion in blogs using temporal random indexing. In
                                                           Proceedings of the Workshop on Events in Emerging
                                                           Text Types, pages 9–16. Association for Computa-
                                                           tional Linguistics.
                                                         Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and
                                                           Steven Skiena. 2015. Statistically significant de-
                                                           tection of linguistic change. In Proceedings of the
                                                           24th International Conference on World Wide Web,
Figure 2: The plot shows how the F-measure in-             pages 625–635. International World Wide Web Con-
creases according to the time span used in the soft        ferences Steering Committee.
match.                                                   Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser
                                                            Aiden, Adrian Veres, Matthew K Gray, Joseph P
                                                            Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig,
number of detected false positive. As future work,          Jon Orwant, et al. 2011. Quantitative analysis of
we plan to investigate further methods for building         culture using millions of digitized books. science,
time series and detecting semantic shifts in order          331(6014):176–182.
to improve the overall performance. Moreover, we
                                                         Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
plan to fix some issues of our extraction process in       frey Dean. 2013. Efficient estimation of word
order to improve the quality of the dataset itself.        representations in vector space. arXiv preprint
                                                           arXiv:1301.3781.
Acknowledgements
                                                         Magnus Sahlgren. 2005. An introduction to random
This work was supported by the ADAPT Centre               indexing.
for Digital Content Technology, funded under the         Terrence Szymanski. 2017. Temporal word analo-
Science Foundation Ireland (SFI) Research Cen-             gies: Identifying lexical replacement with di-
tres Programme (Grant SFI 13/RC/2106) and is               achronic word embeddings. In Proceedings of the
co-funded under the European Regional Devel-               55th Annual Meeting of the Association for Compu-
                                                           tational Linguistics (Volume 2: Short Papers), pages
opment Fund and by the European Unions Hori-               448–453, Vancouver, Canada, July. Association for
zon 2020 (EU2020) research and innovation pro-             Computational Linguistics.
gramme under the Marie Skodowska-Curie grant
                                                         Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2019.
agreement No.: EU2020 713567. The compu-                   Survey of computational approaches to lexical se-
tational work has been executed on the IT re-              mantic change. arXiv:1811.06278v2.
sources made available by two projects, ReCaS
and PRISMA, funded by MIUR under the pro-                Wayne A Taylor. 2000. Change-point analysis: a pow-
                                                           erful new tool for detecting changes.
gram “PON R&C 2007-2013”.
                                                         Eros Zanchetta and Marco Baroni. 2005. Morph-it!
                                                           a free corpus-based morphological resource for the
References                                                 italian language. In Proceedings of corpus linguis-
                                                           tics.
Pierpaolo Basile, Annalina Caputo, and Giovanni Se-
   meraro. 2014. Analysing word meaning over
   time by exploiting temporal random indexing. In
   First Italian Conference on Computational Linguis-
   tics CLiC-it.

Pierpaolo Basile, Annalina Caputo, Roberta Luisi, and
   Giovanni Semeraro. 2016. Diachronic analysis of
   the italian language exploiting google ngram. CLiC
   it, page 56.

Filip Ginter and Jenna Kanerva. 2014. Fast training of
   word2vec representations using n-gram corpora.

William L Hamilton, Jure Leskovec, and Dan Juraf-
  sky. 2016. Diachronic word embeddings reveal