=Paper= {{Paper |id=Vol-2765/147 |storemode=property |title=UNIMIB @ DIACR-Ita: Aligning Distributional Embeddings with a Compass for Semantic Change Detection in the Italian Language (short paper) |pdfUrl=https://ceur-ws.org/Vol-2765/paper147.pdf |volume=Vol-2765 |authors=Federico Belotti,Federico Bianchi,Matteo Palmonari |dblpUrl=https://dblp.org/rec/conf/evalita/BelottiBP20 }} ==UNIMIB @ DIACR-Ita: Aligning Distributional Embeddings with a Compass for Semantic Change Detection in the Italian Language (short paper)== https://ceur-ws.org/Vol-2765/paper147.pdf
    UNIMIB @ DIACR-Ita: Aligning Distributional Embeddings with a
     Compass for Semantic Change Detection in the Italian Language
        Federico Belotti                        Federico Bianchi                 Matteo Palmonari
 University of Milano-Bicocca                  Bocconi University            University of Milano-Bicocca
    Viale Sarca 336, 20126                   Via Sarfatti 25, 20136            Viale Sarca 336, 20126
          Milan, Italy                            Milan, Italy                        Milan, Italy
 f.belotti8@campus.unimib.it                f.bianchi@unibocconi.it          matteo.palmonari@unimib.it


                       Abstract                              nents: 1) an alignment procedure to generate dis-
                                                             tributional vector spaces that are comparable for
    In this paper, we present our results re-                t1 and t2 and 2) the use of distance metrics to
    lated to the EVALITA 2020 challenge,                     compute the degree of semantic change for a given
    DIACR-Ita, for semantic change detection                 word. Our alignment procedure is based on Com-
    for the Italian language. Our approach is                pass Aligned Distributional Embeddings (CADE)
    based on measuring the semantic distance                 proposed by Bianchi et al. (2020) (note the ap-
    across time-specific word vectors gener-                 proach was introduced as Temporal Word Embed-
    ated with Compass-aligned Distributional                 dings with a Compass by Di Carlo et al. (2019),
    Embeddings (CADE). We first generate                     but the name was changed to enforce the idea that
    temporal embeddings with CADE, a strat-                  the embeddings can be used to align more general
    egy to align word embeddings that are spe-               corpora and not just diachronic ones). Given the
    cific for each time period; the quality of               aligned embeddings, we use two measures to com-
    this alignment is the main asset of our                  pute the degree of change based on the similarities
    proposal. We then measure the semantic                   of the vectors in the embedded space. Our results
    shift of each word, combining two differ-                show that our methodology for aligning spaces can
    ent semantic shift measures. Eventually,                 be useful in detecting lexical semantic change.
    we classify a word meaning as changed or
    not changed by defining a threshold over                 2   Description of the System: Semantic
    the semantic distance across time.                           Change Detection with Compass
                                                                 Aligned Embeddings
1   Introduction
                                                             Our approach is based on measuring the seman-
Semantic change detection is the task of detecting           tic distance across time of time-specific word vec-
if a word has shifted in meaning between different           tors generated with CADE and on the use of two
periods of time (Tahmasebi et al., 2018; Kutuzov             measures for detecting semantic shifts i.e., the se-
et al., 2018). The DIACR-Ita (Basile et al., 2020a)          mantic distance between word vectors across time.
challenge (at EVALITA (Basile et al., 2020b)) is             This distance can be interpreted as a function of
meant to evaluate approaches for semantic change             the words’ self-similarity across time, where the
detection for the Italian Language.                          similarity is measured by a linear combination of
    The task is described as follows: for training,          cosine and second-order similarity (Hamilton et
two corpora t1 and t2, consisting of text coming             al., 2016a).
from different periods are given, for testing, a set            Finally, a threshold over this self-similarity is
of unlabeled target words is given, where for each           used to classify a word as changed or not changed.
of them a binary scores has to be predicted: 1 iden-            This methodology was applied also in the se-
tifies lexical change between t1 and t2 while 0              mantic shift detection challenge presented at Se-
does not.                                                    mEval2020 (Schlechtweg et al., 2020) (to which
    In this paper, we present our approach to seman-         we participated after the end of the challenge).
tic change detection that is based on two compo-             The challenge allowed us to explore and under-
                                                             stand how the alignment and our self-similarity
      “Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0     behaved. In the classification task of the Se-
International (CC BY 4.0).”                                  mEval2020 challenge (the one similar to this task),
we eventually achieved 0.703, 0.771, 0.725, 0.742,       ces for each new CBOW model fitted on each of
in accuracy for respectively the English, German,        the slices. During training, these new target matri-
Latin and Swedish languages; these results have          ces are frozen, i.e., they are not updated during the
been obtained with extensive parameter search            training on the slice. This ensures that at the end of
given the gold standard available in the post-           the training process, the various temporal embed-
evaluation.1 In DIACR-Ita, the threshold and few         dings are all aligned in the same embedding space,
other hyper parameters can be heuristically set to       making them comparable without losing their in-
account for the limited number of possible sub-          dividual temporal distinctions. We use the pub-
missions. In the next subsections we provide more        licly available online implementation of CADE.2
details about the alignment methodology and the
similarity function; more details about how we set       2.2    Computing Semantic Change
the hyper parameters are provided in Section 3.          Once the embeddings are aligned, we need mea-
                                                         sures to evaluate the degree of semantic change.
2.1    Aligning Embeddings                               We compute the semantic shift of each word,
Word2vec (Mikolov et al., 2013) is a useful              i.e. the semantic distance between word vectors
methodology to generate vectors of words allow-          across time using the combination of two differ-
ing us to study word similarity through vector sim-      ent measures: Local Neighbors (ln), introduced by
ilarity. However, due to the stochasticity of the        Hamilton et al. (2016a) and cosine similarity (cos),
training procedure, running word2vec on differ-          merging them with a weighted linear combination
ent corpora creates word vectors that are not com-       into a new measure called Move.
parable. Thus, an alignment procedure that puts
the temporal word vectors in the same space is           Local Neighbors ln is based on the similarity
needed.                                                  between a word and its neighbor words in the two
   There are different approaches to generate these      different time periods. Essentially we compute
aligned embeddings (see for example the work by          the degree of semantic change of the word w in
(Hamilton et al., 2016b) and (Yao et al., 2018)).        two slices by first collecting the nearest neighbors
In this paper, we generate aligned embeddings            (NNs) of wt and wt+1 in the two respective slices,
with Compass Aligned Distributional Embeddings           then given the embeddings at time t the similari-
(CADE) (Bianchi et al., 2020) (See Figure 1 for a        ties between the vector of wt and the vectors of all
schematic description of the model). CADE is a           the neighbors are computed.3 The same process is
strategy to align word embeddings that are specific      run for time t + 1 with wt+1 , eventually giving us
for each time period that extends the word2vec           two vectors of similarity scores. These two vectors
Continuous Bag Of Word (CBOW) model pro-                 are again compared using cosine similarity. The
posed by Mikolov et al. (2013). CADE can be              higher the value of this measure the less the vector
used to generate aligned temporal word embed-            has changed with respect to its neighbors and thus
dings (i.e., time-specific vectors of words, like        the less the word should have shifted in meaning.
“amazon1974 ”) from the different slices.                Cosine Similarity The second measure we use
   Given in input a set of slices of text, where each    is simply the cosine similarity of the vectors of a
slice corresponds to text coming from a specific         word in two different time periods. Similarly as
period of time, the alignment procedure is as fol-       before , the higher the value the less the vector has
lows:                                                    changed and thus the less the word should have
   First, the text from all the slices is concatenated   shifted in meaning.
and CBOW is run on this corpus in order to ob-
tain a “compass” model, i.e., a model defining the       The Move Measure We merge these measures
embedding space. The CBOW model uses two                 together using a weighted linear combination, that
matrices to generate the embeddings (U and C in          is:
Figure 1), one for the context words and one for
                                                                s(wt , wt+1 ) = (1 − λ) · ln(wt , wt+1 )
the target words. The target word matrix of the
compass is then used to initialize the target matri-                             +λ · cos-sim(wt , wt+1 )
   1                                                        2
    Check the belerico entry in the challenge leader-         http://github.com/vinid/cade
                                                            3
board at https://competitions.codalab.org/                    When a neighbor is missing in one time slice, we replace
competitions/20948#results                               it with the average vector of the space.
                                                                   D
                                                                                                                      1) Train the
      D1               D2                 D3                           Di                     Dn                      compass from the               U
                                                                                                                      concatenation
                                                                                                                                            C



               3) training        3) training        3) training                3) training             3) training
                                                                                                                          2) Initialize and freeze
                                                                                                                          each CBOW target matrix
                                                                                                                          with the same U matrix
           U                  U                  U                          U                       U
      C1                 C2                 C3                         Ci                      Cn



       Figure 1: An high level overview of the Compass Aligned Distributional Embeddings model.


with λ ∈ [0, 1]. In particular λ express the usage                                is stable or not is set to 0.7, with the decision given
strength of the two measures: a high λ will shift                                 by:
Move towards the cosine similarity, while a low
one towards the ln measure. As introduced before                                              (
we classify if the meaning has changed by defin-                                               0 if s(wt , wt+1 ) ≥ 0.7
ing a threshold over s (more details about this are                                            1 otherwise
presented in the next Section).

3     Experimental Evaluation
                                                                                     Essentially, the less changed are the two vec-
The dataset provided by the challenge’s organizers                                tors of the words (for cos) and the neighbors (for
(Basile et al., 2020a) is a collection of documents                               ln) the more the word has been stable between
extracted by newspapers written in the Italian lan-                               the two time periods. As heuristics we chose
guage labeled with temporal information. Partici-                                 λ ∈ {0.3, 0.5, 0.7} to evaluate the relationship be-
pants must train their models only on the data pro-                               tween the two measures used to build move, and
vided, so a pre-processed corpus is given: tab sep-                               we set to 22 the number of nearest neighbors to
arated, with one token per line, where for each                                   be considered by the ln; this is the general setup
token there are its corresponding part-of-speech                                  that gave the results that have been submitted to
(POS) tag and lemma, with sentences separated by                                  the challenge.
empty lines. The corpus is split into two slices,                                    We trained CADE for 10 epochs to learn 100-
each belonging to a specific period of time, t1 and                               dimensional vectors, with the window size set to 5,
t2, where t1 < t2.                                                                10 negative examples for every positive one, with
3.1    Dataset                                                                    the initial learning rate set to 0.025 and decreased
                                                                                  linearly during training.
For the training data we used the flat version
with only the lemmas, obtained by the organiz-                                      As other models, in the post evaluation we also
ers’ script (Basile et al., 2020a); in addition we ap-                            considered one that only uses the cos (CADE
plied a pre-processing step, in which we removed                                  (cos)) similarity measure and one that uses only
punctuation and non alpha-numeric symbols and                                     the ln metric CADE (ln)) (again with 0.7 as thresh-
we kept only those sentences with at least two to-                                old and with the number of NNs for ln set to 22).
kens.                                                                                As baselines, the authors propose to use
                                                                                  baseline-freq, that is the absolute value of the
3.2    Models Considered                                                          difference between the words’ frequencies and
We use the embeddings aligned with CADE and                                       baseline-colloc, where the Bag-of-Collocations of
the move measure. The parameters of the moving                                    the two words in the two different periods is built
average we need to consider are: the number of                                    and then cosine similarity is applied. A thresh-
nearest neighbors (NNs) to be collected by ln, λ                                  old is used on both metrics to define semantic
for the moving average and the threshold for the                                  change (Basile et al., 2020a). We report also the
similarity. We set the threshold to decide if a word                              results of the other participants.
                              λ      Acc.                why we get those errors. A more precise use of
                                                         pre-processing techniques with the combination of
           team1              /      0.944
                                                         other metrics to compute semantic change might
           team2              /      0.944
                                                         help in reducing these errors.
           team3              /      0.889
           CADE (move)†       0.3    0.833
           team4              /      0.833               References
           team5              /      0.833
                                                         Pierpaolo Basile, Annalina Caputo, Tommaso Caselli,
           team6              /      0.778
                                                            Pierluigi Cassotti, and Rossella Varvara. 2020a.
           team7              /      0.722                  DIACR-Ita @ EVALITA2020:             Overview of
           team8              /      0.667                  the EVALITA2020 Diachronic Lexical Semantics
           team9              /      0.611                  (DIACR-Ita) Task. In Valerio Basile, Danilo Croce,
           baseline-colloc    /      0.611                  Maria Di Maro, and Lucia C. Passaro, editors, Pro-
                                                            ceedings of the 7th evaluation campaign of Natural
           baseline-freq      /      0.500                  Language Processing and Speech tools for Italian
           CADE (move)†       0.5    0.722                  (EVALITA 2020), Online. CEUR.org.
           CADE (move)†       0.7    0.722               Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
           CADE (cos)         /      0.722                 cia C. Passaro. 2020b. Evalita 2020: Overview
                                                           of the 7th evaluation campaign of natural language
           CADE (ln)          /      0.889                 processing and speech tools for italian. In Valerio
                                                           Basile, Danilo Croce, Maria Di Maro, and Lucia C.
Table 1: Accuracy scores for the binary classifica-        Passaro, editors, Proceedings of Seventh Evalua-
tion w.r.t. the other participants to the challenge. †     tion Campaign of Natural Language Processing and
identifies our submitted results.                          Speech Tools for Italian. Final Workshop (EVALITA
                                                           2020), Online. CEUR.org.

3.3    Results                                           Federico Bianchi, Valerio Di Carlo, Paolo Nicoli,
                                                           and Matteo Palmonari. 2020. Compass-aligned
The evaluation metric used in this challenge is the        distributional embeddings for studying seman-
accuracy, that is, the number of correct predictions       tic differences across corpora.  arXiv preprint
                                                           arXiv:2004.06519.
over the target data. Table 1 shows the results. Our
model was the third most accurate. However, in           Valerio Di Carlo, Federico Bianchi, and Matteo Pal-
the post-evaluation we discovered that just using          monari. 2019. Training temporal word embeddings
the ln metric and ignoring the use of cos (this is         with a compass. In Proceedings of the AAAI Con-
                                                           ference on Artificial Intelligence, volume 33, pages
equivalent to using λ = 0 in our move measure)             6326–6334.
improves the performance leading to the second
best accuracy score in the leaderboard.                  William L. Hamilton, Jure Leskovec, and Dan Jurafsky.
                                                           2016a. Cultural shift or linguistic drift? comparing
                                                           two computational measures of semantic change.
4     Discussion                                           In Proceedings of the 2016 Conference on Empiri-
                                                           cal Methods in Natural Language Processing, pages
Our results show that CADE (Bianchi et al., 2020)          2116–2121, Austin, Texas, November. Association
is an effective method to generate aligned embed-          for Computational Linguistics.
dings for the Italian language. This result, to-
gether with those obtained on the SemEval2020            William L. Hamilton, Jure Leskovec, and Dan Jurafsky.
                                                           2016b. Diachronic word embeddings reveal statisti-
data, suggest that CADE can support models of              cal laws of semantic change. In Proceedings of the
semantic shift detection in several languages. In-         54th Annual Meeting of the Association for Compu-
deed, we show that in combination with some sim-           tational Linguistics (Volume 1: Long Papers), pages
ple semantic change measures it is possible to pro-        1489–1501, Berlin, Germany, August. Association
                                                           for Computational Linguistics.
vide a good model for semantic change detection
that can be subsequently extended with more fea-         Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski,
tures. Appendix A contains some more detailed              and Erik Velldal. 2018. Diachronic word embed-
examples of the words that CADE (ln) and CADE              dings and semantic shifts: a survey. In Proceed-
                                                           ings of the 27th International Conference on Com-
(move), with lambda set to 0.3, could not clas-            putational Linguistics, pages 1384–1397, Santa Fe,
sify correctly. Also, we show the neighborhood             New Mexico, USA, August. Association for Com-
for some of those words to give more context on            putational Linguistics.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-           t1                  t2
  rado, and Jeff Dean. 2013. Distributed representa-
  tions of words and phrases and their compositional-          azionario           maxiemendamento
  ity. In Advances in neural information processing            obbligazione        finanziaria
  systems, pages 3111–3119.                                    azionista           decretone
Dominik Schlechtweg, Barbara McGillivray, Simon                azionano            decreto
  Hengchen, Haim Dubossarsky, and Nina Tahmasebi.              edison              ddl
  2020. Semeval-2020 task 1: Unsupervised lex-                 casseforte          emendamento
  ical semantic change detection. arXiv preprint
                                                               contante            liberalizzazioni
  arXiv:2007.11464.
                                                               siap                decretere
Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2018.             shell               maxidecreto
  Survey of computational approaches to lexical se-            prestire            ecobonus
  mantic change. arXiv preprint arXiv:1811.06278.

Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and     Table 4: First 10 nearest neighbors by cosine sim-
   Hui Xiong. 2018. Dynamic word embeddings             ilarity of the word “pacchetto” from t1 and t2
   for evolving semantic discovery. In Proceedings of
   the eleventh acm international conference on web        The same it seems to happen for the target word
   search and data mining, pages 673–681.
                                                        “piovra”, as one can see from Table 5, where at
A   CADE Misclassifications                             time t1 CADE gathers senses from both consider-
                                                        ing it as the animal, for example from the word
We report in Tables 2 and 3 CADE’s misclassifi-         “tentacle”, or as someone tied to crime in gen-
cations with the two best metrics, namely CADE          eral, given words such as “profittatore” or “ru-
(move) with λ = 0.3 and CADE (ln). Eventually,          beria” (“profiteer” and “robbery” resp.); while at
we also show in Tables 4 and 5 some examples of         time t2 captures a shift towards the Italian crime
neighborhood for the target words.                      TV series “La piovra”, as emerge from words such
                                                        as “fiction”, “camorra” or “retequattro”, which is
       Word             Pred        True                an Italian television channel.
       trasferibile     changed     not changed
                                                                    t1                  t2
       pacchetto        changed     not changed
       piovra           changed     not changed                     tentacolo           fiction
                                                                    ingordigia          sceneggiato
Table 2: Wrong predictions done by CADE                             profittatore        tentacolo
(move) with λ = 0.3.                                                somaro              camorrere
                                                                    feudatario          retequattro
                                                                    insaziabile         raidue
     Word             Pred           True                           impere              puntato
     pacchetto        changed        not changed                    ruberia             camorra
     rampante         not changed    changed                        zanne               gomorra
                                                                    putrido             miniserie
Table 3: Wrong predictions done by CADE (ln).
                                                        Table 5: First 10 nearest neighbors by cosine sim-
   Table 4 shows the top 10 nearest neighbors of        ilarity of the word “piovra” from t1 and t2
the target word “pacchetto” and we think CADE
classifies its meaning as changed because during
time t1 the meaning is more focused in the eco-
nomic area, as one can see from neighbors like
“azionario”, “obbligazione” or “contante” (trans-
lated to “stock” as referred to the market, “bond”
and “cash” resp.); while at time t2 shifts to a more
political sense, as shown by words such as “de-
creto” or “emendamento” (“decree” and “amend-
ment” resp.).