=Paper= {{Paper |id=Vol-2765/147 |storemode=property |title=UNIMIB @ DIACR-Ita: Aligning Distributional Embeddings with a Compass for Semantic Change Detection in the Italian Language (short paper) |pdfUrl=https://ceur-ws.org/Vol-2765/paper147.pdf |volume=Vol-2765 |authors=Federico Belotti,Federico Bianchi,Matteo Palmonari |dblpUrl=https://dblp.org/rec/conf/evalita/BelottiBP20 }} ==UNIMIB @ DIACR-Ita: Aligning Distributional Embeddings with a Compass for Semantic Change Detection in the Italian Language (short paper)== https://ceur-ws.org/Vol-2765/paper147.pdf

UNIMIB @ DIACR-Ita: Aligning Distributional Embeddings with a
Compass for Semantic Change Detection in the Italian Language
Federico Belotti Federico Bianchi Matteo Palmonari
University of Milano-Bicocca Bocconi University University of Milano-Bicocca
Viale Sarca 336, 20126 Via Sarfatti 25, 20136 Viale Sarca 336, 20126
Milan, Italy Milan, Italy Milan, Italy
f.belotti8@campus.unimib.it f.bianchi@unibocconi.it matteo.palmonari@unimib.it

Abstract nents: 1) an alignment procedure to generate dis-
tributional vector spaces that are comparable for
In this paper, we present our results re- t1 and t2 and 2) the use of distance metrics to
lated to the EVALITA 2020 challenge, compute the degree of semantic change for a given
DIACR-Ita, for semantic change detection word. Our alignment procedure is based on Com-
for the Italian language. Our approach is pass Aligned Distributional Embeddings (CADE)
based on measuring the semantic distance proposed by Bianchi et al. (2020) (note the ap-
across time-specific word vectors gener- proach was introduced as Temporal Word Embed-
ated with Compass-aligned Distributional dings with a Compass by Di Carlo et al. (2019),
Embeddings (CADE). We first generate but the name was changed to enforce the idea that
temporal embeddings with CADE, a strat- the embeddings can be used to align more general
egy to align word embeddings that are spe- corpora and not just diachronic ones). Given the
cific for each time period; the quality of aligned embeddings, we use two measures to com-
this alignment is the main asset of our pute the degree of change based on the similarities
proposal. We then measure the semantic of the vectors in the embedded space. Our results
shift of each word, combining two differ- show that our methodology for aligning spaces can
ent semantic shift measures. Eventually, be useful in detecting lexical semantic change.
we classify a word meaning as changed or
not changed by defining a threshold over 2 Description of the System: Semantic
the semantic distance across time. Change Detection with Compass
Aligned Embeddings
1 Introduction
Our approach is based on measuring the seman-
Semantic change detection is the task of detecting tic distance across time of time-specific word vec-
if a word has shifted in meaning between different tors generated with CADE and on the use of two
periods of time (Tahmasebi et al., 2018; Kutuzov measures for detecting semantic shifts i.e., the se-
et al., 2018). The DIACR-Ita (Basile et al., 2020a) mantic distance between word vectors across time.
challenge (at EVALITA (Basile et al., 2020b)) is This distance can be interpreted as a function of
meant to evaluate approaches for semantic change the words’ self-similarity across time, where the
detection for the Italian Language. similarity is measured by a linear combination of
The task is described as follows: for training, cosine and second-order similarity (Hamilton et
two corpora t1 and t2, consisting of text coming al., 2016a).
from different periods are given, for testing, a set Finally, a threshold over this self-similarity is
of unlabeled target words is given, where for each used to classify a word as changed or not changed.
of them a binary scores has to be predicted: 1 iden- This methodology was applied also in the se-
tifies lexical change between t1 and t2 while 0 mantic shift detection challenge presented at Se-
does not. mEval2020 (Schlechtweg et al., 2020) (to which
In this paper, we present our approach to seman- we participated after the end of the challenge).
tic change detection that is based on two compo- The challenge allowed us to explore and under-
stand how the alignment and our self-similarity
“Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0 behaved. In the classification task of the Se-
International (CC BY 4.0).” mEval2020 challenge (the one similar to this task),
we eventually achieved 0.703, 0.771, 0.725, 0.742, ces for each new CBOW model fitted on each of
in accuracy for respectively the English, German, the slices. During training, these new target matri-
Latin and Swedish languages; these results have ces are frozen, i.e., they are not updated during the
been obtained with extensive parameter search training on the slice. This ensures that at the end of
given the gold standard available in the post- the training process, the various temporal embed-
evaluation.1 In DIACR-Ita, the threshold and few dings are all aligned in the same embedding space,
other hyper parameters can be heuristically set to making them comparable without losing their in-
account for the limited number of possible sub- dividual temporal distinctions. We use the pub-
missions. In the next subsections we provide more licly available online implementation of CADE.2
details about the alignment methodology and the
similarity function; more details about how we set 2.2 Computing Semantic Change
the hyper parameters are provided in Section 3. Once the embeddings are aligned, we need mea-
sures to evaluate the degree of semantic change.
2.1 Aligning Embeddings We compute the semantic shift of each word,
Word2vec (Mikolov et al., 2013) is a useful i.e. the semantic distance between word vectors
methodology to generate vectors of words allow- across time using the combination of two differ-
ing us to study word similarity through vector sim- ent measures: Local Neighbors (ln), introduced by
ilarity. However, due to the stochasticity of the Hamilton et al. (2016a) and cosine similarity (cos),
training procedure, running word2vec on differ- merging them with a weighted linear combination
ent corpora creates word vectors that are not com- into a new measure called Move.
parable. Thus, an alignment procedure that puts
the temporal word vectors in the same space is Local Neighbors ln is based on the similarity
needed. between a word and its neighbor words in the two
There are different approaches to generate these different time periods. Essentially we compute
aligned embeddings (see for example the work by the degree of semantic change of the word w in
(Hamilton et al., 2016b) and (Yao et al., 2018)). two slices by first collecting the nearest neighbors
In this paper, we generate aligned embeddings (NNs) of wt and wt+1 in the two respective slices,
with Compass Aligned Distributional Embeddings then given the embeddings at time t the similari-
(CADE) (Bianchi et al., 2020) (See Figure 1 for a ties between the vector of wt and the vectors of all
schematic description of the model). CADE is a the neighbors are computed.3 The same process is
strategy to align word embeddings that are specific run for time t + 1 with wt+1 , eventually giving us
for each time period that extends the word2vec two vectors of similarity scores. These two vectors
Continuous Bag Of Word (CBOW) model pro- are again compared using cosine similarity. The
posed by Mikolov et al. (2013). CADE can be higher the value of this measure the less the vector
used to generate aligned temporal word embed- has changed with respect to its neighbors and thus
dings (i.e., time-specific vectors of words, like the less the word should have shifted in meaning.
“amazon1974 ”) from the different slices. Cosine Similarity The second measure we use
Given in input a set of slices of text, where each is simply the cosine similarity of the vectors of a
slice corresponds to text coming from a specific word in two different time periods. Similarly as
period of time, the alignment procedure is as fol- before , the higher the value the less the vector has
lows: changed and thus the less the word should have
First, the text from all the slices is concatenated shifted in meaning.
and CBOW is run on this corpus in order to ob-
tain a “compass” model, i.e., a model defining the The Move Measure We merge these measures
embedding space. The CBOW model uses two together using a weighted linear combination, that
matrices to generate the embeddings (U and C in is:
Figure 1), one for the context words and one for
s(wt , wt+1 ) = (1 − λ) · ln(wt , wt+1 )
the target words. The target word matrix of the
compass is then used to initialize the target matri- +λ · cos-sim(wt , wt+1 )
1 2
Check the belerico entry in the challenge leader- http://github.com/vinid/cade
3
board at https://competitions.codalab.org/ When a neighbor is missing in one time slice, we replace
competitions/20948#results it with the average vector of the space.
D
1) Train the
D1 D2 D3 Di Dn compass from the U
concatenation
C

3) training 3) training 3) training 3) training 3) training
2) Initialize and freeze
each CBOW target matrix
with the same U matrix
U U U U U
C1 C2 C3 Ci Cn

Figure 1: An high level overview of the Compass Aligned Distributional Embeddings model.

with λ ∈ [0, 1]. In particular λ express the usage is stable or not is set to 0.7, with the decision given
strength of the two measures: a high λ will shift by:
Move towards the cosine similarity, while a low
one towards the ln measure. As introduced before (
we classify if the meaning has changed by defin- 0 if s(wt , wt+1 ) ≥ 0.7
ing a threshold over s (more details about this are 1 otherwise
presented in the next Section).

3 Experimental Evaluation
Essentially, the less changed are the two vec-
The dataset provided by the challenge’s organizers tors of the words (for cos) and the neighbors (for
(Basile et al., 2020a) is a collection of documents ln) the more the word has been stable between
extracted by newspapers written in the Italian lan- the two time periods. As heuristics we chose
guage labeled with temporal information. Partici- λ ∈ {0.3, 0.5, 0.7} to evaluate the relationship be-
pants must train their models only on the data pro- tween the two measures used to build move, and
vided, so a pre-processed corpus is given: tab sep- we set to 22 the number of nearest neighbors to
arated, with one token per line, where for each be considered by the ln; this is the general setup
token there are its corresponding part-of-speech that gave the results that have been submitted to
(POS) tag and lemma, with sentences separated by the challenge.
empty lines. The corpus is split into two slices, We trained CADE for 10 epochs to learn 100-
each belonging to a specific period of time, t1 and dimensional vectors, with the window size set to 5,
t2, where t1 < t2. 10 negative examples for every positive one, with
3.1 Dataset the initial learning rate set to 0.025 and decreased
linearly during training.
For the training data we used the flat version
with only the lemmas, obtained by the organiz- As other models, in the post evaluation we also
ers’ script (Basile et al., 2020a); in addition we ap- considered one that only uses the cos (CADE
plied a pre-processing step, in which we removed (cos)) similarity measure and one that uses only
punctuation and non alpha-numeric symbols and the ln metric CADE (ln)) (again with 0.7 as thresh-
we kept only those sentences with at least two to- old and with the number of NNs for ln set to 22).
kens. As baselines, the authors propose to use
baseline-freq, that is the absolute value of the
3.2 Models Considered difference between the words’ frequencies and
We use the embeddings aligned with CADE and baseline-colloc, where the Bag-of-Collocations of
the move measure. The parameters of the moving the two words in the two different periods is built
average we need to consider are: the number of and then cosine similarity is applied. A thresh-
nearest neighbors (NNs) to be collected by ln, λ old is used on both metrics to define semantic
for the moving average and the threshold for the change (Basile et al., 2020a). We report also the
similarity. We set the threshold to decide if a word results of the other participants.
λ Acc. why we get those errors. A more precise use of
pre-processing techniques with the combination of
team1 / 0.944
other metrics to compute semantic change might
team2 / 0.944
help in reducing these errors.
team3 / 0.889
CADE (move)† 0.3 0.833
team4 / 0.833 References
team5 / 0.833
Pierpaolo Basile, Annalina Caputo, Tommaso Caselli,
team6 / 0.778
Pierluigi Cassotti, and Rossella Varvara. 2020a.
team7 / 0.722 DIACR-Ita @ EVALITA2020: Overview of
team8 / 0.667 the EVALITA2020 Diachronic Lexical Semantics
team9 / 0.611 (DIACR-Ita) Task. In Valerio Basile, Danilo Croce,
baseline-colloc / 0.611 Maria Di Maro, and Lucia C. Passaro, editors, Pro-
ceedings of the 7th evaluation campaign of Natural
baseline-freq / 0.500 Language Processing and Speech tools for Italian
CADE (move)† 0.5 0.722 (EVALITA 2020), Online. CEUR.org.
CADE (move)† 0.7 0.722 Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
CADE (cos) / 0.722 cia C. Passaro. 2020b. Evalita 2020: Overview
of the 7th evaluation campaign of natural language
CADE (ln) / 0.889 processing and speech tools for italian. In Valerio
Basile, Danilo Croce, Maria Di Maro, and Lucia C.
Table 1: Accuracy scores for the binary classifica- Passaro, editors, Proceedings of Seventh Evalua-
tion w.r.t. the other participants to the challenge. † tion Campaign of Natural Language Processing and
identifies our submitted results. Speech Tools for Italian. Final Workshop (EVALITA
2020), Online. CEUR.org.

3.3 Results Federico Bianchi, Valerio Di Carlo, Paolo Nicoli,
and Matteo Palmonari. 2020. Compass-aligned
The evaluation metric used in this challenge is the distributional embeddings for studying seman-
accuracy, that is, the number of correct predictions tic differences across corpora. arXiv preprint
arXiv:2004.06519.
over the target data. Table 1 shows the results. Our
model was the third most accurate. However, in Valerio Di Carlo, Federico Bianchi, and Matteo Pal-
the post-evaluation we discovered that just using monari. 2019. Training temporal word embeddings
the ln metric and ignoring the use of cos (this is with a compass. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 33, pages
equivalent to using λ = 0 in our move measure) 6326–6334.
improves the performance leading to the second
best accuracy score in the leaderboard. William L. Hamilton, Jure Leskovec, and Dan Jurafsky.
2016a. Cultural shift or linguistic drift? comparing
two computational measures of semantic change.
4 Discussion In Proceedings of the 2016 Conference on Empiri-
cal Methods in Natural Language Processing, pages
Our results show that CADE (Bianchi et al., 2020) 2116–2121, Austin, Texas, November. Association
is an effective method to generate aligned embed- for Computational Linguistics.
dings for the Italian language. This result, to-
gether with those obtained on the SemEval2020 William L. Hamilton, Jure Leskovec, and Dan Jurafsky.
2016b. Diachronic word embeddings reveal statisti-
data, suggest that CADE can support models of cal laws of semantic change. In Proceedings of the
semantic shift detection in several languages. In- 54th Annual Meeting of the Association for Compu-
deed, we show that in combination with some sim- tational Linguistics (Volume 1: Long Papers), pages
ple semantic change measures it is possible to pro- 1489–1501, Berlin, Germany, August. Association
for Computational Linguistics.
vide a good model for semantic change detection
that can be subsequently extended with more fea- Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski,
tures. Appendix A contains some more detailed and Erik Velldal. 2018. Diachronic word embed-
examples of the words that CADE (ln) and CADE dings and semantic shifts: a survey. In Proceed-
ings of the 27th International Conference on Com-
(move), with lambda set to 0.3, could not clas- putational Linguistics, pages 1384–1397, Santa Fe,
sify correctly. Also, we show the neighborhood New Mexico, USA, August. Association for Com-
for some of those words to give more context on putational Linguistics.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- t1 t2
rado, and Jeff Dean. 2013. Distributed representa-
tions of words and phrases and their compositional- azionario maxiemendamento
ity. In Advances in neural information processing obbligazione finanziaria
systems, pages 3111–3119. azionista decretone
Dominik Schlechtweg, Barbara McGillivray, Simon azionano decreto
Hengchen, Haim Dubossarsky, and Nina Tahmasebi. edison ddl
2020. Semeval-2020 task 1: Unsupervised lex- casseforte emendamento
ical semantic change detection. arXiv preprint
contante liberalizzazioni
arXiv:2007.11464.
siap decretere
Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2018. shell maxidecreto
Survey of computational approaches to lexical se- prestire ecobonus
mantic change. arXiv preprint arXiv:1811.06278.

Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Table 4: First 10 nearest neighbors by cosine sim-
Hui Xiong. 2018. Dynamic word embeddings ilarity of the word “pacchetto” from t1 and t2
for evolving semantic discovery. In Proceedings of
the eleventh acm international conference on web The same it seems to happen for the target word
search and data mining, pages 673–681.
“piovra”, as one can see from Table 5, where at
A CADE Misclassifications time t1 CADE gathers senses from both consider-
ing it as the animal, for example from the word
We report in Tables 2 and 3 CADE’s misclassifi- “tentacle”, or as someone tied to crime in gen-
cations with the two best metrics, namely CADE eral, given words such as “profittatore” or “ru-
(move) with λ = 0.3 and CADE (ln). Eventually, beria” (“profiteer” and “robbery” resp.); while at
we also show in Tables 4 and 5 some examples of time t2 captures a shift towards the Italian crime
neighborhood for the target words. TV series “La piovra”, as emerge from words such
as “fiction”, “camorra” or “retequattro”, which is
Word Pred True an Italian television channel.
trasferibile changed not changed
t1 t2
pacchetto changed not changed
piovra changed not changed tentacolo fiction
ingordigia sceneggiato
Table 2: Wrong predictions done by CADE profittatore tentacolo
(move) with λ = 0.3. somaro camorrere
feudatario retequattro
insaziabile raidue
Word Pred True impere puntato
pacchetto changed not changed ruberia camorra
rampante not changed changed zanne gomorra
putrido miniserie
Table 3: Wrong predictions done by CADE (ln).
Table 5: First 10 nearest neighbors by cosine sim-
Table 4 shows the top 10 nearest neighbors of ilarity of the word “piovra” from t1 and t2
the target word “pacchetto” and we think CADE
classifies its meaning as changed because during
time t1 the meaning is more focused in the eco-
nomic area, as one can see from neighbors like
“azionario”, “obbligazione” or “contante” (trans-
lated to “stock” as referred to the market, “bond”
and “cash” resp.); while at time t2 shifts to a more
political sense, as shown by words such as “de-
creto” or “emendamento” (“decree” and “amend-
ment” resp.).