=Paper=
{{Paper
|id=Vol-2765/132
|storemode=property
|title=CL-IMS @ DIACR-Ita: Volente o Nolente: BERT is still not Outperforming SGNS on Semantic Change Detection
|pdfUrl=https://ceur-ws.org/Vol-2765/paper132.pdf
|volume=Vol-2765
|authors=Severin Laicher,Gioia Baldissin,Enrique Castaneda,Dominik Schlechtweg,Sabine Schulte Im Walde
|dblpUrl=https://dblp.org/rec/conf/evalita/LaicherBCSW20
}}
==CL-IMS @ DIACR-Ita: Volente o Nolente: BERT is still not Outperforming SGNS on Semantic Change Detection==
CL-IMS @ DIACR-Ita:
Volente o Nolente: BERT does not Outperform SGNS on Semantic
Change Detection
Severin Laicher, Gioia Baldissin, Enrique Castañeda
Dominik Schlechtweg, Sabine Schulte im Walde
Institute for Natural Language Processing, University of Stuttgart
{laichesn,baldisga,medinaeo,schlecdk,schulte}@ims.uni-stuttgart.de∗
Abstract ploit BERT embeddings in lexical semantic change
detection.
We present the results of our participa-
tion in the DIACR-Ita shared task on lex- 2 Related Work
ical semantic change detection for Italian.
Most existing approaches for LSC detection are
We exploit Average Pairwise Distance of
type-based (Schlechtweg et al., 2019; Shoemark
token-based BERT embeddings between
et al., 2019). This means that not every word oc-
time points and rank 5 (of 8) in the official
currence is considered individually (token-based)
ranking with an accuracy of .72. While we
but a general vector representation that summarizes
tune parameters on the English data set of
every occurrence of a word (including ambiguous
SemEval-2020 Task 1 and reach high per-
words) is created. The results of the SemEval-2020
formance, this does not translate to the Ital-
Task 1 (Martinc et al., 2020; Schlechtweg et al.,
ian DIACR-Ita data set. Our results show
2020) showed that type-based approaches (Pražák
that we do not manage to find robust ways
et al., 2020b; Asgari et al., 2020) achieved better
to exploit BERT embeddings in lexical se-
results than token-based approaches (Beck, 2020;
mantic change detection.
Kutuzov and Giulianelli, 2020a). This is some-
1 Introduction what surprising since in the last years contextual-
ized token-based approaches have achieved signif-
Lexical Semantic Change (LSC) Detection has icant improvements over the static type-based ap-
drawn increasing attention in the past years (Kutu- proaches in several NLP tasks (Ethayarajh, 2019).
zov et al., 2018; Tahmasebi et al., 2018). Recently, Schlechtweg et al. (2020) suggest a range of pos-
SemEval-2020 Task 1 provided a multi-lingual sible reasons for this: (i) Contextual embeddings
evaluation framework to compare the variety of are new and lack proper usage conventions. (ii)
proposed model architectures (Schlechtweg et al., They are pre-trained and may thus carry additional,
2020). The DIACR-Ita shared task extends parts and possibly irrelevant, information. (iii) The con-
of this framework to Italian by providing an Italian text of word uses in the SemEval data set was too
data set for SemEval’s binary subtask (Basile et narrow (one sentence). (iv) The SemEval corpora
al., 2020a; Basile et al., 2020b). We present the re- were lemmatized, while token-based models usu-
sults of our participation in the DIACR-Ita shared ally take the raw sentence as input. In the DIACR-
task on lexical semantic change for Italian. We Ita challenge (iii) and (iv) are irrelevant because
exploit Average Pairwise Distance of token-based raw corpora with sufficient context are made avail-
BERT embeddings (Devlin et al., 2019) between able to participants. We tried to tackle (i) by exces-
time points and rank 5 (of 8) in the official ranking sively tuning parameters and system modules on
with an accuracy of .72. While we tune parameters the English SemEval data set. (ii) can be tackled by
on the English data set of SemEval-2020 Task 1 fine-tuning BERT on the target corpora. However,
and reach high performance, this does not transfer our experiments on the English SemEval data set
to the Italian DIACR-Ita data set. Our results show show that exceptionally high performances can be
that we do not manage to find robust ways to ex- reached even without fine-tuning.
∗
“Copyright © 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).”
3 Experimental setup context in which they appear (Ethayarajh, 2019).
BERT can either create one vector for an input sen-
The DIACR-Ita task definition is taken from
tence (sentence embedding) or one vector for each
SemEval-2020 Task 1 Subtask 1 (binary change
input token (token embedding).2
detection): Given a list of target words and a di-
Different pre-trained BERT models across lan-
acronic corpus pair C1 and C2 , the task is to identify
guages can be downloaded. In this task, we have
the target words which have changed their mean-
used the bert-base-italian-xxl-cased model for the
ings between the respective time periods t1 and t2
Italian language3 to create token embeddings.
(Basile et al., 2020a; Schlechtweg et al., 2020).1
The basic BERT version is transformer-based
C1 and C2 have been extracted from Italian newspa-
and processes text in 12 different layers. In each
pers and books. Target words which have changed
layer a contextualized token vector representation
their meaning are labeled with the value ‘1’, the
can be created for each word in an input sentence.
remaining target words are labeled with ‘0’. Gold
It has been claimed that each layer captures dif-
data for the 18 target words is semi-automatically
ferent aspects of the input. Jawahar et al. (2019)
generated from Italian online dictionaries. Accord-
suggest that the lower layers capture surface fea-
ing to the gold data, 6 of the 18 target words are
tures, the middle layers capture syntactic features
subject to semantic change between t1 and t2 . This
and the higher layers capture semantic features of
gold data was only made public after the evalua-
the text. Each layer can serve as representation
tion phase. During the evaluation phase each team
for the corresponding token by itself, or within a
was allowed to submit up to 4 predictions for the
combination of multiple layers.
full list of target words, which were scored using
classification accuracy between the predicted labels
and the gold data. The final competition ranking 4.2 Average Pairwise Distance
compares only the highest of the scores achieved Given two sets of token vectors from two time peri-
by each team. ods t1 and t2 , the idea of Average Pairwise Distance
(APD) is to randomly pick a number of vectors
4 System Overview from both sets and measure their pair-wise distance
(Sagi et al., 2009; Schlechtweg et al., 2018; Giu-
Our model uses BERT to create token vectors and lianelli et al., 2020; Beck, 2020; Kutuzov and Giu-
the average pairwise distance to compare the token lianelli, 2020b). The LSC score of the word is the
vectors from two times. The following chapter mean average distance of all comparisons:
presents our model, how we have trained it and
how we have chosen our submissions.
1 X
APD(V, W ) = d(v, w)
nV ∗ nW
4.1 BERT v∈V,w∈W
In 2018 Google has released a pre-trained model where V and W are two sets of vectors, nV and
that ran over Wikipedia and books of different gen- nW denote the number of vectors to be compared,
res (Devlin et al., 2019): BERT (Bidirectional En- and d(v, w) refer to a distance measure (we used
coder Representations from Transformer) is a lan- cosine distance (Salton and McGill, 1983)).
guage representation model, designed to find rep-
resentations for text by analysing its left and right 4.3 Tuning
contexts (Devlin et al., 2019). Peters et al. (2018)
show that contextual word representations derived The choice of BERT layers and the measure used
from pre-trained bidirectional language models like to compare the resulting vectors (e.g. APD, COS
BERT and ELMo yield significant improvements or clustering) strongly influence the performance
to the state-of-the-art for a wide range of NLP tasks. (Kutuzov and Giulianelli, 2020a). Hence, we tuned
BERT can be used to analyse the semantics of in- these parameters/modules on the English SemEval
dividual words, by creating contextualized word data (Schlechtweg et al., 2020). For the 40 English
representations, vectors that are sensitive to the 2
The code of our system is available at https://
1
github.com/Garrafao/TokenChange.
The time periods t1 and t2 were not disclosed to partici- 3
https://huggingface.co/dbmdz/
pants. bert-base-italian-xxl-cased
target words we had access to the sentences that 5 Results
were used for the human annotation (in contrast
Table 1 shows the accuracy scores for the different
to task participants who had only access to the
submissions. The best result was achieved by com-
lemmatized larger corpora containing more target
bining the first and last layer of BERT (’First + Last,
word uses than just the annotated ones).
7’ with .72), just like on the SemEval data. The
We tested several change measures regarding
second-best result was obtained by using the sen-
their ability to find the actual changing words. As
tences where the target word occurred in its lemma
part of our tuning, the APD measure produced the
form (’Lemma, Average, 6’ with .67). Only these
binary and graded LSC scores that best matched
two submissions outperformed the task baselines
the actual LSC scores. We also tested the token vec-
and the majority class baseline. The two lowest
tors from different layers in order to check which
results were achieved by combining the last four
one fits best to our task. The best layer combina-
layers of BERT (’Last Four, 7’ with .61) and by
tions were the average of the last four layers and
averaging the two layer combinations (’Average,
the average of the first and last layer of BERT. The
9’ with .61). The accuracy of our best submission
highest F1-score for the binary subtask was .75
(.72) was ranked at position 5 of the shared task,
and a Spearman correlation of .65 for the graded
where the best task result was achieved by two dif-
subtask. Our results outperformed all official sub-
ferent submissions and reached an accuracy of .94.
missions of the shared tasks, of which the best were
Both submissions were based on type-based em-
all type-based.
beddings (Pražák et al., 2020a; Kaiser et al., 2020),
clearly outperforming our system.
4.4 Threshold Selection
We created four predicted change rankings for the Submission Thresh. Acc.
target words with BERT+APD. By experience and First + Last 7 .72
consideration of the shared tasks (Schlechtweg et Lemma, Average 6 .67
al., 2020), we assumed that maximum half of all Majority Class Baseline - .66
target words are actual words with a change. There- Average 9 .61
fore we always annotated at most 9 of 18 words Last Four 7 .61
with 1. First, we extracted for each target word a Collocations Baseline - .61
maximum of 200 sentences that contain the word Frequency Baseline - .61
in any token form. We limited the number of uses
to 200 for computational efficiency reasons. Then, Table 1: Overview accuracy scores for the four sub-
for each occurrence, we extracted and averaged the missions with official task baselines. We also report
token vectors of (i) the last four layers of BERT, a majority class baseline of a classifier predicting
and (ii) the first and last layer. For our first sub- ‘0’ for all target Words.
mission (‘Last Four, 7’) we labeled those 7 words
with ‘1’ that achieved the highest APD scores in
layer combination (i). For our second submission 6 Analysis
(‘First + Last, 7’) we labeled those 7 words with As aforementioned, the best performance of our
‘1’ that achieved the highest APD scores in layer system, achieved with ’First + Last, 7’, has an
combination (ii). In (i) and (ii) the same 9 words accuracy of .72. It erroneously predicts a meaning
had the highest APD scores. Therefore, in our third change for cappuccio, unico and campionato, while
submission (‘Average, 9’) exactly these 9 words for palmare and rampante it does not detect the
were labeled with ‘1’. And for our last submission change as given by the gold standard.
(Lemma, Average, 6’) we extracted only sentences We compared both corpora in order to find out if
in which the target words were present in their the target words are correctly labeled by the gold
lemma form. Again we created the token vectors standard as well as to identify the possible reasons
for the two layer combinations of BERT mentioned behind the wrong predictions of our model.
above. In both mentioned layer combinations the According to our analysis, we can state that the
same 6 words had the highest APD scores. There- data matches the gold standard. Cappuccio is poly-
fore in our last submission exactly these 6 words semous across both time periods t0 and t1 (“hood”,
were labeled with ‘1’ (similar as in submission 1). “cap”). However, 31% of the uses in t1 are upper-
cased, namely proper nouns (in contrast to the 4% ‘MEN’S TAILORING DEPARTMENT white
in t0 ), which might imply a different sense com- textile waiter JACKET The only certain thing
pared to the above-mentioned ones: is that the government has received a hard
lesson by the professors.’
(1) BENEVENTO Il desiderio di il potere , il
potere di il desiderio : ruota intorno a questo Unico is another example of a word that was er-
inquietante ( e attualissimo ) spunto il Festival roneously predicted as changing. Due to its abstract
di Benevento diretto da Ruggero Cappuccio . meaning (“only”, “single”, “unique”), it exhibits
‘BENEVENTO The desire of the power, the heterogeneous context across both time periods.
power of the desire: the Festival di Benevento Additionally, it can belong to different word classes
directed by Ruggero Cappuccio revolves (noun and adjective in (5) and (6), respectively).
around this unsettling (and current) cue.’ (5) Rischiamo di rimanere gli unici a non aver
This skewed distribution of proper names in the dato mano a la ristrutturazione di le Forze
two corpora is a possible reason for the wrong Armate .
prediction of our model. ‘We risk remaining the only ones not having
Throughout all target words, we noticed that the helped in the reorganization of the Armed
context provided by the previous and the following Forces.’
sentences (as given as input to our model) is often
not related topic-wise; in some instances it seems (6) ... è chiaro che l’ unica cosa da fare sarebbe l’
as if the sentences are headlines, since they refer to unificazione di le due aziende comunali ...
different topics: ‘...it is clear that the only thing to do would be
the unification of the two municipal
(2) M ROMA Sono quindici gli articoli in cui è
companies...’
suddiviso il provvedimento « antiracket » [...].
Roberta Serra ha vinto ieri lo slalom gigante With regards to the undetected changes, the term
di il campionati italiani femminili . palmare (polysemous within and across word
‘M ROMA The «antiracket» measure is classes) acquires a novel sense in t1 . While it
divided into fifteen articles [...]. Roberta mostly has the meaning of “evident” in the 22
Serra won yesterday the giant slalom of the sentences of t0 (see (7)), it additionally denotes
Italian female championship.’ “palmtop” in t1 (see (8)).
(3) ... le uniche azioni pericolose fiorentine sono (7) ... con evidenza palmare , la impossibilità di
arrivate quando il pallone e statu giocato su i difendere una causa perduta ...
lati di il Campo . costruzione di centrali ‘with undeniable evidence, the impossibility
idroelettriche , di miniere , canali e strade ... of defending a lost cause’
‘...the only dangerous Florentine actions (8) Per i palestinesi occorre una sistemazione
arrived when the ball was played on the sides provvisoria in attesa che gli europei si
of the field. Construction of hydroelectric accordino per accoglier li . Potremmo citare
power plants, mines, channels and streets...’ in il lungo elenco il palmare Apple Newton
This “headlines effect” occurs across the whole troppo in anticipo su i tempi
corpus. It can be traced back to the extraction ‘A temporary arrangement is needed for the
process of the original corpus and may be a main Palestinians while waiting for the Europeans
source of error in our model. Despite not being to agree on hosting them. We could quote in
representative, the following example shows that the long list the palmtop Apple Newton too
in some cases no centric window of any size would far ahead of its time’
avoid considering unrelated context.
Note that also in (8), the topic of the previous and
(4) REPARTO CONFEZIONI UOMO GIACCA the target sentence is unrelated.
cameriere bianca , in tessuto L’ unica cosa Rampante is a further case of undetected change.
certa è che il governo ha ricevuto una dura The phrase cavallino rampante, which metonymi-
lezione da i professori . cally denotes “Ferrari”, dominates the usage of the
word in t0 (70%) and covers a (slightly) relevant of the 7th evaluation campaign of natural language
share of the uses in t1 (19%). We hypothesize that processing and speech tools for italian. In Valerio
Basile, Danilo Croce, Maria Di Maro, and Lucia C.
this leads to a large number of homogenous usage
Passaro, editors, Proceedings of Seventh Evalua-
pairs masking the change from “rampant”, “unbri- tion Campaign of Natural Language Processing and
dled” to “extremely ambitious” of rampante. Speech Tools for Italian. Final Workshop (EVALITA
2020), Online. CEUR.org.
7 Conclusion
Christin Beck. 2020. DiaSense at SemEval-2020 Task
Our system comprising BERT+APD was ranked 5 1: Modeling sense change via pre-trained BERT
in the DIACR-Ita shared task. The combination of embeddings. In Proceedings of the 14th Interna-
tional Workshop on Semantic Evaluation, Barcelona,
BERT and APD did not perform as well as expected Spain. Association for Computational Linguistics.
and much lower than the best type-based embed-
dings, but our best submission still outperformed Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
all baselines. The high tuning results achieved on deep bidirectional transformers for language under-
the SemEval data could not be transferred to the standing. In Proceedings of the 2019 Conference of
Italian data. One reason for this may be that a dif- the North American Chapter of the Association for
ferent BERT model was applied, trained on text of Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
a different language. We have not tuned the Italian 4171–4186, Minneapolis, Minnesota, June. Associa-
BERT model. It is therefore possible that the de- tion for Computational Linguistics.
crease in performance may be due to the change of
the underlying BERT model. Furthermore, given Kawin Ethayarajh. 2019. How contextual are contex-
tualized word representations? comparing the geom-
that our model considers as input also the previ- etry of BERT, ELMo, and GPT-2 embeddings. In
ous and the following sentences, the presence of Proceedings of the 2019 Conference on Empirical
semantically unrelated context could have played a Methods in Natural Language Processing and the
significant role in mislabeling the target words. 9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 55–65,
Hong Kong, China. Association for Computational
Acknowledgments Linguistics.
Dominik Schlechtweg was supported by the Kon- Mario Giulianelli, Marco Del Tredici, and Raquel Fer-
rad Adenauer Foundation and the CRETA center nández. 2020. Analysing lexical semantic change
funded by the German Ministry for Education and with contextualised word representations. In Pro-
Research (BMBF) during the conduct of this study. ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 3960–
We thank the task organizers and reviewers for their 3973, Online, July. Association for Computational
efforts. Linguistics.
Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.
References 2019. What does BERT learn about the structure of
language? In Proceedings of the 57th Annual Meet-
Ehsaneddin Asgari, Christoph Ringlstetter, and Hinrich ing of the Association for Computational Linguistics,
Schütze. 2020. EmbLexChange at SemEval-2020 pages 3651–3657, Florence, Italy, July. Association
Task 1: Unsupervised Embedding-based Detection for Computational Linguistics.
of Lexical Semantic Changes. In Proceedings of
the 14th International Workshop on Semantic Eval- Jens Kaiser, Dominik Schlechtweg, and Sabine Schulte
uation, Barcelona, Spain. Association for Computa- im Walde. 2020. OP-IMS @ DIACR-Ita: Back
tional Linguistics. to the Roots: SGNS+OP+CD still rocks Semantic
Change Detection. In Valerio Basile, Danilo Croce,
Pierpaolo Basile, Annalina Caputo, Tommaso Caselli, Maria Di Maro, and Lucia C. Passaro, editors, Pro-
Pierluigi Cassotti, and Rossella Varvara. 2020a. ceedings of the 7th evaluation campaign of Natural
DIACR-Ita @ EVALITA2020: Overview of the Language Processing and Speech tools for Italian
EVALITA 2020 Diachronic Lexical Semantics (EVALITA 2020), Online. CEUR.org.
(DIACR-Ita) Task. In Valerio Basile, Danilo Croce,
Maria Di Maro, and Lucia C. Passaro, editors, Pro- Andrey Kutuzov and Mario Giulianelli. 2020a. UiO-
ceedings of the 7th evaluation campaign of Natural UvA at SemEval-2020 Task 1: Contextualised Em-
Language Processing and Speech tools for Italian beddings for Lexical Semantic Change Detection.
(EVALITA 2020), Online. CEUR.org. In Proceedings of the 14th International Workshop
on Semantic Evaluation, Barcelona, Spain. Associa-
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- tion for Computational Linguistics.
cia C. Passaro. 2020b. Evalita 2020: Overview
Andrey Kutuzov and Mario Giulianelli. 2020b. UiO- the Association for Computational Linguistics: Hu-
UvA at SemEval-2020 Task 1: Contextualised Em- man Language Technologies, pages 169–174, New
beddings for Lexical Semantic Change Detection. Orleans, Louisiana, USA.
In Proceedings of the 14th International Workshop
on Semantic Evaluation, Barcelona, Spain. Associa- Dominik Schlechtweg, Anna Hätty, Marco del Tredici,
tion for Computational Linguistics. and Sabine Schulte im Walde. 2019. A Wind of
Change: Detecting and evaluating lexical seman-
Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, tic change across times and domains. In Proceed-
and Erik Velldal. 2018. Diachronic word embed- ings of the 57th Annual Meeting of the Association
dings and semantic shifts: A survey. In Proceedings for Computational Linguistics, pages 732–746, Flo-
of the 27th International Conference on Computa- rence, Italy. Association for Computational Linguis-
tional Linguistics, pages 1384–1397, Santa Fe, New tics.
Mexico, USA. Association for Computational Lin-
guistics. Dominik Schlechtweg, Barbara McGillivray, Simon
Hengchen, Haim Dubossarsky, and Nina Tahmasebi.
Matej Martinc, Syrielle Montariol, Elaine Zosa, and 2020. SemEval-2020 Task 1: Unsupervised Lexi-
Lidia Pivovarova. 2020. Discovery Team at cal Semantic Change Detection. In Proceedings of
SemEval-2020 Task 1: Context-sensitive Embed- the 14th International Workshop on Semantic Eval-
dings not Always Better Than Static for Seman- uation, Barcelona, Spain. Association for Computa-
tic Change Detection. In Proceedings of the 14th tional Linguistics.
International Workshop on Semantic Evaluation,
Barcelona, Spain. Association for Computational Philippa Shoemark, Farhana Ferdousi Liza, Dong
Linguistics. Nguyen, Scott Hale, and Barbara McGillivray. 2019.
Room to Glo: A systematic comparison of seman-
Matthew Peters, Mark Neumann, Luke Zettlemoyer, tic change detection approaches with word embed-
and Wen-tau Yih. 2018. Dissecting contextual dings. In Proceedings of the 2019 Conference on
word embeddings: Architecture and representation. Empirical Methods in Natural Language Processing
In Proceedings of the 2018 Conference on Empiri- and the 9th International Joint Conference on Natu-
cal Methods in Natural Language Processing, pages ral Language Processing, pages 66–76, Hong Kong,
1499–1509, Brussels, Belgium, October-November. China. Association for Computational Linguistics.
Association for Computational Linguistics.
Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2018.
Ondřej Pražák, Pavel Přibákň, and Stephen Taylor. Survey of computational approaches to diachronic
2020a. UWB @ DIACR-Ita: Lexical Semantic conceptual change. arXiv:1811.06278.
Change Detection with CCA and Orthogonal Trans-
formation. In Valerio Basile, Danilo Croce, Maria
Di Maro, and Lucia C. Passaro, editors, Proceedings
of the 7th evaluation campaign of Natural Language
Processing and Speech tools for Italian (EVALITA
2020), Online. CEUR.org.
Ondřej Pražák, Pavel Přibákň, Stephen Taylor, and
Jakub Sido. 2020b. UWB at SemEval-2020 Task
1: Lexical Semantic Change Detection. In Proceed-
ings of the 14th International Workshop on Semantic
Evaluation, Barcelona, Spain. Association for Com-
putational Linguistics.
Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2009.
Semantic density analysis: Comparing word mean-
ing across time and phonetic space. In Proceed-
ings of the Workshop on Geometrical Models of Nat-
ural Language Semantics, pages 104–111, Athens,
Greece, March. Association for Computational Lin-
guistics.
Gerard Salton and Michael J McGill. 1983. Introduc-
tion to Modern Information Retrieval. McGraw-Hill
Book Company, New York.
Dominik Schlechtweg, Sabine Schulte im Walde, and
Stefanie Eckmann. 2018. Diachronic Usage Relat-
edness (DURel): A framework for the annotation
of lexical semantic change. In Proceedings of the
2018 Conference of the North American Chapter of