University of Padova @ DIACR-Ita ∗

                 Benyou Wang and Emanuele Di Buccio and Massimo Melucci
                           Department of Information Engineering
                             University of Padova, Padova, Italy
                        {wang,dibuccio,melo}@dei.unipd.it


                       Abstract                              tor initialization (Kim et al., 2014), and tempo-
                                                             ral teferencing (Dubossarsky et al., 2019). This
    Semantic change detection task in a rel-                 work relies on contextualized word embeddings
    atively low-resource language like Italian               as the basic word representation component (Hu
    is challenging. By using contextualized                  et al., 2019), since they have been shown to be
    word embeddings, we formalize the task                   effective in many NLP tasks including document
    as a distance metric for two flexible-size               classification and question answering. The meth-
    sets of vectors. Various distance met-                   ods relying on contextualized word embeddings
    rics like average Euclidean Distance, av-                performed worse than those based on static word
    erage Canberra distance, Hausdorff dis-                  embedding in Semantic Change detection tasks in
    tance, as well as Jensen–Shannon diver-                  many languages (Kutuzov and Giulianelli, 2020;
    gence between cluster distributions based                Pömsl and Lyapin, 2020; Schlechtweg et al., 2020;
    on K-means clustering and Gaussian mix-                  Vani et al., 2020; Giulianelli et al., 2020; Giu-
    ture model are used. The final predic-                   lianelli, 2019). However, it is our opinion that
    tion is given by an ensemble of top-ranked               the use of contextualized word embeddings for
    words based on each distance metric. The                 this task is worth investigating because (1) they
    proposed method achieved better perfor-                  have highly expressive power as demonstrated in
    mance than a frequency and collocation                   many downstream tasks e.g., document classifica-
    based baselines.                                         tion and question answering, and (2) they could
                                                             handle fine-grained representations of individual
1   Introduction                                             context at the level of tokens.
Lexical Semantic Change detection aims at identi-               By using contextualized word embedding, each
fying words that change meaning over time; this              word in a specific sentence is represented as a vec-
problem is of great interest for NLP, lexicogra-             tor depending on the neighboring words which
phy, and linguistics. A semantic change detection            form the context of the word; a word appear-
task in English, German, Latin, and Swedish was              ing many times in a corpus is therefore repre-
proposed by Schlechtweg et al. (2020). Recently,             sented as a set of vectors since one vector corre-
Basile et al. (2020a) organized a lexical seman-             sponds to each occurrence). In this paper, seman-
tic change detection task in Italian called DIACR-           tic change detection is addressed by computing the
Ita at EVALITA 2020 (Basile et al., 2020b). This             distance between two flexible-size sets consisting
technical report describes the methodology de-               of vectors with respect to two time-stamped cor-
signed and developed by the University of Padova             pora. We investigated several distance metrics: av-
for the participation to DIACR-Ita.                          erage Euclidean Distance, average Canberra dis-
   Some previous approaches for semantic change              tance, and Hausdorff distance. Our methodol-
modelling were based on static word embedding,               ogy also relies on a clustering algorithm (e.g. K-
where word vectors were trained for each time-               means clustering and Gaussian Mixture Model)
stamped corpus and then were aligned, e.g. by or-            on the joint set and calculates a Jensen–Shannon
thogonal projections (Hamilton et al., 2016), vec-           divergence between cluster distributions in the
    ∗
                                                             two sub-corpora. We aggregate top-ranked words
      “Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0     based on each distance metric as the final predic-
International (CC BY 4.0).”                                  tion. The proposed method achieved better perfor-
mance than frequency and collocation based base-                       Since this is a closed task, we may not have
lines and finally ranked the 8-th among 9 partici-                   enough annotated samples to train a f using gra-
panting teams.                                                       dient descent. Therefore, a well-selected f will be
                                                                     crucial.
2       Problem definition
Unlike the static word embedding like Word2vec                       3       Methodology
(Mikolov et al., 2013) 1 , contextualized word em-                   3.1      Contextualized Word Embedding
beddings like ELMO (Peters et al., 2018) and
BERT (Devlin et al., 2018) generate word repre-                      Using contextualized word embeddings like
sentation based on the context of a word which                       ELMO and BERT has be shown to improve per-
does in this way not have a unique mapping with                      formance in various downstream tasks due to its
a fixed word vector.                                                 expressive power for words. In this paper, we use a
   Let us denote a corpus with m sentences as C.                     multilingual-BERT3 . Uncased models are adopted
In this paper, C is related to a time span t because                 since we assume that semantic change detection is
of the task characteristics; however, the corpus can                 insensitive to word case. All models are in base
be tailored to any specific aspect, e.g. a specific                  settings with 12 layers, 12 heads, and a hidden
domain such as news or books. For a word wi ap-                      state dimension of 768. Only last-layer output of
pearing in C, its contextualized word representa-                    BERT is used as word representation.
                                             (C)
tion in the k-th sentence 2 is denoted by ei,k . The
                                                                     3.2      Measuring Semantic Change Degree
word representation in the corpus is a set
                                                                     3.2.1     Distance-based methods
            ΦCi = {ei,1 , ei,2 , · · · ei,k , · · · , ei,m }
                       (C)    (C)        (C)          (C)
                                                               (1)
                                                                     In this section, we introduce various methods to
   To examine whether a word wi exhibits a se-                       calculate the semantic change degree.
mantic change between two corpora C1 (in t1 ) and
                                                                     Average Geometric Distance. Average Geo-
C2 (in t2 ), we check the difference between two
                                                                     metric Distance (AGD) (also can be seen in (Ku-
sets ΦCi 1 and ΦCi 2 . Let li be a human-annotated la-
                                                                     tuzov and Giulianelli, 2020; Giulianelli, 2019)) is
bel indicating the semantic change degree; li usu-
                                                                     defined as below:
ally ranges from 0 to 1, where 1 denotes a full
semantic change. Let D be the dimension of the                                     C     C        1          X
                                                                           AGD(Φi 1 , Φi 2 ) =                         d(x, y)
word vector. We define the distance metric as a                                                  mn      C       C
function                                                                                              x∈Φi 1 ,y∈Φi 2
                    f : {RD }m , {RD }n → R.                   (2)
                                                                     The distance function d(·, ·) can be the Euclidean
to obtain a semantic change degree based on the
                                                                     Distance 4 , the Canberra distance (Lance and
representation of a word in two corpora denoted
                                                                     Williams, 1966) 5 or any distance function. In this
as ΦCi 1 , ΦCi 2 . When labels are binary, one may sim-
                                                                     paper, we also use the negative cosine similarity as
ply use a threshold on the values of f (·, ·) to pre-
                                                                     a normalized distance metric.
dict the binary label. Let δ be a function to gener-
ate a binary output, e.g., based on a hand-crafted                   Hausdorff distance. Hausdorff distance (Rock-
threshold. We can predict whether wi exhibits a                      afellar and Wets, 2009) is denoted as HD in short
semantic change between C1 and C2 as follows                         and is generally used to measure the distance be-
                                                                     tween two non-empty sets, namely,
                      li = δ(f (ΦCi 1 , ΦCi 2 ))
                      ¯                                        (3)

                                                                             HD(ΦCi 1 , ΦCi 2 ) = max( sup   inf ||x − y||2 ,
where ¯li is the predicted binary label.                                                                C
                                                                                                    x∈Φi 1 y∈Φi
                                                                                                               C2

  In conclusion, in our work the semantic change                                                                                 (5)
                                                                                                      sup    inf ||x − y||2 )
detection task is formalized as follows                                                                 C
                                                                                                    x∈Φi 2 y∈Φi
                                                                                                               C1


                   X                            
                                                                         3
           arg max   δ(f (ΦCi 1 , ΦCi 2 )) == li               (4)         https://storage.googleapis.com/bert_
              f,δ      wi                                            models/2018_11_03/multilingual_L-12_
                                                                     H-768_A-12.zip.
    1                                                                    4
     An overview on word vectors is in Wang et al. (2019).                 Euclidean Distance : d(x, y) = ||x − y||2
    2                                                                    5
     If a word appears in a sentence more than once, we take               Canberra distance is a P
                                                                                                  normalized version of the Man-
                                                                                                         |xi −yi |
the average.                                                         hattan distance, d(x, y) = D   i=1 |xi |+|yi |
3.2.2    Clustering-based Methods                              word              # corpus C1   # corpus C2
By clustering the union set between ΦCi 1 and ΦCi 2            egemonizzare          11           37
                                                               lucciola              64           226
in K clusters/categories, we obtained the cate-                campanello           109           628
gory distributions p, q for ΦCi 1 and ΦCi 2 , respec-          trasferibile           7           60
tively. We adopted two commonly used clus-                     brama                  17           93
                                                               polisportiva           74          134
tering methods: the K-means clustering method                  palmare               19           88
and the Gaussian Mixture Model method. As                      processare             39          594
for the distance between distributions, we adopted             pilotato               34          285
                                                               cappuccio              60          198
the Jensen–Shannon Divergence (JSD), which is                  pacchetto            274          5690
a symmetrized and smoothed version of the Kull-                ape                   123          252
back–Leibler divergence:                                       unico                4524         29620
                                                               discriminatorio       110          262
                                                               rampante               26          462
               1             1                                 campionato           3918         11871
           JSD = KL(p, q) + KL(q, p)
               2             2                                 tac                    88          438
                                                               piovra                 30          621
where KL(p, q) = K             pi
                  P
                    i=1 pi log qi .
                                                        Table 1: ‘#1’ and ‘#2’ denote the number of sen-
3.3     Threshold and Ensemble                          tences where the target word occurs in two time-
We took the top-K ranked target words of each           stamped corpora C1 and C2 respectively.
metric and aggregated them for the final submis-         methods                                         accuracy
sion. The K was decided when the aggregated              Frequencies                                         0.50
target words reached the half of total words num-        Collocations                                        0.61
                                                         Aggregated results (submitted)                      0.67
bers, since we assumed that the annotated labels
                                                         Average negative cosine similarity                  0.67
are balanced. See (Schlechtweg et al., 2020) for         Average distance with Euclidean distance            0.61
detailed discussions about thresholds.                   Average distance with Canberra distance             0.61
                                                         Hausdorff distance                                  0.50
4     Experiments                                        JS divergence with K-means Clustering               0.61
                                                         JS divergence with Gaussian Mixture Model           0.61
4.1     Dataset and Evaluation Methodology
                                                              Table 2: Results of the proposed methods.
DIACR-Ita is the first task on lexical semantic
change for Italian. DIACR-Ita aims to automati-         T, F refers to ‘True’ and ‘False’, P, N refers to
cally detect whether a word semantically change         ‘positive’ and ‘negative’. For example, T P is the
over time. The task is to detect if a set of words,     number of Truly-predicted Positive samples.
called target words, change their meaning across           The task735680 organizers provided two base-
two periods, t1 and t2 , where t1 precedes t2 . Par-    lines: Frequencies: the absolute value of the dif-
ticipants are provided with two corpora C1 and C2       ference between the words’ frequencies is com-
(corresponding to t1 and t2 , respectively), and a      puted; Collocations: for each word, it com-
set of target words. For instance, the meaning of       putes the cosine similarity between two Bag-of-
the word ‘imbarcata’ has changed from t1 to t2 ;        Collocations (BoCs) vector representations related
originally, the word referred to an ‘acrobatic ma-      to C1 and C2 . In both baseline models, a threshold
noeuvre of aeroplanes’, but it is nowadays used to      is used to predict if the word has changed its mean-
refer to the state of being deeply in love (Basile      ing.
et al., 2020a) although the latter meaning is much
less used than the former meaning. The task is          4.2    Experimental Results
formulated as a closed task, namely, models must        Experimental results are reported Table 2 and
be trained solely on the provided data. The occur-      show that the proposed method achieved better
rence about target words is reported in Table 1.        performance than frequency and collocation based
   Labels in this task are binary and the task is       baselines.
considered as a binary classification problem. The
evaluation is based on accuracy:                        4.3    Post-hoc Analysis
                             TP + TN                    In this section, we will provide a bi-dimensional
         Accuracy   =                                   visualization of word representation to intuitively
                        TP + TN + FP + FN
understand how the contextualized word vectors          The patterns of semantic change are multifaceted
work. For each word, we get all contextualized          and we are questioning that a single distance met-
word vectors (with a dimension of 768) based on         ric could precisely distinguish all the above typical
its context. To visualized word in a 2D plane,          semantic shift patterns.
we used a typical dimension reduction algorithm
called T-SNE (Maaten and Hinton, 2008) to re-           Normalization. Most of distance metrics are not
duce word vectors from 768 to 2. Red and blue           normalized except for negative cosine similarity.
points denote the low dimensional representation        Absolute values of unnormalized distance metrics
of vectors when considering the two time-stamped        may differ a lot among individual words; they are
corpora C1 (blue) and C2 (red).                         sometimes unexpectedly affected by the number
   For example, ‘rampante’ and ‘palmare’ are the        of samples, leads to that the values of metrics may
predicted positive samples while ‘cappuccio’ and        not be comparable among words.
‘campanello’ are predicted negative samples. As
shown in Figure 1, the predicted semantically-
                                                        Outliers. Some distance metrics (e.g., Haus-
shifted words exhibit a clear difference between
                                                        dorff distance) are sensitive to outliers. For exam-
red points an blue points with respect to two time-
                                                        ple, since the calculation of Hausdorff distance is
stamped corpora. For the predicted semantically-
                                                        based on infimum and supremum, an outlier point
unshifted words (see Figure 2), it looks slightly
                                                        may largely affect the final Hausdorff distance. As
indistinguishable.
                                                        seen in Table 3, frequently-appearing words e.g.,
                                                        ‘campionato’ and ‘unico’ have the highest Haus-
5     Limitations
                                                        dorff distance between C1 and C2 , this is proba-
In (Schlechtweg et al., 2020), semantic repre-          bly biased by the fact that the two words appear
sentations are mainly divided to two categories:        frequently (see Table 1) and therefore likely have
average embeddings (‘type embeddings’) and              more unexpected outliers.
contextualized embeddings (‘token embeddings’).
Schlechtweg et al. (2020) illustrated the perfor-       Model Fine-tuning. The contextualized word
mance of token-based models are much lower than         embedding that is based on pre-trained language
type-based embedding models. In this section,           models like BERT achieved much better results
we will discuss some limitations of currently-used      compared to static word embedding with a two-
contextualized embedding based methods for se-          stage training paradigm, where the two stages are
mantic change detection.                                pre-training in language model (e.g., mask lan-
   There are typically two kinds of methods to use      guage model) and fine-tuning in downstream tasks
contextualized embeddings for semantic change           (e.g., classifications). However, in the semantic
detection: embedding-based distance metrics and         change detection task, fine-tuning in downstream
clustering-based distance metrics (Schlechtweg          tasks is currently impossible because the anno-
et al., 2020; Vani et al., 2020; Giulianelli et al.,    tated labels are insufficient to this aim; to some
2020; Giulianelli, 2019). The former are directly       extent, the lack of fine-tuning stage may harm the
calculated on the raw contextualized word embed-        performance of the pre-trained language models.
dings while the latter are based on the clustering
results of contextualized word embeddings.
                                                        5.2   Clustering-based Distance Metrics
5.1    Embedding-based Distance Metrics                 After clustering, we used the Jensen–Shannon di-
Can distance metrics distinguish semantic shift         vergence (JSD) which is affected by the issues
patterns? Many typical patterns of semantic             mentioned in Section 5.1 like other distance met-
shifts have been investigated (Grossmann and            rics. Plus, the clustering algorithm may introduce
Rainer, 2013; Basile et al., 2020a): 1) pejora-         some errors of semantic change detection. First,
tion or amelioration (when word meanings be-            typical clustering algorithms may not necessarily
come more negative or more positive); 2) broad-         converge to an identical clustering result when the
ening or narrowing (when it evolves as a general-       seed centroids are changed. Moreover, the number
ized/extended object or a restricted or specialized     of clusters is crucial since the optimal number of
one); 3) adding/deleting a sense; 4) totally shifted.   clusters cannot easily be decided before clustering.
Figure 1: Examples (i.e., ‘rampante’ and ‘palmare’) of predicted ”semantically-shifted” words. Red and
blue points denote dimensionally-reduced vectors of two time-stamped corpora respectively.


Figure 2: Examples (i.e., ‘cappuccio’ and ‘campanello’) of predicted ”semantically-unshifted” words.
Red and blue points denote dimensionally-reduced vectors of two time-stamped corpora respectively.

6   Conclusions                                        References
This paper formalizes semantic change detection        Pierpaolo Basile, Annalina Caputo, Tommaso
as a distance metric between two variable-sized          Caselli, Pierluigi Cassotti, and Rossella Var-
sets of vectors. The final prediction is based on an     vara. 2020a. DIACR-Ita @ EVALITA2020:
ensemble of different distance metrics. The pro-         Overview of the EVALITA 2020 Diachronic
posed method outperformed weak frequency and             Lexical Semantics (DIACR-Ita) Task. In
collocation baselines, but it performed less well        EVALITA 2020, Valerio Basile, Danilo Croce,
than SOTA baselines. As a future work, this task         Maria Di Maro, and Lucia C. Passaro (Eds.).
may be largely improved via a supervised task            CEUR.org, Online.
in a unified multi-lingual framework; thus, any        Valerio Basile, Danilo Croce, Maria Di Maro,
human-annotated labels in other languages could          and Lucia C. Passaro. 2020b. EVALITA 2020:
be used in this task since currently the number of       Overview of the 7th Evaluation Campaign of
annotated semantically-shift words in a single lan-      Natural Language Processing and Speech Tools
guage is limited.                                        for Italian. In Proceedings of Seventh Evalua-
                                                         tion Campaign of Natural Language Processing
Acknowledgments                                          and Speech Tools for Italian. Final Workshop
                                                         (EVALITA 2020), Valerio Basile, Danilo Croce,
This work is supported by the Quantum Access             Maria Di Maro, and Lucia C. Passaro (Eds.).
and Retrieval Theory (QUARTZ) project, which             CEUR.org, Online.
has received funding from the European Union‘s
                                                       Jacob Devlin, Ming-Wei Chang, Kenton Lee,
Horizon 2020 research and innovation programme
                                                         and Kristina Toutanova. 2018. Bert: Pre-
under the Marie Skłodowska-Curie grant agree-
                                                         training of deep bidirectional transformers
ment No. 721321.
                                                         for language understanding. arXiv preprint
                                                         arXiv:1810.04805 (2018).
A   Appendix
                                                       Haim Dubossarsky, Simon Hengchen, Nina Tah-
Table 3 reports the predictions based on various         masebi, and Dominik Schlechtweg. 2019.
distance metrics.                                        Time-out: Temporal referencing for robust
              word              AGD-cosine   AGD-euclidean    AGD-canberra   Hausdorff distance   JSD-GMM   JSD-Kmeans

              matematica          0.996          1.02             86.6             10.0            0.004      0.025
              dettagliato         0.895          6.09            290.9             7.5             0.693      0.693
              sanità             0.990          1.86            130.8             10.9            0.025      0.052
              senatore            0.997          0.79             79.1              7.7            0.009      0.002
              istruzione          0.854          6.14            333.7             14.4            0.275      0.279
              egemonizzare        0.988          1.62            136.6              5.6            0.003      0.033
              lucciola            0.970          2.58            187.3              8.4            0.414      0.154
              campanello          0.990          1.13            131.7             10.8            0.003      0.003
              trasferibile        0.873          4.25            300.7             7.2             0.059      0.073
              brama               0.830          5.80            346.2             8.3             0.420      0.406
              polisportiva        0.921          4.42            285.7             7.5             0.293      0.291
              palmare             0.955          2.55            220.5              8.0            0.130      0.154
              processare          0.986          1.76            159.9              6.9            0.105      0.067
              pilotato            0.970          2.27            198.9             12.1            0.108      0.128
              cappuccio           0.973          1.78            183.6             12.2            0.015      0.016
              pacchetto           0.984          1.67            149.6             10.5            0.011      0.009
              ape                 0.953          2.09            216.7             15.3            0.033      0.031
              unico               0.985          1.89            149.9             16.2            0.035      0.032
              discriminatorio     0.987          1.56            150.5             10.2            0.007      0.007
              rampante            0.888          4.78            302.7             6.5             0.293      0.299
              campionato          0.978          2.51            183.1             16.0            0.074      0.071
              tac                 0.815          5.25            366.2             9.9             0.301      0.391
              piovra              0.976          2.27            189.6              9.7            0.033      0.033


        Table 3: Calculated scores of various distance metrics. Top ranked scores are in bold.

  modeling of lexical semantic change.                       arXiv       Laurens van der Maaten and Geoffrey Hinton.
  preprint arXiv:1906.01688 (2019).                                        2008. Visualizing data using t-SNE. JMLR 9,
                                                                           Nov (2008), 2579–2605.
Mario Giulianelli. 2019. Lexical semantic change
 analysis with contextualised word representa-                           Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
 tions. Unpublished master’s thesis, University                            frey Dean. 2013. Efficient estimation of word
 of Amsterdam, Amsterdam (2019).                                           representations in vector space. arXiv preprint
                                                                           arXiv:1301.3781 (2013).
Mario Giulianelli, Marco Del Tredici, and Raquel
 Fernández. 2020. Analysing Lexical Semantic                            Matthew E Peters, Mark Neumann, Mohit Iyyer,
 Change with Contextualised Word Representa-                              Matt Gardner, Christopher Clark, Kenton Lee,
 tions. arXiv preprint arXiv:2004.14118 (2020).                           and Luke Zettlemoyer. 2018. Deep contextu-
                                                                          alized word representations. In NAACL. 2227–
Maria Grossmann and Franz Rainer. 2013. La                                2237.
 formazione delle parole in italiano. Walter de
                                                                         Martin Pömsl and Roman Lyapin. 2020. CIRCE
 Gruyter.
                                                                          at SemEval-2020 Task 1: Ensembling Context-
William L Hamilton, Jure Leskovec, and Dan Ju-                            Free and Context-Dependent Word Representa-
 rafsky. 2016. Diachronic Word Embeddings Re-                             tions. arXiv preprint arXiv:2005.06602 (2020).
 veal Statistical Laws of Semantic Change. In                            R Tyrrell Rockafellar and Roger J-B Wets. 2009.
 ACL. 1489–1501.                                                           Variational analysis. Vol. 317. Springer Science
Renfen Hu, Shen Li, and Shichen Liang. 2019. Di-                           & Business Media.
  achronic sense modeling with deep contextual-                          Dominik Schlechtweg, Barbara McGillivray, Si-
  ized word embeddings: An ecological view. In                             mon Hengchen, Haim Dubossarsky, and Nina
  ACL. 3899–3908.                                                          Tahmasebi. 2020. SemEval-2020 Task 1: Un-
Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan                               supervised Lexical Semantic Change Detection.
  Hegde, and Slav Petrov. 2014. Temporal Analy-                            arXiv preprint arXiv:2007.11464 (2020).
  sis of Language through Neural Language Mod-                           K Vani, Sandra Mitrovic, Alessandro Antonucci,
  els. ACL 2014 (2014), 61.                                                and Fabio Rinaldi. 2020.      SST-BERT at
                                                                           SemEval-2020 Task 1: Semantic Shift Trac-
Andrey Kutuzov and Mario Giulianelli. 2020.
                                                                           ing by Clustering in BERT-based Embedding
  UiO-UvA at SemEval-2020 Task 1: Con-
                                                                           Spaces.    arXiv preprint arXiv:2010.00857
  textualised Embeddings for Lexical Se-
                                                                           (2020).
  mantic Change Detection.  arXiv preprint
  arXiv:2005.00050 (2020).                                               Benyou Wang, Emanuele Di Buccio, and Mas-
                                                                           simo Melucci. 2019. Representing Words in
Godfrey N Lance and William T Williams. 1966.                              Vector Space and Beyond. In Quantum-Like
  Computer programs for hierarchical polythetic                            Models for Information Retrieval and Decision-
  classification (“similarity analyses”). Comput.                          Making. Springer, 83–113.
  J. 9, 1 (1966), 60–64.