=Paper= {{Paper |id=Vol-1347/paper14 |storemode=property |title=A bottom up approach to category mapping and meaning change |pdfUrl=https://ceur-ws.org/Vol-1347/paper14.pdf |volume=Vol-1347 |dblpUrl=https://dblp.org/rec/conf/networds/DubossarskyTDG15 }} ==A bottom up approach to category mapping and meaning change== https://ceur-ws.org/Vol-1347/paper14.pdf
     A bottom up approach to category mapping and meaning change
 Haim Dubossarsky              Yulia Tsvetkov               Chris Dyer                   Eitan Grossman
The Edmond and Lily            Language Tech-             Language Tech-            Linguistics Department and
Safra Center for Brain         nologies Institute         nologies Institute         the Language, Logic and
        Sciences               Carnegie Mellon            Carnegie Mellon                Cognition Center
The Hebrew Universi-              University                 University              The Hebrew University of
    ty of Jerusalem             Pittsburgh, PA             Pittsburgh, PA                    Jerusalem
 Jerusalem 91904, Is-            15213 USA                  15213 USA                 Jerusalem 91904, Israel
           rael                  ytsvetko                      cdyer                           eit-
haim.dub@gmail                 @cs.cmu.edu                @cs.cmu.edu               an.grossman@mail.h
         .com                                                                              uji.ac.il

                                                               few single words to few dozen words. Only re-
                      Abstract                                 cently though, have usage-based approaches
                                                               (Bybee, 2010) become prominent, in part due to
    In this article, we use an automated bot-
                                                               their compatibility with quantitative research on
    tom-up approach to identify semantic
                                                               large-scale corpora (Geeraerts et al., 2011;
    categories in an entire corpus. We con-
                                                               Hilpert, 2006; Sagi et al., 2011). Such approach-
    duct an experiment using a word vector
                                                               es argue that meaning change, like other linguis-
    model to represent the meaning of words.
                                                               tic changes, are to a large extent governed by and
    The word vectors are then clustered, giv-
                                                               reflected in the statistical properties of lexical
    ing a bottom-up representation of seman-
                                                               items and grammatical constructions in corpora.
    tic categories. Our main finding is that
                                                                  In this paper, we follow such usage-based ap-
    the likelihood of changes in a word’s
                                                               proaches in adopting Firth’s famous maxim
    meaning correlates with its position with-
                                                               “You shall know a word by the company it
    in its cluster.
                                                               keeps,” an axiom that is built into nearly all dia-
1    Introduction                                              chronic corpus linguistics (see Hilpert and Gries,
                                                               2014 for a state-of-the-art survey). However, it is
Modern theories of semantic categories, especial-              unclear how such ‘semantic fields’ are to be
ly those influenced by Cognitive Linguistics                   identified. Usually, linguists’ intuitions are the
(Geeraerts and Cuyckens, 2007), generally con-                 primary evidence. In contrast to an intuition-
sider semantic categories to have an internal                  based approach, we set out from the idea that
structure that is organized around prototypical                categories can be extracted from a corpus, using
exemplars (Geeraerts, 1997; Rosch, 1973).                      a ‘bottom up’ methodology. We demonstrate this
   Historical linguistics uses this conception of              by automatically categorizing the entire lexicon
semantic categories extensively, both to describe              of a corpus, using clustering on the output of a
changes in word meanings over the years and to                 word embedding model.
explain them. Such approaches tend to describe                    We analyze the resulting categories in light of
changes in the meaning of lexical items as                     the predictions proposed in historical linguistics
changes in the internal structure of semantic cat-             regarding changes in word meanings, thus
egories. For example, (Geeraerts, 1999) hypothe-               providing a full-scale quantitative analysis of
sizes that changes in the meaning of a lexical                 changes in the meaning of words over an entire
item are likely to be changes with respect to the              corpus. This approach is distinguished from pre-
prototypical ‘center’ of the category. Further-                vious research by two main characteristics: first,
more, he proposes that more salient (i.e., more                it provides an exhaustive analysis of an entire
prototypical) meanings will probably be more                   corpus; second, it is fully bottom-up, i.e., the cat-
resistant to change over time than less salient                egories obtained emerge from the data, and are
(i.e., less prototypical) meanings.                            not in any way based on linguists’ intuitions. As
   Despite the wealth of data and theories about               such, it provides an independent way of evaluat-
changes in the meaning of words, the conclu-                   ing linguists’ intuitions, and has the potential to
sions of most historical linguistic studies have               turn up new, unintuitive or even counterintuitive
been based on isolated case studies, ranging from
          Copyright © by the paper’s authors. Copying permitted for private and academic purposes.
In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final
                          Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org

                                                          66
facts about language usage, and hence, by hy-                 Where d is the vector’s dimension length, and Wi
pothesis, about knowledge of language.                        and Wi’ represent two specific values at the same
                                                              vector point for the first and second words, re-
2    Literature review                                        spectively.
                                                                 Since words with similar meaning have simi-
Some recent work has examined meaning change
                                                              lar vectors, related words are closer to each other
in large corpora using a similar bottom-up ap-
                                                              in the semantic space. This makes them ideal for
proach and word embedding method (Kim et al.,
                                                              clustering, as word clusters represent semantic
2014). These works analyzed trajectories of
                                                              ‘areas,’ and the position of a word relative to a
meaning change for an entire lexicon, which en-
                                                              cluster centroid represents its saliency with re-
abled them to detect if and when each word
                                                              spect to the semantic concept captured by the
changed, and to measure the degree of such
                                                              cluster. This saliency is higher for words that are
changes. Although these works are highly useful
                                                              closer to their cluster centroid. In other words, a
for our purposes, they do not attempt to explain
                                                              word’s closeness to its cluster centroid is a
why words differ in their trajectories of change
                                                              measure of its prototypicality. To test for the op-
by relating observed changes to linguistic param-
                                                              timal size of the ‘semantic areas,’ different num-
eters.
                                                              bers of clusters were tested. For each the cluster-
   Wijaya and Yeniterzi (2011) used clustering to
                                                              ing procedure was done independently.
characterize the nature of meaning change. They
                                                                 To quantify diachronic word change, we train
were able to measure changes in meaning over
                                                              a word vector model on a historical corpus in an
time, and to identify which aspect of meaning
                                                              orderly incremental manner. The corpus was
had changed and how (e.g., the classical seman-
                                                              sorted by year, and set to create word vectors for
tic changes known as ‘broadening,’ ‘narrowing,’
                                                              each year such that the words’ representations at
and ‘bleaching’). Although innovative, only 20
                                                              the end of training of one year are used to initial-
clusters were used. Moreover, clustering was
                                                              ize the model of the following year. This allows
only used to describe patterns of change, rather
                                                              a yearly resolution of the word vector representa-
than as a possible explanatory factor.
                                                              tions, which are in turn the basis for later anal-
3    Method                                                   yses. To detect and quantify meaning change for
                                                              each word-of-interest, the distance between a
A distributed word vector model was used to                   word’s vector in two consecutive decades was
learn the context in which the words-of-interest              computed, serving as the degree of meaning
are embedded. Each of these words is represent-               change a word underwent in that time period
ed by a vector of fixed length. The model chang-              (with 2 being maximal change and 0 no change).
es the vectors’ values to maximize the probabil-                 Having two representational perspectives –
ity in which, on average, these words could pre-              synchronic and diachronic – we test the hypothe-
dict their context. As a result, words that predict           sis that words that exhibit stronger cluster salien-
similar contexts would be represented with simi-              cy in the synchronic model – i.e., are closer to
lar vectors. This is much like linguistic items in a          the cluster centroid – are less likely to change
classical structuralist paradigm, whose inter-                over time in the diachronic model. We thus
changeability at a given point or ‘slot’ in the syn-          measure the correlation between the distance of a
tagmatic chain implies they share certain aspects             word to its cluster centroid at a specific point in
of function or meaning.                                       time and the degree of change the word under-
   The vectors’ dimensions are opaque from a                  went over the next decade.
linguistic point of view, as it is still not clear how
to interpret them individually. Only when the full            4    Experiment
range of the vectors’ dimensions is taken togeth-
                                                              We used the 2nd version of Google Ngram of
er does meaning emerges in the semantic hyper-
                                                              fiction English, from which 10 millions 5-grams
space they occupy. The similarity of words is
                                                              were sampled for each year from 1850-2009 to
computed using the cosine distance between two
                                                              serve as our corpus. All words were lower cased.
word vectors, with 0 being identical vectors, and
                                                                 Word2vec (Mikolov et al., 2013) was used as
2 being maximally different:
                                                              the distributed word vector model. The model
                           ∑𝑑𝑑𝑖𝑖=1 𝑊𝑊𝑖𝑖 × 𝑊𝑊′𝑖𝑖               was initiated to 50 dimensions for the word vec-
    (1)      1−
                                                              tors’ representations, and the window size for
                  �∑𝑑𝑑𝑖𝑖=1(𝑊𝑊𝑖𝑖 )2 × �∑𝑑𝑑𝑖𝑖=1(𝑊𝑊′𝑖𝑖 )2
                                                              context set to 4, which is the maximum size giv-




                                                         67
en the constraints of the corpus. Words that ap-                     shutters, 0.04             hat, 0.03
peared less than 10 times in the entire corpus                       windows, 0.05              cap, 0.04
were discarded from the model vocabulary.                              doors, 0.08            napkin, 0.09
Training the model was done year by year, and                         curtains, 0.1         spectacles, 0.09
                                                                      blinds, 0.11            helmet, 0.13
versions of the model were saved in 10 year in-
                                                                       gates, 0.13             cloak, 0.14
tervals from 1900 to 2000.                                                                 handkerchief, 0.14
                                                                      gallop, 0.02
   The 7000 most frequent words in the corpus                           trot, 0.02             cane, 0.15
were chosen as words-of-interest, representing              Table 1: Example for clusters of words using 2000
the entire lexicon. For each of these words, the            clusters and their distance from their centroids.
cosine distance between its two vectors, at a spe-
cific year and 10 years later, was computed using              Figure 1 shows the analysis of changes in
(1) above to represent the degree of meaning                word meanings for the years 1950-1960. We
change. A standard K-means clustering proce-                chose this decade at random, but the general
dure was conducted on the vector representations            trend observed here obtains over the entire peri-
of the words for the beginning of each decade               od (1900-2000). There is a correlation between
from 1900 to 2000 and for different number of               the words’ distances from their centroids and the
clusters from 500 until 5000 in increments of               degree of meaning change they underwent in the
500. The distances of words from their cluster              following decade, and this correlation is observ-
centroids were computed for each cluster, using             able for different number of clusters (e.g., for
(1) above. These distances were correlated with             500 clusters, 1000 clusters, and so on). The posi-
the degree of change the words underwent in the             tive correlations (r>.3) mean that the more distal
following ten-year period. The correlation be-              a word is from its cluster’s centroid, the greater
tween the distance of words from random cen-                the change its word vectors exhibit the following
troids of different clusters, on the one hand, and          decade, and vice versa.
the degree of change, on the other hand, served                Crucially, the correlations of the distances
as a control condition.                                     from the centroid outperform the correlations of
                                                            the distances from the prototypical exemplar,
4.1   Results
                                                            which was defined as the exemplar that is the
Table 1 shows six examples of clusters of words.            closest to the centroid. Both the correlations of
The clusters contain words that are semantically            the distance from the cluster centroid and of the
similar, as well as their distances from their clus-        distance from the prototypical exemplar were
ter centroids. It is important to stress that a cen-        significantly better than the correlations of the
troid is a mathematical entity, and is not neces-           control condition (all p’s < .001 under permuta-
sarily identical to any particular exemplar. We             tions tests).
suggest interpreting a word’s distance from its
cluster’s centroid as the degree of its proximity
to a category’s prototype, or, more generally, as
a measure of prototypicality. Defined in this
way, sword is a more prototypical exemplar than
spear or dagger, and windows, shutters or doors
may be more prototypical exemplars of a cover
of an entrance than blinds or gates. In addition,
the clusters capture near-synonyms, like gallop
and trot, and level-of-category relations, e.g., the
modal predicates allowed, permitted, able. The
                                                            Figure 1. Change in the meanings of words correlated
very fact that the model captures clusters and
                                                            with distance from centroid for different numbers of
distances of words which are intuitively felt to be         clusters, for the years 1950-1960.
semantically closer to or farther away from a cat-
egory prototype is already an indication that the              In other words, the likelihood of a word
model is on the right track.                                changing its meaning is better correlated with the
                                                            distance from an abstract measure than with the
                                                            distance from an actual word. For example, the
        sword, 0.06             allowed, 0.02               likelihood of change in the sword-spear-dagger
         spear, 0.07           permitted, 0.04
                                                            cluster is better predicted by a word’s closeness
        dagger, 0.09            able, 0.06




                                                       68
to the centroid, which perhaps could be concep-             5      Conclusion
tualized as a non-lexicalized ‘elongated weapon
with a sharp point,’ than its closeness to an actual        We have shown an automated bottom-up ap-
word, e.g., sword. This is a curious finding,               proach for category formation, which was done
which seems counter-intuitive for nearly all theo-          on an entire corpus using the entire lexicon.
ries of lexical meaning and meaning change.                    We have used this approach to supply histori-
   The magnitude of correlations is not fixed or            cal linguistics with a new quantitative tool to
randomly fluctuating, but rather depends on the             test hypotheses about change in word meanings.
number of clusters used. It peaks for about 3500            Our main findings are that the likelihood of a
clusters, after which it drops sharply. Since a             word’s meaning changing over time correlates
larger number of clusters necessarily means                 with its closeness to its semantic cluster’s most
smaller ‘semantic areas’ that are shared by fewer           prototypical exemplar, defined as the word clos-
words, this suggests that there is an optimal               est to the cluster’s centroid. Crucially, even bet-
range for the size of clusters, which should not            ter than the correlation between distance from
be too small or too large.                                  the prototypical exemplar and the likelihood of
                                                            change is the correlation between the likelihood
4.2   Theoretical implications                              of change and the closeness of a word to its clus-
One of our findings matches what might be ex-               ter’s actual centroid, which is a mathematical
pected, based on Geeraert’s hypothesis, men-                abstraction. This finding is surprising, but is
tioned in Section 1: a word’s distance from its             comparable to the idea that attractors, which are
cluster’s most prototypical exemplar is quite in-           also mathematical abstractions, may be relevant
formative with respect to how well it fits the              for language change.
cluster (Fig. 1). This could be taken to corrobo-
rate Roschian prototype-based views. However,
another finding is more surprising, namely, that a
                                                            Acknowledgements
word’s distance from its real centroid, an abstract         We thank Daphna Weinshall (Hebrew University
average of the members of a category by defini-             of Jerusalem) and Stéphane Polis (University of
tion, is even better than the word’s distance from          Liège) for their helpful and insightful comments.
the cluster’s most prototypical exemplar.                   All errors are, of course, our own.
   In fact, our findings are consonant with recent
work in usage-based linguistics on attractors,              Reference
‘the state(s) or patterns toward which a system is
drawn’ (Bybee and Beckner, 2015). Importantly,              Joan Bybee. 2010. Language, usage and cognition.
attractors are ‘mathematical abstractions (poten-                Cambridge: Cambridge University Press.
tially involving many variables in a multidimen-
sional state space)’. We do not claim that the              Joan Bybee and Clay Beckner. 2015. Emergence at
centroids of the categories identified in our work               the cross linguistic level. In B. MacWhinney
are attractors – although this may be the case –                 and W. O'Grady (eds.), The handbook of
but rather make the more general point that an                   language     emergence,      181-200.  Wiley
abstract mathematical entity might be relevant                   Blackwell.
for knowledge of language and for language
change.                                                     Dirk     Geeraerts. 1997. Diachronic prototype
   In the domain of meaning change, the fact that                  semantics. A contribution to historical
                                                                   lexicology. Oxford: Clarendon Press.
words farther from their cluster’s centroid are
more prone to change is in itself an innovative
                                                            Dirk     Geeraerts. 1999. Diachronic Prototype
result, for at least two reasons. First, it shows on
                                                                   Semantics. A Digest. In: A. Blank and P. Koch
unbiased quantitative grounds that the internal                    (eds.), Historical semantics and cognition.
structure of semantic categories or clusters is a                  Berlin & New York: Mouton de Gruyter.
factor in the relative stability over time of a
word’s meaning. Second, it demonstrates this on             Dirk Geeraerts, and Hubert Cuyckens (eds.). 2007.
the basis of an entire corpus, rather than an indi-              The Oxford handbook of cognitive linguistics.
vidual word. Ideas in this vein have been pro-                   Oxford: Oxford University Press.
posed in the linguistics literature (Geeraerts,
1997), but on the basis of isolated case studies            Dirk    Geeraerts, Caroline Gevaerts, and Dirk
which were then generalized.                                       Speelman. 2011. How Anger Rose: Hypothesis




                                                       69
     Testing in Diachronic Semantics. In J.
     Robynson and K. Allan (eds.),       Current
     methods in historical semantics, 109-132.
     Berlin & New York: Mouton de Gruyter.

Martin Hilpert. 2006. Distinctive Collexeme Analysis
     and Diachrony. Corpus Linguistics and
     Linguistic Theory, 2 (2): 243–256.

Martin Hilpert and Stefan Th. Gries. 2014.
     Quantitative Approaches to Diachronic Corpus
     Linguistics. In M. Kytö and P. Pahta (eds.), The
     Cambridge Handbook of English Historical
     Linguistics. Cambridge: Cambridge University
     Press, 2014.

Yoon Kim, Yi-I Chiu, Kentaro Haraki, Darshan
    Hegde, and Slav Petrov. 2014. Temporal
    Analysis of Language through Neural Language
    Models. Proceedings of the ACL 2014
    Workshop on Language Technologies and
    Computational     Social   Science,    61-65.
    Baltimore, USA.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
    2013. Linguistic Regularities in Continuous
    Space Word Representations. Proceedings of
    NAACL-HLT 2013: 746–751. Atlanta, Georgia.

Eleanor H. Rosch. 1973. Natural Categories.
     Cognitive Psychology 4 (3): 328–350.

Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2011.
     Tracing semantic change with latent semantic
     analysis. In K. Allan and J.A. Robinson (eds.),
     Current methods in historical semantics, 161-
     183. Berlin & New York: Mouton de Gruyter.

Derry T. Wijaya and Reyyan Yeniterzi. 2011.
     Understanding semantic change of words over
     centuries. In Proceedings of the 2011
     international workshop on DETecting and
     Exploiting Cultural diversiTy on the social web
     (DETECT ’11) 35-40. Glasgow, United
     Kingdom.




                                                        70