<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Distinctive Collexeme Analysis
and Diachrony. Corpus Linguistics and
Linguistic Theory</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A bottom up approach to category mapping and meaning change</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>ytsvetko @cs.cmu.edu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>cdyer @cs.cmu.edu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chris Dyer Language Technologies Institute Carnegie Mellon University Pittsburgh</institution>
          ,
          <addr-line>PA 15213</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Copyright © by the paper's authors. Copying permitted for private and academic purposes. In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final Conference</institution>
          ,
          <addr-line>Pisa</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Eitan Grossman Linguistics Department and the Language, Logic and Cognition Center The Hebrew University of Jerusalem Jerusalem 91904</institution>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Haim Dubossarsky The Edmond and Lily Safra Center for Brain Sciences The Hebrew University of Jerusalem Jerusalem 91904</institution>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Yulia Tsvetkov Language Technologies Institute Carnegie Mellon University Pittsburgh</institution>
          ,
          <addr-line>PA 15213</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2006</year>
      </pub-date>
      <volume>2</volume>
      <issue>2</issue>
      <fpage>66</fpage>
      <lpage>70</lpage>
      <abstract>
        <p>In this article, we use an automated bottom-up approach to identify semantic categories in an entire corpus. We conduct an experiment using a word vector model to represent the meaning of words. The word vectors are then clustered, giving a bottom-up representation of semantic categories. Our main finding is that the likelihood of changes in a word's meaning correlates with its position within its cluster.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Modern theories of semantic categories,
especially those influenced by Cognitive Linguistics
(Geeraerts and Cuyckens, 2007), generally
consider semantic categories to have an internal
structure that is organized around prototypical
exemplars
        <xref ref-type="bibr" rid="ref4">(Geeraerts, 1997; Rosch, 1973)</xref>
        .
      </p>
      <p>Historical linguistics uses this conception of
semantic categories extensively, both to describe
changes in word meanings over the years and to
explain them. Such approaches tend to describe
changes in the meaning of lexical items as
changes in the internal structure of semantic
categories. For example, (Geeraerts, 1999)
hypothesizes that changes in the meaning of a lexical
item are likely to be changes with respect to the
prototypical ‘center’ of the category.
Furthermore, he proposes that more salient (i.e., more
prototypical) meanings will probably be more
resistant to change over time than less salient
(i.e., less prototypical) meanings.</p>
      <p>
        Despite the wealth of data and theories about
changes in the meaning of words, the
conclusions of most historical linguistic studies have
been based on isolated case studies, ranging from
few single words to few dozen words. Only
recently though, have usage-based approaches
        <xref ref-type="bibr" rid="ref1">(Bybee, 2010)</xref>
        become prominent, in part due to
their compatibility with quantitative research on
large-scale corpora
        <xref ref-type="bibr" rid="ref5">(Geeraerts et al., 2011;
Hilpert, 2006; Sagi et al., 2011)</xref>
        . Such
approaches argue that meaning change, like other
linguistic changes, are to a large extent governed by and
reflected in the statistical properties of lexical
items and grammatical constructions in corpora.
      </p>
      <p>In this paper, we follow such usage-based
approaches in adopting Firth’s famous maxim
“You shall know a word by the company it
keeps,” an axiom that is built into nearly all
diachronic corpus linguistics (see Hilpert and Gries,
2014 for a state-of-the-art survey). However, it is
unclear how such ‘semantic fields’ are to be
identified. Usually, linguists’ intuitions are the
primary evidence. In contrast to an
intuitionbased approach, we set out from the idea that
categories can be extracted from a corpus, using
a ‘bottom up’ methodology. We demonstrate this
by automatically categorizing the entire lexicon
of a corpus, using clustering on the output of a
word embedding model.</p>
      <p>We analyze the resulting categories in light of
the predictions proposed in historical linguistics
regarding changes in word meanings, thus
providing a full-scale quantitative analysis of
changes in the meaning of words over an entire
corpus. This approach is distinguished from
previous research by two main characteristics: first,
it provides an exhaustive analysis of an entire
corpus; second, it is fully bottom-up, i.e., the
categories obtained emerge from the data, and are
not in any way based on linguists’ intuitions. As
such, it provides an independent way of
evaluating linguists’ intuitions, and has the potential to
turn up new, unintuitive or even counterintuitive
facts about language usage, and hence, by
hypothesis, about knowledge of language.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Literature review</title>
      <p>Some recent work has examined meaning change
in large corpora using a similar bottom-up
approach and word embedding method (Kim et al.,
2014). These</p>
      <p>works analyzed trajectories of
meaning change for an entire lexicon, which
enabled them to detect if and when each word
changed, and to measure the degree of such
changes. Although these works are highly useful
for our purposes, they do not attempt to explain
why words differ in their trajectories of change
by relating observed changes to linguistic
parameters.</p>
      <p>Wijaya and Yeniterzi (2011) used clustering to
characterize the nature of meaning change. They
were able to measure changes in meaning over
time, and to identify which aspect of meaning
had changed and how (e.g., the classical
semantic changes known as ‘broadening,’ ‘narrowing,’
and ‘bleaching’). Although innovative, only 20
clusters were used. Moreover, clustering was
only used to describe patterns of change, rather
than as a possible explanatory factor.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>A distributed word vector model was used to
learn the context in which the words-of-interest
are embedded. Each of these words is
represented by a vector of fixed length. The model
changes the vectors’ values to maximize the
probability in which, on average, these words could
predict their context. As a result, words that predict
similar contexts would be represented with
similar vectors. This is much like linguistic items in a
classical structuralist paradigm,
whose
interchangeability at a given point or ‘slot’ in the
syntagmatic chain implies they share certain aspects
of function or meaning.</p>
      <p>The vectors’ dimensions are opaque from a
linguistic point of view, as it is still not clear how
to interpret them individually. Only when the full
range of the vectors’ dimensions is taken
together does meaning emerges in the semantic
hyperspace they occupy. The similarity of words is
computed using the cosine distance between two
word vectors, with 0 being identical vectors, and
2 being maximally different:
(1)
1 −</p>
      <p>∑ =1   ×  ′

∑ =1(  )2 ×</p>
      <p>∑ =1( ′ )2</p>
      <p>Where d is the vector’s dimension length, and Wi
and Wi’ represent two specific values at the same
vector point for the first and second words,
respectively.</p>
      <p>Since words with similar meaning have
similar vectors, related words are closer to each other
in the semantic space. This makes them ideal for
clustering, as word clusters represent semantic
‘areas,’ and the position of a word relative to a
cluster centroid represents its saliency with
respect to the semantic concept captured by the
cluster. This saliency is higher for words that are
closer to their cluster centroid. In other words, a
word’s closeness to its cluster centroid is a
measure of its prototypicality. To test for the
optimal size of the ‘semantic areas,’ different
numbers of clusters were tested. For each the
clustering procedure was done independently.</p>
      <p>To quantify diachronic word change, we train
a word vector model on a historical corpus in an
orderly incremental manner. The corpus was
sorted by year, and set to create word vectors for
each year such that the words’ representations at
the end of training of one year are used to
initialize the model of the following year. This allows
a yearly resolution of the word vector
representations, which are in turn the basis for later
analyses. To detect and quantify meaning change for
each</p>
      <p>word-of-interest, the distance between a
word’s vector in two consecutive decades was
computed, serving as the degree of meaning
change a word underwent in that time period
(with 2 being maximal change and 0 no change).</p>
      <p>Having two representational perspectives –
synchronic and diachronic – we test the
hypothesis that words that exhibit stronger cluster
saliency in the synchronic model – i.e., are closer to
the cluster centroid – are less likely to change
over time in the diachronic
model. We thus
measure the correlation between the distance of a
word to its cluster centroid at a specific point in
time and the degree of change the word
underwent over the next decade.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiment</title>
      <p>We used the 2nd version of Google Ngram of
fiction English, from which 10 millions 5-grams
were sampled for each year from 1850-2009 to
serve as our corpus. All words were lower cased.</p>
      <p>
        Word2vec
        <xref ref-type="bibr" rid="ref3">(Mikolov et al., 2013)</xref>
        was used as
the distributed word vector model. The model
was initiated to 50 dimensions for the word
vectors’ representations, and the window size for
context set to 4, which is the maximum size
given the constraints of the corpus. Words that
appeared less than 10 times in the entire corpus
were discarded from the model vocabulary.
Training the model was done year by year, and
versions of the model were saved in 10 year
intervals from 1900 to 2000.
      </p>
      <p>The 7000 most frequent words in the corpus
were chosen as words-of-interest, representing
the entire lexicon. For each of these words, the
cosine distance between its two vectors, at a
specific year and 10 years later, was computed using
(1) above to represent the degree of meaning
change. A standard K-means clustering
procedure was conducted on the vector representations
of the words for the beginning of each decade
from 1900 to 2000 and for different number of
clusters from 500 until 5000 in increments of
500. The distances of words from their cluster
centroids were computed for each cluster, using
(1) above. These distances were correlated with
the degree of change the words underwent in the
following ten-year period. The correlation
between the distance of words from random
centroids of different clusters, on the one hand, and
the degree of change, on the other hand, served
as a control condition.
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>Table 1 shows six examples of clusters of words.
The clusters contain words that are semantically
similar, as well as their distances from their
cluster centroids. It is important to stress that a
centroid is a mathematical entity, and is not
necessarily identical to any particular exemplar. We
suggest interpreting a word’s distance from its
cluster’s centroid as the degree of its proximity
to a category’s prototype, or, more generally, as
a measure of prototypicality. Defined in this
way, sword is a more prototypical exemplar than
spear or dagger, and windows, shutters or doors
may be more prototypical exemplars of a cover
of an entrance than blinds or gates. In addition,
the clusters capture near-synonyms, like gallop
and trot, and level-of-category relations, e.g., the
modal predicates allowed, permitted, able. The
very fact that the model captures clusters and
distances of words which are intuitively felt to be
semantically closer to or farther away from a
category prototype is already an indication that the
model is on the right track.</p>
      <p>sword, 0.06
spear, 0.07
dagger, 0.09
allowed, 0.02
permitted, 0.04
able, 0.06
shutters, 0.04 hat, 0.03
windows, 0.05 cap, 0.04
doors, 0.08 napkin, 0.09
curtains, 0.1 spectacles, 0.09
blinds, 0.11 helmet, 0.13
gates, 0.13 cloak, 0.14
gallop, 0.02 handkerchief, 0.14</p>
      <p>trot, 0.02 cane, 0.15
Table 1: Example for clusters of words using 2000
clusters and their distance from their centroids.</p>
      <p>Figure 1 shows the analysis of changes in
word meanings for the years 1950-1960. We
chose this decade at random, but the general
trend observed here obtains over the entire
period (1900-2000). There is a correlation between
the words’ distances from their centroids and the
degree of meaning change they underwent in the
following decade, and this correlation is
observable for different number of clusters (e.g., for
500 clusters, 1000 clusters, and so on). The
positive correlations (r&gt;.3) mean that the more distal
a word is from its cluster’s centroid, the greater
the change its word vectors exhibit the following
decade, and vice versa.</p>
      <p>Crucially, the correlations of the distances
from the centroid outperform the correlations of
the distances from the prototypical exemplar,
which was defined as the exemplar that is the
closest to the centroid. Both the correlations of
the distance from the cluster centroid and of the
distance from the prototypical exemplar were
significantly better than the correlations of the
control condition (all p’s &lt; .001 under
permutations tests).</p>
      <p>In other words, the likelihood of a word
changing its meaning is better correlated with the
distance from an abstract measure than with the
distance from an actual word. For example, the
likelihood of change in the sword-spear-dagger
cluster is better predicted by a word’s closeness
to the centroid, which perhaps could be
conceptualized as a non-lexicalized ‘elongated weapon
with a sharp point,’ than its closeness to an actual
word, e.g., sword. This is a curious finding,
which seems counter-intuitive for nearly all
theories of lexical meaning and meaning change.</p>
      <p>The magnitude of correlations is not fixed or
randomly fluctuating, but rather depends on the
number of clusters used. It peaks for about 3500
clusters, after which it drops sharply. Since a
larger number of clusters necessarily means
smaller ‘semantic areas’ that are shared by fewer
words, this suggests that there is an optimal
range for the size of clusters, which should not
be too small or too large.
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Theoretical implications</title>
      <p>One of our findings matches what might be
expected, based on Geeraert’s hypothesis,
mentioned in Section 1: a word’s distance from its
cluster’s most prototypical exemplar is quite
informative with respect to how well it fits the
cluster (Fig. 1). This could be taken to
corroborate Roschian prototype-based views. However,
another finding is more surprising, namely, that a
word’s distance from its real centroid, an abstract
average of the members of a category by
definition, is even better than the word’s distance from
the cluster’s most prototypical exemplar.</p>
      <p>
        In fact, our findings are consonant with recent
work in usage-based linguistics on attractors,
‘the state(s) or patterns toward which a system is
drawn’
        <xref ref-type="bibr" rid="ref2">(Bybee and Beckner, 2015)</xref>
        . Importantly,
attractors are ‘mathematical abstractions
(potentially involving many variables in a
multidimensional state space)’. We do not claim that the
centroids of the categories identified in our work
are attractors – although this may be the case –
but rather make the more general point that an
abstract mathematical entity might be relevant
for knowledge of language and for language
change.
      </p>
      <p>In the domain of meaning change, the fact that
words farther from their cluster’s centroid are
more prone to change is in itself an innovative
result, for at least two reasons. First, it shows on
unbiased quantitative grounds that the internal
structure of semantic categories or clusters is a
factor in the relative stability over time of a
word’s meaning. Second, it demonstrates this on
the basis of an entire corpus, rather than an
individual word. Ideas in this vein have been
proposed in the linguistics literature (Geeraerts,
1997), but on the basis of isolated case studies
which were then generalized.</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>We have shown an automated bottom-up
approach for category formation, which was done
on an entire corpus using the entire lexicon.</p>
      <p>We have used this approach to supply
historical linguistics with a new quantitative tool to
test hypotheses about change in word meanings.
Our main findings are that the likelihood of a
word’s meaning changing over time correlates
with its closeness to its semantic cluster’s most
prototypical exemplar, defined as the word
closest to the cluster’s centroid. Crucially, even
better than the correlation between distance from
the prototypical exemplar and the likelihood of
change is the correlation between the likelihood
of change and the closeness of a word to its
cluster’s actual centroid, which is a mathematical
abstraction. This finding is surprising, but is
comparable to the idea that attractors, which are
also mathematical abstractions, may be relevant
for language change.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>We thank Daphna Weinshall (Hebrew University
of Jerusalem) and Stéphane Polis (University of
Liège) for their helpful and insightful comments.
All errors are, of course, our own.</p>
      <p>Dirk
Dirk</p>
      <p>Geeraerts. 1997. Diachronic
semantics. A contribution to
lexicology. Oxford: Clarendon Press.
prototype
historical</p>
      <p>Geeraerts. 1999. Diachronic Prototype
Semantics. A Digest. In: A. Blank and P. Koch
(eds.), Historical semantics and cognition.</p>
      <p>Berlin &amp; New York: Mouton de Gruyter.</p>
      <p>Dirk Geeraerts, and Hubert Cuyckens (eds.). 2007.</p>
      <p>The Oxford handbook of cognitive linguistics.</p>
      <p>Oxford: Oxford University Press.</p>
      <p>Dirk</p>
      <p>Geeraerts, Caroline Gevaerts, and Dirk
Speelman. 2011. How Anger Rose: Hypothesis</p>
      <p>Martin</p>
      <p>Yoon Kim, Yi-I Chiu, Kentaro Haraki, Darshan
Hegde, and Slav Petrov. 2014. Temporal
Analysis of Language through Neural Language
Models. Proceedings of the ACL 2014
Workshop on Language Technologies and
Computational Social Science, 61-65.</p>
      <p>Baltimore, USA.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Joan</given-names>
            <surname>Bybee</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Language, usage and cognition</article-title>
          . Cambridge: Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Joan</given-names>
            <surname>Bybee</surname>
          </string-name>
          and
          <string-name>
            <given-names>Clay</given-names>
            <surname>Beckner</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Emergence at the cross linguistic level</article-title>
          . In B. MacWhinney and W. O'Grady (eds.),
          <source>The handbook of language emergence</source>
          ,
          <fpage>181</fpage>
          -
          <lpage>200</lpage>
          . Wiley Blackwell.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Wen-tau
          <string-name>
            <surname>Yih</surname>
            , and
            <given-names>Geoffrey</given-names>
          </string-name>
          <string-name>
            <surname>Zweig</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Linguistic Regularities in Continuous Space Word Representations</article-title>
          .
          <source>Proceedings of NAACL-HLT</source>
          <year>2013</year>
          :
          <fpage>746</fpage>
          -
          <lpage>751</lpage>
          . Atlanta, Georgia.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Eleanor H.</given-names>
            <surname>Rosch</surname>
          </string-name>
          .
          <year>1973</year>
          .
          <string-name>
            <given-names>Natural</given-names>
            <surname>Categories</surname>
          </string-name>
          .
          <source>Cognitive Psychology</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          ):
          <fpage>328</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Eyal</given-names>
            <surname>Sagi</surname>
          </string-name>
          , Stefan Kaufmann, and
          <string-name>
            <given-names>Brady</given-names>
            <surname>Clark</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Tracing semantic change with latent semantic analysis</article-title>
          .
          <source>In K. Allan and J.A. Robinson (eds.)</source>
          ,
          <source>Current methods in historical semantics</source>
          ,
          <fpage>161</fpage>
          -
          <lpage>183</lpage>
          . Berlin &amp; New York: Mouton de Gruyter.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Derry</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Wijaya</surname>
            and
            <given-names>Reyyan</given-names>
          </string-name>
          <string-name>
            <surname>Yeniterzi</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Understanding semantic change of words over centuries</article-title>
          .
          <source>In Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web (DETECT '11</source>
          )
          <fpage>35</fpage>
          -
          <lpage>40</lpage>
          . Glasgow, United Kingdom.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>