1st Workshop on AI + Informetrics - AII2021


                   Study concept drift in 150-year English
                                 literature

                               Ruiyuan Li, Pin Tian, and Shenghui Wang

               University of Twente, Drienerlolaan 5, 7522 NB Enschede, The Netherlands
                 {r.li-3,p.tian}@student.utwente.nl, shenghui.wang@utwente.nl


                 Abstract. The meaning of a concept or a word changes over time. Such
                 concept drift reflects the change of the social consensus as well. Study-
                 ing concept drift over time is valuable for researchers who are interested
                 in language or culture evolution. Recent word embedding technologies
                 inspire us to automatically detect concept drift in large-scale corpora.
                 However, comparing embeddings generated from different corpora is a
                 complex task. In this paper, we propose to use a simple approach for de-
                 tecting concept drift based on the change in word contexts from different
                 time periods and apply it to subsequent time periods so that the detailed
                 drift could be detected and visualised. We dive into certain words to track
                 how the meaning of a word changes gradually over a long time span with
                 relevant historical events which demonstrates the effect of our method.

                 Keywords: concept drifting · word embedding · historical event · word
                 context


         1     Introduction
         Concept drift or diachronic semantic shift studies how the meaning of a concept
         or a word changes over time [15, 12]. Concept drift reflects the change of the so-
         cial consensus. For example, the word gay was originally used to mean carefree,
         cheerful, or bright [7]. However, from the 1960s, the word gay started to describe
         homosexual men [3]. Studying concept drift over time is valuable for researchers
         who are interested in language or culture evolution. For people who want to
         identify societal changes in literature, who research on historical texts, such as
         librarians, historians or linguists, it is desirable if they can discover potential con-
         cept drift in large-scale textual content before conducting in-depth investigation.
         Automatically identifying concept drift over time can improve their efficiency.
             Recent word embedding technologies inspire us to automatically detect con-
         cept drift in large-scale corpora [9, 6, 5] . However, comparing embeddings gen-
         erated from different corpora is a complex task [6, 5]. How to visually inspect
         concept drift is also a challenge [16].
             In this paper, we propose to use a simple approach to quantify the concept
         drift based on their contexts generated from two time periods and apply it to
         subsequent time periods to study concept drift over a long period of time. We
         study more than 50 thousand English books published between 1800 and 1950.


Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
2      Ruiyuan Li, Pin Tian, and Shenghui Wang

We first divide the whole period into subsequent 20-year time spans. Based on
word embeddings corresponding to each individual time span, we calculate the
context of each word at that particular period. Secondly, we measure the concept
drift by comparing the contexts of the same word from different periods. Looking
how the context changes from the starting period to the ending period, we can
easily identify the most dynamic words over the 150 years. For these words which
potentially underwent drastic change in their meaning, we further measure how
their contexts change in subsequent time periods and visualise the change over
time. Some interesting associations to historical events are also discovered.


2   Related Work

Diachronic semantic shifts has gained much attention because of the availability
of large corpora and recent success in computing distributional semantics [4] in
natural languages [12]. When computing distributional semantics, words are rep-
resented by sparse or, more recently, dense vectors based on their co-occurrences
in a corpus. In other words, words are embedded in a semantic space and, more
importantly, similar words are embedded nearby each other in this space.
    To study semantic shift, it is therefore natural to first construct word em-
beddings in separated time periods before comparing these embeddings across
time. However, the similarity between the word embeddings generated from sep-
arate time-specific corpora cannot be computed directly because the stochastic
embedding algorithms could only roughly guarantee the stability of the pairwise
similarities between words but the numerical embeddings are often rotated after
each training, i.e. invariant under rotation. An earlier proposal was to incre-
mentally update diachronic embedding models [10], where the word embeddings
trained on the previous time period are used to initialise the training for the
current period. Later researchers proposed to align these spaces by unifying the
coordinates via local word alignments [11] or by projecting one space to another
via orthogonial procrustes [6].
    Another approach is to study the neighbours or the context of a word at
two different time periods to measure semantic shift. Azarbonyad et. al. in [1]
used the neighbors of a word to determine its stability. Their best model uses
the traditional alignment-based method weighting the neighbors’ rank and their
stability, requiring computation on whole vocabulary. Later, Gonen et al. [5]
simply took the top-k neighbors in each of the two corpora and measure the
overlap of these two lists. A smaller overlap suggests more drastic change. They
applied this method to corpora separated based on different criteria, such as age,
gender, profession, time of tweet. However, the concept stability of neighboring
words is unsure for a long period, for example over centuries. In this paper, we
apply this method to study English literature that spans over 150 years.
    More recently, researchers also proposed dynamic word embeddings models to
jointly learn word embeddings across different times periods [2, 17]. By enforcing
the alignments simultaneously, there is no need to train separate time-specific
                          Study concept drift in 150-year English literature    3

embeddings, i.e., the resulting word embeddings are time-aware already. We will
explore this aproach in the future.


3     Method

Here, we first describe the method which measures the changes in word meaning
based on its contexts at two different time periods. Secondly, we divide the whole
corpus into subsets corresponding to consequent time periods and study how the
meaning of a word change over time.


3.1   Measuring drift based on context

Inspired by the method proposed in [5], given two periods in time, we collect sep-
arate corpora corresponding to each period. As shown in Figure 1, we generate
embeddings for each word in the separate corpus. For each word, we calculate its
context as the top K most similar words. For the same word that occurs in both
corpora, we measure the similarity of its contexts at two different time periods.
This similarity reflects the changes in word meaning: the more similar the two
contexts are, the less change in meaning the word has.


                   Fig. 1. workflow for measuring concept drift


Word embedding First, we need to generate embeddings for each word that
occurs in each corpus. All words in the corpus are embedded as high-dimensional
vectors, and semantically similar words embedded near to each other in a se-
mantic space. The purpose of this step is to use numeric vectors to represent
the meaning of words, so that the similarity between words could be computed
easily.
4        Ruiyuan Li, Pin Tian, and Shenghui Wang

Word context After the embedding spaces, two semantic spaces are generated
from each corpus. These two semantic spaces can be aligned to use the same
coordinate axes before we could compares words in these two spaces, as proposed
in [6]. Here, we adopt the method proposed in [5] to use the closest neighbours of
a word, i.e. its context, to reflect the extensional meaning of the word, therefore,
the change in context at two different times reflects the drift in the meaning of
the word. For each word, we select the top K most similar words as its context.


Drift based on context similarity For a word that occurs in both time
periods, how much its context changes from one period to the other reflects the
change in its meaning. Since the context is defined as the set of top K most similar
words, we use Jaccard similarity coefficient [8] to measure the similarity between
two contexts of the same word but from two different time periods. Jaccard
similarity coefficient is a statistic used for gauging the similarity and diversity
of sample sets. Let A and B are two sets, the Jaccard similarity coefficient is
calculated as follows,

                             |A∩B |       |A∩B |
                 J(A, B) =          =                                           (1)
                             |A∪B |   |A|+|B |−|A∩B |

A high Jaccard similarity coefficient suggests that the context of the word has
not changed much, while a low value suggests that the meaning of the word
might have changed over time.


3.2    Analysing concept drift over time

Given a corpus which spans a long period of time, it is then possible to study
how individual words change over time. We divide the whole corpus into multi-
ple subsets corresponding to subsequent time periods. We measure the drift of
words from the beginning period to the last. This way, we could detect the most
dynamic and static words over the long period of time. It is also possible to look
more carefully when a word has undergone a critical moment when its meaning
changed dramatically.


4     Data set

We download the full text content of 51,625 English books from Project Guten-
berg.1 Unfortunately, the exact years of publication for these books are missing.
However, the birth and death years of the author are available in the data. We
therefore took the average of the birth and death year of the author as the
approximate year of publication for the book.
   After grouping books by their year of publication, we find that, although
the earliest books were written before 500 BC, the number of books published
1
    https://www.gutenberg.org/
                           Study concept drift in 150-year English literature   5

earlier than the 18th century is far less than that from the 18th century to the
mid 20th century. Because of the copyright restriction, books in recent years are
also limited. In this study we focus on the books published from 1800 to 1950.
We further divide the corpus into consequent groups of 20-year time spans, as
hown in Table 1.


                              Group Time span Book
                               1800 1790–1810 1230
                               1830 1820–1840 2214
                               1850 1840–1860 3712
                               1870 1860–1880 6595
                               1890 1880–1900 8661
                               1910 1900–1920 7721
                               1930 1920–1940 1734
                               1950 1940–1960 1926

                    Table 1. Number of Books in each period


5     Experiment and results

5.1    Word embedding and context computation

For each time period show in Table 1, we trained a word2vec model using the
gensim library2 using the full text content of the books published within that
period of time. We chose the continuous Bag-of-Words(CBOW) model [13], set
the vector size as 100, the minimum count 10 (ignoring words that occur less than
10 times), the window size 20 (taking 20 words behind and 20 words ahead as
the training context) and took the rest parameters as their default values. After
embedding, each word at each time period is represented as a 100-dimensional
vector. For each word at a particular time period, we calculated the top 20 most
similar words as its context in that period.


5.2    Measuring drift

After the context of a word is calculated for each time period, we can now
measure how much this word has drifted from one time period to another.

Sensibility of parameter K The model is generally stable. We examine how k
parameter, which defines the length of similar words lists, affect the stability.
We change k as [20,30,60,100], and the curves highly coincide with each other
in the whole plot for Fig.2, so in most part of these plot, we regard the same
2
    https://radimrehurek.com/gensim/models/word2vec.html
6      Ruiyuan Li, Pin Tian, and Shenghui Wang


       Fig. 2. Jaccard similarity in different K (length of similar words lists)


distribution as proof of the stability of our model. In the following experiments,
we take k = 20.
    We calculate the Jaccard similarity between the context at 1800 and that
at 1950. The distribution of the Jaccard similarity is shown in Fig. 3. As the
distribution shows, very few words have stable context throughout these 150
years. Words such as sister, daughter, mother, wife and husband are rather stable
word, while other words such as witch, foster, hive and potion have changed
dramatically.
    Many words have completely changed their contexts. However, these words
are mostly infrequent words, such as chestnut, hive, vantage, and coffin. Their
embeddings and consequent contexts are over sensitive to the corpus. We could
not make solid conclusions in terms of the drift of their meaning.
    This still helps us to identify interesting cases of concept drift among the
words that have a low context similarity. Once identified, we can dive into the
more granule time periods and inspect the drift more closely. For example, Fig. 4
shows the drift of word peace over time. The Jaccard similarities between the
current context and that of the previous period are plotted. The sharp decrease
from 1890 to 1930 suggests that there was a drastic change of the meaning of
the word peace in that period.


5.3   Visualising individual drift

The change in Jaccard similarity as shown in Fig. 4 only provides a signal of
drift, but not the content of drift. To dive deeper into what exactly happened for
specific words in specific periods, we present a visualization method that helps
to see how words have changed.
                            Study concept drift in 150-year English literature      7


Fig. 3. The distribution of Jaccard Similarity between starting(1800) and ending(1950)
period


Fig. 4. Jaccard Similarity of ’peace’ over time (comparing to the previous neighbor
time group)
8       Ruiyuan Li, Pin Tian, and Shenghui Wang


                Fig. 5. The drift for bishop between 1800s and 1950s


    Our visualization is inspired by the work of Wijaya and Yeniterzi[16]. In
their visualization, each word is a node and there is an edge between two words
if they co-occur in the same context. The width of the edges is the frequency of
co-occurrence. As shown in Fig. 5, our visualization consists of two clusters. One
is the target word with its top 20 context words at the first time span. The other
is the target word with its top 20 context words at the second time span. The
line width shows the cosine similarity between the target word and the context
word.
    In Fig. 5, the word bishop at 1910 is mostly associated with religious words,
such as prelate, church, cathedral, and dioce while at 1950, it is more related to
the chess game, as its context includes words like checkmate, pawn, and knight.
It is worth mentioning that ktxb, bxkt, and rxkt are notation words in chess but
not meaningless words.
    Compared to the popular t-SNE visualization [14] used in [5], our visual-
ization is more comprehensible and intuitive to compare the intersection and
the unique section. It can also be used for tracking how the concept of a word
changed gradually over a long time span. However, this visualization often be-
come unreadable because of the complexity when the number of the context
words increases.
    Fig.6 shows the sequential change of the word peace from 1800 to 1950. Fig.6
(b) shows that war is in the same context with peace at 1800 and 1910. It suggests
that peace and war were often mentioned together .
    However, the link between war and peace disappeared in 1950. New relation-
ships emerged, such as spirit, humanity, tyranny, and poverty. Link to the statistics
of war. There were high-intensity conflicts around the 1800s (Napoleonic Wars,
etc), 1860s (American Civil War, etc.), and 1910s (World War I). There were
few wars after World War II (1945). It makes sense that people transferred their
          Study concept drift in 150-year English literature   9


      (a) The word peace at 1800 and 1850


      (b) The word peace at 1800 and 1910


      (c)The word peace at 1800 and 1950

Fig. 6. Concept Drift of the word peace over time
10      Ruiyuan Li, Pin Tian, and Shenghui Wang

concerns about peace into other topics like spirit, humanity, tyranny, and poverty
in the 1950s.
     Because of the copyright restriction, Gutenberg data set only has few books
after 1970. It limits us to apply our method to recently published books. A
potential future work for this work is to apply this approach on possible con-
temporary book corpus. There would be more interesting findings closer to our
life.


6    conclusion
Concept drift reflects the change of the social consensus. Detecting word usage in
different periods is an important research method. We propose a computational
approach to discover the drastic concept drifts by their context in the historical
English books over centuries. It quantifies the extend of concept drift and makes
the rank of drastic change possible. We also present a new way to compare the
concept of a word in different periods. We show that our visualization is simple
and intuitive. It also has the unique advantage of demonstrating the gradual
change of concept overtime.


References
 1. Hosein Azarbonyad, Mostafa Dehghani, Kaspar Beelen, Alexandra Arkut, Maarten
    Marx, and Jaap Kamps. Words are malleable: Computing semantic shifts in po-
    litical and media discourse. In Proceedings of the 2017 ACM on Conference on
    Information and Knowledge Management, CIKM ’17, page 1509–1518, New York,
    NY, USA, 2017. Association for Computing Machinery.
 2. Robert Bamler and Stephan Mandt. Dynamic word embeddings. In Doina Precup
    and Yee Whye Teh, editors, Proceedings of the 34th International Conference on
    Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages
    380–389, International Convention Centre, Sydney, Australia, 06–11 Aug 2017.
    PMLR.
 3. Oxford English Dictionary and WALTON Street. Oxford english dictionary. Re-
    trieved February, 4:2019, 2019.
 4. John Firth. A synopsis of linguistic theory. 1930-1955. Blackwell, 1957.
 5. Hila Gonen, Ganesh Jawahar, Djamé Seddah, and Yoav Goldberg. Simple, inter-
    pretable and stable method for detecting words with usage change across corpora.
    In Proceedings of the 58th Annual Meeting of the Association for Computational
    Linguistics, pages 538–555, 2020.
 6. William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic word embed-
    dings reveal statistical laws of semantic change. In Proceedings of the 54th Annual
    Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
    pages 1489–1501, Berlin, Germany, August 2016. Association for Computational
    Linguistics.
 7. Archie Hobson. The Oxford dictionary of difficult words. Oxford University Press,
    USA, 2004.
 8. Paul Jaccard. The distribution of the flora in the alpine zone. 1. New phytologist,
    11(2):37–50, 1912.
                            Study concept drift in 150-year English literature       11

 9. Adam Jatowt and Kevin Duh. A framework for analyzing semantic change of
    words across time. In IEEE/ACM Joint Conference on Digital Libraries, pages
    229–238. IEEE, 2014.
10. Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. Tem-
    poral analysis of language through neural language models. In Proceedings of the
    ACL 2014 Workshop on Language Technologies and Computational Social Science,
    pages 61–65, Baltimore, MD, USA, June 2014. Association for Computational Lin-
    guistics.
11. Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. Statistically
    significant detection of linguistic change. In Proceedings of the 24th International
    Conference on World Wide Web, WWW ’15, page 625–635, Republic and Can-
    ton of Geneva, CHE, 2015. International World Wide Web Conferences Steering
    Committee.
12. Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, and Erik Velldal. Diachronic
    word embeddings and semantic shifts: a survey. In Proceedings of the 27th Interna-
    tional Conference on Computational Linguistics, pages 1384–1397, Santa Fe, New
    Mexico, USA, August 2018. Association for Computational Linguistics.
13. Tomas Mikolov, Kai Chen, G. S. Corrado, and J. Dean. Efficient estimation of
    word representations in vector space. In ICLR, 2013.
14. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Jour-
    nal of machine learning research, 9(11), 2008.
15. Shenghui Wang, Stefan Schlobach, and Michel Klein. Concept drift and how to
    identify it. Journal of Web Semantics, 9(3):247–265, 2011. Semantic Web Dynam-
    ics Semantic Web Challenge, 2010.
16. Derry Tanti Wijaya and Reyyan Yeniterzi. Understanding semantic change of
    words over centuries. In Proceedings of the 2011 international workshop on DE-
    Tecting and Exploiting Cultural diversiTy on the social web, pages 35–40, 2011.
17. Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui Xiong. Dynamic
    word embeddings for evolving semantic discovery. In Proceedings of the Eleventh
    ACM International Conference on Web Search and Data Mining, WSDM ’18, page
    673–681, New York, NY, USA, 2018. Association for Computing Machinery.