=Paper=
{{Paper
|id=Vol-1911/9
|storemode=property
|title=Temporal Semantic Analysis and Visualization of Words
|pdfUrl=https://ceur-ws.org/Vol-1911/9.pdf
|volume=Vol-1911
|authors=Zaikun Xu,Fabio Crestani
|dblpUrl=https://dblp.org/rec/conf/iir/XuC17
}}
==Temporal Semantic Analysis and Visualization of Words==
<pdf width="1500px">https://ceur-ws.org/Vol-1911/9.pdf</pdf>
<pre>
Temporal Semantic Analysis and Visualisation of
                   Words

                          Zaikun Xu and Fabio Crestani

                               Faculty of Informatics
                      Universitá della Svizzera Italiana (USI)
                                Lugano, Switzerland
                      {zaikun.xu,fabio.crestani}@usi.ch


      Abstract. Today there are many languages spoken in the world, among
      which English is the most popular one. However, words in English evolved
      a lot in history such that it is very difficult for contemporary people to
      read ancient English articles. There are many changes, such as the mu-
      tation of word itself, the migration of word usage from one context to
      another, etc. It is thus very interesting to understand the temporal evo-
      lution of word’s semantic across a long span of time. In this paper we
      look at two datasets: the New York Times and the National Geographic
      to study the temporal evolution of words. For this purpose a model that
      can embed word into vectors is needed. Word2Vec is such an neural net-
      work model that learns a vector representation for each word in a way
      that similar words are also similar in the vector space. By similar, I mean
      that words tends to co-occur in the same context. So, to obtain a tempo-
      ral Word2Vec representation, a temporal Word2Vec model needs to be
      trained sequentially, training one individual Word2Vec model for a given
      dataset in each time period. The temporal Word2Vec model allows us
      to explore different visualisation techniques of word semantic evolution.
      Temporal Word cloud, Heatmap, t-distributed stochastic neighbour em-
      bedding are some of the techniques that makes the visualisation possible.


1   Introduction

Language understanding is a research issue that has been investigated for cen-
turies. We are specifically interested in the dynamic nature of a language, es-
pecially the temporal semantic analysis of English words. A word might change
its meaning over time, it might even disappear and a new word substitute it.
For example, ’car’ is a new word that describes a new transportation tool in the
20th century. The notion of car is evolving since it was introduced. With the ad-
vancement of automatic driving technology, the integration of smart embedded
systems, the concept of car during the 21th century will be fundamentally dif-
ferent from what it was in the past. In this sense, the evolution of word semantic
is a reflection of the evolution of human history. Thus it is really interesting
to explore the dynamics of word evolution for the assessment of the dynamics
of words, people and events. What’s more, for researchers to fully understand
2      Zaikun Xu, Fabio Crestani

a language, its dynamic evolving nature should be considered. To address such
kind of questions, temporal semantic analysis and visualisation is the essential
way to uncover mysteries of word semantics.
    There are many challenges related to temporal semantic analysis of large
chunks of data. Firstly, One need to collect a dataset that spans a long time
period such that word semantic meaning evolves. Also, data collected needs to be
preprocessed before feeding into a model. Secondly, words need to be represented
as numbers so that computers can understand and process them. Lastly, how to
visualise temporal semantic of a word is an open question too.


2   Related Work

There are various models developed along the way for language understanding.
Bag of words (BoW) [3] is a naı̈ve model to represent a document or a sentence
using a n-dimensional vector such that the frequency of each word is represented
in each position of a n-dimensional vector, where n is the number of total unique
words. The BoW model essentially employs one-hot representation. The problem
of one-hot representation is that it linearly scales with the number of words and,
more importantly, it can not capture the inherent similarity between two words.
For example, the word ’car’ and ’truck’ are semantically similar, but in one-
hot representation, ’car’ can be as similar to any other word as ’truck’. On
the other side, distributed representation [4], which is to embed each word into
a n-dimensional vector space, can directly compare word similarity in the n-
dimensional vector space. The Neural Probabilistic Language Model (NPLM),
proposed by [2], is a neural network model that utilize distributed representation,
which has a input layer, a projection layer, a hidden layer and an output layer.
The input layer takes each word and projects it into a n-dimensional vector
in the projection layer through a shared matrix C. The embedding vector is
then passed from the projection layer to the hidden layer and the embedding is
learned through training with Back-propagation (BP) algorithm. The objective
function is to maximise the log likelihood of training data. Comparing it with the
n-gram model, NPLM can model words with longer distance and the number
of parameters scales only linearly with number of unique words, while the n-
gram model’s computational complexity increases exponentially with the size of
unique words. Despite its effectiveness, when the input size is big, the number of
neurons in hidden layer and the output size has to be very large to capture the
underlying complexity of input data, which makes it computational inefficient
and have to be greatly parallelised. Word2Vec [6] was proposed by Mikolov in
2013 for the language modelling of the 6 billion tokens from Google News corpus.
As its name suggests, Word2Vec is a model that maps a word into a vector,
which is called embedding. More precisely, it is a neural network that is trained
on a large corpus of sentences to learn word embeddings such that similar words
occur in similar contexts. There are two popular neural network architectures of
Word2Vec, one is CBOW (continuous BoW model) and the other is skip-gram ,
both of which are simple one-hidden layer neural networks. This greatly reduces
                    Temporal Semantic Analysis and Visualisation of Words         3

the network complexity and makes it computational feasible. Recently, there is
an interesting work by [5] that trains a Word2Vec sequentially for temporal time
periods to learn a temporal word representation such that words at different
time points are different.


3     Data Processing

3.1   Data Collection

Temporal word semantic analysis requires that we have access to a reasonable
amount of texts that spans a long period of time. In this era of big data, digi-
tised texts are much easier to access. Many old documents or books are digitised,
scanned and uploaded to the Internet. More recent articles or books can be com-
pletely digitised at the very time they are created, into different formats. In this
study, we choose the National Geographic (NG) and the New York Times (NT)
magazine, both of which have digitised texts that spans more than 100 years,
facilitating our analysis at a large scale and long period of time. However, the
format of data available to us is non-uniform for NT, namely, their articles pub-
lished before 1922 are in scanned PDF formats while data after 1970s are in html
format. Data in between those two period are not available for public download.
For the scanned PDF, there are huge variations of fonts, writing styles and scan
qualities, imposing great pressure on OCR technologies to transform such PDF
documents into texts. Such transformation usually comes in low quality using
open-source packages, like for example Tesseract. Besides, there are more than
3 millions scanned PDF, which will takes more than a month for a single com-
puter to process. Due to the lower quality of OCR and intensive computation,
we decide to avoid using data before 1922 for NT. Scrapy, an open-source library
for data crawling is applied and customised to crawl html data from the NT web
site. NG, on the other hands, spans about 110 years, with all articles digitised
into 6 DVDs, which can be transformed into text with relative ease and good
quality for pure text images. Still, many errors can occur due to the non-uniform
layout of images. Figure 1 shows that NT has much more articles each year than
those in NG shows in a decade. However, NT spans only 47 years (since 1973)
while NG spans more than 100 years.


3.2   Data Normalisation

After data is collected, we corrected errors and applied text normalisation tech-
niques on the raw datasets. Figure 2 shows the text normalisation pipeline. The
first step is concerned with common misspelling substitution. Common mistakes
exists after OCR processing. For example, the character ’w’ in the original PDF
is mistakenly processed as ’vv’. This kind of error is caused by the fact that the
quality of the input is low and the OCR engine is confused. It is then followed
by broken words concatenation. In the OCR processed text, it is very common
that one word is broken into two parts especially at the end of a line. Since this
4       Zaikun Xu, Fabio Crestani


                                              Number of articles of NT vs NG
                                5
                                 0
                              ×1
                           2.0

                                5
                                 0
                              ×1        New York Times
           # of articles
                           1.5          National Geographic

                                5
                                 0
                              ×1
                           1.0

                                4
                                 0
                              ×1
                           5.0


                                 .0
                               01880   1900    1920    1940   1960    1980     2000   2020
                                                          Year


                                       Fig. 1. Data statistics for NT and NG.


occurred a lot, we performed a thorough checking to decide whether concatena-
tion is needed or not for each line of texts. Afterwards, non English words were
deleted and stop-words removed. Stemming is not applied in this work since we
want to keep the original words.


4     Temporal Analysis

Temporal analysis is a broad topic that try to analyse any data with temporal
structure in it, either financial time series data or large collections of text span-
ning for long period of time. The importance of temporal analysis lies in that it
views data as a dynamic evolving structure instead of a static one. So its goal
is to find out temporal patterns in the data studied, which are not unveiled by
methods overlooking the temporal dimension. In this work, we setup a simple
framework by training a temporal Word2Vec model on NG and NT dataset and
analysing word’s temporal semantic dynamics with respect to an anchored word.


4.1   Problem Definition

The goal of word embedding is to embed words into a vector space such that
the similarities between words can be directly measured by vector operations,
such as cosine similarities in the corresponding vector space. Word2Vec, is such a
powerful model that captures semantic similarity between words co-occurring in
similar contexts. In the temporal Word2Vec case, the goal is then to embed words
into a discrete vector space with extra temporal dimension. Here we denote Vw as
                       Temporal Semantic Analysis and Visualisation of Words               5


                        Fig. 2. The Text Normalisation Pipeline.


a temporal vector representing word w, with Vw,ti being the vector representation
of word w at time ti :
                           Vw = {Vw,t1 , Vw,t2 , Vw,t3 , ..., Vw,T }
    Ideally, we want the temporal Word2Vec model to capture the evolution
dynamics of word semantics such that Vw,ti represents the word w at time ti .
However, the obstacle to obtain such a model is that it is difficult to evaluate a
temporal Word2Vec model quantitatively. After all, there is no ground truth of
a word semantic representation. So, we focussed this work on the visualisation
of the temporal similarity of words and on the evolution of such similarity over
time (see next section).
    We apply the common way of training a temporal Word2Vec model proposed
by [5]. Briefly, the strategy is to train a sequence of Word2Vec model for each
time period one by one and each model’s weight is initialised based on the one
trained before. Then the problem of calculating word similarity for two words
w1 ,w2 becomes the following:
      sim(w1 , w2 ) = [sim(w1,t1 , w2,t1 ), sim(w1,t2 , w2,t2 ), ..., sim(w1,T , w2,T )]
where w1,ti means the word w1 at time ti and sim(w1 , w2 ) is defined as :
                                                       ∗w2
                                 sim(w1 , w2 ) = |ww11|∗|w2|

where the L2-norm is used.

4.2    Model Training
For training the temporal Word2Vec model, we use Gensim [7], a Word2Vec
library written in Python with Cython optimisation that achieves 70 speedup
6       Zaikun Xu, Fabio Crestani

comparing it to a plain Numpy implementation. The two dataset are prepared
by putting all docs into a text file within one time period where each doc takes
one line in the file, so there is a total of T files, where T is the number of time
period. For both datasets we trained the temporal Word2Vec model for each time
period and use this trained model to initialise the next Word2Vec model. As for
hyper parameters, the window size is set to 5 and the embedding dimension is
set to 200. After the training finished, a model Mi is saved for each ti and the
vector representation for each word is saved in Mi .
    Training the temporal Word2Vec model of NG is fast due to the relative
small size of the dataset. While training the temporal Word2Vec model of NT
takes much longer time (around 2 hours). The training was done on a Macbook
Pro with 16 GB 1600 MHz DDR3 and a 2.2 GHz Intel Core i7 processor.


5     Visual Temporal Analysis

In this section, we visualise the word semantic evolution by exploring different
visualisation techniques and adapting those techniques to temporal analysis.
Three techniques, Word Cloud, Heatmap and t-SNE are adopted to visualise
word vectors from trained Word2Vec model. The example of the anchor word
’car’ is show for all the three cases.


5.1   Word Cloud

A Word Cloud visualise the relative frequency of each word by making the more
frequent word bigger and the less frequent word smaller. For its layout, words
can be positioned horizontal, vertical or oblique and the Word Cloud generator
produces the layout of words such that there is no overlap following their size
and position constraints. Since we are interested to study the temporal word
semantics evolution, the question is whether Word Cloud could be adapted to
visualise the evolution of word semantics.
    Originally, Word Cloud is based only on one metric, which is the frequency
of given words. In order to study word similarities with Word Cloud, the simple
change is to adopt the similarity of a word w.r.t. an anchor word as the metric.
Consequently, the more similar one word towards the anchor, the bigger the size
of that word. We call this approach Semantic Similarity Word Cloud (SSWC).
    With SSWCs generated for a certain time period, a possible way to have a
Temporal Semantic Similarity Word Cloud (TSSWC) is to generate SSWC in
each time period and concatenate them sequentially into a GIF. From a visual
point of view it would be nice to have a word position fixed in each SSWC
to facilitate eye tracking. The algorithm 1 below shows the pseudo-code that
calculates the font size of each word proportionally to its similarity weight. Figure
3 shows the SSWC plot of NG for two time periods, namely, the first decade in
20th and 21th century. Of course we cannot show in this paper a video of the
entire sequence of years. Also, due to space constraint, the exact location of the
                    Temporal Semantic Analysis and Visualisation of Words         7


Fig. 3. SSWC for 1900s (left) and 2000s (right) with the words most similar to ’car’
according to the National Geographic.


same word in two SSWC is still not exactly the same since each word size can
be different.
    Note that not all SSWCs are shown in Figure 3, rather only two representative
ones are shown here. As you can see, the relative order of each word in each
SSWC plot is fixed. Immediately one can pick up some difference for the same
word between 1900s and 2000s. In the 1900s, ’parlor’ is more similar to ’car’
as well as ’pullman’. However, in the 2000s, the notion of ’car’ is less similar
to ’parlor’ and more similar to ’automobile’. In fact, the word ’pullman’ in the
USA, was specifically applied to refer to railroad sleep cars which were operated
by the Pullman Company from 1867 to 1968. Thus, in 2000s, the word ’pullman’
should not be similar to car anymore due to its disuse in car related contexts.
The word ’automobile’ is derived from an Ancient Greek word, meaning ’self’.
Over time, the word ’automobile’ is less used in Britain but remains widely used
in North America.
   Figure 4 shows the same SSWC plot for the NT dataset. The two series of
words between NT and NG are not exactly the same since contexts are different
and a word that is similar to ’car’ in NG is not necessarily similar according to th
8       Zaikun Xu, Fabio Crestani

NT. Overall, the notation of car is kind of fixed since 1970s, such as four-wheeled,
powered by gasoline, famous brands including Ford, Toyota, BMW.


Fig. 4. SSWC for 1970 (left) and 2016 (right) with the words most similar to ’car’
according to the New York Times.


5.2   Heatmap
In this section, we explore another visualisation technique, Heatmap. The idea
of an Heatmap is that each value is associated with a color. The task is always
to visualise word similarities for each temporal period. In the temporal word
semantic analysis case we let y-axis be a list of each words and x-axis be a list of
each temporal periods such that for a given word we can get an understanding of
its temporal similarity to the anchor word quickly based on the change of color.
    Figure 5 shows the Heatmap visualisation of words that are most similar
to the word ’car’ for both NG and NT. Since for each time period, the most
similar words to a given word is different, we choose the union of the top 10
most similar words for each time period and list them along the y-axis. Note
that word similarity is normalised across all its time period for a better contrast
of colours. So:
                    Temporal Semantic Analysis and Visualisation of Words        9


          Fig. 5. Heatmap visualisation of word similar to the word ’car’.


                                            simwj,tt ,wref
                         simwj,tt ,wref = Pn
                                           t=1 simwj,t ,wref


where n is the length of the most similar word list, wj,tt , wref is the j-th word
at time period tt and anchor word respectively.
    One interesting phenomenon is that in figure 5 right has more car brands than
figure 5 left. Actually, there are 9 brands in figure 5 right, which are mercedes,
bmw, jeep, chevy, volkswagen, cadillac, camaro, buick, corolla. While in the figure
5 left, only ford, honda, chandler dodge are mentioned. One way to interpret this
phenomenon is that globalisation lead to many more international cars being
introduced into the US market in 2016 than in 1970.
    Now, let’s look at each individual case in figure 5 left. For example, the
word ’chandler’ is very similar to car before 1930s and less similar afterwards.
Interestingly, when checking the history we find that the Chandler Motor Car,
10      Zaikun Xu, Fabio Crestani

a company founded in 1913 whose production peaked in 1927, purchased by a
competitor two years later. The history of the Chandler Motor Car is roughly
synchronised with the dynamics of the semantic similarity between the word
’Chandler’ and the word ’car’.
    Another interesting word is ’honda’. The brand Honda became an important
motorcycle and car provider in America. It went on to dominate the America’s
car and motorcycle market with as high a percent marketshare as 63% in 1966
starting from 0% in 1959. This history is reflected by the Heatmap colours.
    Interestingly, the word ’ford’ is not related to the word ’car’ during the early
decades of 20th century, despite the well-known Ford’s T-car that dominated
the market at that time. One way to explain this inconsistency is to refer to the
fact that ’car’ was more similar to ’train’ at that time, showing that there might
be a delay between a word semantic meaning and the true concept of a word.
    Another interesting word ’dining’, whose similarity peaks in the late 19th
century and early 20th century. According to Wikipedia, the concept of ’dining
car’ can be traced back to 1880s when dining cars were a normal part of long-
distance travelling trains. This also explains the phenomenon that word ’train’
have roughly the same pattern of similarity as ’dining’ to the word ’car’.


6    Conclusions

In this paper we studied the temporal semantic evolution of words in two datasets:
the National Geographic and the New York Times. The National Geographic
spans more than 100 years but have less articles per day and covers less topics.
The New York Times (available to us), on the other hand, covers only 47 years
since 1970, but have much more articles and has a much more broader topic
coverage. They are both America magazines; NG focuses more on geography,
history and cultures, while NT focuses on politics, life, society and opinions.
    We applied Word2Vec model, which is a neural network language model that
maps each individual word into a vector space where vectors can be measured by
cosine-similarity. However, the Word2Vec model assumes that one word does not
change over time, which is not true. In order to take word semantic evolution into
account, we built a temporal Word2Vec model where each word is mapped into
a vector of vectors, namely, one vector for each time period t. There are other
ways to train such a temporal Word2Vec model in addition to the sequential
initialisation procedure. For example, Alburg et al [1] initialise all the models
with the same pre-trained weights and adds a regularisation term to the model.
    Two visualisation techniques were explored: Word Clouds and Heatmaps.
Word Cloud visualise words based on their frequency. TSSWC is adapted from
Word Cloud by fixing each word’s position and horizontal orientation. Although
the Word2Vec approach has advantages in temporal word analysis, it also has
certain limitations. First, some words do not exist in the early years, which leads
to the exclusion of such word in the word set of Word2Vec model. For example,
the word ’Internet’ is not mentioned in the 70s which makes it impossible to
compare the similarity between the word Internet’ and ’surfing’, for example.
                     Temporal Semantic Analysis and Visualisation of Words          11

Also, ’computer’ does not exists in late 19th century in the NG dataset. Accord-
ingly, the temporal Word2Vec framework does not have the flexibility to detect
when a new word occurs and then show its dynamics from that period. Secondly,
one word is a basic unit for constructing a sentence in English. However, this
is not always true. For example, human names are composed by at least two
words, family name and name.
    Nonetheless, the visualisation analysis in this paper is effective and fruitful.
It captures many interesting trends of word semantic similarities. What is inter-
esting is to explore the possible reasons leading to such trends. Overall, temporal
Word2Vec model and the other visualisation techniques introduced here provide
researchers a great combination of tools to spot word semantic evolution, despite
the presence of some noise.


Acknowledgments.
This work was extracted from the first author’s Master Thesis submitted at the
Faculty of Informatics of the Universitá della Svizzera Italiana (USI) in January
2017. The work was supervised by the second author.


References
1. Alburg, H.: Tracking temporal evolution in word meaning with distributed word
   representation (2015), master thesis in Computer Science, Chalmers University of
   Technology
2. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language
   model. J. Mach. Learn. Res. 3, 1137–1155 (Mar 2003)
3. Harris, Z.: Word. Distributed Structure (1954)
4. Hinton, G.E.: Learning distributed representations of concepts. Proceedings of the
   eighth annual conference of the cognitive science society (1986)
5. Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D., Petrov, S.: Temporal Analysis of Lan-
   guage through Neural Language Models. ArXiv e-prints (May 2014)
6. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word represen-
   tations in vector space. CoRR abs/1301.3781 (2013)
7. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Cor-
   pora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP
   Frameworks. pp. 45–50. ELRA, Valletta, Malta (May 2010)

</pre>