=Paper=
{{Paper
|id=Vol-2831/paper3
|storemode=property
|title=Diffusion-based Temporal Word Embeddings
|pdfUrl=https://ceur-ws.org/Vol-2831/paper3.pdf
|volume=Vol-2831
|authors=Ahnaf Farhan,Roberto Camacho Barranco,M. Shahriar Hossain,Monika Akbar
|dblpUrl=https://dblp.org/rec/conf/aaai/FarhanBHA21
}}
==Diffusion-based Temporal Word Embeddings==
Diffusion-based Temporal Word Embeddings Ahnaf Farhan, Roberto Camacho Barranco, M. Shahriar Hossain, Monika Akbar Department of Computer Science, University of Texas at El Paso, El Paso, TX 79968 afarhan@miners.utep.edu, {rcamachobarranco, mhossain, makbar}@utep.edu Abstract et al. 1980). A concept generally does not spike on a day and disappear immediately. Rather, concepts evolve with con- Semantics in natural language processing is largely depen- dent on contextual relationships between words and entities text. Existing temporal low-dimensional language represen- in documents. The context of a word may evolve. For ex- tations fail to integrate the concept of temporal diffusion ample, the word “apple” currently has two contexts – a fruit into language models effectively. Moreover, these existing and a technology company. The changes in the context of models (Bamler and Mandt 2017; Marina Del Rey 2018; entities in biomedical publications can help us understand Rudolph and Blei 2018) cannot simultaneously capture both the evolution of a disease and relevant scientific interven- tions. In this work, we present a new diffusion-based temporal the short-term and long-term drifts in the meaning of words. word embedding model that can capture short and long-term As a result, sharply trending concepts, such as COVID-19 changes in the semantics of biomedical entities. Our model (coronavirus disease 2019), cannot be modeled in the em- captures how the context of each entity shifts over time. Ex- bedding space when long-term drifts are considered. On the isting dynamic word embeddings capture semantic evolution at a discrete/granular level, aiming to study how a language other hand, long-range effects – such as the change in the developed over a long period. Our approach provides smooth meaning of the word cloud – are not captured when these embeddings suitable for studying short as well as long-term algorithms take only short-term drifts into account. changes. For the evaluation of the proposed model, we track the semantic evolution of entities in abstracts of biomedical Our approach uses the model for temporal high- publications. Our experiments demonstrate the superiority of dimensional tf-idf representations introduced by (Cama- the proposed model when compared to its state-of-the-art al- cho et al. 2018) to construct a training set. Construction ternatives. of the training set is a one-time cost. The temporal high- dimensional tf-idf representation (Camacho et al. 2018) is 1 Introduction able to capture sudden short-term changes in the corpus. Additionally, it incorporates diffusion into the modeling to Word embeddings are low-dimensional vector space models some extent by incorporating the time dimension smoothly. obtained by training a neural network using contextual in- The framework presented by (Camacho et al. 2018) gener- formation from a large text corpus. There are several vari- ates smooth tf-idf vectors for each word of a corpus at every ants of word embeddings with different features, such as timestamp, making it suitable for the generation of training word2vec (Mikolov et al. 2013b,a) and GloVe (Penning- data for our proposed model. One of the challenges of (Ca- ton, Socher, and Manning 2014). However, the research on macho et al. 2018) is that each word vector has a length word embeddings to incorporate temporal shifts of contex- equal to the number of documents in the corpus, which is tual meanings of words is still in its infant stage. This paper not practical for analyzing a corpus containing thousands of focuses on generating word embeddings that account for and scientific documents. Our goal is to construct a contextual take advantage of the temporal nature of timestamped scien- low-dimensional temporal embedding space mimicking tific documents (e.g., abstracts of biomedical publications.) this high-dimensional representation without losing the es- Our goal is to obtain a low-dimensional temporal vector sential temporal diffusion information encoded in the vec- space representation that allows us to study the semantic tors. We introduce a neural-network-based framework that and contextual evolution of words/entities. Using the word generates temporal word embeddings while optimizing for embeddings generated by our framework, we demonstrate multiple key objectives. The temporal tf-idf representation the task of tracking the semantic evolution of entities in a from (Camacho et al. 2018) is used to obtain a baseline ex- corpus of biomedical abstracts. pected cosine distance (1.0 - cosine similarity) between pairs To generate word embeddings, our framework trains a of word vectors at each timestamp. The expected cosine dis- model using a diffusion-mechanism for evolving concepts tance is used in the output layer of our proposed neural net- within a scientific text corpus (Camacho et al. 2018; Angulo work. New low-dimensional embedding vectors – driven by Copyright © 2021, Copyright © 2021 for this paper by its authors. a rigorous objective function to smoothly bring contextual Use permitted under Creative Commons License Attribution 4.0 entities close to each other – are generated in the hidden International (CC BY 4.0). layer. The generated low-dimensional vectors are contextual and allow the discovery of latent (transitive) relationships model that may lead to wrong conclusions. A potential solu- that can’t be observed in the temporal tf-idf representation. tion to the sparsity problem is introduced by Camacho et For example, if words A and B are close to C, we expect al. (Camacho et al. 2018), which leverages diffusion the- words A and B to be close to each other. We further explain ory (Angulo et al. 1980) to generate a robust temporal rep- the objective function and the neural-network in Section 4. resentation. The technique uses a temporal tf-idf represen- The experimental results in Section 5 show that the pro- tation in which the model changes size with the number of posed method performs significantly better than the state- documents and as a result, is not extensible. of-the-art dynamic embedding models (Rudolph and Blei The drawbacks of using static word embedding models 2017; Carlo, Bianchi, and Palmonari 2019) in capturing both to generate temporal representations have led to the devel- short-term and long-term changes in word semantics. Re- opment of new techniques that can train the embeddings sults show that our approach improves the continuity be- for different timestamps jointly. The models use filters or tween the vectors across different timestamps. As a result, regularization terms to connect the embeddings over time. embeddings for different timestamps combine to a homoge- Yao et al. (Marina Del Rey 2018) propose to generate a co- neous space, unlike the state-of-the-art models. occurrence-based matrix and factorize it to generate tem- poral embeddings. The embeddings over timestamps are 2 Related Work aligned using a regularization term. Rudolph et al. (Rudolph and Blei 2018) apply Kalman filtering to exponential family Meanings of words in a language change over time de- embeddings to generate temporal representations. Bamler et pending on their use (Aitchison 2013; Yule 2017). Tem- al. (Bamler and Mandt 2017) use similar filtering but apply poral syntactic and semantic shifts are called diachronic it to embeddings using a probabilistic variant of word2vec. changes (Hamilton, Leskovec, and Jurafsky 2016). Several According to Bamler et al. (Bamler and Mandt 2017), us- probabilistic approaches tackle the problem of modeling the ing a probabilistic method makes the model less sensitive to temporal evolution of a vocabulary by converting a set of noise. All these methods focus primarily on capturing long- timestamped documents into a latent variable model (Radin- term semantic shifts, while our goal is to be able to capture sky, Davidovich, and Markovitch 2012; Yogatama et al. both long and short-term shifts. 2014; Tang, Qu, and Chen 2013; Naim, Boedihardjo, and Hossain 2017). Other approaches model diachronic changes 3 Problem Description using Parts of Speech features (Mihalcea and Nastase 2012) or using graphs where the edges between nodes (that repre- In this paper, we focus on timestamped text corpora, such sent words) are stronger based on context information (Mitra as collections of scientific publications that have publication et al. 2015). However, tracking semantic evolution is not dates. Let D = {d1 , d2 , . . . , d|D| } be a corpus of |D| docu- possible using these techniques because they do not generate ments and W = {w1 , w2 , . . . , w|W| } be the set of |W| noun language models. phrases and entities extracted from the text corpus D. We The state-of-the-art technique for language modeling is consider each of the noun phrases and entities a word. Each word2vec, introduced by Mikolov et al. (Mikolov et al. document d contains words from the vocabulary (Wd ⊂ W) 2013a,b). This method generates a static language model in the same order as they appear in the original document where every word is represented as a vector (also called em- of d. Every document d ∈ D is labeled with a timestamp bedding) by training a neural network to mimic the con- td ∈ T , where T is the ordered set of timestamps. textual patterns observed in a text corpus. There are sev- The goal of this paper is to obtain a temporal word em- eral variants of this method which include probabilistic bedding model U from corpus D. Thus, for every timestamp approaches (Barkan 2017) as well as matrix-factorization- t ∈ T , we seek to obtain a vector representation uit for ev- based techniques such as GloVe (Pennington, Socher, and ery word wi ∈ W. The word embeddings U are represented Manning 2014). A major challenge with static representa- as a 3-dimensional matrix of size |W|×|T |×|u| where |u| is tions is that they do not incorporate any temporal informa- a user-given parameter that indicates the size of a vector for tion that can be used for tracking semantic evolution. Our a particular word at a particular time. We use the shorthand work focuses on incorporating the temporal dimension of Ui to describe the 2-dimensional matrix of size |T |×|u| that text data into text embedding models so that evolution of a represents word wi ∈ W over time. vector space over time can be studied. A proposed solution to tracking semantic evolution is to 4 Methodology obtain a static representation for each timestamp in a cor- Each subsequent subsections below describes a major com- pus, and then artificially couple these embeddings over time ponent of our objective function to generate diffusion-based using regression or similar methods (Hamilton, Leskovec, temporal word embeddings. and Jurafsky 2016; Rosin, Adar, and Radinsky 2017; Carlo, Bianchi, and Palmonari 2019). However, this approach has several drawbacks. First, it requires having a significant 4.1 Training data for our model number of occurrences for all words at all times, which is We use the temporal tf-idf model (Camacho et al. 2018) to usually not the case since words can gain popularity or ap- obtain high-dimensional time-reflective text representations pear at different times. Second, the artificial coupling of em- of size |W| × |T | × |D| for training purpose. The vectors are beddings across timestamps can introduce artifacts in the formed using the temporal tf-idf weights of a word for every document in every timestamp. The temporal tf-idf weight of is more important that our word embedding model correctly a word is computed using Eq. (1). captures the relevant neighborhood of a word. Our experi- ments demonstrated that each word has a small number of (t −t)2 1 − d2ς 2 relevant neighbors. That is, each word shares context with a ŵ(w, d, td , t, ς) = √ e · 2πς 2 small number of words. To take this into account in the ob- jective function, we introduce a penalty when the temporal |D| (1 + log(fw,d )(log λw ) tf-idf-based cosine distance δijt is small, ensuring that our 2 , (1) P word embedding model captures the relevant context accu- |D| w0 ∈Wd (1 + log(fw0 ,d )(log λ 0 ) w rately. |W| |W| |T | where ŵ is the weighted tf-idf value at timestamp t for the X XX word w ∈ W in document d ∈ D, which was published at ϑ2 (U) = timestamp td . The term fw,d represents the term frequency i=1 j=1 t=1 2 of word w in document d, λw is the number of documents (α · dist(uit , ujt ) − α · δijt ) · e−βδijt (3) that contain word w, and Wd is the set of words that appear in document d. The standard deviation of the Gaussian dis- where β is a scaling parameter to increase/decrease the im- tribution function is represented by ς, and is set by the user. portance given to the samples with a smaller distance. Notice Next, we compute the cosine distance (1.0–cosine simi- that e−βδijt in Eq. (3) imposes a higher penalty to examples larity) between every pair of words and store these as a dis- with smaller baseline distances. The penalty is less when the tance matrix ∆, where each element can be addressed as distance from the temporal tf-idf model is large. Equation (3) δijt ∈ ∆. This distance matrix ∆ becomes the training data supports the phenomenon that, for a specific word, most of for the expected distance between a particular pair of words the words in the vocabulary are at a relatively large distance. (wi , wj ) ∈ W at time t ∈ T . We use the notation δij to rep- The large distances need not be a part of the penalty because resent a vector of size |T | with the temporal tf-idf-based co- the objective function is only concerned about neighbors that sine distance between (wi , wj ) ∈ W for all the timestamps. appear in the vicinity for the temporal tf-idf model. The cosine distances are later used in the output layer of our proposed neural network. 4.4 Temporal diffusion filter Based on the diffusion theory (Angulo et al. 1980), we as- 4.2 Optimizing for similarity sume that the meaning of a word, and consequently its vector One of our objectives is to obtain a low-dimensional word representation, diffuses (or drifts) over time. Thus, the word embedding model U such that computing the cosine distance embeddings should evolve smoothly over time. To introduce between the word vectors results in a distance matrix that this concept in our objective function, we model the effect closely resembles ∆. Equation (2) formulates this objective of every word-vector in all timestamps to some degree. as ϑ. In this case, we are optimizing the vectors in U to mini- We use a Gaussian filter (Eq. (4)) to diffuse the contribu- mize the difference between the cosine distance of each pair tion of each vector smoothly before and after the timestamp of word vectors for every timestamp and the cosine distance of the current sample. The filter uses a sliding window, going from temporal tf-idf model in ∆ (Eq. (1)). The minimiza- from the first to the last timestamp. σ is a user-settable pa- tion of the difference will ensure that our model captures the rameter representing the standard deviation of the Gaussian same similarity as the temporal tf-idf model but ours will distribution. A large value of σ means that the diffusion of provide low-dimensional contextual vectors. word vectors is slow over time. A small standard deviation In this paper, the term dist(A, B) refers to the cosine dis- allows capturing short-term changes in meaning. tance between vector A and vector B. The cosine distance between two words vectors is bounded between [0, 1]. A co- 1 (ti −t)2 sine distance of 0 between two words vectors means that γ(t, σ) = √ e− 2σ 2 with ti = 1, . . . , |T | both words share the same context, while a cosine distance 2πσ 2 of 1 means that the vectors are completely orthogonal, thus (4) does not share contextual similarities. The variable α is in- Equation (5) presents the updated objective ϑ3 which in- troduced as a scaling factor to avoid numerical stability is- cludes the temporal diffusion of the word embeddings. sues with values close to zero. The simplest form of our ob- |W| |W| |T | jective function is as follows. X XX ϑ3 (U) = i=1 j=1 t=1 |W| |W| |T | 2 (α · dist(γ(t, σ)Ui , γ(t, σ)Uj ) − α · δijt ) · e−βδijt X XX 2 ϑ1 (U) = (α · dist(uit , ujt ) − α · δijt ) (2) (5) i=1 j=1 t=1 4.5 Smoothness penalty: Creating a homogeneous 4.3 Weighing relevance: Giving more importance temporal embedding space to the neighborhood of each word The second important goal that our word embedding model In our work, we focus on the task of studying the semantic should achieve is to be spatially smooth over time. Contin- evolution of a word based on changes to its context. Thus, it uous or smooth temporal embeddings are those where the We generated the synthetic dataset consisting of 10,000 words and ten timestamps. For this dataset, we already know the 10-nearest neighbors of each word in every timestamp. Neighborhoods of larger sizes will contain random words starting at the 11th nearest neighbor. The PubMed pandemic dataset, contains 328,908 ab- stracts of pandemic and epidemic-related biomedical publi- cations. The abstracts were published between years 2000 to 2020. We selected 3,000 most frequent biomedical enti- ties for this dataset. The PubMed COVID dataset contains 41,571 abstracts of biomedical papers related to COVID-19, published in 2020. The corpus was collected from Kaggle COVID19 Figure 1: The proposed neural network architecture for tem- Open Research Dataset Challenge (Wang et al. 2020). We poral embedding generation in the hidden layer. selected 2,000 most frequent biomedical entities for this dataset. We extracted the biomedical entities for the PubMed distance (e.g., Manhattan or Euclidean) between two vec- abstracts using scispaCy’s Biomedical Named Entity Recog- tors of the same word for consecutive timestamps is small. nition (Neumann et al. 2019). In this paper, we used the Equation (6) captures the expected behavior by penalizing phrase temporal word embedding or temporal embedding to significant spatial changes. describe the core concepts, while in practice we performed temporal biomedical entity embedding. |W| |T |−1 X X We evaluate our temporal word embedding method by ε1a (U) = ||uit+1 , uit ||2 (6) comparing its performance with that of a regular tf-idf i=1 t=1 model, the temporal tf-idf model (Camacho et al. 2018), dy- The main issue with this expression is that by forcing con- namic Bernoulli embeddings (Rudolph and Blei 2018), and secutive vectors to be very close together, we might be los- temporal word embeddings with a compass (TWEC) (Carlo, ing important information when the vectors drift apart in the Bianchi, and Palmonari 2019). In all our experiments we original data. Thus, we introduce weights, ωϑ , and ωε to con- used an embedding size of 64. trol the effect of each objective. The final objective function We seek to answer the following questions. takes the form of Eq. (7). 1. What is the effect of introducing different penalty terms in our objective function? (Section 5.1) Fa (U) = ϑ3 (U)ωϑ ε1 (U)ωε (7) 2. How well do the models perform in terms of capturing An alternative form would be: the neighborhood of entities over time, compared to the Fb (U) = ωϑ log ϑ3 (U) + ωε log ε1 (U) (8) temporal tf-idf? (Section 5.2) 3. How well do the models perform in terms of capturing or changes in the neighborhood over time in the respective Fc (U) = ωϑ ϑ3 (U) + ωε ε1 (U) (9) embedding spaces? (Section 5.3) 4.6 Implementation 4. How well does our algorithm track the quick evolution of a specific entity, such as COVID, compared to other We implemented a neural network-based model using Ten- methods? (Section 5.4) sorflow to generate our low-dimensional temporal word embeddings. An overall view of the architecture of our neu- 5. How well does our algorithm capture semantic evolution ral network is shown in Fig. 1. The goal of the neural net- of a general term, such as pandemic, compared to other work is to minimize Eq. (7). The embeddings for all words methods? (Section 5.5) in all timestamps are generated in the hidden layer. We ini- tialize the weights in the hidden layer in the range [0, 1]. 5.1 Effect of penalty terms The data used for training the model contains three inputs In this experiment, we study the effect of the different ver- (one-hot encoding of a pair of words for which the cosine sions of our objective function on the quality of the temporal distance is known, and the timestamp) and one target value word embedding model, focusing on the task of tracking se- (cosine distance). The inputs are the indices for two random mantic evolution. The versions under this study correspond words wit and wjt , at timestamp t. The target value is the ex- to ϑ1 (2), ϑ2 (3), ϑ3 (5), Fa (7), Fb (8), and Fc (9). We pected cosine distance between wit and wjt , obtained using quantify the quality of the resulting vectors with two differ- the temporal tf-idf representations of Eq. (1). ent metrics: similarity and continuity. The similarity is measured as the number of intersections 5 Experimental Results between the word neighborhoods obtained using the tem- We performed experiments using three different datasets: a poral tf-idf model and each of the different versions of our synthetic dataset, PubMed Pandemic dataset, and PubMed objective function. The goal of the similarity evaluation is COVID dataset. to quantify how well our model mimics the temporal tf-idf Averaged over 1000 randomly chosen entities 𝓕a from 328,908 PubMed (Pandemic) abstracts bernoulli temporal_embeddings twec the neighborhoods in temporal tf-idf Average Jaccard similarity between 𝝑3 0.25 𝝑2 and another model 0.20 𝝑1 0.15 0.10 𝓕c 0.05 𝓕b 0.00 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Timestamps Figure 4: Jaccard similarity in the neighborhoods between Figure 2: Average number of intersections per timestamp for temporal tf-df (Camacho et al. 2018) and each of the three different neighborhood sizes (k) between the neighborhoods models – TWEC, Bernoulli embeddings, and our temporal obtained with the baseline method and those obtained using word embedding model. (PubMed (pandemic) dataset. Em- the different versions of our objective function (embedding bedding size = 64.) size = 64). model can correctly capture the semantic evolution of the synthetic dataset. Evaluating continuity is required to ensure that there is a smooth transition between timestamps for the vectors of the same word. A high average or minimum MSE value indi- cates that there is a significant movement of the word vectors over time in the embedding space. However, a small max- imum MSE value would mean that the word embeddings are not following the trends observed in the temporal tf-idf model-based training. Thus, the best model is one that has high similarity with the temporal tf-idf model while main- taining a low MSE value. Fig. 3 shows the results for the continuity evaluation. In 𝓕a 𝓕b 𝓕c 𝝑1 𝝑2 𝝑3 𝓕a 𝓕b 𝓕c 𝝑1 𝝑2 𝝑3 𝓕a 𝓕b 𝓕c 𝝑1 𝝑2 𝝑3 this case, Fc has a continuity of 0.0, which, in conjunction with the similarity results, indicates that this objective func- Figure 3: Average mean squared error (MSE) for different tion produces static, unusable vectors. The second smallest versions of our objective function. The average MSE is com- average MSE value is obtained with Fa , which also showed puted from obtaining the squared difference between vectors the best performance in terms of similarity. Thus, the final for the same word for every pair of consecutive timestamps objective function is Fa (Eq. (7)), and we confirm that the (embedding size = 64). smoothness penalty (Eq. (6)) has a positive effect both on the similarity and continuity results. model. It must be noted that we did not expect to have a per- fect match in the neighborhoods of words since the temporal 5.2 Capability to capture content neighborhood tf-idf model representation does not take into account latent A major purpose of any temporal word modeling is to cap- contextual relationships between words. ture content similarity over time. We compare three mod- The continuity is measured using the average, maximum, els – TWEC, Bernoulli embeddings, and our temporal word and minimum mean squared errors (MSE) across consecu- embedding – with Temporal tf-idf (Camacho et al. 2018) in tive timestamps for the word vectors obtained using the dif- Fig. 4, using PubMed (pandemic) dataset. We use temporal ferent versions of our objective function. tf-idf (Camacho et al. 2018) for this comparison because it Fig. 2 shows the results for the similarity evaluation with models content smoothly over time. Each line in the figure the synthetic dataset described at the beginning of Section 5. represents average set-based Jaccard similarity between the The objective function labeled as Fa on the figure performs 10-nearest neighbors of 1000 randomly selected entities us- significantly better than the other formulations. If we dis- ing temporal tf-idf and the 10-nearest neighbors of the same card Fb and Fc , it is possible to see how the similarity im- entities using one of the three models. Fig. 4 demonstrates proves with the progression in which we developed our ob- that our embedding model and TWEC have closer similar- jective function. Furthermore, taking into account that only ity with temporal tf-idf than Bernoulli embeddings. Addi- the top-10 nearest neighbors are known and set as accurate tionally, our model has greater similarity with the neigh- in the synthetic data and the rest of the neighbors are ran- borhood of temporal tf-idf in the earlier timestamps, com- dom, having an average of 8 intersections means that our pared to both TWEC and Bernoulli embeddings. Our model Averaged over 1000 randomly chosen entities in terms of average Jaccard dissimilarity, better than regular from 328,908 PubMed (Pandemic) abstracts bernoulli temporal_embeddings temporal_tfidf tfidf twec tf-idf and temporal tf-idf models. 1.0 Our model is clearly superior in terms of the ability to current year and the previous year capture changes. In subsection 5.4, we explain how the supe- between 10-neighbours of the Average Jaccard dissimilarity 0.8 riority in the detection of changes in the neighborhood helps in analyzing evolving concepts, such as COVID-19. 0.6 5.4 Analyzing the neighborhood of COVID-19 0.4 In this experiment, we analyze the changes in the neigh- borhood of the word COVID in the PubMed (COVID) 0.2 dataset. Fig. 6 presents how the similarities between the en- tity COVID and some of its nearest neighbors–China, epi- 0.0 demic, pandemic, and patients– change over time using (a) 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 TWEC model, (b) Bernoulli embeddings, (c) temporal tf-idf, Timestamps and (d) our temporal embedding model. The data contains Figure 5: Comparison of average Jaccard dissimilarity ranges of two-weeks from January to July of 2020. As we (change) between 10-neighbors of the current year and the already know, COVID-19 is, as of August 2020, considered previous year. a pandemic – which is a global outbreak rather than a lo- cal epidemic (Steffens 2020). In Fig. 6, we observe that (c) smoothly spreads word-influence using diffusion over the temporal tf-idf and (d) our temporal embedding can detect years. As a result, our embedding model performs signifi- the rising trends of pandemic and falling trends of the word cantly better than other methods even when the vocabulary epidemic. This observation matches with our known knowl- is smaller in the earlier timestamps. edge regarding COVID-19. TWEC (Fig. 6a) is able to track this to some degree but with zigzag-patterns in the trends. 5.3 Capability to detect changes in neighborhood Bernoulli embeddings (Fig. 6b) give higher similarity for An objective of a temporal embedding technique is to cap- pandemic than epidemic with the word COVID, which is cor- ture changes in the neighborhood of each word over time. rect in July but the timeline does not demonstrate any rising The ability to capture changes allows us to study the evolu- and falling trends of the words pandemic and epidemic. tion of concepts. This subsection provides an experiment to Our temporal embedding (Fig. 6d) demonstrates that the investigate how much change occurs from one year to an- word China had high similarity with COVID in the begin- other in the neighborhood using different models. We quan- ning. The similarity started to fall by the end of March. Ac- tify change in terms of set-based Jaccard dissimilarity (1.0- cording to our model, starting at the end of march the word Jaccard similarity) between the neighborhood of a word in epidemic started to exhibit lesser similarity with COVID and the current year and the neighborhood of the same word in the word pandemic started to show higher similarity. The the previous year. Average Jaccard dissimilarity over many temporal tf-idf model (Fig. 6c) demonstrates a similar trend. words in a certain year for a model gives an overall idea The trends match with our common knowledge regarding of how much the model can detect changes in the neigh- the COVID-19 pandemic. Also, TWEC (Fig. 6a) has an borhood. Fig. 5 demonstrates average Jaccard dissimilarity overall downward trend for the word China, but with zigzag (change) at each year for five different models – our tempo- movements over the timeline. Bernoulli embeddings (Fig. 6 ral embedding model, Bernoulli embeddings, TWEC, and b) do not demonstrate any change and capture a static simi- vanilla tf-idf computed independently at each year, and tem- larity for the entire timeline. We noticed that the underlying poral tf-idf using 1000 randomly selected entities from the vectors in Bernoulli embeddings change but the neighbors PubMed (pandemic) dataset. The plot shows that our tem- of a word do not change much. poral embedding model detects more changes in terms of We know that the number of COVID infected patients in- average Jaccard dissimilarity compared to other models. creased over the months of 2020. Our temporal embedding The Bernoulli embeddings capture the least amount of model (as well as the temporal tf-idf) captures the rising- changes. Based on further investigation (not covered in this similarity of the word patients in the context of COVID quite paper), we noticed that Bernoulli embeddings rarely capture smoothly (Fig. 6d). TWEC also has an upward trend which any changes. These embeddings capture only a few long- is less smooth. However, the Bernoulli embeddings do not term changes, whereas our temporal embedding model sig- demonstrate any changes in the similarity between the words nificantly captures both long-term and short-term changes. patients and COVID. TWEC captures more changes than Bernoulli and temporal This experiment demonstrates that our temporal embed- tf-idf, but lesser changes than the vanilla tf-idf. Our tempo- ding model captures the short-term changes in content (as ral word embedding performs even better than the vanilla tf- shown by temporal tf-idf) while also capturing the context idf. Contextual changes are best-captured using our tempo- that we can track smoothly to study the evolution of a con- ral embedding because the objective function of our model cept, such as COVID. In contrast, Bernoulli embeddings spreads the effect of each word smoothly from the current construct a context that is intractable in terms of similarity. year to other years. As a result, our model captures changes, TWEC provides noisy patterns that are difficult to interpret. TWEC Bernoulli Embedding Model 1.0 1.0 Cosine Sim. with the word 'COVID' Cosine Sim. with the word 'COVID' Similar Words china 0.8 0.8 epidemic pandemic 0.6 patients 0.6 0.4 0.4 Similar Words china epidemic 0.2 0.2 pandemic patients 0.0 0.0 13 01 - 14 Jul 14 15 - 31 Jul 13 01 - 14 Jul 14 15 - 31 Jul 05 01 - 14 Mar 06 15 - 31 Mar 05 01 - 14 Mar 06 15 - 31 Mar 11 01 - 14 Jun 12 15 - 30 Jun 11 01 - 14 Jun 12 15 - 30 Jun 01 01 - 14 Jan 02 15 - 31 Jan 01 01 - 14 Jan 02 15 - 31 Jan 07 01 - 14 Apr 08 15 - 30 Apr 07 01 - 14 Apr 08 15 - 30 Apr 03 01 - 14 Feb 04 15 - 30 Feb 09 01 - 14 May 10 15 - 31 May 03 01 - 14 Feb 04 15 - 30 Feb 09 01 - 14 May 10 15 - 31 May Timestamps (ranges of dates in year 2020) Timestamps (ranges of dates in year 2020) (a) TWEC-Temporal Word Embedding using Compass (b) Bernoulli embeddings Temporal Tf-Idf Model Temporal Embedding Model 1.0 1.0 Cosine Sim. with the word 'COVID' Cosine Sim. with the word 'COVID' Similar Words china 0.8 epidemic 0.8 pandemic patients 0.6 0.6 0.4 0.4 Similar Words china 0.2 0.2 epidemic pandemic patients 0.0 0.0 13 01 - 14 Jul 14 15 - 31 Jul 13 01 - 14 Jul 14 15 - 31 Jul 05 01 - 14 Mar 06 15 - 31 Mar 05 01 - 14 Mar 06 15 - 31 Mar 11 01 - 14 Jun 12 15 - 30 Jun 11 01 - 14 Jun 12 15 - 30 Jun 01 01 - 14 Jan 02 15 - 31 Jan 01 01 - 14 Jan 02 15 - 31 Jan 07 01 - 14 Apr 08 15 - 30 Apr 07 01 - 14 Apr 08 15 - 30 Apr 03 01 - 14 Feb 04 15 - 30 Feb 09 01 - 14 May 10 15 - 31 May 03 01 - 14 Feb 04 15 - 30 Feb 09 01 - 14 May 10 15 - 31 May Timestamps (ranges of dates in year 2020) Timestamps (ranges of dates in year 2020) (c) Temporal tf-idf model (vector size = 32,000) (d) Our model (Eq. (7), vector size = 64) Figure 6: Evolution of the word COVID in PubMed COVID-19-related abstracts published in 2020 using four different models – TWEC, Bernoulli embeddings, temporal tf-idf, and our temporal embedding model. Cosine similarity is used to compute the similarity between the vectors of the word COVID and any other word. 5.5 Analysis of the the word Pandemic nearest neighbors, which are not highly similar to the word pandemic. This indicates that no entities appeared too close With the rise of the COVID-19 pandemic, it has become es- to the word pandemic in those years. sential to study how biomedical scientists have dealt with a pandemic in the past years. Such an analysis requires a TWEC captures influenza and H1N1 in the middle of model that can capture long term changes. In this experi- the timeline but fails to capture COVID-related keywords ment, we attempt to track the closest term to the word pan- in 2020 as the closest entity to pandemic. In Fig. 7, the demic in each year of the PubMed (pandemic) dataset, which Bernoulli model can pick up coronavirus as the nearest spans biomedical abstracts from 2000 to 2020. neighbor of pandemic but it was not able to pick up influenza Each line of Fig. 7 plots the similarity of the top nearest- in its trend. Moreover, coronovirus appears in all the years neighbor of the word pandemic in each year. The five lines as the top nearest neighbor of pandemic which is not cor- represent similarities using five different models – Bernoulli rect because the fact is that the coronavirus spread started in embeddings, our temporal embedding model, temporal tf- 2019 and became a pandemic in 2020 (Cucinotta and Vanelli idf, vanilla tf-idf, and TWEC. Notice that our temporal em- 2020). Temporal tf-idf and vanilla tf-idf were able to pick up bedding model demonstrates peak similarities in 2009/2010 coronavirus/COVID. Temporal tf-idf and vanilla tf-idf were and in 2020, when H1N1 influenza (swine flu) and COVID- also able to pick up influenza subtype H1N1 (swine flu) but 19, respectively became prominent. This signal from our the respective similarities were not high. temporal embedding model reflects the fact that the worst Based on the experiment presented in this subsection, our pandemics in the last 20 years are the H1N1 influenza in temporal embedding model has the ability to separate highly 2009 (Sullivan et al. 2010) and COVID-19 in 2020 (Cu- contextual words (such as H1N1 and COVID) of a concept cinotta and Vanelli 2020). Note that other words like con- (such as pandemic) via similarity-peaks. Our model helps cerns in 2004 and public in 2015 are detected as the top in determining prominent neighbors of a concept in the past. Dataset: PubMed (Pandemic) abstracts the description of the epidemic flow of contagious disease. Public bernoulli temporal_embeddings temporal_tfidf tfidf twec Health Rep 95(5): 478–485. 1.1 Bamler, R.; and Mandt, S. 2017. Dynamic Word Embeddings. In Proceedings of the 34th ICML, volume 70, 380–389. PMLR. 1.0 coronavirus coronavirus coronavirus Barkan, O. 2017. Bayesian Neural Word Embedding. In AAAI, Cosine similarity between "pandemic" and smoking covid 0.9 3135–3143. San Francisco, California, USA: AAAI Press. covid africa influenza healthcare Camacho, R.; Dos Santos, R. F.; Hossain, M. S.; and Akbar, M. 0.8 H1N1 covid 2018. Tracking the Evolution of Words with Time-reflective Text it's top nearest neighbor Representations. In 2018 IEEE International Conference on Big 0.7 H1N1 Data (Big Data), 2088–2097. Seattle, WA, USA: IEEE. aids 0.6 hiv concerns Carlo, V. D.; Bianchi, F.; and Palmonari, M. 2019. Training Tem- public poral Word Embeddings with a Compass. In AAAI. 0.5 Cucinotta, D.; and Vanelli, M. 2020. WHO Declares COVID-19 a covid Pandemic. Acta bio-medica : Atenei Parmensis 157—160. 0.4 influenza Hamilton, W. L.; Leskovec, J.; and Jurafsky, D. 2016. Diachronic 0.3 Word Embeddings Reveal Statistical Laws of Semantic Change. In Proc. of ACL, volume 1, 1489–1501. Berlin, Germany. 0.2 aids H1N1 Marina Del Rey, CA, U. 2018. Dynamic Word Embeddings for aids Evolving Semantic Discovery. In Proc. of ACM WSDM, 673–681. 0.1 Mihalcea, R.; and Nastase, V. 2012. Word Epoch Disambiguation: Finding How Words Change over Time. In Proc. of ACL, 259–263. 0.0 Mikolov, T.; Chen, K.; Corrado, G. S.; and Dean, J. 2013a. Effi- 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 cient Estimation of Word Representations in Vector Space. CoRR Timstamp Timestamps abs/1301.3781. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. Figure 7: Cosine similarity of the top nearest neighbor of 2013b. Distributed Representations of Words and Phrases and their pandemic in each year using all methods. Nearest neighbors Compositionality. ArXiv abs/1310.4546. are written with selected peaks. Our temporal embedding Mitra, S.; Mitra, R.; Maity, S.; Riedl, M.; Biemann, C.; Goyal, P.; method provides better context for pandemic. Embedding and Mukherjee, A. 2015. An automatic approach to identify word size = 64. sense changes in text media across timescales. Natural Language Engineering 21: 773–798. Naim, S. M.; Boedihardjo, A. P.; and Hossain, M. S. 2017. A scal- Our vectors are able to construct a peak for a prominent near- able model for tracking topical evolution in large document collec- est neighbor because our method models diffusion. That is, tions. In IEEE BigData, 726–735. a concept that appears today affects the past and the future Neumann, M.; King, D.; Beltagy, I.; and Ammar, W. 2019. Scis- to some extent, regardless of whether the concept directly paCy: Fast and Robust Models for Biomedical Natural Language appears in the contents or not. Processing. In BioNLP@ACL. Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global 6 Conclusions Vectors for Word Representation. In Proc. of Conf. on EMNLP, 1532–1543. This paper introduces a new technique to generate low- Radinsky, K.; Davidovich, S.; and Markovitch, S. 2012. Learning dimensional temporal word embeddings for timestamped Causality for News Events Prediction. In Proc. of Int. Conf. on scientific documents. We compare our model with existing WWW, 909–918. temporal word embeddings. Our method generates a repre- Rosin, G. D.; Adar, E.; and Radinsky, K. 2017. Learning Word sentation that: (1) can track changes observed within a short Relatedness over Time. In Proceedings of the 2017 Conference on period, (2) provides a smooth evolution of the word vectors Empirical Methods in Natural Language Processing, 1168–1178. over a continuous temporal vector space, (3) uses the con- Rudolph, M.; and Blei, D. 2018. Dynamic Embeddings for Lan- cept of diffusion to capture trends better than the existing guage Evolution. In Proceedings of the 2018 World Wide Web Con- ference, 1003–1011. models, and (4) is low-dimensional. Unlike previous mod- els, our proposed model creates a homogeneous space over Rudolph, M. R.; and Blei, D. M. 2017. Dynamic Bernoulli Embed- dings for Language Evolution. ArXiv abs/1703.08052. every timestamp of the embeddings. As a result, the gener- Steffens, I. 2020. A hundred days into the coronavirus disease ated vectors over timestamps can be used for prediction us- (COVID-19) pandemic. Euro Surveill. 25(14). ing conventional algorithms for predicting signals. Extrapo- Sullivan, S. J.; Jacobson, R. M.; Dowdle, W. R.; and Poland, G. A. lation of the embedding vectors to forecast a future neigh- 2010. 2009 H1N1 influenza. Mayo Clin. Proc. 85(1): 64–76. borhood of a scientific concept is a future direction of this Tang, X.; Qu, W.; and Chen, X. 2013. Semantic Change Computa- work. tion: A Successive Approach 68–81. Wang, L. L.; Lo, K.; Chandrasekhar, Y.; Reas, R.; and et al. 2020. References CORD-19: The COVID-19 Open Research Dataset. Aitchison, J. 2013. Language change: progress or decay? Cam- Yogatama, D.; Wang, C.; Routledge, B. R.; Smith, N. A.; and Xing, bridge University Press. E. 2014. Dynamic Language Models for Streaming Text. Transac- Angulo, J.; Pederneiras, C.; Ebner, W.; Kimura, E.; and Megale, tions of the Association for Computational Linguistics 181–192. P. 1980. Concepts of diffusion theory and a graphic approach to Yule, G. 2017. The study of language. Cambridge University Press.