1. Introduction

Bari, Italy " wang@dei.unipd.it (B. Wang)

Sequential Modeling in Vector Space

Benyou Wang

Emanuele Di Buccio

0 1

Massimo Melucci

0 0 Department of Information Engineering, University of Padova , Padova , Italy 1 Department of Statistical Sciences, University of Padova , Padova , Italy

2021

000 0 0002

In Information Retrieval and Natural Language Processing, representation of discrete objects, e.g., words, usually relies on embedding in vector space; this representation typically ignores sequential information. One instance of such sequential information is temporal evolution. For example, when discrete objects are words, their meaning may smoothly change over time. For this reason, previous works proposed dynamic word embeddings to model this sequential information in word representation explicitly. This paper introduces a representation that relies on sinusoidal functions to capture the sequential order of discrete objects in vector space.

eol>sequential modeling vector space dynamic word embedding sinusoidal functions

1. Introduction

However, such a embedding cannot deal with the spatially or temporally sequential information of objects. One spatial scenario is to encode word order in bag-of-words neural networks like Transformer [2, 3]. Regarding the temporal scenario, word meaning may change over time [4]. For instance, the word gay shifted from the meaning cheerful in the 1900s to the meaning frolicsome in the 1950s and finally to the meaning homosexuality since the 1990s [5].

In this work, we will focus on the temporally sequential aspect: temporal evolution. This work adopts sinusoidal functions to encode sequential evolution of word meaning change in vector space. The advantages over existing methods might be: 1) it is more eficient since the proposed method do not need to maintain a copy of word representation for each timestamp as required by previous works [6]; 2) it can be more efective to model semantic evolution since functions can deal with long-term but gradual meaning changes thanks to the continuity of

Binary coding: orders in 16 numbers ( 0 − 15 ) are encoded as four-digit binary numbers 0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, 1111, . Observe that the last digit in red is a periodical sequence of [0, 1, · · · ] with a period of 2, the second last digit in blue is a periodical sequence of [0, 0, 1, 1, · · · ] with a period of 4, and so on 1. functions. In Section 3 we will show how the proposed method could approximate any word meaning evolution.

2. Problem Definition

A object-agnostic order (e.g., position and time) embedding [7, 2] is defined as: : R → R One may consider binary coding for order embedding. However, it is not diferentiable and thus unfriendly to neural networks. To this end, one may consider designing continuous coding with the same periodical property. Fig. 1 shows an alternative sinusoidal encoding [2, 8] with periods of 2, 4, 8, 16. Such continuity will facilitate back-propagation if such embedding is used in neural networks.

Object-aware dynamic evolution. Sequential encoding becomes more challenging when such sequential evolution is not shared among objects; for example, an individual word may change meaning over time, but other words may not share the same trend in meaning change. Therefore, such dynamic evolution processes are object-aware. Formally, evolution of an object with index can be formalized as a mapping from object (indexed in N) and time ( ∈ R) to a -dimensional vector:

: N × R → R

3. Methodology: Dynamic Object Embedding

To smoothly model object-aware dynamic evolution, we represent each object as a continuous function: a specific object embedding at time is represented as the values of the function when the variable equals . More formally, our approach aims to learn a mapping that maps each object to functions over time/order: ( 2 ) ( 3 ) : N → ( : R → R) ( 4 ) where maps a object, e.g., with index , to a function , which is a function over a variable ∈ R. Note that the output of is a -dimensional vector, () ∈ R. Let us denote () as . A object at time is represented as a -dimensional vector U, = ()() = ().

1The example is from https://kazemnejad.com/blog/transformer_architecture_positional_encoding Examples of are linear functions () = b + k with parameters b, k ∈ R or a sinusoidal functions () = b + v sin( + ) with parameters b, v, , ∈ R.

A typical way for word vectors is factoring positive point-wise mutual information (PPMI) matrices [9]. Note that in a temporal scenario, PPMI matrices also changes over time. Assume that the PPMI between a word pair (, ) at time is , (); our goal is to approximate , () by a dot product between dynamic word embedding of , denoted as ()() ∈ R, and a static compass [10] of , denoted as ℎ() = vj ∈ R: ,() ≈ ()()ℎ() ( 5 ) ( 7 ) Sinusoidal Parameterization. By formalizing (, ) as sinusoidal functions, i.e., a mixture of cosine and sine functions plus a bias term: ()() = [,1 + ,1 sin(1); ,2 + ,2 cos(1); . . . ; ,−1 + ,−1 sin(/2); , + , cos(/2); ] ( 6 ) Eq. 5 will result in: ⏞ Δ 2 ()()ℎ() = ∑︁ ,, + ∑︁ ,2−1,2−1 sin() + ,2,2 cos()

⏟=1 =1 ⏟ ,⏞, ⏟ ,⏞, Therefore, , () is a weighted sum of sinusoidal functions plus a constant term Δ, i.e., , () = Δ + ∑︀=/12 ,, sin() + ,, cos() {,,}=/12 and {,,}=/12 are the co eficients and {Ω}2=1 are the corresponding frequencies. [11] states that linear combinations of sine and cosine functions could approximate all continuous functions in (). Thus, Eq. 7 could approximate any , () ∈ (), and therefore capture any word meaning evolution. Static object vectors, e.g., [12], can be considered as a special case of constant functions: = b, or a specific case of sinusoidal function when r = 0 or i is small enough. The additional parameters i and r are expected to capture the dynamic aspect of word meaning evolution. Intuitively, long periods reflect some long-range evolution, although in practice, such sinusoidal functions would not necessarily be periodical with an extremely long period in a limited timespan [13].

4. Ongoing and Future Work

This paper proposes a sinusoidal parameterization to capture the sequential aspects of objects embedded in vector space. We focused on modeling change in word meaning over time; the considered parameterization is promising since, in principle, it could approximate any word meaning evolution. We are currently focusing on the evaluation of the proposed approach to investigate both its efectiveness and eficiency. Experiments will consider diverse tasks, e.g., temporal analogy [6] and semantic change detection [14]. Future work will consider other discrete objects, e.g., user profiles. Moreover, further theoretical and empirical investigations are needed to deal with the optimization issues when sinusoidal activation functions are used, i.e., infinity local minima [13]. The work is supported by the Quantum Access and Retrieval Theory (QUARTZ) project, which has received funding from the European Union‘s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 721321.

[1]

Salton ,

Wong ,

C.-S.

Yang , A vector space model for automatic indexing , Communications of the ACM 18 ( 1975 ) 613 - 620 .

[2]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , NIPS 2017 ( 2017 ).

[3]

Wang ,

Zhao ,

Lioma ,

Li ,

Zhang ,

J. G.

Simonsen , Encoding word order in complex embeddings , in: ICLR , 2020 .

[4]

Wang ,

E. Di

Buccio ,

Melucci , Representing words in vector space and beyond, in: Quantum-Like Models for Information Retrieval and Decision-Making , Springer, 2019 , pp. 83 - 113 .

[5]

W. L.

Hamilton ,

Leskovec ,

Jurafsky , Diachronic word embeddings reveal statistical laws of semantic change , in: ACL , 2016 , pp. 1489 - 1501 .

[6]

Yao ,

Sun ,

Ding ,

Rao ,

Xiong , Dynamic word embeddings for evolving semantic discovery , in: WSDM , 2018 , pp. 673 - 681 .

[7]

S. M.

Kazemi ,

Goel ,

Eghbali ,

Ramanan ,

Sahota ,

Thakur ,

Wu ,

Smyth ,

Poupart , M. Brubaker, Time2vec: Learning a vector representation of time , arXiv preprint arXiv: 1907 . 05321 ( 2019 ).

[8]

Wang ,

Shang ,

Lioma ,

Jiang ,

Yang ,

Liu ,

J. G.

Simonsen , On position embeddings in bert , in: ICLR, 2021 .

[9]

Levy ,

Goldberg , Neural word embedding as implicit matrix factorization , NIPS 27 ( 2014 ) 2177 - 2185 .

[10]

Di Carlo ,

Bianchi ,

Palmonari , Training temporal word embeddings with a compass , in: AAAI , volume 33 , 2019 , pp. 6326 - 6334 .

[11]

Cybenko , Approximation by superpositions of a sigmoidal function , Mathematics of control, signals and systems 2 ( 1989 ) 303 - 314 .

[12]

Mikolov ,

Chen , G. Corrado,

Dean , Eficient estimation of word representations in vector space , arXiv preprint arXiv:1301.3781 ( 2013 ).

[13]

Parascandolo ,

Huttunen , T. Virtanen, Taming the waves: sine as activation function in deep neural networks, Openreview preprint ( 2016 ).

[14]

Shoemark ,

F. F.

Liza ,

Nguyen ,

Hale ,

McGillivray , Room to glo: A systematic comparison of semantic change detection approaches with word embeddings , in: EMNLPIJCNLP , 2019 , pp. 66 - 76 .