-

Using Quantum Probability for Word Embedding Problem

0 ITMO University , Kronverksky Ave. 49, St.Petersburg, 197101, Russian Federation

Over the past years, there has been being a contradiction between the growth rate of data that is available to humanity and the possibilities of their intellectual processing. Most of the knowledge that mankind operates is stored in the form of text documents in natural languages, which are not accompanied by additional markup tools for automated text processing tools. Thus, the exponential increase of the amount of information in the baggage of knowledge of mankind is faced with the inability to process it e ectively. To resolve this contradiction, there are systems for the automatic processing of natural language data. Most intelligent data processing algorithms operate with numerical data, so the basic task of any process of working with natural language texts is to represent text units in numerical form. In our research we propose to use framework of of the quantum theory of probabilities. In this case we can operating correctly with as clean as entangled states of words. For implementation of calculation of the matrix for generalized context we using the machine learning technique, named gradient descent, and apply some of restrictions ensuring for elimination extra degrees of freedom. Our approach provides a probabilistic interpretation of research results. And it allows easy to nd a probability that word's context similar to another one. The proposed model of word will can be to integrate to di erent text data analysis processes. This paper presents the results of a comparison of our proposed method with other similar algorithms.

Word Embedding Natural Language Processing tum Probability Theory Machine Learning

QuanThe possibility of a qualitative analysis of the context of words is one of the most important tasks of our time. It is very important to nd a mathematical description that would make it possible to make predictions about the meanings of words with high accuracy. Today, the most promising approach is to describe Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). the context of a word through its surround. We can imagine a word as a vector, which is the sum of all its contexts. And the purpose of this study is to nd a formal description of the vector space of various contexts of words on with help set of texts. In this article, to solve the problem of nding context, we use the mathematical framework of the quantum theory of probabilities. And to calculate the context in vector space, we use large sparse matrices is de ned by the power of the set of words. 2

Related Works

Today, there are a number of basic approaches to modeling such spaces. In the general case, all approaches can be divided into two large groups: distribution [1] and structural [2]. In the structural approach, parsing of text data is performed and a parsing tree is constructed, which is used to identify semantic relations between words. A set of such relations for a word de nes its semantic representation.The structural approach is most typical for knowledge base systems, for example, ontologies, in which the relations between concepts are speci ed explicitly, we can immediately determine the severity of these relations, which ultimately allows we to build some kind of analogue of the space in which the concepts are de ned and a metric for determining their similarities. However, in most cases, building knowledge bases is done manually or by automated means, which certainly a ects the speed of developing tools using such representations. In the case of the distributive approach, the Harris distribution hypothesis is used, which says that words with close semantic meanings will be more often found in texts that have similar sets of words. According to this hypothesis, the concept of a word context is introduced, which includes, in its simplest form, the set of surrounding words in a xed-size window taken in the original text sample. In this paper, we use the distribution hypothesis, and all further algorithms considered, including the approach we propose, are based on this hypothesis. All algorithms based on the distribution approach require large sets of texts. But also in most algorithms no manual markup is required. And we can build vector representations of words without a teacher. The simplest model that allows vectorization of words is the Bag-of-Word algorithm [3]. In order to obtain a vector representation of a word in this algorithm, we need to select a xed-size scanning window, and calculate the frequency of words, in such a context for a given word for the entire set of texts. The calculated word frequencies are used as values for the vector components corresponding to the word of their context. Finally, the third is an approach that uses neural networks, and which is now the most popular. In particular, the Word2Vec algorithm and its derivatives [4]. In this algorithm, a set of vectors is supplied to the input of a single-layer perceptron, each of which is a vector representing a word from the context, modeled for example by the Bag-of-Words model. At its output, we get the same vector, but for the central word in the window in question. This neural network in a hidden layer will have a vector of small dimension, about several hundred components, than the size of the entire lexicon. In a model trained in predicting contexts, a hidden layer is used to represent a word vector. Thus, the Word2Vec algortime constructs dense small-size vectors in contrast to the previous algorithms that generate sparse representations [7]. 3

Quantum probability theory in problem of vector representation of words 3.1

Passing word meanings through vector representation

With the work [8] begins the presentation of the possibilities of using the apparatus of quantum theory for information retrieval models. This paper uses quantum formalism to model documents, users, and ranking. Further, this approach was developed and a whole direction in information retrieval, called Quantum Information Retrieval, was born, which is engaged in modeling information retrieval processes using the framework of quantum mathematics and applies these models for the needs of information retrieval, for example ranking.

One of the most important areas in information retrieval is the solution of the problem of the vector representation of objects involved in information retrieval, such as words or documents. This direction is important because vector representations allow we to enter comparison operations using mathematical tools (for example, using a cosine metric), which certainly increases the interpretability of the model and the possibility of its use in other algorithms as opposed to algorithms based on classical machine learning.

Consider the problem of constructing a vector of word, that will to represent the meaning this word. To solve this problem, it is necessary to develop an algorithm, that takes a word and returns L-dimensional vector Vword = [v1v2:::vL]T in the space of real numbers as the output. This operation can be written as: a(W ) : W ! RL ( 1 ) where W is a set of all words, and w 2 W is a target word. In this paper, it is assumed that the meaning of the word in question is determined through its context, which corresponds to the Harris distribution hypothesis. If the context of the word is represented by a vector, then in this representation for the i-th position of the vector the number of repetitions of the i-th word in the dictionary W will be assigned in a window around the word w. So for the sentence "he want eat" and a dictionary consisting of these three words, the context of the word "want" will be determined by the vector v = [101]T . 3.2

Tomography of quantum states

Quantum physics allows us to describe the behavior of quantum particles that are not amenable to direct observation. So we cannot track, for example, the momentum and coordinate of an electron without changing its trajectory. Therefore, in quantum physics there is no way to describe the behavior of individual particles, and physicists tend to describe the state of ensembles of particles, resorting naturally to statistical methods.

In quantum physics, the described system can be represented by some vector j i in a Hilbert space. The distributions of all possible states that the system can take are described by the density matrix (operator), which was proposed in the works of L.D. Landau and J. von Neumann. The state of a quantum system is described in some basis of the chosen Hilbert space and the density operator is a function of the density distribution in the phase space of this system. In the general case, for a selected basis j 1i j ni, the state of a quantum system can be described by a density matrix according to the following formula: ( 2 ) ( 3 ) n = X pi j ii h ij ;

i=1 where h ij means the conjugate vector for the vector j ii, and pi is the probability of nding the system under study in the state j ii. If the system is in a pure state, then it can be uniquely described by only one vector in the selected basis and will have a density matrix = j i h j. In accordance with the classical theory of probability, a pure state corresponds to an elementary event in probability space. If the system cannot be described by a single vector, then it is described by expression ( 2 ) and this state is called mixed.

Thus the quantum probability theory uses vectors and matrices to describe quantum systems [5]. Thus, it is a convenient framework to get a vector representation of a word. Quantum probability theory, in isolation from quantum physics, can be interpreted by a geometric generalization of classical probability theory. There is an equation in this theory, which binds the state of the system under study, the expected state of the system, and the probability of observing the expected state: [6]:

hAi = T r(A ); where is the density matrix describing the distribution of the system states, A is the projector on subspace corresponding to the expected state of the system, hAi is the probability observation of the state. Matrices and A have the equal dimensions. Then, the probability that system is in state A is equal to calculating the trace of the product between density matrix D , and the projector to the subspace of the state. 3.3

Quantum probability theory analogy

In this work, we use the hypothesis that the framework of density matrices is applicable to modeling vector or matrix representations of natural language words. This hypothesis is based on two facts. First, there are experiments showing that the vectors obtained by modeling using word contexts have a quantum-like statistical structure. This gives some reason to assume that the quantum theory framework can be used to model vector representations of words. Secondly, if we represent the contexts in which words can appear (for example, bag-of-words vectors, thematic modeling vectors and others) as observable, that is, as some basic vectors, and the average for such expected is the probability of this word appearing in this context, when constructing projectors on context vectors, we can restore the density matrix for a given word. Consider the following approach. Let Ak be the matrix-projector on the subspace corresponding to some context k. This context is in set of N contexts.Consider context k, which is one of the known contexts N .To obtain such a projector, it is necessary to present the context of the word as a vector. Since the matrix ( 3 ) obtained from such vector must have the properties of a projector, the normalization condition of such vector is necessary.

L L T r(Ak ) = X X aij i=1 j=1 ji = Pk where PAk is the probability of a speci c context.

Ak = vk vT jvkj2k ;

We know the value of PAk , since we know the corpus of texts on which training is conducted. Then it is su cient to group the context vectors according to their exact coincidence and express the probability of their occurrence in terms of the frequency, and we get a probability distribution on N-contexts for the word under study. In the particular case, the probability of any such context will be 1 equal to , if all contexts are unique.

N 3.4

Projector-Matrix Search Task

Continuing the discussion of the preceding sections, the density matrix is a description of the state of the \system" corresponding to the word under study. It is this matrix that needs to be restored from expression 3, and using this to get the matrix representation of the generalized context of the word. Unlike quantum theory, we will further consider all matrices as objects over a eld of real numbers. Extending the model to the eld of complex numbers is the next step in our study. Thus, the problem of obtaining the representation of a word is reduced to the problem of nding a matrix satisfying equation 3. More detailed representation of the matrices is presented below:

2 11 : : : 1L 3 2a11 : : : a1L 3 = 64 ... . . . ... 75 ; Ak = 64 ... . . . ... 57 :

L1 : : : LL aL1 : : : aLL Thus, the following expression shows how can be to calculate the trace of the product between these matrices. ( 4 ) ( 5 ) ( 6 )

Density matrix recovery

In the work of Pivovarsky [9], we can trace the approach to obtaining density matrices based on obtaining a weighted sum of the matrices of projectors on the contexts of a word: 8> i;i 0 > ><T r = 1 >T r( 2 ) 1 > > : ij = ji; i 6= j

L L = X kAk; X k=1 k=1 = 1; k 0 In such a sum, the weights correspond to the frequencies of occurrence of the word contexts. This approach is simple from a computational point of view and it retains all the properties of the density matrix from formula: ( 7 ) ( 8 ) ( 9 ) ( 10 )

R min Q( ; P ) = DKL( jjP ) + X i( ) ! min; i=1 Here R is the number of regularizers . And the expression of the gradient in terms of the density matrix parameters for formula 10 will have the following form: However, it is not suitable for non-orthogonal contexts of words, as well as for systems that are mixed states. Such an algorithm does not restore the target distribution for contexts.

In our experiments, we determined that it is necessary to restore the density matrix based on a metric called Kullback{Leibler divergence:

DKL( jjP ) =

X k=1:::L

T r(

Ak) ln

T r( k

Ak) ;

This metric allows us to approximate the initial distribution in contexts to that which is actually in the training data. In this paper, we propose to use the approach often used in machine learning, based on solving the optimization problem by the gradient descent method. If we take metric 9 as the main objective function, and i( ) is considered as a set of regularizers for preserving the properties of density matrix 3, then we obtain the following optimization problem: T r(

Ak)

R 1)AkT + X r i( ): i=1

( 11 ) knei( k1 knei( k2 . . . kn 2 (13) (14) kn)3 kn)7 7 7 7 5 Once the expression for the gradient of the objective function is obtained, gradient descent methods can be applied to optimize and obtain the density matrix for the word. 4

Introducing a phase in computing A feature of most algorithms that reconstruct in one form or another the density matrix for processing natural language texts is that the developers of these algorithms strive to circumvent the use of the complex number eld, thus working in the eld of real numbers. However, as you know, if we take quantum probability theory as a probability theory with a quadratic measure, then degeneracy can be avoided when compiling the density matrix only when using the eld of complex numbers. An approximate solution in the eld of real numbers can be obtained according to the algorithm described above, but from the point of view of increasing the degrees of freedom in training and the accuracy of restoring the initial distribution, the use of complex numbers is a justi ed step. In this section, we will try to convert the original expressions used in solving the optimization problem to expressions working in the eld of complex numbers. During the transition to the eld of complex numbers, the initial vectors representing the basis by which the density matrix is reconstructed can be represented as follows: j ki = k1 ei k1 : : : kn ei kn = kn ei kn n=1:::N ; (12) where N is the dimension of the vector.

When receiving a projector on such a vector, we get a matrix of the following form: Ak = j ki h kj = 666 k1 6 4 2 kn

2 k1 k2ei( k2 . .

. k1ei( kn k1) k1) k1 k2 k2ei( k1

2 k2 : : : k2) : : : : : : . . . knei( kn k2) : : : k1 k2 And if we shorten the record:

h Ak = kn kmei( kn km i ; n; m 2 [1 : : : N ] : After receiving this projector expression, we can substitute it into the main expression to calculate the probability of the observed:

L P ln( T r( kAk) ) k=1 L P Akmn ln( T r( kAk) ) k=1 It should be noted that, in fact, the original expression has not changed, only now the values of Akmn are complex numbers. We write the process of deriving the partial derivative by phase:

From this expression, substituting it into the objective function ( 10 ), one can obtain expressions of partial gradients for performing gradient decent. Note that with such a statement of the problem, we do not know anything about what phase values have the basis vectors. We will also nd them using the gradient decent method, thus ful lling some semblance of automatic phase adjustment for the basis in the learning process. Therefore, we need to know two expressions for gradient decent. The rst expression is the partial derivative of the objective function with respect to the values of the destiny matrix, which we reconstruct. This expression is the rst part of the gradient of objective function ( 11 ): (16) (18) (19)

N X i [ mn m=1 kn kn ei( kn km) : (17) nm km kn ei( km kn)] The index m appears due to the fact that the phase kn appears several times in all expressions for the remaining phases. Next, we can perform the following replacement in order to simplify the recording and knowing the properties of the density matrix: mn = mn ei nm ; nm = mn; knm = kn km After performing this replacement, we can get the following expression:

N X i m=1 mn kn km hei( mn+ nm) e i( mn+ nm)i If we transforming the notation of complex exponentials in parentheses using the Euler formula, we can obtain the nal expression for the partial derivative in phase:

N 2 X m=1 mn kn km sin( mn + nm) (20) Thus, using expressions (16) and (20), we can go to the space of complex numbers to search for the density matrix with automatic phase adjustment.

Testing of algorithm

To evaluate the algorithm, the well-known WordSim353 package was chosen. This corpus consists of two parts with a total number of example strings equal to 353. Each row of this data set is a pair of English words and a set of 16 ratings of people re ecting the degree of similarity of this pair of words with each other. A couple of words can get a rating from 1 to 10, where 10 means the maximum semantic similarity between words in the opinion of a person. For evaluation needs developed of the algorithm, we used the average value of people's ratings for word pairs as a reference.

To give a comparative evaluation of the developed algorithm, two text data vectorization algorithms were chosen: an algorithm based on the idea of a word bag and tf-idf statistics and the word2vec algorithm (CBoW architecture was used) as the values of the coordinates of the context vectors. For both algorithms, the cosine distance was used as a measure of the proximity of the resulting vector representations: d(W1; W2) =

W1 W2 ; jW1j jW2j (21) The Tensor ow library was used to build the architecture and teach the Word2Vec model. The degree of closeness of the density matrices was estimated using expression 6. Formally, this expression is not suitable for obtaining the probability value that the words in question belong to the same semantic context, but the expression can still be used as an indicator of the similarity of the two density matrices. In addition, in the current implementation of the semantic tomography algorithm, phase auto-adjustment was not used due to which all the work on constructing density matrices was carried out in the eld of real numbers, which could potentially a ect the quality of the restoration of distributions over the Kullback-Leibler divergence.

As a training sample for algorithms, a slice of English Wikipedia was chosen. All texts were subjected to the following processing: paragraphs were glued together into one text, punctuation marks were deleted, and all words were rst reduced to lowercase and then stamped. All non-alphabetic characters (i.e. numbers, dashes, colons, etc.) were removed from the sample as well as words corresponding to conjunctions and prepositions. For punching, we used Porter [11] from the NLTK library [10] for the Python programming language, which is currently the standard tool for working with texts in natural languages. As a measure for comparing the algorithms with each other, the Pearson correlation coe cient was selected by evaluating the proximity of words by a person and the metric for this model. The algorithm for constructing the matrix representation is given below in the pseudo-code:

Data: Ds - set of documents, Cmax - maximum number of context clusters, - threshold for stopping gradient descent, Imax maximum number of iterations of gradient descent

Result: [ 1; 2; : : : ; L] - density matrices for size lexicon words L 1 Dict build dict(Ds)// building vocabulary of dimension L 2 BoW s make bag of words(Ds; Dict)// for each word generation of multiple word bags 3 result [] 4 // i - word index, ctxs - set bags of words 5 for (i; word; ctxs) BoW s do 6 [ctxs0; P ] clusterize(ctxs; Kmax) // context clustering, P probability distribution of clusters 7 N jctxs0j // total number of context clusters

T cont[evxjkvtkvj 8 A k : vk 2 ctxs0] // projectors on word contexts, vk - word

N 9 i N1 kP=1 Ak // initialization of the density matrix for the i-th +1, r0 0 1 : : : Imax do

N current loss P T r( i Ak) log T r(PikAk) // loss function

k=1 calculation Q( ; P ) if jcurrent loss lossj < then // if changes Q( ; P ) are minor, then stop the gradient descent break end riter

PN (log T r(PikAk) 1)AkT + PR k=1 j=1 calculation rQ( ; P ) at current point i adam grad( i; riter 1; riter) // performing adaptive gradient descent step norm( i) // matrix trace normalization if necessary r j ( i) // gradient value 19 i 20 21 22 end 23 return result end result.add( i) The comparison results of the algorithms are shown in table 1: The result of comparing algorithms for a sample of WordSim353. This article discusses the analogy between quantum tomography and the process of constructing vector representations of words for text documents. cops. Such an analogy allows we to adapt the mathematical apparatus used to describe the process of quantum tomography, which is used to describe the statistical properties of objects with quantum-like properties. From the point of view of modeling semantics, such an atematic apparatus allows one to take into account the properties of superposition and entanglement of word contexts that occur during vectorization of the analyzed word. In this case, the superposition is considered from the point of view of the dictionary - each individual word in the context is a semantic concept from the dictionary, and the analyzed word is, respectively, in a state of superposition of all words in its context, i.e. in a state of uncertainty about its meaning, expressed through the words of context. As for the analogy with a mixed state, the analyzed word is found in a number of contexts and, from this point of view, the representation of the word in the form of a density matrix, one should take into account the many encountered contexts as a mixture of density matrices. The analogy with entanglement can be carried out through the determination of the presence of correlation of the words encountered in the analyzed contexts. As for further research aimed at improving the presented vectorization algorithm, here we can distinguish a number of improvements, namely: 1. This model does not take into account the frequency characteristics of the words encountered in contexts. For example, there is no automatic ltering of the most garbage words, which can be performed using the tf-idf algorithm. The use of such normalization of context words seems to be a simple and at the same time useful step for cleaning the contexts of words; 2. Density matrices use the word bag model, learning on the contexts of the dimension of the entire lexicon, thus generating matrices of a very large dimension. Further research should be aimed at reducing the dimensionality of these objects and, as a consequence, reducing computational costs.

Harris . Distributional structure . Word , 10 ( 23 ): 146 { 162 , 1954 .

Chomsky . Three models for the description of language . IRE Transactions on Information Theory , 2 : 113 { 124 , 1956 .

3. Zhang, Y. , Rong

J. , Zhi-Hua Z.:

Understanding bag-of-words model. A statistical frame-work . International Journal of Machine Learning and Cybernetics . 1 . 43 - 52 . ( 2010 ).

4. Jeong , Y. , Song , M. : Applying content-based similarity measure to author cocitation analysis . In: Proceedings of Conference 2016 . ( 2016 ).

5. Haven , E. , Khrennikov , A. : Quantum probability and the mathematical modelling of deci-sion-making . Philosophical Transactions. Series A. Mathematical , physical, and engineering science . vol. 374 . ( 2016 ).

6. Gleason , Andrew M.: Measures on the closed subspaces of a Hilbert space . Indiana Uni-versity Mathematics Journal . 6 885 ( 1957 ).

Mikolov ,

Chen , G. Corrado, and

Dean . E cient estimation of word representations in vector space . CoRR, abs/1301.3781 , 2013 .

C. J. v.

Rijsbergen . The Geometry of Information Retrieval . Cambridge University Press, New York, NY, USA, 2004 .

Piwowarski and

Lalmas . A quantum-based model for interactive information retrieval . In L. Azzopardi, G. Kazai,

Robertson , S. Ru}ger,

Shokouhi ,

Song , and E. Yilmaz, editors, Advances in Information Retrieval Theory , pages 224 { 231 , Berlin, Heidelberg, 2009 . Springer Berlin Heidelberg.

10. E. Loper and

Bird . Nltk: The natural language toolkit . In In Proceedings of the ACL Workshop on E ective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguis.

11. P. M.F . An algorithm for su x stripping . 14 ( 3 ): 130 { 137 , Jan 1980 .