Using Quantum Probability for Word Embedding Problem Aleksandr Trukhanov[0000−0002−5054−3439] , [0000−0002−8485−1296] Aleksei Platonov , and Igor Bessmertny[0000−0001−6711−6399] ITMO University, Kronverksky Ave. 49, St.Petersburg, 197101, Russian Federation {astrukhanov, avplatonov, bessmertny}@itmo.ru Abstract. Over the past years, there has been being a contradiction between the growth rate of data that is available to humanity and the possibilities of their intellectual processing. Most of the knowledge that mankind operates is stored in the form of text documents in natural languages, which are not accompanied by additional markup tools for automated text processing tools. Thus, the exponential increase of the amount of information in the baggage of knowledge of mankind is faced with the inability to process it effectively. To resolve this contradiction, there are systems for the automatic processing of natural language data. Most intelligent data processing algorithms operate with numerical data, so the basic task of any process of working with natural language texts is to represent text units in numerical form. In our research we propose to use framework of of the quantum theory of probabilities. In this case we can operating correctly with as clean as entangled states of words. For implementation of calculation of the matrix for generalized context we using the machine learning technique, named gradient descent, and apply some of restrictions ensuring for elimination extra degrees of freedom. Our approach provides a probabilistic interpretation of research results. And it allows easy to find a probability that word’s context similar to another one. The proposed model of word will can be to integrate to different text data analysis processes. This paper presents the results of a comparison of our proposed method with other similar algorithms. Keywords: Word Embedding · Natural Language Processing · Quan- tum Probability Theory· Machine Learning 1 Introduction The possibility of a qualitative analysis of the context of words is one of the most important tasks of our time. It is very important to find a mathematical description that would make it possible to make predictions about the meanings of words with high accuracy. Today, the most promising approach is to describe Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 Trukhanov A.S. et al. the context of a word through its surround. We can imagine a word as a vector, which is the sum of all its contexts. And the purpose of this study is to find a formal description of the vector space of various contexts of words on with help set of texts. In this article, to solve the problem of finding context, we use the mathematical framework of the quantum theory of probabilities. And to calculate the context in vector space, we use large sparse matrices is defined by the power of the set of words. 2 Related Works Today, there are a number of basic approaches to modeling such spaces. In the general case, all approaches can be divided into two large groups: distribution [1] and structural [2]. In the structural approach, parsing of text data is performed and a parsing tree is constructed, which is used to identify semantic relations between words. A set of such relations for a word defines its semantic repre- sentation.The structural approach is most typical for knowledge base systems, for example, ontologies, in which the relations between concepts are specified explicitly, we can immediately determine the severity of these relations, which ultimately allows we to build some kind of analogue of the space in which the concepts are defined and a metric for determining their similarities. However, in most cases, building knowledge bases is done manually or by automated means, which certainly affects the speed of developing tools using such representations. In the case of the distributive approach, the Harris distribution hypothesis is used, which says that words with close semantic meanings will be more often found in texts that have similar sets of words. According to this hypothesis, the concept of a word context is introduced, which includes, in its simplest form, the set of surrounding words in a fixed-size window taken in the original text sam- ple. In this paper, we use the distribution hypothesis, and all further algorithms considered, including the approach we propose, are based on this hypothesis. All algorithms based on the distribution approach require large sets of texts. But also in most algorithms no manual markup is required. And we can build vector representations of words without a teacher. The simplest model that al- lows vectorization of words is the Bag-of-Word algorithm [3]. In order to obtain a vector representation of a word in this algorithm, we need to select a fixed-size scanning window, and calculate the frequency of words, in such a context for a given word for the entire set of texts. The calculated word frequencies are used as values for the vector components corresponding to the word of their context. Finally, the third is an approach that uses neural networks, and which is now the most popular. In particular, the Word2Vec algorithm and its derivatives [4]. In this algorithm, a set of vectors is supplied to the input of a single-layer percep- tron, each of which is a vector representing a word from the context, modeled for example by the Bag-of-Words model. At its output, we get the same vector, but for the central word in the window in question. This neural network in a hidden layer will have a vector of small dimension, about several hundred components, than the size of the entire lexicon. In a model trained in predicting contexts, a Using Quantum Probability for Word Embedding Problem 3 hidden layer is used to represent a word vector. Thus, the Word2Vec algortime constructs dense small-size vectors in contrast to the previous algorithms that generate sparse representations [7]. 3 Quantum probability theory in problem of vector representation of words 3.1 Passing word meanings through vector representation With the work [8] begins the presentation of the possibilities of using the appara- tus of quantum theory for information retrieval models. This paper uses quantum formalism to model documents, users, and ranking. Further, this approach was developed and a whole direction in information retrieval, called Quantum Infor- mation Retrieval, was born, which is engaged in modeling information retrieval processes using the framework of quantum mathematics and applies these mod- els for the needs of information retrieval, for example ranking. One of the most important areas in information retrieval is the solution of the problem of the vector representation of objects involved in information retrieval, such as words or documents. This direction is important because vector rep- resentations allow we to enter comparison operations using mathematical tools (for example, using a cosine metric), which certainly increases the interpretabil- ity of the model and the possibility of its use in other algorithms as opposed to algorithms based on classical machine learning. Consider the problem of constructing a vector of word, that will to represent the meaning this word. To solve this problem, it is necessary to develop an algo- rithm, that takes a word and returns L-dimensional vector Vword = [v1 v2 ...vL ]T in the space of real numbers as the output. This operation can be written as: a(W ) : W → RL (1) where W is a set of all words, and w ∈ W is a target word. In this paper, it is assumed that the meaning of the word in question is determined through its context, which corresponds to the Harris distribution hypothesis. If the context of the word is represented by a vector, then in this representation for the i-th position of the vector the number of repetitions of the i-th word in the dictionary W will be assigned in a window around the word w. So for the sentence ”he want eat” and a dictionary consisting of these three words, the context of the word ”want” will be determined by the vector v = [101]T . 3.2 Tomography of quantum states Quantum physics allows us to describe the behavior of quantum particles that are not amenable to direct observation. So we cannot track, for example, the mo- mentum and coordinate of an electron without changing its trajectory. Therefore, 4 Trukhanov A.S. et al. in quantum physics there is no way to describe the behavior of individual parti- cles, and physicists tend to describe the state of ensembles of particles, resorting naturally to statistical methods. In quantum physics, the described system can be represented by some vector |Ψ i in a Hilbert space. The distributions of all possible states that the system can take are described by the density matrix (operator), which was proposed in the works of L.D. Landau and J. von Neumann. The state of a quantum system is described in some basis of the chosen Hilbert space and the density operator is a function of the density distribution in the phase space of this system. In the general case, for a selected basis |Ψ1 i · · · |Ψn i, the state of a quantum system can be described by a density matrix according to the following formula: n X ρ= pi |Ψi i hΨi | , (2) i=1 where hΨi | means the conjugate vector for the vector |Ψi i, and pi is the proba- bility of finding the system under study in the state |Ψi i. If the system is in a pure state, then it can be uniquely described by only one vector in the selected basis and will have a density matrix ρ = |Ψ i hΨ |. In accordance with the clas- sical theory of probability, a pure state corresponds to an elementary event in probability space. If the system cannot be described by a single vector, then it is described by expression (2) and this state is called mixed. Thus the quantum probability theory uses vectors and matrices to describe quan- tum systems [5]. Thus, it is a convenient framework to get a vector representa- tion of a word. Quantum probability theory, in isolation from quantum physics, can be interpreted by a geometric generalization of classical probability theory. There is an equation in this theory, which binds the state of the system under study, the expected state of the system, and the probability of observing the expected state: [6]: hAi = T r(Aρ), (3) where ρ is the density matrix describing the distribution of the Ψ system states, A is the projector on subspace corresponding to the expected state of the system, hAi is the probability observation of the state. Matrices ρ and A have the equal dimensions. Then, the probability that system Ψ is in state A is equal to calcu- lating the trace of the product between density matrix DΨ , and the projector to the subspace of the state. 3.3 Quantum probability theory analogy In this work, we use the hypothesis that the framework of density matrices is ap- plicable to modeling vector or matrix representations of natural language words. This hypothesis is based on two facts. First, there are experiments showing that the vectors obtained by modeling using word contexts have a quantum-like sta- tistical structure. This gives some reason to assume that the quantum theory Using Quantum Probability for Word Embedding Problem 5 framework can be used to model vector representations of words. Secondly, if we represent the contexts in which words can appear (for example, bag-of-words vectors, thematic modeling vectors and others) as observable, that is, as some basic vectors, and the average for such expected is the probability of this word appearing in this context, when constructing projectors on context vectors, we can restore the density matrix for a given word. Consider the following approach. Let Ak be the matrix-projector on the subspace corresponding to some context k. This context is in set of N contexts.Consider context k, which is one of the known contexts N .To obtain such a projector, it is necessary to present the con- text of the word as a vector. Since the matrix (3) obtained from such vector must have the properties of a projector, the normalization condition of such vector is necessary. vk · vkT Ak = 2 , (4) |vk | We know the value of PAk , since we know the corpus of texts on which training is conducted. Then it is sufficient to group the context vectors according to their exact coincidence and express the probability of their occurrence in terms of the frequency, and we get a probability distribution on N-contexts for the word under study. In the particular case, the probability of any such context will be 1 equal to , if all contexts are unique. N 3.4 Projector-Matrix Search Task Continuing the discussion of the preceding sections, the density matrix ρ is a description of the state of the “system” corresponding to the word under study. It is this matrix that needs to be restored from expression 3, and using this to get the matrix representation of the generalized context of the word. Unlike quantum theory, we will further consider all matrices as objects over a field of real numbers. Extending the model to the field of complex numbers is the next step in our study. Thus, the problem of obtaining the representation of a word is reduced to the problem of finding a matrix satisfying equation 3. More detailed representation of the matrices is presented below:     ρ11 . . . ρ1L a11 . . . a1L ρ =  ... . . . ...  , Ak =  ... . . . ...  . (5)     ρL1 . . . ρLL aL1 . . . aLL Thus, the following expression shows how can be to calculate the trace of the product between these matrices. L X X L T r(Ak ρ) = aij · ρji = Pk (6) i=1 j=1 where PAk is the probability of a specific context. 6 Trukhanov A.S. et al. 3.5 Density matrix recovery In the work of Pivovarsky [9], we can trace the approach to obtaining density matrices based on obtaining a weighted sum of the matrices of projectors on the contexts of a word: L X L X ρ= νk Ak , ν = 1, νk ≥ 0 (7) k=1 k=1 In such a sum, the weights correspond to the frequencies of occurrence of the word contexts. This approach is simple from a computational point of view and it retains all the properties of the density matrix from formula:  ρi,i ≥ 0   T rρ = 1 (8) T r(ρ2 ) ≤ 1   ρij = ρji , i 6= j  However, it is not suitable for non-orthogonal contexts of words, as well as for systems that are mixed states. Such an algorithm does not restore the target distribution for contexts. In our experiments, we determined that it is necessary to restore the density matrix based on a metric called Kullback–Leibler divergence: X νk DKL (ρ||P ) = − T r(ρ · Ak ) · ln , (9) T r(ρ · Ak ) k=1...L This metric allows us to approximate the initial distribution in contexts to that which is actually in the training data. In this paper, we propose to use the approach often used in machine learning, based on solving the optimization problem by the gradient descent method. If we take metric 9 as the main ob- jective function, and λi (ρ) is considered as a set of regularizers for preserving the properties of density matrix 3, then we obtain the following optimization problem: R X min Q(ρ, P ) = DKL (ρ||P ) + λi (ρ) → min, (10) ρ ρ i=1 Here R is the number of regularizers λ. And the expression of the gradient in terms of the density matrix parameters for formula 10 will have the following form: Using Quantum Probability for Word Embedding Problem 7 K R X νk T X ∇Q(ρ, P ) = (ln − 1)Ak + ∇λi (ρ). (11) T r(ρ · Ak ) i=1 k=1 Once the expression for the gradient of the objective function is obtained, gra- dient descent methods can be applied to optimize and obtain the density matrix for the word. 4 Introducing a phase in computing A feature of most algorithms that reconstruct in one form or another the density matrix for processing natural language texts is that the developers of these al- gorithms strive to circumvent the use of the complex number field, thus working in the field of real numbers. However, as you know, if we take quantum proba- bility theory as a probability theory with a quadratic measure, then degeneracy can be avoided when compiling the density matrix only when using the field of complex numbers. An approximate solution in the field of real numbers can be obtained according to the algorithm described above, but from the point of view of increasing the degrees of freedom in training and the accuracy of restoring the initial distribution, the use of complex numbers is a justified step. In this section, we will try to convert the original expressions used in solving the optimization problem to expressions working in the field of complex numbers. During the transition to the field of complex numbers, the initial vectors repre- senting the basis by which the density matrix is reconstructed can be represented as follows: |ψk i = ψk1 · eiφk1 . . . ψkn · eiφkn = ψkn · eiφkn n=1...N ,     (12) where N is the dimension of the vector. When receiving a projector on such a vector, we get a matrix of the following form:  2  ψk1 ψk1 · ψk2 ei(φk1 −φk2 ) . . . ψk1 · ψkn ei(φk1 −φkn ) 2  ψk1 · ψk2 ei(φk2 −φk1 ) ψk2 . . . ψk2 · ψkn ei(φk2 −φkn )    Ak = |ψk i hψk | =   .. .. ..  .   . ... .  2 ψkn · ψk1 ei(φkn −φk1 ) ψk2 · ψkn ei(φkn −φk2 ) . . . ψkn (13) And if we shorten the record: h i Ak = ψkn · ψkm ei(φkn −φkm , n, m ∈ [1 . . . N ] . (14) After receiving this projector expression, we can substitute it into the main expression to calculate the probability of the observed: N X X N N X X N T r(ρ · Ak ) = ρmn · Aknm = ρmn ψkm · ψkn ei(φkm −φkn ) (15) m=1 n=1 m=1 n=1 8 Trukhanov A.S. et al. From this expression, substituting it into the objective function (10), one can obtain expressions of partial gradients for performing gradient decent. Note that with such a statement of the problem, we do not know anything about what phase values have the basis vectors. We will also find them using the gradient decent method, thus fulfilling some semblance of automatic phase adjustment for the basis in the learning process. Therefore, we need to know two expressions for gradient decent. The first expression is the partial derivative of the objective function with respect to the values of the destiny matrix, which we reconstruct. This expression is the first part of the gradient of objective function (11): L ∂DKL (ρ||P ) P νk ∂T r ≈ ln( T r(ρ·A k) ) k=1 L (16) ∂Dkl (ρ||P ) P νk ∂ρnm ≈ Akmn · ln( T r(ρ·A k) ) k=1 It should be noted that, in fact, the original expression has not changed, only now the values of Akmn are complex numbers. We write the process of deriving the partial derivative by phase: N ∂DKL X = i · [ρmn · ψkn · ψkn · ei(φkn −φkm ) − ∂φkn m=1 . (17) −ρnm · ψkm · ψkn · ei(φkm −φkn ) ] The index m appears due to the fact that the phase φkn appears several times in all expressions for the remaining phases. Next, we can perform the following replacement in order to simplify the recording and knowing the properties of the density matrix: ρmn = ρmn · eiαnm , ρnm = ρ∗mn , ∆φknm = φkn − φkm (18) After performing this replacement, we can get the following expression: N ∂DKL X h i = i · ρmn · ψkn · ψkm · ei(αmn +∆φnm ) − e−i(αmn +∆φnm ) (19) ∂φkn m=1 If we transforming the notation of complex exponentials in parentheses using the Euler formula, we can obtain the final expression for the partial derivative in phase: N ∂DKL X = −2 ρmn · ψkn · ψkm · sin(αmn + ∆φnm ) (20) ∂φkn m=1 Thus, using expressions (16) and (20), we can go to the space of complex numbers to search for the density matrix with automatic phase adjustment. Using Quantum Probability for Word Embedding Problem 9 5 Testing of algorithm To evaluate the algorithm, the well-known WordSim353 package was chosen. This corpus consists of two parts with a total number of example strings equal to 353. Each row of this data set is a pair of English words and a set of 16 ratings of people reflecting the degree of similarity of this pair of words with each other. A couple of words can get a rating from 1 to 10, where 10 means the maximum semantic similarity between words in the opinion of a person. For evaluation needs developed of the algorithm, we used the average value of people’s ratings for word pairs as a reference. To give a comparative evaluation of the developed algorithm, two text data vectorization algorithms were chosen: an algorithm based on the idea of a word bag and tf-idf statistics and the word2vec algorithm (CBoW architecture was used) as the values of the coordinates of the context vectors. For both algorithms, the cosine distance was used as a measure of the proximity of the resulting vector representations: W 1 · W2 d(W1 , W2 ) = , (21) |W1 | |W2 | The Tensorflow library was used to build the architecture and teach the Word2Vec model. The degree of closeness of the density matrices was estimated using ex- pression 6. Formally, this expression is not suitable for obtaining the probability value that the words in question belong to the same semantic context, but the expression can still be used as an indicator of the similarity of the two density matrices. In addition, in the current implementation of the semantic tomogra- phy algorithm, phase auto-adjustment was not used due to which all the work on constructing density matrices was carried out in the field of real numbers, which could potentially affect the quality of the restoration of distributions over the Kullback-Leibler divergence. As a training sample for algorithms, a slice of English Wikipedia was chosen. All texts were subjected to the following processing: paragraphs were glued together into one text, punctuation marks were deleted, and all words were first reduced to lowercase and then stamped. All non-alphabetic characters (i.e. numbers, dashes, colons, etc.) were removed from the sample as well as words correspond- ing to conjunctions and prepositions. For punching, we used Porter [11] from the NLTK library [10] for the Python programming language, which is currently the standard tool for working with texts in natural languages. As a measure for comparing the algorithms with each other, the Pearson correlation coefficient was selected by evaluating the proximity of words by a person and the metric for this model. The algorithm for constructing the matrix representation is given below in the pseudo-code: 10 Trukhanov A.S. et al. Data: Ds - set of documents, Cmax - maximum number of context clusters,  - threshold for stopping gradient descent, Imax - maximum number of iterations of gradient descent Result: [ρ1 , ρ2 , . . . , ρL ] - density matrices for size lexicon words L 1 Dict ← build dict(Ds)// building vocabulary of dimension L 2 BoW s ← make bag of words(Ds, Dict)// for each word generation of multiple word bags 3 result ← [] 4 // i - word index, ctxs - set bags of words 5 for (i, word, ctxs) ← BoW s do 6 [ctxs0 , P ] ← clusterize(ctxs, Kmax ) // context clustering, P - probability distribution of clusters 7 N ← |ctxs0 | // total number of context clusters v ·v T 8 k k A ← [ |v k| : vk ∈ ctxs0 ] // projectors on word contexts, vk - word context N ρi ← N1 P 9 Ak // initialization of the density matrix for the i-th k=1 word 10 loss ← +∞, ∇0 ←0 11 for iter ← 1 . . . Imax do N T r(ρi · Ak ) · log T r(ρPik·Ak ) // loss function P 12 current loss ← k=1 calculation Q(ρ, P ) 13 if |current loss − loss| <  then 14 // if changes Q(ρ, P ) are minor, then stop the gradient descent 15 break 16 end N R (log T r(ρPik·Ak ) − 1)ATk + P P 17 ∇iter ← ∇λj (ρi ) // gradient value k=1 j=1 calculation ∇Q(ρ, P ) at current point 18 ρi ← adam grad(ρi , ∇iter−1 , ∇iter ) // performing adaptive gradient descent step 19 ρi ← norm(ρi ) // matrix trace normalization if necessary 20 end 21 result.add(ρi ) 22 end 23 return result The comparison results of the algorithms are shown in table 1: Using Quantum Probability for Word Embedding Problem 11 Table 1: The result of comparing algorithms for a sample of WordSim353 tf-idf word2vec (CBow) QST Measure of similarity cos(w1 , w2 ) cos(w1 , w2 ) T r(Aρ) Part 1 0.0831 0.1461 0.2213 Part 2 0.1023 0.1644 0.2107 6 Conclusion The result of comparing algorithms for a sample of WordSim353. This article dis- cusses the analogy between quantum tomography and the process of constructing vector representations of words for text documents. cops. Such an analogy allows we to adapt the mathematical apparatus used to describe the process of quan- tum tomography, which is used to describe the statistical properties of objects with quantum-like properties. From the point of view of modeling semantics, such an atematic apparatus allows one to take into account the properties of superposition and entanglement of word contexts that occur during vectoriza- tion of the analyzed word. In this case, the superposition is considered from the point of view of the dictionary - each individual word in the context is a semantic concept from the dictionary, and the analyzed word is, respectively, in a state of superposition of all words in its context, i.e. in a state of uncertainty about its meaning, expressed through the words of context. As for the analogy with a mixed state, the analyzed word is found in a number of contexts and, from this point of view, the representation of the word in the form of a density matrix, one should take into account the many encountered contexts as a mixture of density matrices. The analogy with entanglement can be carried out through the deter- mination of the presence of correlation of the words encountered in the analyzed contexts. As for further research aimed at improving the presented vectorization algorithm, here we can distinguish a number of improvements, namely: 1. This model does not take into account the frequency characteristics of the words encountered in contexts. For example, there is no automatic filtering of the most garbage words, which can be performed using the tf-idf algorithm. The use of such normalization of context words seems to be a simple and at the same time useful step for cleaning the contexts of words; 2. Density matrices use the word bag model, learning on the contexts of the dimension of the entire lexicon, thus generating matrices of a very large dimension. Further research should be aimed at reducing the dimensionality of these objects and, as a consequence, reducing computational costs. References 1. Z. Harris. Distributional structure. Word, 10(23):146–162, 1954. 2. N. Chomsky. Three models for the description of language. IRE Transactions on Information Theory, 2:113–124, 1956. 12 Trukhanov A.S. et al. 3. Zhang, Y., Rong J., Zhi-Hua Z.: Understanding bag-of-words model. A statistical frame-work. International Journal of Machine Learning and Cybernetics. 1. 43-52. (2010). 4. Jeong, Y., Song, M.: Applying content-based similarity measure to author co- citation analysis. In: Proceedings of Conference 2016. (2016). 5. Haven, E., Khrennikov, A.: Quantum probability and the mathematical modelling of deci-sion-making. Philosophical Transactions. Series A. Mathematical, physical, and engineering science. vol. 374. (2016). 6. Gleason, Andrew M.: Measures on the closed subspaces of a Hilbert space. Indiana Uni-versity Mathematics Journal. 6 885 (1957). 7. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word repre- sentations in vector space. CoRR, abs/1301.3781, 2013. 8. C. J. v. Rijsbergen. The Geometry of Information Retrieval. Cambridge University Press, New York, NY, USA, 2004. 9. B. Piwowarski and M. Lalmas. A quantum-based model for interactive information retrieval. In L. Azzopardi, G. Kazai, S. Robertson, S. Rűger, M. Shokouhi, D. Song, and E. Yilmaz, editors, Advances in Information Retrieval Theory, pages 224–231, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. 10. E. Loper and S. Bird. Nltk: The natural language toolkit. In In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguis. 11. P. M.F. An algorithm for suffx stripping. 14(3):130–137, Jan 1980.