<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Quantum Probability for Word Embedding Problem</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>ITMO University</institution>
          ,
          <addr-line>Kronverksky Ave. 49, St.Petersburg, 197101, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Over the past years, there has been being a contradiction between the growth rate of data that is available to humanity and the possibilities of their intellectual processing. Most of the knowledge that mankind operates is stored in the form of text documents in natural languages, which are not accompanied by additional markup tools for automated text processing tools. Thus, the exponential increase of the amount of information in the baggage of knowledge of mankind is faced with the inability to process it e ectively. To resolve this contradiction, there are systems for the automatic processing of natural language data. Most intelligent data processing algorithms operate with numerical data, so the basic task of any process of working with natural language texts is to represent text units in numerical form. In our research we propose to use framework of of the quantum theory of probabilities. In this case we can operating correctly with as clean as entangled states of words. For implementation of calculation of the matrix for generalized context we using the machine learning technique, named gradient descent, and apply some of restrictions ensuring for elimination extra degrees of freedom. Our approach provides a probabilistic interpretation of research results. And it allows easy to nd a probability that word's context similar to another one. The proposed model of word will can be to integrate to di erent text data analysis processes. This paper presents the results of a comparison of our proposed method with other similar algorithms.</p>
      </abstract>
      <kwd-group>
        <kwd>Word Embedding</kwd>
        <kwd>Natural Language Processing tum Probability Theory Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>QuanThe possibility of a qualitative analysis of the context of words is one of the
most important tasks of our time. It is very important to nd a mathematical
description that would make it possible to make predictions about the meanings
of words with high accuracy. Today, the most promising approach is to describe
Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
the context of a word through its surround. We can imagine a word as a vector,
which is the sum of all its contexts. And the purpose of this study is to nd
a formal description of the vector space of various contexts of words on with
help set of texts. In this article, to solve the problem of nding context, we
use the mathematical framework of the quantum theory of probabilities. And to
calculate the context in vector space, we use large sparse matrices is de ned by
the power of the set of words.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>Today, there are a number of basic approaches to modeling such spaces. In the
general case, all approaches can be divided into two large groups: distribution [1]
and structural [2]. In the structural approach, parsing of text data is performed
and a parsing tree is constructed, which is used to identify semantic relations
between words. A set of such relations for a word de nes its semantic
representation.The structural approach is most typical for knowledge base systems,
for example, ontologies, in which the relations between concepts are speci ed
explicitly, we can immediately determine the severity of these relations, which
ultimately allows we to build some kind of analogue of the space in which the
concepts are de ned and a metric for determining their similarities. However, in
most cases, building knowledge bases is done manually or by automated means,
which certainly a ects the speed of developing tools using such representations.
In the case of the distributive approach, the Harris distribution hypothesis is
used, which says that words with close semantic meanings will be more often
found in texts that have similar sets of words. According to this hypothesis, the
concept of a word context is introduced, which includes, in its simplest form, the
set of surrounding words in a xed-size window taken in the original text
sample. In this paper, we use the distribution hypothesis, and all further algorithms
considered, including the approach we propose, are based on this hypothesis.
All algorithms based on the distribution approach require large sets of texts.
But also in most algorithms no manual markup is required. And we can build
vector representations of words without a teacher. The simplest model that
allows vectorization of words is the Bag-of-Word algorithm [3]. In order to obtain
a vector representation of a word in this algorithm, we need to select a xed-size
scanning window, and calculate the frequency of words, in such a context for a
given word for the entire set of texts. The calculated word frequencies are used
as values for the vector components corresponding to the word of their context.
Finally, the third is an approach that uses neural networks, and which is now the
most popular. In particular, the Word2Vec algorithm and its derivatives [4]. In
this algorithm, a set of vectors is supplied to the input of a single-layer
perceptron, each of which is a vector representing a word from the context, modeled for
example by the Bag-of-Words model. At its output, we get the same vector, but
for the central word in the window in question. This neural network in a hidden
layer will have a vector of small dimension, about several hundred components,
than the size of the entire lexicon. In a model trained in predicting contexts, a
hidden layer is used to represent a word vector. Thus, the Word2Vec algortime
constructs dense small-size vectors in contrast to the previous algorithms that
generate sparse representations [7].
3</p>
      <p>Quantum probability theory in problem of vector
representation of words
3.1</p>
      <sec id="sec-2-1">
        <title>Passing word meanings through vector representation</title>
        <p>With the work [8] begins the presentation of the possibilities of using the
apparatus of quantum theory for information retrieval models. This paper uses quantum
formalism to model documents, users, and ranking. Further, this approach was
developed and a whole direction in information retrieval, called Quantum
Information Retrieval, was born, which is engaged in modeling information retrieval
processes using the framework of quantum mathematics and applies these
models for the needs of information retrieval, for example ranking.</p>
        <p>One of the most important areas in information retrieval is the solution of the
problem of the vector representation of objects involved in information retrieval,
such as words or documents. This direction is important because vector
representations allow we to enter comparison operations using mathematical tools
(for example, using a cosine metric), which certainly increases the
interpretability of the model and the possibility of its use in other algorithms as opposed to
algorithms based on classical machine learning.</p>
        <p>
          Consider the problem of constructing a vector of word, that will to represent
the meaning this word. To solve this problem, it is necessary to develop an
algorithm, that takes a word and returns L-dimensional vector Vword = [v1v2:::vL]T
in the space of real numbers as the output. This operation can be written as:
a(W ) : W ! RL
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
where W is a set of all words, and w 2 W is a target word. In this paper, it
is assumed that the meaning of the word in question is determined through its
context, which corresponds to the Harris distribution hypothesis. If the context
of the word is represented by a vector, then in this representation for the i-th
position of the vector the number of repetitions of the i-th word in the dictionary
W will be assigned in a window around the word w. So for the sentence "he want
eat" and a dictionary consisting of these three words, the context of the word
"want" will be determined by the vector v = [101]T .
3.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Tomography of quantum states</title>
        <p>Quantum physics allows us to describe the behavior of quantum particles that
are not amenable to direct observation. So we cannot track, for example, the
momentum and coordinate of an electron without changing its trajectory. Therefore,
in quantum physics there is no way to describe the behavior of individual
particles, and physicists tend to describe the state of ensembles of particles, resorting
naturally to statistical methods.</p>
        <p>
          In quantum physics, the described system can be represented by some vector
j i in a Hilbert space. The distributions of all possible states that the system
can take are described by the density matrix (operator), which was proposed in
the works of L.D. Landau and J. von Neumann. The state of a quantum system
is described in some basis of the chosen Hilbert space and the density operator
is a function of the density distribution in the phase space of this system.
In the general case, for a selected basis j 1i j ni, the state of a quantum
system can be described by a density matrix according to the following formula:
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
n
= X pi j ii h ij ;
        </p>
        <p>
          i=1
where h ij means the conjugate vector for the vector j ii, and pi is the
probability of nding the system under study in the state j ii. If the system is in a
pure state, then it can be uniquely described by only one vector in the selected
basis and will have a density matrix = j i h j. In accordance with the
classical theory of probability, a pure state corresponds to an elementary event in
probability space. If the system cannot be described by a single vector, then it
is described by expression (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) and this state is called mixed.
        </p>
        <p>Thus the quantum probability theory uses vectors and matrices to describe
quantum systems [5]. Thus, it is a convenient framework to get a vector
representation of a word. Quantum probability theory, in isolation from quantum physics,
can be interpreted by a geometric generalization of classical probability theory.
There is an equation in this theory, which binds the state of the system under
study, the expected state of the system, and the probability of observing the
expected state: [6]:</p>
        <p>hAi = T r(A );
where is the density matrix describing the distribution of the system states,
A is the projector on subspace corresponding to the expected state of the system,
hAi is the probability observation of the state. Matrices and A have the equal
dimensions. Then, the probability that system is in state A is equal to
calculating the trace of the product between density matrix D , and the projector to
the subspace of the state.
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Quantum probability theory analogy</title>
        <p>
          In this work, we use the hypothesis that the framework of density matrices is
applicable to modeling vector or matrix representations of natural language words.
This hypothesis is based on two facts. First, there are experiments showing that
the vectors obtained by modeling using word contexts have a quantum-like
statistical structure. This gives some reason to assume that the quantum theory
framework can be used to model vector representations of words. Secondly, if
we represent the contexts in which words can appear (for example, bag-of-words
vectors, thematic modeling vectors and others) as observable, that is, as some
basic vectors, and the average for such expected is the probability of this word
appearing in this context, when constructing projectors on context vectors, we
can restore the density matrix for a given word. Consider the following approach.
Let Ak be the matrix-projector on the subspace corresponding to some context
k. This context is in set of N contexts.Consider context k, which is one of the
known contexts N .To obtain such a projector, it is necessary to present the
context of the word as a vector. Since the matrix (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) obtained from such vector must
have the properties of a projector, the normalization condition of such vector is
necessary.
        </p>
        <p>L L
T r(Ak ) = X X aij
i=1 j=1
ji = Pk
where PAk is the probability of a speci c context.</p>
        <p>Ak =
vk vT
jvkj2k ;</p>
        <p>We know the value of PAk , since we know the corpus of texts on which training
is conducted. Then it is su cient to group the context vectors according to their
exact coincidence and express the probability of their occurrence in terms of
the frequency, and we get a probability distribution on N-contexts for the word
under study. In the particular case, the probability of any such context will be
1
equal to , if all contexts are unique.</p>
        <p>N
3.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Projector-Matrix Search Task</title>
        <p>Continuing the discussion of the preceding sections, the density matrix is a
description of the state of the \system" corresponding to the word under study.
It is this matrix that needs to be restored from expression 3, and using this
to get the matrix representation of the generalized context of the word. Unlike
quantum theory, we will further consider all matrices as objects over a eld of
real numbers. Extending the model to the eld of complex numbers is the next
step in our study. Thus, the problem of obtaining the representation of a word is
reduced to the problem of nding a matrix satisfying equation 3. More detailed
representation of the matrices is presented below:</p>
        <p>2 11 : : : 1L 3 2a11 : : : a1L 3
= 64 ... . . . ... 75 ; Ak = 64 ... . . . ... 57 :</p>
        <p>
          L1 : : : LL
aL1 : : : aLL
Thus, the following expression shows how can be to calculate the trace of the
product between these matrices.
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>Density matrix recovery</title>
        <p>
          In the work of Pivovarsky [9], we can trace the approach to obtaining density
matrices based on obtaining a weighted sum of the matrices of projectors on the
contexts of a word:
8&gt; i;i 0
&gt;
&gt;&lt;T r = 1
&gt;T r(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) 1
&gt;
&gt;
: ij = ji; i 6= j
        </p>
        <p>
          L L
= X kAk; X
k=1
k=1
= 1; k
0
In such a sum, the weights correspond to the frequencies of occurrence of the
word contexts. This approach is simple from a computational point of view and
it retains all the properties of the density matrix from formula:
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
        </p>
        <p>R
min Q( ; P ) = DKL( jjP ) + X i( ) ! min;
i=1
Here R is the number of regularizers . And the expression of the gradient in
terms of the density matrix parameters for formula 10 will have the following
form:
However, it is not suitable for non-orthogonal contexts of words, as well as for
systems that are mixed states. Such an algorithm does not restore the target
distribution for contexts.</p>
        <p>In our experiments, we determined that it is necessary to restore the density
matrix based on a metric called Kullback{Leibler divergence:</p>
        <p>DKL( jjP ) =</p>
        <p>X
k=1:::L</p>
        <p>T r(</p>
        <p>Ak) ln</p>
        <p>T r(
k</p>
        <p>Ak)
;</p>
        <p>This metric allows us to approximate the initial distribution in contexts to
that which is actually in the training data. In this paper, we propose to use
the approach often used in machine learning, based on solving the optimization
problem by the gradient descent method. If we take metric 9 as the main
objective function, and i( ) is considered as a set of regularizers for preserving
the properties of density matrix 3, then we obtain the following optimization
problem:
T r(</p>
        <p>Ak)</p>
        <p>R
1)AkT + X r i( ):
i=1</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
knei( k1
knei( k2
.
.
.
kn
2
(13)
(14)
kn)3
kn)7
7
7
7
5
Once the expression for the gradient of the objective function is obtained,
gradient descent methods can be applied to optimize and obtain the density matrix
for the word.
4
        </p>
        <p>Introducing a phase in computing
A feature of most algorithms that reconstruct in one form or another the density
matrix for processing natural language texts is that the developers of these
algorithms strive to circumvent the use of the complex number eld, thus working
in the eld of real numbers. However, as you know, if we take quantum
probability theory as a probability theory with a quadratic measure, then degeneracy
can be avoided when compiling the density matrix only when using the eld of
complex numbers. An approximate solution in the eld of real numbers can be
obtained according to the algorithm described above, but from the point of view
of increasing the degrees of freedom in training and the accuracy of restoring the
initial distribution, the use of complex numbers is a justi ed step. In this section,
we will try to convert the original expressions used in solving the optimization
problem to expressions working in the eld of complex numbers.
During the transition to the eld of complex numbers, the initial vectors
representing the basis by which the density matrix is reconstructed can be represented
as follows:
j ki =
k1 ei k1 : : : kn ei kn =
kn ei kn n=1:::N ;
(12)
where N is the dimension of the vector.</p>
        <p>When receiving a projector on such a vector, we get a matrix of the following
form:
Ak = j ki h kj = 666 k1
6
4
2
kn</p>
        <p>2
k1
k2ei( k2
.
.</p>
        <p>.
k1ei( kn
k1)
k1)
k1
k2
k2ei( k1</p>
        <p>2
k2
: : :
k2) : : :
: : :
. . .
knei( kn
k2) : : :
k1
k2
And if we shorten the record:</p>
        <p>h
Ak =
kn
kmei( kn
km i ; n; m 2 [1 : : : N ] :
After receiving this projector expression, we can substitute it into the main
expression to calculate the probability of the observed:</p>
        <p>L
P ln( T r( kAk) )
k=1
L
P Akmn ln( T r( kAk) )
k=1
It should be noted that, in fact, the original expression has not changed, only
now the values of Akmn are complex numbers. We write the process of deriving
the partial derivative by phase:</p>
        <p>
          From this expression, substituting it into the objective function (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ), one can
obtain expressions of partial gradients for performing gradient decent. Note that
with such a statement of the problem, we do not know anything about what
phase values have the basis vectors. We will also nd them using the gradient
decent method, thus ful lling some semblance of automatic phase adjustment
for the basis in the learning process. Therefore, we need to know two expressions
for gradient decent. The rst expression is the partial derivative of the objective
function with respect to the values of the destiny matrix, which we reconstruct.
This expression is the rst part of the gradient of objective function (
          <xref ref-type="bibr" rid="ref11">11</xref>
          ):
(16)
(18)
(19)
        </p>
        <p>N
X i [ mn
m=1
kn
kn ei( kn
km)
:
(17)
nm
km
kn ei( km
kn)]
The index m appears due to the fact that the phase kn appears several times
in all expressions for the remaining phases. Next, we can perform the following
replacement in order to simplify the recording and knowing the properties of the
density matrix:
mn =
mn ei nm ; nm =
mn;
knm =
kn
km
After performing this replacement, we can get the following expression:</p>
        <p>N
X i
m=1
mn
kn
km
hei( mn+
nm)
e i( mn+
nm)i
If we transforming the notation of complex exponentials in parentheses using
the Euler formula, we can obtain the nal expression for the partial derivative
in phase:</p>
        <p>N
2 X
m=1
mn
kn
km sin( mn +
nm)
(20)
Thus, using expressions (16) and (20), we can go to the space of complex numbers
to search for the density matrix with automatic phase adjustment.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Testing of algorithm</title>
      <p>To evaluate the algorithm, the well-known WordSim353 package was chosen.
This corpus consists of two parts with a total number of example strings equal
to 353. Each row of this data set is a pair of English words and a set of 16 ratings
of people re ecting the degree of similarity of this pair of words with each other.
A couple of words can get a rating from 1 to 10, where 10 means the maximum
semantic similarity between words in the opinion of a person. For evaluation
needs developed of the algorithm, we used the average value of people's ratings
for word pairs as a reference.</p>
      <p>To give a comparative evaluation of the developed algorithm, two text data
vectorization algorithms were chosen: an algorithm based on the idea of a word
bag and tf-idf statistics and the word2vec algorithm (CBoW architecture was
used) as the values of the coordinates of the context vectors. For both algorithms,
the cosine distance was used as a measure of the proximity of the resulting vector
representations:
d(W1; W2) =</p>
      <p>W1 W2 ;
jW1j jW2j
(21)
The Tensor ow library was used to build the architecture and teach the Word2Vec
model. The degree of closeness of the density matrices was estimated using
expression 6. Formally, this expression is not suitable for obtaining the probability
value that the words in question belong to the same semantic context, but the
expression can still be used as an indicator of the similarity of the two density
matrices. In addition, in the current implementation of the semantic
tomography algorithm, phase auto-adjustment was not used due to which all the work
on constructing density matrices was carried out in the eld of real numbers,
which could potentially a ect the quality of the restoration of distributions over
the Kullback-Leibler divergence.</p>
      <p>As a training sample for algorithms, a slice of English Wikipedia was chosen. All
texts were subjected to the following processing: paragraphs were glued together
into one text, punctuation marks were deleted, and all words were rst reduced
to lowercase and then stamped. All non-alphabetic characters (i.e. numbers,
dashes, colons, etc.) were removed from the sample as well as words
corresponding to conjunctions and prepositions. For punching, we used Porter [11] from
the NLTK library [10] for the Python programming language, which is currently
the standard tool for working with texts in natural languages. As a measure for
comparing the algorithms with each other, the Pearson correlation coe cient
was selected by evaluating the proximity of words by a person and the metric
for this model. The algorithm for constructing the matrix representation is given
below in the pseudo-code:</p>
      <p>Data: Ds - set of documents, Cmax - maximum number of context
clusters, - threshold for stopping gradient descent, Imax
maximum number of iterations of gradient descent</p>
      <p>Result: [ 1; 2; : : : ; L] - density matrices for size lexicon words L
1 Dict build dict(Ds)// building vocabulary of dimension L
2 BoW s make bag of words(Ds; Dict)// for each word generation of
multiple word bags
3 result []
4 // i - word index, ctxs - set bags of words
5 for (i; word; ctxs) BoW s do
6 [ctxs0; P ] clusterize(ctxs; Kmax) // context clustering, P
probability distribution of clusters
7 N jctxs0j // total number of context clusters</p>
      <p>T
cont[evxjkvtkvj
8 A k : vk 2 ctxs0] // projectors on word contexts, vk - word</p>
      <p>N
9 i N1 kP=1 Ak // initialization of the density matrix for the i-th
+1, r0 0
1 : : : Imax do</p>
      <p>N
current loss P T r( i Ak) log T r(PikAk) // loss function</p>
      <p>k=1
calculation Q( ; P )
if jcurrent loss lossj &lt; then
// if changes Q( ; P ) are minor, then stop the gradient
descent
break
end
riter</p>
      <p>PN (log T r(PikAk) 1)AkT + PR
k=1 j=1
calculation rQ( ; P ) at current point
i adam grad( i; riter 1; riter) // performing adaptive
gradient descent step
norm( i) // matrix trace normalization if necessary
r j ( i) // gradient value
19 i
20
21
22 end
23 return result
end
result.add( i)
The comparison results of the algorithms are shown in table 1:
The result of comparing algorithms for a sample of WordSim353. This article
discusses the analogy between quantum tomography and the process of constructing
vector representations of words for text documents. cops. Such an analogy allows
we to adapt the mathematical apparatus used to describe the process of
quantum tomography, which is used to describe the statistical properties of objects
with quantum-like properties. From the point of view of modeling semantics,
such an atematic apparatus allows one to take into account the properties of
superposition and entanglement of word contexts that occur during
vectorization of the analyzed word. In this case, the superposition is considered from the
point of view of the dictionary - each individual word in the context is a semantic
concept from the dictionary, and the analyzed word is, respectively, in a state
of superposition of all words in its context, i.e. in a state of uncertainty about
its meaning, expressed through the words of context. As for the analogy with a
mixed state, the analyzed word is found in a number of contexts and, from this
point of view, the representation of the word in the form of a density matrix, one
should take into account the many encountered contexts as a mixture of density
matrices. The analogy with entanglement can be carried out through the
determination of the presence of correlation of the words encountered in the analyzed
contexts. As for further research aimed at improving the presented vectorization
algorithm, here we can distinguish a number of improvements, namely:
1. This model does not take into account the frequency characteristics of the
words encountered in contexts. For example, there is no automatic ltering of
the most garbage words, which can be performed using the tf-idf algorithm.
The use of such normalization of context words seems to be a simple and at
the same time useful step for cleaning the contexts of words;
2. Density matrices use the word bag model, learning on the contexts of the
dimension of the entire lexicon, thus generating matrices of a very large
dimension. Further research should be aimed at reducing the dimensionality
of these objects and, as a consequence, reducing computational costs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Harris</surname>
          </string-name>
          .
          <article-title>Distributional structure</article-title>
          .
          <source>Word</source>
          ,
          <volume>10</volume>
          (
          <issue>23</issue>
          ):
          <volume>146</volume>
          {
          <fpage>162</fpage>
          ,
          <year>1954</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>N.</given-names>
            <surname>Chomsky</surname>
          </string-name>
          .
          <article-title>Three models for the description of language</article-title>
          .
          <source>IRE Transactions on Information Theory</source>
          ,
          <volume>2</volume>
          :
          <fpage>113</fpage>
          {
          <fpage>124</fpage>
          ,
          <year>1956</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Zhang,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Rong</surname>
          </string-name>
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Zhi-Hua Z.:</surname>
          </string-name>
          <article-title>Understanding bag-of-words model. A statistical frame-work</article-title>
          .
          <source>International Journal of Machine Learning and Cybernetics</source>
          .
          <volume>1</volume>
          .
          <fpage>43</fpage>
          -
          <lpage>52</lpage>
          . (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jeong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Applying content-based similarity measure to author cocitation analysis</article-title>
          .
          <source>In: Proceedings of Conference</source>
          <year>2016</year>
          . (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Haven</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khrennikov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Quantum probability and the mathematical modelling of deci-sion-making</article-title>
          .
          <source>Philosophical Transactions. Series A. Mathematical</source>
          , physical, and
          <article-title>engineering science</article-title>
          . vol.
          <volume>374</volume>
          . (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gleason</surname>
            ,
            <given-names>Andrew M.:</given-names>
          </string-name>
          <article-title>Measures on the closed subspaces of a Hilbert space</article-title>
          .
          <source>Indiana Uni-versity Mathematics Journal</source>
          .
          <volume>6</volume>
          <fpage>885</fpage>
          (
          <year>1957</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>CoRR, abs/1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>C. J. v.</given-names>
            <surname>Rijsbergen</surname>
          </string-name>
          .
          <source>The Geometry of Information Retrieval</source>
          . Cambridge University Press, New York, NY, USA,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>B.</given-names>
            <surname>Piwowarski</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lalmas</surname>
          </string-name>
          .
          <article-title>A quantum-based model for interactive information retrieval</article-title>
          . In L. Azzopardi, G. Kazai,
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          , S. Ru}ger,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shokouhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          , and E. Yilmaz, editors,
          <source>Advances in Information Retrieval Theory</source>
          , pages
          <volume>224</volume>
          {
          <fpage>231</fpage>
          , Berlin, Heidelberg,
          <year>2009</year>
          . Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. E. Loper and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bird</surname>
          </string-name>
          .
          <article-title>Nltk: The natural language toolkit</article-title>
          .
          <source>In In Proceedings of the ACL Workshop on E ective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguis.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>P. M.F</surname>
          </string-name>
          .
          <article-title>An algorithm for su x stripping</article-title>
          .
          <volume>14</volume>
          (
          <issue>3</issue>
          ):
          <volume>130</volume>
          {
          <fpage>137</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>