=Paper= {{Paper |id=Vol-1662/mpr1 |storemode=property |title=Spectral properties of a matrix of correspondences between terms |pdfUrl=https://ceur-ws.org/Vol-1662/mpr1.pdf |volume=Vol-1662 |authors=Dmitry V. Bondarchuk,Alexander V. Martynenko }} ==Spectral properties of a matrix of correspondences between terms== https://ceur-ws.org/Vol-1662/mpr1.pdf
       Spectral properties of a matrix of correspondences
                         between terms

                      Dmitry V. Bondarchuk1                       Alexander V. Martynenko1,2
                     dvbondarchuk@gmail.com                        amartynenko@rambler.ru


                     1 – Ural State University of Railway Transport (Yekaterinburg, Russia)
     2 – Institute of Economics, The Ural Branch of Russian Academy of Sciences (Yekaterinburg, Russia)




                                                        Abstract

                       The semantic core is widely used for a compact representation of doc-
                       uments in problems of classification and intelligent search. In order to
                       create the semantic core, various methods are available, including the
                       singular value decomposition of a term-document matrix and the eigen-
                       value decomposition of a matrix of correspondences between terms. In
                       the present paper, we investigate how lengths of documents influence
                       the semantic core created by these methods.
                       Keywords: intelligent search, semantic core, term-document matrix,
                       matrix of correspondences between terms.




1    Introduction
Vector models are a simple way to represent a document. Such models are based on the frequency of occurrence
of words and do not take into account grammar, semantics or other features of texts in a natural language
[1]. The vector models are often used to analyze relationships between a set of documents and terms that are
contained in these documents. The relationships are presented as a large-dimensional matrix. In order to reduce
the dimension of the space, some methods were proposed, such as latent semantic analysis (LSA) [2, 3]. This
method operates with a term-document matrix. A slightly different method based on matrix of correspondences
between terms (MCT) was used in papers [4, 5]. In the present paper, we consider some mathematical aspects
of this method.

2    Latent semantic analysis and MCT
Suppose we have a collection of text documents D = {d1 , d2 , ..., dm } with terms from the collection W =
{w1 , w2 , ..., wn }. We present the data as the n × m matrix X = (xij ) with xij = tf (di , wj ), where the term
frequency tf (d, w) ∈ N shows how many times the term w occurs in the document d. The matrix X is called a
term-document matrix. Rows of X correspond to the documents and columns of X correspond to the terms.
   The main idea of the latent semantic analysis is as follows [3]. In view of the singular value decomposition,
the matrix X can be presented in the form

Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
In: A.A. Makhnev, S.F. Pravdin (eds.): Proceedings of the 47th International Youth School-conference “Modern Problems in
Mathematics and its Applications”, Yekaterinburg, Russia, 02-Feb-2016, published at http://ceur-ws.org




                                                            186
                                                        X = U SV T ,
where U is a m×m real unitary matrix, S is a m×n rectangular diagonal matrix with non-negative real numbers
on the diagonal and V is a n × n real unitary matrix. The diagonal elements σi of S are known as the singular
values of X. The matrices U and V can be written such that the singular values are in descending order:

                                       σ1 ≥ σ2 ≥ ≥ σr > σr+1 = = σn = 0.

In order to reduce the dimension we choose the critical value σ ∗ > 0. In the matrices U , V and S we drop the
rows and columns corresponding to the singular values σi < σ ∗ . As a result we obtain the matrices U
                                                                                                    e , Ve and Se
respectively, which specify the matrix
                                                Xe =U e SeVe T .                                             (1)
The matrix X e is a low-rank approximation of the matrix X [3]. Decomposition (1) allows to build the semantic
core of the documents collection D.
   Another way to identify the semantic core is to employ the matrix G = {gij }, where gij is some measure of
proximity between the terms wi and wj . The matrix G is called a matrix of correspondences between terms. If
we put
                                                 Xm
                                     gij = gji =    tf (dk , wi )tf (dk , wj ),
                                                        k=1

then
                                                         G = X T X.
Since the matrix G is a symmetric non-negative definite matrix, it can be presented in the form

                                                        G = T ZT T ,

where T is a real unitary matrix and Z is a diagonal matrix with non-negative real numbers on the diagonal.
The diagonal elements λi of Z are known as eigenvalues of G. If we truncate the matrices T and Z according to
the condition λi < λ∗ , we obtain
                                                Ge = TeZeTeT .                                            (2)
Note that decompositions (1) and (2) are equivalent, because λi = σi2 and Te = Ve if λ∗ = (σ ∗ )2 [5].
  Let us consider the matrix Y = {yij }, where
                                                               xij
                                                       yij = Pn          .
                                                               j=1 xij

The matrix Y is called a normalized term-document matrix. This matrix can be used instead of X for building
of the semantic core. When all documents have the same length, the matrices X and Y yield the same result. In
the opposite case, these results are significantly different from each other [5]. The natural question arises: How
do the lengths of documents influence the result of LSA? We answer to this question in the next section.

3      Main results
Let us consider the collection of text documents D = {d1 , d2 , ..., dm } with terms from the collection W =
{w1 , w2 , ..., wn }. And let us suppose that lengths of the first k documents are Φ and lengths of the rest of the
documents are φ. It is easily seen that the term-document matrix X = (xij )m,n  i=1,j=1 has the elements
                                         
                                             Φaij        for i = 1, k, j = 1, n,
                                 xij =
                                             φbij        for i = k + 1, m, j = 1, n,

where the real numbers Φ, φ, aij , bij satisfy the following conditions

                                             aij ≥ 0      ∀i = 1, k, j = 1, n,                                 (3)

                                         bij ≥ 0        ∀i = k + 1, m, j = 1, n,                               (4)




                                                              187
                                              n
                                              X
                                                    aij = 1        ∀i = 1, k,                                  (5)
                                              j=1
                                            n
                                            X
                                                  bij = 1      ∀i = k + 1, m,                                  (6)
                                            j=1

                                                       Φ ≥ φ ≥ 1.                                              (7)
It is convenient to write the matrix X in the form

                                                     X = ΦA + φB,

where                                                                                               
                        a11      a12     . . . a1n                      0         0       ...     0
                       ..        ..     ..     ..                    ..        ..      ..        .. 
                       .
                                  .         .   . 
                                                               
                                                                        .         .          .      . 
                                                                                                       
                       ak1      ak2     . . . akn               0              0       ...     0    
                   A= 
                       0
                                                    ,      B=                                        .
                                 0      ...    0  
                                                                bk+1,1
                                                                               bk+1,2    . . . bk+1,n 
                                                                                                       
                       .         ..     ..      ..               ..             ..      ..       ..
                       ..
                                                                                                      
                                   .         .    .               .              .          .     .  
                            0     0      ...    0                      bm1       bm2      . . . bmn
For the matrix G = X T X we obtain
                                                              T          
                                           G = ΦA + φB             ΦA + φB =

                                       Φ2 AT A + ΦφAT B + φΦB T A + φ2 B T B.
It is obvious that AT B = B T A = 0, thus
                                                   G = Φ2 GA + φ2 GB ,                                         (8)
               T            T
where GA = A A, GB = B B.
  The matrices G, GA and GB are symmetric non-negative definite matrices. We also note that

                          rank GA ≤ min rank AT , rank A = rank A ≤ k.
                                        


These properties mean that all eigenvalues of the matrix GA are non-negative real numbers and at least the
smallest n − k eigenvalues are equal zero.
  We are interested in the influence of the numbers Φ and φ on eigenvalues of the matrix G under the condition
Φ is sufficiently greater than φ. In order to investigate this influence, it is necessary to introduce additional
notation. In particular, for any matrix P = (pij )ni,j=1 we denote
                                                           v
                                                           u n 2
                                                           uX
                                                   kP kE = t   pij .
                                                               i,j=1


The value kP kE is called the Euclidian norm of the matrix P . Note, that for any matrices P and Q, we have

                                                  kP QkE ≤ kP kE kQkE ,                                        (9)

                                                    kP T kE = kP kE .                                         (10)
   All eigenvalues of a symmetric matrix are real, so they can be ordered. Let λ1 (P ) ≥ λ2 (P ) ≥ ... ≥ λn (P ) be
eigenvalues of the symmetric matrix P sorted in decreasing order. It is obvious that λs (µP ) = µλs (P ) for any
real µ. For our purposes we need the following theorems ([6], [7]).
   Theorem 1. Let P and Q be symmetric matrices of the same size n × n, then

                                   |λs (P ) − λs (Q)| ≤ kP − QkE             ∀s = 1, n.




                                                            188
   Theorem 2. Let P be a symmetric matrix, Q be a non-negative definite matrix and let them have the same
size n × n, then
                                   λs (P + Q) ≥ λs (P ) ∀s = 1, n.


    Using these theorems, we can estimate the influence of the numbers Φ and φ on eigenvalues of the matrix G.
    Theorem 3. Let (6) hold, then

                          Φ2 λs (GA ) ≤ λs (G) ≤ Φ2 λs (GA ) + φ2 (m − k)      ∀s = 1, n.                     (11)


    Proof. Since the matrices G and GA are symmetric, it follows from Theorem 1 that for any s = 1, n

                                    |λs (G) − λs (Φ2 GA )| ≤ kG − Φ2 GA kE =

                                    = kφ2 GB kE = φ2 kGB kE = φ2 kB T BkE .                                   (12)
From condition (6) we obtain
                                       v           v
                                       u m X
                                           n
                                                   u m    √
                                       u X         u X
                                kBkE = t      2
                                             aij ≤ t   1 = m − k.
                                             i=k+1 j=1         i=k+1


This inequality and properties (9) and (10) give

                        |λs (G) − λs (Φ2 GA )| ≤ φ2 kB T kE kBkE = φ2 kBk2E = φ2 (m − k).                     (13)

Since φ2 GB is non-negative definite matrix, then from Theorem 2 we get for any s = 1, n

                                               λs (G) ≥ λs (Φ2 GA ).

Using estimate (13) and the last inequality we arrive at the desired estimate (11). The proof is completed.
  If λs (GA ) 6= 0, then we can rewrite (11) in the form

                                               λs (G)      φ2 (m − k)
                                        1≤         2
                                                        ≤1+ 2         .
                                             λs (Φ GA )    Φ λs (GA )

From this it follows that for any ε > 0 there exists Φ∗ = Φ∗ (m, k, φ, A), such that for any Φ > Φ∗ the following
estimate holds
                                                    λs (G)
                                             1≤               ≤ 1 + ε.                                       (14)
                                                  λs (Φ2 GA )
If λs (GA ) = 0 (it was shown above that this is true, in particular, for s > n − k), then estimate (11) can be
rewritten in the form
                                           0 ≤ λs (G) ≤ φ2 (m − k).                                        (15)

4    Conclusion
From estimates (14) and (15) it follows that the influence of long documents on the semantic core obtained by
LSA is much stronger than the influence of short documents. In particular, if the differences in lengths are large
enough, then the semantic core does not contain information from short documents. Hence it is necessary to use
the normalized term-document matrix in the case when all documents must be taken into account.

References
 [1] G. Salton, A. Wong, C.-S. Yang. A vector space model for automatic indexing. Communications of the
     ACM, 18(11):613–620, 1975.

 [2] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman. Indexing by Latent Semantic
     Analysis. Journal of the American Society for Information Science, 41(6):391–407, 1999.




                                                         189
[3] T. Landauer, P.W. Foltz, D. Laham. Introduction to Latent Semantic Analysis. Discourse Processes,
    25:259–284, 1998.

[4] D.V. Bondarchuk. Intelligent method of selection of personal recommendations, guarantees a non-empty
    result. Control Systems and Information Technology, 2(92):130–138, 2015.
[5] D.V. Bondarchuk, G.A. Timofeeva. Isolation of the semantic kernel based on the matrix of terms corre-
    spondence. Control Systems and Information Technology, 3.1(61):134–139, 2015.
[6] G.H. Golub, C.F. van Loan. Matrix Computations. Baltimore: Johns Hopkins University Press, 1996.

[7] E.F. Beckenbach, R. Bellman. Inequalities. New York: Springer-Verlag, 1961.




                                                   190