=Paper=
{{Paper
|id=Vol-1662/mpr1
|storemode=property
|title=Spectral properties of a matrix of correspondences between terms
|pdfUrl=https://ceur-ws.org/Vol-1662/mpr1.pdf
|volume=Vol-1662
|authors=Dmitry V. Bondarchuk,Alexander V. Martynenko
}}
==Spectral properties of a matrix of correspondences between terms==
Spectral properties of a matrix of correspondences between terms Dmitry V. Bondarchuk1 Alexander V. Martynenko1,2 dvbondarchuk@gmail.com amartynenko@rambler.ru 1 – Ural State University of Railway Transport (Yekaterinburg, Russia) 2 – Institute of Economics, The Ural Branch of Russian Academy of Sciences (Yekaterinburg, Russia) Abstract The semantic core is widely used for a compact representation of doc- uments in problems of classification and intelligent search. In order to create the semantic core, various methods are available, including the singular value decomposition of a term-document matrix and the eigen- value decomposition of a matrix of correspondences between terms. In the present paper, we investigate how lengths of documents influence the semantic core created by these methods. Keywords: intelligent search, semantic core, term-document matrix, matrix of correspondences between terms. 1 Introduction Vector models are a simple way to represent a document. Such models are based on the frequency of occurrence of words and do not take into account grammar, semantics or other features of texts in a natural language [1]. The vector models are often used to analyze relationships between a set of documents and terms that are contained in these documents. The relationships are presented as a large-dimensional matrix. In order to reduce the dimension of the space, some methods were proposed, such as latent semantic analysis (LSA) [2, 3]. This method operates with a term-document matrix. A slightly different method based on matrix of correspondences between terms (MCT) was used in papers [4, 5]. In the present paper, we consider some mathematical aspects of this method. 2 Latent semantic analysis and MCT Suppose we have a collection of text documents D = {d1 , d2 , ..., dm } with terms from the collection W = {w1 , w2 , ..., wn }. We present the data as the n × m matrix X = (xij ) with xij = tf (di , wj ), where the term frequency tf (d, w) ∈ N shows how many times the term w occurs in the document d. The matrix X is called a term-document matrix. Rows of X correspond to the documents and columns of X correspond to the terms. The main idea of the latent semantic analysis is as follows [3]. In view of the singular value decomposition, the matrix X can be presented in the form Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: A.A. Makhnev, S.F. Pravdin (eds.): Proceedings of the 47th International Youth School-conference “Modern Problems in Mathematics and its Applications”, Yekaterinburg, Russia, 02-Feb-2016, published at http://ceur-ws.org 186 X = U SV T , where U is a m×m real unitary matrix, S is a m×n rectangular diagonal matrix with non-negative real numbers on the diagonal and V is a n × n real unitary matrix. The diagonal elements σi of S are known as the singular values of X. The matrices U and V can be written such that the singular values are in descending order: σ1 ≥ σ2 ≥ ≥ σr > σr+1 = = σn = 0. In order to reduce the dimension we choose the critical value σ ∗ > 0. In the matrices U , V and S we drop the rows and columns corresponding to the singular values σi < σ ∗ . As a result we obtain the matrices U e , Ve and Se respectively, which specify the matrix Xe =U e SeVe T . (1) The matrix X e is a low-rank approximation of the matrix X [3]. Decomposition (1) allows to build the semantic core of the documents collection D. Another way to identify the semantic core is to employ the matrix G = {gij }, where gij is some measure of proximity between the terms wi and wj . The matrix G is called a matrix of correspondences between terms. If we put Xm gij = gji = tf (dk , wi )tf (dk , wj ), k=1 then G = X T X. Since the matrix G is a symmetric non-negative definite matrix, it can be presented in the form G = T ZT T , where T is a real unitary matrix and Z is a diagonal matrix with non-negative real numbers on the diagonal. The diagonal elements λi of Z are known as eigenvalues of G. If we truncate the matrices T and Z according to the condition λi < λ∗ , we obtain Ge = TeZeTeT . (2) Note that decompositions (1) and (2) are equivalent, because λi = σi2 and Te = Ve if λ∗ = (σ ∗ )2 [5]. Let us consider the matrix Y = {yij }, where xij yij = Pn . j=1 xij The matrix Y is called a normalized term-document matrix. This matrix can be used instead of X for building of the semantic core. When all documents have the same length, the matrices X and Y yield the same result. In the opposite case, these results are significantly different from each other [5]. The natural question arises: How do the lengths of documents influence the result of LSA? We answer to this question in the next section. 3 Main results Let us consider the collection of text documents D = {d1 , d2 , ..., dm } with terms from the collection W = {w1 , w2 , ..., wn }. And let us suppose that lengths of the first k documents are Φ and lengths of the rest of the documents are φ. It is easily seen that the term-document matrix X = (xij )m,n i=1,j=1 has the elements Φaij for i = 1, k, j = 1, n, xij = φbij for i = k + 1, m, j = 1, n, where the real numbers Φ, φ, aij , bij satisfy the following conditions aij ≥ 0 ∀i = 1, k, j = 1, n, (3) bij ≥ 0 ∀i = k + 1, m, j = 1, n, (4) 187 n X aij = 1 ∀i = 1, k, (5) j=1 n X bij = 1 ∀i = k + 1, m, (6) j=1 Φ ≥ φ ≥ 1. (7) It is convenient to write the matrix X in the form X = ΦA + φB, where a11 a12 . . . a1n 0 0 ... 0 .. .. .. .. .. .. .. .. . . . . . . . . ak1 ak2 . . . akn 0 0 ... 0 A= 0 , B= . 0 ... 0 bk+1,1 bk+1,2 . . . bk+1,n . .. .. .. .. .. .. .. .. . . . . . . . 0 0 ... 0 bm1 bm2 . . . bmn For the matrix G = X T X we obtain T G = ΦA + φB ΦA + φB = Φ2 AT A + ΦφAT B + φΦB T A + φ2 B T B. It is obvious that AT B = B T A = 0, thus G = Φ2 GA + φ2 GB , (8) T T where GA = A A, GB = B B. The matrices G, GA and GB are symmetric non-negative definite matrices. We also note that rank GA ≤ min rank AT , rank A = rank A ≤ k. These properties mean that all eigenvalues of the matrix GA are non-negative real numbers and at least the smallest n − k eigenvalues are equal zero. We are interested in the influence of the numbers Φ and φ on eigenvalues of the matrix G under the condition Φ is sufficiently greater than φ. In order to investigate this influence, it is necessary to introduce additional notation. In particular, for any matrix P = (pij )ni,j=1 we denote v u n 2 uX kP kE = t pij . i,j=1 The value kP kE is called the Euclidian norm of the matrix P . Note, that for any matrices P and Q, we have kP QkE ≤ kP kE kQkE , (9) kP T kE = kP kE . (10) All eigenvalues of a symmetric matrix are real, so they can be ordered. Let λ1 (P ) ≥ λ2 (P ) ≥ ... ≥ λn (P ) be eigenvalues of the symmetric matrix P sorted in decreasing order. It is obvious that λs (µP ) = µλs (P ) for any real µ. For our purposes we need the following theorems ([6], [7]). Theorem 1. Let P and Q be symmetric matrices of the same size n × n, then |λs (P ) − λs (Q)| ≤ kP − QkE ∀s = 1, n. 188 Theorem 2. Let P be a symmetric matrix, Q be a non-negative definite matrix and let them have the same size n × n, then λs (P + Q) ≥ λs (P ) ∀s = 1, n. Using these theorems, we can estimate the influence of the numbers Φ and φ on eigenvalues of the matrix G. Theorem 3. Let (6) hold, then Φ2 λs (GA ) ≤ λs (G) ≤ Φ2 λs (GA ) + φ2 (m − k) ∀s = 1, n. (11) Proof. Since the matrices G and GA are symmetric, it follows from Theorem 1 that for any s = 1, n |λs (G) − λs (Φ2 GA )| ≤ kG − Φ2 GA kE = = kφ2 GB kE = φ2 kGB kE = φ2 kB T BkE . (12) From condition (6) we obtain v v u m X n u m √ u X u X kBkE = t 2 aij ≤ t 1 = m − k. i=k+1 j=1 i=k+1 This inequality and properties (9) and (10) give |λs (G) − λs (Φ2 GA )| ≤ φ2 kB T kE kBkE = φ2 kBk2E = φ2 (m − k). (13) Since φ2 GB is non-negative definite matrix, then from Theorem 2 we get for any s = 1, n λs (G) ≥ λs (Φ2 GA ). Using estimate (13) and the last inequality we arrive at the desired estimate (11). The proof is completed. If λs (GA ) 6= 0, then we can rewrite (11) in the form λs (G) φ2 (m − k) 1≤ 2 ≤1+ 2 . λs (Φ GA ) Φ λs (GA ) From this it follows that for any ε > 0 there exists Φ∗ = Φ∗ (m, k, φ, A), such that for any Φ > Φ∗ the following estimate holds λs (G) 1≤ ≤ 1 + ε. (14) λs (Φ2 GA ) If λs (GA ) = 0 (it was shown above that this is true, in particular, for s > n − k), then estimate (11) can be rewritten in the form 0 ≤ λs (G) ≤ φ2 (m − k). (15) 4 Conclusion From estimates (14) and (15) it follows that the influence of long documents on the semantic core obtained by LSA is much stronger than the influence of short documents. In particular, if the differences in lengths are large enough, then the semantic core does not contain information from short documents. Hence it is necessary to use the normalized term-document matrix in the case when all documents must be taken into account. References [1] G. Salton, A. Wong, C.-S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. [2] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6):391–407, 1999. 189 [3] T. Landauer, P.W. Foltz, D. Laham. Introduction to Latent Semantic Analysis. Discourse Processes, 25:259–284, 1998. [4] D.V. Bondarchuk. Intelligent method of selection of personal recommendations, guarantees a non-empty result. Control Systems and Information Technology, 2(92):130–138, 2015. [5] D.V. Bondarchuk, G.A. Timofeeva. Isolation of the semantic kernel based on the matrix of terms corre- spondence. Control Systems and Information Technology, 3.1(61):134–139, 2015. [6] G.H. Golub, C.F. van Loan. Matrix Computations. Baltimore: Johns Hopkins University Press, 1996. [7] E.F. Beckenbach, R. Bellman. Inequalities. New York: Springer-Verlag, 1961. 190