Introduction

Spectral properties of a matrix of correspondences between terms

1 { Ural State University of Railway Transport (Yekaterinburg, Russia)

0 0 2 { Institute of Economics, The Ural Branch of Russian Academy of Sciences , Yekaterinburg , Russia

186 190

The semantic core is widely used for a compact representation of documents in problems of classi cation and intelligent search. In order to create the semantic core, various methods are available, including the singular value decomposition of a term-document matrix and the eigenvalue decomposition of a matrix of correspondences between terms. In the present paper, we investigate how lengths of documents in uence the semantic core created by these methods.

intelligent search semantic core term-document matrix matrix of correspondences between terms

Introduction

Vector models are a simple way to represent a document. Such models are based on the frequency of occurrence of words and do not take into account grammar, semantics or other features of texts in a natural language [1]. The vector models are often used to analyze relationships between a set of documents and terms that are contained in these documents. The relationships are presented as a large-dimensional matrix. In order to reduce the dimension of the space, some methods were proposed, such as latent semantic analysis (LSA) [ 2, 3 ]. This method operates with a term-document matrix. A slightly di erent method based on matrix of correspondences between terms (MCT) was used in papers [ 4, 5 ]. In the present paper, we consider some mathematical aspects of this method.

Latent semantic analysis and MCT Suppose we have a collection of text documents D = fd1; d2; :::; dmg with terms from the collection W = fw1; w2; :::; wng. We present the data as the n m matrix X = (xij ) with xij = tf (di; wj ); where the term frequency tf (d; w) 2 N shows how many times the term w occurs in the document d. The matrix X is called a term-document matrix. Rows of X correspond to the documents and columns of X correspond to the terms.

The main idea of the latent semantic analysis is as follows [ 3 ]. In view of the singular value decomposition, the matrix X can be presented in the form where U is a m m real unitary matrix, S is a m n rectangular diagonal matrix with non-negative real numbers on the diagonal and V is a n n real unitary matrix. The diagonal elements i of S are known as the singular values of X. The matrices U and V can be written such that the singular values are in descending order: 1 2 r > r+1 = = n = 0: In order to reduce the dimension we choose the critical value rows and columns corresponding to the singular values i < respectively, which specify the matrix > 0. In the matrices U , V and S we drop the . As a result we obtain the matrices Ue , Ve and Se The matrix Xe is a low-rank approximation of the matrix X [ 3 ]. Decomposition (1) allows to build the semantic core of the documents collection D.

Another way to identify the semantic core is to employ the matrix G = fgij g, where gij is some measure of proximity between the terms wi and wj . The matrix G is called a matrix of correspondences between terms. If we put gij = gji = m X tf (dk; wi)tf (dk; wj ); k=1 then Since the matrix G is a symmetric non-negative de nite matrix, it can be presented in the form Xe = Ue SeVe T : G = XT X: G = T ZT T ;

Ge = TeZeTeT : yij =

xij Pn j=1 xij : (1) (2) (3) (4) where T is a real unitary matrix and Z is a diagonal matrix with non-negative real numbers on the diagonal. The diagonal elements i of Z are known as eigenvalues of G. If we truncate the matrices T and Z according to the condition i < , we obtain Note that decompositions (1) and (2) are equivalent, because i =

Let us consider the matrix Y = fyij g, where i2 and Te = Ve if = ( )2 [ 5 ].

The matrix Y is called a normalized term-document matrix. This matrix can be used instead of X for building of the semantic core. When all documents have the same length, the matrices X and Y yield the same result. In the opposite case, these results are signi cantly di erent from each other [ 5 ]. The natural question arises: How do the lengths of documents in uence the result of LSA? We answer to this question in the next section. 3

Main results

Let us consider the collection of text documents D = fd1; d2; :::; dmg with terms from the collection W = fw1; w2; :::; wng. And let us suppose that lengths of the rst k documents are and lengths of the rest of the m;n documents are . It is easily seen that the term-document matrix X = (xij )i=1;j=1 has the elements xij = aij bij for i = 1; k; j = 1; n; for i = k + 1; m; j = 1; n; where the real numbers ; ; aij ; bij satisfy the following conditions aij 0

8i = 1; k; j = 1; n; bij 0 8i = k + 1; m; j = 1; n; It is convenient to write the matrix X in the form where kP kE = uvut Xn pi2j :

i;j=1 kP QkE

kP kE kQkE ; kP T kE = kP kE : (5) (6) (7) (8) (9) (10) :::

n(P ) be s(P ) for any The value kP kE is called the Euclidian norm of the matrix P . Note, that for any matrices P and Q, we have

All eigenvalues of a symmetric matrix are real, so they can be ordered. Let 1(P ) 2(P ) eigenvalues of the symmetric matrix P sorted in decreasing order. It is obvious that s( P ) = real . For our purposes we need the following theorems ([ 6 ], [ 7 ]).

Theorem 1. Let P and Q be symmetric matrices of the same size n n, then j s(P ) s(Q)j kP

QkE

Theorem 2. Let P be a symmetric matrix, Q be a non-negative de nite matrix and let them have the same size n n, then s(P + Q) s(P ) Using these theorems, we can estimate the in uence of the numbers Theorem 3. Let (6) hold, then and

on eigenvalues of the matrix G. 2 s(GA) s(G) 2 s(GA) + 2(m k) kBkE = uvut Xm n X aij

2 i=k+1 j=1 utuv Xm 1 = pm i=k+1

k: j s(G) s( 2GA)j 2kBT kE kBkE = 2kBk2E = 2(m k): Since 2GB is non-negative de nite matrix, then from Theorem 2 we get for any s = 1; n Using estimate (13) and the last inequality we arrive at the desired estimate (11). The proof is completed.

If s(GA) 6= 0, then we can rewrite (11) in the form (11) (12) (13) (14) (15) s(G)

s( 2GA): 1

s(G) s( 2GA) 1 + 2(m 2 s(GA) k)

: 1 0

s(G) s( 2GA)

1 + ": s(G) 2(m

k): From this it follows that for any " > 0 there exists estimate holds = (m; k; ; A), such that for any > the following If s(GA) = 0 (it was shown above that this is true, in particular, for s > n rewritten in the form k), then estimate (11) can be 4

Conclusion References

From estimates (14) and (15) it follows that the in uence of long documents on the semantic core obtained by LSA is much stronger than the in uence of short documents. In particular, if the di erences in lengths are large enough, then the semantic core does not contain information from short documents. Hence it is necessary to use the normalized term-document matrix in the case when all documents must be taken into account. [1] G. Salton, A. Wong, C.-S. Yang. A vector space model for automatic indexing. Communications of the

ACM, 18(11):613{620, 1975. [2] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman. Indexing by Latent Semantic

Analysis. Journal of the American Society for Information Science, 41(6):391{407, 1999.

[3]

Landauer ,

P.W.

Foltz ,

Laham . Introduction to Latent Semantic Analysis . Discourse Processes , 25 : 259 { 284 , 1998 .

[4]

D.V.

Bondarchuk . Intelligent method of selection of personal recommendations, guarantees a non-empty result . Control Systems and Information Technology , 2 ( 92 ): 130 { 138 , 2015 .

[5]

D.V.

Bondarchuk ,

G.A.

Timofeeva . Isolation of the semantic kernel based on the matrix of terms correspondence . Control Systems and Information Technology , 3 .1( 61 ): 134 { 139 , 2015 .

[6]

G.H.

Golub ,

C.F. van Loan. Matrix

Computations . Baltimore: Johns Hopkins University Press, 1996 .

[7]

E.F.

Beckenbach ,

Bellman . Inequalities. New York: Springer-Verlag, 1961 .