<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Spectral properties of a matrix of correspondences between terms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>1 { Ural State University of Railway Transport (Yekaterinburg, Russia)</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>2 { Institute of Economics, The Ural Branch of Russian Academy of Sciences</institution>
          ,
          <addr-line>Yekaterinburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>186</fpage>
      <lpage>190</lpage>
      <abstract>
        <p>The semantic core is widely used for a compact representation of documents in problems of classi cation and intelligent search. In order to create the semantic core, various methods are available, including the singular value decomposition of a term-document matrix and the eigenvalue decomposition of a matrix of correspondences between terms. In the present paper, we investigate how lengths of documents in uence the semantic core created by these methods.</p>
      </abstract>
      <kwd-group>
        <kwd>intelligent search</kwd>
        <kwd>semantic core</kwd>
        <kwd>term-document matrix</kwd>
        <kwd>matrix of correspondences between terms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Vector models are a simple way to represent a document. Such models are based on the frequency of occurrence
of words and do not take into account grammar, semantics or other features of texts in a natural language
[1]. The vector models are often used to analyze relationships between a set of documents and terms that are
contained in these documents. The relationships are presented as a large-dimensional matrix. In order to reduce
the dimension of the space, some methods were proposed, such as latent semantic analysis (LSA) [
        <xref ref-type="bibr" rid="ref1">2, 3</xref>
        ]. This
method operates with a term-document matrix. A slightly di erent method based on matrix of correspondences
between terms (MCT) was used in papers [
        <xref ref-type="bibr" rid="ref2 ref3">4, 5</xref>
        ]. In the present paper, we consider some mathematical aspects
of this method.
      </p>
      <p>Latent semantic analysis and MCT
Suppose we have a collection of text documents D = fd1; d2; :::; dmg with terms from the collection W =
fw1; w2; :::; wng. We present the data as the n m matrix X = (xij ) with xij = tf (di; wj ); where the term
frequency tf (d; w) 2 N shows how many times the term w occurs in the document d. The matrix X is called a
term-document matrix. Rows of X correspond to the documents and columns of X correspond to the terms.</p>
      <p>
        The main idea of the latent semantic analysis is as follows [
        <xref ref-type="bibr" rid="ref1">3</xref>
        ]. In view of the singular value decomposition,
the matrix X can be presented in the form
where U is a m m real unitary matrix, S is a m n rectangular diagonal matrix with non-negative real numbers
on the diagonal and V is a n n real unitary matrix. The diagonal elements i of S are known as the singular
values of X. The matrices U and V can be written such that the singular values are in descending order:
1
2
r &gt; r+1 = =
n = 0:
In order to reduce the dimension we choose the critical value
rows and columns corresponding to the singular values i &lt;
respectively, which specify the matrix
&gt; 0. In the matrices U , V and S we drop the
. As a result we obtain the matrices Ue , Ve and Se
The matrix Xe is a low-rank approximation of the matrix X [
        <xref ref-type="bibr" rid="ref1">3</xref>
        ]. Decomposition (1) allows to build the semantic
core of the documents collection D.
      </p>
      <p>Another way to identify the semantic core is to employ the matrix G = fgij g, where gij is some measure of
proximity between the terms wi and wj . The matrix G is called a matrix of correspondences between terms. If
we put
gij = gji =
m
X tf (dk; wi)tf (dk; wj );
k=1
then
Since the matrix G is a symmetric non-negative de nite matrix, it can be presented in the form
Xe = Ue SeVe T :
G = XT X:
G = T ZT T ;</p>
      <p>Ge = TeZeTeT :
yij =</p>
      <p>xij
Pn
j=1 xij
:
(1)
(2)
(3)
(4)
where T is a real unitary matrix and Z is a diagonal matrix with non-negative real numbers on the diagonal.
The diagonal elements i of Z are known as eigenvalues of G. If we truncate the matrices T and Z according to
the condition i &lt; , we obtain
Note that decompositions (1) and (2) are equivalent, because i =</p>
      <p>
        Let us consider the matrix Y = fyij g, where
i2 and Te = Ve if
= ( )2 [
        <xref ref-type="bibr" rid="ref3">5</xref>
        ].
      </p>
      <p>
        The matrix Y is called a normalized term-document matrix. This matrix can be used instead of X for building
of the semantic core. When all documents have the same length, the matrices X and Y yield the same result. In
the opposite case, these results are signi cantly di erent from each other [
        <xref ref-type="bibr" rid="ref3">5</xref>
        ]. The natural question arises: How
do the lengths of documents in uence the result of LSA? We answer to this question in the next section.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Main results</title>
      <p>Let us consider the collection of text documents D = fd1; d2; :::; dmg with terms from the collection W =
fw1; w2; :::; wng. And let us suppose that lengths of the rst k documents are and lengths of the rest of the
m;n
documents are . It is easily seen that the term-document matrix X = (xij )i=1;j=1 has the elements
xij =
aij
bij
for i = 1; k; j = 1; n;
for i = k + 1; m; j = 1; n;
where the real numbers ; ; aij ; bij satisfy the following conditions
aij
0</p>
      <p>8i = 1; k; j = 1; n;
bij
0
8i = k + 1; m; j = 1; n;
It is convenient to write the matrix X in the form
where
kP kE = uvut Xn pi2j :</p>
      <p>i;j=1
kP QkE</p>
      <p>kP kE kQkE ;
kP T kE = kP kE :
(5)
(6)
(7)
(8)
(9)
(10)
:::</p>
      <p>n(P ) be
s(P ) for any
The value kP kE is called the Euclidian norm of the matrix P . Note, that for any matrices P and Q, we have</p>
      <p>
        All eigenvalues of a symmetric matrix are real, so they can be ordered. Let 1(P ) 2(P )
eigenvalues of the symmetric matrix P sorted in decreasing order. It is obvious that s( P ) =
real . For our purposes we need the following theorems ([
        <xref ref-type="bibr" rid="ref4">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">7</xref>
        ]).
      </p>
      <p>Theorem 1. Let P and Q be symmetric matrices of the same size n n, then
j s(P )
s(Q)j
kP</p>
      <p>QkE</p>
      <p>Theorem 2. Let P be a symmetric matrix, Q be a non-negative de nite matrix and let them have the same
size n n, then
s(P + Q)
s(P )
Using these theorems, we can estimate the in uence of the numbers
Theorem 3. Let (6) hold, then
and</p>
      <p>on eigenvalues of the matrix G.
2 s(GA)
s(G)
2 s(GA) +
2(m
k)
kBkE = uvut Xm
n
X aij</p>
      <p>2
i=k+1 j=1
utuv Xm 1 = pm
i=k+1</p>
      <p>k:
j s(G)
s( 2GA)j
2kBT kE kBkE =
2kBk2E =
2(m
k):
Since 2GB is non-negative de nite matrix, then from Theorem 2 we get for any s = 1; n
Using estimate (13) and the last inequality we arrive at the desired estimate (11). The proof is completed.</p>
      <p>If s(GA) 6= 0, then we can rewrite (11) in the form
(11)
(12)
(13)
(14)
(15)
s(G)</p>
      <p>s( 2GA):
1</p>
      <p>s(G)
s( 2GA)
1 +
2(m
2 s(GA)
k)</p>
      <p>:
1
0</p>
      <p>s(G)
s( 2GA)</p>
      <p>1 + ":
s(G)
2(m</p>
      <p>k):
From this it follows that for any " &gt; 0 there exists
estimate holds
=
(m; k; ; A), such that for any
&gt;
the following
If s(GA) = 0 (it was shown above that this is true, in particular, for s &gt; n
rewritten in the form
k), then estimate (11) can be
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion References</title>
      <p>From estimates (14) and (15) it follows that the in uence of long documents on the semantic core obtained by
LSA is much stronger than the in uence of short documents. In particular, if the di erences in lengths are large
enough, then the semantic core does not contain information from short documents. Hence it is necessary to use
the normalized term-document matrix in the case when all documents must be taken into account.
[1] G. Salton, A. Wong, C.-S. Yang. A vector space model for automatic indexing. Communications of the</p>
      <p>ACM, 18(11):613{620, 1975.
[2] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman. Indexing by Latent Semantic</p>
      <p>Analysis. Journal of the American Society for Information Science, 41(6):391{407, 1999.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Landauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.W.</given-names>
            <surname>Foltz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Laham</surname>
          </string-name>
          .
          <article-title>Introduction to Latent Semantic Analysis</article-title>
          .
          <source>Discourse Processes</source>
          ,
          <volume>25</volume>
          :
          <fpage>259</fpage>
          {
          <fpage>284</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.V.</given-names>
            <surname>Bondarchuk</surname>
          </string-name>
          .
          <article-title>Intelligent method of selection of personal recommendations, guarantees a non-empty result</article-title>
          .
          <source>Control Systems and Information Technology</source>
          ,
          <volume>2</volume>
          (
          <issue>92</issue>
          ):
          <volume>130</volume>
          {
          <fpage>138</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.V.</given-names>
            <surname>Bondarchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.A.</given-names>
            <surname>Timofeeva</surname>
          </string-name>
          .
          <article-title>Isolation of the semantic kernel based on the matrix of terms correspondence</article-title>
          .
          <source>Control Systems and Information Technology</source>
          ,
          <volume>3</volume>
          .1(
          <issue>61</issue>
          ):
          <volume>134</volume>
          {
          <fpage>139</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.H.</given-names>
            <surname>Golub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.F. van Loan. Matrix</given-names>
            <surname>Computations</surname>
          </string-name>
          . Baltimore: Johns Hopkins University Press,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.F.</given-names>
            <surname>Beckenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bellman</surname>
          </string-name>
          . Inequalities. New York: Springer-Verlag,
          <year>1961</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>