<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Distributional Semantic Representation for Text Classification and Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Barathi Ganesh HB</string-name>
          <email>barathiganesh.hb@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Kumar M and Soman KP</string-name>
          <email>anandkumar@cb.amrita.edu</email>
          <email>m anandkumar@cb.amrita.edu, kp soman@.amrita.edu</email>
          <email>soman@.amrita.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Artificial Intelligence Practice, Tata Consultancy Services</institution>
          ,
          <addr-line>Kochi - 682 042</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Center for Computational Engineering and</institution>
          ,
          <addr-line>Networking (CEN)</addr-line>
          ,
          <institution>Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore</addr-line>
          ,
          <institution>Amrita Vishwa Vidyapeetham, Amrita University</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <issue>0</issue>
      <abstract>
        <p>The objective of this experiment is to validate the performance of the distributional semantic representation of text in the classi cation (Question Classi cation) task and the Information Retrieval task. Followed by the distributional representation, rst level classi cation of the questions is performed and relevant tweets with respect to the given queries are retrieved. The distributional representation of text is obtained by performing Non - Negative Matrix Factorization on top of the Document - Term Matrix in the training and test corpus. To improve the semantic representation of the text, phrases are also considered along with the words. This proposed approach achieved 80% as a F-1 measure and 0.0377 as a mean average precision against the its respective Mixed Script Information Retrieval task1 and task 2 test sets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Information Retrieval (IR) and Text classi cation are the
classic applications in text analytics domain, that is
utilized in the multiple domains and industries in various forms.
Given a text content, the classi er must have the capability
of classifying it into the prede ned set of classes and given
a query, the search engine must have the capability of
retrieving relevant text content within the stored collection of
text [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This task becomes more complex, when the text
contents are represented in more than one language. This
introduces the problem during the representation as well as
while mining information out of it.
      </p>
      <p>The fundamental component in classi cation and retrieval
task is text representation, which tries to represent the given
text into its equivalent form of numerical components. Later,
these numerical components are utilized directly to perform
the further actions or will be used to extract the features
required for performing further action.</p>
      <p>
        This text representation methods evolved over the time to
improve the originality of representation, which paves way
to move from the frequency based representation methods to
the semantic representation methods. Though other
methods like set-theoretic Boolean systems are also available, this
paper focuses only on Vector Space Model (VSM) and
Vector Space Model of Semantics (VSMs) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        In VSM, the text is represented as a vector, based on
the occurrence of terms (binary matrix) or frequency of the
occurrence of terms (Term - Document Matrix) present in
the given text. The given text is represented as a vector,
based on frequency of terms that occur in the text. Here,
'terms' represents words or phrases [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Considering only the
term frequency is not su cient, since it ignores syntactic and
semantic information that lies within the text.
      </p>
      <p>
        The term documents matrix is ine cient due to the
biasing problem (i.e. few terms gets higher weight because
of unbalanced and uninformative data). To overcome this,
Term Frequency - Inverse Document Frequency (TF-IDF)
representation method is introduced, which re-weighs the
terms based upon its presence across the documents [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. It
has a tendency to give higher weights to the rarely
occurring words, wherein these words may be misspelled which is
obvious with social media texts.
      </p>
      <p>
        The Vector Space Model of Semantics (VSMs) overcomes
the above mentioned shortcomings by weighing terms based
on the context. This is achieved by applying TDM on
matrix factorization methods like Singular Value
Decomposition (SVD) and Non - Negative Matrix Factorization (NMF)
[
        <xref ref-type="bibr" rid="ref10 ref11 ref15">10, 15, 11</xref>
        ]. This has the ability of weighing terms though
it is not present in a given query. This is because, matrix
factorization leads to represent the TDM matrix with its
basis vectors [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This representation does not include the
syntactic information which requires large data and is
computationally high because of its high dimension.
      </p>
      <p>
        Word Embeddings along with the structure of the sentence
are utilized to represent the short texts. This requires very
less data and the dimension of the vector can be controlled.
But to develop the Word to Vector (Word2Vec) model it
requires a very large corpus [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Here we are not
considering it since we do not have large size mixed script text data.
Followed by representation, similarity measures is carried
on in-order to perform the question classi cation task. Here
similarity measures are distance measure (Cosine distance,
Euclidean distance, Jaccard distance, etc.) and correlation
measure (Pearson correlation coe cient) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Considering above said pros and cons, here the proposed
approach is experimented to observe the performance of
distributional semantic representation of text in the classi
cation and retrieval task. The given questions are represented
as a TDM matrix after the necessary preprocessing steps and
NMF is applied on it to get the distributional representation.
Thereafter, distance measure and correlation measures
between entropy vector of each class and vector representation
of the question are computed in order to perform the
question classi cation task and in order to retrieve the relevant
tweets with respect to the given query, cosine distance
between query and tweets are measured.</p>
    </sec>
    <sec id="sec-2">
      <title>DISTRIBUTIONAL REPRESENTATION</title>
      <p>This section describes about the distributional
representation of the text, which is used further for the question
classi cation and retrieval tasks. The systematic approach
for the distributional representation is given in Figure 1.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Problem Definition</title>
      <p>Let, Q = q1; q2; q3; :::; qn are the questions (qi represents
the ith question) , C = c1; c2; :::; cn are the classes which the
questions falls under and n is size of corpus. T = t1; t2; :::; tn
are the tweets which the questions are related and n is size of
corpus. The objective of the experimentation is to classify
each query into its respective prede ned classes in task 1
and retrieving the relevant tweets with respect to the input
query in task 2.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Preprocessing</title>
      <p>Few of the terms that appears across multiple classes will
shows con ict towards the classi cation, where the terms
generally gets low weighs in TF-IDF representation. Hence
these terms are eliminated if it occurs more than 3=4 times
across the classes and in order to avoid the sparsity of the
representation, terms with the document frequency of one
are eliminated. Here TF-IDF representation not
considered. Because, it has a tendency to provide weighs for the
rare words which is more common in mixed script texts.
Here, advantage of the TF-IDF representation is indirectly
obtained by handling document frequency of the terms.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Vector Space Model : Term - Document</title>
    </sec>
    <sec id="sec-6">
      <title>Matrix</title>
      <p>In TDM, vocabulary has been computed by nding unique
words present in the given corpus. Then the number of times
term presents (term frequency) in each question is computed
against the vocabulary formed. The terms present in this
vocabulary acts as a rst level features.</p>
      <p>A i;j = T DM (Corpus)</p>
      <p>A i = termf requency(qi)
minfr(W; H)</p>
      <p>V</p>
      <p>W HT 2</p>
      <p>F
s:t: W; H
0</p>
      <p>Here F is Forbenius norm and r is parameter for
dimension reduction, which is set to be 10 to have i 10 xed size
vector for each question.</p>
      <p>Where, i represents the ith question and j represents the
jth term in the vocabulary. In-order to improve the
representation, along with the unigram words, the bi-gram and
tri-gram phrases also considered after following above
mentioned preprocessing steps.
2.4</p>
    </sec>
    <sec id="sec-7">
      <title>Vector Space Model of Semantics : Distributional Representation</title>
      <p>The above computed TDM is applied on NMF to get the
distributional representation of the given corpus.</p>
      <p>W i;r = nmf ( A i;j )</p>
      <p>
        In general matrix factorization is done to get the
product of matrices, subject to their reconstruction that the
error needs to be low. The product components from the
factorization gives the characteristics of the original matrix
[
        <xref ref-type="bibr" rid="ref10 ref15">10, 15</xref>
        ]. Here NMF is incorporated along with the
proposed model to get the principal characteristic of the
matrix, known as basis vector. Sentences may vary in its length
but their representation needs to be of xed size for its use
in various applications. TDM representation followed by the
Non - Negative Matrix Factorization (NMF) will achieve this
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] . Mathematically it can be represenated as,
(1)
(2)
(3)
(4)
(5)
A
      </p>
      <p>W HT</p>
      <p>
        If A is m n original TDM matrix, then W is i r basis
matrix and H is j r mixture matrix. Linear combination of
basis vectors (column vectors) of W along with the weights
of H gives the approximated original matrix A. While
factorizing, intially random values are assigned to W and H
then the optimization function is applied on it to compute
appropriate W and H.
Here NMF is used for nding out the basis vector for the
following reasons: the non-negativity constraints makes
interpretability straight forward than the other factorization
methods; selection of r is straight forward; and the basis
vector in semantic space is not constrained to be
orthogonal, which is not a ordable by nding singular vectors or
eigen vectors [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>QUESTION CLASSIFICATION</title>
      <p>Question Answering (QA), systems becoming necessary
units in all the industry as well as the non - industrial
domains. Especially, they try to automate the manual
efforts required in the personal assistance systems and virtual
agents. With this information the remaining part of the
section describes about the proposed approach in question
classi cation task.</p>
      <p>
        For this experiment the data set has been provided by
Mixed Script Information Retrieval (MSIR) task committee
[
        <xref ref-type="bibr" rid="ref2 ref3">3, 2</xref>
        ]. The detailed statistics about the training and the
testing set are given in Table 1.
      </p>
      <p>The objective of task is to classify the given question into
its corresponding class. The distributional representation
of the given training and testing corpus are computed as
described in the previous section. The systematic diagram
for the remaining approach is given in Figure 2.</p>
      <p>After the representation, the reference vector for the each
class is computed by summing up the question vectors in
that class. This reference vector acts as a entropy vector for
its corresponding class. This is mathematically represented
as,
c
rc = X qi
s:t: qi 2 c
(6)
330.0
407.0</p>
      <p>Then the similarity measures between question vector qi
and reference vectors in R are computed. Similarity
measures computed are given in table 2. These similarity
measures that is computed are taken as the attributes for the
supervised classi cation algorithm.</p>
      <p>The Random Forest Tree (RFT) with nCpn number of
trees are utilized to perform the supervised classi cation.
In order to ensure the performance, 10-fold 10-cross
validation performed during the training and this yields 82%
as precision. Proposed approach yields 79.44% as accuracy
measure against the test set and statistics about the results
are tabulated in Table 3. There are totally 3 runs were
submitted to the task committee, which has changes in nal
classi cation algorithm. In this paper we described about
the approach that yields best performance.</p>
    </sec>
    <sec id="sec-9">
      <title>4. INFORMATION RETRIEVAL</title>
      <p>The information shared through the social media is huge
and it has various challenges in its representation. This
induces to carryout research in order to obtain useful
in</p>
      <sec id="sec-9-1">
        <title>Team</title>
        <p>AmritaCEN
AMRITA-CEN-NLP</p>
        <p>Anuj
BITS PILANI</p>
        <p>IINTU
IIT(ISM)D*
NLP-NITMZ
formation out of it. IR is being part of such a research,
which is basic component in text analytics and serves as a
input to the other applications. One of the major problem
is handling the transliterated texts in IR. These
transliterated texts introduces more complex problem especially with
representation.</p>
        <p>
          For this experiment the data set has been provided by
Mixed Script Information Retrieval (MSIR) task committee
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The detailed statistics about the training and the testing
set are given in Table 4.
        </p>
        <p>The objective of this task is to retrieve the top 20 relevant
tweets from the corpus with respect to the input query.
Primarily queries and corpus are distributionally represented
as described in the section 3.</p>
        <p>
          The proposed distributional representation based approach
yields 0.0377 mean average precision against the test queries,
which is best amongst the other approaches proposed in this
task [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The statistics about the obtained results are given
in Table 5.
        </p>
      </sec>
      <sec id="sec-9-2">
        <title>Team</title>
        <p>UB</p>
        <p>Anuj
Amrita CEN
NLP-NITMZ
NITA NITMZ
CEN@Amrita</p>
        <p>IIT(ISM)D</p>
        <p>Mean Average Precision
0.0217
0.0209
0.0377
0.0203
0.0047
0.0315
0.0083</p>
        <p>The classi cation task and retrieval task are developed
based on the distributional representation of the text by
utilizing term - document matrix and non-negative matrix
factorization. The proposed approach outperformed well
in both the task, but there is still room for the
improvement. Though the distributional representation methods
performed well, it su ers from the well known problem 'Curse
of Dimensionality'. It requires a much research in feature
engineering, which directly reduces the dimension of the term
- document matrix. Hence the future work will be focused
on improving performance of the retrieval and reducing the
dimensionality of the representation basis vectors.
6.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          .
          <article-title>A survey of text classi cation algorithms</article-title>
          .
          <source>InMining text data</source>
          , pages
          <volume>163</volume>
          {
          <fpage>222</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Naskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chakma</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Choudhury</surname>
          </string-name>
          .
          <article-title>Msir@ re: Overview of the mixed script information retrieval</article-title>
          .
          <source>In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation</source>
          , Kolkata, India, December 7-
          <issue>10</issue>
          ,
          <year>2016</year>
          ,
          <string-name>
            <given-names>CEUR</given-names>
            <surname>Workshop</surname>
          </string-name>
          <article-title>Proceedings</article-title>
          . CEUR-WS.org,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. Sudip</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          .
          <article-title>The rst cross-script code-mixed question answering corpus</article-title>
          .
          <source>In Modelling, Learning and mining for Cross/Multilinguality Workshop</source>
          , pages
          <volume>56</volume>
          {
          <fpage>65</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H. B.</given-names>
            <surname>Barathi Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Anand</given-names>
            <surname>Kumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Soman</surname>
          </string-name>
          . Amrita cen at semeval
          <article-title>-2016 task 1: Semantic relation from word embeddings in higher dimension</article-title>
          .
          <source>Proceedings of SemEval-2016</source>
          , pages
          <fpage>706</fpage>
          {
          <fpage>711</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Blacoe</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          .
          <article-title>A comparison of vector-based representations for semantic composition</article-title>
          .
          <source>In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning</source>
          , pages
          <volume>546</volume>
          {
          <fpage>556</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Cha</surname>
          </string-name>
          .
          <article-title>Comprehensive survey on distance/similarity measures between probability density functions</article-title>
          .
          <source>City</source>
          ,
          <volume>1</volume>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Juan</surname>
          </string-name>
          .
          <article-title>Using tf-idf to determine word relevance in document queries</article-title>
          .
          <source>In Proceedings of the rst instructional conference on machine learning</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Seung</surname>
          </string-name>
          .
          <article-title>Learning the parts of objects by non-negative matrix factorization</article-title>
          .
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Manwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mahalle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chinchkhede</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Chavan</surname>
          </string-name>
          .
          <article-title>A vector space model for information retrieval: A matlab approach</article-title>
          .
          <source>Indian Journal of Computer Science and Engineering</source>
          ,
          <volume>3</volume>
          :
          <fpage>222</fpage>
          {
          <fpage>229</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pat</surname>
          </string-name>
          .
          <article-title>An introduction to latent semantic analysis</article-title>
          .
          <source>Indian Journal of Computer Science and Engineering.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>U.</given-names>
            <surname>Reshma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. B.</given-names>
            <surname>Barathi Ganesh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. Anand</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>Author identi cation based on word distribution in word space</article-title>
          .
          <source>In Advances in Computing, Communications and Informatics (ICACCI)</source>
          , pages
          <fpage>1519</fpage>
          {
          <fpage>1523</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Roshdi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Roohparvar</surname>
          </string-name>
          . Review:
          <article-title>Information retrieval techniques and applications</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Anita</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chung-Shu</surname>
          </string-name>
          .
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>18</volume>
          :
          <fpage>613</fpage>
          {
          <fpage>620</fpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <article-title>Dynamic pooling and unfolding recursive autoencoders for paraphrase detection</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>801</fpage>
          {
          <fpage>809</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          . Xu w, liu x, gong y.
          <article-title>document clustering based on non-negative matrix factorization</article-title>
          .
          <source>In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>267</volume>
          {
          <fpage>273</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          .
          <article-title>Comparing matrix methods in text-based information retrieval</article-title>
          .
          <source>Tech. Rep., School of Mathematical Sciences</source>
          , Peking University,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>