<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Statistical Semantics in Context Space: Amrita CEN@Author Pro ling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Barathi Ganesh HB</string-name>
          <email>barathiganesh.hb@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Kumar M</string-name>
          <email>m_anandkumar@cb.amrita.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soman KP</string-name>
          <email>kp_soman@amrita.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arti cial Intelligence Practice, Tata Consultancy Services</institution>
          ,
          <addr-line>Kochi - 682 042, Kerala</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Center for Computational Engineering and Networking, Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore</addr-line>
          ,
          <institution>Amrita Vishwa Vidyapeetham, Amrita University</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Languages shared by people di ers due to diversity in their ethnicity, socioeconomic status, gender, language, religion, sexual orientation, geographical area, accents, pronunciation and word usages. This eventually fall into hypothesis that they follow unknown hidden pattern. By using this hypothesis, determining the class of a person such as age, gender, their personality and nativity has multiple applications in social media, forensic science, marketing analysis, e-commerce and e-security. This work advances the research on author pro ling much further by overcoming existing language dependent, domain dependent and lexicon based author pro ling methods by nding user's sociolect aspects based on authors statistical pattern of semantics in context space. It proves to be a domain and language independent method in Author Pro ling by nearing constant performance over English, Dutch and Spanish corpus.</p>
      </abstract>
      <kwd-group>
        <kwd>Author Pro ling</kwd>
        <kwd>Context Space</kwd>
        <kwd>Distributional Representation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The amount of language sharing through Internet is prevalent due to the rapid
growth of the social media resources like Facebook, Twitter, LinkedIn, Pinterest
and chat resources like Hike, Whatapp, Wechat [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This positive growth ensures
and encourages the recommendation and Internet marketing among users on a
particular resource. It has been used in business organization for the marketing,
market analysis, advertising and connecting with customers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This ensures the
need of Author Pro ling (AP) [15] in-order to discover user's sociolect aspects
from shared language. The complication involved here is unlike natural language
the language shared by the people on social media is small and unfortunate to
extract information easily out of it.
      </p>
      <p>
        People started revolving around authorship tasks, right from the ancient
Greek play- wright times. Recognizing the age, gender, native language,
personality and many more facets that frame the pro le of a particular person. It nds
application in di erent zones like forensic security, literary research, marketing
analysis, industries, on-line messengers, e-commerce , chats in mobile
applications, medical applications to treat neuroticism and many more [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Forensic
Linguistics came into existence only after 1968. In this sector, police register is
one of the area under security, in which the statements taken down by the police
act as a source text for Author Pro ling (AP). Legal investigation continues its
examination on all elds of suspicion.
      </p>
      <p>
        In marketing, on-line customer reviews in blogs and sites helps the consumers
in deciding his/her choice about shopping a product. Detecting the age and
gender of the person who posted his/her feedback paves way for the owners
to improve their business strategy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Industries get bene ted with customer's
suggestions and reviews from which they could group the most likely products
based on the gender and age. Twitter and Facebook are the popularly used sites
for social media. In last year survey, it shows that every month there are about
236 million users who sign up to the micro blogging site-Twitter and 1.44 billion
users to Facebook but among them 83.09 millions are fake accounts [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The
author with age under 13 and author having more than one account noted as
fake account which has to be taken care. There may also be anon who tend to
have many fake id's and post messages and chat with innocent people in order
to trap them.
      </p>
      <p>
        In general Machine Learning (ML) algorithm can be used to attain this
objective if subjected to relevant features and most of the existing methods follow
the same [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In existing methods common and most commonly used features
for AP are author's style based features (punctuation marks, usage of capitals,
POS tags, sentence length, repeated usage of words, quotations), content based
features (topic related words, words present in dictionaries), content and
typographical ease, words that express sentiments and emotions with emoticons,
special words from which information could be extracted, collocations and
ngrams. These features are dependent on lexicon which varies with respect to the
topic, genre and language. In ML, the low dimensional condensed vectors which
exhibit a relation between the terms, documents and the pro le was built using
Concise Semantic Analysis (CSA) in order to create the second order attributes
(SOA) which was classi ed using a linear model and also became sensitive to
high dimensional problems . This system was further extended in 2014 to make
it more precise in pro ling. With the generation of highly informative attributes
(creating sub pro les) using Expectation Maximization Clustering (EMC)
algorithm, the system built was able to group sub classes within a cluster and
exhibit a relation between sub pro les. Though this system was successful,it was
dependent to the language and genre [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The syntactic and lexical features utilized in earlier models vary with respect
to the morphological and agglutinative nature of the language. These features
also varies with respect to the domain in which the AP is performed. There
exists a con ict between classifying algorithms to learn from these features in
order to build a uni ed and e ective classi cation model which is independent of
domain and language. This can be observed from system's performance in PAN
- AP shared task[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        In order to overcome these con icts, this paper proposes a model based on
statistical semantics from author's digital text. Statistical semantics paves way
to the advancement in research of relational similarity by including statistical
features of word distribution along with traditional semantic features utilized in
Latent Semantic Analysis (LSA) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It is clear that sexual aspects and
vocabulary knowledge of a person varies due to human's cognitive phenomena which
induces and also limits people of a particular gender and age group to utilize
certain range of words to convey their message. Thus by utilizing this word
distribution in context space and their statistical features, the gender and age group
of a particular author is identi ed in this work. The basic idea is to utilize the
distributional representation of an author's document to aggregate the
statistical semantic information and promote a set of constraints for nding related
hypotheses of that author's document.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        John collected large number of tweets and also evaluated it with people work
using Amazon Mechanical Turk (AMT). Their data included 213 million tweets
on the whole from 18.5 million users. Tweets collected were multilingual. As
tweets include many more contents like emoticons, images etc., feature extraction
part was limited to a particular n-gram length with total distinct features of
15,572,522. Word level and character level n-grams were chosen. There was no
language speci c processing done but instead only n-gram counts were taken into
account. Once features were extracted classi ers namely SVM, Naive Bayes and
Winnow2 were evaluated out of which Winnow2 performed exceptionally well
with an overall accuracy of 92%. Their work was done only for gender classifying
gender information [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Lizi told that the entrance to colossal measure of client produced information
empowers them to examine lifetime semantic variety of individuals. The center
reason of the model is that age impacts the point piece of a client, and every
subject has an interesting age conveyance. They made use of Gibbs EM algorithm
for evaluating their model. They were able to nd information of both word
distribution and age distribution from the sample of twitter data they collected.
They treated tweets as bag of words content thus performing well and e ectively
mapping the topic to ages [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Pastor framed their methodology by utilizing the thought of second request
properties (a low dimensional and thick record representation), yet goes past
consolidating data among every objective pro le. The proposed representation
extended the examination fusing data among writings in the same pro le, this
is, they concentrated in subpro les. For this, they naturally discover subpro les
and assemble report vectors that speak to more itemized connections of archives
and subpro le records. Results shows proof of the helpfulness of intra-pro le
data to focus sex and age pro les. The sub pro le or intra pro le information
of each author was found using Expectation Maximization Clustering (EMC)
algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Suraj uses MapReduce programming standard for most parts of their
preparation process, which makes their framework quick. Their framework uses word
n-grams including stopwords, accentuations and emoticons as components and
TF-IDF (term recurrence reverse report recurrence) as the measuring plan.
These were bolstered to the logistic relapse classi er that predicts the age and
sexual orientation of the creators. Mapreduce distributed their tasks among
many machines and made their work more easy and fast [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Unlike PAN 2016, in PAN 2013, PAN 2014 and PAN 2015 the training and
testing were done on similar domains. In most of the work author's stylistic
features, readability, speci c domain features (Emoticons, Hash tags), lexical
features, LSA based features along with the projection based classi ers, regression
based classi ers and clustering based classi ers are used to achieve the objective.
In most of the proposed systems varying in its accuracy for di erent domain and
language [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Mathematical Background</title>
      <p>This section rst presents the problem de nition followed by the mathematical
modeling to the idea described in section 1 for building AP model.
3.1</p>
      <sec id="sec-3-1">
        <title>Problem De nition</title>
        <p>In general the solution is to build a training model from the given problem set
pt = d1; d2; :::; dm and to map each document's author to a speci c gender and
age group pt (gender; age group).
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Training Phase</title>
        <p>
          Step 1 - Constructing document - term matrix [Vi;j ]m n, where m is total number
of documents (total number of authors) in pt, n is size of the vocabulary [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
and
[Vi;j ] = term f requency (vi;j ) (1 &lt; i &gt; m) and (1 &lt; j &gt; n)
        </p>
        <p>[V ] = V SM (pt)(1 &lt; t &gt; m)</p>
        <p>
          Step 2 - Underlying semantic information and relation between authors's
documents can be obtained using latent vector by nding basis vectors of V V T
which is column space of V . This column called as the context space with respect
to the author's documents. Thus the computed basis vectors spans the context
space by satisfying the following condition [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ],
~
V
        </p>
        <p>
          In equation 3, W is m r basis matrix and H is n r coe cient matrix.
Linear combination of basis vectors (column vectors) of W with coe cients of
H gives the context matrix V . While factorizing, formerly random values are
assigned to W and H then the optimization function in equation 4 is applied on
it to compute appropriate W and H. Where, r is the reduced dimension and F
is the Frobenius norm. Here r xed as m to have m m context matrix. The
basis vectors in W considered as the basis vector of context space, which are
linearly combined with elements in the H to recompute the V . The singular or
eigen vector based computation methods avoided here, since they constrained
and forced to found the orthogonal basis vectors. This may not form the exact
context space of the author's documents. Since the occurrence of the words in
a documents cannot be a negative value, which is a ordable by NMF and the
non-negativity constraints makes interpretability straight forward than the other
factorization methods [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          Each element in the matrix W is a distributed representation of the semantic
information of the author's documents in context space. This is known as Vector
Space Model of Semantics (VSMs) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] but in this application it captures user's
cognitive ability and will be called as statistical semantics. Using these base
vector it is possible to span the space, where the di erent representation of
similar semantics lies.
        </p>
        <p>[W ]n n = [x1; x2; :::; xm]</p>
        <p>[W ] = V SM s([V V T ])</p>
        <p>Step 3 - The statistical features of semantic distribution in a context space
are computed in order to build supervised classi cation model. Statistical
features include the marginal decision boundaries with respect to word distribution
in each document vector Wi based on each class which has to be classi ed.
Performing NMF moves the values in W from discrete to continuous. Thus by
taking Wi as a random variable 1 and by xing random variable 2 as other
distributions (Normal, Gamma, Chisquare, Rayleigh and Pareto Distributions),
the correlation, null hypothesis between them are measured. This is expressed
as,</p>
        <p>[F ]n s = statistical f eatures [W ]n n
Where, s is number of statistical features and F is feature matrix for building
classi cation model. From the above it is clear that the extracted features are
only dependent on how the author's semantic distribution lies in a document.</p>
        <p>
          Step 4 - In order to build classi cation model, the regression relation
between the feature and the respective class are constructed using Random
Forest tree algorithm which is a collection of Decision trees that formulates the
classi cation rule based on randomly selected features in training set . From
(5)
(6)
(7)
L = f(yi; Fi) ; 1 &lt; i &gt; ng the subset of Lb is formed using ps and b number of
aggregate predictor is built [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Then nal predictor is built by,
'b (F ) = maxJ HJ
(8)
Where, J is number of decision trees and HJ = f' (F; Lb) = J g.
        </p>
        <p>The gender and age group classi cation model is built using hierarchical
method. In order to constrain the model, after nding gender information it will
be fed as additional features to the age group classi cation model, where gender
information act as a binary feature. In training there are two models built for
gender and age classi cation. This is further detailed in following testing phase.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Testing Phase</title>
        <p>Step 5 - As similar to training set except Step 4, the test set pt = fd1; d2; :::; dn1g
also follows all remaining steps to compute feature vectors. Further classi cation
of document into gender and age group is performed using b aggregate predictors
in hierarchical method. The nal class is assigned based on voting method. The
test features [Ft] are initially classi ed into male/female and padding is done as
an additional feature for further age group classi cation.</p>
        <p>The algorithm for training and testing are shown below,
Input pi = fd1; d2; :::; dng
for i=1 to n do</p>
        <p>[V ] = V SM (di)
end
[W ] = N M F V V T
[F ] = statistical f eature([W ])
modelgen = rf t ([Ffinal; b])
modelage = rf t ([Ffinal; gender])
ygen = predict (modelgen; [F ])
yage = predict (modelage; [F; ygen])</p>
        <p>Algorithm 1: Training and Testing
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiment and Observations</title>
      <p>The model diagram for performing AP is given in Figure 1. The data-set
chosen for this experimentation is from the PAN CLEF AP 2016 workshop [15][16]
which are built with challenges involved in real word applications. The 2016
corpus incorporates Twitter data of users for training and reviews, blog data
of the authors taken as test data. This is also included in three languages
(English, Dutch and Spain). Among them Dutch data-set does not have age-group
information.</p>
      <p>During pre-processing, author's text alone extracted for further actions. As
detailed in problem de nition section, the total documents (Author's tweets)
represented as a Document-Term matrix (m n). This matrix multiplied with
its transposed version to get the document-document co-occurrence matrix of
size m m. Further NMF applied on the document-document matrix to get
the basis vector matrix (Context Matrix) with r = m. The basis vector of the
each author's document considered as a random variable and its correlation with
other distribution random variables mentioned in the section 2 are measured as
the features. This nal, feature matrix is utilized to construct the classi cation
model which is built using Random Forest Tree classi ers. Here the classi cation
model is built based on 100 decision trees, constructed to form the random
forest tree. Initially gender classi cation is performed and by feeding its result
to the feature matrix, age group is classi ed. Same process applied on the three
languages without any change. All the above done in Python and its packages
(Scikit Learn and Scipy).
10 fold cross validation is performed to measure the training performance
and given in the Table 1. The measures performed on individual (English, Dutch
and Spanish) and combined data-sets (English and Spanish). Thought the
proposed performance not greater in accuracy, from results it can be observed that,
proposed model shows constant accuracy over the all the language and genre.
This ensures that proposed model act as the language and domain independent
method.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>With the global need for author pro ling system this experimentation has brought
forth a simple, uni ed and reliable model for nding the demographic features
of an individual by extracting statistical semantics property of context space.
This is achieved by incorporating the Document - Term Matrix, Non - Negative
Matrix Factorization and statistical features along with the Random Forest Tree
classi er. From the results it can be concluded that this serves as the domain
and language independent method, however there is still room for improvement.
The future work will be extending and implementing proposed algorithm on
distributed computation frameworks like Apache Hadoop and Apache Spark.
14. Leo Breiman: Random forests. Machine learning, (2001)
15. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.:
Evaluations Concerning Cross-genre Author Pro ling. In: Working Notes Papers of
the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and
CEURWS.org (Sep 2016)
16. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.:
Improving the Reproducibility of PANs Shared Tasks: Plagiarism Detection, Author
Identi cation, and Author Pro ling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson,
M., Hall, M., Hanbury, A., Toms, E. (eds.) Information Access Evaluation meets
Multilinguality, Multimodality, and Visualization. 5th International Conference of
the CLEF Initiative (CLEF 14). pp. 268299. Springer, Berlin Heidelberg New York
(Sep 2014)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Perrin</surname>
          </string-name>
          : Social Media Usage:
          <fpage>2005</fpage>
          -
          <lpage>2015</lpage>
          . 2015
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mangold</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Glynn</surname>
            , and
            <given-names>David J.</given-names>
          </string-name>
          <string-name>
            <surname>Faulds</surname>
          </string-name>
          :
          <article-title>Social media: The new hybrid element of the promotion mix</article-title>
          .
          <source>Business horizons</source>
          , (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Rangel</surname>
          </string-name>
          , Francisco, Efstathios Stamatatos, Moshe Moshe Koppel, Giacomo Inches, and
          <article-title>Paolo Rosso: Overview of the author pro ling task at pan 2013</article-title>
          .
          <source>In CLEF Conference on Multilingual and Multimodal Information Access Evaluation</source>
          , (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Rangel</surname>
          </string-name>
          , Francisco, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, and
          <article-title>Walter Daelemans: Overview of the 2nd author pro ling task at pan 2014</article-title>
          .
          <article-title>CLEF Evaluation Labs</article-title>
          and Workshop, (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Rangel</surname>
            , Francisco, P. Rosso,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
            , and
            <given-names>W.</given-names>
          </string-name>
          <article-title>Daelemans: Overview of the 3rd Author Pro ling Task at PAN 2015</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>CLEF</given-names>
          </string-name>
          , (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Barathi</given-names>
            <surname>Ganesh</surname>
          </string-name>
          <string-name>
            <given-names>HB</given-names>
            ,
            <surname>Reshma</surname>
          </string-name>
          <string-name>
            <surname>U</surname>
          </string-name>
          , and Anand Kumar M:
          <article-title>Author identi cation based on word distribution in word space</article-title>
          .
          <source>Advances in Computing, Communications and Informatics (ICACCI)</source>
          , (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Burger</surname>
            ,
            <given-names>John D.</given-names>
          </string-name>
          , John Henderson, George Kim, and Guido Zarrella:
          <article-title>Discriminating gender on Twitter</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Liao</surname>
          </string-name>
          , Lizi, Jing Jiang, Ying Ding,
          <article-title>Heyan Huang, and Ee Peng LIM: Lifetime lexical variation in social media</article-title>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lpez-Monroy</surname>
          </string-name>
          ,
          <article-title>Adrin Pastor, Manuel Montes-y-</article-title>
          <string-name>
            <surname>Gmez</surname>
          </string-name>
          ,
          <article-title>Hugo Jair Escalante, and Luis Villaseor Pineda: Using Intra-Pro le Information for Author Pro ling</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          , (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Maharjan</surname>
            , Suraj,
            <given-names>Prasha</given-names>
          </string-name>
          <string-name>
            <surname>Shrestha</surname>
          </string-name>
          , and
          <article-title>Thamar Solorio: A Simple Approach to Author Pro ling in MapReduce</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          , (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Turney</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter D.</surname>
          </string-name>
          , and Patrick Pantel:
          <article-title>From frequency to meaning: Vector space models of semantics</article-title>
          .
          <source>Journal of arti cial intelligence research</source>
          , (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Daniel D</given-names>
          </string-name>
          and Seung, H Sebastian:
          <article-title>Learning the parts of objects by nonnegative matrix factorization</article-title>
          .
          <source>Nature</source>
          Publishing Group, (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Xu</surname>
          </string-name>
          , Wei, Xin Liu, and Yihong Gong:
          <article-title>Document clustering based on non-negative matrix factorization</article-title>
          .
          <source>Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, ACM</source>
          , (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>