<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>INAOE's participation at PAN'15: Author Profiling task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Language Technologies Laboratory, Department of Computer Science, Instituto Nacional de Astrofísica</institution>
          ,
          <addr-line>Óptica y Electrónica, Luis Enrique Erro No. 1, C.P. 72840, Pue. Puebla</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Miguel A. Álvarez-Carmona, A. Pastor López-Monroy, Manuel Montes-y-Gómez</institution>
          ,
          <addr-line>Luis Villaseñor-Pineda, and Hugo Jair Escalante</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>In this paper, we describe the participation of the Language Technologies Lab of INAOE at PAN 2015. According to the Author Profiling (AP) literature. In this paper we take such discriminative and descriptive information into a new higher level exploiting a combination of discriminative and descriptive representations. For this we use dimensionality reduction techniques on the top of typical discriminative and descriptive textual features for AP task. The main idea is that each representation, using the full feature space, automatically highlights the different stylistic and thematic properties in the documents. Specifically, we propose the joint use of Second Order Attributes (SOA) and Latent Semantic Analysis (LSA) techniques to highlight discriminative and descriptive properties respectively. In order to evaluate our approach, we compare our proposal against a standard Bag-of-Words (BOW), SOA and LSA representations using the PAN 2015 corpus for AP. Experimental results in AP show that the combination of SOA and LSA outperforms the BOW and each individual representation, which gives evidence of its usefulness to predict gender, age and personality profiles. More importantly, according to the PAN 2015 evaluation, the proposed approach are in the top 3 positions in every dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Author Profiling (AP) task consists in knowing as much as possible about an
unknown author, just by analysing a given text [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The interest in AP tasks has captured
the attention of the scientific community in recent years. This is due, in part, to the
potential of the huge amount of the user-generated textual information on the internet. In
this context, several applications related to AP are emerging, some of them have to do
with e-commerce, computer forensics and security. There are several ways to address
the AP task. One of them is to approach it as a single-label multiclass classification
problem, where the target specific profiles (e.g., male and female) represent the classes
to discriminate.
      </p>
      <p>
        Broadly speaking, in text classification tasks there are three general-key procedures;
i) the extraction of textual features, ii) the representation of the documents, and iii) the
application of a learning learning algorithm. In the context of the AP tasks, for the first
step, specific lexical (e.g., simple words, function words) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and syntactical features
(e.g., POS tags) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] have proven to be highly discriminative for some specific profiles.
Regarding to the last two steps, the representation of documents and the learning
algorithm that are the most common-effective approaches for AP tasks consist in using the
Bag-of-Words formulation (e.g., histograms of the presence/absence of textual features)
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and Support Vector Machines [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] respectively.
      </p>
      <p>According to the AP task literature, most of the work has been devoted to the first
step: to identify the most useful-interesting textual features for the target profiles. In
spite of the usefulness of previous interesting textual features and the good results
achieved by the configuration BoW-SVM, the research community has put little effort
to deepen in the second and third steps: alternative representations and learning
algorithms for the AP task. The main shortcomings of the BoW-SVM approach are well
known from other text mining task.</p>
      <p>
        To overcome the latter shortcomings, in this paper we focus in the second step, i.e.
the representation of the documents, in order to improve the representation of tweets.
The main goal of our approach is to compute high quality discriminative and
descriptive features built on the top of the state-of-the-art typical textual features (e.g.,
content words, function words, punctuation marks, etc.). For this, we propose to combine
two state-of-the-art dimensionality reduction techniques that best contribute to
automatically stress the contribution of the discriminative and descriptive textual features.
According to the literature the most frequent textual features (e.g., function words,
stopwords, punctuation marks) provide important clues about the discrimination of the
authors. For this we need a representation highly based in term frequencies, that stresses
the contribution of such discriminative attributes and produces highly discriminative
document representations. To capture this information contained among textual
features we use Second Order Attributes (SOA) computed as in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. On the other hand,
relevant thematic information usually are in descriptive terms, terms that are frequent
only in some specific documents or classes. In this way, to represent documents we
bring ideas from the information retrieval field exploiting the Latent Semantic Analysis
(LSA) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. LSA represents terms and documents into a new semantic space. This is
done performing a singular value decomposition using a Term Frequency Inverse
Document Frequency (TFIDF) matrix. The descriptive terms and documents representation
are stressed under the LSA formulation throwing out the noise, but emphasizing strong
patterns and trends. To the best of our knowledge, the idea of representing documents
using the combination discriminative and the descriptive high-level features through
dimensionality reduction techniques have never been explored before in AP task. Thus,
it is promising to bring together two of the best document representations to better
improve the AP; that is precisely the propose of this work.
      </p>
      <p>The rest of this paper is organized as follows: Section 2 introduces the proposed
representation, in Section 3 some characteristics of the corpus PAN15 are explained
briefly, Section 4 explains how we performed the experiments and the results we
obtained, finally Section 5 shows our conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Exploiting Discriminative and Descriptive features</title>
      <p>Along this section we briefly describe each representation and the proposed strategy
to compute the final representation of documents. In Section 2.1 we explain the SOA
representation to get the discriminative features. In Section 2.2 we explain the LSA
algorithm with which we intend to get descriptive features. Finally, in Section 2.3 we
explain how we join these representations for the AP task.
2.1</p>
      <sec id="sec-2-1">
        <title>Computing Discriminative Features</title>
        <p>
          The stylistic textual features have proven to be useful for AP task [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. A plenty of the
style textual attributes in text mining tasks (e.g., Author Profiling, Authorship
Attribution, Plagiarism Detection) have been associated with highly frequent terms [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. For
example, observing the frequency of stopwords and punctuation marks exposes clues
about the author of a document. In gender identification observing the distribution of
specific function words and determiners have proven to be also useful [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Second
Order Attributes (SOA) proposed in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is a supervised frequency based approach to build
document vectors in a space of the target profiles. Under this representation, each value
in the document vector represents the relationship of each document with each target
profile.
        </p>
        <p>
          The representation as described in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] has two keys steps. i) To build words vectors
in an space of profiles and ii) to build documents vectors in an space of profiles. In the
former step, for each vocabulary term tj , a tj = htp1j , ..., tpmj i vector is computed.
Where each tpmj is a frequency-based-value that represents the relationship between
term tj and the profile pm. In the latter step the representation of documents is built
using a weighted by frequency aggregation of the term vectors contained in the document
(see Equation 1).
        </p>
        <p>dk =</p>
        <p>
          X tfkj t
tj∈Dk lenght(dk) j
where Dk is the set of terms that belongs to document dk. For more details please
refer to [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Computing Descriptive Features</title>
        <p>
          Besides the usefulness of stylistic features, thematic information has proven to be an
important aspect for the AP Task [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. For example, several works have shown
evidence that groups of people of the same age and gender write generality about the same
topics. For this reason we exploit the Latent Semantic Analysis (LSA). LSA is a
technique that can associate words and it contribution to automatically generated concepts
(topics) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. This is usually named the latent space, where documents and terms are
projected to produce a reduced topic based representation. We hypothesises that under
the aforementioned latent space, we can better expose descriptive relevant information
for the AP task.
        </p>
        <p>
          LSA is a method to extract and represent the meaning of the words and documents.
LSA is built from a matrix M where mij is typically represented by the TFIDF [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
of the word i in document j. LSA uses the Singular Value Decomposition (SVD) to
decompose M as follows.
(1)
(2)
        </p>
        <p>M = UΣVT</p>
        <p>
          Where The Σ values are called the singular values and U and V are the left and right
singular vectors respectively. U and V contains a reduced dimensional representation
of words and documents respectively. U and V emphasizes the strongest relationships
and throws away the noise [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In other words, it makes the best possible reconstruction
of the M matrix with the less possible information [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Using U and V computed only
from the training documents, words and documents are represented for training and test.
For more details please refer to [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Exploiting the jointly use of Discriminative-Descriptive Features</title>
        <p>The idea is to use the representations built under the whole feature space to
automatically highlight the discriminative and descriptive properties in documents. The intuitive
idea is to take advantage of both approaches in a representation using early fusion. Let
xj be the j − th training instance-profile under LSA representation with k dimensions
and yj be the same instance-profile under the SOA representation with m dimensions,
the final representation is show in Expression 3.</p>
        <sec id="sec-2-3-1">
          <title>The collection of training documents are finally represented as:</title>
          <p>zj = hxj1, . . . , xjk, yj1, . . . , yjmi</p>
          <p>Z =
[ hzj, cji
dj∈D
(3)
(4)</p>
          <p>Where cj is the class of the j − th training instance-profile.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Data Collection</title>
      <p>We have approached the PAN 2015 AP task as a classification problem. PAN 2015
corpora is composed by 4 datasets in different languages (Spanish, English, Italian and
Dutch). Each dataset has labels of gender (male, female), age 1 (18-24, 25-34, 35-49,
50-xx) and five personality traits values (extroverted, stable, agreeable, conscientious,
open) between -0.5 and 0.5. In Table 1 we show the number of Author-Profiles per
language.</p>
      <p>For personality identification Table 2 shows the relevant information (in terms of
classes). For each language it shows the range and the number of the classes for each
trait2. For personality we consider each trait value in the train corpus as a class. For
example, if only two values (e.g., 0.2 and 0.3) are observed in the training corpus, then
we built a two class classifier (e.g., 0.2 and 0.3) 3.
4
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Evaluation</title>
      <sec id="sec-4-1">
        <title>Experimental Settings</title>
        <p>We use for each experiment the following configuration: i) for terms we use words,
contractions, words with hyphens, punctuation marks and a set of common emoticons,
1 Age data for Italian and Dutch languages are not available.
2 The ranges with asterisk indicate that a value between the range is missing. For example, in</p>
        <p>
          Spanish (extroverted and conscientious) the -0.1 is missing.
3 For each personality trait in each language the number of the classes are variables between
them, see Table 2
ii) we consider the terms with at least 5 occurrences in the corpus, iii) the number of
concepts for LSA is set to k = 100. We perform an stratified 10 cross fold validation
(CFV) using the training PAN15 corpus and a LibLINEAR classifier [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. In order to
determine the full profile of a document (gender, age and the 5 personality traits) we
built one classifier to predict each target profile for each language.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Experimental Results</title>
        <p>The aim of this first experiment is to analyse the performance of LSA, SOA and the
BOW approach in the AP tasks. We experiment with LSA and SOA separately and
finally with the two approaches together. We are interested in observing the
contribution of discriminative-stylistic (captured by SOA) and descriptive-thematic (captured by
LSA) information in the AP task. For gender prediction, in Table 3 we can see that
considering the individual representations, LSA obtains the best results, which outperforms
the BOW approach in every language. When LSA and SOA are together the result only
improves in English, which is an important remark since the English language is the
bigger-robust collection (see Table 1). The following conclusions can be outlined from
Table 3:
– The descriptive information captured by LSA is the most relevant information for
gender prediction in PAN 2015 AP dataset. This is because LSA obtained the best
average individual performance.
– The pure discriminative information captured by SOA only outperforms BOW in
Dutch documents. But the combination of LSA and SOA obtained an improvement
of around 4% in accuracy for English gender detection. We think, SOA could
improve the results if more documents are available 4.</p>
        <p>For age prediction, Table 4 shows the experimental results. Recall that the age data is
available only for English and Spanish languages. As in the last experiment LSA obtains
the best individual performance, but in this experiment the combination of LSA and
SOA obtains an improvement in both collections. It is worth noting that despite of the
small datasets, for age prediction SOA could contribute to improve the classification5.</p>
        <p>
          Finally for personality prediction Table 5 shows the performance of BOW and LSA
plus SOA performance by language in the personality detection task. For this
experiment, although the results seems promising they should be taken with caution. This is
due to the lack of data and the number of classes that we consider (one class for each
observed value) one correct/wrong predicted instance is enough to change the results
considerably. For this specific experiment in personality, we built a representation on
the entire dataset, then we evaluate using a 10CFV. In general, the results suggest that
the combination of LSA plus SOA gets similar or better results than the typical BOW
approach. Given evidence of the usefulness of the discriminative features and the
descriptive features.
4 SOA has proven outstanding results in recent years in the PAN AP tracks [
          <xref ref-type="bibr" rid="ref10 ref9">9,10</xref>
          ].
5 The best results for SOA in previous PAN AP editions have been for age prediction
In this paper, we have explored a new combination of document representations for
AP task. The main aim of this work was to experiment the with descriptive (LSA)
and discriminative (SOA) features. We found that the descriptive information is very
useful, which confirms several findings in the literature. Moreover, we also find that
discriminative information could improve the results when it is combined with
descriptive information. This indicates that LSA captures very important information which in
turn can be complemented with the SOA stylistic information.
        </p>
        <p>Acknowledgment This work was partially funded by the program
SEP-CONACYT-ANUIES-ECOS Nord under the project M11-H04. Álvarez-Carmona
thanks for doctoral scholarship CONACyT-Mexico 401887. Also, López-Monroy
thanks for doctoral scholarship CONACyT-Mexico 243957.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Dumais</surname>
          </string-name>
          , S.T.:
          <article-title>Latent semantic analysis</article-title>
          .
          <source>Annual review of information science and technology 38(1)</source>
          ,
          <fpage>188</fpage>
          -
          <lpage>230</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Estival</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaustad</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutchinson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Author profiling for english emails</article-title>
          .
          <source>In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING'07)</source>
          . pp.
          <fpage>263</fpage>
          -
          <lpage>272</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>Liblinear: A library for large linear classification</article-title>
          .
          <source>The Journal of Machine Learning Research 9</source>
          ,
          <fpage>1871</fpage>
          -
          <lpage>1874</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shimoni</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          :
          <article-title>Automatically categorizing written texts by author gender</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>17</volume>
          (
          <issue>4</issue>
          ),
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zigdon</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Determining an author's native language by mining a text for errors</article-title>
          .
          <source>In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining</source>
          . pp.
          <fpage>624</fpage>
          -
          <lpage>628</lpage>
          . ACM (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foltz</surname>
            ,
            <given-names>P.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laham</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>An introduction to latent semantic analysis</article-title>
          .
          <source>Discourse processes 25(2-3)</source>
          ,
          <fpage>259</fpage>
          -
          <lpage>284</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNamara</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dennis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kintsch</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Handbook of latent semantic analysis</article-title>
          . Psychology Press (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lopez-Monroy</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>M.y</article-title>
          .,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villasenor-Pineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villatoro-Tello</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Inaoe's participation at pan'13: Author profiling task</article-title>
          .
          <source>In: CLEF 2013 Evaluation Labs and Workshop</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chugur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trenkmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 2nd author profiling task at pan 2014</article-title>
          .
          <source>In: Proceedings of the Conference and Labs of the Evaluation Forum (Working Notes)</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Inches</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Overview of the author profiling task at pan 2013</article-title>
          . Notebook Papers of CLEF pp.
          <fpage>23</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.:</given-names>
          </string-name>
          <article-title>Effects of age and gender on blogging</article-title>
          . In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. vol.
          <volume>6</volume>
          , pp.
          <fpage>199</fpage>
          -
          <lpage>205</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Turney</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Mining the web for synonyms: Pmi-ir versus lsa on toefl (</article-title>
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Van Halteren</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Linguistic profiling for author recognition and verification</article-title>
          .
          <source>In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics</source>
          . p.
          <fpage>199</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoiem</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forsyth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Building text features for object image classification</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          .
          <article-title>CVPR 2009</article-title>
          . IEEE Conference on. pp.
          <fpage>1367</fpage>
          -
          <lpage>1374</lpage>
          . IEEE (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Wiemer-Hastings</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiemer-Hastings</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graesser</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Latent semantic analysis</article-title>
          .
          <source>In: Proceedings of the 16th international joint conference on Artificial intelligence</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>