<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Adapting Cross-Genre Author Profiling to Language and Corpus</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ilia Markov</institution>
          ,
          <addr-line>Helena Gómez-Adorno, Grigori Sidorov, and Alexander Gelbukh</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto Politécnico Nacional, Center for Computing Research</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>This paper presents our approach to the Author Profiling (AP) task at PAN 2016. The task aims at identifying the author's age and gender under crossgenre AP conditions in three languages: English, Spanish, and Dutch. Our preprocessing stage includes reducing non-textual features to their corresponding semantic classes. We exploit typed character n-grams, lexical features, and nontextual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, second order attributes (SOA), tf-idf) and machine learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, logistic regression). For textual feature selection, we applied the transition point technique, except when SOA was used. We found that the optimal configuration was different for different languages at each stage.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Author Profiling (AP) is the task that aims at identifying profiling aspects of an author
based solely on a sample of his or her writing. From the machine-learning
perspective, AP can be viewed as a multiclass, single-label classification problem, when
automatic methods have to assign class labels (e.g., male, female) to objects (texts). The AP
methods can be useful for security and marketing applications, as well as contribute to
forensics purposes, when part of the evidence refers to texts.</p>
      <p>
        The rapid growth of social media in past years has significantly contributed to the
increased interest in the task, giving rise to a large number of substantial work in this field.
Most of these approaches concerned with exploring different sets of features to
distinguish between specific profiles. According to the AP task literature, character n-grams
and lexical features have proved to be highly discriminative for this task, regardless of
the language the texts are written in [
        <xref ref-type="bibr" rid="ref14 ref17 ref7 ref8">7, 8, 14, 17</xref>
        ].
      </p>
      <p>
        Recently, different types of character n-grams were proposed by Sapkota et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
to tackle the task of Authorship Attribution (AA). The authors showed that some types
of character n-grams distinguish better between stylistic properties of an author than
other types, both under single and cross-topic AA conditions. In this study, we apply
the approach proposed by Sapkota et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] to the task of AP. We demonstrate that
using typed character n-grams along with lexical and non-textual features is also helpful
for distinguishing between profiling aspects of authors under cross-genre AP
conditions, that is, the training corpus is on one genre, while the test set is on another genre.
We propose several pre-processing steps and apply transition point technique based on
Zipf’s law [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] to fine-tune the feature set. We examine various feature representations,
including second order attributes (SOA) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which is known to provide good results
for this task [
        <xref ref-type="bibr" rid="ref1 ref11">1, 11</xref>
        ].
      </p>
      <p>The rest of this paper is organized as follows. Section 2 presents the proposed
methodology. Section 3 provides some characteristics of the PAN Author Profiling 2016
corpus. Section 4 describes the conducted experiments. Section 5 provides the obtained
results and their evaluation. Section 6 draws the conclusions and points to the possible
directions of future work.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <sec id="sec-2-1">
        <title>Pre-processing steps</title>
        <p>Since the provided training corpus (described further in Section 3) consists of Twitter
messages, and the evaluation corpus will be on another genre, we introduce the
following pre-processing steps, which are applied before the extraction of features, aiming to
reduce non-textual features to their semantic classes:
Digits We replace all digits with a single symbol (e.g., 345 ! 0), which allows capturing
information about their occurrence, discarding the actual numbers, since the numbers
do not represent useful information concerning profiling aspects.</p>
        <p>URLs In order to keep information about the presence of URL mentions and not to
extract character n-grams from them, we replace all URL mentions with the same symbol.
However, we use the information regarding the particular domain name in order to form
our feature set of domain names (e.g., https://www.instagram.com ! 1, “instagram” !
feature set of domain names).
@mentions We replace all @mention instances with the same symbol in order to keep
track of their occurrence and remove information related to the specific username
mentioned (e.g., @mention ! 2). If there is a space after the “@” symbol, in most cases, it
is followed by a specific location. Location mention is usually user specific and does not
carry useful clues for distinguishing between communities of people who share
common profiling aspects. Therefore, we replace @_mention with a different symbol (e.g.,
@_mention ! 3).</p>
        <p>Picture links For the same purposes as the previous steps, all picture links are replaced
with a single symbol (e.g., pic.twitter.com/vYpLShlHs7 ! 4).</p>
        <p>Emoticons Emoticons can provide useful information about sentiments of a specific
user; however, we consider them not to be helpful for author profile identification,
especially under cross-gender conditions. Therefore, we are only interested in capturing
their presence (e.g., :) ! 5).</p>
        <p>Furthermore, we apply the following normalization:
Slang words We expand slang words with their corresponding meanings, since slang
words are not used in the same way by all authors, especially taking into account that
the test set will be on another genre (e.g., 4u ! for you).</p>
        <p>Punctuation marks We split punctuation marks from adjacent words and from each
other to be able to capture their presence separately when using character n-grams
features (e.g., .” ! . ” ).
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Features</title>
        <p>
          Our approach is based on the character n-grams categories introduced by Sapkota et
al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The authors defined 10 different character n-gram categories based on affixes,
words, and punctuation. Following the practice of Sapkota et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], we examine three
cases according to what kind of n-gram categories are used:
1. Untyped - traditional approach to extracting n-grams, where the categories of
ngrams are ignored. Any distinct n-gram is a different feature.
2. Typed - when n-grams of all the categories (affix+word+punctuation) are
considered. Instances of the same n-gram may refer to different features.
3. Affix+punctuation - when the n-grams of the word category are excluded.
        </p>
        <p>
          The main conclusion of Sapkota et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] is that for the Authorship Attribution
task, models based on affix+punctuation features are more efficient than models trained
on all the features. In this study, we apply these three models to the task of AP and
examine which one of them is more appropriate for the AP task.
        </p>
        <p>
          In addition, we examine whether the effectiveness of the proposed models can be
enhanced when combined with lexical and non-textual features, since combining
different feature sets usually improves the performance of classification models [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Transition point technique for feature selection</title>
        <p>
          Zipf’s law states that given a large enough corpus, the frequency ranks of words (terms)
are inversely proportional to the corresponding frequencies [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. Transition point (TP)
technique is based on Zipf’s law and word occurrences. This technique splits the
vocabulary of a document into two sets of terms (low and high frequency). According
to Pinto et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], the terms whose frequency is closer to the transition point value
(medium-frequency terms) have a higher semantic value, and therefore, are more
appropriate for document representation. These medium-frequency terms can be obtained
by setting lower (U1) and upper (U2) threshold values through selecting appropriate
neighbourhood values of transition point (N T P ).
        </p>
        <p>
          The formula to obtain the transition point is given in equation (1):
2
where I1 represents the number of words with frequency equal to 1 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          The lower (U1) and upper (U2) threshold values are calculated by a given
neighbourhood value of T P (N T P 2 [
          <xref ref-type="bibr" rid="ref1">0-1</xref>
          ]):
        </p>
        <p>T P =
p1 + 8</p>
        <p>I1</p>
        <p>
          Transition point technique has been used in various areas of Natural Language
Processing (NLP) and has proved to perform better than traditional feature selection
methods for several classification tasks [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. In this work, we apply the transition point
technique to our character n-grams and lexical sets of features. We further demonstrate that
this feature selection method can enhance the performance of cross-gender AP system.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Corpus</title>
      <p>
        The Author Profiling task at PAN 2016 consisted in predicting age and gender of authors
under cross-gender AP conditions [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The provided training corpus is composed of
Twitter messages in English, Spanish, and Dutch. The English and Spanish training
datasets are labeled with age and gender, whereas the Dutch dataset is labeled only
with gender. The following age classes are considered: 18-24, 25-34, 35-49, 50-64, and
65-xx. The distribution of age and gender over the instances of the training set can be
seen in Table 1.
      </p>
      <p>The PAN Author Profiling 2016 training corpus is perfectly balanced in terms of
represented gender groups; however, it is highly unbalanced in terms of age classes.
The majority of participants falls into the 35-49 age category, when there are only few
instances for the 65-xx age category, which makes the task more challenging.</p>
    </sec>
    <sec id="sec-4">
      <title>Adapting Procedures to Language and Corpus</title>
      <p>For the evaluation of the proposed approach, we conducted our experiments on both,
the provided training dataset under 10-fold cross-validation and the PAN Author
Profiling 2014 training corpus composed of English and Spanish blogs, social media, and
reviews. We used the PAN Author Profiling 2014 training corpus as a test set for our
experiments. Following the proposed performance measure, we evaluated our system
by measuring classification accuracy on both corpora (PAN 2016 and PAN 2014).</p>
      <p>
        In order to perform the pre-processing steps as described in Section 2, we expanded
slang words and replaced emoticons using the dictionary developed by Gómez-Adorno
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The examined features, machine learning algorithms, feature representations, and
threshold values are shown in Table 2.</p>
      <p>We also experimented with Latent Semantic Analysis (LSA) of words and stems,
which did not yield good results. Furthermore, we measured the impact of tackling the
task as a single-labeled 10 class classification problem using 10 age-gender profiling
classes.</p>
      <p>We evaluated the performance of each of the feature sets separately and in
combinations. Regarding the character n-grams features, we conducted experiments with
different values of n ranging from 3 to 6 for untyped and from 3 to 4 for typed and
affix+punctuation character n-grams. In addition, we examined the contribution of each
category of character n-grams separately, as well as the performance of our system
when n-grams of different length are combined.</p>
      <p>Typed character trigrams generally provided a higher level of classification accuracy
than untyped and affix-punctuation character n-grams. They also have proved to be
more predicative than typed character n-grams with a higher values of n, and therefore,
were included in the final system. Furthermore, their combination with word unigrams
(for English, Spanish, and Dutch) and domain names (for English and Dutch) features
allowed us to further enhance system performance. However, it is necessary to mention
that the models based on untyped and affix+punctuation character n-grams produced
nearly as high levels of classification accuracy as the model based on typed character
n-grams. Moreover, different values of n yielded only slight accuracy variations.</p>
      <p>
        We examined the performance of the machine learning classifiers, shown in Table 2,
using their scikit-learn [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] implementation. These classification algorithms are
considered among the best for text classification tasks [
        <xref ref-type="bibr" rid="ref14 ref9">9, 14</xref>
        ]. We evaluated the performance
of each of the classifiers separately, as well as examined an ensemble setup, which
combines the probability distributions provided by the individual classifiers based on
majority voting scheme.
      </p>
      <p>
        Feature representations used in this work are shown in Table 2. We exploited second
order attributes (SOA) computed as in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] with age-gender pairs as profiles calculated
separately for n-grams and word unigrams. Applying SOA, we reduced the number of
features to 10 for each of the feature sets (n-grams and word unigrams).
      </p>
      <p>
        Gelbukh and Sidorov [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] showed that Zipf’s law coefficients depend on language.
Therefore, when applying the transition point technique to our character n-grams set of
features, we evaluated threshold values for each of the languages separately based on
grid search. We selected all the n-grams with a frequency greater than or equal to the
upper threshold (U2), with the N T P values of 0.90, 1.00, and –0.95 for the English,
Spanish, and Dutch datasets, respectively. We also used a fixed frequency cutoff, which
consisted in discarding 10 most frequent n-grams for each language.
      </p>
      <p>In order to compose our lexical set of features, first, we discarded 100 most frequent
words from the English and Spanish datasets and 50 most frequent words from the
Dutch one. In the same way as for character n-grams, we estimated the most appropriate
threshold values and selected all the words with a frequency greater than or equal to the
lower threshold (U1). The lower threshold N T P values for our lexical set of features
were 0.75, 0.90, and 0.90 for the English, Spanish, and Dutch datasets, respectively.</p>
      <p>Our non-textual set of features was composed of 30 most frequent domain names
for each of the languages.</p>
      <p>We submitted three systems for the final evaluation on the PAN Author Profiling
2016 test corpus. The best results were obtained with the configurations shown in
Table 3.</p>
      <p>The best performing system, by a small margin, was a system consisting of training
libSVM (for English) and liblinear (for Dutch) classifiers on the combination of typed
character trigrams, word unigrams, and domain names features using their binary
representation. The Spanish system consisted of training liblinear classifier on typed
character trigrams and word unigrams features using SOA representation. Our final setup
for libSVM classifier employed a linear kernel. Both libSVM and liblinear classifiers
used the “balanced” class weight mode.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <p>
        In Table 4, we present the results on the PAN Author Profiling 2016 test corpus for the
three submitted systems evaluated in TIRA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Systems 1 and 2 are based on binary
feature representation and liblinear and libSVM classifiers, respectively. System 3 is
composed of SOA and liblinear classifier. The three systems were evaluated for each of
the languages in order to examine their performance on the test set. The best results for
each language are in bold.
      </p>
      <p>In case of age classification, the obtained results for English and Spanish were
almost equal, in spite of different approaches used to tackle these two languages. The
accuracy of gender classification for Spanish was good, even though it had fewer
instances for training. The obtained results for the Dutch language were rather low; this
can be due to the fact that we did not tune the system under cross-genre conditions for
this language, as we did for English and Spanish.</p>
      <p>The main lesson learned was that each language required different configuration at
each stage.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In this paper, we presented an approach for cross-genre age and gender identification.
Our final system for the English and Dutch languages combined typed character
ngrams, lexical features, and non-textual features. LibSVM and liblinear classifiers were
used for the English and Dutch languages, respectively. We employed binary feature
encoding and the transition point technique to fine-tune the size of the feature set
depending on language. For the Spanish language, the system was composed of typed
character n-grams and lexical features to build a liblinear classifier. We employed the
second order attributes (SOA) technique, which yielded a higher classification accuracy
for this language than others examined feature representations. For all the three
languages, we applied the same pre-processing steps, which includes reducing non-textual
features to their corresponding semantic classes.</p>
      <p>
        One of the directions for future work would be to conduct experiments
combining the proposed features with others of a distinct nature such as syntactic [
        <xref ref-type="bibr" rid="ref13 ref18">13, 18</xref>
        ]
and corpus statistics features: lexical diversity, lexical sophistication, and lexical
density, among others. Moreover, we intent to develop a method for automatic definition
of optimal neighbourhood values of the transition point technique depending on both
language and corpus.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work was done under partial support of the Mexican Government (CONACYT
project 240844, SNI, COFAA-IPN, SIP-IPN 20161947).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>López-Monroy</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-Gómez,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villaseñor-Pineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jair-Escalante</surname>
          </string-name>
          , H.:
          <article-title>INAOE's participation at PAN'15: Author profiling task</article-title>
          .
          <source>In: Working Notes Papers of the CLEF</source>
          <year>2015</year>
          <article-title>Evaluation Labs</article-title>
          .
          <source>CLEF '15</source>
          , vol.
          <volume>1391</volume>
          .
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Buitinck</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Louppe</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mueller</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niculae</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grobler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Layton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , VanderPlas, J.,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
          </string-name>
          , G.:
          <article-title>API design for machine learning software: experiences from the scikit-learn project</article-title>
          .
          <source>In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning</source>
          . pp.
          <fpage>108</fpage>
          -
          <lpage>122</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Estival</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaustad</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutchinson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Author profiling for English emails</article-title>
          .
          <source>In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics</source>
          . pp.
          <fpage>263</fpage>
          -
          <lpage>272</lpage>
          . PACLING '
          <volume>07</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
          </string-name>
          , G.:
          <article-title>Zipf and Heaps laws' coefficients depend on language</article-title>
          .
          <source>In: Proceedings of the 2nd International Conference on Intelligent Text Processing and Computational Linguistics</source>
          . pp.
          <fpage>332</fpage>
          -
          <lpage>335</lpage>
          . CICLing '01, Springer Berlin Heidelberg (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoppe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>TIRA: Configuring, executing, and disseminating information retrieval experiments</article-title>
          .
          <source>In: Proceedings of the 9th International Workshop on Text-based Information Retrieval at DEXA</source>
          . pp.
          <fpage>151</fpage>
          -
          <lpage>155</lpage>
          . TIR '12,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Posadas-Durán</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fócil-Arias</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Compilación de un lexicón de redes sociales para la identificación de perfiles de autor [
          <article-title>Compiling a lexicon of social media for the author profiling task] (in Spanish, abstract in English)</article-title>
          .
          <source>Research in Computing Science (accepted) 115</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>González-Gallardo</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sierra</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Núñez-Juárez</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salinas-López</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ek</surname>
          </string-name>
          , J.:
          <article-title>Tweets classification using corpus dependent tags, character and POS n-grams</article-title>
          .
          <source>In: Working Notes Papers of the CLEF</source>
          <year>2015</year>
          <article-title>Evaluation Labs</article-title>
          .
          <source>CLEF '15</source>
          , vol.
          <volume>1391</volume>
          .
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Houvardas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>N-gram feature selection for authorship identification</article-title>
          .
          <source>In: Proceedings of the 12th International Conference on Artificial Intelligence: Methodologies, Systems, and Applications</source>
          . pp.
          <fpage>77</fpage>
          -
          <lpage>86</lpage>
          . AIMSA '
          <volume>06</volume>
          , Springer Berlin Heidelberg (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kibriya</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
          </string-name>
          , G.:
          <article-title>Multinomial naive Bayes for text categorization revisited</article-title>
          .
          <source>In: Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence</source>
          . pp.
          <fpage>488</fpage>
          -
          <lpage>499</lpage>
          . AI '
          <volume>04</volume>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>López-Monroy</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-Gómez,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villaseñor-Pineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Discriminative subprofile-specific representations for author profiling in social media</article-title>
          .
          <source>Knowledge-Based Systems 89(C)</source>
          ,
          <volume>134</volume>
          -
          <fpage>147</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>López-Monroy</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-Gómez,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villaseñor-Pineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>VillatoroTello</surname>
          </string-name>
          , E.:
          <article-title>INAOE's participation at PAN'13: Author profiling task</article-title>
          .
          <source>In: Working Notes Papers of the CLEF</source>
          <year>2013</year>
          <article-title>Evaluation Labs</article-title>
          .
          <source>CLEF '13</source>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiménez-Salazar</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Clustering abstracts of scientific texts using the transition point technique</article-title>
          .
          <source>In: Proceedings of the 7th International Conference on Intelligent Text Processing and Computational Linguistics</source>
          . pp.
          <fpage>536</fpage>
          -
          <lpage>546</lpage>
          . CICLing '06, Springer Berlin Heidelberg (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Posadas-Durán</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batyrshin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pichardo-Lagunas</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Syntactic n-grams as features for the author profiling task</article-title>
          .
          <source>In: Working Notes Papers of the CLEF</source>
          <year>2015</year>
          <article-title>Evaluation Labs</article-title>
          .
          <source>CLEF '15</source>
          , vol.
          <volume>1391</volume>
          .
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pottast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd author profiling task at PAN 2015</article-title>
          . In: Cappelato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Juan</surname>
          </string-name>
          , E.S. (eds.)
          <article-title>CLEF 2015 Labs and Workshops, Notebook Papers</article-title>
          . vol.
          <volume>1391</volume>
          .
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 4th author profiling task at PAN 2016: Cross-genre evaluations</article-title>
          .
          <source>In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sapkota</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-Gómez,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Not all character n-grams are created equal: A study in authorship attribution</article-title>
          .
          <source>In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies</source>
          . pp.
          <fpage>93</fpage>
          -
          <lpage>102</lpage>
          . NAACL-HLT '
          <fpage>15</fpage>
          ,
          <string-name>
            <surname>Association for Computational Linguistics</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Effects of age and gender on blogging</article-title>
          . In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. pp.
          <fpage>199</fpage>
          -
          <lpage>205</lpage>
          . AAAI (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loya</surname>
          </string-name>
          , N.:
          <article-title>Computing text similarity using tree edit distance</article-title>
          .
          <source>In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society and 5th World Conference on Soft Computing</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . NAFIPS '
          <volume>15</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Zipf</surname>
            ,
            <given-names>G.K.</given-names>
          </string-name>
          :
          <article-title>Human behavior and the principle of least effort</article-title>
          . Cambridge, MA, AddisonWesley (
          <year>1949</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>