<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pro le-based Approach for Age and Gender Identi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ma. Jose Garciarena Ucelay</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ma. Paula Villegas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dario G. Funez</string-name>
          <email>funezdariog@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leticia C. Cagnina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcelo L. Errecalde</string-name>
          <email>merrecaldeg@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriela Ram rez-de-la-Rosa</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Esau Villatoro-Tello</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Consejo Nacional de Investigaciones Cient cas y Tecnicas</institution>
          ,
          <addr-line>CONICET</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIDIC Research Group, Universidad Nacional de San Luis</institution>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Language and Reasoning Research Group, Information Technologies Dept., Universidad Autonoma Metropolitana (UAM) Unidad Cuajimalpa</institution>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation between the LIDIC research group of the UNSL from Argentina and the Language and Reasoning research group of the UAM Cuajimalpa from Mexico at the PAN's 2016 Author Pro ling task. For the proposed method we adopted a pro le-based approach, which has been successfully applied in the Authorship Attribution problem. Thus, we proposed a variation of this technique for tackling the Author Pro ling task. Performed experiments showed that using about 8000 most frequent character n-grams for the construction of the di erent pro les, our proposed method obtains a better performance for both the same genre of documents as well as for the cross-genre scenario.</p>
      </abstract>
      <kwd-group>
        <kwd>Pro le-based approach</kwd>
        <kwd>Author Pro ling</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Lately, the Author Pro ling (AP) task is among the challenges that has been
very attractive for the scienti c community, specially for elds such as Natural
Language Processing, Forensics, Marketing, and Internet Security. As known,
the main goal of the AP is to distinguish, from a given text, among di erent
authors' categories and not to identify the author itself; the latter is known
as Authorship Attribution [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Thus, the AP task aims at modelling, through
more general set of features, groups of authors. Ideally speaking, such features
will represent, to some extent, how di erent categories of authors employ their
language depending on its age, gender, native language, political preference,
personality, etc. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        One of the very rst works on facing the problem of AP are [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ], where it was
shown the pertinence of statistical techniques for distinguishing among authors'
gender and age. Since then, many approaches have been proposed for facing
the AP challenge [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref7">3,4,5,6,7</xref>
        ]. A common approach among these research works
is the use of textual representations, which have shown being e ective enough
when the revised documents represent formal texts, for instance, news reports,
scienti c papers, books, etc. Nonetheless, most of traditional approaches face
several di culties when provided documents are from a more informal source,
such as blogs, chats, or social media texts (e.g., tweets).
      </p>
      <p>
        As part of the e orts in providing e ective solutions to the AP challenge,
the PAN@CLEF4 proposes a competitive evaluation exercise for uncovering
plagiarism, authorship, and social software misuse. For this year PAN campaign
the focus of AP shared task is on cross-genre age and gender identi cation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
meaning that, the training documents will be on one genre (e.g. Twitter, blogs,
social media, etc.) and the evaluation will be on a di erent one.
      </p>
      <p>The rest of this document is organized as follows, Section 2 describes some
of the most relevant research works that have tried to solve the problem of AP
with a pro le-based paradigm. Section 3 describes the ideas that motivate this
work. Next, Section 4 describes our proposed method for approaching the AP
problem and, Section 5 shows the obtained results on the PAN 2016 dataset.
Finally, Section 6 depicts our future work ideas and the obtained conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>In the eld of Author Analysis, there are several tasks that fall under the same
type of stylistic analysis; these tasks are Author Attribution, Plagiarism
Detection and Author Pro ling. In the Author Attribution problem, there are two
predominant paradigms: instance-based paradigm and pro le-based paradigm.
The former is the common one and also is the most used in the other related
tasks of Author Analysis; this paradigm assumes each document of an author as
independent. However, the pro le-based paradigm, in which all the documents
for the same author are treated as one, despite its simplicity is not very common.</p>
      <p>
        The most recent research that uses the pro le-based paradigm is the one
proposed by Potha and Stamatatos [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. They evaluated the pro le-based paradigm
for the author attribution task and tested the paradigm against methods that
use an instance-based paradigm from the PAN-2013 participants. The authors
established four parameters for their method, such as the length of the n-grams,
the length of the unknown document, the length of the pro le and the
dissimilarity function. Results showed that their method, using a set of global and local
settings, outperforms single methods from the participants of PAN 2013 for the
author authorship track.
      </p>
      <p>
        Another researches, also for Author Attribution, use hybrid approaches, that
is, some characteristics are taken from both paradigms (i.e., instance-based and
4 http://pan.webis.de/clef16/pan16-web/
pro le-based) [
        <xref ref-type="bibr" rid="ref10 ref11">10,11</xref>
        ]. In these researches the authors use each document for each
author as independent in the same way the instance-based paradigm does, but
a pro le is built for each author.
      </p>
      <p>
        As the previous works show, pro le-based approaches have been given
competitive results for author attribution tasks. In this sense, we want to test this
simple approach in another author analysis problem, i.e., author pro ling task.
As in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we set some parameters such as the length of the pro le and the length
of the n-grams in an cross-domain scenario.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Pro le based approaches</title>
      <p>
        Pro le-based methods have been successfully used for addressing problems
related to the authorship attribution (AA) task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In a typical AA problem, a
text of unknown authorship is assigned to a candidate author, given a set of
candidate authors for which we have available texts of undisputed authorship.
In this context, for each class of author these methods build a pro le containing
information extracted from a collection of documents written by the author [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
Figure 1 summarizes graphically the process of generating the pro les of each
author.
      </p>
      <p>The information extracted from the documents for the construction of the
pro les can be related to the writing style or the text content as we brie y
describe below.</p>
      <p>
        { Style-based features: such as frequency or number of pronouns, articles
and prepositions, number of hyperlinks, words average, etc. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. One of the
most used is the frequencies of n-grams of characters. The n-grams are
substrings of n consecutive characters [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. In particular for English language,
n-grams of characters with n=3 have demonstrated to be e ective. These
features capture interesting information depending on the gender and the
age of the author. For example, women in blogs use more pronouns and
a rmative-negative words.
{ Content-based features: consider the words related to di erent topics [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>For example, the women usually write words related to personal concerns
such as shopping, mom, etc. Instead, the men usually write about politic
and technology.</p>
      <p>
        In order to obtain the author pro les, these methods consider a set of
documents of each author and extract the set of features. As the set could be too
large, the pro le will consider only the L more frequent features from the whole
set. Then, before classifying a target document, the method will construct a
pro le with that unique document and using a similarity measure with respect
to all authors' pro les, it will determine the authorship [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Some similarity (or distance) measures used in the pro le-based approaches
are:
1. Keselj's Relative Distance (KRD) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]: calculates the distance K between
two pro les P1 and P2 as:
(1)
(2)
where Pi(x) is the frequency of the term x in pro le Pi, and XP i is the set
of all terms that occur in the pro le Pi.
2. Simpli ed Pro le Intersection (SPI) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]: calculates the amount of features
that belong to both pro les P1 and P2 as:
      </p>
      <p>As pro le-based approaches have been successfully used for the AA task, we
propose to use these for the Author Pro ling task.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Sistema de Per les : the proposed method</title>
      <p>
        Our study focuses on predicting the age and gender of the author (female or
male), for the languages English, Spanish and Dutch. For the age, the task
considers the following ranges of ages: 18-24, 25-34, 35-49, 50-64 and 65-xx years
old, only for the English and Spanish texts [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>In order to use a pro le-based approach, we represent a speci c class of
author with a pro le. Then, for predicting the gender and the age, we made 10
K =</p>
      <p>X
x2XP1 [XP2
K =</p>
      <p>X
x2XP1 \XP2
2
2
(P1(x)
P1(x) + P2(x)</p>
      <p>P2(x)) 2
(P1(x)
P1(x) + P2(x)</p>
      <p>P2(x)) 2
di erent pro les which comprise information combined about the possible gender
and age of the authors. Thus, we obtained pro les for the following categories:
female 18-24, male 18-24, female 25-34, male 25-34, female 35-49, male 35-49,
female 50-64, male 50-64, female 65-XX and male 65-XX.</p>
      <p>
        Regarding the features for the construction of the pro les, preliminary
experiments showed that the use of character n-grams were adequate. The complete
system named Sistema de Per les (SP) was implemented in two stages. In the
rst one we constructed the pro les for each category for each language
separately. We used the documents (i.e., training set) provided by Author Pro ling
task at PAN-PC-2016 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. To getting the pro les of each category (each language
separately) we applied the following steps considering all the training set:
{ Uni cation of each separate xml les in a single txt le (concatenation). One
for category.
{ Preprocessing of the txt le obtained for each category: tags and images are
removed.
{ Generation of the n-grams using the txt le and calculate the frequencies of
each one. Sort the n-grams considering those most frequent at rst5. This
step is performed for each category.
{ Save the pro le of the category considering only the L most frequent n-grams
obtained in the previous step.
      </p>
      <p>The second stage is the classi cation of a test document in a particular
language (this information is provided). SP receives an input xml le then, the
following steps are performed:
{ Preprocessing of the input le: tags and images are removed of the le and
it is saved as a txt le.
{ Obtaining the n-grams and sorting those considering only the L most
frequent (pro le document).
{ Check for similarity with the pro les of each category using the SPI function
described above. It compares the pro le document with the corresponding to
each category returning the label of that which is closer. Take into account
that the pro les considered in this step are those with similar language of
the input le.
5
5.1</p>
    </sec>
    <sec id="sec-5">
      <title>Experiments and results</title>
      <sec id="sec-5-1">
        <title>Intra-Domain Study</title>
        <p>We rst studied the performance of SP in a intra-domain experiments.
Regarding the parameter L of SP, we consider that choosing an appropriate value is
important to achieve a correct balance between an acceptable execution time
and a good percentage of instances correctly classi ed. Moreover, if the L value
5 We used the library Morphadorner for this step, which is an open-access Java library
for NLP supplied by the Northwestern University.
is small it occurs an under tting. On the contrary, if the L value is excessively
large, SP can generate an over tting of the classi cation. This is because the
generated pro les would be adjusted too much over the corpus used for training.</p>
        <p>Then, we carried out some preliminary intra-domain experiments, using only
the training corpus provided by PAN 2016 competition. Although the
competition stated that Author Pro ling task would focus on cross-genre age and gender
identi cation, we believed convenient to try di erent values of L using the same
corpus for both, training and testing. PAN 2016 corpus consists of 436 documents
written in English, 250 in Spanish and 384 in Dutch language. We splitted this
collection taking the 80% to train, and leaving the remaining 20% to test.</p>
        <p>Tables 1 and 2 show the results of experiments for gender in Dutch, English
and Spanish languages, as well as for age in the case of the latter two. We
consider the percentage obtained of correctly classi ed instances, in other words,
the accuracy as a measure of performance. Rows of Tables 1 and 2 indicate the
di erent values for L (from 2000 to 8000) and columns point out di erent models
of representation, that is, only 3-grams of characters or the combination from
3-grams to 5-grams, and so on.</p>
        <p>We can observe that, in general, the best values of accuracy were reached
when L was 4000 and 3-grams were utilized. In some cases, 5-grams work
similarly to the use of 3-grams, but the reason for choosing the latter was given by
the time incurred in the execution. Building the pro les based on 5-grams took
twice as long as the construction of the pro les based on 3-grams.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Cross-genre Study</title>
        <p>
          As we mentioned before, this years PAN Author Pro ling task was stated as
cross-genre classi cation [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In this context, \genre" refers to the type of source
from which the texts proceed, for example, Twitter, blogs and social media.
For the experimentation we constructed the pro les for SP from the complete
training corpus provided by the competition at PAN-2016.
        </p>
        <p>
          In order to test our SP method in a cross-genre scenario, we used two di erent
corpus: a representative subset of the collection supplied by the competition in
PAN-2014 [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], and the complete corpus of PAN-2015 competition [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. For the
former collection we only considered the texts obtained from blogs and social
media, both in Spanish and English languages. For the latter test collection we
used all the available texts, which were obtained from Twitter; for Dutch we
only evaluated the gender identi cation problem.
        </p>
        <p>At rst, we obtained a general baseline in order to have values to compare
with. Thus, using the training and test sets mentioned in the previous paragraph,
with the Nave Bayes classi er and the tf-idf word representation, we reached
the results shown in Table 3.</p>
        <p>The results obtained with our SP method are shown in Table 4 and Table
5. We show the accuracy obtained for classi cation by gender and age, with
di erent L values using 3-grams. The results in both tables correspond to the
PAN-2014 collection to test for English and Spanish language. As we can see,
SP with L=8000 achieves in the most of the cases, the highest percentage of
classi cation (over the baseline).</p>
        <p>Table 6 shows the accuracy obtained using PAN-2015 collection for testing
with di erent L values. Although there is not a L value which is the best in
all languages for both age and gender, we can conclude that L=8000 still
per500
blg sm
2000
blg sm
4000
blg sm
6000
blg sm
8000 10000</p>
        <p>blg sm blg
forming well in the most of the cases. In fact for Dutch language (the only
experimentation performed) with this value of L, SP obtained the best result.</p>
        <p>
          Finally, for simplicity, we have set, for all categories and all languages, our
SP system with L=8000 and as a similarity measure the SPI metric for the
nal submission in the PAN competition. This decision was determined based
on the averages of the results obtained and shown in the tables above. All the
experiments were run using the TIRA platform [
          <xref ref-type="bibr" rid="ref21 ref22">21,22</xref>
          ].
        </p>
        <p>Figure 2 summarizes the obtained performance of our system when it is tested
with di erent corpora using the PAN-2016 data set for building the pro les. It is
worth noting that in all considered cases (PAN-2014 and PAN-2015) the accuracy
values are good when L=8000.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and future work</title>
      <p>This paper described the joint participation of the LIDIC research group of the
UNSL from Argentina and the LyR research group of the UAM Cuajimalpa from
Mexico at the PAN-2016 Author Pro ling task.</p>
      <p>We presented a pro le-based method for the Author Pro ling task. Our
proposal uses pro les of character 3-grams for representing information about the
di erent categories of authors. We performed experiments in intra and cross
genre scenarios and we showed that using the 8000 most frequent character
3grams, our method obtains the best performance of classi cation for genre and
age.</p>
      <p>In future works we plan to test di erent features for the construction of the
pro les and the use of di erent similarity measures for comparing the pro les.</p>
      <p>Acknowledgments. This work was partially funded by CONACyT under the
Thematic Networks program (Language Technologies Thematic Network projects
260178, 271622). We also thank to UAM Cuajimalpa, CONACyT (Project grant
number 258588) and SNI-CONACyT for their support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. E. Stamatatos, \
          <article-title>A survey of modern authorship attribution methods,"</article-title>
          <source>J. Am. Soc. Inf. Sci. Technol</source>
          ., vol.
          <volume>60</volume>
          , pp.
          <volume>538</volume>
          {
          <issue>556</issue>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fine</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Shimoni</surname>
          </string-name>
          , \Gender, genre, and
          <article-title>writing style in formal written texts,"</article-title>
          <source>TEXT</source>
          , vol.
          <volume>23</volume>
          , pp.
          <volume>321</volume>
          {
          <issue>346</issue>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Shimoni</surname>
          </string-name>
          , \
          <article-title>Automatically categorizing written texts by author gender,"</article-title>
          <source>Literary and Linguistic Computing</source>
          , vol.
          <volume>17</volume>
          , no.
          <issue>4</issue>
          , pp.
          <volume>401</volume>
          {
          <issue>412</issue>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Burger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Henderson</surname>
          </string-name>
          , G. Kim, and G. Zarrella, \
          <article-title>Discriminating gender on twitter,"</article-title>
          <source>in Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , EMNLP '
          <fpage>11</fpage>
          ,
          <string-name>
            <surname>(Stroudsburg</surname>
          </string-name>
          , PA, USA), pp.
          <volume>1301</volume>
          {
          <issue>1309</issue>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>C.</given-names>
            <surname>Peersman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Daelemans</surname>
          </string-name>
          , and
          <string-name>
            <surname>L. Van Vaerenbergh</surname>
          </string-name>
          , \
          <article-title>Predicting age and gender in online social networks,"</article-title>
          <source>in Proceedings of the 3rd International Workshop on Search and Mining User-generated Contents</source>
          , SMUC '
          <fpage>11</fpage>
          , (New York, NY, USA), pp.
          <volume>37</volume>
          {
          <issue>44</issue>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gravel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Trieschnigg</surname>
          </string-name>
          , and T. Meder, \
          <article-title>How old do you think i am?; a study of language and age in twitter,"</article-title>
          <source>in Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media</source>
          , AAAI Press,
          <year>2013</year>
          . Reporting year:
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Lopez-Monroy</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>M. y</article-title>
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>H. J.</given-names>
          </string-name>
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <article-title>Villasen~or-</article-title>
          <string-name>
            <surname>Pineda</surname>
          </string-name>
          , and E. Stamatatos, \
          <article-title>Discriminative subpro le-speci c representations for author proling in social media," Knowledge-Based Systems</article-title>
          , vol.
          <volume>89</volume>
          , pp.
          <volume>134</volume>
          {
          <issue>147</issue>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Verhoeven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Daelemans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , \
          <article-title>Overview of the 4th Author Pro ling Task at PAN 2016: Cross-genre Evaluations," in Working Notes Papers of the CLEF 2016 Evaluation Labs</article-title>
          ,
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org, Sept</source>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>N.</given-names>
            <surname>Potha</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <source>Arti cial Intelligence: Methods and Applications: 8th Hellenic Conference on AI, SETN</source>
          <year>2014</year>
          , Ioannina, Greece, May
          <volume>15</volume>
          -17,
          <year>2014</year>
          . Proceedings, ch. A
          <string-name>
            <surname>Pro</surname>
          </string-name>
          le
          <article-title>-Based Method for Authorship Veri cation</article-title>
          , pp.
          <volume>313</volume>
          {
          <fpage>326</fpage>
          . Cham: Springer International Publishing,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Halteren</surname>
          </string-name>
          , \
          <article-title>Author veri cation by linguistic pro ling: An exploration of the parameter space,"</article-title>
          <source>ACM Trans. Speech Lang. Process.</source>
          , vol.
          <volume>4</volume>
          , pp.
          <volume>1</volume>
          :
          <issue>1</issue>
          {1:
          <fpage>17</fpage>
          ,
          <string-name>
            <surname>Feb</surname>
          </string-name>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. J. Grieve, \
          <article-title>Quantitative authorship attribution: An evaluation of techniques,"</article-title>
          <source>Literary and Linguistic Computing</source>
          , vol.
          <volume>22</volume>
          , no.
          <issue>3</issue>
          , pp.
          <volume>251</volume>
          {
          <issue>270</issue>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>H. J. Escalante</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. y Gomez</surname>
          </string-name>
          , and T. Solorio, \
          <article-title>A weighted pro le intersection measure for pro le-based authorship attribution,"</article-title>
          <source>in Proceedings of MICAI 2011</source>
          , vol.
          <volume>7094</volume>
          , pp.
          <volume>232</volume>
          {
          <issue>243</issue>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>J. Schler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Argamon</surname>
            , and
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Pennebaker</surname>
          </string-name>
          , \
          <article-title>E ects of age and gender on blogging,"</article-title>
          in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp.
          <volume>199</volume>
          {
          <issue>205</issue>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>W. B. Cavnar</surname>
            and
            <given-names>J. M.</given-names>
          </string-name>
          <string-name>
            <surname>Trenkle</surname>
          </string-name>
          , \
          <article-title>N-gram-based text categorization,"</article-title>
          <source>in Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval</source>
          , pp.
          <volume>161</volume>
          {
          <issue>175</issue>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>R.</given-names>
            <surname>Layton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Watters</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Dazeley</surname>
          </string-name>
          , \
          <article-title>Recentred local pro les for authorship attribution," Natural Language Engineering</article-title>
          , vol.
          <volume>18</volume>
          , pp.
          <volume>293</volume>
          {
          <issue>312</issue>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>V.</given-names>
            <surname>Keselj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cercone</surname>
          </string-name>
          , and C. Thomas, \
          <article-title>N-gram-based author pro les for authorship attribution," Proceedings of the conference paci c association for computational linguistics</article-title>
          ,
          <source>PACLING</source>
          , vol.
          <volume>3</volume>
          , pp.
          <volume>255</volume>
          {
          <issue>264</issue>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. G. Frantzeskou,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gritzalis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Katsikas</surname>
          </string-name>
          , \
          <article-title>Source code author identi cation based on n-gram author pro les," in Arti cial Intelligence Applications and Innovations</article-title>
          , vol.
          <volume>204</volume>
          of IFIP, pp.
          <volume>508</volume>
          {
          <issue>515</issue>
          ,
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <article-title>\9th evaluation lab on uncovering plagiarism, authorship, and social software misuse (PAN</article-title>
          <year>2013</year>
          ).
          <article-title>" http://pan</article-title>
          .webis.de/,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , I. Chugur,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Trenkmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Verhoeven</surname>
          </string-name>
          , and W. Daelemans, \
          <source>Overview of the 2nd Author Pro ling Task at PAN</source>
          <year>2014</year>
          ,
          <article-title>" in CLEF 2014 Evaluation Labs</article-title>
          and Workshop, pp.
          <volume>15</volume>
          {
          <issue>18</issue>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          .org,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Celli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , and W. Daelemans, \
          <source>Overview of the 3rd Author Pro ling Task at PAN</source>
          <year>2015</year>
          ,
          <article-title>" in CLEF 2015 Evaluation Labs</article-title>
          and Workshop, pp.
          <volume>8</volume>
          {
          <issue>11</issue>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          .org,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>T.</given-names>
            <surname>Gollub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Burrows</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Hoppe</surname>
          </string-name>
          , \TIRA:
          <article-title>Con guring, Executing, and Disseminating Information Retrieval Experiments,"</article-title>
          <source>in 9th International Workshop on Text-based Information Retrieval</source>
          (
          <article-title>TIR 12) at DEXA (A</article-title>
          .
          <string-name>
            <surname>Tjoa</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Liddle</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-D. Schewe</surname>
          </string-name>
          , and X. Zhou, eds.), (Los Alamitos, California), pp.
          <volume>151</volume>
          {
          <issue>155</issue>
          , IEEE, Sept.
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>M. Potthast</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            , E. Stamatatos, and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , \
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identi cation, and Author Pro ling," in Information Access Evaluation meets Multilinguality, Multimodality, and Visualization</article-title>
          .
          <source>5th International Conference of the CLEF Initiative (CLEF</source>
          <volume>14</volume>
          )
          <string-name>
            <surname>(E. Kanoulas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lupu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sanderson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hanbury</surname>
          </string-name>
          , and E. Toms, eds.), (Berlin Heidelberg New York), pp.
          <volume>268</volume>
          {
          <issue>299</issue>
          , Springer, Sept.
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>