<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilingual Author Pro ling using LSTMs Notebook for PAN at CLEF 2018</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roy Khristopher Bayot Teresa Goncalves</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidade de Evora Evora</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper shows one approach of the Universidade de Evora for author pro ling for PAN 2018. The approach mainly consists of using word vectors and LSTMs for gender classi cation. Using the PAN 2018 dataset, we achieved an accuracy of 67.60% for Arabic, 77.16% for English, and 68.73% for Spanish gender classi cation.</p>
      </abstract>
      <kwd-group>
        <kwd>author pro ling</kwd>
        <kwd>twitter</kwd>
        <kwd>word vectors</kwd>
        <kwd>word2vec</kwd>
        <kwd>LSTM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Communication methods have changed rapidly in the recent years especially
with the rise of di erent social media platforms such as Facebook, Instagram,
and Twitter. Aside from exchanges in these social media platforms, there are also
platforms such as message boards, question answering sites, and recommendation
sites such as Reddit, Quora, Yelp, and Amazon that also make up activity online.</p>
      <p>Fake pro les are one of the problems with these online communication
models. Incomplete information in someone's pro le is also a problem. And thus,
analyzing the authorship is one way to take measures with this problem. One
component of analyzing the authorship of a text is pro ling, may it be
determining aspects such as age, gender, or personality.</p>
      <p>
        Our work tries a method on gender author pro ling in English, Spanish, and
Arabic with twitter text using long short term memory recurrent neural networks
and organized as follows: Section 2 covers related literature where it initially
discusses previous author pro ling endeavors, then followed by methods in PAN,
followed by long short term memory recurrent neural networks in conjunction
with word vectors. Section 3 describes the author pro ling task as well as the
dataset. Section 4 describes the methodology and results, beginning with the
creation of word2vec vectors, the model, details of the training and then how it
was evaluated. Section 5 gives the conclusion and recommendations.
content-based features such as the 1000 frequent words in the text with high
information gain. The work also used style-based features such as the nodes of
a taxonomic tree made from systemic functional linguistics [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Schler et al. in [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] provides another example of author pro ling. It is still
centered on gender and age classi cation and it used stylistic and content
features. Parts-of-speech tags, function words, hyperlinks, and non-dictionary words
composed the stylistic features while word unigrams with high information gain
comprised the content features. These were the features were then used on a
Multi-Class Real Winnow for the classi cation.
2.1
      </p>
      <sec id="sec-1-1">
        <title>PAN Editions</title>
        <p>PAN is one of the initiatives at CLEF that has various tasks related to author
analysis. It has author identi cation, obfuscation, and pro ling. The author
proling task has been running since 2013, with di erent aspects to the task during
every year.</p>
        <p>
          The focus for PAN 2013 [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] was age and gender pro ling. The corpus used
then were blogs in Spanish and English. The focus was extended more sources
in PAN 2014 [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. This edition had texts from blogs, reviews, twitter, and
social media. The task was again expanded in PAN 2015 [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. In that year, age
and gender classi cation was also accompanied with regression for personality
traits. The personality traits included extroversion, stability, agreeableness,
conscientiousness, and openness. However, there was only a twitter corpus on four
languages - English, Spanish, Italian, and Dutch. The focus for PAN 2016 [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]
was cross-genre evaluation. The idea was to train models on tweets and test
them on blogs, reviews, and social media. The languages included during this
year were English, Spanish, and Dutch.
        </p>
        <p>
          The focus for PAN 2017 [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] was on determining the author origin given
a speci c text aside from gender classi cation. For instance, an English tweet
could come from an author from the US, UK, Canada, Ireland, New Zealand,
and Australia. A Portuguese tweet could be from Portugal or Brazil. An Arabic
tweet could be from Egypt, Gulf, Levantine, or Maghrebi. And a Spanish tweet
could be from Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela.
        </p>
        <p>
          Majority of the approaches is similar to Argamon et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and Schler et
al. [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] wherein features are content-based or stylistic-based. There are also other
features that are n-grams based or information retrieval based. The classi ers
used are also vary from the use of logistic regression, multinomial Nave Bayes,
liblinear, random forests, Support Vector Machines, and decision tables.
        </p>
        <p>
          Among some of the variations include Weren et al. [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ] where the work also
used length features such as number of characters, words, sentences. Their
approach also check for infomation retrieval features such as cosine similarity, as
well as readability features such as Flesch-Kincaid readability score. Marquardt
et al. in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] also used a combination of content-based features (MRC, LIWC,
sentiments) and stylistic features (readability, html tags, spelling and
grammatical error, emoticons, total number of posts, number of capitalized letters number
of capitalized words). Maharjan et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] used n-grams with stopwords,
punctuations, and emoticons. The work also included the idf count. Villena Roman
et al. [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] used term vector model representation.
        </p>
        <p>
          One of the more prominent approaches in the previous editions is that of
Lopez-Monroy et al. in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. It is prominent in the sense that it worked best for
most tasks in most editions. They placed second for both English and Spanish in
2013 where they used second order representation based on relationships between
documents and pro les. Another work that placed rst for English in 2013 was
that of Meina et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] while Santosh et al. in [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] worked well for Spanish.
The work of Meina et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] used collocations while Santosh et al. [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] used
POS features. The work of Lopez-Monroy et al. in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] gave the best result with
an average accuracy of 28.95% on all corpus-types and languages for PAN 2014.
They used the same method same method as the previous year [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>
          The work of Alvarez-Carmona et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] used second order pro les similar to
previous years. They used it in conjunction with LSA to get the best results on
English, Spanish, and Dutch for PAN 2015. The work of Gonzales-Gallardo et
al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] on the other hand, used character n-grams and POS n-grams that gave
the best result for Italian.
        </p>
        <p>
          In 2016, there had been multiple comparisons since the test genre for the early
bird was di erent from that of the nal evaluation, and there had been
comparisons with the earlier years as well. However, looking at the nal ranking, the top
3 are Busger et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], Modaresi et al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], and Bilan et al. in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Individually,
the work of Bougiatiotis and Krithara [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is the top for English while the work
of Deneva et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is the top for Dutch, while Busger et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] and Modaresi et
al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] are tied for Spanish. Busger et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] used combinations of stylistic
features such as function words, parts-of-speech, emoticons, and punctuations signs.
The combined this with second order representation and trained their models
with SVM. Modaresi et al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] used a combination of lexical features with word
and character n-grams together with stylometric features as inputs to a logistic
regression classi er. Bilan et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] used parts-of-speech, collocations, connective
words and various other stylometric features for its classi cation. Bougiatiotis
and Krithara [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] also used stylometric features with character n-grams and the
second order representation in conjunction with SVM.
        </p>
        <p>
          In 2017, although there have been approaches that are more related to deep
learning such as RNN [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and CNN [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], most of the top results were given
using SVMs [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. For instance, the top result came from Basile et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] who used
a combination of character and tf-idf n-grams to train an SVM. The second
result came from Martinc et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] who used a combination of character, word,
and POS n-grams, emojis, sentiments, character ooding, and lists of words per
variety as features to a logistic regression classi er. The third best result done
by by Tellez et al. [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] also used an SVM.
2.2
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>LSTM and word vectors</title>
        <p>Most of the previous approaches hinges on extracting prede ned features such as
that for style and content. However, a recent trend is to use neural networks to
learn certain lters at run time and use the learned lters to generate a feature
representation suitable for classi cation. This approach need two things - word
vectors and the neural network architecture.</p>
        <p>
          Word vectors or word embeddings are needed to be created to represent
words in a dictionary. These vectors capture some semantic relation between
the words and word2vec is one of the prominent vectors developed by Mikolov
in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. To create the vectors, random numbers are initially used for words
from a dictionary of a corpus such as a Wikpedia dump. Then, by going through
the text in the corpus, a word's vector representation is learned by predicting
using adjacent words. Getting the vector can be done through either skip grams
or continuous bag of words (CBOW). In CBOW, the word vector is predicted
given the context of adjacent words while it is the opposite in skip grams. The
context words are predicted given a word. The word vectors are then updated
after all the predictions are made.
        </p>
        <p>
          Choosing an architecute comes next after creating word vectors. Among
neural network architectures, recurrent neural networks are speci cally good for
sequences such as text since it uses the previous inputs along with the current
input for prediction. This can be shown in the simple recurrent network
developed by Je Elman in the paper [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. However, recurrent neural networks usually
su er from vanishing gradient problem especially with long sequences. One way
this was dealt with was using long short term memory units which originally
proposed by Hochreiter and Schmidhuber in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The idea was to make
analog gates that control which information could be stored and which information
could be used in remembered. This allowed the propagated errors to be more
constant and thus help with the vanishing gradient problem.
        </p>
        <p>
          An example of using LSTM for classi cation is that of Rao and Spasojevic
in [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. They used LSTMs for two di erent datasets - that of customer service and
that of political leaning. Their problem was to classify text as either actionable
or non-actionable in the domain of customer service, while classifying either
Republican or Democrat in terms of political leaning.
        </p>
        <p>
          Another example is that of Tang et al. in [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] for document classi cation. In
their work, CNNs and LSTMs were used to learn sentence representations and
the results were encoded with a gated recurrent network. Their model was used
on reviews of IMDB and Yelp.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Methodology and Results</title>
      <p>The gure 1 given below shows an overview description of the system from how
the dataset is manipulated before fed into the LSTM and how it is evaluated.
The details are described in the following subsections.
3.1</p>
      <sec id="sec-2-1">
        <title>Dataset</title>
        <p>
          In the current edition of PAN 2018 [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] for author pro ling [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], the task is to
predict gender based on text, images, or both. The current dataset has 1500
users for Arabic, and 3000 users each for English and Spanish. These are all
balanced with an equal number of male and female.
        </p>
        <p>
          Each user has 100 tweets and 10 images. The images do not necessarily
contain an image of a person who is the user but an assorted number of images
the user has on the pro le. The idea is to train a model to classify a user with
tweets alone, or images alone, or both. The model is then submitted to the TIRA
server [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] for evaluation over a held-out test set.
3.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Pre-trained Vectors</title>
        <p>
          Word embeddings were created from Wikipedia dumps. The February 05, 2016
wikipedia dump was used for English and Spanish. The English wikipedia dump
at that time was 11.8Gb compressed while the Spanish had 2.2Gb compressed.
The Arabic wikipedia dump was from March 20, 2018 with about 600Mb
compressed. These dumps were then extracted and transformed into lowercase and
entries are in one le. The word2vec implementation of gensim [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] was used
to generate our own vectors by using the wikipedia text as input. In terms of
word2vec parameters, no lemmatization was done, and the window size used
was 5. Skip grams instead of continuous bag of words was used as the method
to generate the vectors and nally the size of the embeddings chosen was 300.
3.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Preprocessing</title>
        <p>Before training the model, we prepared the training le. The training le was
done by preprocessing the XML les. Each user has one XML les and the tweets
were extracted to form one training example. The examples were all put to lower
case. No stop words are removed. Hash tags, numbers, mentions, shares, and
retweets were not processed or transformed to anything else. They were retained
as is and will correspond to another item in the dictionary of words. The test
set from TIRA were also processed in the same manner.
3.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Model</title>
        <p>
          We have a basic model for this experiment that is implemented in Keras [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] with
a Theano [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] backend ran on an NVIDIA Tesla K20c GPU. The model mainly
consists of an embedding layer, one LSTM layer, and one dense layer.
        </p>
        <p>All words in the training set are turned into number indices that corresponds
to a word vector. Each training example will be represented by a sequence of
numbers. The sequence length will vary. The total number of indices in the
sequence is held at 64 and padding is done to ensure it. We then feed the sequence
into the system. Each number will be looked up in the embedding layer and
converted to a word vector according to the pre-trained word vectors previously
discussed. Then the word vectors passed to an LSTM layer with an output of 32,
a dropout value of 0.2, and a recurrent dropout value of 0.5. The output of the
LSTM layer is sent to a dense layer with 2 output units and a sigmoid activation
function.</p>
        <p>Stochastic gradient descent over shu ed mini-batches with the Adam update
rule was used for training. Each mini-batch is had 4096 examples. The
development set is comprised of 20% of the training set. We also kept the number
of epochs to 200 and to provide for early stopping. We saved the best model
trained and used it for our test.
3.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Evaluation</title>
        <p>When the training nishes and the model is saved, we load the model from the
test le and apply it on the tweets per user. After getting a predictions for all
the tweets, we used the majority prediction as a nal prediction for the user.
3.6</p>
      </sec>
      <sec id="sec-2-6">
        <title>Results</title>
        <p>We used the model on the full training set and predicted the gender. The
confusion matrix of the results are given in tables 1, 2, 3 for Arabic, English, and
Spanish respectively.</p>
        <p>It gives an accuracy of 81.80% for Arabic, a misclassi cation rate of 18.20%.
It's also 82.44% likely to be an actual male when it predicts males. It's also
rougly similar for predicting females, which has 81.18% to be actually female.
English accuracy is 85.13%. The misclassi cation rate is about 14.87%. When it
predicts male, it is 86.55% likely to be actually male. When it predicts female, it
is 83.82% likely to predict female. Finally, Spanish accuracy is at 75.27% with a
misclassi cation rate of 24.73%. When it predicts male, it's likely to be actually
male by 73.51%. When it predicts female, it's likely to be actually female by
77.30%.</p>
        <p>However, when this model was applied to the test set in the TIRA servers,
the results are lower than the given accuracies. The results of our approach are
in table 4. Comparing with the results from other contestants, our approach was
18th globally. We ranked 20 out of 23 for Arabic, 17 out of 23 for English, and 19
out of 23 for Spanish. The highest accuracy achieved for Arabic text was 0.8170,
English text was 0.8221, and Spanish text was 0.8200. The di erence between
the accuracies are 14.1%, 5.05%, and 13.27% for Arabic, English, and Spanish
respectively. English has the closest gap. Perhaps it is also due to the word
vectors used. Since the word vectors used came from a bigger resource, 11.8Gb
against 2.2Gb and 600Mb of the other languages, it could have contributed to
better vectors that were used in the classi cation problem.</p>
        <p>Predicted</p>
        <p>Male Female</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Recommendation</title>
      <p>To summarize, we were able to use word vectors together with long short term
memory networks as for classi cation. Our approach is higher than 50% however
it is among one of the lowest in terms of accuracy. We submitted a naive
approach to LSTM and there are multiple parameters that still could be explored.
Aside from the breadth of hyperparameters, it would also be interesting to see
if using a vector for characters instead of words would be useful. This could be
an interesting direction since some of the past approaches to classi cation that
worked well has used character ngrams. Another approach could also be a way
to incorporate stylometric features to the model.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Miguel A Alvarez-Carmona</surname>
            ,
            <given-names>A Pastor</given-names>
          </string-name>
          <string-name>
            <surname>Lopez-Monroy</surname>
          </string-name>
          ,
          <article-title>Manuel Montes-y Gomez, Luis Villasen~or-</article-title>
          <string-name>
            <surname>Pineda</surname>
          </string-name>
          , and
          <string-name>
            <surname>Hugo</surname>
          </string-name>
          Jair-Escalante.
          <article-title>Inaoe's participation at pan'15: Author pro ling task</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2015 Evaluation Labs</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Shlomo</given-names>
            <surname>Argamon</surname>
          </string-name>
          , Moshe Koppel, James W Pennebaker, and
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Schler</surname>
          </string-name>
          .
          <article-title>Automatically pro ling the author of an anonymous text</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>52</volume>
          (
          <issue>2</issue>
          ):
          <volume>119</volume>
          {
          <fpage>123</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Angelo</given-names>
            <surname>Basile</surname>
          </string-name>
          , Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          .
          <article-title>N-gram: New groningen author-pro ling model</article-title>
          .
          <source>arXiv preprint arXiv:1707.03764</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Bilan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Desislava</given-names>
            <surname>Zhekova</surname>
          </string-name>
          .
          <article-title>Caps: A cross-genre author pro ling system</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          , pages
          <fpage>824</fpage>
          {
          <fpage>835</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Konstantinos</given-names>
            <surname>Bougiatiotis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Anastasia</given-names>
            <surname>Krithara</surname>
          </string-name>
          .
          <article-title>Author pro ling using complementary second order attributes and stylometric features</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          , pages
          <fpage>836</fpage>
          {
          <fpage>845</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Francois</given-names>
            <surname>Chollet</surname>
          </string-name>
          . keras. https://github.com/fchollet/keras,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Corinna</given-names>
            <surname>Cortes</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Vapnik</surname>
          </string-name>
          .
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>20</volume>
          (
          <issue>3</issue>
          ):
          <volume>273</volume>
          {
          <fpage>297</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Je</surname>
          </string-name>
          rey L Elman.
          <article-title>Finding structure in time</article-title>
          .
          <source>Cognitive science</source>
          ,
          <volume>14</volume>
          (
          <issue>2</issue>
          ):
          <volume>179</volume>
          {
          <fpage>211</fpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Pepa</given-names>
            <surname>Gencheva</surname>
          </string-name>
          , Martin Boyanov, Elena Deneva, Preslav Nakov,
          <string-name>
            <given-names>G</given-names>
            <surname>Georgiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y</given-names>
            <surname>Kiprov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I</given-names>
            <surname>Koychev</surname>
          </string-name>
          .
          <article-title>Pancakes team: a composite system of genre-agnostic features for author pro ling</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Carlos E Gonzalez-Gallardo</surname>
          </string-name>
          , Azucena Montes, Gerardo Sierra, J Antonio Nun~ezJuarez,
          <string-name>
            <surname>Adolfo Jonathan</surname>
            Salinas-Lopez, and
            <given-names>Juan</given-names>
          </string-name>
          <string-name>
            <surname>Ek</surname>
          </string-name>
          .
          <article-title>Tweets classi cation using corpus dependent tags, character and pos n-grams</article-title>
          .
          <source>In Proceedings of CLEF</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Michael</surname>
            <given-names>Halliday</given-names>
          </string-name>
          , Christian MIM Matthiessen, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Matthiessen</surname>
          </string-name>
          .
          <article-title>An introduction to functional grammar</article-title>
          .
          <source>Routledge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <article-title>Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Yann</surname>
            <given-names>LeCun</given-names>
          </string-name>
          , Leon Bottou, Yoshua Bengio, and
          <article-title>Patrick Ha ner</article-title>
          .
          <article-title>Gradient-based learning applied to document recognition</article-title>
          .
          <source>Proceedings of the IEEE</source>
          ,
          <volume>86</volume>
          (
          <issue>11</issue>
          ):
          <volume>2278</volume>
          {
          <fpage>2324</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Adrian Pastor</surname>
          </string-name>
          Lopez-Monroy,
          <article-title>Manuel Montes-y Gomez, Hugo Jair Escalante, Luis Villasenor-Pineda, and Esau Villatoro-Tello. Inaoe's participation at pan'13: Author pro ling task</article-title>
          .
          <source>In CLEF 2013 Evaluation Labs and Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Adrian Pastor</surname>
          </string-name>
          Lopez-Monroy,
          <article-title>Manuel Montes-y Gomez, Hugo Jair Escalante, and Luis Villasen~or Pineda. Using intra-pro le information for author pro ling</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          , pages
          <fpage>1116</fpage>
          {
          <fpage>1120</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Suraj</surname>
            <given-names>Maharjan</given-names>
          </string-name>
          , Prasha Shrestha, and
          <string-name>
            <given-names>Thamar</given-names>
            <surname>Solorio</surname>
          </string-name>
          .
          <article-title>A simple approach to author pro ling in mapreduce</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          , pages
          <fpage>1121</fpage>
          {
          <fpage>1128</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>James</surname>
            <given-names>Marquardt</given-names>
          </string-name>
          , Golnoosh Farnadi, Gayathri Vasudevan,
          <string-name>
            <surname>Marie-Francine</surname>
            <given-names>Moens</given-names>
          </string-name>
          , Sergio Davalos, Ankur Teredesai, and Martine De Cock.
          <article-title>Age and gender identi - cation in social media</article-title>
          .
          <source>Proceedings of CLEF 2014 Evaluation Labs</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Matej</surname>
            <given-names>Martinc</given-names>
          </string-name>
          , Iza Skrjanec, Katja Zupan, and
          <string-name>
            <given-names>Senja</given-names>
            <surname>Pollak</surname>
          </string-name>
          .
          <source>Pan</source>
          <year>2017</year>
          :
          <article-title>Author pro ling-gender and language variety prediction</article-title>
          .
          <source>Cappellato</source>
          et al.[
          <volume>13</volume>
          ],
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Michal</surname>
            <given-names>Meina</given-names>
          </string-name>
          , Karolina Brodzinska, Bartosz Celmer, Maja Czokow, Martyna Patera, Jakub Pezacki, and
          <string-name>
            <given-names>Mateusz</given-names>
            <surname>Wilk</surname>
          </string-name>
          .
          <article-title>Ensemble-based classi cation for author pro ling using various features</article-title>
          .
          <source>Notebook Papers of CLEF</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <article-title>Je rey Dean. E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Je</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Pashutan</surname>
            <given-names>Modaresi</given-names>
          </string-name>
          , Matthias Liebeck, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Conrad</surname>
          </string-name>
          .
          <article-title>Exploring the e ects of cross-genre machine learning for author pro ling in pan 2016</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          , pages
          <fpage>970</fpage>
          {
          <fpage>977</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23. Mart Busger op Vollenbroek, Talvany Carlotto, Tim Kreutz, Maria Medvedeva, Chris Pool, Johannes Bjerva, Hessel Haagsma, and
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          . Gronup:
          <article-title>Groningen user pro ling</article-title>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Martin</surname>
            <given-names>Potthast</given-names>
          </string-name>
          , Tim Gollub, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identi cation, and Author Pro ling</article-title>
          . In Evangelos Kanoulas, Mihai Lupu, Paul Clough, Mark Sanderson, Mark Hall, Allan Hanbury, and Elaine Toms, editors,
          <source>Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14)</source>
          , pages
          <fpage>268</fpage>
          {
          <fpage>299</fpage>
          , Berlin Heidelberg New York,
          <year>September 2014</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25. Francisco Rangel, Paolo Rosso,
          <article-title>Manuel Montes-y-</article-title>
          <string-name>
            <surname>Gomez</surname>
          </string-name>
          ,
          <article-title>Martin Potthast, and Benno Stein. Overview of the 6th Author Pro ling Task at PAN 2018: Multimodal Gender Identi cation in Twitter</article-title>
          . In Linda Cappellato, Nicola Ferro,
          <string-name>
            <surname>Jian-Yun Nie</surname>
          </string-name>
          , and Laure Soulier, editors,
          <source>Working Notes Papers of the CLEF 2018 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org</source>
          ,
          <year>September 2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26. Francisco Rangel, Paolo Rosso, Moshe Moshe Koppel, Efstathios Stamatatos, and
          <string-name>
            <given-names>Giacomo</given-names>
            <surname>Inches</surname>
          </string-name>
          .
          <article-title>Overview of the author pro ling task at pan 2013</article-title>
          .
          <source>In CLEF Conference on Multilingual and Multimodal Information Access Evaluation</source>
          , pages
          <volume>352</volume>
          {
          <fpage>365</fpage>
          .
          <string-name>
            <surname>CELCT</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27. Francisco Rangel, Paolo Rosso,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <article-title>Overview of the 5th author pro ling task at pan 2017: Gender and language variety identi cation in twitter</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28. Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein, and
          <string-name>
            <given-names>Walter</given-names>
            <surname>Daelemans</surname>
          </string-name>
          .
          <article-title>Overview of the 3nd author pro ling task at pan 2015</article-title>
          . In L Cappellato,
          <string-name>
            <given-names>N</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J</given-names>
            <surname>Gareth</surname>
          </string-name>
          , and E San Juan, editors,
          <source>CLEF 2015 Labs and Workshops</source>
          , Notebook Papers, volume
          <volume>1391</volume>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29. Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Pottast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <article-title>Overview of the 4th Author Pro ling Task at PAN 2016</article-title>
          . In Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald, editors,
          <source>Working Notes Papers of the CLEF 2015 Evaluation Labs</source>
          , volume
          <volume>1609</volume>
          <source>of CEUR Workshop Proceedings</source>
          , pages
          <volume>750</volume>
          {
          <fpage>784</fpage>
          .
          <article-title>CLEF and CEUR-WS</article-title>
          .org,
          <year>September 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <given-names>Adithya</given-names>
            <surname>Rao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nemanja</given-names>
            <surname>Spasojevic</surname>
          </string-name>
          .
          <article-title>Actionable and political text classi cation using word embeddings and lstm</article-title>
          .
          <source>arXiv preprint arXiv:1607.02501</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <given-names>Radim</given-names>
            <surname>Rehurek</surname>
          </string-name>
          and
          <string-name>
            <given-names>Petr</given-names>
            <surname>Sojka</surname>
          </string-name>
          .
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          , pages
          <volume>45</volume>
          {
          <fpage>50</fpage>
          ,
          <string-name>
            <surname>Valletta</surname>
          </string-name>
          , Malta, May
          <year>2010</year>
          . ELRA. http: //is.muni.cz/publication/884893/en.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <given-names>K</given-names>
            <surname>Santosh</surname>
          </string-name>
          , Romil Bansal, Mihir Shekhar, and
          <string-name>
            <given-names>Vasudeva</given-names>
            <surname>Varma</surname>
          </string-name>
          .
          <article-title>Author pro ling: Predicting age and gender from blogs</article-title>
          .
          <source>Notebook Papers of CLEF</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Jonathan</surname>
            <given-names>Schler</given-names>
          </string-name>
          , Moshe Koppel, Shlomo Argamon, and James W Pennebaker.
          <article-title>E ects of age and gender on blogging</article-title>
          .
          <source>In AAAI Spring Symposium: Computational Approaches</source>
          to Analyzing Weblogs, volume
          <volume>6</volume>
          , pages
          <fpage>199</fpage>
          {
          <fpage>205</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Efstathios</surname>
            <given-names>Stamatatos</given-names>
          </string-name>
          , Francisco Rangel, Michael Tschuggnall, Mike Kestemont, Paolo Rosso, Benno Stein, and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          .
          <article-title>Overview of PAN-2018: Author Identi cation, Author Pro ling, and Author Obfuscation</article-title>
          . In Patrice Bellot, Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda Cappellato, and Nicola Ferro, editors,
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. 9th International Conference of the CLEF Initiative (CLEF 18)</source>
          , Berlin Heidelberg New York,
          <year>September 2018</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Duyu</surname>
            <given-names>Tang</given-names>
          </string-name>
          , Bing Qin, and Ting Liu.
          <article-title>Document modeling with gated recurrent neural network for sentiment classi cation</article-title>
          .
          <source>In Proceedings of the 2015 conference on empirical methods in natural language processing</source>
          , pages
          <volume>1422</volume>
          {
          <fpage>1432</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Eric S Tellez</surname>
            , Sabino Miranda-Jimenez,
            <given-names>Mario</given-names>
          </string-name>
          <string-name>
            <surname>Gra</surname>
            , and
            <given-names>Daniela</given-names>
          </string-name>
          <string-name>
            <surname>Moctezuma</surname>
          </string-name>
          .
          <article-title>Gender and language variety identi cation with microtc</article-title>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37. Theano Development Team.
          <article-title>Theano: A Python framework for fast computation of mathematical expressions</article-title>
          . arXiv e-prints,
          <source>abs/1605</source>
          .02688, May
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38. Julio Villena Roman and Jose Carlos Gonzalez Cristobal. Daedalus at pan 2014:
          <article-title>Guessing tweet author's gender and</article-title>
          age,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39.
          <string-name>
            <surname>Edson</surname>
          </string-name>
          RD Weren, Viviane Pereira Moreira, and Jose Palazzo M de Oliveira.
          <article-title>Exploring information retrieval features for author pro ling</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          , pages
          <fpage>1164</fpage>
          {
          <fpage>1171</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>