<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Profiling Twitter Users Using Autogenerated Features Invariant to Data Distribution</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research Council (CNR), Institute of Informatics and Telematics (IIT)</institution>
          <addr-line>via G. Moruzzi 1, 56124, Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>With the diffusion of Web and Social Media, automatic user profiling classifiers applied to digital contents have become extremely important in application contexts related to social and forensic studies. In many research papers on this topic, an important part of the work is devoted to a costly manual "feature engineering" phase, where the semantic, syntactic, and often languagedependent features need to be accurately chosen to be relevant for profilation task. Differently from this approach, in this work we propose a Twitter user profiling classifier which exploits deep learning techniques to automatically generate user features being a) optimal for user profilation task, and b) able to fight covariance shift problem due to data distribution differences in training and test sets. In the best configuration found, the built system is able to achieve very interesting accuracy results on both English and Spanish languages, with an average final accuracy of more than 0.83.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Since the 19th century, authorship identification (AI) analysis have been used in several
contexts [
        <xref ref-type="bibr" rid="ref13">13,28</xref>
        ] in order to draw some conclusion about specific cases where the
original author was uncertain. In the last years, with the advent of Web and Social Media,
the attention of researchers has moved to the analysis of digital contents (e-mail, blogs,
Twitter, etc.) in contexts related to social studies and forensic analysis. Author
profiling, one the main sub-tasks of AI, is a method of analyzing a corpus of texts in order
to identify the main stylistic and content-based features characterizing the user profiles
(e.g., gender, age, etc.) we are interested in. Every year, PAN workshop proposes a
specific challenge on this topic, offering multilingual annotated datasets where participants
can test their own techniques. This year, "Bots and Gender Profiling" task is focused on
the profilation of Twitter users considering both the user’s type (bot vs. human) and, in
case of human, the user’s gender (male vs. female) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Differently from multimodal
data in author profiling task of PAN 2018 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], this year we only have textual contents
ready for analysis, making it similar to the edition of 2017 [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Previous editions have shown a prevalence of software solutions based on traditional
machine learning approaches, based on costly manual feature engineering phases.
Except for some works, the usage of language modeling and deep learning techniques for
textual features encoding has been quite limited and the level of accuracy reachable by
using only these techniques is unclear. In this work we thus propose a software
solution strongly based on these recent approaches able to a) perform textual encoding in a
language-agnostic way and with almost no manual feature engineering, and b)
attenuate the data covariate shift [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] between training and test datasets in order to improve
classifier accuracy.
      </p>
      <p>The rest of this paper is organized as follow. In Section 2 we briefly introduce related
works on this topic, and in Section 3 we show how we performed text encoding. After
having introduced the covariate shift problem of challenge datasets in Section 4, we
describe in details the design of the proposed solution in Section 5, the experimental
results in Section 6, and the conclusions on Section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Until 2017, "Author profiling" task in PAN challenges was focused on age and gender
identification [
        <xref ref-type="bibr" rid="ref21 ref22">22,21</xref>
        ], while in 2017 the age part was replaced by a sub-task focused
on language variety identification [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In 2018, the main focus of the challenge was
instead targeted to gender prediction but using different types of data sources: text and
images. In the following, let’s describe the types of software solutions proposed on
author profiling task in the last 2 editions of the challenge.
      </p>
      <p>
        Two main different types of feature representation have been used in past works. A
first one is a traditional approach with features (content-based or style-based) selected
through a manual "feature engineering" phase. Many works [
        <xref ref-type="bibr" rid="ref2 ref26 ref3 ref8 ref9">9,2,8,3,26</xref>
        ] use chars and
word n-grams (or a combination of them) for basic feature representations, while
others [
        <xref ref-type="bibr" rid="ref14 ref15 ref2 ref6">15,6,2,14</xref>
        ] also used stylistic features obtained by counting or computing ratios
for stopwords, hashtags, user mentions, character flooding, emoticons, and laugher
expressions. Another type of feature representation, similar in spirit to what we propose
in this work, has been the usage of features obtained through neural networks. Some
works [
        <xref ref-type="bibr" rid="ref1 ref10">1,10,27</xref>
        ] exploited word embeddings, chars embeddings, or a their combination
with traditional features to encode textual contents. In this work, in addition to this
standard approaches, we propose also a deep learning network able to learn the latent
syntax encoding of a text (see Section 3.2).
      </p>
      <p>
        The classification approaches used by many works [
        <xref ref-type="bibr" rid="ref1 ref26 ref3 ref9">1,9,26,3</xref>
        ] are based on
traditional machine learning algorithms like SVM, Random Forest, logistic regression,
Naive-Bayes, or ensemble of these methods. An increasing number of works [
        <xref ref-type="bibr" rid="ref10 ref12 ref25">12,10,25</xref>
        ]
use instead deep learning approaches adopting mainly RNN (like LSTM, GRU) and
CNN nets. Often these networks are combined together or mixed with traditional
machine learning methods in order to get best of both worlds or differentiate the strategy
used to analyze a specific type of content (e.g., texts and images). In this work we
heavily exploit deep learning algorithms to generate automatically a set of optimal and
distribution-invariant user features for problem’s representation, while we use SVM
over these computed features to build the final classifier.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Textual Encoding</title>
      <p>
        A common approach to encode textual contents is to use a NLP (Natural Language
Processing) processing pipeline [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. This type of solution suffers from a costly "feature
engineering" phase necessary to try and test several types of features in order to obtain
sufficiently informative data representations for the problem to deal with. Furthermore,
domain knowledge and the availability of high quality analysis tools in the target
language are critical for successfully performing this step: unfortunately these conditions
can not always be met. In order to overcome these limitations, in the following we
propose 3 different approaches based on language modeling techniques and deep learning
algorithms.
3.1
      </p>
      <sec id="sec-3-1">
        <title>W2V: a Semantic Encoding Based on Word2Vec</title>
        <p>
          The first type of text encoding used (called W2V) is based on the usage of a language
modeling algorithm very popular in the last years: word2vec [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. This technique
allow to compute word vectors (called embeddings) by analyzing a set of unannotated
texts according to an objective function based on distributional hypothesis1. The
resulting vectors have very interesting properties matching the relationship existent between
words (e.g., W (\queen") = W (\king") W (\man") + W (\woman")).
        </p>
        <p>Given a raw tweet, in order to obtain a vector representation to be used as text
encoding with machine learning methods, we proceed as following. We first convert the
text to lowercase, next we replace each distinct URL with the word "url" and we remove
all punctuation characters. If present, we remove also all the emoticons occurring in the
text. The remaining textual content is then tokenized into distinct tokens (eventually
appearing more than 1 time) and the final encoding vector is computed as
Vtweet =</p>
        <p>N i=1
1 XN we(token(i))
(1)
where Vtweet is the final tweet vector, N is the total number of tokens belonging to
final tokenized tweet, token(i) is the function returning the token at position i in the
tokenized vector, and we(x) is the function returning the word vector associated to
token x.</p>
        <p>For both target languages, we used precomputed 300-dimensional embedding models
built on big data collections: GoogleNews for English2 and SBWCE for Spanish3.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>SYN: a Syntax-Modeling Encoding Based on Deep Learning</title>
        <p>The W2V encoding described previously has the main drawback in trying to only
capture semantic information from Twitter data, skipping all the remaining knowledge
related to write-style and syntax used by the different type of users to write tweets. In
1 The meaning of this hypothesis is that words appearing in similar contexts often have a similar
meaning.
2 Available at http://tiny.cc/0vhz6y
3 Available at https://bit.ly/2Q9DBzU
Syntax tweet encoding: encoder Chars tweet encoding: encoder
learning (2) learning
order to extract syntax-related information from textual data, we proposed an encoding
method based on deep learning technology, as shown on Figure 1a.</p>
        <p>The original tweet text is tokenized before being fed to the encoding neural network.
The tokenization process is performed in the following way. We replace all distinct
URLs with the word "url" but we do not remove neither the punctuation nor the
emoticons found on text. Each distinct user mention (e.g. @realdonaldtrump) and hashtag
(e.g. #politic) is replaced respectively by the words mention and hashtag. Each
remaining word is transformed into one of these 3 words: uppercase_w for a
completely uppercase word, lowercase_w for a completely lowercase word, and mixed_w
for a word including both uppercase and lowercase characters.</p>
        <p>
          In order to learn a tweet encoder using the architecture proposed in Figure 1a, we
used only training data split and transform the dataset of our original problem
(proposed to classify users into profiles) into a new dataset suitable to classify tweets into
user profiles. For each user profile in training data, we thus take its tweets and its
original assigned label (bot, female, or male) and we extract 100 different tweets where each
one has assigned the label of the user. At the end of this process we have produced 2
new training datasets, one for English and one for Spanish, consisting in respectively
288,000 and 208,000 training documents, which can next be used to learn the two
profile classifiers. The proposed neural network uses two sub-networks to learn different
feature types. A first part exploits a GRU net [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to find interesting temporal patterns
in the sequence of input tokens. A second sub-network based on a CNN [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] try instead
to identify spatial features which are invariant to respect of original position in the text.
These two types of found features are next concatenated together and used to find an
optimal vector encoding for the tweet, before performing the final classification.
        </p>
        <p>The syntax classifiers so obtained are not interesting in themselves, what it does
matter for us is the possibility to take the trained nets and extract, as tweet encoding
vector, the output produced by "Encoder layer" layer. The resulting vector should be in
fact an effective representation of the syntax information convey by the original tweet.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>CHARS: a Characters Encoding Based on Deep Learning</title>
        <p>This encoding, differently from W2V method, want to exploit data at sub-word level in
order to find interesting recurrent patterns and potentially useful information for user
profile classification. The tokenization process of original tweets is simpler than the
others two seen encoders. In this case, we have defined the set of valid symbols in
the characters dictionary as (A Za z0 9), the punctuation symbols, and all the
distinct emoticons found on the training data split. The tokenization process thus simply
consists in removing all invalid characters from input raw text. In the same manner as
described for SYN encoder, starting from training data split, we create two new different
training datasets to solve a tweet ) user_profile classification task. As shown on Figure
1b, the neural network architecture used to learn the encoder is slightly different from
that used in SYN encoder. The network only use a CNN sub-network for encoding the
tweet’s data and the layer taken as final tweet encoder is the CNN layer.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Data Mismatch in Contest Dataset</title>
      <p>
        The first attempt to build a user’s profile classifier for the "Bots and Gender Profiling"
task was to take all the data included in train and validation splits, merge them together,
and then try to perform a k-fold evaluation [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] on the resulting dataset using one of
the textual encoding defined in the previous section and a simple feed forward neural
network classifier. With a similar attempt, we obtained a final software accuracy on
profiling task of 0.973 for both English and Spanish classifiers. Unfortunately these
results were a lot worse if we used the splits suggested by organizers, with a final
classifier accuracy of 0.785 for English and just 0.721 for Spanish. From additional
performed tests, we have verified that if just use only training data split, we have no
evidence neither of bias nor variance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in the built classifier, the accuracy discrepancy
is only evident when the classifier learned on training data is applied on test data of
validation split. If we reasonably assume that provided validation split matches the data
distribution of the unknown test dataset used to evaluate the system for the challenge,
we can infer that there is a "covariance shift" problem [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] between the two distributions
that need to be treated appropriately.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>The Proposed User’s Profile Classifier</title>
      <p>
        For each language, we designed a single-label multiclass classifier [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] able to assign a
user to exactly 1 of the 3 defined classes: bot, male, and female. The high level
architecture of the software is shown on Figure 2. The tweets of a user timeline are encoded
through a specific text encoder TE and then passed to a component called User Encoder
with the aim to analyze the timeline and to produce a user vector including useful
information for final profilation. The vector so obtained is later passed to an SVM classifier
which will determine the label to assign to the user.
      </p>
      <p>We tested three different types of textual encodings:</p>
      <p>Architecture of final user’s profile
classifier
Bot</p>
      <p>Male</p>
      <p>Female
Tweet
enc
1
TE
Tw1eet</p>
      <p>SVM classifier
User Encoder</p>
      <p>Tweet Tweet
enc ... 1e0n0c
2
TE</p>
      <p>TE</p>
      <p>Tw2eet ... T1w0e0et
Classifier architecture</p>
      <sec id="sec-5-1">
        <title>Different types of User Encoder</title>
        <p>BUE
encoder
CBUE
DUE
encoder
BUE
encoder
CDUE
Masked
DUE
vector
DUE
encoder
BUE
encoder
CmaskDUE</p>
      </sec>
      <sec id="sec-5-2">
        <title>Different types of Text Encoder</title>
        <p>W2V</p>
        <p>W2V-SYN</p>
        <p>W2V-SYN-CHARS</p>
        <p>In the remainder on the section, we will describe the two proposed core user
encoders (BUE and DUE encoders) used in tested configurations, the strategy used to
mask most discriminating features, and how we built the final user’s profile classifier.
ncoding user profiles (BUE
ncoder)
e training data
ly.
code users with
tor of size 512
ted various
ers DL subnets:
M,
M+CNN, GRU,
U+CNN,
In this section we describe a first type of user encoder (called BUE, Base User
Encoder) which is able to process user timelines and provide good vector representations
of the user profiles. In order to auto-learn very informative features for the final task,
we designed a classification system based on a deep learning architecture (see Figure
3a) and trained it using only train data split. The final built classifier, thanks to CNN
sub-network4, learn how to encode properly the sequence of tweets submitted by the
user by identifying the space-invariant inter-tweets patterns most relevant to solve the
classification problem. In particular, the trained network can be used to compute a good
user vector representation starting from its timeline. We tuned the network to have 512
as vector size for user encoding layer (layer "User encoding layer"), while for CNN we
used 512 different filters with a window size set to 10.
5.2</p>
        <sec id="sec-5-2-1">
          <title>Learning a Discriminant User Encoder</title>
          <p>User encodings produced by BUE encoder are optimal user representations for training
data split and final classification task, but as we have seen before are suboptimal for
encoding and classify test datasets due to different characteristics of data. The main aim
of this step is thus to start from BUE vector representations and try to find similar user
representations, suitable for classification task but where the features high
discriminative for training and test distributions are highlighted and easily recognizable. In order
to tackle this issue, we propose a discriminative user encoder (called DUE,
Discriminative User Encoder) based on the neural network architecture shown on Figure 3b.
4 We tested other variants of deep learning architectures (e.g., using sub-nets based on LSTM,
GRU, or a combination of them with CNN) but we finally choosed CNN because it was able to
return back slightly better user vectors encoding (giving better effectiveness on final classifier).
This encoder is trained on a new dataset composed by both train and validation data but
where each user has different target labels respect to original task: "True" if the user was
originally included in validation split, "False" otherwise. The neural network structure
is very simple. It takes as input a user vector representation computed by BUE encoder
and then in the first hidden dense layer (Discriminant user encoding) try to learn a user
representation optimal for this new binary classification task. To ensure that the net
would learn a user representation suitable also for original "Bots and Gender Profiling"
task, we introduce a constraint of what the net can learn at this level. In particular, we
force the net to minimize two different losses in gradient descent optimization: the loss
on labels related to data distribution (Ldistribution), and the loss due to the difference
from learned encoding and the encoding produced by BUE encoder (Luser). With the
imposed constraints we aim to obtain a trained net which will produce user vectors of
dimension 512 very similar to those produced by BUE encoder but containing some
specific feature resulting very discriminative in recognizing the original data
distribution of a user.
5.3</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>Masking most Discriminating Features</title>
          <p>
            DUE encoder provides user vector representations containing some high informative
features for identifying the origin of the data distribution of a user. A feature is high
discriminative for data distribution if, after having built a binary classifier using only
the considered feature, we are able to distinguish quite well the distribution origin of
a user. To weight the discriminative power of a feature we can use ROC-AUC [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ], a
measure able to telling how much a model is able to distinguish between classes. To
measure ROC values for all features, we built a new balanced dataset (Dmask) mixing
some data from train split and some from validation split and measured classification
performance of each feature using a Random Forest classifier.
          </p>
          <p>On Figure 4 we show a typical distribution of the features related to their ROC-AUC
values emerged from the tests. The majority of features have a ROC value around 0.55
but a long queue exists on the right with few features having high discriminative power
(up to more than 0.8). These last type of features are the those we want to mask in the
final user encoding vector. Indeed, without these features the user vector should be less
sensible to distribution origin and therefore, in the successive learning phase, we force
the classifier to concentrate on the other distribution-invariant features to solve the final
task.
5.4</p>
        </sec>
        <sec id="sec-5-2-3">
          <title>Building the Final Task Classifier</title>
          <p>The final user’s profile classifier has been choosed and tuned using remaining train
and validation data not included in dataset Dmask. We tested two different classifiers
(Random Forest and SVM) but we finally adopted SVM because it performed better
in terms of accuracy in all tested configurations. The classifier has been optimized by
fine-tuning model parameters through the use of a k-fold validation. For both languages
and independently from the used text encoding, the best found configuration uses an
RBF kernel and standard parameters "gamma" and "C" set to respectively 0.01 and
0.1 . The best threshold determining the ROC minimal value over which we apply
3) Mask most important discriminant
features
▷ RandomForest classifier
▷ Mask features with</p>
          <p>ROC-AUC value &gt;= chosen
threshold.
feature masking depends on the adopted text encoding and it will be reported in the
experimental section.</p>
          <p>After found the best model parameters and chosen a threshold for ROC value, we
built the final optimized classifier using train split as training data and using
validation split as test data to measure the accuracy of a specific solution. The final models
submitted on early test dataset for evaluation have instead been built using all the data
available (the union of data in training and validation splits). Detailed results of the
tested configurations will be presented in the experimental section.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Experimental Results</title>
      <p>
        In this section we will report all the results obtained in our experimentation. A detailed
description of the dataset used in the challenge is available in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], while the final
evaluation on test sets has been obtained by using TIRA platform [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The base measure
used to evaluate the goodness of the proposed software solutions is the accuracy [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
The software has been implemented entirely in Python using various external libraries
for specific functions. In particular, the deep learning part has been developed using
Keras5 with a Tensorflow6 backend, while we used scikit-learn7 for traditional machine
learning methods.
      </p>
      <p>For each language, on Table 1 we report the system global accuracy on validation
dataset obtained a) using the three different strategies seen above for encoding tweets
and b) using three ways to encode the user vector before building the final classifier.
In particular, we report the 3 types of user encoder configurations introduced in
Section 5, together with cut threshold for masking features ("Masking threshold" column).
This first type of experiment has the main aim in identifying, for each language and
each tweet encoding, the best user encoding which next will be used to test the system
5 Available at https://keras.io/ .
6 Available at https://www.tensorflow.org/ .
7 Available at https://scikit-learn.org/stable/ .
performance on early test dataset. For this reason, for each language and each tweet
encoding tested, we also report in bold the best user encoding found.</p>
      <p>On both languages, the results for W2V-SYN and W2V-SYN-CHARS configurations
show that the better results are obtained by using CmaskDUE configuration, while for
W2V the best ones make use of CDUE without using feature masking. The optimal
threshold identified for the various configurations is very similar and is always
comprised between 0.65 and 0.70. It is also interesting to note that the accuracy for
W2VSYN always increases while passing from CBUE to CmaskDUE configurations.
W2VSYN-CHARS instead, thanks probably to a richer representation of input data, seems
instead to be a strong competitor also using the base CBUE user encoding.</p>
      <p>On Table 2, we report a comparison between the accuracy obtained for both
languages on validation data and on the early test dataset provided by the challenge
organizers. The reported results are referred to the best configurations identified on Table 1
and show the accuracy obtained on the 2 sub-tasks defined in the competition ("Type
task" and "Gender task" columns) together with the averaged global accuracy of the
system ("Global" column). For each type of dataset and each language, we have also
reported in bold the best results.</p>
      <p>Globally, the results show clearly that W2V-SYN is the best performer, in particular,
W2V-SYN with CmaskDUE
on early test dataset and for both languages, it always provided the best accuracy.
W2V-SYN-CHARS seems to be competitive only on validation dataset but on early
test dataset the accuracy is not very good, especially on "Gender task". A possible
explanation of this worst performance respect to W2V-SYN is that the addition of features
generated by CHARS encoder confuse the classifier while learning the best parameters
to build a classification model invariant to data distribution. W2V is instead the worst
performer on both datasets, probably a sign that this type of text encoding has a limited
expressivity in representing the data of the classification task.</p>
      <p>If we observe the results from the point of view of languages and classification
subtasks, we can note the the proposed configurations always perform better on English,
with a remarkable difference in terms of accuracy between the two classification
subtasks. In particular, solving the "Type" task seems a lot easier than solving the "Gender"
task. This difference is probably due from one side to the greater number of training
examples available for "Type task", from the other side for the greater difficulty8 in
identifying the differences in the write style between males and females.</p>
      <p>Finally, on Table 3 we show the accuracy results obtained by our best
configuration (W2V-SYN with CmaskDUE encoder) on the test dataset used for final evaluation
of system performance in the challenge. In addition, on the table we report also the
accuracy obtained by various baseline methods proposed by organizers. Our method
outperforms all the baselines, in particular seems more effective especially on Spanish
language. On both languages, the global accuracy results obtained by our algorithm on
final test dataset are quite similar to those obtained on early test dataset (0.8409 and
0.8366 vs. 0.8523 and 0.8250). An interesting observation is that in final test dataset the
difference in terms of accuracy between the two sub-tasks "Type" and "Gender" is very
similar for both languages, suggesting that the proposed solution is a good balanced
system and not sensible to the target language.
8 If we manually inspect a timeline of a human user of the dataset, it is not always so easy
identify its gender. Generally, a bot user is instead easily recognizable because certain types of
writing patterns occurred frequently in the timeline.</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>In this work we tackled the "Bots and Gender Profiling" task of PAN 2019 challenge
by proposing a software solution making extensive use of state of the art language
modeling and deep learning techniques. In particular, these type of algorithms have
been used in order to automatically performing feature encoding of textual contents,
avoid almost completely the usual costly and manual "feature engineering" phase. In
addition, deep learning has been also exploited to find textual features invariant to train
and test datasets, helping fighting data covariate shift which usually has remarkable
negative effect on final classifier accuracy. The experimental results show that, among
all tested configurations, the solution using W2V-SYN textual encoding and combined
with user encoder CmaskDUE is globally the best performer on both validation and early
test datasets. In particular, on the final test set, such configuration allow to obtain a
balanced multilingual classifier able to obtain very interesting accuracy results on both
languages, with a global combined accuracy of 0.8387.</p>
      <p>Possible future works could try to use more advanced language modeling techniques
like BERT and Elmo for textual encoding. These methods provide contextual
embeddings for text representation, by automatically disambiguate both the terms and the used
syntax in analyzed contents. A comparison with the textual encoding methods proposed
here could thus be very helpful in order to address future research directions.
27. Veenhoven, R., Snijders, S., van der Hall, D., van Noord, R.: Using translated data to
improve deep learning author profiling models. In: Proceedings of the Ninth International
Conference of the CLEF Association (CLEF 2018) (2018)
28. Yule, G.U.: On sentence-length as a statistical characteristic of style in prose: With
application to two cases of disputed authorship. Biometrica 30, 363–390 (1939)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Akhtyamova</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cardiff</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Twitter author profiling using word embeddings and logistic regression</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alrifai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rebdawi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghneim</surname>
          </string-name>
          , N.:
          <article-title>Arabic tweeps gender and dialect prediction</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. von Daniken,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Grubenmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Cieliebak</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Word unigram weighing for author profiling at pan 2018</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          )
          <article-title>(</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Geman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bienenstock</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doursat</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Neural networks and the bias/variance dilemma</article-title>
          .
          <source>Neural Comput</source>
          .
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>58</lpage>
          (
          <year>Jan 1992</year>
          ). https://doi.org/10.1162/neco.
          <year>1992</year>
          .
          <volume>4</volume>
          .
          <issue>1</issue>
          .1, http://dx.doi.org/10.1162/neco.
          <year>1992</year>
          .
          <volume>4</volume>
          .
          <issue>1</issue>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jozefowicz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.:</given-names>
          </string-name>
          <article-title>An empirical exploration of recurrent network architectures</article-title>
          .
          <source>In: International Conference on Machine Learning</source>
          . pp.
          <fpage>2342</fpage>
          -
          <lpage>2350</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Karlgren</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esposito</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gratton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanerva</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Authorship profiling without using topical information: Notebook for pan at clef 2018</article-title>
          .
          <source>In: 19th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF</source>
          <year>2018</year>
          , Avignon, France, 10
          <year>September 2018</year>
          through
          <issue>14</issue>
          <year>September 2018</year>
          . vol.
          <volume>2125</volume>
          .
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Language-and subtask-dependent feature selection and classifier parameter tuning for author profiling</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Martinc</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skrjanec</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zupan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pollak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Pan 2017:
          <article-title>Author profiling-gender and language variety prediction</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Martinc</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skrlj</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pollak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Multilingual gender classification with multi-view deep learning</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          )
          <article-title>(</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2</source>
          . pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          . NIPS'
          <volume>13</volume>
          , Curran Associates Inc.,
          <source>USA</source>
          (
          <year>2013</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>2999792</volume>
          .
          <fpage>2999959</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Miura</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taniguchi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taniguchi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ohkuma</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Author profiling with word+ character neural attention network</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mosteller</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wallace</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Inference and disputed authorship: The federalist (</article-title>
          <year>1964</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neto</surname>
            ,
            <given-names>R.F.O.</given-names>
          </string-name>
          :
          <article-title>Using character n-grams and style features for gender and language variety classification</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Patra</surname>
            ,
            <given-names>B.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>K.G.</given-names>
          </string-name>
          :
          <article-title>Dd. multimodal author profiling for arabic, english, and spanish</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          )
          <article-title>(</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franco-Salvador</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A low dimensionality representation for language variety identification</article-title>
          .
          <source>In: International Conference on Intelligent Text Processing and Computational Linguistics</source>
          . pp.
          <fpage>156</fpage>
          -
          <lpage>169</lpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          , H. (eds.)
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR-WS.org (Sep</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gómez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 4th author profiling task at pan 2016: cross-genre evaluations</article-title>
          .
          <source>In: Working Notes Papers of the CLEF</source>
          <year>2016</year>
          <article-title>Evaluation Labs</article-title>
          . CEUR Workshop Proceedings/Balog, Krisztian [edit.]; et al. pp.
          <fpage>750</fpage>
          -
          <lpage>784</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>Rangel</given-names>
            <surname>Pardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.M.</given-names>
            ,
            <surname>Celli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Daelemans</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          :
          <article-title>Overview of the 3rd author profiling task at pan 2015</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop Working Notes Papers. pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>: Machine learning in automated text categorization</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>34</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          (
          <year>Mar 2002</year>
          ). https://doi.org/10.1145/505282.505283, http://doi.acm.
          <source>org/10</source>
          .1145/505282.505283
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Sugiyama</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakajima</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kashima</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buenau</surname>
            ,
            <given-names>P.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawanabe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Direct importance estimation with model selection and its application to covariate shift adaptation</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>1433</fpage>
          -
          <lpage>1440</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Takahashi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tahara</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nagatani</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miura</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taniguchi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ohkuma</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Text and image synergy with feature cross technique for gender identification</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Tellez</surname>
            ,
            <given-names>E.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranda-Jiménez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moctezuma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graff</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salgado</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ortiz-Bejar</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Gender identification through multi-modal tweet analysis using microtc and bag of visual words</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          )
          <article-title>(</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>