Profiling Twitter Users Using Autogenerated Features
                Invariant to Data Distribution
                         Notebook for PAN at CLEF 2019

                             Tiziano Fagni and Maurizio Tesconi

         National Research Council (CNR), Institute of Informatics and Telematics (IIT)
                             via G. Moruzzi 1, 56124, Pisa, Italy
               {tiziano.fagni, maurizio.tesconi}@iit.cnr.it


        Abstract With the diffusion of Web and Social Media, automatic user profil-
        ing classifiers applied to digital contents have become extremely important in
        application contexts related to social and forensic studies. In many research pa-
        pers on this topic, an important part of the work is devoted to a costly manual
        "feature engineering" phase, where the semantic, syntactic, and often language-
        dependent features need to be accurately chosen to be relevant for profilation task.
        Differently from this approach, in this work we propose a Twitter user profiling
        classifier which exploits deep learning techniques to automatically generate user
        features being a) optimal for user profilation task, and b) able to fight covari-
        ance shift problem due to data distribution differences in training and test sets. In
        the best configuration found, the built system is able to achieve very interesting
        accuracy results on both English and Spanish languages, with an average final
        accuracy of more than 0.83.


1     Introduction
Since the 19th century, authorship identification (AI) analysis have been used in several
contexts [13,28] in order to draw some conclusion about specific cases where the orig-
inal author was uncertain. In the last years, with the advent of Web and Social Media,
the attention of researchers has moved to the analysis of digital contents (e-mail, blogs,
Twitter, etc.) in contexts related to social studies and forensic analysis. Author profil-
ing, one the main sub-tasks of AI, is a method of analyzing a corpus of texts in order
to identify the main stylistic and content-based features characterizing the user profiles
(e.g., gender, age, etc.) we are interested in. Every year, PAN workshop proposes a spe-
cific challenge on this topic, offering multilingual annotated datasets where participants
can test their own techniques. This year, "Bots and Gender Profiling" task is focused on
the profilation of Twitter users considering both the user’s type (bot vs. human) and, in
case of human, the user’s gender (male vs. female) [18]. Differently from multimodal
data in author profiling task of PAN 2018 [19], this year we only have textual contents
ready for analysis, making it similar to the edition of 2017 [20].
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
    Previous editions have shown a prevalence of software solutions based on traditional
machine learning approaches, based on costly manual feature engineering phases. Ex-
cept for some works, the usage of language modeling and deep learning techniques for
textual features encoding has been quite limited and the level of accuracy reachable by
using only these techniques is unclear. In this work we thus propose a software solu-
tion strongly based on these recent approaches able to a) perform textual encoding in a
language-agnostic way and with almost no manual feature engineering, and b) attenu-
ate the data covariate shift [24] between training and test datasets in order to improve
classifier accuracy.
    The rest of this paper is organized as follow. In Section 2 we briefly introduce related
works on this topic, and in Section 3 we show how we performed text encoding. After
having introduced the covariate shift problem of challenge datasets in Section 4, we
describe in details the design of the proposed solution in Section 5, the experimental
results in Section 6, and the conclusions on Section 7.


2   Related Work

Until 2017, "Author profiling" task in PAN challenges was focused on age and gender
identification [22,21], while in 2017 the age part was replaced by a sub-task focused
on language variety identification [20]. In 2018, the main focus of the challenge was
instead targeted to gender prediction but using different types of data sources: text and
images. In the following, let’s describe the types of software solutions proposed on
author profiling task in the last 2 editions of the challenge.
     Two main different types of feature representation have been used in past works. A
first one is a traditional approach with features (content-based or style-based) selected
through a manual "feature engineering" phase. Many works [9,2,8,3,26] use chars and
word n-grams (or a combination of them) for basic feature representations, while oth-
ers [15,6,2,14] also used stylistic features obtained by counting or computing ratios
for stopwords, hashtags, user mentions, character flooding, emoticons, and laugher ex-
pressions. Another type of feature representation, similar in spirit to what we propose
in this work, has been the usage of features obtained through neural networks. Some
works [1,10,27] exploited word embeddings, chars embeddings, or a their combination
with traditional features to encode textual contents. In this work, in addition to this
standard approaches, we propose also a deep learning network able to learn the latent
syntax encoding of a text (see Section 3.2).
     The classification approaches used by many works [1,9,26,3] are based on tra-
ditional machine learning algorithms like SVM, Random Forest, logistic regression,
Naive-Bayes, or ensemble of these methods. An increasing number of works [12,10,25]
use instead deep learning approaches adopting mainly RNN (like LSTM, GRU) and
CNN nets. Often these networks are combined together or mixed with traditional ma-
chine learning methods in order to get best of both worlds or differentiate the strategy
used to analyze a specific type of content (e.g., texts and images). In this work we
heavily exploit deep learning algorithms to generate automatically a set of optimal and
distribution-invariant user features for problem’s representation, while we use SVM
over these computed features to build the final classifier.
3     Textual Encoding
A common approach to encode textual contents is to use a NLP (Natural Language Pro-
cessing) processing pipeline [23]. This type of solution suffers from a costly "feature
engineering" phase necessary to try and test several types of features in order to obtain
sufficiently informative data representations for the problem to deal with. Furthermore,
domain knowledge and the availability of high quality analysis tools in the target lan-
guage are critical for successfully performing this step: unfortunately these conditions
can not always be met. In order to overcome these limitations, in the following we pro-
pose 3 different approaches based on language modeling techniques and deep learning
algorithms.

3.1   W2V: a Semantic Encoding Based on Word2Vec
The first type of text encoding used (called W2V) is based on the usage of a language
modeling algorithm very popular in the last years: word2vec [11]. This technique al-
low to compute word vectors (called embeddings) by analyzing a set of unannotated
texts according to an objective function based on distributional hypothesis1 . The result-
ing vectors have very interesting properties matching the relationship existent between
words (e.g., W (“queen”) ∼  = W (“king”) − W (“man”) + W (“woman”)).
     Given a raw tweet, in order to obtain a vector representation to be used as text
encoding with machine learning methods, we proceed as following. We first convert the
text to lowercase, next we replace each distinct URL with the word "url" and we remove
all punctuation characters. If present, we remove also all the emoticons occurring in the
text. The remaining textual content is then tokenized into distinct tokens (eventually
appearing more than 1 time) and the final encoding vector is computed as
                                               N
                                           1 X
                              Vtweet =           we(token(i))                                 (1)
                                           N i=1

where Vtweet is the final tweet vector, N is the total number of tokens belonging to
final tokenized tweet, token(i) is the function returning the token at position i in the
tokenized vector, and we(x) is the function returning the word vector associated to
token x.
For both target languages, we used precomputed 300-dimensional embedding models
built on big data collections: GoogleNews for English2 and SBWCE for Spanish3 .

3.2   SYN: a Syntax-Modeling Encoding Based on Deep Learning
The W2V encoding described previously has the main drawback in trying to only cap-
ture semantic information from Twitter data, skipping all the remaining knowledge re-
lated to write-style and syntax used by the different type of users to write tweets. In
 1
   The meaning of this hypothesis is that words appearing in similar contexts often have a similar
   meaning.
 2
   Available at http://tiny.cc/0vhz6y
 3
   Available at https://bit.ly/2Q9DBzU
    Syntax tweet encoding: encoder Chars tweet encoding: encoder
    learning (2)                   learning
                                       Bot | Male | Female
                                            prediction
Up to here for                                                                                  Bot | Male | Female
                                                                                                     prediction
user                 Encoding
                     network
                                            Encoder layer

encoding!                                                                                          Dense layer
                                       Concatenation layer

                                                                    CNN layer        Encoding
                                GRU layer                           used for user
                                                            CNN layer
                                                                                     network        CNN layer

                                                                    encoding!
                                                                                                Embeddings layer
                                        Embeddings layer

● Units number in                                ● Units number in
  GRU, CNN: 300                                     CNN: 300                                          Tweet
                                               Tweet
● Encoding vector                                ● Encoding vector
                                                text
                                                                                                       text

  size: 300
                                                    size: 300
                    (a) Neural network architecture for                             (b) Neural network architecture for
                    SYN encoder.                                                    CHARS encoder.

                           Figure 1: Deep learning architectures for SYN and CHARS tweet encoders.


                    order to extract syntax-related information from textual data, we proposed an encoding
                    method based on deep learning technology, as shown on Figure 1a.
                        The original tweet text is tokenized before being fed to the encoding neural network.
                    The tokenization process is performed in the following way. We replace all distinct
                    URLs with the word "url" but we do not remove neither the punctuation nor the emoti-
                    cons found on text. Each distinct user mention (e.g. @realdonaldtrump) and hashtag
                    (e.g. #politic) is replaced respectively by the words mention and hashtag. Each
                    remaining word is transformed into one of these 3 words: uppercase_w for a com-
                    pletely uppercase word, lowercase_w for a completely lowercase word, and mixed_w
                    for a word including both uppercase and lowercase characters.
                         In order to learn a tweet encoder using the architecture proposed in Figure 1a, we
                    used only training data split and transform the dataset of our original problem (pro-
                    posed to classify users into profiles) into a new dataset suitable to classify tweets into
                    user profiles. For each user profile in training data, we thus take its tweets and its origi-
                    nal assigned label (bot, female, or male) and we extract 100 different tweets where each
                    one has assigned the label of the user. At the end of this process we have produced 2
                    new training datasets, one for English and one for Spanish, consisting in respectively
                    288,000 and 208,000 training documents, which can next be used to learn the two pro-
                    file classifiers. The proposed neural network uses two sub-networks to learn different
                    feature types. A first part exploits a GRU net [5] to find interesting temporal patterns
                    in the sequence of input tokens. A second sub-network based on a CNN [7] try instead
                    to identify spatial features which are invariant to respect of original position in the text.
                    These two types of found features are next concatenated together and used to find an
                    optimal vector encoding for the tweet, before performing the final classification.
                       The syntax classifiers so obtained are not interesting in themselves, what it does
                    matter for us is the possibility to take the trained nets and extract, as tweet encoding
vector, the output produced by "Encoder layer" layer. The resulting vector should be in
fact an effective representation of the syntax information convey by the original tweet.

3.3   CHARS: a Characters Encoding Based on Deep Learning
This encoding, differently from W2V method, want to exploit data at sub-word level in
order to find interesting recurrent patterns and potentially useful information for user
profile classification. The tokenization process of original tweets is simpler than the
others two seen encoders. In this case, we have defined the set of valid symbols in
the characters dictionary as (A − Za − z0 − 9), the punctuation symbols, and all the
distinct emoticons found on the training data split. The tokenization process thus simply
consists in removing all invalid characters from input raw text. In the same manner as
described for SYN encoder, starting from training data split, we create two new different
training datasets to solve a tweet ⇒ user_profile classification task. As shown on Figure
1b, the neural network architecture used to learn the encoder is slightly different from
that used in SYN encoder. The network only use a CNN sub-network for encoding the
tweet’s data and the layer taken as final tweet encoder is the CNN layer.

4     Data Mismatch in Contest Dataset
The first attempt to build a user’s profile classifier for the "Bots and Gender Profiling"
task was to take all the data included in train and validation splits, merge them together,
and then try to perform a k-fold evaluation [23] on the resulting dataset using one of
the textual encoding defined in the previous section and a simple feed forward neural
network classifier. With a similar attempt, we obtained a final software accuracy on
profiling task of 0.973 for both English and Spanish classifiers. Unfortunately these
results were a lot worse if we used the splits suggested by organizers, with a final
classifier accuracy of 0.785 for English and just 0.721 for Spanish. From additional
performed tests, we have verified that if just use only training data split, we have no
evidence neither of bias nor variance [4] in the built classifier, the accuracy discrepancy
is only evident when the classifier learned on training data is applied on test data of
validation split. If we reasonably assume that provided validation split matches the data
distribution of the unknown test dataset used to evaluate the system for the challenge,
we can infer that there is a "covariance shift" problem [24] between the two distributions
that need to be treated appropriately.

5     The Proposed User’s Profile Classifier
For each language, we designed a single-label multiclass classifier [23] able to assign a
user to exactly 1 of the 3 defined classes: bot, male, and female. The high level archi-
tecture of the software is shown on Figure 2. The tweets of a user timeline are encoded
through a specific text encoder TE and then passed to a component called User Encoder
with the aim to analyze the timeline and to produce a user vector including useful infor-
mation for final profilation. The vector so obtained is later passed to an SVM classifier
which will determine the label to assign to the user.
    We tested three different types of textual encodings:
              Architecture of ﬁnal user’s proﬁle
              classiﬁer
              Bot        Male       Female

                                                Different types of User Encoder

                                                                            Masked
                    SVM classifier                                           DUE
                                                                            vector
                                                             DUE
                                                            encoder
                                                BUE                         DUE
                                               encoder                     encoder
                    User Encoder                             BUE
                                                            encoder
                                                                            BUE
                                                                           encoder
            Tweet     Tweet          Tweet
             enc
              1
                       enc
                        2     ...     enc
                                      100       CBUE          CDUE         CmaskDUE


             TE        TE             TE        Different types of Text Encoder

                                                 W2V        W2V-SYN   W2V-SYN-CHARS

                              ...
            Tweet     Tweet          Tweet
              1         2             100


              Classiﬁer architecture


                    Figure 2: High level architecture of the proposed system.


W2V This encoding is the same encoding as defined in Section 3.1. The idea here is to
  exploit only semantic information from original data.
W2V-SYN This encoding is the concatenation of W2V encoding and the encoding
  generated by SYN encoder (described in Section 3.2). The idea in this case is to use
  a representation conveying both syntactic and semantic information from original
  data.
W2V-SYN-CHARS This encoding is the concatenation of W2V encoding, SYN en-
  coding, and the encoding generated by CHARS encoder (described in Section 3.3).
  The idea here is the same as W2V-SYN representation but with the additional abil-
  ity to learn also information at sub-string level, useful for handling particular cases
  such as orthographic errors.

   We used 3 different types of user encoder configurations, as described in the fol-
lowing:

CBUE A configuration producing final user vector by only applying a BUE encoder
   (described in Section 5.1).
CDUE A configuration producing final user vector by applying in sequence first a BUE
   encoder and then a DUE encoder (described in Section 5.2).
CmaskDUE A configuration similar to CDUE but with the addition of a final step used
   to mask the vector features most discriminative for data distributions (described in
   Section 5.3).

   In the remainder on the section, we will describe the two proposed core user en-
coders (BUE and DUE encoders) used in tested configurations, the strategy used to
mask most discriminating features, and how we built the final user’s profile classifier.
Encoding user proﬁles (BUE
encoder)

se training data
nly.                                Bot | Male | Female
                                         prediction

ncode users with                                                                            L_user                  L_distribution

 ctor of size 512                   User encoding layer                               BaseUserEncoder
                                                                                           loss
                                                                                                                   Train | Validation
                                                                                                                    gold standard

 sted various
                                        CNN layer                                                                  Train | Validation
 hers DL subnets:        Encoding
                         network                                                     Encoding network
                                                                                                                      prediction


 TM,                                                                                            Discriminant user encoding
 TM+CNN, GRU,
RU+CNN,
                            Tweet
                              1
                                       Tweet
                                         2      ...       Tweet
                                                           100
                                                                                                     User encoding from
                                                                                                            BUE


                    (a) Base user encoder (BUE) architec-                          (b) Discriminant user encoder (DUE)
                    ture.                                                          architecture.

                              Figure 3: Deep learning architectures for SYN and CHARS tweet encoders.


                    5.1       Learning a Base Twitter User Encoder
                    In this section we describe a first type of user encoder (called BUE, Base User En-
                    coder) which is able to process user timelines and provide good vector representations
                    of the user profiles. In order to auto-learn very informative features for the final task,
                    we designed a classification system based on a deep learning architecture (see Figure
                    3a) and trained it using only train data split. The final built classifier, thanks to CNN
                    sub-network4 , learn how to encode properly the sequence of tweets submitted by the
                    user by identifying the space-invariant inter-tweets patterns most relevant to solve the
                    classification problem. In particular, the trained network can be used to compute a good
                    user vector representation starting from its timeline. We tuned the network to have 512
                    as vector size for user encoding layer (layer "User encoding layer"), while for CNN we
                    used 512 different filters with a window size set to 10.

                    5.2       Learning a Discriminant User Encoder
                    User encodings produced by BUE encoder are optimal user representations for training
                    data split and final classification task, but as we have seen before are suboptimal for en-
                    coding and classify test datasets due to different characteristics of data. The main aim
                    of this step is thus to start from BUE vector representations and try to find similar user
                    representations, suitable for classification task but where the features high discrimina-
                    tive for training and test distributions are highlighted and easily recognizable. In order
                    to tackle this issue, we propose a discriminative user encoder (called DUE, Discrim-
                    inative User Encoder) based on the neural network architecture shown on Figure 3b.
                     4
                         We tested other variants of deep learning architectures (e.g., using sub-nets based on LSTM,
                         GRU, or a combination of them with CNN) but we finally choosed CNN because it was able to
                         return back slightly better user vectors encoding (giving better effectiveness on final classifier).
This encoder is trained on a new dataset composed by both train and validation data but
where each user has different target labels respect to original task: "True" if the user was
originally included in validation split, "False" otherwise. The neural network structure
is very simple. It takes as input a user vector representation computed by BUE encoder
and then in the first hidden dense layer (Discriminant user encoding) try to learn a user
representation optimal for this new binary classification task. To ensure that the net
would learn a user representation suitable also for original "Bots and Gender Profiling"
task, we introduce a constraint of what the net can learn at this level. In particular, we
force the net to minimize two different losses in gradient descent optimization: the loss
on labels related to data distribution (Ldistribution ), and the loss due to the difference
from learned encoding and the encoding produced by BUE encoder (Luser ). With the
imposed constraints we aim to obtain a trained net which will produce user vectors of
dimension 512 very similar to those produced by BUE encoder but containing some
specific feature resulting very discriminative in recognizing the original data distribu-
tion of a user.

5.3   Masking most Discriminating Features
DUE encoder provides user vector representations containing some high informative
features for identifying the origin of the data distribution of a user. A feature is high
discriminative for data distribution if, after having built a binary classifier using only
the considered feature, we are able to distinguish quite well the distribution origin of
a user. To weight the discriminative power of a feature we can use ROC-AUC [23], a
measure able to telling how much a model is able to distinguish between classes. To
measure ROC values for all features, we built a new balanced dataset (Dmask ) mixing
some data from train split and some from validation split and measured classification
performance of each feature using a Random Forest classifier.
    On Figure 4 we show a typical distribution of the features related to their ROC-AUC
values emerged from the tests. The majority of features have a ROC value around 0.55
but a long queue exists on the right with few features having high discriminative power
(up to more than 0.8). These last type of features are the those we want to mask in the
final user encoding vector. Indeed, without these features the user vector should be less
sensible to distribution origin and therefore, in the successive learning phase, we force
the classifier to concentrate on the other distribution-invariant features to solve the final
task.

5.4   Building the Final Task Classifier
The final user’s profile classifier has been choosed and tuned using remaining train
and validation data not included in dataset Dmask . We tested two different classifiers
(Random Forest and SVM) but we finally adopted SVM because it performed better
in terms of accuracy in all tested configurations. The classifier has been optimized by
fine-tuning model parameters through the use of a k-fold validation. For both languages
and independently from the used text encoding, the best found configuration uses an
RBF kernel and standard parameters "gamma" and "C" set to respectively 0.01 and
0.1 . The best threshold determining the ROC minimal value over which we apply
                               3) Mask most important discriminant
                               features


                                                                          ▷ RandomForest classiﬁer
                                                                          ▷ Mask features with
                                                                            ROC-AUC value >= chosen
                                                                            threshold.


                Figure 4: ROC-AUC values computed for all user features.
                       Features obtained by DistributionUserEncoder net


feature masking depends on the adopted text encoding and it will be reported in the
experimental section.
    After found the best model parameters and chosen a threshold for ROC value, we
built the final optimized classifier using train split as training data and using valida-
tion split as test data to measure the accuracy of a specific solution. The final models
submitted on early test dataset for evaluation have instead been built using all the data
available (the union of data in training and validation splits). Detailed results of the
tested configurations will be presented in the experimental section.


6    Experimental Results
In this section we will report all the results obtained in our experimentation. A detailed
description of the dataset used in the challenge is available in [18], while the final eval-
uation on test sets has been obtained by using TIRA platform [16]. The base measure
used to evaluate the goodness of the proposed software solutions is the accuracy [23].
The software has been implemented entirely in Python using various external libraries
for specific functions. In particular, the deep learning part has been developed using
Keras5 with a Tensorflow6 backend, while we used scikit-learn7 for traditional machine
learning methods.
    For each language, on Table 1 we report the system global accuracy on validation
dataset obtained a) using the three different strategies seen above for encoding tweets
and b) using three ways to encode the user vector before building the final classifier.
In particular, we report the 3 types of user encoder configurations introduced in Sec-
tion 5, together with cut threshold for masking features ("Masking threshold" column).
This first type of experiment has the main aim in identifying, for each language and
each tweet encoding, the best user encoding which next will be used to test the system
 5
   Available at https://keras.io/ .
 6
   Available at https://www.tensorflow.org/ .
 7
   Available at https://scikit-learn.org/stable/ .
               Tweet encoding
                            CBUE CDUE Masking threshold CmaskDUE
                                 E NGLISH
                  W2V      0.7960 0.8032    0.65         0.7998
                W2V-SYN    0.8065 0.8225    0.65         0.8282
             W2V-SYN-CHARS 0.8203 0.8169    0.70         0.8266
                                  S PANISH
                  W2V      0.6920 0.6967    0.65         0.6945
                W2V-SYN    0.7670 0.7695    0.65         0.7741
             W2V-SYN-CHARS 0.7789 0.7771    0.70         0.7793

    Table 1: Classification global accuracy of the software on validation data split.


                               VALIDATION SPLIT   E ARLY TEST DATASET
            Tweet     User    Type Gender          Type Gender
                                           Global               Global
           encoding encoding task     task         task   task
                                 E NGLISH
           W2V        CBUE   0.8693 0.7370 0.8032 0.8902 0.7576 0.8239
         W2V-SYN    CmaskDUE 0.9080 0.7483 0.8282 0.9091 0.7955 0.8523
      W2V-SYN-CHARS CmaskDUE 0.8967 0.7564 0.8266 0.8939 0.7424 0.8181
                                 S PANISH
           W2V        CBUE   0.8630 0.5304 0.6967 0.8778 0.6889 0.7833
         W2V-SYN    CmaskDUE 0.8863 0.6619 0.7741 0.8944 0.7556 0.8250
      W2V-SYN-CHARS CmaskDUE 0.8826 0.6760 0.7793 0.8778 0.6722 0.7750

Table 2: Comparison of software accuracy between validation and early test datasets.


performance on early test dataset. For this reason, for each language and each tweet
encoding tested, we also report in bold the best user encoding found.
On both languages, the results for W2V-SYN and W2V-SYN-CHARS configurations
show that the better results are obtained by using CmaskDUE configuration, while for
W2V the best ones make use of CDUE without using feature masking. The optimal
threshold identified for the various configurations is very similar and is always com-
prised between 0.65 and 0.70. It is also interesting to note that the accuracy for W2V-
SYN always increases while passing from CBUE to CmaskDUE configurations. W2V-
SYN-CHARS instead, thanks probably to a richer representation of input data, seems
instead to be a strong competitor also using the base CBUE user encoding.
    On Table 2, we report a comparison between the accuracy obtained for both lan-
guages on validation data and on the early test dataset provided by the challenge orga-
nizers. The reported results are referred to the best configurations identified on Table 1
and show the accuracy obtained on the 2 sub-tasks defined in the competition ("Type
task" and "Gender task" columns) together with the averaged global accuracy of the
system ("Global" column). For each type of dataset and each language, we have also
reported in bold the best results.
Globally, the results show clearly that W2V-SYN is the best performer, in particular,
                                  "T YPE " TASK "G ENDER " TASK
                   Method        English Spanish English Spanish Global
               MAJORITY [18]     0.5000 0.5000 0.5000 0.5000 0.5000
                RANDOM [18]      0.4905 0.4861 0.3716 0.3700 0.4295
             CHAR N-GRAMS [18]   0.9360 0.8972 0.7920 0.7289 0.8385
             WORD N-GRAMS [18]   0.9356 0.8833 0.7989 0.7244 0.8355
            WORD EMBEDDINGS [18] 0.9030 0.8444 0.7879 0.7156 0.8127
                  LDSE [17]      0.9054 0.8372 0.7800 0.6900 0.8031
              W2V-SYN with CmaskDUE          0.9148 0.9144       0.7670 0.7589       0.8387

Table 3: Final software accuracy on challenge test set using W2V-SYN and CmaskDUE
configurations. On the table we report also the accuracy of the baselines proposed by
challenge organizers.


on early test dataset and for both languages, it always provided the best accuracy.
W2V-SYN-CHARS seems to be competitive only on validation dataset but on early
test dataset the accuracy is not very good, especially on "Gender task". A possible ex-
planation of this worst performance respect to W2V-SYN is that the addition of features
generated by CHARS encoder confuse the classifier while learning the best parameters
to build a classification model invariant to data distribution. W2V is instead the worst
performer on both datasets, probably a sign that this type of text encoding has a limited
expressivity in representing the data of the classification task.
If we observe the results from the point of view of languages and classification sub-
tasks, we can note the the proposed configurations always perform better on English,
with a remarkable difference in terms of accuracy between the two classification sub-
tasks. In particular, solving the "Type" task seems a lot easier than solving the "Gender"
task. This difference is probably due from one side to the greater number of training
examples available for "Type task", from the other side for the greater difficulty8 in
identifying the differences in the write style between males and females.
    Finally, on Table 3 we show the accuracy results obtained by our best configura-
tion (W2V-SYN with CmaskDUE encoder) on the test dataset used for final evaluation
of system performance in the challenge. In addition, on the table we report also the
accuracy obtained by various baseline methods proposed by organizers. Our method
outperforms all the baselines, in particular seems more effective especially on Spanish
language. On both languages, the global accuracy results obtained by our algorithm on
final test dataset are quite similar to those obtained on early test dataset (0.8409 and
0.8366 vs. 0.8523 and 0.8250). An interesting observation is that in final test dataset the
difference in terms of accuracy between the two sub-tasks "Type" and "Gender" is very
similar for both languages, suggesting that the proposed solution is a good balanced
system and not sensible to the target language.

 8
     If we manually inspect a timeline of a human user of the dataset, it is not always so easy
     identify its gender. Generally, a bot user is instead easily recognizable because certain types of
     writing patterns occurred frequently in the timeline.
7   Conclusions

In this work we tackled the "Bots and Gender Profiling" task of PAN 2019 challenge
by proposing a software solution making extensive use of state of the art language
modeling and deep learning techniques. In particular, these type of algorithms have
been used in order to automatically performing feature encoding of textual contents,
avoid almost completely the usual costly and manual "feature engineering" phase. In
addition, deep learning has been also exploited to find textual features invariant to train
and test datasets, helping fighting data covariate shift which usually has remarkable
negative effect on final classifier accuracy. The experimental results show that, among
all tested configurations, the solution using W2V-SYN textual encoding and combined
with user encoder CmaskDUE is globally the best performer on both validation and early
test datasets. In particular, on the final test set, such configuration allow to obtain a
balanced multilingual classifier able to obtain very interesting accuracy results on both
languages, with a global combined accuracy of 0.8387.
     Possible future works could try to use more advanced language modeling techniques
like BERT and Elmo for textual encoding. These methods provide contextual embed-
dings for text representation, by automatically disambiguate both the terms and the used
syntax in analyzed contents. A comparison with the textual encoding methods proposed
here could thus be very helpful in order to address future research directions.


References
 1. Akhtyamova, L., Cardiff, J., Ignatov, A.: Twitter author profiling using word embeddings
    and logistic regression. In: CLEF (Working Notes) (2017)
 2. Alrifai, K., Rebdawi, G., Ghneim, N.: Arabic tweeps gender and dialect prediction. In:
    CLEF (Working Notes) (2017)
 3. von Daniken, P., Grubenmann, R., Cieliebak, M.: Word unigram weighing for author
    profiling at pan 2018. In: Proceedings of the Ninth International Conference of the CLEF
    Association (CLEF 2018) (2018)
 4. Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma.
    Neural Comput. 4(1), 1–58 (Jan 1992). https://doi.org/10.1162/neco.1992.4.1.1,
    http://dx.doi.org/10.1162/neco.1992.4.1.1
 5. Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network
    architectures. In: International Conference on Machine Learning. pp. 2342–2350 (2015)
 6. Karlgren, J., Esposito, L., Gratton, C., Kanerva, P.: Authorship profiling without using
    topical information: Notebook for pan at clef 2018. In: 19th Working Notes of CLEF
    Conference and Labs of the Evaluation Forum, CLEF 2018, Avignon, France, 10 September
    2018 through 14 September 2018. vol. 2125. CEUR-WS (2018)
 7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
    neural networks. In: Advances in neural information processing systems. pp. 1097–1105
    (2012)
 8. Markov, I., Gómez-Adorno, H., Sidorov, G.: Language-and subtask-dependent feature
    selection and classifier parameter tuning for author profiling. In: CLEF (Working Notes)
    (2017)
 9. Martinc, M., Skrjanec, I., Zupan, K., Pollak, S.: Pan 2017: Author profiling-gender and
    language variety prediction. In: CLEF (Working Notes) (2017)
10. Martinc, M., Skrlj, B., Pollak, S.: Multilingual gender classification with multi-view deep
    learning. In: Proceedings of the Ninth International Conference of the CLEF Association
    (CLEF 2018) (2018)
11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of
    words and phrases and their compositionality. In: Proceedings of the 26th International
    Conference on Neural Information Processing Systems - Volume 2. pp. 3111–3119.
    NIPS’13, Curran Associates Inc., USA (2013),
    http://dl.acm.org/citation.cfm?id=2999792.2999959
12. Miura, Y., Taniguchi, T., Taniguchi, M., Ohkuma, T.: Author profiling with word+ character
    neural attention network. In: CLEF (Working Notes) (2017)
13. Mosteller, F., Wallace, D.: Inference and disputed authorship: The federalist (1964)
14. Oliveira, R.R., Neto, R.F.O.: Using character n-grams and style features for gender and
    language variety classification. In: CLEF (Working Notes) (2017)
15. Patra, B.G., Das, K.G.: Dd. multimodal author profiling for arabic, english, and spanish. In:
    Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018)
    (2018)
16. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture.
    In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World -
    Lessons Learned from 20 Years of CLEF. Springer (2019)
17. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for
    language variety identification. In: International Conference on Intelligent Text Processing
    and Computational Linguistics. pp. 156–169. Springer (2016)
18. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and
    Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019
    Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)
19. Rangel, F., Rosso, P., Montes-y Gómez, M., Potthast, M., Stein, B.: Overview of the 6th
    author profiling task at pan 2018: multimodal gender identification in twitter. Working
    Notes Papers of the CLEF (2018)
20. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at
    pan 2017: Gender and language variety identification in twitter. Working Notes Papers of
    the CLEF (2017)
21. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of
    the 4th author profiling task at pan 2016: cross-genre evaluations. In: Working Notes Papers
    of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings/Balog, Krisztian [edit.];
    et al. pp. 750–784 (2016)
22. Rangel Pardo, F.M., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview
    of the 3rd author profiling task at pan 2015. In: CLEF 2015 Evaluation Labs and Workshop
    Working Notes Papers. pp. 1–8 (2015)
23. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv.
    34(1), 1–47 (Mar 2002). https://doi.org/10.1145/505282.505283,
    http://doi.acm.org/10.1145/505282.505283
24. Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P.V., Kawanabe, M.: Direct importance
    estimation with model selection and its application to covariate shift adaptation. In:
    Advances in neural information processing systems. pp. 1433–1440 (2008)
25. Takahashi, T., Tahara, T., Nagatani, K., Miura, Y., Taniguchi, T., Ohkuma, T.: Text and
    image synergy with feature cross technique for gender identification. Working Notes Papers
    of the CLEF (2018)
26. Tellez, E.S., Miranda-Jiménez, S., Moctezuma, D., Graff, M., Salgado, V., Ortiz-Bejar, J.:
    Gender identification through multi-modal tweet analysis using microtc and bag of visual
    words. In: Proceedings of the Ninth International Conference of the CLEF Association
    (CLEF 2018) (2018)
27. Veenhoven, R., Snijders, S., van der Hall, D., van Noord, R.: Using translated data to
    improve deep learning author profiling models. In: Proceedings of the Ninth International
    Conference of the CLEF Association (CLEF 2018) (2018)
28. Yule, G.U.: On sentence-length as a statistical characteristic of style in prose: With
    application to two cases of disputed authorship. Biometrica 30, 363–390 (1939)