UOBIT @ TAG-it: Exploring a Multi-faceted Representation for Profiling
             Age, Topic and Gender in Italian Texts.
         Roberto Labadie Tamayo, Daniel Castro Castro and Reynier Ortega Bueno
                    Computer Science Departament, University of Oriente
                                 Santiago de Cuba, Cuba
                   roberto.labadie@estudiantes.uo.edu.cu,
                         {danielcc, reynier}@uo.edu.cu


                        Abstract                                their age, gender, personality or any other demo-
                                                                graphic attribute.
    English. This paper describes our sys-                          Many forums, due to the applicability of AP,
    tem for participating in the TAG-it Au-                     share tasks directed to mining features that in
    thor Profiling task at EVALITA 2020. The                    general way, predict that valuable information.
    task aims to predict age and gender of                      Those tasks commonly make special focus on
    blogs users from their posts, as the topic                  popular languages such as English and Spanish.
    they wrote about. Our proposal combines                     Nevertheless, other languages are explored on
    learned representations by RNN at word                      important forums too, that is the case of EVALITA
    and sentence levels, Transformer Neural                     1 , this one, promoting analysis of NLP tasks in
    Nets and hand-crafted stylistic features.                   the Italian language. Among the challenges from
    All these representations are mixed and                     its last campaign EVALITA 2018 was the AP
    fed into a fully connected layer from a                     (in terms of gender) task GxG (Dell’Orletta and
    feed-forward neural network in order to                     Nissim, 2018), exploring the gender-predicting
    make predictions for addressed subtasks.                    issue.
    Experimental results show that our model                    The analysis of age, gender and the topic a text
    achieves encouraging performance.                           is related with, are tasks well explored and the
                                                                most approaches employ data representation
   The growing integration of social media with
                                                                based on stylistic features, n-gram representations
people’s daily live has made this medium a com-
                                                                and/or words embedding combined with Machine
mon environment for the deployment of technolo-
                                                                Learning (ML) methods like Support Vector
gies that allow the retrieval of useful information
                                                                Machine (SVM) and Random Forest (Pizarro,
in the development of business activities, social
                                                                2019). Also some authors by using Deep Learning
outreach processes, forensic tasks, etc. That is be-
                                                                (DL) models like Convolutional Neural Networks
cause people frequently upload and share content
                                                                (CNN) and Long-Short Term Memory (LSTM)
in these media with various purposes such as so-
                                                                combined with stylistic features (Aragón and
cialization of points of view about some topic or
                                                                López-Monroy, 2018) (Bayot and Gonçalves,
promotion of personal business, etc. The analysis
                                                                2018) have yield encouraging performances.
of textual information from such data, is one of the
main reasons why researches become trending on
                                                                   In this work we address precisely, the automatic
the Natural Language Processing (NLP) field.
                                                                detection of gender and age of the authors, besides
   However, the fact that this information varies
                                                                the identification of the prevailing topic on textual
greatly in terms of its format, even when it comes
                                                                information from blogs. Also, we describe our
from the same person, besides textual sequences
                                                                developed model for participating on TAG-it:
are unstructured information, make challenging
                                                                Topic, Age and Gender prediction for Italian2
the process of analyzing it automatically. Author
                                                                (Cimino A., 2020) task at EVALITA 2020 (Basile
Profiling (AP) task aims at discovering different
                                                                et al., 2020).
marks or patterns (linguistic or not) from texts,
                                                                Having in account the proved ability of DL
that allow a user to be characterized in terms of
                                                                   1
     Copyright © 2020 for this paper by its authors. Use per-      http://www.evalita.it/
                                                                   2
mitted under Creative Commons License Attribution 4.0 In-          https://sites.google.com/view/
ternational (CC BY 4.0).                                        tag-it-2020
models to learn abstract depictions that are            two different paradigms. The first ones analy-
omitted in hand-crafted features engine meth-           ses the information sequentially, token by token
ods, our approach is mainly based on them,              whereas the second ones analyze all these tokens
particularly on Bi-LSTM and Transformer Nets            at once, relating every one with respect to each
(Vaswani et al., 2017). We combine the feature          other. The opposite behavior of these two archi-
representations learned by DL models, with hand-        tectures implies learning different patterns which
crafted ones based on Term Frequency-Inverse            individually have proved to be an accurate way to
Document Frequency (tf-idf) and stylistic features.     synthesize the information.
                                                        We hypothesize that making an ensemble of these
   This paper is organized as follow: in the next       deep representations and fusing it with hand-
section a brief description about the different sub-    crafted ones as we show on Figure. 1 could yield
tasks of TAG-it task. Next, we present our pro-         encouraging results on the proposed tasks.
posal. Specifically, we describe the data prepro-
cessing as well as the DL methods and features
used for depicting this data. Finally, the experi-
mental setting, the experiments conducted and the
results achieved.

1    TAG-it Tasks
Three sub-task have been proposed on TAG-it
task.
    • subtask 1: Toward to predict the gender, the
      age (as an age range, eg: 20-29) and the topic
      mentioned by the author given a collection of
      texts written by him/her from a blog, all this
      three dimensions at once.

    • subtask 2a: For predicting gender.                       Figure 1: Representations Ensemble
    • subtask 2b: For predicting age.
                                                           The first representation (Transformer Block)
For these tasks a training corpus of texts written by   based on Bidirectional Representation from Trans-
blogs users, with possibly multiple posts per user,     formers (BERT) Architecture (Devlin et al., 2018).
was provided. Each user information (i.e posts          The second based on LSTM (Hochreiter and
per user) varies in terms of its length and quantity,   Schmidhuber, 1997) neural nets with self atten-
and the data for each subtask is unbalanced mainly      tion mechanism (Att-LSTM) by using words em-
for gender and topic prediction tasks, which place      bedding (Recurrent Word-Level Block). The third
some complexity degree for the training stage of        one, a condensed representation based on the com-
the models for these classification tasks.              bination of stylistic features and a vector with the
                                                        tf-idf computation of some keys tokens from the
2    Our Proposal
                                                        text (Stylistic Block). Finally (Recurrent Sentence-
Deep Learning methods are capable to learn and          Level Block), another representation based on Att-
project relationships between elements within tex-      LSTM, but at this time, analyzing the sequence in-
tual information which are beyond the human ab-         formation at sentence level.
stract comprehension. Therefore the use of just         All these representations are concatenated and fed
hand-crafted representations may omit some im-          into a dense layer, by using Leaky Rectified Linear
portant patterns on textual information analysis.       Unit (Leaky ReLU) activation function, to synthe-
However, stylistic and linguistic features have         size the extracted information on each block and
proved to be good marks to determine some author        its output vector goes to a softmax dense layer
characteristics. Within the used DL models on AP        which have the same number of neurons as classes
field, are the LSTM (Labadie-Tamayo et al., 2020)       on the analyzed task, in order to make the predic-
and the Transformers Neural Nets, which rely on         tions.
For dealing with the three classification tasks we        the entire input context.
used the same architecture, but trained separately        The original BERT model is trained with two sub-
for each of them, with different targets attending        tasks, one of them consisting on predict some
to the task.                                              masked words from a sentence and the other one
                                                          consisting on predict if two sentences are consec-
2.1   Preprocessing                                       utive in the given corpus text.
In the preprocessing stage we concatenate the             For the TAG-it tasks we employed a pre-trained
posts corresponding to the same user, in order to         BERT model on a multilingual corpus (multi-
treat them as only one super-document, but be-            lingual L-12 H-768 A-12)3 (Turc et al., 2019),
tween each post we place a tag i.e h post i de-           which is fed with the super-document sequence.
noting the ending-beginning of them. Afterwards,          From this model we just used the first two trans-
the numbers and dates are recognized and replaced         former blocks and as its output we keep the first
by a corresponding wildcard which encodes the             and last vectors from the input sequence encod-
meaning of these special tokens. Then, the text is        ing, which are concatenated.
tokenized and morphologically analyzed by means           Also we applied fine tuning on BERT, adding
of FreeLing (Padró and Stanilovsky, 2012).               an intermediate dense layer of 64 units by using
For computing the stylistic and tf-idf vectors as         Leaky ReLU activation function, and taking as tar-
for feeding the deep models on prevailing topic           get for training a multitask focus trying to make
detection task, we removed the stop words from            predictions for age, topic and gender tasks at once.
the document and lemmatized the tokens to their
canonical form.                                           2.3   Recurrent Word-Level Block
                                                          The second representation block of our system is
2.2   Transformer Block. BERT
                                                          based on LSTM nets. This block takes as input
BERT (Bidirectional Encoder Representations               a sequence of the preprocessed text information,
from Transformers) is an architecture resulting of        which is fed into an embedding layer, set up with
applying a bidirectional training to the attention        fixed weights from FastText (Grave et al., 2018)
model Transformer, designed for language model-           pretrained word embedding4 , obtaining from each
ing. The Transformer model has two mechanisms,            word of the sequence a vectorial representation.
the first one, known as the encoder, which is fed         The textual sequence is provided with relevant or
with the text and finds out an encoded represen-          not information with respect to the task in anal-
tation for the sequence. The second one, the de-          ysis. In order to highlight the most important ele-
coder, produces the predicted tokens for language         ments for encoding the message instead of making
modeling one at time, having in account the en-           the network pays attention to all elements alike,
coder’s output and the previous predicted tokens          the embedding layer output tokens are scored by
on each time step.                                        its relative importance over the other elements
The main advantage of this transformer mod-               on its context with Scaled Dot-Product Attention
els w.r.t. traditional sequential architectures like      Mechanism (Vaswani et al., 2017). Then, the
Gated Recurrent Unit (GRU) (Cho et al., 2014) is          new scored sequence is fed into a Bidirectional-
that instead of analyzing the textual information in      LSTM (BI-LSTM) (Schuster and Paliwal, 1997)
one or another direction (e.g. right to left or left to   layer with 64 neurons which perform two analy-
right) it takes in account the entire information at      sis over this sequence, in forward and backward
once by using an attention mechanism, which re-           directions, for detecting not just relations of an el-
lates each word on the text with its surrounding          ement with the previous ones, but also with the el-
context.                                                  ements that appear after it. Afterwards, the hidden
Since the goal of BERT is to generate a language          states from the Bi-LSTM layer are considered as
representation, only the encoder mechanism is             a new sequence, which is fed into another LSTM
necessary. It is structured with transformer blocks       with 64 neurons too, taking from its output just
connected sequentially and each transformer block         the last hidden state, which represents the Recur-
is composed by attention heads working in paral-
                                                            3
lel. These transformer blocks give to their subse-            https://github.com/google-research/
                                                          bert
quent layer one representation for each element of          4
                                                              https://fasttext.cc/docs/en/
the input text, but these representations correlates      crawl-vectors.html
rent Word-Level Block encoding.                          guistic layers.
For training this block we applied dropout (Srivas-      For constructing the first one we used a feature
tava et al., 2014) to the neurons of the attention and   selection approach which score every term em-
LSTM layers in order to improve the generalizing         ployed by users corresponding to some category
capability of the model.                                 within a classification task and then are selected
                                                         the more relevant ones.
2.3.1 Scaled Dot-Product Attention
                                                         For scoring the tokens we use IG (Sebastiani,
This attention function at first, maps for each se-      2002) standing for Information Gain, which takes
quence token three representations ( the query and       into account the presence of a term in a category
a key-value pair) for computing a compatibility in-      as well as its absence. The information gain of a
dex between every pair of elements. Afterwards,          term t in a class C is defined as:
for each token ti is evaluated its compatibility w.r.t
every other sequence token tj by relating its query
                                                                        X        X                         P (x, c)
                                                         IG(t, C) =                       P (x, c) log2
vector qi with all the keys kj , then these compati-                                                      P (x)P (c)
                                                                      c∈{C,C̄} x∈{t,t̄}
bilities cij are normalized with a softmax function                                                       (2)
and used for scoring the value vectors vj in front       In this formula, probabilities are interpreted on an
of that specific query. Finally, the attention based     event space of documents (e.g. P (t̄, C) indicates
representation for ti is computed as the weighted        the probability that, for a random document d,
sum of these pondered values vectors. This com-          term t does not occur in d and d belongs to
putation is defined as follows:                          category C ).
                                                         Once computed the IG for every term which be-
                                   Q × KT
Attention(Q, V, K) = sof tmax( √               )×V       longs to documents of the class ci , the 500
                                                                                                    lc tokens
                                          dk             with highest IG are chosen for characterizing
                                                 (1)
                  n×d              n×d                   this class , where lc is the number of the task
Where Q, K ∈ <        k and V ∈ <       v  are matri-
                                                         classes. Finally a 500 − dimensional vector is
ces, which, on every row contain for query, key
                                                         constructed where its components are computed
and value respectively the mappings of the se-
                                                         as the tf-idf of the representative terms from every
quence tokens, n corresponds to the length of the
                                                         class.
sequence and dk , dv to the dimension of mapping
vectors for key and value respectively.
                                                            The second representation is computed in-
2.3.2 LSTM                                               dependently of the addressed task as a 12 −
LSTM networks are a special kind of RNNs,                dimensional vector where its components are
which are specialized on analyzing sequential            real numbers corresponding to statistical values
data. These have a main cell unit (the recurrent         from lexical and syntactical linguistic layers (e.g
unit) which explores the data sequence one ele-          sentence, paragraph, syntactic layers) such as:
ment at each time step (left to right order). This
                                                            • Paragraph layer: Standard deviation of the
network shares the information captured in pre-
                                                              sentences’ length written by the user.
vious steps, for computing the new hidden state
at the current time step. Inside the main cell is           • Text layer: Number of stop words used.
contained a gate structure that informs to the net-
work which information preserve or forget from              • Sentence layer: Average of words’ length.
the hidden sates of previous time steps for the cur-        • Syntactic layer: Proportion of nouns over ad-
rent computation.                                             jective.
2.4   Stylistic Block. Stylistic Features                   These two representations are combined and fed
The Representation based on stylistic features is        into a 64-neurons dense layer to synthesize the in-
twofold; in one side we consider for characteriz-        formation and later being fused it with the other
ing a user attending to some classification task, a      blocks representations.
vector containing the tf-idf of a set of key tokens
from the text and on the other side we construct a       2.5   Recurrent Sentence-Level Block
statistical style features vector which captures in-     This block shares the same structure with the
formation from distinct lexical and syntactical lin-     Recurrent Word-Level Block, but instead to be
fed with a sequence composed by word repre-             an enormous data as more as possible, while we
sentations provided by a word embedding layer,          made the model focus on our addressed tasks,
it is fed with a sequence resulting of encoding         also we set the decay = 2e-3 to the learning rate
each super-document’s sentence by means of              scheduler.
an encoder with a similar structure as the first
analyzed Transformer-Block .                               We evaluate and select the hyper-parameters as
                                                        the representation and features that we used for
   For this Recurrent Sentence-Level Block, we          our model by using a cross-validation method to
trained the sentence encoder with the same multi-       obtain a more realistic an unbiased performance
task focus as in the Transformer-Block , but aiming     evaluation, making 5 splits for validation. On each
to predict for each sentence from a document the        cross validation step, the dataset was split in 20%
annotated characteristics (i.e age and gender) of       for validation and 80% for training, keeping the
the user who it belongs to and the topic of its sur-    distribution of examples relative to the split size.
rounding text. Then we encode all the sentences         The performance of the model on training stage
from the super-document composed by the user’s          was evaluated independently for each subtask by
posts, and we considered them as tokens from a          using different combinations of representations
sequence at sentence level. Afterwards, that se-        from Recurrent Word-Level Block (RNN-W), Re-
quence is fed into a model with the same structure      current Sentence-Level Block (RNN-S), Trans-
Att-Bi-LSTM as the Recurrent Word-Level Block           former Block (T) and Stylistic Block (STY).
taking from this, as the user’s profile encoding, the   For age and gender prediction we employed
last hidden state from the second LSTM layer as in      Micro-F1 metric whereas for topic prediction we
the Word-Level block.                                   used accuracy metric for the evaluation. In Ta-
                                                        ble. 1 we summarize the results obtained in terms
3   Experiments and Results                             of the average of these metrics in cross-validation
                                                        training.
The dataset used in this work was the one provided        As we can see, assembling the three deep repre-
by the task organizers. This dataset is unbalanced,
mainly for gender classification task, where the
                                                          Table 1: Model Performance on training data.
male class represents the 82.6% of the examples.
In order to prevent a biased training of the model
                                                                                  Age       Gender      Topic
we applied a class-weighting method, scoring             Model
                                                                                AVG-F1      AVG-F1       Acc
the computed loss for every examples having
                                                         RNN(S+W)-STY-T          0.378       0.941      0.935
in account the class which it belongs to (i.e
                                                         RNN(S+W)-T              0.203       0.946      0.885
for examples from male class we give to the
                                                         RNNS-STY-T              0.348       0.940      0.931
computed loss a weight of 0.3 whereas for female
                                                         RNNW-STY-T              0.339       0.919      0.903
examples we pondered the loss to 0.7) this makes
that when parameters are updated by means of
the gradients, the models pays more attention           sentations with the stylistic one, yield a good per-
to the most weighted class, specifically to the         formance in all cases through the cross-validation
under-represented class.                                process. However, the stylistic representation had
We pretrain the Transformer models from the             a soft negative influence on gender prediction task.
Transformer Block and the sentence encoder of
the Recurrent Sentence-Level Block independently          Regarding the official results, we submitted
of the entire model and then we fixed the learned       3 runs as UOBIT team, on each of them we
weights.                                                employed the representations learned by the
For fine tuning these BERT models we employ             Transformer and Stylistic Blocks by tuning the
Adam Optimizer, using categorical cross-entropy         use of the Recurrent Blocks’ encode, as shown on
loss function for every output layer, since we          Table. 2.
applied multi-task learning over two epochs. The
learning rate for this training was set up to a           After the evaluation phase we try to remove
low value (lr=1e-5) since we wanted to keep the         the stylistic features based representation and we
parameters learned from the original train with         found out that this representation, possibly be-
                            Table 2: Model Performance on test data.
                                             Subtask 1         Subtask 2a              Subtask 2b
          run     Model
                                        Metric 1 Metric 2       Micro-F1                Micro-F1
         run-1    RNN-W T STY             0.686      0.251         0.852                 0.278
         run-2    RNN-S T STY             0.674      0.243         0.883                 0.370
         run-3    RNN-W RNN-S T STY       0.699      0.251         0.893                 0.308
                                          Unofficial
            -     RNN-W RNN-S T           0.680      0.248         0.898                  0.4680
            -     RNN-W RNN-S             0.667      0.243         0.893                   0.369
            -     T                       0.436      0.067         0.835                   0.283


cause of it introduces some noise, makes the
model to have a worst performance, at least on              This four representations are mixed and fed into
those tasks related to the author attributes (i.e gen-   a dense layer for synthesize them and its output
der and age) corresponding to task 2a and task 2b.       is received by another dense layer which classify
We think that noise introduced by these features         this profile taking into account the classes from the
mainly comes from the fact that they are computed        addressed subtask.
based on key tokens from the text, these tokens             The results shown that considering both the
may suggest to the model that texts with same            stylistic representation and the deep representa-
topic belongs to the same class within gender or         tions learned by Recurrent and Transformer mod-
age classification task.                                 els we obtain the best effectiveness based on the
The performance of our system just by using the          accuracy measure for the task related to the topic
deep representations of the Recurrent and Trans-         classification, but this behavior changed for age
former Blocks, yield a performance of 0.4606 un-         and gender classification, due to the relationship
der F1 metric on subtask 2b which improves the           of syntactic structures of the text with the topic
ones reached by the best team of 0.409, whereas          that the user’s posts are related to. We think that
this same combination improves our best official         excluding the stylistic features or at least those
run on subtask 2a. These results are shown on Ta-        related to the frequency of tokens from the text,
ble. 2 under the row named Unofficial.                   could be a way to increase the effectiveness of
                                                         the ensemble, mainly on the age detection subtask.
4   Conclusions                                          Also analyzing the content of the posts at charac-
                                                         ter level, due to the informal text origin, would
In this paper we described our system for par-
                                                         solve the problem of missidentification of some
ticipating in the TAG-it Author Profiling task at
                                                         key words within te text. We would like to explore
EVALITA 2020. Our proposal is based on an
                                                         these ideas in future work.
ensemble of RNN, Transformer Neural Nets and
hand-crafted stylistic features. The system re-
ceives as input a user’s profile textual information     References
as an only one super document (sequence), this
information is encoded in four different ways,           Mario Ezra Aragón and A-Pastor López-Monroy.
                                                          2018. A straightforward multimodal approach for
the first one by a Transformer Block, specifically        author profiling. In Proceedings of the Ninth In-
a fine tuned and reduced BERT model, the                  ternational Conference of the CLEF Association
second one, by a Recurrent Block based on an              (CLEF 2018).
Attention-Bi-LSTM model analyzing the infor-
                                                         Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
mation at word level, the third one by a feature           cia C. Passaro. 2020. Evalita 2020: Overview
representation based on the combination of tf-idf          of the 7th evaluation campaign of natural language
information and stylistic features extracted from          processing and speech tools for italian. In Valerio
the text. Finally the fourth one by the same               Basile, Danilo Croce, Maria Di Maro, and Lucia C.
                                                           Passaro, editors, Proceedings of Seventh Evalua-
recurrent structure as in the Recurrent Worf-Level         tion Campaign of Natural Language Processing and
Block, but analyzing the information at sentence           Speech Tools for Italian. Final Workshop (EVALITA
level.                                                     2020), Online. CEUR.org.
Roy Khristopher Bayot and Teresa Gonçalves. 2018.        Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina
  Multilingual author profiling using lstms: Notebook        Toutanova. 2019. Well-read students learn better:
  for pan at clef 2018. In CLEF (Working Notes).             On the importance of pre-training compact models.
                                                             arXiv preprint arXiv:1908.08962v2.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
  cehre, Dzmitry Bahdanau, Fethi Bougares, Holger         Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Schwenk, and Yoshua Bengio. 2014. Learning                Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  phrase representations using rnn encoder-decoder          Kaiser, and Illia Polosukhin. 2017. Attention is all
  for statistical machine translation. arXiv preprint       you need. In Advances in neural information pro-
  arXiv:1406.1078.                                          cessing systems, pages 5998–6008.

Nissim M. Cimino A., Dell’Orletta F. 2020. Tag-
  it@evalita2020: Overview of the topic, age, and
  gender prediction task for italian.

Felice Dell’Orletta and Malvina Nissim.       2018.
  Overview of the evalita 2018 cross-genre gender
  prediction (gxg) task. EVALITA Evaluation of NLP
  and Speech Tools for Italian, 12:35.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2018. Bert: Pre-training of
   deep bidirectional transformers for language under-
   standing. arXiv preprint arXiv:1810.04805.

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-
  mand Joulin, and Tomas Mikolov. 2018. Learn-
  ing word vectors for 157 languages. In Proceed-
  ings of the International Conference on Language
  Resources and Evaluation (LREC 2018).

Sepp Hochreiter and Jürgen Schmidhuber. 1997.
  Long short-term memory. Neural Computation,
  9(8):1735–1780.

Roberto Labadie-Tamayo, Daniel Castro-Castro, and
  Reynier Ortega-Bueno. 2020. Fusing Stylistic
  Features with Deep-learning Methods for Profiling
  Fake News Spreader—Notebook for PAN at CLEF
  2020. In Linda Cappellato, Carsten Eickhoff, Nicola
  Ferro, and Aurélie Névéol, editors, CLEF 2020 Labs
  and Workshops, Notebook Papers. CEUR-WS.org,
  September.

Lluı́s Padró and Evgeny Stanilovsky. 2012. Freeling
  3.0: Towards wider multilinguality. In Proceedings
  of the Language Resources and Evaluation Confer-
  ence (LREC 2012), Istanbul, Turkey, May. ELRA.

Juan Pizarro. 2019. Using n-grams to detect bots on
  twitter. In CLEF (Working Notes).

Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec-
  tional recurrent neural networks. IEEE Trans. Sig-
  nal Process., 45(11):2673–2681.

Fabrizio Sebastiani. 2002. Machine learning in auto-
  mated text categorization. ACM computing surveys
  (CSUR), 34(1):1–47.

Nitish Srivastava, Geoffrey E. Hinton, Alex
  Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-
  nov. 2014. Dropout: a simple way to prevent neural
  networks from overfitting. J. Mach. Learn. Res.,
  15(1):1929–1958.