Unsupervised pretraining for text classification using
                siamese transfer learning
                          Notebook for PAN at CLEF 2019

                           Maximilian Bryan, J. Nathanael Philipp

                                    Universität Leipzig
           bryan@informatik.uni-leipzig.de, jonas_nathanael.philipp@uni-leipzig.de


        Abstract When training neural networks, huge amounts of training data typi-
        cally lead to better results. When only a small amount of training data is available,
        it has been proven useful to initialize a network with pretrained layers. For NLP
        tasks, networks are usually only given pretrained word embeddings, the rest of
        the network is not pretrained since pretraining recurrent networks for NLP tasks
        is difficult. In this article, we present a siamese architecture for pretraining recur-
        rent networks on textual data. The network has to map pairs of sentences onto a
        vector representation. When a sentence pair is appearing coherently in our corpus,
        the vector representations should be similar, if not, the representations should be
        dissimilar. After having pretrained that network, we enhance it and train it on a
        smaller dataset in order to have it classify textual data. We show that using this
        kind of approach for pretraining results in better results comparing to doing no
        pretraining or only using pretrained embeddings when doing text classification
        for a task with only a small amount of training data. For evaluation, we are using
        the bots and gender profiling dataset provided by PAN 2019.


1     Introduction
When training a neural network for a given task, training data is needed. With this
data, the network adjusts its weights in order to correctly classify input data onto given
labels presented with the training data. The more variance there is in the input data, the
better the network performs on unseen data. Thus, the more training data is available,
the better a network is performing [13]. On the other hand when there is not much data
available, a neural networks is running the risk of overfitting the training data since only
patterns are learned that can be seen in the training data.
    In this article, we present an architecture for pretraining neural networks on textual
data, that not only pretrains word embeddings but is a bigger network that learns sen-
tence representations. In section 3 we present a siamese architecture to pretrain a neural
network on a huge text corpus. To show that this pretraining leads to better results when
transferring the network to a specific task, we enhance the presented architecture and
continue training it on a smaller dataset. The smaller dataset will be taken from the
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
bots and gender profiling task of PAN 2019 [6,23,24,22]. In section 4, we compare this
pretrained network with a not-pretrained network.


2   Related Work
In order to circumvent the problem of having little training data for a specific task, a
neural network can be trained on other, related data before being trained on the data for
the desired use case. This approach is known as transfer learning [20] and is commonly
used in the field of computer vision by using the pretrained ImageNet [12], which is
trained on 1.2 million images, on tasks like image classification [9], object detection
[10], image segmentation [7] or others. In all use cases, the network is performing
better because it has been pretrained on a large dataset of images.
     Pretraining neural networks for NLP tasks is harder because of two reasons: on one
hand, it is very easy to gather text data, but it is much more difficult to get a labeled
corpus of humongous size. On the other hand, text data is unsuited for use of traditional
pretraining architectures like autoencoders [27] for several reasons. When pretraining
unsupervised using an autoencoder, the neural network has to recreate the given input
data. Now, when the input data is a sequence of dictionary entries, the network has to
contain a layer that is mapping onto a dictionary in order to e.g. predict probabilities of
tokens being present. Given that dictionaries typically contain several hundreds of thou-
sands of tokens, that single layer contains so many weights that it becomes technically
hard to train such a network. Also, when using an autoencoder as a pretraining step,
one is usually only interested in the encoding part of the network, but the decoding part
still has to be trained which results in bigger networks than actually needed. Another
problem arises when considering that the fact that text data is sequential. This makes
autoencoders, which work great for images, unusable. One could think of a recurrent
architecture that maps a sequence onto itself, but when using that kind of architecture,
the network just has to map each token onto itself and does not need to consider contex-
tual information. To circumvent this problems, architectures have been created which
split the network into an encoding and decoding part by mapping the sequence onto one
vector representation and then recreating the sequence using that single vector [25,3].
     To avoid the aforementioned problems when pretraining sequential networks in the
context of NLP, networks most often contain a pretrained input part, which maps words
to a vector representation, often called embedding and created using tools like Word2Vec
[18], GloVe [21] or context2vec [17]. For example, [16] have created an architecture for
question answering using pretrained word vectors using Glove , [2] created a neural
network for natural language inference, in which they used pre-trained word vectors,
[26] used a pre-trained embedding for creating a neural network to do semantic role
labeling.


3   Siamese Architecture
A siamese architecture belongs to the group of unsupervised learning methods. It does
not aim to perform a classification like with supervised approaches but tries to find
patterns in data, so that the given input data can be mapped onto a vector representation.
This approach is similar to Restricted Boltzman Machines [14] or Autoencoders [27]. A
siamese architecture is similar to an autoencoder’s architecture, whereas it only contains
the first part which is the encoding part. Due to the lacking of a decoding part, the
siamese architecture cannot be trained to recreate the input data but is instead trained
using data pairs. The network is given two entries of the training set and has to map
both inputs onto the same vector representation as long as both entries are considered
to be similar (positive pair). The vector similarity can then be calculated several ways,
e.g. using cosine similarity, euclidian similarity or the Jaccard index. It is important to
give the network an input pair of which the both entries are considered being not similar
(negative pair). Otherwise, the network would convert to the trivial solution of giving
all input data the same vector representation. Using the positive/negative pair approach,
the network has to find patterns in the input data so that positive input pairs with similar
patterns result in a similar network representation and conversely negative pairs result
in different vector representations.
     To our knowledge, the first application of such architecture was created by [1] for
verifying signatures. A neural network was created that mapped the input image onto
a vector representation. Then, two signature’s cosine similarity had to be 1 when the
signatures where from the same person. Another siamese approach was created by [4]:
They created a convolutional neural network with a linear output. During training, the
network was trained to give face images from the same person a cosine similarity of
1. When the images came from two different persons, the cosine similarity should be
0. Other siamise architectures have been used for text use cases. Putting text character-
wise into LSTMs was done by [19]. They used job descriptions, which had been labeled
by similarity previously. The neural network then had to create a vector representation,
which then has been used with cosine distance to create the similarity between input
data.

Network Structure When training a neural network to learn semantical and structural
patterns from sentences, we propose to use a recurrent network. The recurrent network
is fed with a sequence of word vectors. To allow for out-of-vocabulary tokens, we add
a second input per time step, which is a character ngram representation of the current
token. That way, word vectors are trained together with the character of the correspond-
ing token. When the token is not in our dictionary, we use a zero word vector and weigh
the character ngram representation with a factor of 2. As a type of recurrent layer, we
advocate for using a GRU [5]. As a sentence’s vector representation, we are using the
last state of the recurrent network. Using a GRU over a simple RNN has the advantage
that a GRU solves the vanishing gradient problem of plain RNN layers. When compar-
ing an GRU with the also commonly used LSTM layer [15], there is a problem with
the LSTM’s hidden state: The LSTM cell is designed such that it contains a state which
is protected and will only be output when the corresponding gates opens. At the same
time, an freely accessible hidden state is passed over each time step. That means, there
are two vector representations of each LSTM cell at each time step. Since we do not
want to decide which state to take and we neither want to combine two vectors which
different purpose, we decide to use the GRU with its single vector representation. Also,
since the GRU has less gates and thus less weights to train, the overall training speed
increases compared to using an LSTM. The explained structure can be seen in figure 3.
               t0        n0             t1        n1            t2        n2


                    h0                       h1                      h2        ...


Figure 1. Architecture for multi-class visual attention. Tokens are given into a recurrent network
as pairs of word vectors and character ngram representations.


Training Data For unsupervised training of a neural network on sequential textual
data, we are using a huge collection of crawled news articles [11]. When comparing
news articles to books or tweets, they have the advantage that each article contains a
rather short set of sentences of which we know that they cover the same topic. When
training the network using the positive negative approach, we create positive pairs by
choosing random coherent sentence pairs from the same news article. Negative pairs
consist of random sentences from the whole corpus. The network has to learn to map
sentences onto a vector representation whereas it has to learn which sentences are likely
to be paired with a given sentence, which means that features from possibly coherent
sentences have to be contained in a sentence’s vector representation. At the same time,
all sentences cannot be mapped onto the same vector representation because of the
negative pairs. The effect is that sentences like Python is a nice programming language.
and I like working with it. probably appear more often as a positive pair than a negative
pair, thus the network has to learn to give both sentences a similar vector representation.
The fact that sentences pairs are chosen as positive pairs but can also be shown as
negative pairs is called weak supervision ([8] among others).


4    Evaluation

In this section, we want to evaluate our pretraining approach. We are using a rather small
dataset provided by PAN @ CLEF 2019 [6]. The dataset consists of tweets of different
authors [24]. The authors have to be categorized whether they are a bot or a human
being, and if the latter, whether the author is male or female. For that task, tweets of
2880 authors are given as training data. When comparing that amount of authors with
the number of users a service like Twitter sees daily, it is incredibly small. With that
small amount of data, the problem now is that typical tweet patterns or even linguistic
patterns may not be presented often enough in the training data that the network can
robustly learn to recognize these patterns. Also, some patterns can occur above-average
in the training data, so that the network could fit itself to the training data, which then
is disadvantageously when using it on data unseen during the test phase.
    The architecture used for this challenge will be the one described in section 3. Since
the network only returns a vector representation of a single sentence and no classifica-
tion for an author, we enhace it with a recurrent layer and two small output layers. The
first has two outputs which indicates the information bot or no bot, the second addi-
tional layer has three outputs and indicates bot, male or female. Both new output layers
are activated using softmax. The architecture can be seen in figure 4.


                  v0              v1              v2

                                                                       Output 1

                  h0              h1              h2          ...

                                                                       Output 2


Figure 2. Enhanced architecture for taking part in the author-profiling-challenge. The input into
the recurrent layer is a tweet’s vector representation created by the network seen in figure 3.


                                         bot / no-bot    bot / male / female
                                         English Spanish English Spanish
                   not-pretrained        0.6514 0.5708 0.6758 0.6525
                   pretrained embeddings 0.7303 0.6101 0.6928 0.6897
                   pretrained siamese    0.8597 0.7862 0.7494 0.7169
Table 1. Comparison of network accuracy for differently pretrained networks on the challenge’s
publicly available test data.


    We are using three differently trained versions of that architecture for comparison
which are differing in the parts which have been pretrained. The training for fine-tuning
the data will be done on the data provided for the challenge. The testing will be done
on the publicly available test data. The first version of the architecture won’t be pre-
trained at all, i.e. it does not get pretrained word embedding or a pretrained recurrent
layer. The second version will only be trained on the challenge data, but will receive
pretrained word embeddings. This approach is similar to other previously mentioned
related projects, which create a neural network for an NLP task and provide pretrained
word embeddings. The third version will be pretrained using the siamese approach de-
scribed in section 3. After pretraining, the architecture will be enhanced and trained
on the challenge’s training data. Additionally, each approach for the three described
versions is done for English data and for Spanish data. As can be seen in table 1, the
presented architecture’s performance get significantly better the more pretraining has
been done.
5    Conclusion

In this article, we tackled the problem of pretraining recurrent architectures on NLP
data by transfering the siamese architecture to textual data. We are training a network
using positive/negative sentence pairs with weak supervision. The sentence pairs are
chosen from a corpus of news articles. When sentences are appearing next to each other
in an article, they are considered being a positive pair, otherwise they are a negative pair.
This pretrained network was then enhanced and trained on a smaller dataset to have it
classify author’s by their tweets. We showed that we get significantly better results than
when no pretraining is performed or when just pretrained word embeddings are used.


References

 1. Bromley, J., Guyon, I., Lecun, Y., Säckinger, E., Shah, R.: Signature verification using a
    siamese time delay neural network. vol. 7, pp. 737–744 (01 1993)
 2. Chen, Q., Zhu, X., Ling, Z., Wei, S., Jiang, H.: Enhancing and combining sequential and
    tree LSTM for natural language inference. CoRR abs/1609.06038 (2016),
    http://arxiv.org/abs/1609.06038
 3. Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.:
    Learning phrase representations using RNN encoder-decoder for statistical machine
    translation. CoRR abs/1406.1078 (2014), http://arxiv.org/abs/1406.1078
 4. Chopra, S., Hadsell, R., Lecun, Y.: Learning a similarity metric discriminatively, with
    application to face verification. vol. 1, pp. 539– 546 vol. 1 (07 2005)
 5. Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural
    networks on sequence modeling. CoRR abs/1412.3555 (2014),
    http://arxiv.org/abs/1412.3555
 6. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P.,
    Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.:
    Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship
    Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber,
    A., Müller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the
    Tenth International Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019)
 7. Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network
    cascades. CoRR abs/1512.04412 (2015), http://arxiv.org/abs/1512.04412
 8. Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.B.: Neural ranking models
    with weak supervision. CoRR abs/1704.08803 (2017), http://arxiv.org/abs/1704.08803
 9. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A
    deep convolutional activation feature for generic visual recognition. In: Xing, E.P., Jebara,
    T. (eds.) Proceedings of the 31st International Conference on Machine Learning.
    Proceedings of Machine Learning Research, vol. 32, pp. 647–655. PMLR, Bejing, China
    (22–24 Jun 2014), http://proceedings.mlr.press/v32/donahue14.html
10. Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate
    object detection and semantic segmentation. CoRR abs/1311.2524 (2013),
    http://arxiv.org/abs/1311.2524
11. Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the
    leipzig corpora collection: From 100 to 200 languages. In: In Proceedings of the Eight
    International Conference on Language Resources and Evaluation (LREC’12 (2012)
12. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level
    performance on imagenet classification. In: Proceedings of the IEEE international
    conference on computer vision. pp. 1026–1034 (2015)
13. Hestness, J., Narang, S., Ardalani, N., Diamos, G.F., Jun, H., Kianinejad, H., Patwary,
    M.M.A., Yang, Y., Zhou, Y.: Deep learning scaling is predictable, empirically. CoRR
    abs/1712.00409 (2017), http://arxiv.org/abs/1712.00409
14. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural
    networks. Science 313(5786), 504–507 (2006),
    https://science.sciencemag.org/content/313/5786/504
15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9, 1735–80
    (12 1997)
16. Liu, X., Shen, Y., Duh, K., Gao, J.: Stochastic answer networks for machine reading
    comprehension. CoRR abs/1712.03556 (2017), http://arxiv.org/abs/1712.03556
17. Melamud, O., Goldberger, J., Dagan, I.: context2vec: Learning generic context embedding
    with bidirectional LSTM. In: Proceedings of The 20th SIGNLL Conference on
    Computational Natural Language Learning. pp. 51–61. Association for Computational
    Linguistics, Berlin, Germany (Aug 2016), https://www.aclweb.org/anthology/K16-1006
18. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations
    in vector space (2013), http://arxiv.org/abs/1301.3781
19. Neculoiu, P., Versteegh, M., Rotaru, M.: Learning text similarity with siamese recurrent
    networks (01 2016)
20. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. on Knowl. and Data Eng.
    22(10), 1345–1359 (2010), http://dx.doi.org/10.1109/TKDE.2009.191
21. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation.
    In: EMNLP. vol. 14, pp. 1532–1543 (2014), https://nlp.stanford.edu/pubs/glove.pdf
22. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture.
    In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World -
    Lessons Learned from 20 Years of CLEF. Springer (2019)
23. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation
    for language variety identification. In: Gelbukh, A. (ed.) Computational Linguistics and
    Intelligent Text Processing. pp. 156–169. Springer International Publishing, Cham (2018)
24. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and
    Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019
    Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)
25. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks.
    CoRR abs/1409.3215 (2014), http://arxiv.org/abs/1409.3215
26. Tan, Z., Wang, M., Xie, J., Chen, Y., Shi, X.: Deep semantic role labeling with
    self-attention. CoRR abs/1712.01586 (2017), http://arxiv.org/abs/1712.01586
27. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust
    features with denoising autoencoders (2008)