Fully Convolutional Networks for Text Classification

                                          Jacob Anderson
                                             Sentim LLC
                                         Columbus, OH, USA
                                      papers@sentimllc.com


                                                         size output from any sized input. In text classifi-
                     Abstract                            cation tasks, this often means that the input is
                                                         fixed in size in order for the output to also have a
    English. In this work I propose a new way            fixed size.
    of using fully convolutional networks for               Other recent work in language understanding
    classification while allowing for input of           and translation uses a concept called attention. At-
    any size. I additionally propose two mod-            tention is particularly useful for language under-
    ifications on the idea of attention and the          standing tasks as it creates a mechanism for relat-
    benefits and detriments of using the mod-            ing different position of a single sequence to each
    ifications. Finally, I show suboptimal re-           other (Vaswani et al., 2017).
    sults on the ITAmoji 2018 tweet to emoji                In this work I propose a new way of using fully
    task and provide a discussion about why              convolutional networks for classification to allow
    that might be the case as well as a pro-             for any sized input length without adding or re-
    posed fix to further improve results.                moving data. I also propose two modifications on
                                                         attention and then discuss the benefits and detri-
    Italian. In questo lavoro viene presentato           ments of using the modified versions as compared
    un nuovo approccio all'uso di fully convo-           to the unmodified version.
    lutional network per la classificazione,
    adattabile a dati di input di qualsiasi di-          2    Model Description
    mensione. In aggiunta vengono proposte
    due modifiche basate sull'uso di meccani-            The overall architecture of my fully convolutional
    smi di attention, valutandone benefici e             network design is shown in Figure 1. My model
    svantaggi. Infine, sono presentati i risul-          begins with a character embedding where each
    tati della partecipazione al Task ITAmoji            character in the input maps to a vector of size 16.
    2018 relativo alla predizione di emoji as-           I then first apply a causal convolution with 128
    sociate al testo di tweets, discutendo il            filters of size 3. After which, I apply a stack of 9
    perché delle performance non ottimali del            layers of residual dilated convolutions with skip
    sistema sviluppato e proponendo possibili            connections, each of which use 128 filters of size
    migliorie.                                           7. The size of 7 here was chosen by inspection, as
                                                         it converged faster than size 3 or 5 while not con-
                                                         suming too much memory. Additionally, the dila-
1    Introduction
                                                         tion rate of each layer of the stack doubles for
   The dominant approach in many natural lan-            every layer, so the first layer has rate 1, then the
guage tasks is to use recurrent neural networks or       second layer has rate 2, then rate 4, and so on.
convolutional neural networks (CNN) (Conneau                 All of the skip connections are combined with
et al., 2017). For classification tasks, recurrent       a summation immediately followed by a ReLU to
neural networks have a natural advantage because         increase nonlinearity. Finally, the output of the
of their ability to take in any size input and output    network was computed using a convolution with
a fixed size output. This ability allows for greater     25 filters each of size 1, followed by a global max
generalization as no data is removed nor added in        pool operation. The global max pool operation re-
order for the inputs to match in length. While con-      duces the 3D tensor of size (batch size, input
volutional neural networks can also support input        length, 25) to (batch size, 25) in order to match the
of any size, they lack the ability to generate a fixed   expected output.
   I implemented all code using a combination of          2.1   Hardware Limitations
Tensorflow (Abadi et al., 2016) and Keras (Chol-
                                                          At the time of creating the models in this paper, I
let, 2015). During training I used softmax cross-
                                                          was limited to only a Google Colab GPU, which
entropy loss with an l2 regularization penalty with
                                                          comes with a runtime restriction of 12 hours per
a scale of .0001. I further reduced overfitting by
                                                          day and a half a GB of GPU memory1. While it is
adding spatial dropout (Tompson et al., 2015)
                                                          possible to continue training again after the re-
with a drop probability of 10% in the residual di-
                                                          striction is reset, in order to maximize GPU usage,
lated convolution layers.
                                                          I tried to design each iteration of the model so that
                                                          it would finish training within a 12 hour time pe-
                                                          riod.
                                                          2.2   Residual Block
                                                          A residual connection is any connection which
                                                          maps the input of one layer to the output of a layer
                                                          further down in the network. Residual connec-
                                                          tions decrease training error, increase accuracy,
                                                          and increase training speed (He et al., 2015).
                                                          2.3   Dilated Convolution
                                                          A dilated convolution is a convolution where the
                                                          filter is applied over a larger area by skipping in-
                                                          put values according to a dilation rate. This rate
                                                          usually exponentially scales with the numbers of
                                                          layers of the network, so you would look at every
                                                          input for the first layer and then every other input
                                                          for the second, and then every fourth and so on
                                                          (van den Oord, 2016).
                                                              In this paper, I use dilated convolutions similar
                                                          to Wavenet (van den Oord, 2016), where each
                                                          convolution has both residual and skip connec-
                                                          tions. However, instead of the gated activation
                                                          function from the Wavenet paper, I used local re-
                                                          sponse normalization followed by a ReLU func-
                                                          tion. This activation function was proposed by
                                                          Krizhevsky, Sutskever, and Hinton (2012), and I
                                                          used it because I found this method to achieve
                                                          equal results but faster convergence.
                                                          2.4   Residual Dilated Convolution
                                                          A residual dilated convolution is a dilated convo-
                                                          lution with a residual connection. First, I take a
                                                          dilated convolution on the input and a linear pro-
                                                          jection on the input. The dilated convolution and
                                                          the linear projection are added together and then
                                                          outputted. The dilated convolution also outputs as
                                                          a skip connection, which is eventually summed to-
                                                          gether with every other skip connection later in
                                                          the network.


               Figure 1: Model Architecture

  1
      They have since changed this limitation to 13 GB.
                                                        solving the problem. Simplified attention can also
                                                        be thought of as reinforcing a one-to-one corre-
                                                        spondence between the key and the value.


        Figure 2: Residual Dilated Convolution
2.5     Skip Connections
In this paper, I also use the idea of skip connec-                 Figure 3: Simplified Attention
tions from Long, Shelhamer, and Darrell (2015).
Skip connections simply connect previous layers            Local attention is like simplified attention ex-
with the layer right before the output in order to      cept instead of performing a linear projection on
fuse local and global information from across the       the keys, local attention performs a convolutional
network. In this work, the connections are all          projection on the keys. This allows for the net-
fused together with a summation followed by a           work to use local information in the keys to attend
ReLU activation to increase nonlinearity.               to the values.
2.6     Attention and Self-Attention                    2.8   Multi-Head Attention
Attention can be described as mapping a query           In multi-head attention, attention is performed
and a set of key value pairs to an output (Vaswani      multiple times on different projections of the input
et al., 2017). Specifically, when I say attention or    (Vaswani et al., 2017). In this paper, I either use
‘normal’ attention, I am referring to Scaled Dot-       one or eight heads in every experiment with atten-
Product Attention. Scaled Dot-Product Attention         tion, in order to get the best results and to compare
is computed as:                                         the different methods accurately.
                                       45 6             2.9   Model Modifications for Attention
      𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 3          : 𝑉 (1)
                                       789
                                                        In this paper, I tested seven different models, six
                                                        of which extend the base model using some type
Where Q, K, and V are matrices representing the
                                                        of attention. In the models with attention, self-at-
queries, keys, and values respectively (Vaswani
                                                        tention is used right after the final convolution and
et al., 2017).
                                                        right before the global pooling operation.
   Self-Attention then is where Q, K, and V all
come from the same source vector after a linear         2.10 Global Max Pooling
projection. This allows each position in the vector
to attend to every other position in the same vec-      While CNN’s support input of any size, they lack
tor.                                                    the ability to generate a fixed size input and in-
                                                        stead output a tensor that is proportional in size to
2.7     Simplified and Local Attention                  the input size. In order for the output of the net-
                                                        work to have a fixed size of 25, I use max pooling
Simplified and local attention can both be thought
                                                        (Scherer et al., 2010) along the time dimension of
of as trying to reinforce the mapping of a key to
                                                        the last convolutional layer. I perform the max
value pair by extracting extra information from
                                                        pooling globally, which simply means that I take
the key. I compute a linear transformation fol-
                                                        the maximum value of the whole time dimension
lowed by a softmax to get the weights on the val-
                                                        instead of from a sliding window of the time di-
ues. These weights and the initial values are mul-
                                                        mension.
tiplied together element-wise in order to highlight
which of the values are the most important for
3     Experiment and Results                                                loudly crying face              1.49
In this section, I go over the ITAmoji task descrip-                        top arrow                       1.39
tion and limitations, as well as my results on the                          two hearts                      1.36
task.                                                                       sun                             1.28
3.1     ITAmoji Task                                                        rose                            1.06
This model was initially designed for the ITAmoji
                                                                            sparkles                        1.06
task in EVALITA 2018 (Ronzano et al., 2018).
The goal of this task is to predict which of 25 emo-              Table 1: Each of the 25 different emojis used in
jis (shown in Table 1) is most likely to be in a                  the ITAmoji task, their labels, and the correspond-
given Italian tweet. The provided dataset is                      ing percent of samples in the test dataset.
250,000 Italian tweets with one emoji label per
tweet, and no additional data is allowed for train-               3.2   Results
ing the models. However, it is allowed to use ad-                 Table 2 shows my official results from the
ditional data to train unsupervised systems like                  ITAmoji competition, as well as the first and sec-
word embeddings. All results in the coming sub-                   ond group scores. Table 3 shows the best result
sections were tested on the dataset of 25,000 Ital-               (evaluated after the competition was complete)
ian tweets provided by the organizers.                            according to the macro f1 score of the seven dif-
                                                                  ferent models I trained during the competition. It
 Emoji Label                                        %             also shows the micro f1 score at the same run of
                                                    Sam-          the best macro f1 score for comparison. Table 4
                                                    ples          shows the upper and lower bounds of the f1 scores
             red heart                              20.28         after the scores have stopped increasing and have
                                                                  plateaued.
             face with tears of joy                 19.86
             smiling face with heart eyes           9.45           Model                   Macro F1 Micro F1
             kiss mark                              1.12           1st Place Group         0.365          0.477
                                                                     nd
                                                                   2 Place Group           0.232          0.401
             winking face                           5.35
                                                                   Run 3: Simplified       0.106          0.294
             smiling face with smiling 5.13                        Attention
             eyes                                                  Run 2: 1 Head Atten- 0.102             0.313
             beaming face with smiling 4.11                        tion
             eyes                                                  Run 1: No Attention2 0.019             0.064
             grinning face             3.54                       Table 2: Official results from the ITAmoji com-
             face blowing a kiss                    3.34          petition, as compared to the first and second place
                                                                  groups.
             smiling face with sunglasses 2.80
             thumbs up                              2.57           Model                     Macro F1 Micro F1
                                                                   8 Head Attention          0.113        0.316
             rolling on the floor laughing 2.18                    1 Head Attention          0.105        0.339
             thinking face                          2.16           Local Attention           0.106        0.341
             blue heart                             2.02           8 Head Local              0.106        0.337
                                                                   Simplified Attention 0.106             0.341
             winking face with tongue               1.93           8 Head Simplified         0.109        0.308
             face screaming in fear                 1.78           No Attention              0.11         0.319
             flexed biceps                          1.67          Table 3: The best results from the different models
                                                                  on the dataset, run after the competition was over.
             face savoring food                     1.55
             grinning face with sweat               1.52


  2
    Due to an off-by-one error in the conversion from net-
work output to emoji, the official results for the no attention
network are much worse than in actuality.
 Model                    Macro F1 Micro F1             and would be faster than using an LSTM. The is-
 8 Head Attention         [.10, .11]    [.30, .36]      sue here is that in order to maintain the property
 1 Head Attention         [.09, .11]    [.30, .36]      that the network can have any input size, pooling
 Local Attention          [.10, .11]    [.30, .35]      or some other method of downsampling has to be
 8 Head Local             [.10, .11]    [.34, .36]      used, potentially throwing away useful data.
 Simplified Attention [.10, .11]        [.32, .36]      4.2    Potential Uses of Simplified and Local
 8 Head Simplified        [.10, .11]    [.31, .36]             Attention
 No Attention             [.10, .11]    [.30, .36]
Table 4: The upper and lower bounds of the f1           While the original idea behind simplifying atten-
scores of the different model types after the scores    tion in such a manner as presented in this paper
have plateaued in training and start oscillating.       was to reduce computational cost and encourage
                                                        easier learning by enforcing a softmax distribu-
   While 8 head attention did outperform the 8          tion of data, there didn’t seem to be any benefit in
head local and simplified models, it’s interesting      doing so. In most cases the computational cost of
to note that that isn’t the case for the 1 head ver-    a couple of matrix multiplications versus an ele-
sions. Additionally, the bounds for the scores sig-     ment-wise product is negligible, so it would usu-
nificantly overlap so there is no statistically sig-    ally be better to just apply normal attention in
nificant gains for one method over the other. This      those cases as it already covers the case of simpli-
result, along with my comparatively worse scores        fied attention in its implementation.
is probably because the max pooling at the end of          Similar to simplified attention, it doesn’t neces-
my model was throwing away too much infor-              sarily make sense to use local attention instead of
mation in order to make the size consistent.            normal attention for small input sizes. Instead, it
                                                        might make sense to switch out the linear projec-
4     Discussion                                        tion on the queries and keys in normal attention
                                                        with a convolutional projection but otherwise per-
In the upcoming sections, I discuss a possible          form the scaled-dot product attention normally.
problem with the design of my models and pro-           This could be potentially useful if the problem be-
pose a few solutions for that problem. I further        ing approached needs to map patterns to values in-
discuss the two new modifications on attention          stead of mapping values to values. One could of
that I proposed and their possible uses.                course extend this even further by also performing
4.1    Loss of Information While Pooling                a convolutional projection on the values in order
                                                        to map local patterns to other local patterns, and
   For the problem of throwing away too much in-        so on.
formation during the pooling or downsampling               On the other hand, the local attention suggested
phase, there are three main approaches that could       in this paper could be useful in neural nets used
be explored, each with their positives and nega-        for images and other large data, where it might not
tives.                                                  make sense to attend over the whole input. This is
    The first approach is to just fix the size of the   especially true in the initial layers of such neural
input and use fully connected layers or similar ap-     networks where the neurons are only looking at a
proaches to find the correct output. This is the cur-   small section of the input in the first place. Be-
rent approach by most researchers, and has shown        yond the smaller memory demands compared to
good results. The main negative here is that the        normal attention, local attention could be useful in
input size must be fixed, and fixing the input size     these layers because it provides a method to natu-
could mean throwing away or adding information          rally figure out which patterns are important at
that isn’t naturally there.                             these early layers.
   The second approach is to use a recurrent neu-          Of course an alternative to local attention is to
ral network neuron like an LSTM or a GRU with           just take small patches of the image and apply the
size equal to the output size to parse the result and   original formulation of scaled-dot product atten-
output singular values for the final sequence. This     tion to get similar results. This idea was originally
would probably lead to better results but is going      suggested as future work in Vaswani et al. (2017).
to be slower than the other approaches.
    The last approach is to use convolutional lay-      5     Conclusion
ers with a large kernel size and stride (e.g. stride
equal to the size of the kernel). This would allow      In this work I present simplified and local atten-
the network to shrink the output size naturally,        tion and test the methods in comparison to similar
models with normal attention and without any              Scherer, D., Müller, A. and Behnke, S., 2010. Evalua-
kind of attention at all. I also introduced a new           tion of pooling operations in convolutional architec-
strategy for classifying data with fully convolu-           tures for object recognition. In Artificial Neural Net-
tional networks with any sized input.                       works–ICANN 2010 (pp. 92-101). Springer, Berlin,
                                                            Heidelberg.
   The new model design was not without its own
flaws, as it showed poor results for all modifica-        Tompson, J., Goroshin, R., Jain, A., LeCun, Y. and
tions of the method. The poor results were proba-           Bregler, C., 2015. Efficient object localization using
bly due to the final pooling layer throwing away            convolutional networks. In Proceedings of the IEEE
too much information. A better method would be              Conference on Computer Vision and Pattern Recog-
to use LSTMs or specially designed convolutions             nition (pp. 648-656).
in order to shrink the output to the correct size.        Van Den Oord, A., Dieleman, S., Zen, H., Simonyan,
   Future work will include further explorations of         K., Vinyals, O., Graves, A., Kalchbrenner, N., Sen-
simplified and local attention to really get a grasp        ior, A.W. and Kavukcuoglu, K., 2016, September.
of what tasks they are good at and where, if any-           WaveNet: A generative model for raw audio. In
where, they show better efficiency or results than          SSW (p. 125).
normal attention. In the future I will also further       Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
explore the new strategy for classification on any          Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin,
sized input with fully convolutional model and see          I., 2017. Attention is all you need. In Advances in
what I can change and update in order to improve            Neural Information Processing Systems (pp. 5998-
the results of the model.                                   6008).
                                                          Zhang, X., Zhao, J. and LeCun, Y., 2015. Character-
References                                                  level convolutional networks for text classification.
                                                            In Advances in neural information processing sys-
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A.,       tems (pp. 649-657).
  Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard,
  M. and Kudlur, M., 2016, November. Tensorflow: a
  system for large-scale machine learning. In OSDI
  (Vol. 16, pp. 265-283).
Conneau, A., Schwenk, H., Barrault, L. and Lecun, Y.,
  2016. Very deep convolutional networks for text
  classification. arXiv preprint arXiv:1606.01781.
Chollet, F., 2015. Keras.
He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep re-
  sidual learning for image recognition. In Proceed-
  ings of the IEEE conference on computer vision and
  pattern recognition (pp. 770-778).
Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012.
  Imagenet classification with deep convolutional
  neural networks. In Advances in neural information
  processing systems (pp. 1097-1105).
Long, J., Shelhamer, E. and Darrell, T., 2015. Fully
  convolutional networks for semantic segmentation.
  In Proceedings of the IEEE conference on computer
  vision and pattern recognition (pp. 3431-3440).
Ronzano, Francesco and Barbieri, Francesco and
  Wahyu Pamungkas, Endang and Patti, Viviana and
  Chiusaroli, Francesca. 2018. Overview of the
  EVALITA 2018 Italian Emoji Prediction
  (ITAMoji). Proceedings of Fifth Italian Conference
  on Computational Linguistics (CLiC-it 2018) &
  Sixth Evaluation Campaign of Natural Language
  Processing and Speech Tools for Italian. Final
  Workshop (EVALITA 2018).