Fully Convolutional Networks for Text Classification Jacob Anderson Sentim LLC Columbus, OH, USA papers@sentimllc.com size output from any sized input. In text classifi- Abstract cation tasks, this often means that the input is fixed in size in order for the output to also have a English. In this work I propose a new way fixed size. of using fully convolutional networks for Other recent work in language understanding classification while allowing for input of and translation uses a concept called attention. At- any size. I additionally propose two mod- tention is particularly useful for language under- ifications on the idea of attention and the standing tasks as it creates a mechanism for relat- benefits and detriments of using the mod- ing different position of a single sequence to each ifications. Finally, I show suboptimal re- other (Vaswani et al., 2017). sults on the ITAmoji 2018 tweet to emoji In this work I propose a new way of using fully task and provide a discussion about why convolutional networks for classification to allow that might be the case as well as a pro- for any sized input length without adding or re- posed fix to further improve results. moving data. I also propose two modifications on attention and then discuss the benefits and detri- Italian. In questo lavoro viene presentato ments of using the modified versions as compared un nuovo approccio all'uso di fully convo- to the unmodified version. lutional network per la classificazione, adattabile a dati di input di qualsiasi di- 2 Model Description mensione. In aggiunta vengono proposte due modifiche basate sull'uso di meccani- The overall architecture of my fully convolutional smi di attention, valutandone benefici e network design is shown in Figure 1. My model svantaggi. Infine, sono presentati i risul- begins with a character embedding where each tati della partecipazione al Task ITAmoji character in the input maps to a vector of size 16. 2018 relativo alla predizione di emoji as- I then first apply a causal convolution with 128 sociate al testo di tweets, discutendo il filters of size 3. After which, I apply a stack of 9 perché delle performance non ottimali del layers of residual dilated convolutions with skip sistema sviluppato e proponendo possibili connections, each of which use 128 filters of size migliorie. 7. The size of 7 here was chosen by inspection, as it converged faster than size 3 or 5 while not con- suming too much memory. Additionally, the dila- 1 Introduction tion rate of each layer of the stack doubles for The dominant approach in many natural lan- every layer, so the first layer has rate 1, then the guage tasks is to use recurrent neural networks or second layer has rate 2, then rate 4, and so on. convolutional neural networks (CNN) (Conneau All of the skip connections are combined with et al., 2017). For classification tasks, recurrent a summation immediately followed by a ReLU to neural networks have a natural advantage because increase nonlinearity. Finally, the output of the of their ability to take in any size input and output network was computed using a convolution with a fixed size output. This ability allows for greater 25 filters each of size 1, followed by a global max generalization as no data is removed nor added in pool operation. The global max pool operation re- order for the inputs to match in length. While con- duces the 3D tensor of size (batch size, input volutional neural networks can also support input length, 25) to (batch size, 25) in order to match the of any size, they lack the ability to generate a fixed expected output. I implemented all code using a combination of 2.1 Hardware Limitations Tensorflow (Abadi et al., 2016) and Keras (Chol- At the time of creating the models in this paper, I let, 2015). During training I used softmax cross- was limited to only a Google Colab GPU, which entropy loss with an l2 regularization penalty with comes with a runtime restriction of 12 hours per a scale of .0001. I further reduced overfitting by day and a half a GB of GPU memory1. While it is adding spatial dropout (Tompson et al., 2015) possible to continue training again after the re- with a drop probability of 10% in the residual di- striction is reset, in order to maximize GPU usage, lated convolution layers. I tried to design each iteration of the model so that it would finish training within a 12 hour time pe- riod. 2.2 Residual Block A residual connection is any connection which maps the input of one layer to the output of a layer further down in the network. Residual connec- tions decrease training error, increase accuracy, and increase training speed (He et al., 2015). 2.3 Dilated Convolution A dilated convolution is a convolution where the filter is applied over a larger area by skipping in- put values according to a dilation rate. This rate usually exponentially scales with the numbers of layers of the network, so you would look at every input for the first layer and then every other input for the second, and then every fourth and so on (van den Oord, 2016). In this paper, I use dilated convolutions similar to Wavenet (van den Oord, 2016), where each convolution has both residual and skip connec- tions. However, instead of the gated activation function from the Wavenet paper, I used local re- sponse normalization followed by a ReLU func- tion. This activation function was proposed by Krizhevsky, Sutskever, and Hinton (2012), and I used it because I found this method to achieve equal results but faster convergence. 2.4 Residual Dilated Convolution A residual dilated convolution is a dilated convo- lution with a residual connection. First, I take a dilated convolution on the input and a linear pro- jection on the input. The dilated convolution and the linear projection are added together and then outputted. The dilated convolution also outputs as a skip connection, which is eventually summed to- gether with every other skip connection later in the network. Figure 1: Model Architecture 1 They have since changed this limitation to 13 GB. solving the problem. Simplified attention can also be thought of as reinforcing a one-to-one corre- spondence between the key and the value. Figure 2: Residual Dilated Convolution 2.5 Skip Connections In this paper, I also use the idea of skip connec- Figure 3: Simplified Attention tions from Long, Shelhamer, and Darrell (2015). Skip connections simply connect previous layers Local attention is like simplified attention ex- with the layer right before the output in order to cept instead of performing a linear projection on fuse local and global information from across the the keys, local attention performs a convolutional network. In this work, the connections are all projection on the keys. This allows for the net- fused together with a summation followed by a work to use local information in the keys to attend ReLU activation to increase nonlinearity. to the values. 2.6 Attention and Self-Attention 2.8 Multi-Head Attention Attention can be described as mapping a query In multi-head attention, attention is performed and a set of key value pairs to an output (Vaswani multiple times on different projections of the input et al., 2017). Specifically, when I say attention or (Vaswani et al., 2017). In this paper, I either use ‘normal’ attention, I am referring to Scaled Dot- one or eight heads in every experiment with atten- Product Attention. Scaled Dot-Product Attention tion, in order to get the best results and to compare is computed as: the different methods accurately. 45 6 2.9 Model Modifications for Attention 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 3 : 𝑉 (1) 789 In this paper, I tested seven different models, six of which extend the base model using some type Where Q, K, and V are matrices representing the of attention. In the models with attention, self-at- queries, keys, and values respectively (Vaswani tention is used right after the final convolution and et al., 2017). right before the global pooling operation. Self-Attention then is where Q, K, and V all come from the same source vector after a linear 2.10 Global Max Pooling projection. This allows each position in the vector to attend to every other position in the same vec- While CNN’s support input of any size, they lack tor. the ability to generate a fixed size input and in- stead output a tensor that is proportional in size to 2.7 Simplified and Local Attention the input size. In order for the output of the net- work to have a fixed size of 25, I use max pooling Simplified and local attention can both be thought (Scherer et al., 2010) along the time dimension of of as trying to reinforce the mapping of a key to the last convolutional layer. I perform the max value pair by extracting extra information from pooling globally, which simply means that I take the key. I compute a linear transformation fol- the maximum value of the whole time dimension lowed by a softmax to get the weights on the val- instead of from a sliding window of the time di- ues. These weights and the initial values are mul- mension. tiplied together element-wise in order to highlight which of the values are the most important for 3 Experiment and Results loudly crying face 1.49 In this section, I go over the ITAmoji task descrip- top arrow 1.39 tion and limitations, as well as my results on the two hearts 1.36 task. sun 1.28 3.1 ITAmoji Task rose 1.06 This model was initially designed for the ITAmoji sparkles 1.06 task in EVALITA 2018 (Ronzano et al., 2018). The goal of this task is to predict which of 25 emo- Table 1: Each of the 25 different emojis used in jis (shown in Table 1) is most likely to be in a the ITAmoji task, their labels, and the correspond- given Italian tweet. The provided dataset is ing percent of samples in the test dataset. 250,000 Italian tweets with one emoji label per tweet, and no additional data is allowed for train- 3.2 Results ing the models. However, it is allowed to use ad- Table 2 shows my official results from the ditional data to train unsupervised systems like ITAmoji competition, as well as the first and sec- word embeddings. All results in the coming sub- ond group scores. Table 3 shows the best result sections were tested on the dataset of 25,000 Ital- (evaluated after the competition was complete) ian tweets provided by the organizers. according to the macro f1 score of the seven dif- ferent models I trained during the competition. It Emoji Label % also shows the micro f1 score at the same run of Sam- the best macro f1 score for comparison. Table 4 ples shows the upper and lower bounds of the f1 scores red heart 20.28 after the scores have stopped increasing and have plateaued. face with tears of joy 19.86 smiling face with heart eyes 9.45 Model Macro F1 Micro F1 kiss mark 1.12 1st Place Group 0.365 0.477 nd 2 Place Group 0.232 0.401 winking face 5.35 Run 3: Simplified 0.106 0.294 smiling face with smiling 5.13 Attention eyes Run 2: 1 Head Atten- 0.102 0.313 beaming face with smiling 4.11 tion eyes Run 1: No Attention2 0.019 0.064 grinning face 3.54 Table 2: Official results from the ITAmoji com- face blowing a kiss 3.34 petition, as compared to the first and second place groups. smiling face with sunglasses 2.80 thumbs up 2.57 Model Macro F1 Micro F1 8 Head Attention 0.113 0.316 rolling on the floor laughing 2.18 1 Head Attention 0.105 0.339 thinking face 2.16 Local Attention 0.106 0.341 blue heart 2.02 8 Head Local 0.106 0.337 Simplified Attention 0.106 0.341 winking face with tongue 1.93 8 Head Simplified 0.109 0.308 face screaming in fear 1.78 No Attention 0.11 0.319 flexed biceps 1.67 Table 3: The best results from the different models on the dataset, run after the competition was over. face savoring food 1.55 grinning face with sweat 1.52 2 Due to an off-by-one error in the conversion from net- work output to emoji, the official results for the no attention network are much worse than in actuality. Model Macro F1 Micro F1 and would be faster than using an LSTM. The is- 8 Head Attention [.10, .11] [.30, .36] sue here is that in order to maintain the property 1 Head Attention [.09, .11] [.30, .36] that the network can have any input size, pooling Local Attention [.10, .11] [.30, .35] or some other method of downsampling has to be 8 Head Local [.10, .11] [.34, .36] used, potentially throwing away useful data. Simplified Attention [.10, .11] [.32, .36] 4.2 Potential Uses of Simplified and Local 8 Head Simplified [.10, .11] [.31, .36] Attention No Attention [.10, .11] [.30, .36] Table 4: The upper and lower bounds of the f1 While the original idea behind simplifying atten- scores of the different model types after the scores tion in such a manner as presented in this paper have plateaued in training and start oscillating. was to reduce computational cost and encourage easier learning by enforcing a softmax distribu- While 8 head attention did outperform the 8 tion of data, there didn’t seem to be any benefit in head local and simplified models, it’s interesting doing so. In most cases the computational cost of to note that that isn’t the case for the 1 head ver- a couple of matrix multiplications versus an ele- sions. Additionally, the bounds for the scores sig- ment-wise product is negligible, so it would usu- nificantly overlap so there is no statistically sig- ally be better to just apply normal attention in nificant gains for one method over the other. This those cases as it already covers the case of simpli- result, along with my comparatively worse scores fied attention in its implementation. is probably because the max pooling at the end of Similar to simplified attention, it doesn’t neces- my model was throwing away too much infor- sarily make sense to use local attention instead of mation in order to make the size consistent. normal attention for small input sizes. Instead, it might make sense to switch out the linear projec- 4 Discussion tion on the queries and keys in normal attention with a convolutional projection but otherwise per- In the upcoming sections, I discuss a possible form the scaled-dot product attention normally. problem with the design of my models and pro- This could be potentially useful if the problem be- pose a few solutions for that problem. I further ing approached needs to map patterns to values in- discuss the two new modifications on attention stead of mapping values to values. One could of that I proposed and their possible uses. course extend this even further by also performing 4.1 Loss of Information While Pooling a convolutional projection on the values in order to map local patterns to other local patterns, and For the problem of throwing away too much in- so on. formation during the pooling or downsampling On the other hand, the local attention suggested phase, there are three main approaches that could in this paper could be useful in neural nets used be explored, each with their positives and nega- for images and other large data, where it might not tives. make sense to attend over the whole input. This is The first approach is to just fix the size of the especially true in the initial layers of such neural input and use fully connected layers or similar ap- networks where the neurons are only looking at a proaches to find the correct output. This is the cur- small section of the input in the first place. Be- rent approach by most researchers, and has shown yond the smaller memory demands compared to good results. The main negative here is that the normal attention, local attention could be useful in input size must be fixed, and fixing the input size these layers because it provides a method to natu- could mean throwing away or adding information rally figure out which patterns are important at that isn’t naturally there. these early layers. The second approach is to use a recurrent neu- Of course an alternative to local attention is to ral network neuron like an LSTM or a GRU with just take small patches of the image and apply the size equal to the output size to parse the result and original formulation of scaled-dot product atten- output singular values for the final sequence. This tion to get similar results. This idea was originally would probably lead to better results but is going suggested as future work in Vaswani et al. (2017). to be slower than the other approaches. The last approach is to use convolutional lay- 5 Conclusion ers with a large kernel size and stride (e.g. stride equal to the size of the kernel). This would allow In this work I present simplified and local atten- the network to shrink the output size naturally, tion and test the methods in comparison to similar models with normal attention and without any Scherer, D., Müller, A. and Behnke, S., 2010. Evalua- kind of attention at all. I also introduced a new tion of pooling operations in convolutional architec- strategy for classifying data with fully convolu- tures for object recognition. In Artificial Neural Net- tional networks with any sized input. works–ICANN 2010 (pp. 92-101). Springer, Berlin, Heidelberg. The new model design was not without its own flaws, as it showed poor results for all modifica- Tompson, J., Goroshin, R., Jain, A., LeCun, Y. and tions of the method. The poor results were proba- Bregler, C., 2015. Efficient object localization using bly due to the final pooling layer throwing away convolutional networks. In Proceedings of the IEEE too much information. A better method would be Conference on Computer Vision and Pattern Recog- to use LSTMs or specially designed convolutions nition (pp. 648-656). in order to shrink the output to the correct size. Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, Future work will include further explorations of K., Vinyals, O., Graves, A., Kalchbrenner, N., Sen- simplified and local attention to really get a grasp ior, A.W. and Kavukcuoglu, K., 2016, September. of what tasks they are good at and where, if any- WaveNet: A generative model for raw audio. In where, they show better efficiency or results than SSW (p. 125). normal attention. In the future I will also further Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., explore the new strategy for classification on any Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, sized input with fully convolutional model and see I., 2017. Attention is all you need. In Advances in what I can change and update in order to improve Neural Information Processing Systems (pp. 5998- the results of the model. 6008). Zhang, X., Zhao, J. and LeCun, Y., 2015. Character- References level convolutional networks for text classification. In Advances in neural information processing sys- Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., tems (pp. 649-657). Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M. and Kudlur, M., 2016, November. Tensorflow: a system for large-scale machine learning. In OSDI (Vol. 16, pp. 265-283). Conneau, A., Schwenk, H., Barrault, L. and Lecun, Y., 2016. Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781. Chollet, F., 2015. Keras. He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep re- sidual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). Long, J., Shelhamer, E. and Darrell, T., 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440). Ronzano, Francesco and Barbieri, Francesco and Wahyu Pamungkas, Endang and Patti, Viviana and Chiusaroli, Francesca. 2018. Overview of the EVALITA 2018 Italian Emoji Prediction (ITAMoji). Proceedings of Fifth Italian Conference on Computational Linguistics (CLiC-it 2018) & Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018).