Hyperparameter Tuning for Deep Learning in Natural Language Processing Ahmad Aghaebrahimian Mark Cieliebak Zurich University of Applied Sciences Zurich University of Applied Sciences Switzerland Switzerland agha@zhaw.ch ciel@zhaw.ch Abstract with a series of hyperparameters which need to be tuned if one expects to obtain state-of-the-art Deep Neural Networks have advanced or even better results using them. Some of these rapidly over the past several years. How- hyperparameters, such as the number of layers or ever, it still seems like a black art for many the number of neurons per layer, are bound di- people to make use of them efficiently. rectly to the deep neural architecture, while others The reason for this complexity is that ob- - such as drop-out rate - are independent of the ar- taining a consistent and outstanding result chitecture. In addition to these hyperparameters, from a deep architecture requires optimiz- there are other network choices such as the clas- ing many parameters known as hyperpa- sifier type that affects the network performance to rameters. Hyperparameter tuning is an es- a large extent. Our list of parameters to tune in- sential task in deep learning, which can cludes both of these hyperparameters and network make significant changes in network per- choices. Since none of these parameters, includ- formance. This paper is the essence of ing network choices and hyperparameters, can be over 3000 GPU hours on optimizing a net- learned within the network directly, from now on, work for a text classification task on a wide we use the term hyperparameter to refer to both. array of hyperparameters. We provide a Recognizing the best choice of hyperparame- list of hyperparameters to tune in addition ters is often a cumbersome process to a level that to their tuning impact on the network per- some people consider it a ”black art” (Snoek et al., formance. The hope is that such a listing 2012). Scarcity of proper research on the impact will provide the interested researchers a of these parameters on the network performance mean to prioritize their efforts and to mod- often leads to a waste of a lot of time, especially ify their deep architecture for getting the for younger researchers with little experience. In best performance with the least effort. this paper, we adopt a state-of-the-art multi-label classifier to investigate the impact of 12 categories 1 Introduction of hyperparameters on the task of multi-label text The application of Deep Neural Networks (DNN) classification. The task in multi-label text classifi- such as Convolution Neural Networks (CNN) (Le- cation is to assign one or more labels to each text. Cun et al., 1989) or Recurrent Neural Networks Word embeddings types, word embeddings (RNN) (Rumelhart et al., 1986) and its variants sizes, word embeddings updating, character (e.g., Long Short Term Memory (LSTM) (Hochre- embeddings, deep architectures (CNN, LSTM, iter and Schmidhuber, 1997) or Gated Recurrent GRU), optimizers, gradient control, classifiers, Unit (GRU) (Cho et al., 2014)) has accelerated drop out, deep vs. wide networks, and pooling since the beginning of this decade partly due to are the settings studied in this work. To make the the abundance of data available for training. Since experiment manageable, several groups of these past several years, DNNs have found their way in parameters are set on an individual grid to serve many areas of Artificial Intelligence (AI) such as as an ad-hoc grid search scheme for finding the image processing or Natural Language Process- most promising hyperparameters by focusing on ing (NLP) and have yielded superior performance the most promising optimized area. in almost all of them. However, a DNN comes We provide the readers with an insight into the Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0). impact of each hyperparameter on this specific task. This study is performed by running over 400 different configurations in over 3000 GPU hours. The contribution of this work is to provide a prior- itized list of hyperparameters to optimize. 2 Related Work Hyperparameter tuning is often performed using grid search/brute force, where all possible com- binations of the hyperparameters with all of their values form a grid and an algorithm is trained for each combination. However, this method be- comes incomputable already for small numbers of hyperparameters. For instance, in our study Figure 1: The system architecture with 12 categories of hyperparameters each with four instances on average, we would have a grid trated schema is the optimized network which cre- with several million nods, which would be highly ated the best results for the task. One channel is computationally expensive. To address this is- devoted to the most informative words given each sue Bergstra et al. (2013) proposed a method for class, which are extracted using the χ2 method. randomized parameter tuning and showed that for The other channel is used for input tokens. For each of their datasets there are only a few impact- more information about the architecture, please re- ful parameters on which more values should be fer to Aghaebrahimian and Cieliebak (2019). tried. However, due to the random mechanism The dataset used for this experiment is a pro- in this approach, each trial is independent of the prietary dataset with roughly 60K articles with a others. Hence, it does not learn anything from total number of 28 labels. The dataset contains other experiments. To address this problem Snoek about 250K different words and assigns 2.5 labels et al. (2012) proposed a Bayesian optimization to each article on average. It is randomly divided method using a statistical model for mapping hy- into 80%,10%, and 10% parts for training, validat- perparameters to an objective function. However, ing, and testing accordingly. Bayesian optimization adds another layer of com- The textual data is preprocessed by removing plexity to the problem. Therefore, this method has non-alphanumeric values and replacing numeric not gained much popularity since its proposal. values with a unique symbol. The resulting strings The most effective and straightforward method are tokenized and truncated to 3k tokens. Shorter for hyperparameter tuning is still ad-hoc grid texts are padded with 0 to fixate all the texts to the search (Hutter et al., 2015) where the researcher same length. manually tries the most correlated parameters on Two measures are used for evaluation. F1 (Mi- the same grid to gradually and iteratively find the cro) is used as a measure of performance. It is most impactful set of hyperparameters with the computed by calculating F1 scores for each arti- best values. cle and averaging them over all articles in the test data. The second metric, Epochs, is reported as 3 Multi-Label Classification a measure of time required for the network with Multi-label text classification is the task of as- a specific setting to converge. The early stopping signing one or more labels to each text. News method is used as criterion for convergence, which classification is an example of such a task. For is recognized when after three consecutive epochs this task, we adopted a state-of-the-art architecture no decrease in validation loss is observed. All for multi-label classification (Aghaebrahimian and models are trained in batches with 64 instances in Cieliebak, 2019). The schema of the model is il- each. lustrated in Figure 1. 4 Experimental results The architecture consists of two channels of bi- GRU deep structures with an attention mechanism There are 12 categories of hyperparameters which and a dense sigmoid layer on the top. The illus- are tuned in this study. Some of the hyperparam- Word embedding type Epochs Results eters, such as the deep architecture or the classi- Word2Vec (Mikolov et al., 2013) 26 81.9 % fier type, are network choices while others, such Glove-6 (Pennington et al., 2014) 25 81.7 % Glove-42 (Pennington et al., 2014) 26 82.9 % as the embeddings type or the dropout rate, are Glove-840 (Pennington et al., 2014) 29 84.5 % variables pertaining to different parts of the net- FastText (Bojanowski et al., 2016) 24 79.2 % Dependency (Levy and Goldberg, 2014) 22 81.4 % work. The results of hyperparameter optimization ELMo (Peters et al., 2018) 32 84.6 % on each criterion are reported in the following sub- sections. Table 1: Embedding type tuning results. Embed- ding types, sizes, and update methods are on the All parameters except the parameter under in- same grid (26 configurations). vestigation in each experiment are kept constant. All other parameters that are not part of this study, such as the seed number or batch size, are also kept 50-dimensional vectors, which is sub-optimal, all constant throughout all the experiments. other dimensions yield superior results with an un- noticeable difference in the number of Epochs. 4.1 Word Embeddings Grid Word embedding size Epochs Results 50 22 81.8 % In this grid, we tune the word embeddings type, 100 25 82.9 % 83.6 % the size, and the method of updating. Low di- 200 300 27 29 84.3 % mensional dense word vectors known as word em- 1024 32 84.6 % beddings have been proven to be highly effective Table 2: Embedding size tuning results. Embed- in representing words, and often lead to signifi- ding types, sizes, and update methods are on the cantly better performance (Collobert et al., 2011). same grid search (26 configurations). Depending on the method used for their training, they can provide different levels of syntactic and Word embeddings provide a mean of transfer semantic information about each word. Many fac- learning, which means word vectors are initially tors can affect the quality of word embeddings, learned using a large dataset containing several including the data on which they were trained, billion tokens and are fine-tuned on a smaller their number of dimension, their domain, and pre- dataset for doing their specific task afterwards. processing steps involved in the training. We in- This mechanism can be controlled by having word vestigated five widely studied pre-trained word vectors frozen or fine-tuned through training. De- embeddings including Word2Vec (Mikolov et al., pending on the size of the dataset on which word 2013) trained on Google News dataset with 100 embeddings are being refined, updating them can billion tokens, Glove (Pennington et al., 2014) improve the performance. However, as observed with three variants (one trained on Wikipedia with in Table 3 fine-tuning the word vectors yielded no 64 billion tokens and two others trained on the significant improvement over original pre-trained Common Crawl, one on 42 and the other on 840 ones since the dataset was not large enough. billion tokens), FastText (Bojanowski et al., 2016), dependency based (Levy and Goldberg, 2014), Word embedding updating Epochs Results Disabled 29 84.3 % and ELMo (Peters et al., 2018). As shown in Ta- Enabled 31 84.5 % ble 1, the Glove embeddings trained on the Com- mon Crawl yields significantly better results com- Table 3: Embedding update method tuning results. pared to other embeddings except for Elmo. Elmo Embedding types, sizes, and update methods are and Glove-840 yield roughly similar results. How- on the same grid search (26 configurations). ever, due to the much larger word vector size in Elmo, it is much more computationally expensive 4.2 Character embedding and takes much longer time to converge. Word-level features are not the only features used Each pre-trained embedding comes with a spe- in text analytics. Character-level features are cific vector size. The Glove embeddings are also reported to improve model performance es- available in 50, 100, 200, and 300-dimensional pecially in tasks such as Named Entity Recogni- word vectors. Elmo provides 1024 dimensional tion (NER) (Akbik et al., 2018) or Part Of Speech vectors, and other embeddings all are with 300- (POS) (Anastasiev et al., 2018) tagging, where dimensional word vectors. The results for size knowing the function of individual characters such tuning are reported in Table 2. Except for the as prefixes, suffixes, or even infixes are beneficial. Deep architectures Epochs Results We used two different character encoding mecha- LSTM (Hochreiter and Schmidhuber, 1997) 30 78.2 % nisms, one CNN-based (Ma and Hovy, 2016) and Bi-LSTM 37 82.9 % GRU (Cho et al., 2014) 21 79.8 % the other LSTM-based (Lample et al., 2016), to Bi-GRU 29 84.3 % investigate the impact of character-level features CNN (single channel) (Kim, 2014) 18 81.7 % CNN (double channel) (Kim, 2014) 23 82.5 % on the network performance. As we expected, using character-level features had no added value Table 5: Deep architecture tuning results. Deep in the label classification task where labels were architectures, Deep and wide networks and opti- bound to words and their syntactic and semantic mizers are in the same grid (270 configurations). attributes rather than to their characters. Character embeddings and the best of embed- 4.4 Deep vs. wide networks dings grid were tuned on the same grid. It means The application of more deep layers and more that in this grid, we disregard the sub-optimal set- units in each layer has been beneficial in some tings in the embeddings grid and only focus on the tasks. Adding more layers helps in more complex winning setting. Given the winning setting, we tasks to generate more layers of abstraction, while tune the character embedding settings to investi- adding more units to each layer contributes to gen- gate the impact of character embeddings (Table 4). erating more features. Still, adding extra layers in depth and width without enough training data usu- ally leads to overfitting. In all of our configura- Character embedding Epochs Results tions, we got the best performance by having 128 Disabled 29 84.3 % Enabled-CNN (Ma and Hovy, 2016) 31 84.7 % units for each layer and only one layer in depth Enabled-LSTM (Lample et al., 2016) 36 84.8 % (Table 6). Table 4: Character embedding tuning results. Deep vs. wide network Epochs Results Deep-1 29 84.3 % Character embeddings and the best of embeddings Deep-2 26 83.7 % are in the same grid search (14 configurations). Deep-3 18 74.6 % Wide-64 30 82.9 % Wide-128 29 84.3 % Wide-256 25 83.5 % Table 6: Deep and wide networks tuning. Deep 4.3 Deep architectures and wide networks, Deep architectures, and opti- mizers are in the same grid (270 configurations). The choice of deep architecture either as a Con- volution Neural Network (CNN) (LeCun et al., 4.5 Optimizer 1989) or as a variant of Recurrent Neural Net- works (RNN) such as Long Short Term Memory The job of an optimizer is to minimize the loss (LSTM) (Hochreiter and Schmidhuber, 1997) or in the objective function. Gradient-based meth- Gated Recurrent Unit (GRU) (Cho et al., 2014) ods in general, and Stochastic Gradient Descent can have a huge effect on the performance of a (SGD) in particular, are one of the widely used model. classes of optimizers for minimizing the objective functions in machine learning. Due to high sen- The deep architecture type, the number of deep sitivity to learning rate in SGD, other variants of layers, and the number of units in each layer, as optimizers such as Adagrad (Duchi et al., 2011), well as the optimizers, are highly dependent on RMSProp (Hinton, 2012), Adam (Kingma and Ba, each other. Therefore, we optimize all of them on 2015), and Nadam (Dozat, 2015) have been pro- the same grid with 270 different configurations. posed in recent years. In all our configurations, we For the CNN model, we adapted Kim (2014) got the best performance using Adam. Nadam also model, and for RNN models, we used both vari- yields almost the same performance while con- ants LSTM and GRU as single and bidirectional verging faster (Table 7). architectures. As seen in Table 5, although CNN models converge faster than the RNNs, they can 4.6 Pooling not beat RNNs performance. Among all other Either in a CNN after the convolutional filters or RNN models, bidirectional GRU yields signifi- in an RNN after the recurrent layers, pooling has cantly better results. been proven as a useful tool for extracting the most Optimizer Epochs Results SGD 22 78.4 % puted features in this layer are projected to their Adagrad (Duchi et al., 2011) 25 82.7 % appropriate classes. Therefore the choice of this RMSProp (Hinton, 2012) 27 83.9 % Adam (Kingma and Ba, 2015) 29 84.3 % layer has an essential impact on the network per- Nadam (Dozat, 2015) 24 84.2 % formance. The choice of this layer is highly de- Table 7: Optimizer tuning results. Optimizers, pendent on the assumptions we make about the Deep and wide networks, and Deep architectures task at hand. If the labels are independently dis- are in the same grid (270 configurations). tributed, the Sigmoid and the Softmax yield better results, while if they are conditioned on their ad- jacent labels (e.g., POS tagging) the Conditional relevant features given each task. We investigated Random Field (CRF) (Lafferty et al., 2001) works three types of polling, namely average, max and better. If we expect a multinomial distribution the concatenation of both with the best of opti- over the labels, the Softmax is the best classifier mizer configurations on the same grid with 15 set- to choose while if we expect a Bernoulli distribu- tings. The results are reported in Table 8, which tion, the Sigmoid is the right choice. All of the shows that using both yields the best performance facts mentioned here come from the assumptions for our task. behind each of these statistical functions. Pooling Epochs Results We investigated the performance of these three Average 29 83.2 % Max 29 83.5 % classifiers with the best of the deep architectures Both 29 84.2 % from Sub-section 4.3 on the same grid with 18 configurations. As observed in the results pre- Table 8: Pooling tuning results. Pooling and the sented in Table 10, the Sigmoid obtains statis- best of optimizers are in the same grid search (15 tically significant better result compared to two configurations). other functions. As expected, due to the indepen- dence among the labels of different samples, the 4.7 Gradient control CRF did not perform very well. Likewise, duo The derivatives which are computed in backprop- to the freedom among labels in each sample, the agation at training time in a DNN with many lay- Softmax also performed poorly. ers get smaller and smaller to the point of van- Classifier Epochs Results ishing. This is particularly true for RNN’s which Softmax 30 78.4 % Sigmoid 29 84.2 % have a large number of layers. This makes the CRF 31 77.1 % training difficult and time-consuming. There are two widely practiced mechanism called gradient Table 10: Classifier tuning results. The classifiers clipping (Mikolov, 2012) and gradient normaliza- and the best of deep architectures are in the same tion (Pascanu et al., 2013) to address this issue grid search (18 configurations). known as gradient vanishing. We set the gradient control mechanism with the best of the deep ar- 4.9 Drop out value chitectures from Sub-section 4.3 on the same grid Deep neural networks tend to memorize or over- with 18 configurations. In all of these configura- fit, which is not a desirable behavior since we tions, we got better results using gradient normal- are mostly interested in the ability of the net- ization (Table 9). work to generalize. Drop out (Srivastava et al., Gradient control Epochs Results 2014) is an effective tool to enhance generalizabil- Disabled 28 82.9 % Clipping (Mikolov, 2012) 31 83.1 % ity. The first technique known as simple or naive Normalization (Pascanu et al., 2013) 29 84.2 % drop out was proposed as a mechanism which ran- Table 9: Gradient control tuning results. Gradient domly removes the connections between deep lay- control and the best of deep architectures are in the ers. Gal and Ghahramani (2016) proposed a new same grid search (18 configurations). mechanism for drop out called variational, which improves the simple drop out by defining static masks for removing the connections between deep 4.8 Classifier layers (‘interlayer’) as well as between the units The last layer in a classification model is consid- inside deep layers (‘intralayer’). We placed drop ered as the most crucial layer since all the com- out methods with the best of the deep architec- tures from Sub-section 4.3 on the same grid with References 90 configurations. The results are reported in Ta- Ahmad Aghaebrahimian and Mark Cieliebak. 2019. ble 11 and Table 12. As expected, the configu- Towards integration of statistical hypothesis tests rations with both inter- and intralayer variational into deep neural networks. In Proceedings of the method yields the best performance. 57th annual meeting of the association of Computa- tional Linguistics (ACL). Florence, Italy. Drop out value Epochs Results Disabled 24 80.2 % Alan Akbik, Duncan Blythe, and Roland Vollgraf. Simple 0.2 26 83.2 % 2018. Contextual string embeddings for sequence Simple 0.5 27 83.8 % Simple 0.7 29 81.5 % labeling. In Proceedings of the 27th International Variational 32 84.2 % Conference on Computational Linguistics. Santa Fe, New Mexico, USA. Table 11: Simple drop out tuning results. The drop D. G. Anastasiev, I. O. Gusev, and E. M. Indenbom. out and the best of deep architectures are in the 2018. Improving part-of-speech tagging via multi- same grid search (90 configurations). task learning and character-level word representa- tions. In Proceedings of the International Confer- ence Dialogue, Computational linguistics and intel- Variational drop out method Epochs Results lectual technologies. Inter 31 83.5 % Intra 30 83.2 % Both 32 84.2 % James Bergstra, Daniel Yamins, and David D. Cox. 2013. Making a science of model search: Hy- Table 12: Variational drop out value tuning results perparameter optimization in hundreds of dimen- sions for vision architectures. In Proceedings of the 30th International Conference on Machine Learning 5 Conclusion (ICML). Atlanta, GA, USA. In this study, we investigated various settings for Piotr Bojanowski, Edouard Grave, Armand Joulin, a Deep Neural Network for multi-label classifica- and Tomas Mikolov. 2016. Enriching word vec- tion. Considering the characteristics of the dataset tors with subword information. arXiv preprint and the task, we observed the following results: arXiv:1607.04606. Using Sigmoid in the last layer yields statistically Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- significant better results compared to CRF or Soft- danau, and Yoshua Bengio. 2014. On the properties max. The Glove embedings (Pennington et al., of neural machine translation: Encoder-decoder ap- 2014) with more than 100-dimensional vector size proaches. In Proceedings of the Eighth Workshop on Syntax, Semantics and Structure in Statistical Trans- and without updating yields statistically signifi- lation. cant better results compared to other word vec- tors. Compared to other deep architectures, bi- Ronan Collobert, Jason Weston, Léon Bottou, Michael GRU yields better results when it is used as a one- Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from depth layer with 128 units. Adam and Nadam ob- scratch. J. Mach. Learn. Res. 12:2493–2537. tain roughly the same results, while Nadam con- verges much faster. Pooling is better to be used as Timothy Dozat. 2015. Incorporating nesterov momen- the concatenation of both max and average-pooled tum into adam. tensors, and it is better to use Normalization (Pas- John Duchi, Elad Hazan, and Yoram Singer. 2011. canu et al., 2013) as a mean of gradient control to Adaptive subgradient methods for online learning control gradient vanishing. It is also a good prac- and stochastic optimization. J. Mach. Learn. Res. tice to use Variational drop out (Gal and Ghahra- 12:2121–2159. mani, 2016) both between layers and inside recur- Yarin Gal and Zoubin Ghahramani. 2016. A theoret- rent units to control over-fitting. Finally, we did ically grounded application of dropout in recurrent not observe any improvement by using character neural networks. In Proceedings of the 30th Interna- embeddings. tional Conference on Neural Information Processing Systems. Curran Associates Inc., USA, NIPS’16. The order in which these parameters are men- tioned is the magnitude of their importance for the Geoffrey Hinton. 2012. Neural networks for machine final performance. Parameters with no mention learning. Lecture 6a - Overview of mini-batch gra- dient descent . here did not have any noticeable impact on the sys- tem results. Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. In Neural Comput.. volume 9. Frank Hutter, Jörg Lücke, and Lars Schmidt-Thieme. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt 2015. Beyond manual tuning of hyperparameters. Gardner, Christopher Clark, Kenton Lee, and Luke KI - Künstliche Intelligenz . Zettlemoyer. 2018. Deep contextualized word rep- resentations. In Proc. of NAACL. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Conference on Empirical Methods in Natural Lan- Williams. 1986. Learning Representations by Back- guage Processing (EMNLP). Association for Com- propagating Errors. Nature 323. putational Linguistics, Doha, Qatar. Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Diederik P. Kingma and Jimmy Ba. 2015. Adam: 2012. Practical bayesian optimization of machine A method for stochastic optimization. CoRR learning algorithms. In Proceedings of the 25th In- abs/1412.6980. ternational Conference on Neural Information Pro- cessing Systems (NIPS). USA. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Probabilistic models for segmenting and labeling se- Ilya Sutskever, and Ruslan Salakhutdinov. 2014. quence data. In Proceedings of the Eighteenth Inter- Dropout: A simple way to prevent neural networks national Conference on Machine Learning (ICML). from over fitting. Journal of Machine Learning Re- search . Guillaume Lample, Miguel Ballesteros, Sandeep Sub- ramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies. San Diego, California. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Back- propagation applied to handwritten zip code recog- nition. Neural Computation . Omer Levy and Yoav Goldberg. 2014. Dependency- based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 2: Short Papers). Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs- CRF. In Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Compu- tational Linguistics, Berlin, Germany, pages 1064– 1074. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. arXiv:1301.3781 . Tom Mikolov. 2012. Statistical language models based on neural networks. Ph.D. Thesis, Brno University of Technology . Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neu- ral networks. In Proceedings of the 30th Interna- tional Conference on International Conference on Machine Learning (ICML). Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).