Hyperparameter Tuning for Deep Learning in Natural Language
                               Processing

              Ahmad Aghaebrahimian                                      Mark Cieliebak
         Zurich University of Applied Sciences                Zurich University of Applied Sciences
                     Switzerland                                          Switzerland
                  agha@zhaw.ch                                         ciel@zhaw.ch


                       Abstract                               with a series of hyperparameters which need to
                                                              be tuned if one expects to obtain state-of-the-art
    Deep Neural Networks have advanced                        or even better results using them. Some of these
    rapidly over the past several years. How-                 hyperparameters, such as the number of layers or
    ever, it still seems like a black art for many            the number of neurons per layer, are bound di-
    people to make use of them efficiently.                   rectly to the deep neural architecture, while others
    The reason for this complexity is that ob-                - such as drop-out rate - are independent of the ar-
    taining a consistent and outstanding result               chitecture. In addition to these hyperparameters,
    from a deep architecture requires optimiz-                there are other network choices such as the clas-
    ing many parameters known as hyperpa-                     sifier type that affects the network performance to
    rameters. Hyperparameter tuning is an es-                 a large extent. Our list of parameters to tune in-
    sential task in deep learning, which can                  cludes both of these hyperparameters and network
    make significant changes in network per-                  choices. Since none of these parameters, includ-
    formance. This paper is the essence of                    ing network choices and hyperparameters, can be
    over 3000 GPU hours on optimizing a net-                  learned within the network directly, from now on,
    work for a text classification task on a wide             we use the term hyperparameter to refer to both.
    array of hyperparameters. We provide a                       Recognizing the best choice of hyperparame-
    list of hyperparameters to tune in addition               ters is often a cumbersome process to a level that
    to their tuning impact on the network per-                some people consider it a ”black art” (Snoek et al.,
    formance. The hope is that such a listing                 2012). Scarcity of proper research on the impact
    will provide the interested researchers a                 of these parameters on the network performance
    mean to prioritize their efforts and to mod-              often leads to a waste of a lot of time, especially
    ify their deep architecture for getting the               for younger researchers with little experience. In
    best performance with the least effort.                   this paper, we adopt a state-of-the-art multi-label
                                                              classifier to investigate the impact of 12 categories
1   Introduction
                                                              of hyperparameters on the task of multi-label text
The application of Deep Neural Networks (DNN)                 classification. The task in multi-label text classifi-
such as Convolution Neural Networks (CNN) (Le-                cation is to assign one or more labels to each text.
Cun et al., 1989) or Recurrent Neural Networks                   Word embeddings types, word embeddings
(RNN) (Rumelhart et al., 1986) and its variants               sizes, word embeddings updating, character
(e.g., Long Short Term Memory (LSTM) (Hochre-                 embeddings, deep architectures (CNN, LSTM,
iter and Schmidhuber, 1997) or Gated Recurrent                GRU), optimizers, gradient control, classifiers,
Unit (GRU) (Cho et al., 2014)) has accelerated                drop out, deep vs. wide networks, and pooling
since the beginning of this decade partly due to              are the settings studied in this work. To make the
the abundance of data available for training. Since           experiment manageable, several groups of these
past several years, DNNs have found their way in              parameters are set on an individual grid to serve
many areas of Artificial Intelligence (AI) such as            as an ad-hoc grid search scheme for finding the
image processing or Natural Language Process-                 most promising hyperparameters by focusing on
ing (NLP) and have yielded superior performance               the most promising optimized area.
in almost all of them. However, a DNN comes                      We provide the readers with an insight into the


    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
impact of each hyperparameter on this specific
task. This study is performed by running over 400
different configurations in over 3000 GPU hours.
The contribution of this work is to provide a prior-
itized list of hyperparameters to optimize.

2   Related Work
Hyperparameter tuning is often performed using
grid search/brute force, where all possible com-
binations of the hyperparameters with all of their
values form a grid and an algorithm is trained
for each combination. However, this method be-
comes incomputable already for small numbers
of hyperparameters. For instance, in our study                  Figure 1: The system architecture
with 12 categories of hyperparameters each with
four instances on average, we would have a grid         trated schema is the optimized network which cre-
with several million nods, which would be highly        ated the best results for the task. One channel is
computationally expensive. To address this is-          devoted to the most informative words given each
sue Bergstra et al. (2013) proposed a method for        class, which are extracted using the χ2 method.
randomized parameter tuning and showed that for         The other channel is used for input tokens. For
each of their datasets there are only a few impact-     more information about the architecture, please re-
ful parameters on which more values should be           fer to Aghaebrahimian and Cieliebak (2019).
tried. However, due to the random mechanism                The dataset used for this experiment is a pro-
in this approach, each trial is independent of the      prietary dataset with roughly 60K articles with a
others. Hence, it does not learn anything from          total number of 28 labels. The dataset contains
other experiments. To address this problem Snoek        about 250K different words and assigns 2.5 labels
et al. (2012) proposed a Bayesian optimization          to each article on average. It is randomly divided
method using a statistical model for mapping hy-        into 80%,10%, and 10% parts for training, validat-
perparameters to an objective function. However,        ing, and testing accordingly.
Bayesian optimization adds another layer of com-           The textual data is preprocessed by removing
plexity to the problem. Therefore, this method has      non-alphanumeric values and replacing numeric
not gained much popularity since its proposal.          values with a unique symbol. The resulting strings
   The most effective and straightforward method        are tokenized and truncated to 3k tokens. Shorter
for hyperparameter tuning is still ad-hoc grid          texts are padded with 0 to fixate all the texts to the
search (Hutter et al., 2015) where the researcher       same length.
manually tries the most correlated parameters on           Two measures are used for evaluation. F1 (Mi-
the same grid to gradually and iteratively find the     cro) is used as a measure of performance. It is
most impactful set of hyperparameters with the          computed by calculating F1 scores for each arti-
best values.                                            cle and averaging them over all articles in the test
                                                        data. The second metric, Epochs, is reported as
3   Multi-Label Classification                          a measure of time required for the network with
Multi-label text classification is the task of as-      a specific setting to converge. The early stopping
signing one or more labels to each text. News           method is used as criterion for convergence, which
classification is an example of such a task. For        is recognized when after three consecutive epochs
this task, we adopted a state-of-the-art architecture   no decrease in validation loss is observed. All
for multi-label classification (Aghaebrahimian and      models are trained in batches with 64 instances in
Cieliebak, 2019). The schema of the model is il-        each.
lustrated in Figure 1.
                                                        4   Experimental results
   The architecture consists of two channels of bi-
GRU deep structures with an attention mechanism         There are 12 categories of hyperparameters which
and a dense sigmoid layer on the top. The illus-        are tuned in this study. Some of the hyperparam-
                                                           Word embedding type                    Epochs   Results
eters, such as the deep architecture or the classi-        Word2Vec (Mikolov et al., 2013)        26       81.9 %
fier type, are network choices while others, such          Glove-6 (Pennington et al., 2014)      25       81.7 %
                                                           Glove-42 (Pennington et al., 2014)     26       82.9 %
as the embeddings type or the dropout rate, are            Glove-840 (Pennington et al., 2014)    29       84.5 %
variables pertaining to different parts of the net-        FastText (Bojanowski et al., 2016)     24       79.2 %
                                                           Dependency (Levy and Goldberg, 2014)   22       81.4 %
work. The results of hyperparameter optimization           ELMo (Peters et al., 2018)             32       84.6 %
on each criterion are reported in the following sub-
sections.                                               Table 1: Embedding type tuning results. Embed-
                                                        ding types, sizes, and update methods are on the
   All parameters except the parameter under in-
                                                        same grid (26 configurations).
vestigation in each experiment are kept constant.
All other parameters that are not part of this study,
such as the seed number or batch size, are also kept    50-dimensional vectors, which is sub-optimal, all
constant throughout all the experiments.                other dimensions yield superior results with an un-
                                                        noticeable difference in the number of Epochs.
4.1 Word Embeddings Grid                                   Word embedding size                    Epochs   Results
                                                           50                                     22       81.8 %
In this grid, we tune the word embeddings type,            100                                    25       82.9 %
                                                                                                           83.6 %
the size, and the method of updating. Low di-              200
                                                           300
                                                                                                  27
                                                                                                  29       84.3 %
mensional dense word vectors known as word em-             1024                                   32       84.6 %

beddings have been proven to be highly effective
                                                        Table 2: Embedding size tuning results. Embed-
in representing words, and often lead to signifi-
                                                        ding types, sizes, and update methods are on the
cantly better performance (Collobert et al., 2011).
                                                        same grid search (26 configurations).
Depending on the method used for their training,
they can provide different levels of syntactic and
                                                           Word embeddings provide a mean of transfer
semantic information about each word. Many fac-
                                                        learning, which means word vectors are initially
tors can affect the quality of word embeddings,
                                                        learned using a large dataset containing several
including the data on which they were trained,
                                                        billion tokens and are fine-tuned on a smaller
their number of dimension, their domain, and pre-
                                                        dataset for doing their specific task afterwards.
processing steps involved in the training. We in-
                                                        This mechanism can be controlled by having word
vestigated five widely studied pre-trained word
                                                        vectors frozen or fine-tuned through training. De-
embeddings including Word2Vec (Mikolov et al.,
                                                        pending on the size of the dataset on which word
2013) trained on Google News dataset with 100
                                                        embeddings are being refined, updating them can
billion tokens, Glove (Pennington et al., 2014)
                                                        improve the performance. However, as observed
with three variants (one trained on Wikipedia with
                                                        in Table 3 fine-tuning the word vectors yielded no
64 billion tokens and two others trained on the
                                                        significant improvement over original pre-trained
Common Crawl, one on 42 and the other on 840
                                                        ones since the dataset was not large enough.
billion tokens), FastText (Bojanowski et al., 2016),
dependency based (Levy and Goldberg, 2014),                Word embedding updating                Epochs   Results
                                                           Disabled                               29       84.3 %
and ELMo (Peters et al., 2018). As shown in Ta-            Enabled                                31       84.5 %
ble 1, the Glove embeddings trained on the Com-
mon Crawl yields significantly better results com-      Table 3: Embedding update method tuning results.
pared to other embeddings except for Elmo. Elmo         Embedding types, sizes, and update methods are
and Glove-840 yield roughly similar results. How-       on the same grid search (26 configurations).
ever, due to the much larger word vector size in
Elmo, it is much more computationally expensive         4.2   Character embedding
and takes much longer time to converge.                 Word-level features are not the only features used
   Each pre-trained embedding comes with a spe-         in text analytics. Character-level features are
cific vector size. The Glove embeddings are             also reported to improve model performance es-
available in 50, 100, 200, and 300-dimensional          pecially in tasks such as Named Entity Recogni-
word vectors. Elmo provides 1024 dimensional            tion (NER) (Akbik et al., 2018) or Part Of Speech
vectors, and other embeddings all are with 300-         (POS) (Anastasiev et al., 2018) tagging, where
dimensional word vectors. The results for size          knowing the function of individual characters such
tuning are reported in Table 2. Except for the          as prefixes, suffixes, or even infixes are beneficial.
                                                              Deep architectures                        Epochs   Results
We used two different character encoding mecha-               LSTM (Hochreiter and Schmidhuber, 1997)   30       78.2 %
nisms, one CNN-based (Ma and Hovy, 2016) and                  Bi-LSTM                                   37       82.9 %
                                                              GRU (Cho et al., 2014)                    21       79.8 %
the other LSTM-based (Lample et al., 2016), to                Bi-GRU                                    29       84.3 %
investigate the impact of character-level features            CNN (single channel) (Kim, 2014)          18       81.7 %
                                                              CNN (double channel) (Kim, 2014)          23       82.5 %
on the network performance. As we expected,
using character-level features had no added value          Table 5: Deep architecture tuning results. Deep
in the label classification task where labels were         architectures, Deep and wide networks and opti-
bound to words and their syntactic and semantic            mizers are in the same grid (270 configurations).
attributes rather than to their characters.
   Character embeddings and the best of embed-             4.4   Deep vs. wide networks
dings grid were tuned on the same grid. It means           The application of more deep layers and more
that in this grid, we disregard the sub-optimal set-       units in each layer has been beneficial in some
tings in the embeddings grid and only focus on the         tasks. Adding more layers helps in more complex
winning setting. Given the winning setting, we             tasks to generate more layers of abstraction, while
tune the character embedding settings to investi-          adding more units to each layer contributes to gen-
gate the impact of character embeddings (Table 4).         erating more features. Still, adding extra layers in
                                                           depth and width without enough training data usu-
                                                           ally leads to overfitting. In all of our configura-
   Character embedding                  Epochs   Results   tions, we got the best performance by having 128
   Disabled                             29       84.3 %
   Enabled-CNN (Ma and Hovy, 2016)      31       84.7 %
                                                           units for each layer and only one layer in depth
   Enabled-LSTM (Lample et al., 2016)   36       84.8 %    (Table 6).
Table 4: Character embedding tuning results.                  Deep vs. wide network                     Epochs   Results
                                                              Deep-1                                    29       84.3 %
Character embeddings and the best of embeddings               Deep-2                                    26       83.7 %
are in the same grid search (14 configurations).              Deep-3                                    18       74.6 %
                                                              Wide-64                                   30       82.9 %
                                                              Wide-128                                  29       84.3 %
                                                              Wide-256                                  25       83.5 %


                                                           Table 6: Deep and wide networks tuning. Deep
4.3 Deep architectures                                     and wide networks, Deep architectures, and opti-
                                                           mizers are in the same grid (270 configurations).
The choice of deep architecture either as a Con-
volution Neural Network (CNN) (LeCun et al.,
                                                           4.5   Optimizer
1989) or as a variant of Recurrent Neural Net-
works (RNN) such as Long Short Term Memory                 The job of an optimizer is to minimize the loss
(LSTM) (Hochreiter and Schmidhuber, 1997) or               in the objective function. Gradient-based meth-
Gated Recurrent Unit (GRU) (Cho et al., 2014)              ods in general, and Stochastic Gradient Descent
can have a huge effect on the performance of a             (SGD) in particular, are one of the widely used
model.                                                     classes of optimizers for minimizing the objective
                                                           functions in machine learning. Due to high sen-
   The deep architecture type, the number of deep
                                                           sitivity to learning rate in SGD, other variants of
layers, and the number of units in each layer, as
                                                           optimizers such as Adagrad (Duchi et al., 2011),
well as the optimizers, are highly dependent on
                                                           RMSProp (Hinton, 2012), Adam (Kingma and Ba,
each other. Therefore, we optimize all of them on
                                                           2015), and Nadam (Dozat, 2015) have been pro-
the same grid with 270 different configurations.
                                                           posed in recent years. In all our configurations, we
   For the CNN model, we adapted Kim (2014)                got the best performance using Adam. Nadam also
model, and for RNN models, we used both vari-              yields almost the same performance while con-
ants LSTM and GRU as single and bidirectional              verging faster (Table 7).
architectures. As seen in Table 5, although CNN
models converge faster than the RNNs, they can             4.6   Pooling
not beat RNNs performance. Among all other                 Either in a CNN after the convolutional filters or
RNN models, bidirectional GRU yields signifi-              in an RNN after the recurrent layers, pooling has
cantly better results.                                     been proven as a useful tool for extracting the most
   Optimizer                              Epochs   Results
   SGD                                    22       78.4 %
                                                             puted features in this layer are projected to their
   Adagrad (Duchi et al., 2011)           25       82.7 %    appropriate classes. Therefore the choice of this
   RMSProp (Hinton, 2012)                 27       83.9 %
   Adam (Kingma and Ba, 2015)             29       84.3 %    layer has an essential impact on the network per-
   Nadam (Dozat, 2015)                    24       84.2 %    formance. The choice of this layer is highly de-
Table 7: Optimizer tuning results. Optimizers,               pendent on the assumptions we make about the
Deep and wide networks, and Deep architectures               task at hand. If the labels are independently dis-
are in the same grid (270 configurations).                   tributed, the Sigmoid and the Softmax yield better
                                                             results, while if they are conditioned on their ad-
                                                             jacent labels (e.g., POS tagging) the Conditional
relevant features given each task. We investigated           Random Field (CRF) (Lafferty et al., 2001) works
three types of polling, namely average, max and              better. If we expect a multinomial distribution
the concatenation of both with the best of opti-             over the labels, the Softmax is the best classifier
mizer configurations on the same grid with 15 set-           to choose while if we expect a Bernoulli distribu-
tings. The results are reported in Table 8, which            tion, the Sigmoid is the right choice. All of the
shows that using both yields the best performance            facts mentioned here come from the assumptions
for our task.                                                behind each of these statistical functions.
   Pooling                                Epochs   Results      We investigated the performance of these three
   Average                                29       83.2 %
   Max                                    29       83.5 %    classifiers with the best of the deep architectures
   Both                                   29       84.2 %    from Sub-section 4.3 on the same grid with 18
                                                             configurations. As observed in the results pre-
Table 8: Pooling tuning results. Pooling and the
                                                             sented in Table 10, the Sigmoid obtains statis-
best of optimizers are in the same grid search (15
                                                             tically significant better result compared to two
configurations).
                                                             other functions. As expected, due to the indepen-
                                                             dence among the labels of different samples, the
4.7 Gradient control                                         CRF did not perform very well. Likewise, duo
The derivatives which are computed in backprop-              to the freedom among labels in each sample, the
agation at training time in a DNN with many lay-             Softmax also performed poorly.
ers get smaller and smaller to the point of van-                Classifier                     Epochs   Results
ishing. This is particularly true for RNN’s which               Softmax                        30       78.4 %
                                                                Sigmoid                        29       84.2 %
have a large number of layers. This makes the                   CRF                            31       77.1 %
training difficult and time-consuming. There are
two widely practiced mechanism called gradient               Table 10: Classifier tuning results. The classifiers
clipping (Mikolov, 2012) and gradient normaliza-             and the best of deep architectures are in the same
tion (Pascanu et al., 2013) to address this issue            grid search (18 configurations).
known as gradient vanishing. We set the gradient
control mechanism with the best of the deep ar-              4.9    Drop out value
chitectures from Sub-section 4.3 on the same grid
                                                             Deep neural networks tend to memorize or over-
with 18 configurations. In all of these configura-
                                                             fit, which is not a desirable behavior since we
tions, we got better results using gradient normal-
                                                             are mostly interested in the ability of the net-
ization (Table 9).
                                                             work to generalize. Drop out (Srivastava et al.,
   Gradient control                       Epochs   Results   2014) is an effective tool to enhance generalizabil-
   Disabled                               28       82.9 %
   Clipping (Mikolov, 2012)               31       83.1 %    ity. The first technique known as simple or naive
   Normalization (Pascanu et al., 2013)   29       84.2 %
                                                             drop out was proposed as a mechanism which ran-
Table 9: Gradient control tuning results. Gradient           domly removes the connections between deep lay-
control and the best of deep architectures are in the        ers. Gal and Ghahramani (2016) proposed a new
same grid search (18 configurations).                        mechanism for drop out called variational, which
                                                             improves the simple drop out by defining static
                                                             masks for removing the connections between deep
4.8 Classifier                                               layers (‘interlayer’) as well as between the units
The last layer in a classification model is consid-          inside deep layers (‘intralayer’). We placed drop
ered as the most crucial layer since all the com-            out methods with the best of the deep architec-
tures from Sub-section 4.3 on the same grid with       References
90 configurations. The results are reported in Ta-     Ahmad Aghaebrahimian and Mark Cieliebak. 2019.
ble 11 and Table 12. As expected, the configu-           Towards integration of statistical hypothesis tests
rations with both inter- and intralayer variational      into deep neural networks. In Proceedings of the
method yields the best performance.                      57th annual meeting of the association of Computa-
                                                         tional Linguistics (ACL). Florence, Italy.
    Drop out value                Epochs   Results
    Disabled                      24       80.2 %      Alan Akbik, Duncan Blythe, and Roland Vollgraf.
    Simple 0.2                    26       83.2 %        2018. Contextual string embeddings for sequence
    Simple 0.5                    27       83.8 %
    Simple 0.7                    29       81.5 %
                                                         labeling. In Proceedings of the 27th International
    Variational                   32       84.2 %        Conference on Computational Linguistics. Santa Fe,
                                                         New Mexico, USA.
Table 11: Simple drop out tuning results. The drop
                                                       D. G. Anastasiev, I. O. Gusev, and E. M. Indenbom.
out and the best of deep architectures are in the
                                                         2018. Improving part-of-speech tagging via multi-
same grid search (90 configurations).                    task learning and character-level word representa-
                                                         tions. In Proceedings of the International Confer-
                                                         ence Dialogue, Computational linguistics and intel-
    Variational drop out method   Epochs   Results       lectual technologies.
    Inter                         31       83.5 %
    Intra                         30       83.2 %
    Both                          32       84.2 %      James Bergstra, Daniel Yamins, and David D. Cox.
                                                         2013. Making a science of model search: Hy-
Table 12: Variational drop out value tuning results      perparameter optimization in hundreds of dimen-
                                                         sions for vision architectures. In Proceedings of the
                                                         30th International Conference on Machine Learning
5    Conclusion                                          (ICML). Atlanta, GA, USA.
In this study, we investigated various settings for    Piotr Bojanowski, Edouard Grave, Armand Joulin,
a Deep Neural Network for multi-label classifica-         and Tomas Mikolov. 2016. Enriching word vec-
tion. Considering the characteristics of the dataset      tors with subword information. arXiv preprint
and the task, we observed the following results:          arXiv:1607.04606.
Using Sigmoid in the last layer yields statistically   Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-
significant better results compared to CRF or Soft-      danau, and Yoshua Bengio. 2014. On the properties
max. The Glove embedings (Pennington et al.,             of neural machine translation: Encoder-decoder ap-
2014) with more than 100-dimensional vector size         proaches. In Proceedings of the Eighth Workshop on
                                                         Syntax, Semantics and Structure in Statistical Trans-
and without updating yields statistically signifi-       lation.
cant better results compared to other word vec-
tors. Compared to other deep architectures, bi-        Ronan Collobert, Jason Weston, Léon Bottou, Michael
GRU yields better results when it is used as a one-      Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
                                                         2011. Natural language processing (almost) from
depth layer with 128 units. Adam and Nadam ob-           scratch. J. Mach. Learn. Res. 12:2493–2537.
tain roughly the same results, while Nadam con-
verges much faster. Pooling is better to be used as    Timothy Dozat. 2015. Incorporating nesterov momen-
the concatenation of both max and average-pooled         tum into adam.
tensors, and it is better to use Normalization (Pas-   John Duchi, Elad Hazan, and Yoram Singer. 2011.
canu et al., 2013) as a mean of gradient control to      Adaptive subgradient methods for online learning
control gradient vanishing. It is also a good prac-      and stochastic optimization. J. Mach. Learn. Res.
tice to use Variational drop out (Gal and Ghahra-        12:2121–2159.
mani, 2016) both between layers and inside recur-      Yarin Gal and Zoubin Ghahramani. 2016. A theoret-
rent units to control over-fitting. Finally, we did      ically grounded application of dropout in recurrent
not observe any improvement by using character           neural networks. In Proceedings of the 30th Interna-
embeddings.                                              tional Conference on Neural Information Processing
                                                         Systems. Curran Associates Inc., USA, NIPS’16.
   The order in which these parameters are men-
tioned is the magnitude of their importance for the    Geoffrey Hinton. 2012. Neural networks for machine
final performance. Parameters with no mention            learning. Lecture 6a - Overview of mini-batch gra-
                                                         dient descent .
here did not have any noticeable impact on the sys-
tem results.                                           Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long
                                                         short-term memory. In Neural Comput.. volume 9.
Frank Hutter, Jörg Lücke, and Lars Schmidt-Thieme.     Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
  2015. Beyond manual tuning of hyperparameters.          Gardner, Christopher Clark, Kenton Lee, and Luke
  KI - Künstliche Intelligenz .                          Zettlemoyer. 2018. Deep contextualized word rep-
                                                          resentations. In Proc. of NAACL.
Yoon Kim. 2014. Convolutional neural networks for
  sentence classification. In Proceedings of the 2014    David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.
  Conference on Empirical Methods in Natural Lan-          Williams. 1986. Learning Representations by Back-
  guage Processing (EMNLP). Association for Com-           propagating Errors. Nature 323.
  putational Linguistics, Doha, Qatar.
                                                         Jasper Snoek, Hugo Larochelle, and Ryan P. Adams.
Diederik P. Kingma and Jimmy Ba. 2015.         Adam:        2012. Practical bayesian optimization of machine
  A method for stochastic optimization.        CoRR         learning algorithms. In Proceedings of the 25th In-
  abs/1412.6980.                                            ternational Conference on Neural Information Pro-
                                                            cessing Systems (NIPS). USA.
John D. Lafferty, Andrew McCallum, and Fernando
  C. N. Pereira. 2001. Conditional random fields:        Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
  Probabilistic models for segmenting and labeling se-     Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
  quence data. In Proceedings of the Eighteenth Inter-     Dropout: A simple way to prevent neural networks
  national Conference on Machine Learning (ICML).          from over fitting. Journal of Machine Learning Re-
                                                           search .
Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
  ramanian, Kazuya Kawakami, and Chris Dyer. 2016.
  Neural architectures for named entity recognition.
  In Proceedings of the 2016 Conference of the North
  American Chapter of the Association for Computa-
  tional Linguistics: Human Language Technologies.
  San Diego, California.

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
   Howard, W. Hubbard, and L. D. Jackel. 1989. Back-
   propagation applied to handwritten zip code recog-
   nition. Neural Computation .

Omer Levy and Yoav Goldberg. 2014. Dependency-
 based word embeddings. In Proceedings of the 52nd
 Annual Meeting of the Association for Computa-
 tional Linguistics (Volume 2: Short Papers).

Xuezhe Ma and Eduard Hovy. 2016. End-to-end
  sequence labeling via bi-directional LSTM-CNNs-
  CRF. In Proceedings of the 54th Annual Meet-
  ing of the Association for Computational Linguistics
  (Volume 1: Long Papers). Association for Compu-
  tational Linguistics, Berlin, Germany, pages 1064–
  1074.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
  Dean. 2013. Efficient estimation of word represen-
  tations in vector space. arXiv:1301.3781 .

Tom Mikolov. 2012. Statistical language models based
  on neural networks. Ph.D. Thesis, Brno University
  of Technology .

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
  2013. On the difficulty of training recurrent neu-
  ral networks. In Proceedings of the 30th Interna-
  tional Conference on International Conference on
  Machine Learning (ICML).

Jeffrey Pennington, Richard Socher, and Christopher
   Manning. 2014. Glove: Global vectors for word
   representation. In Proceedings of the Conference on
   Empirical Methods in Natural Language Processing
   (EMNLP).