=Paper=
{{Paper
|id=Vol-1749/paper_015
|storemode=property
|title=Building the state–of–the–art in POS tagging of Italian Tweets
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_015.pdf
|volume=Vol-1749
|authors=Andrea Cimino,Felice Dell'Orletta
|dblpUrl=https://dblp.org/rec/conf/clic-it/CiminoD16
}}
==Building the state–of–the–art in POS tagging of Italian Tweets==
<pdf width="1500px">https://ceur-ws.org/Vol-1749/paper_015.pdf</pdf>
<pre>
     Building the state-of-the-art in POS tagging of Italian Tweets

                         Andrea Cimino and Felice Dell’Orletta
         Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
                              ItaliaNLP Lab - www.italianlp.it
           {andrea.cimino, felice.dellorletta}@ilc.cnr.it


                 Abstract                            trambi bidirezionali e ricorrenti di tipo
                                                     Long Short Term Memory (LSTM). Men-
English. In this paper we describe our ap-           tre la prima rete neurale è una LSTM
proach to EVALITA 2016 POS tagging for               bidirezionale che prende in input vettori
Italian Social Media Texts (PoSTWITA).               che rappresentano le parole in maniera
We developed a two-branch bidirectional              tipica rispetto a precedenti lavori, la sec-
Long Short Term Memory recurrent neu-                onda prende in input una nuova rappre-
ral network, where the first bi-LSTM uses            sentazione vettoriale delle parole che con-
a typical vector representation for the in-          tiene informazioni sui caratteri contenuti
put words, while the second one uses a               evitando un incremento del costo com-
newly introduced word-vector represen-               putazionale del sistema rispetto a LSTM
tation able to encode information about              che prendono in input rappresentazioni
the characters in the words avoiding the             vettoriali delle sequenze di caratteri. Le
increasing of computational costs due to             rappresentazioni vettoriali ottenute dalle
the hierarchical LSTM introduced by the              due LSTM vengono in fine combinate at-
character–based LSTM architectures. The              traverso l’operatore di somma. Il nos-
vector representations calculated by the             tro sistema, utilizzando come dati anno-
two LSTM are then merged by the sum                  tati solo quelli distribuiti dagli organiz-
operation. Even if participants were al-             zatori del task, quando valutato sul test
lowed to use other annotated resources in            set uffciale ha ottenuto il miglior risul-
their systems, we used only the distributed          tato nella competizione EVALITA 2016
data set to train our system. When evalu-            PoSTWITA, riportando una accuratezza di
ated on the official test set, our system out-       93.19%. Ulteriori esperimenti condotti
performed all the other systems achieving            dopo il periodo ufficiale di valutazione ci
the highest accuracy score in EVALITA                hanno permesso di sviluppare un sistema
2016 PoSTWITA, with a tagging accu-                  capace di raggiungre una accuratezza an-
racy of 93.19%. Further experiments car-             cora maggiore, mostrandoci l’importanza
ried out after the official evaluation pe-           dell’ingegnerizzazione manuale delle fea-
riod allowed us to develop a system able             tures anche quando vengono utilizzati al-
to achieve a higher accuracy. These ex-              goritmi di apprendimento basati su reti
periments showed the central role played             neurali.
by the handcrafted features even when ma-
chine learning algorithms based on neural
networks are used.                               1   Description of the system
Italiano. In questo articolo descriviamo il      Our approach to EVALITA 2016 PoSTWITA
sistema che abbiamo utilizzato per parte-        (Bosco et al., 2016) task was implemented in a
cipare al task POS tagging for Italian So-       software prototype operating on tokenized sen-
cial Media Texts (PoSTWITA) della con-           tences which assigns to each token a score ex-
ferenza EVALITA 2016. Per questa parte-          pressing its probability of belonging to a given
cipazione abbiamo sviluppato un sistema          part-of-speech class. The highest score represents
basato su due reti neurali parallele en-         the most probable class.
   Differently from the previous EVALITA part of               periments, we considered a context window of
speech tagging tasks (Tamburini (2007), Attardi                5 words. These models learn lower-dimensional
and Simi (2009)), in EVALITA 2016 PoSTWITA                     word embeddings. Embeddings are represented by
the participants must tackle the problem of analyz-            a set of latent (hidden) variables, and each word is
ing text with low conformance to common writ-                  a multidimensional vector that represent a specific
ing practices. For example, capitalization rules               instantiation of these variables. We built two Word
may be ignored; excessive punctuation, particu-                Embedding Lexicons starting from the following
larly repeated ellipsis and question marks may be              corpora:
used, or spacing may be irregular (Agichtein et
                                                                  • The first lexicon was built using a tokenized
al., 2008). Our development system strategy took
                                                                    version of the itWaC corpus3 . The itWaC cor-
into account this issue. In particular, we imple-
                                                                    pus is a 2 billion word corpus constructed
mented a multiple input bidirectional Long Short
                                                                    from the Web limiting the crawl to the .it
Term Memory recurrent neural network (LSTM)
                                                                    domain and using medium-frequency words
model. We developed a two-branched bidirec-
                                                                    from the Repubblica corpus and basic Italian
tional LSTM (bi-LSTM) where the first bi-LSTM
                                                                    vocabulary lists as seeds.
uses a typical vector representation of the input
words commonly used for different classification                  • The second lexicon was built from a tok-
tasks, while the second one uses a newly intro-                     enized corpus of tweets. This corpus was col-
duced word-vector representation specifically de-                   lected using the Twitter APIs and is made up
signed to handle peculiarities of ill-formed or not                 of 10,700,781 Italian tweets.
standard texts typical of social media texts.
   To create the input vectors for the two branches               In addition to these two lexicons, we built an-
we use a combination of different components ex-               other word embedding lexicon based on fastText
tracted from three different word embedding lex-               (Bojanowski et al., 2016), a library for efficient
icons, from a manually created morpho-syntactic                learning of word representations and sentence
lexicon and from handcrafted features specifically             classification. FastText allows to overcome the
defined to improve the accuracy of the system                  problem of out-of-vocabulary words which affects
when tested on social media texts.                             the relying methodology of word2vec. Generat-
   In this work we used Keras (Chollet, 2016) deep             ing out-of-vocabulary word embeddings is a typi-
learning framework to generate the neural network              cal issue for morphologically rich languages with
models.                                                        large vocabularies and many rare words. FastText
                                                               overcomes this limitation by representing each
1.1     Lexicons                                               word as a bag of character n-grams. A vector rep-
In order to improve the overall accuracy of our                resentation is associated to each character n-gram
system, we developed three word embedding lex-                 and the word is represented as the sum of these
icons1 and we used a manually created morpho-                  character n-gram representations. To build the lex-
syntactic lexicon.                                             icon based on fastText, we adopted as learning cor-
                                                               pus the same set of tokenized tweets used to build
1.1.1    Word Embedding lexicons                               the word2vec based lexicon.
Since the lexical information in tweets can be very            1.1.2 Morpho-syntactic lexicon
sparse, to overcame this problem we built three
                                                               We used a large Italian lexicon of about 1,300,000
word embedding lexicons.
                                                               forms, developed as part of the SemaWiki
   For this purpose, we trained two predict mod-
                                                               project4 . The full-form lexicon was generated
els using the word2vec2 toolkit (Mikolov et al.,
                                                               from a base lexicon of 65,500 lemmas, initially in-
2013). As recommended in (Mikolov et al., 2013),
                                                               spired by the Zanichelli dictionary5 , and updated
we used the CBOW model that learns to pre-
                                                               along several years and cross-checked with other
dict the word in the middle of a symmetric win-
                                                               online dictionaries6 . For each form the lexicon
dow based on the sum of the vector representa-
                                                                  3
tions of the words in the window. For our ex-                       http://wacky.sslmit.unibo.it/doku.php?id=corpora
                                                                  4
                                                                    http://medialab.di.unipi.it/wiki/SemaWiki
   1                                                              5
      The three word embedding lexicons are freely available        Zingarelli: Il nuovo Zingarelli minore, 2008.
at the following website: http://www.italianlp.it/.               6
                                                                    Aldo Gabrielli: Il Grande Dizionario di Italiano; Tullio
    2
      http://code.google.com/p/word2vec/                       De Mauro: Il Dizionario della lingua italiana.
contains all the possible parts-of-speech and pro-            tion about the characters in each word of a sen-
vides information on morpho-syntactic features,               tence. The vector representations calculated by
but using a different tagset (ISST-TANL Tagsets 7 )           the two LSTM are then merged by the sum opera-
with respect to the one used for PoSTWITA.                    tion. For what concerns the optimization process,
                                                              categorical cross-entropy is used as a loss func-
1.2      The POS tagger architecture                          tion and the optimization process is performed
The LSTM unit was initially proposed by Hochre-               by the rmsprop optimizer (Tieleman and Hinton,
iter and Schmidhuber (Hochreiter et al., 1997).               2012). Each bidirectional LSTM branch is con-
LSTM units are able to propagate an important                 figured to have 24 units. In addition, we applied
feature that came early in the input sequence over            a dropout factor to both input gates and to the re-
a long distance, thus capturing potential long-               current connections in order to prevent overfitting
distance dependencies. This type of neural net-               which is a typical issue of neural networks (Galp
work was recently tested on Sentiment Analy-                  and Ghahramani, 2015). As suggested in (Galp
sis tasks (Tang et al., 2015), (Xu et al., 2016)              and Ghahramani, 2015) we have chosen a dropout
where it has been proven to outperform classifi-              factor value in the optimum range [0.3, 0.5], more
cation performance in several sentiment analysis              specifically 0.35 for each branch.
task (Nakov et al., 2016) with respect to com-
                                                                                                   Bag-of-Character
monly used learning algorithms, showing a 3-4                   Word-based LSTM
                                                                                                   Word-based LSTM
points of improvements. Similar big improve-
ments have not been obtained in tagging tasks,
such as Part-Of-Speech tagging. This is most due                                  Merged vectors
to the fact that state-of-the art systems for part of
speech tagging exploit strong performing learning
algorithms and hard feature engineering. In ad-                                     Categorical
dition, a little knowledge of the surrounding con-                                 crossentropy

text is enough to reach very high tagging perfor-
mance. On the contrary, LSTM networks per-                    Figure 1: Diagram of the two-branched bi-LSTM
form very well with respect to other learning al-             architecture.
gorithms when word dependencies are long. Al-
though without a big improvement, POS tagging                 1.2.1 Word-based bi-LSTM
systems which exploit LSTM as learning algo-
                                                              In this part, we describe the Word-based bidirec-
rithm have been proven to reach state-of-the-art
                                                              tional LSTM branch of the proposed neural net-
performances both when analyzing text at char-
                                                              work architecture and the word level information
acter level (Ling et al., 2015) and at word level
                                                              given in input to this layer. Each word is repre-
(Wang et al., 2016). More specifically they used a
                                                              sented by a low dimensional, continuous and real-
bidirectional LSTM allows to capture long-range
                                                              valued vector, also known as word embedding and
dependencies from both directions of a sentence
                                                              all the word vectors are stacked in a word em-
by constructing bidirectional links in the network
                                                              bedding matrix. To train this LSTM branch, each
(Schuster et al., 1997). In addition, (Plank et al.,
                                                              input word in the tweet is represented by a 979-
2016) have proposed a model which takes into ac-
                                                              dimensional vector which is composed by:
count at the same time both word level and char-
                                                              Word2vec word embeddings: the concatenation
acter level information, showing very good results
                                                              of the two word embeddings extracted by the
for many languages. As proposed by these sys-
                                                              two available word2vec Word Embedding Lexi-
tems, we employed a bidirectional LSTM archi-
                                                              cons (128 components for each word embedding,
tecture. We implemented a 2-branch bidirectional
                                                              thus resulting in a total of 256 components), and
LSTM but instead of using the character based
                                                              for each word embedding an extra component was
branch we introduced another specific word level
                                                              added in order to handle the ”unknown word” (2
branch in order to reduce the computational cost
                                                              components).
of the hierarchical LSTM introduced by the char-
                                                              FastText word embeddings: the word embed-
acter based LSTM. This branch encodes informa-
                                                              dings extracted by the fastText Word Embedding
   7
       http://www.italianlp.it/docs/ISST-TANL-POStagset.pdf   Lexicon (128 components).
Morpho-syntactic category: the parts-of-speech         case characters.
and the corresponding morpho-syntactic features        Contains alphanumeric: a component indicating
obtained by exploiting the Morpho-syntactic lexi-      whether or not the current word contains alphanu-
con, resulting in 293 components.                      meric characters
Spell checker: the parts-of-speech and the corre-      Contains not alphanumeric: a component indi-
sponding morpho-syntactic features of the word         cating whether or not the current word contains
obtained by analyzing the current word using           non alphanumeric characters
a spell checker (pyenchant8 ) and exploiting the       Contains alphabetics: a component indicating
Morpho-syntactic lexicon, resulting in 295 com-        whether or not the current word contains alpha-
ponents.                                               betic characters.
Word length: a component representing the              Contains not alphabetics: a component indicat-
length of the analyzed word.                           ing whether or not the current word contains non
Is URL: a component indicating whether the             alphabetic characters.
”http” substring is contained in the analyzed word.    End of sentence: a component indicating whether
Is uppercase: a component indicating if the ana-       the sentence was totally read.
lyzed word is uppercase.
Is capitalized: a component indicating if the ana-     2     Results and Discussion
lyzed word is capitalized.                             To develop our system, we created an internal
End of sentence: a component indicating whether        development set of 368 tweets randomly selected
or not the sentence was totally read.                  from the training set distributed by the task
                                                       organizers. The first row in Table 1 reports the
1.2.2      Bag-of-Character Word-based                 accuracy achieved by our final system on the
           bi-LSTM                                     internal development set and on the official test
In this part, we describe the Bag-of-Character         set (row Two-branch bi-LSTM).
Word-based bidirectional LSTM branch of the
proposed neural network architecture and the word          Configuration                 Devel     Test
level information given in input to this layer. Dif-       Two-branch bi-LSTM            96.55    93.19
ferently from the Word-based LSTM branch, in
this branch we did not use pretrained vectors. To          Word bi-LSTM                  96.03    92.35
train this LSTM branch, each input word in the             Bag-of-Char. Word bi-LSTM     84.47    80.77
tweet is represented by a 316-dimensional vector           No Morpho-syntactic lexicon   96.48    93.54
which is composed by:                                      No spell checker              96.49    93.31
Characters: a vector representing the set of char-         No word2vec lexicons          93.23    89.87
acters which compose the current word. Since our           No fastText lexicon           95.85    92.43
considered alphabet is composed by 173 different           No feature engineering        96.39    93.06
characters, the resulting in a 173-dimensional vec-
tor.                                                   Table 1: Tagging accuracy (in percentage) of the
Lowercased characters: 134 components rep-             different learning models on our development set
resenting the set of lowercased characters which       and the official test set.
compose the current word.
Has numbers: a component indicating whether or            We tested different configurations of our system
not the current word contains a number.                in order to evaluate the contribution on the tag-
Contains not numbers: a component indicating           ging accuracy of: i) each branch in the proposed
whether or not the current word contains non num-      architecture, ii) the different word embedding and
bers.                                                  morpho-syntactic lexicons and iii) the handcrafted
Contains lowercased: a component indicating            features. We carried out different experiments that
whether or not the current word contains lower-        reflect the questions we wanted to answer, more
case characters.                                       specifically the questions are:
Contains uppercased: a component indicating
whether or not the current word contains upper-            • (a) what are the contributions of the Word-
                                                             based bi-LSTM and of the Bag-of-Character
   8
       http://pythonhosted.org/pyenchant/                    Word-based bi-LSTM?
  • (b) what is the contribution of the Morpho-       both configurations we excluded the other Word-
    syntactic lexicon?                                based LSTM components, while we left all the
                                                      components of the Bag-of-Character Word-based
  • (c) what is the contribution of the spell         LSTM. The results show that word2vec seems to
    checker?                                          be a better choice with respect to fastText, both
                                                      in development and in test sets. This is in con-
  • (d) what is the contribution of fastText with     trast with what we would have expected consider-
    respect to word2vec Word Embedding lexi-          ing that fastText learns the word embedding rep-
    cons?                                             resentation using subword information that should
                                                      be particularly useful for the analysis of non stan-
   In order to answer to the question (a), first we   dard text such as social media ones.
run the Word-based LSTM excluding the Bag-of-
Character Word-based bi-LSTM branch, then we          2.1    Single bi-LSTM and Handcrafted
excluded the Word-based bi-LSTM to verify the                features
Bag-of-Character Word based bi-LSTM contribu-
                                                      After the submission of the final system results, we
tion. The results of these experiments are reported
                                                      devised two further experiments. The first one was
in Word bi-LSTM and Bag-of-Char. Word bi-LSTM
                                                      devoted to testing the tagging performances of a
rows in Table 1. The Word-based bi-LSTM is
                                                      single word-based bi-LSTM architecture with re-
clearly the best performer with respect to the Bag-
                                                      spect to the presented Two-branch bi-LSTM. The
of-Character one, but remarkable is that our pro-
                                                      second experiment was aimed to study the effect
posed two-branch architecture shows an improve-
                                                      of handcrafted features combined with the learn-
ment of about 0.5 points in the development set
                                                      ing ones. To this aim, we developed a Part-of-
with respect to the best single bi-LSTM. The same
                                                      Speech tagger based on a single word-based bi-
behaviour is shown in the test set, where the com-
                                                      LSTM, where each input word vector is the con-
bined system achieves an improvement of 0.84
                                                      catenation of the two input word representations
points with respect to the single Word-based bi-
                                                      of the bi-LSTMs presented in Section 1.2.1 and
LSTM.
                                                      Section 1.2.2.
   In order to answer to the question (b), we ex-
                                                         Table 2 reports the results of these experiments.
cluded from the input vectors of the Word-based
                                                      As shown in the Single bi-LSTM row, the use of the
bi-LSTM branch the morpho-syntactic category
                                                      single architecture instead of the two-branch one
components extracted from Morpho-syntactic lex-
                                                      does not affect tagging results, actually the single
icon. Row No Morpho-syntactic lexicon reports
                                                      bi-LSTM slightly outperforms the two-branch ar-
the results and shows that this information gives
                                                      chitecture when tested on the test set (+0.48%).
a negligible improvement on the development set
                                                         In order to evaluate the effect of handcrafted
and unexpectedly a slight drop on the test set.
                                                      features, we conducted a last experiment where
   For what concerns the question (c), we excluded
                                                      we removed all the components from the input
the morpho-syntactic category components of the
                                                      vectors of the single Word-based bi-LSTM with
word obtained using the spell checker. The results
                                                      the exceptions of word2vec and fastText word em-
are reported in the No spell checker row. Simi-
                                                      beddings. No handcrafted features row shows the
larly to what happened in the (b) experiment, also
                                                      relevance of the handcrafted features that yield
such information do not contribute in increasing
                                                      an improvement of 1.34% and 1.68% on the de-
the tagging performances.
                                                      velopment and the test sets respectively. These
   In order to compare the contributions of fast-
                                                      results show the important role of feature engi-
Text and word2vec lexicons (question (d)), we
                                                      neering even when neural networks learning algo-
considered two different system configurations:
                                                      rithms are used.
one removing the two word2vec lexicons (No
word2vec lexicons row) and one removing fastText      3     Conclusion
and itWac word2vec lexicons (No fastText lexicon
row). In this second configuration, we removed        In this paper we reported the results of our partici-
also the itWac word2vec lexicon to compare fast-      pation to the EVALITA 2016 POS tagging for Ital-
Text and word2vec using the same learning corpus      ian Social Media Texts (PoSTWITA). By resort-
(the twitter corpus described in section 1.1.1). In   ing to a two-branch bidirectional LSTM, word em-
    Configuration                Devel     Test          Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long
                                                           short-term memory. Neural computation.
    Single bi-LSTM               96.39     93.67
    No handcrafted features      95.22     91.99         Wang Ling, Chris Dyer, Alan W. Black, Isabel Tran-
                                                           coso, Ramon Fermandez, Silvio Amir, Luis Marujo
                                                           and Luis Tiago 2016. Finding Function in Form:
Table 2: Tagging accuracy of the single word-              Compositional Character Models for Open Vocab-
based bi-LSTM on our development set and the               ulary Word Representation. In Proceedings of the
official test set.                                         2015 Conference on Empirical Methods in Natural
                                                           Language Processing. 1520–1530 , Lisbon, Portu-
                                                           gal. ACL.
beddings and morpho-syntactic lexicons and hand
crafted features we achieved the best score. In par-     Tomas Mikolov, Kai Chen, Greg Corrado and Jef-
                                                           frey Dean. 2013. Efficient estimation of word
ticular, we showed the relevance of handcrafted            representations in vector space. arXiv preprint
features that allowed an improvement of more than          arXiv1:1301.3781.
one percentage point in terms of tagging accuracy
both in development and test sets when combined          Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
                                                           Sebastiani and Veselin Stoyanov 2016. SemEval-
with learned features such as word embedding lex-          2016 task 4: Sentiment analysis in Twitter. In Pro-
icons. As future research direction we will test the       ceedings of the 10th International Workshop on Se-
contribution of a pure character based LSTM with           mantic Evaluation (SemEval-2016).
respect to character handcrafted features.               Barbara Plank, Anders Søgaard and Yoav Goldberg.
                                                           2016. Multilingual Part-of-Speech Tagging with
                                                           Bidirectional Long Short-Term Memory Models an
References                                                 Auxiliary Loss. In Proceedings of the 54th Annual
Eugene Agichtein and Carlos Castillo and Debora Do-        Meeting of the Association for Computational Lin-
  nato and Aristides Gionis and Gilad Mishne. 2008.        guistics (Volume 2: Short Papers). August, Berlin,
  Finding High-quality Content in Social Media. In         Germany.
  Proceedings of the 2008 International Conference       Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec-
  on Web Search and Data Mining. New York, USA.            tional recurrent neural networks. IEEE Transactions
Giuseppe Attardi and Maria Simi. 2009. Overview            on Signal Processing 45(11):2673–2681.
  of the EVALITA 2009 Part-of-Speech Tagging Task.       Fabio Tamburini. 2007. Evalita 2007: The Part-of-
  In Proceedings of Evalita ’09, Evaluation of NLP         Speech Tagging Task. In Proceedings of Evalita
  and Speech Tools for Italian. December, Reggio           ’07, Evaluation of NLP and Speech Tools for Ital-
  Emilia, Italy.                                           ian. September, Rome, Italy.
Piotr Bojanowski, Edouard Grave, Armand Joulin and
                                                         Duyu Tang, Bing Qin and Ting Liu. 2015. Docu-
   Tomas Mikolov. 2016. Enriching Word Vec-
                                                           ment modeling with gated recurrent neural network
   tors with Subword Information. arXiv preprint
                                                           for sentiment classification. In Proceedings of the
   arXiv:607.04606.
                                                           2015 Conference on Empirical Methods in Natural
Cristina Bosco and Fabio Tamburini and Andrea Boli-        Language Processing 1422-1432, Lisbon, Portugal.
  oli and Alessandro Mazzei. 2016. Overview of the
  EVALITA 2016 Part Of Speech on TWitter for ITAl-       Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture
  ian Task. In Proceedings of Third Italian Confer-         6.5-RmsProp: Divide the gradient by a running aver-
  ence on Computational Linguistics (CLiC-it 2016)          age of its recent magnitude. In COURSERA: Neural
  & Fifth Evaluation Campaign of Natural Language           Networks for Machine Learning.
  Processing and Speech Tools for Italian. Final Work-   XingYi Xu, HuiZhi Liang and Timothy Baldwin. 2016.
  shop (EVALITA 2016). Associazione Italiana di Lin-       UNIMELB at SemEval-2016 Tasks 4A and 4B:
  guistica Computazionale (AILC).                          An Ensemble of Neural Networks and a Word2Vec
François Chollet. 2016. Keras. Software available at      Based Model for Sentiment Classification. In Pro-
  https://github.com/fchollet/keras/tree/master/keras.     ceedings of the 10th International Workshop on Se-
                                                           mantic Evaluation (SemEval-2016).
Cı̀cero Nogueira dos Santos and Bianca Zadrozny.
   2013. Learning Character-level Representations for    Peilu Wang, Yao Qian, Frank K. Soong, Lei He, and
   Part-of-Speech Tagging. In Proceedings of the           Hai Zhao . 2016. Learning Distributed Word Repre-
   31st International Conference on Machine Learning       sentations For Bidirectional LSTM Recurrent Neu-
   (ICML 2014).                                            ral Network. In Proceedings of the 2016 Conference
                                                           of the North American Chapter of the Association
Yarin Gal and Zoubin Ghahramani. 2015. A theoret-          for Computational Linguistics: Human Language
  ically grounded application of dropout in recurrent      Technologies 527–533, San Diego, CA, USA. ACL.
  neural networks. arXiv preprint arXiv:1512.05287.

</pre>