=Paper= {{Paper |id=Vol-2253/paper09 |storemode=property |title=From General to Specific : Leveraging Named Entity Recognition for Slot Filling in Conversational Language Understanding |pdfUrl=https://ceur-ws.org/Vol-2253/paper09.pdf |volume=Vol-2253 |authors=Samuel Louvan,Bernardo Magnini |dblpUrl=https://dblp.org/rec/conf/clic-it/LouvanM18 }} ==From General to Specific : Leveraging Named Entity Recognition for Slot Filling in Conversational Language Understanding== https://ceur-ws.org/Vol-2253/paper09.pdf
From General to Specific: Leveraging Named Entity Recognition for Slot
         Filling in Conversational Language Understanding

                 Samuel Louvan                            Bernardo Magnini
               University of Trento                    Fondazione Bruno Kessler
             Fondazione Bruno Kessler                    magnini@fbk.eu
               slouvan@fbk.eu


                  Abstract                       1   Introduction

  English. Slot filling techniques are often     In dialogue systems, semantic information of an
  adopted in language understanding com-         utterance is generally represented with a semantic
  ponents for task-oriented dialogue sys-        frame, a data structure consisting of a domain, an
  tems. In recent approaches, neural mod-        intent, and a number of slots (Tur, 2011). For ex-
  els for slot filling are trained on domain-    ample, given the utterance “I’d like a United Air-
  specific datasets, making it difficult port-   lines flight on Wednesday from San Francisco to
  ing to similar domains when few or no          Boston”, the domain would be flight, the intent
  training data are available. In this pa-       is booking, and the slot fillers are United Air-
  per we use multi-task learning to lever-       lines (for the slot airline name), Wednesday
  age general knowledge of a task, namely        (booking time), San Francisco (origin),
  Named Entity Recognition (NER), to im-         and Boston (destination). Automatically ex-
  prove slot filling performance on a seman-     tracting this information involves domain identifi-
  tically similar domain-specific task. Our      cation, intent classification, and slot filling, which
  experiments show that, for some datasets,      is the focus of our work.
  transfer learning from NER can achieve            Slots are usually domain specific as they are
  competitive performance compared with          predefined for each domain. For instance, in the
  the state-of-the-art and can also help slot    flight domain the slots might be airline name,
  filling in low resource scenarios.             booking time, and airport name, while in
                                                 the bus domain the slots might be pickup time,
  Italiano. Molti sistemi di dialogo task-       bus name, and travel duration. Recent
  oriented utilizzano tecniche di slot-filling   successful approaches related to slot filling tasks
  per la comprensione degli enunciati. Gli       (Wang et al., 2018; Liu and Lane, 2017a; Goo et
  approcci piú recenti si basano su modelli     al., 2018) are based on variants of recurrent neu-
  neurali addestrati su dataset specializzati    ral network architecture. In general there are two
  per un certo dominio, rendendo difficile la    ways of approaching the task: (i) by training a
  portabilitá su dominii simili, quando pochi   single model for each domain; or (ii) by perform-
  o nessun dato di addestramento é disponi-     ing domain adaptation, which results in a model
  bile. In questo contributo usiamo multi-       that learns better feature representations across do-
  task learning per sfruttare la conoscenza      mains. All these approaches directly train the
  generale proveniente da un task, precisa-      models on domain-specific slot filling datasets.
  mente Named Entity Recognition (NER),             In our work, instead of using a domain-specific
  per migliorare le prestazioni di slot fill-    slot filling dataset, which can be expensive to ob-
  ing su dominii specifici e semanticamente      tain being task specific, we propose to leverage
  simili. I nostri esperimenti mostrano che      knowledge gained from a more “general”, but se-
  transfer learning da NER aiuta lo slot fill-   mantically related, task, referred as the auxiliary
  ing in dominii con poche risorse e rag-        task, and then transfer the learned knowledge to
  giunge risultati competitivi con lo stato      the more specific task, namely slot filling, referred
  dell’arte.                                     as the target task, through transfer learning. In the
                                                 literature, the term transfer learning can be used
in different ways. We follow the definition from
(Mou et al., 2016), in which transfer learning is
viewed as a paradigm which enables a model to
use knowledge from auxiliary tasks to help the
target task. There are several ways to train this
model: we can directly use the trained parameters
of the auxiliary tasks to initialize the parameters
in the target task (pre-train & fine-tuning), or train
a model of auxiliary and target tasks simultane-
ously, where some parameters are shared (multi-
task learning).

   We propose to train a slot filling model jointly
with Named Entity Recognition (NER) as an aux-                 Figure 1: Multi-task Learning Network architecture.
iliary task through multi-task learning (Caruana,
1997). Recent studies have shown the potential
of multi-task learning in NLP models. For exam-          2     Related Work
ple, (Mou et al., 2016) empirically evaluates trans-     Recent approaches on slot filling for conversa-
fer learning in sentence and question classification     tional agents are based mostly on neural models.
tasks. (Yang et al., 2017) proposes an approach for      The work by (Wang et al., 2018) introduces a bi-
transfer learning in sequence tagging tasks.             model Recurrent Neural Network (RNN) structure
                                                         to consider cross-impact between intent detection
   NER is chosen as the auxiliary task for several       and slot filling. (Liu and Lane, 2016) propose
reasons. First, named entities frequently occur as       an attention mechanism on the encoder-decoder
slot values in several domains, which make them          model for joint intent classification and slot filling.
a relevant general knowledge to exploit. The same        (Goo et al., 2018) extends the attention mechanism
NER type can refer to different slots in the same        using a slot gated model to learn relationships be-
utterance. On the previous utterance example,            tween slot and intent attention vectors. The work
the NER labels are LOC for both San Francisco            from (Hakkani-Tür et al., 2016) uses bidirectional
and Boston, and ORG for United Airlines. Sec-            RNN as a single model that handles multiple do-
ond, state-of-the-art performance of NER (Lam-           mains by adding a final state that contains domain
ple et al., 2016; Ma and Hovy, 2016) is relatively       identifier. (Jha et al., 2018; Kim et al., 2017) uses
high, therefore we expect that the transferred fea-      expert based domain adaptation while (Jaech et al.,
ture representation can be useful for slot filling       2016) proposes a multi-task learning approach to
tasks. Third, large annotated NER corpora are eas-       guide the training of a model for new domains.
ier to obtain compared to domain-specific slot fill-     All of these studies train their model solely on
ing datasets.                                            slot filling datasets, while our focus is to lever-
                                                         age more “general” resources, such as NER, by
   The contributions of this work are as fol-            training the model simultaneously with slot filling
lows: we investigate the effectiveness of lever-         through multi-task learning.
aging Named Entity Recognition as an auxiliary
                                                         3     Model
task to learn general knowledge, and transfer this
knowledge to slot filling as the target task in a        In this Section we describe the base model that we
multi-task learning setting. To our knowledge,           use for the slot filling task and the transfer learning
there is no reported work that uses NER trans-           model between NER and slot filling.
fer learning for slot filling in conversational lan-
guage understanding. Our experiments show that           3.1     Base Model
for some datasets multi-task learning achieves bet-      The model that we use is a hierarchical neural
ter overall performance compared to previous pub-        based model, as it has shown to be the state of
lished results, and performs better in some low re-      the art in sequence tagging tasks such as named
source scenarios.                                        entity recognition (Ma and Hovy, 2016; Lample
    Sentence   find   flights   from     Atlanta   to    Boston   and some parameters are shared.
    Slot        O       O        O     B-fromloc   O    B-toloc
                                                                  Dataset         #sents #tokens #label                           Label Examples
            Table 1: An example output from the model.            Slot Filling
                                                                  ATIS           4478       869     79    airport name, airline name, return date
                                                                  MIT Restaurant 6128      3385     20        restaurant name, dish, price, hours
                                                                  MIT Movie      7820      5953      8      actor, director, genre, title, character
et al., 2016). Figure 1 depicts the overall archi-
                                                                  NER
tecture of the model. The model consists of sev-                  CoNLL 2003 14987         23624     4           person, location, organization
                                                                  OntoNotes 5.0 34970      39490    18 organization, gpe, date, money, quantity
eral stacked bidirectional RNNs and a CRF layer
on top to compute the final output. The input of                                   Table 2: Training data statistics.
the model are both words and characters in the
sentence. Each word is represented with a word
embedding, which is simply a lookup table. Each                   Data. We use three conversational slot filling
word embedding is concatenated with its character                 datasets to evaluate the performance of our ap-
representation. The character representation itself               proach: the ATIS dataset on Airline Travel In-
can be composed from a concatenation of the fi-                   formation Systems (Tür et al., 2010), the MIT
nal state of bidirectional LSTM (Hochreiter and                   Restaurant and the MIT Movie datasets1 (Liu
Schmidhuber, 1997) over characters in a word or                   et al., 2013; Liu and Lane, 2017a) on restau-
extracted using a Convolutional Neural Network                    rant reservations and movie information respec-
(CNN) (LeCun et al., 1998). The concatenation of                  tively. Each dataset provides a number of conver-
word and character embeddings is then passed to a                 sational user utterances, where tokens in the ut-
LSTM cell. The output of the LSTM in each time                    terance are annotated with their domain specific
step is then fed to a CRF layer. Finally, the output              slot. As for the NER dataset, we use two datasets:
of the CRF layer is the slot tag for a word in the                CoNLL 2003 (Tjong Kim Sang and De Meulder,
sentence, as shown in Table 1.                                    2003) and Ontonotes 5.0 (Pradhan et al., 2013).
                                                                  For OntoNotes, we use the Newswire section for
3.2         Transfer Learning Model                               our experiments. Table 2 shows the statistics
In the context of NLP, recent studies have applied                and example labels of each dataset. We use the
transfer learning in tasks such as POS tagging,                   training-test split provided by the developers of
NER, and semantic sequence tagging (Yang et al.,                  the datasets, and have further split the training data
2017; Alonso and Plank, 2017). In general, a pop-                 into 80% training and 20% development sets.
ular mechanism is to do multitask learning with a                 Implementation. We use the multi-task learn-
network that optimizes the feature representation                 ing implementation from (Reimers and Gurevych,
for two or more tasks simultaneously. In partic-                  2017) and have adapted it for our experiments. We
ular, among the tasks we can set target tasks and                 consider slot filling as the target task and NER as
auxiliary tasks. In our case, the target task is the              the auxiliary task. We use a pretrained embedding
slot filling task and the auxiliary task is the NER                  1
                                                                         https://groups.csail.mit.edu/sls/downloads/
task. Both tasks are using the base model ex-
plained in the previous section with a task specific
CRF layer on top.                                                  Model
                                                                                                            ATIS       MIT             MIT
                                                                                                                       Restaurant      Movie
                                                                   Bi-model based                           96.89      -               -
4          Experimental Setup                                      (Wang et al., 2018)
                                                                   Slot gated model                         95.20      -               -
The objective of our experiment is to validate the                 (Goo et al., 2018)
                                                                   Recurrent Attention                      95.78      -               -
hypothesis that by training a slot filling model                   (Liu and Lane, 2016)
                                                                   Adversarial                              95.63      74.47           85.33
with semantically related tasks, such as NER, can                  (Liu and Lane, 2017b)
be helpful to the slot filling performance. We                     Base model (STL)                         95.68      78.58           87.34
                                                                   MTL with CoNLL 2003                      95.43      78.82           87.31
compare the performance of Single Task Learning                    MTL with OntoNotes                       95.78      79.81††         87.20
(STL) and Multi-Task Learning (MTL). STL uses                      MTL with CoNLL 2003 + OntoNotes          95.69      78.52           86.93

the Bi-LSTM + CRF model described in (§3.1)
                                                                  Table 3: F1 score comparison of MTL, STL and the state of
and it is trained directly on the target slot filling             the art approaches. †† indicates significant improvement over
task. MTL refers to (§3.2), in which models for                   STL baseline with p < 0.05 using approximate randomiza-
slot filling and NER are trained simultaenously                   tion testing.
    Slot
                    ATIS            MIT Restaurant           MIT Movie     approaches, however not all MTL models lead to
              STL      MTL        STL       MTL          STL       MTL     an increase in the performance. As for the MIT
    PER       -        -          -         -            90.73     89.58   Restaurant, both STL and MTL models achieve
    LOC       98.91    99.32      81.95     83.47††      -         -
    ORG       100.00   100.00     -         -            -         -       better performance compared to the previously
                                                                           published results (Liu and Lane, 2017a). For the
Table 4: Performance on slots related to CoNLL tags on the                 MIT movie dataset, STL achieves better results by
development set (MTL with CONLL).
                                                                           a small margin over MTL. Both STL and MTL
    Dataset            #training sents    STL        MTL-C       MTL-O
                                                                           performs better than the previous approach for the
    ATIS               200                84.37      83.15       84.97
                                                                           MIT movie dataset. When we combine CoNLL
                       400                87.04      86.54       86.93     and OntoNotes into three tasks in the MTL setting,
                       800                90.67      91.15       91.58††
                                                                           the overall performance tends to decrease across
    MIT Restaurant     200                54.65      56.95††     56.79
                       400                62.91      63.91       62.29     datasets compared to MTL with OntoNotes only.
                       800                68.15      68.52       68.47

    MIT Movie          200                69.97      71.11††     69.78
                       400                75.88      75.23       75.18
                       800                79.33      80.28††     78.65
                                                                           Per slot performance. Although the overall per-
Table 5: Performance comparison on low resource scenar-                    formance using MTL is not necessarily help-
ios. MTL-C and MTL-O are MTL models trained on CoNLL
and OntoNotes datasets respectively. †† indicates significant              ful, we analyze the per slot performance in
improvement over STL with p < 0.05 using approximate                       the development set to get better understand-
randomization testing.                                                     ing of the model’s behaviour. In particular, we
                                                                           want to know whether slots that are related to
from (Komninos and Manandhar, 2016) to initial-                            CoNLL tags perform better through MTL com-
ize the word embedding layer. We did not tune                              pared to STL, as evidence of transferable knowl-
the hyperparameters extensively, although we fol-                          edge. To this goal, we manually created a map-
lowed the suggestions in a comprehensive study of                          ping between NER CoNLL tags and slot tags
hyperparameters in sequence labeling tasks from                            for each dataset. For example in the ATIS
(Reimers and Gurevych, 2017). The word and                                 dataset, some of the slots that are related to the
character embedding dimensions, and dropout rate                           LOC tags are fromloc.airport name and
are set to 300, 30, and 0.25 respectively. The                             fromloc.city name. We compute the micro-
LSTM size is set to 100 following (Lample et al.,                          F1 scores for the slots based on this mapping. Ta-
2016). We use CNN to generate the character em-                            ble 4 shows the performance of the slots related
bedding as in (Ma and Hovy, 2016). For each                                to CoNLL tags on the development set. For the
epoch in the training, we train both the target task                       ATIS and MIT Restaurant datasets we can see
and the auxiliary task and keep the data size be-                          that MTL improves the performance in recogniz-
tween them proportional. We train the network us-                          ing LOC related tags. While for the MIT Movie
ing Adam (Kingma and Ba, 2014) optimizer. Each                             dataset, MTL suffers from performance decrease
model is trained for 50 epochs with early stopping                         on PER tag. There are three slots related to PER
on the target task. We evaluate the performance                            in MIT Movie namely CHARACTER, ACTOR, and
of the target task by computing the F1-score of                            DIRECTOR. We found that the decrease is on
the test data following the standard CoNLL-2000                            DIRECTOR while for ACTOR and CHARACTER
evaluation2 .                                                              there is actually an improvement. We sample 10
                                                                           sentences in which the model makes mistakes on
5     Results and Analysis                                                 DIRECTOR tag. Of these sentences, four sen-
                                                                           tences are wrongly annotated. Another four sen-
Overall performance. Table 3 shows the com-                                tences are errors by the model although the sen-
parison of our Single Task Learning (STL) and                              tence seems easy, typically the model is confused
Multi-Task Learning (MTL) models with the cur-                             between DIRECTOR and ACTOR. The rests are
rent state of the art performance for each dataset.                        difficult sentences. For example, the sentence:
For the ATIS dataset, the performance of the STL                           “Can you name Akira Kurusawas first color film”.
model is comparable to most of the state-of-the-art                        This sentence is somewhat general and the model
   2
     https://www.clips.uantwerpen.be/conll2000/chunking/                   needs more information to discriminate between
output.html                                                                ACTOR and DIRECTOR.
Low resource scenario. In Table 5 we compare             in Italian and explore more sophisticated multi-
STL and MTL under varying numbers of training            task learning strategies.
sentences to simulate low resource scenarios. We
did not perform MTL including both CoNLL and             Acknowledgments
OntoNotes, as the results from Table 3 show that         We would like to thanks three anonymous review-
performance tends to degrade when we include             ers and Simone Magnolini, Marco Guerini, Serra
both resources. For the MIT Restaurant, for all the      Sinem Tekiroğlu for helpful comments and feed-
low resource scenarios, MTL consistently gives           back. This work was supported by the grant of
better results. In the MIT Restaurant dataset, it is     Fondazione Bruno Kessler PhD scholarship.
evident that the less number of training sentences
that we have, the more helpful is MTL. For the
ATIS and MIT Movie, MTL performs better than             References
STL except for the 400 sentence training scenario.       Héctor Martı́nez Alonso and Barbara Plank. 2017.
We suspect that to have a more consistent MTL               When is multitask learning effective? semantic se-
improvement in different low resource scenarios,            quence prediction under varying data conditions. In
a different training strategy is needed. In our cur-        15th Conference of the European Chapter of the As-
                                                            sociation for Computational Linguistics.
rent experiments, the number of training data is
proportional between the target task and auxiliary       Joachim Bingel and Anders Søgaard. 2017. Identify-
task. In the future, we would like to try other train-     ing beneficial task relations for multi-task learning
                                                           in deep neural networks. In Proceedings of the 15th
ing strategies, such as using the full training data
                                                           Conference of the European Chapter of the Associa-
from the auxiliary task. As the data from the target       tion for Computational Linguistics: Volume 2, Short
task is much smaller, we plan to repeat the batch          Papers, pages 164–169. Association for Computa-
of the target task until we finish training all the        tional Linguistics.
batches from the auxiliary task in an epoch. This        Rich Caruana. 1997. Multitask learning. Machine
strategy is similar to (Jaech et al., 2016).               Learning, 28:41–75.
   Regarding the variation of results that we get
                                                         Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li
from CoNLL or OntoNotes, we believe that se-               Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-
lecting promising auxiliary tasks, or selecting data       Nung Chen. 2018. Slot-gated modeling for joint
from a particular auxiliary task, are important to         slot filling and intent prediction. In Proceedings of
alleviate negative transfer. This also has been            the 2018 Conference of the North American Chap-
                                                           ter of the Association for Computational Linguistics:
shown empirically in (Ruder and Plank, 2017;               Human Language Technologies, NAACL-HLT, New
Bingel and Søgaard, 2017). Another alternative to          Orleans, Louisiana, USA, June 1-6, 2018, Volume 2
reduce negative transfer, which would be interest-         (Short Papers), pages 753–757.
ing to try in the future, is by using a model which
                                                         Dilek Z. Hakkani-Tür, Gökhan Tür, Asli Çelikyilmaz,
can decide which knowledge to share (or not to             Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi
share) among tasks (Ruder et al., 2017; Meyerson           Wang. 2016. Multi-domain joint semantic frame
and Miikkulainen, 2017).                                   parsing using bi-directional rnn-lstm. In INTER-
                                                           SPEECH.
6   Conclusion                                           Sepp Hochreiter and Jürgen Schmidhuber. 1997.
                                                           Long short-term memory. Neural computation,
In this work we train a slot filling domain-specific       9(8):1735–1780.
model adding NER information, under the as-
                                                         Aaron Jaech, Larry P. Heck, and Mari Ostendorf.
sumption that NER introduces useful “general” la-          2016. Domain adaptation of recurrent neural net-
bels, and that it is cheaper to obtain compared to         works for natural language understanding. In IN-
task specific slot filling datasets. We use multi-         TERSPEECH.
task learning to leverage the learned knowledge          Rahul Jha, Alex Marin, Suvamsh Shivaprasad, and
from NER to slot filling task. Our experiments             Imed Zitouni. 2018. Bag of experts architectures
show evidence that we can achieve comparable or            for model reuse in conversational language under-
better performance against the state-of-the-art ap-        standing. In Proceedings of the 2018 Conference
                                                           of the North American Chapter of the Association
proaches and against single task learning, both in         for Computational Linguistics: Human Language
full training data and low resource scenarios. In          Technologies, Volume 3 (Industry Papers), volume 3,
the future, we are interested in working on datasets       pages 153–161.
Young-Bum Kim, Karl Stratos, and Dongchan Kim.              Nils Reimers and Iryna Gurevych. 2017. Report-
  2017. Domain attention with an ensemble of ex-              ing Score Distributions Makes a Difference: Perfor-
  perts. In ACL.                                              mance Study of LSTM-networks for Sequence Tag-
                                                              ging. In Proceedings of the 2017 Conference on
Diederik P Kingma and Jimmy Ba. 2014. Adam: A                 Empirical Methods in Natural Language Processing
  method for stochastic optimization. arXiv preprint          (EMNLP), pages 338–348, Copenhagen, Denmark,
  arXiv:1412.6980.                                            09.
Alexandros Komninos and Suresh Manandhar. 2016.             Sebastian Ruder and Barbara Plank. 2017. Learn-
  Dependency based embeddings for sentence classi-            ing to select data for transfer learning with bayesian
  fication tasks. In HLT-NAACL.                               optimization. In Proceedings of the 2017 Confer-
                                                              ence on Empirical Methods in Natural Language
Guillaume Lample, Miguel Ballesteros, Sandeep Sub-            Processing, pages 372–382, Copenhagen, Denmark,
  ramanian, Kazuya Kawakami, and Chris Dyer.                  September. Association for Computational Linguis-
  2016. Neural architectures for named entity recog-          tics.
  nition. In Proceedings of NAACL-HLT, pages 260–
  270.                                                      Sebastian Ruder, Joachim Bingel, Isabelle Augenstein,
                                                              and Anders Søgaard. 2017. Learning what to
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick          share between loosely related tasks. arXiv preprint
  Haffner. 1998. Gradient-based learning applied to           arXiv:1705.08142.
  document recognition. Proceedings of the IEEE,
  86(11):2278–2324.                                         Erik F Tjong Kim Sang and Fien De Meulder.
                                                               2003. Introduction to the conll-2003 shared task:
Bing Liu and Ian Lane. 2016. Attention-based recur-            Language-independent named entity recognition. In
  rent neural network models for joint intent detection        Proceedings of the seventh conference on Natural
  and slot filling. In Interspeech 2016.                       language learning at HLT-NAACL 2003-Volume 4,
                                                               pages 142–147. Association for Computational Lin-
Bing Liu and Ian Lane. 2017a. Multi-domain adver-              guistics.
  sarial learning for slot filling in spoken language un-
  derstanding. In NIPS Workshop on Conversational           Gökhan Tür, Dilek Z. Hakkani-Tür, and Larry P.
  AI.                                                         Heck. 2010. What is left to be understood in atis?
                                                              2010 IEEE Spoken Language Technology Workshop,
Bing Liu and Ian Lane. 2017b. Multi-Domain Adver-             pages 19–24.
  sarial Learning for Slot Filling in Spoken Language
  Understanding.                                            Gokhan Tur. 2011. Spoken Language Understanding:
                                                              Systems for Extracting Semantic Information from
Jingjing Liu, Panupong Pasupat, Yining Wang, Scott            Speech. John Wiley and Sons, New York, NY, Jan-
   Cyphers, and James R. Glass. 2013. Query un-               uary.
   derstanding enhanced by hierarchical parsing struc-
   tures. In 2013 IEEE Workshop on Automatic Speech         Yu Wang, Yilin Shen, and Hongxia Jin. 2018. A bi
   Recognition and Understanding, Olomouc, Czech              model based rnn semantic frame parsing model for
   Republic, December 8-12, 2013, pages 72–77.                intent detection and slot filling. In NAACL.

Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-             Zhilin Yang, Ruslan Salakhutdinov, and William W
  quence labeling via bi-directional lstm-cnns-crf. In        Cohen. 2017. Transfer learning for sequence tag-
  Proceedings of the 54th Annual Meeting of the As-           ging with hierarchical recurrent networks. arXiv
  sociation for Computational Linguistics (Volume 1:          preprint arXiv:1703.06345.
  Long Papers), volume 1, pages 1064–1074.

Elliot Meyerson and Risto Miikkulainen. 2017. Be-
   yond shared hierarchies: Deep multitask learn-
   ing through soft layer ordering. arXiv preprint
   arXiv:1711.00108.

Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu,
   Lu Zhang, and Zhi Jin. 2016. How transferable are
   neural networks in nlp applications? arXiv preprint
   arXiv:1603.06111.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,
  Hwee Tou Ng, Anders Björkelund, Olga Uryupina,
  Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-
  bust linguistic analysis using ontonotes. In Proceed-
  ings of the Seventeenth Conference on Computa-
  tional Natural Language Learning, pages 143–152.