<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From General to Specific: Leveraging Named Entity Recognition for Slot Filling in Conversational Language Understanding</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Samuel Louvan</string-name>
          <email>slouvan@fbk.eu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernardo Magnini</string-name>
          <email>magnini@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
          ,
          <addr-line>Fondazione Bruno Kessler</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>English. Slot filling techniques are often
adopted in language understanding
components for task-oriented dialogue
systems. In recent approaches, neural
models for slot filling are trained on
domainspecific datasets, making it difficult
porting to similar domains when few or no
training data are available. In this
paper we use multi-task learning to
leverage general knowledge of a task, namely
Named Entity Recognition (NER), to
improve slot filling performance on a
semantically similar domain-specific task. Our
experiments show that, for some datasets,
transfer learning from NER can achieve
competitive performance compared with
the state-of-the-art and can also help slot
filling in low resource scenarios.</p>
      <p>Italiano. Molti sistemi di dialogo
taskoriented utilizzano tecniche di slot-filling
per la comprensione degli enunciati. Gli
approcci piu´ recenti si basano su modelli
neurali addestrati su dataset specializzati
per un certo dominio, rendendo difficile la
portabilita´ su dominii simili, quando pochi
o nessun dato di addestramento e´
disponibile. In questo contributo usiamo
multitask learning per sfruttare la conoscenza
generale proveniente da un task,
precisamente Named Entity Recognition (NER),
per migliorare le prestazioni di slot
filling su dominii specifici e semanticamente
simili. I nostri esperimenti mostrano che
transfer learning da NER aiuta lo slot
filling in dominii con poche risorse e
raggiunge risultati competitivi con lo stato
dell’arte.</p>
      <p>
        In dialogue systems, semantic information of an
utterance is generally represented with a semantic
frame, a data structure consisting of a domain, an
intent, and a number of slots
        <xref ref-type="bibr" rid="ref27">(Tur, 2011)</xref>
        . For
example, given the utterance “I’d like a United
Airlines flight on Wednesday from San Francisco to
Boston”, the domain would be flight, the intent
is booking, and the slot fillers are United
Airlines (for the slot airline name), Wednesday
(booking time), San Francisco (origin),
and Boston (destination). Automatically
extracting this information involves domain
identification, intent classification, and slot filling, which
is the focus of our work.
      </p>
      <p>
        Slots are usually domain specific as they are
predefined for each domain. For instance, in the
flight domain the slots might be airline name,
booking time, and airport name, while in
the bus domain the slots might be pickup time,
bus name, and travel duration. Recent
successful approaches related to slot filling tasks
        <xref ref-type="bibr" rid="ref1 ref15 ref16 ref19 ref2 ref22 ref23 ref24 ref28 ref29 ref4 ref9">(Wang et al., 2018; Liu and Lane, 2017a; Goo et
al., 2018)</xref>
        are based on variants of recurrent
neural network architecture. In general there are two
ways of approaching the task: (i) by training a
single model for each domain; or (ii) by
performing domain adaptation, which results in a model
that learns better feature representations across
domains. All these approaches directly train the
models on domain-specific slot filling datasets.
      </p>
      <p>
        In our work, instead of using a domain-specific
slot filling dataset, which can be expensive to
obtain being task specific, we propose to leverage
knowledge gained from a more “general”, but
semantically related, task, referred as the auxiliary
task, and then transfer the learned knowledge to
the more specific task, namely slot filling, referred
as the target task, through transfer learning. In the
literature, the term transfer learning can be used
in different ways. We follow the definition from
        <xref ref-type="bibr" rid="ref20">(Mou et al., 2016)</xref>
        , in which transfer learning is
viewed as a paradigm which enables a model to
use knowledge from auxiliary tasks to help the
target task. There are several ways to train this
model: we can directly use the trained parameters
of the auxiliary tasks to initialize the parameters
in the target task (pre-train &amp; fine-tuning), or train
a model of auxiliary and target tasks
simultaneously, where some parameters are shared
(multitask learning).
      </p>
      <p>
        We propose to train a slot filling model jointly
with Named Entity Recognition (NER) as an
auxiliary task through multi-task learning
        <xref ref-type="bibr" rid="ref3">(Caruana,
1997)</xref>
        . Recent studies have shown the potential
of multi-task learning in NLP models. For
example,
        <xref ref-type="bibr" rid="ref20">(Mou et al., 2016)</xref>
        empirically evaluates
transfer learning in sentence and question classification
tasks.
        <xref ref-type="bibr" rid="ref29">(Yang et al., 2017)</xref>
        proposes an approach for
transfer learning in sequence tagging tasks.
      </p>
      <p>
        NER is chosen as the auxiliary task for several
reasons. First, named entities frequently occur as
slot values in several domains, which make them
a relevant general knowledge to exploit. The same
NER type can refer to different slots in the same
utterance. On the previous utterance example,
the NER labels are LOC for both San Francisco
and Boston, and ORG for United Airlines.
Second, state-of-the-art performance of NER
        <xref ref-type="bibr" rid="ref11 ref12 ref14 ref18 ref7">(Lample et al., 2016; Ma and Hovy, 2016)</xref>
        is relatively
high, therefore we expect that the transferred
feature representation can be useful for slot filling
tasks. Third, large annotated NER corpora are
easier to obtain compared to domain-specific slot
filling datasets.
      </p>
      <p>
        The contributions of this work are as
follows: we investigate the effectiveness of
leveraging Named Entity Recognition as an auxiliary
task to learn general knowledge, and transfer this
knowledge to slot filling as the target task in a
multi-task learning setting. To our knowledge,
there is no reported work that uses NER
transfer learning for slot filling in conversational
language understanding. Our experiments show that
for some datasets multi-task learning achieves
better overall performance compared to previous
published results, and performs better in some low
resource scenarios.
Recent approaches on slot filling for
conversational agents are based mostly on neural models.
The work by
        <xref ref-type="bibr" rid="ref28">(Wang et al., 2018)</xref>
        introduces a
bimodel Recurrent Neural Network (RNN) structure
to consider cross-impact between intent detection
and slot filling.
        <xref ref-type="bibr" rid="ref11 ref14 ref18 ref7">(Liu and Lane, 2016)</xref>
        propose
an attention mechanism on the encoder-decoder
model for joint intent classification and slot filling.
        <xref ref-type="bibr" rid="ref4">(Goo et al., 2018)</xref>
        extends the attention mechanism
using a slot gated model to learn relationships
between slot and intent attention vectors. The work
from
        <xref ref-type="bibr" rid="ref5">(Hakkani-Tu¨r et al., 2016)</xref>
        uses bidirectional
RNN as a single model that handles multiple
domains by adding a final state that contains domain
identifier.
        <xref ref-type="bibr" rid="ref8 ref9">(Jha et al., 2018; Kim et al., 2017)</xref>
        uses
expert based domain adaptation while
        <xref ref-type="bibr" rid="ref7">(Jaech et al.,
2016)</xref>
        proposes a multi-task learning approach to
guide the training of a model for new domains.
All of these studies train their model solely on
slot filling datasets, while our focus is to
leverage more “general” resources, such as NER, by
training the model simultaneously with slot filling
through multi-task learning.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Model</title>
      <p>In this Section we describe the base model that we
use for the slot filling task and the transfer learning
model between NER and slot filling.
3.1</p>
      <sec id="sec-2-1">
        <title>Base Model</title>
        <p>The model that we use is a hierarchical neural
based model, as it has shown to be the state of
the art in sequence tagging tasks such as named
entity recognition (Ma and Hovy, 2016; Lample
#sents #tokens #label</p>
        <p>Label Examples
79 airport name, airline name, return date
20 restaurant name, dish, price, hours
8 actor, director, genre, title, character
4 person, location, organization
18 organization, gpe, date, money, quantity
and some parameters are shared.</p>
        <p>
          Boston
B-toloc
et al., 2016). Figure 1 depicts the overall
architecture of the model. The model consists of
several stacked bidirectional RNNs and a CRF layer
on top to compute the final output. The input of
the model are both words and characters in the
sentence. Each word is represented with a word
embedding, which is simply a lookup table. Each
word embedding is concatenated with its character
representation. The character representation itself
can be composed from a concatenation of the
final state of bidirectional LSTM
          <xref ref-type="bibr" rid="ref6">(Hochreiter and
Schmidhuber, 1997)</xref>
          over characters in a word or
extracted using a Convolutional Neural Network
(CNN)
          <xref ref-type="bibr" rid="ref13">(LeCun et al., 1998)</xref>
          . The concatenation of
word and character embeddings is then passed to a
LSTM cell. The output of the LSTM in each time
step is then fed to a CRF layer. Finally, the output
of the CRF layer is the slot tag for a word in the
sentence, as shown in Table 1.
3.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Transfer Learning Model</title>
        <p>
          In the context of NLP, recent studies have applied
transfer learning in tasks such as POS tagging,
NER, and semantic sequence tagging
          <xref ref-type="bibr" rid="ref1 ref15 ref16 ref19 ref2 ref22 ref23 ref24 ref29 ref29 ref9">(Yang et al.,
2017; Alonso and Plank, 2017)</xref>
          . In general, a
popular mechanism is to do multitask learning with a
network that optimizes the feature representation
for two or more tasks simultaneously. In
particular, among the tasks we can set target tasks and
auxiliary tasks. In our case, the target task is the
slot filling task and the auxiliary task is the NER
task. Both tasks are using the base model
explained in the previous section with a task specific
CRF layer on top.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Setup</title>
      <p>The objective of our experiment is to validate the
hypothesis that by training a slot filling model
with semantically related tasks, such as NER, can
be helpful to the slot filling performance. We
compare the performance of Single Task Learning
(STL) and Multi-Task Learning (MTL). STL uses
the Bi-LSTM + CRF model described in (x3.1)
and it is trained directly on the target slot filling
task. MTL refers to (x3.2), in which models for
slot filling and NER are trained simultaenously</p>
      <p>
        Data. We use three conversational slot filling
datasets to evaluate the performance of our
approach: the ATIS dataset on Airline Travel
Information Systems
        <xref ref-type="bibr" rid="ref26">(Tu¨r et al., 2010)</xref>
        , the MIT
Restaurant and the MIT Movie datasets1
        <xref ref-type="bibr" rid="ref1 ref15 ref16 ref17 ref19 ref2 ref22 ref23 ref24 ref29 ref9">(Liu
et al., 2013; Liu and Lane, 2017a)</xref>
        on
restaurant reservations and movie information
respectively. Each dataset provides a number of
conversational user utterances, where tokens in the
utterance are annotated with their domain specific
slot. As for the NER dataset, we use two datasets:
CoNLL 2003
        <xref ref-type="bibr" rid="ref25">(Tjong Kim Sang and De Meulder,
2003)</xref>
        and Ontonotes 5.0
        <xref ref-type="bibr" rid="ref21">(Pradhan et al., 2013)</xref>
        .
For OntoNotes, we use the Newswire section for
our experiments. Table 2 shows the statistics
and example labels of each dataset. We use the
training-test split provided by the developers of
the datasets, and have further split the training data
into 80% training and 20% development sets.
Implementation. We use the multi-task
learning implementation from
        <xref ref-type="bibr" rid="ref1 ref15 ref16 ref19 ref2 ref22 ref23 ref24 ref29 ref9">(Reimers and Gurevych,
2017)</xref>
        and have adapted it for our experiments. We
consider slot filling as the target task and NER as
the auxiliary task. We use a pretrained embedding
1https://groups.csail.mit.edu/sls/downloads/
Model
Bi-model based
        <xref ref-type="bibr" rid="ref28">(Wang et al., 2018)</xref>
        Slot gated model
        <xref ref-type="bibr" rid="ref4">(Goo et al., 2018)</xref>
        Recurrent Attention
        <xref ref-type="bibr" rid="ref11 ref14 ref18 ref7">(Liu and Lane, 2016)</xref>
        Adversarial
        <xref ref-type="bibr" rid="ref1 ref15 ref16 ref19 ref2 ref22 ref23 ref24 ref29 ref9">(Liu and Lane, 2017b)</xref>
        Base model (STL) 95.68
MTL with CoNLL 2003 95.43
MTL with OntoNotes 95.78
MTL with CoNLL 2003 + OntoNotes 95.69
ATIS MIT
      </p>
      <p>Restaurant
96.89
95.20
95.78
95.63 74.47
78.58
78.82
79.81yy
78.52</p>
      <p>
        MIT
Movie
from
        <xref ref-type="bibr" rid="ref11 ref14 ref18 ref7">(Komninos and Manandhar, 2016)</xref>
        to
initialize the word embedding layer. We did not tune
the hyperparameters extensively, although we
followed the suggestions in a comprehensive study of
hyperparameters in sequence labeling tasks from
        <xref ref-type="bibr" rid="ref1 ref15 ref16 ref19 ref2 ref22 ref23 ref24 ref29 ref9">(Reimers and Gurevych, 2017)</xref>
        . The word and
character embedding dimensions, and dropout rate
are set to 300, 30, and 0.25 respectively. The
LSTM size is set to 100 following
        <xref ref-type="bibr" rid="ref12">(Lample et al.,
2016)</xref>
        . We use CNN to generate the character
embedding as in
        <xref ref-type="bibr" rid="ref11 ref14 ref18 ref7">(Ma and Hovy, 2016)</xref>
        . For each
epoch in the training, we train both the target task
and the auxiliary task and keep the data size
between them proportional. We train the network
using Adam
        <xref ref-type="bibr" rid="ref10">(Kingma and Ba, 2014)</xref>
        optimizer. Each
model is trained for 50 epochs with early stopping
on the target task. We evaluate the performance
of the target task by computing the F1-score of
the test data following the standard CoNLL-2000
evaluation2.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results and Analysis</title>
      <sec id="sec-4-1">
        <title>Overall performance. Table 3 shows the com</title>
        <p>
          parison of our Single Task Learning (STL) and
Multi-Task Learning (MTL) models with the
current state of the art performance for each dataset.
For the ATIS dataset, the performance of the STL
model is comparable to most of the state-of-the-art
2https://www.clips.uantwerpen.be/conll2000/chunking/
output.html
approaches, however not all MTL models lead to
an increase in the performance. As for the MIT
Restaurant, both STL and MTL models achieve
better performance compared to the previously
published results
          <xref ref-type="bibr" rid="ref1 ref15 ref16 ref19 ref2 ref22 ref23 ref24 ref29 ref9">(Liu and Lane, 2017a)</xref>
          . For the
MIT movie dataset, STL achieves better results by
a small margin over MTL. Both STL and MTL
performs better than the previous approach for the
MIT movie dataset. When we combine CoNLL
and OntoNotes into three tasks in the MTL setting,
the overall performance tends to decrease across
datasets compared to MTL with OntoNotes only.
Per slot performance. Although the overall
performance using MTL is not necessarily
helpful, we analyze the per slot performance in
the development set to get better
understanding of the model’s behaviour. In particular, we
want to know whether slots that are related to
CoNLL tags perform better through MTL
compared to STL, as evidence of transferable
knowledge. To this goal, we manually created a
mapping between NER CoNLL tags and slot tags
for each dataset. For example in the ATIS
dataset, some of the slots that are related to the
LOC tags are fromloc.airport name and
fromloc.city name. We compute the
microF1 scores for the slots based on this mapping.
Table 4 shows the performance of the slots related
to CoNLL tags on the development set. For the
ATIS and MIT Restaurant datasets we can see
that MTL improves the performance in
recognizing LOC related tags. While for the MIT Movie
dataset, MTL suffers from performance decrease
on PER tag. There are three slots related to PER
in MIT Movie namely CHARACTER, ACTOR, and
DIRECTOR. We found that the decrease is on
DIRECTOR while for ACTOR and CHARACTER
there is actually an improvement. We sample 10
sentences in which the model makes mistakes on
DIRECTOR tag. Of these sentences, four
sentences are wrongly annotated. Another four
sentences are errors by the model although the
sentence seems easy, typically the model is confused
between DIRECTOR and ACTOR. The rests are
difficult sentences. For example, the sentence:
“Can you name Akira Kurusawas first color film”.
This sentence is somewhat general and the model
needs more information to discriminate between
ACTOR and DIRECTOR.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Low resource scenario. In Table 5 we compare</title>
        <p>
          STL and MTL under varying numbers of training
sentences to simulate low resource scenarios. We
did not perform MTL including both CoNLL and
OntoNotes, as the results from Table 3 show that
performance tends to degrade when we include
both resources. For the MIT Restaurant, for all the
low resource scenarios, MTL consistently gives
better results. In the MIT Restaurant dataset, it is
evident that the less number of training sentences
that we have, the more helpful is MTL. For the
ATIS and MIT Movie, MTL performs better than
STL except for the 400 sentence training scenario.
We suspect that to have a more consistent MTL
improvement in different low resource scenarios,
a different training strategy is needed. In our
current experiments, the number of training data is
proportional between the target task and auxiliary
task. In the future, we would like to try other
training strategies, such as using the full training data
from the auxiliary task. As the data from the target
task is much smaller, we plan to repeat the batch
of the target task until we finish training all the
batches from the auxiliary task in an epoch. This
strategy is similar to
          <xref ref-type="bibr" rid="ref7">(Jaech et al., 2016)</xref>
          .
        </p>
        <p>
          Regarding the variation of results that we get
from CoNLL or OntoNotes, we believe that
selecting promising auxiliary tasks, or selecting data
from a particular auxiliary task, are important to
alleviate negative transfer. This also has been
shown empirically in
          <xref ref-type="bibr" rid="ref1 ref1 ref15 ref15 ref16 ref16 ref19 ref19 ref2 ref2 ref22 ref22 ref23 ref23 ref24 ref24 ref29 ref29 ref9 ref9">(Ruder and Plank, 2017;
Bingel and Søgaard, 2017)</xref>
          . Another alternative to
reduce negative transfer, which would be
interesting to try in the future, is by using a model which
can decide which knowledge to share (or not to
share) among tasks
          <xref ref-type="bibr" rid="ref1 ref15 ref16 ref19 ref2 ref22 ref23 ref23 ref24 ref24 ref29 ref9">(Ruder et al., 2017; Meyerson
and Miikkulainen, 2017)</xref>
          .
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this work we train a slot filling domain-specific
model adding NER information, under the
assumption that NER introduces useful “general”
labels, and that it is cheaper to obtain compared to
task specific slot filling datasets. We use
multitask learning to leverage the learned knowledge
from NER to slot filling task. Our experiments
show evidence that we can achieve comparable or
better performance against the state-of-the-art
approaches and against single task learning, both in
full training data and low resource scenarios. In
the future, we are interested in working on datasets
in Italian and explore more sophisticated
multitask learning strategies.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thanks three anonymous
reviewers and Simone Magnolini, Marco Guerini, Serra
Sinem Tekirog˘ lu for helpful comments and
feedback. This work was supported by the grant of
Fondazione Bruno Kessler PhD scholarship.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>He´ctor Mart´ınez Alonso</article-title>
          and
          <string-name>
            <given-names>Barbara</given-names>
            <surname>Plank</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>When is multitask learning effective? semantic sequence prediction under varying data conditions</article-title>
          .
          <source>In 15th Conference of the European Chapter of the Association for Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Joachim</given-names>
            <surname>Bingel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Anders</given-names>
            <surname>Søgaard</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Identifying beneficial task relations for multi-task learning in deep neural networks</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>2</volume>
          ,
          <string-name>
            <surname>Short</surname>
            <given-names>Papers</given-names>
          </string-name>
          , pages
          <fpage>164</fpage>
          -
          <lpage>169</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Rich</given-names>
            <surname>Caruana</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Multitask learning</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>28</volume>
          :
          <fpage>41</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Chih-Wen</surname>
            <given-names>Goo</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Guang</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yun-Kai</surname>
            <given-names>Hsu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chih-Li</surname>
            <given-names>Huo</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsung-Chieh</surname>
            <given-names>Chen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keng-Wei</surname>
            <given-names>Hsu</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>YunNung</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Slot-gated modeling for joint slot filling and intent prediction</article-title>
          .
          <source>In Proceedings of the</source>
          <year>2018</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-</article-title>
          <string-name>
            <surname>HLT</surname>
          </string-name>
          , New Orleans, Louisiana, USA, June 1-6,
          <year>2018</year>
          , Volume
          <volume>2</volume>
          (
          <issue>Short Papers)</issue>
          , pages
          <fpage>753</fpage>
          -
          <lpage>757</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Dilek Z. Hakkani-Tu¨</surname>
          </string-name>
          r, Go¨khan Tu¨r, Asli C¸ elikyilmaz,
          <string-name>
            <surname>Yun-Nung</surname>
            <given-names>Chen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Li</given-names>
            <surname>Deng</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ye-Yi Wang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Multi-domain joint semantic frame parsing using bi-directional rnn-lstm</article-title>
          .
          <source>In INTERSPEECH.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and Ju¨rgen Schmidhuber.
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Jaech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Larry P.</given-names>
            <surname>Heck</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Mari</given-names>
            <surname>Ostendorf</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Domain adaptation of recurrent neural networks for natural language understanding</article-title>
          .
          <source>In INTERSPEECH.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Rahul</given-names>
            <surname>Jha</surname>
          </string-name>
          , Alex Marin, Suvamsh Shivaprasad, and
          <string-name>
            <given-names>Imed</given-names>
            <surname>Zitouni</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bag of experts architectures for model reuse in conversational language understanding</article-title>
          .
          <source>In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>3</volume>
          (
          <string-name>
            <surname>Industry</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , volume
          <volume>3</volume>
          , pages
          <fpage>153</fpage>
          -
          <lpage>161</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Young-Bum</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Karl Stratos, and
          <string-name>
            <given-names>Dongchan</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Domain attention with an ensemble of experts</article-title>
          .
          <source>In ACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Diederik P Kingma and Jimmy Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412</source>
          .
          <fpage>6980</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Alexandros</given-names>
            <surname>Komninos</surname>
          </string-name>
          and
          <string-name>
            <given-names>Suresh</given-names>
            <surname>Manandhar</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Dependency based embeddings for sentence classification tasks</article-title>
          .
          <source>In HLT-NAACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Guillaume</given-names>
            <surname>Lample</surname>
          </string-name>
          , Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Dyer</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Neural architectures for named entity recognition</article-title>
          .
          <source>In Proceedings of NAACL-HLT</source>
          , pages
          <fpage>260</fpage>
          -
          <lpage>270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Yann</surname>
            <given-names>LeCun</given-names>
          </string-name>
          , Le´on Bottou, Yoshua Bengio, and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Haffner</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Gradient-based learning applied to document recognition</article-title>
          .
          <source>Proceedings of the IEEE</source>
          ,
          <volume>86</volume>
          (
          <issue>11</issue>
          ):
          <fpage>2278</fpage>
          -
          <lpage>2324</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Bing</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ian</given-names>
            <surname>Lane</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Attention-based recurrent neural network models for joint intent detection and slot filling</article-title>
          .
          <source>In Interspeech</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Bing</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ian</given-names>
            <surname>Lane</surname>
          </string-name>
          .
          <article-title>2017a. Multi-domain adversarial learning for slot filling in spoken language understanding</article-title>
          .
          <source>In NIPS Workshop on Conversational AI</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Bing</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ian</given-names>
            <surname>Lane</surname>
          </string-name>
          .
          <article-title>2017b. Multi-Domain Adversarial Learning for Slot Filling in Spoken Language Understanding</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Jingjing</given-names>
            <surname>Liu</surname>
          </string-name>
          , Panupong Pasupat,
          <string-name>
            <given-names>Yining</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Scott</given-names>
            <surname>Cyphers</surname>
          </string-name>
          , and
          <string-name>
            <surname>James</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Glass</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Query understanding enhanced by hierarchical parsing structures</article-title>
          .
          <source>In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding</source>
          , Olomouc,
          <source>Czech Republic, December</source>
          <volume>8</volume>
          -
          <issue>12</issue>
          ,
          <year>2013</year>
          , pages
          <fpage>72</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Xuezhe</given-names>
            <surname>Ma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Eduard</given-names>
            <surname>Hovy</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>End-to-end sequence labeling via bi-directional lstm-cnns-crf</article-title>
          .
          <source>In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , volume
          <volume>1</volume>
          , pages
          <fpage>1064</fpage>
          -
          <lpage>1074</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Elliot</given-names>
            <surname>Meyerson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Risto</given-names>
            <surname>Miikkulainen</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Beyond shared hierarchies: Deep multitask learning through soft layer ordering</article-title>
          .
          <source>arXiv preprint arXiv:1711</source>
          .
          <fpage>00108</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Lili</surname>
            <given-names>Mou</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            <given-names>Meng</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rui Yan</surname>
            ,
            <given-names>Ge</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            , Yan Xu, Lu Zhang, and
            <given-names>Zhi</given-names>
          </string-name>
          <string-name>
            <surname>Jin</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>How transferable are neural networks in nlp applications? arXiv preprint</article-title>
          arXiv:
          <volume>1603</volume>
          .
          <fpage>06111</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Sameer</given-names>
            <surname>Pradhan</surname>
          </string-name>
          , Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bjo¨rkelund, Olga Uryupina, Yuchen Zhang, and
          <string-name>
            <given-names>Zhi</given-names>
            <surname>Zhong</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Towards robust linguistic analysis using ontonotes</article-title>
          .
          <source>In Proceedings of the Seventeenth Conference on Computational Natural Language Learning</source>
          , pages
          <fpage>143</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>338</fpage>
          -
          <lpage>348</lpage>
          , Copenhagen, Denmark,
          <volume>09</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          and
          <string-name>
            <given-names>Barbara</given-names>
            <surname>Plank</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Learning to select data for transfer learning with bayesian optimization</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>372</fpage>
          -
          <lpage>382</lpage>
          , Copenhagen, Denmark, September. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          , Joachim Bingel, Isabelle Augenstein, and
          <string-name>
            <given-names>Anders</given-names>
            <surname>Søgaard</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Learning what to share between loosely related tasks</article-title>
          .
          <source>arXiv preprint arXiv:1705</source>
          .
          <fpage>08142</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Erik F Tjong Kim Sang and Fien De Meulder</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Introduction to the conll-2003 shared task: Language-independent named entity recognition</article-title>
          .
          <source>In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume</source>
          <volume>4</volume>
          , pages
          <fpage>142</fpage>
          -
          <lpage>147</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Go</surname>
          </string-name>
          ¨khan Tu¨r, Dilek Z.
          <string-name>
            <surname>Hakkani-Tu¨</surname>
            r, and
            <given-names>Larry P.</given-names>
          </string-name>
          <string-name>
            <surname>Heck</surname>
          </string-name>
          .
          <year>2010</year>
          . What is left to be understood in atis?
          <source>2010 IEEE Spoken Language Technology Workshop</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Gokhan</given-names>
            <surname>Tur</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Spoken Language Understanding: Systems for Extracting Semantic Information from Speech</article-title>
          . John Wiley and Sons, New York, NY, January.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Yu</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Yilin</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Hongxia</given-names>
            <surname>Jin</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A bi model based rnn semantic frame parsing model for intent detection and slot filling</article-title>
          .
          <source>In NAACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Zhilin</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , and William W Cohen.
          <year>2017</year>
          .
          <article-title>Transfer learning for sequence tagging with hierarchical recurrent networks</article-title>
          .
          <source>arXiv preprint arXiv:1703</source>
          .
          <fpage>06345</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>