A hybrid convolutional and recurrent network
 approach for conversational AI in spoken language
                   understanding
                                     1st Bassel Zaity                                                 2nd Hazem Wannous
                       Graduate School of Software Engineering                                         IMT Lille Douai
             Peter the Great St.Petersburg Polytechnic University (SPbPU)                    CRIStAL UMR 9189, University of Lille
                                Saint Petersburg, Russia                                                 Lille, France
                                bassel.zaity@gmail.com                                            hazem.wannous@univ-lille.fr

                       3rd Zein Shaheen                                                   4th Igor Chernoruckiy
             Computer Intelligent Technologies                         Graduate School of Software Engineering
Peter the Great St.Petersburg Polytechnic University (SPbPU) Peter the Great St.Petersburg Polytechnic University (SPbPU)
                   Saint Petersburg, Russia                                     Saint Petersburg, Russia
                   shahin.z@edu.spbstu.ru                                           igcher@spbstu.ru

                      5th Pavel Drobintsev                                                   6th Vadim Pak
          Graduate School of Software Engineering                         Computer Intelligent Technologies
Peter the Great St.Petersburg Polytechnic University (SPbPU) Peter the Great St.Petersburg Polytechnic University (SPbPU)
                   Saint Petersburg, Russia                                     Saint Petersburg, Russia
                   drob@ics2.ecd.spbstu.ru                                    vadim.pak@cit.icc.spbstu.ru


   Abstract—The deep learning revolution has an impact on              algorithms started to take place in the programmer society.
almost all parts of our life, it brought us improved momental ma-      However, the last five years brought the real change after
chine translators, modern human-like conversation voice assistant      the new deep learning architectures, which leds to a new
like Siri, Alexa, Alisa. This revolution had become truth because
of deep learning methods which improved multiple processing            level of solutions and the Spoken Dialogue Systems (SDS)
layers to learn a hierarchical representation of data, and have        is one of the fields which had really improved recently. SDS
achieved the state-of-the-art results in many lives domains. In this   and chatbots are taking a wider place day by day in the
paper, we are focusing on one of the most famous NLP (Natural          scientific conferences as a case study. They already have great
language processing) problems which is slot filling to approach        commercial potential according to the changing of the way
the state-of-the-art results on the ticketing problem to make the
Spoken Dialogue systems work more efficiently. We propose a            humans interact with machines. The improvement of deep
hybrid architecture, as a combination of a Recurrent Neural            learning in general, and the Natural Language Processing
Network and a Convolutional Neural Network models, for Slot            (NLP) researchers in special, led to place a lot of difficult
Filling in Spoken Language Understanding. In particular, our           problems under the microscope, and the research teams over
network model is built from stacked units of 1-dimensional CNN         the world trying to test different architecture models to get the
(Convolutional Neural Network) across the temporal domain,
which are used to train an RNN (Recurrent neural network) layer        state-of-the-art results to solve these problems. In our days,
to model dependencies in the temporal domain. Experimental             the importance of chatbots has increased, most websites tend
tests show extensive comparisons between different models for          to have their own chatbots to communicate with customers
NER (Named Entities Recognition). Results demonstrate the              and facilitate their work. The goal of such bots is to know
effectiveness of hybrid models that combine benefits from both         users needs and give responses in their natural language. This
RNN and CNN architecture compared over distinct RNN and
CNN models and also compared with other traditional models.            will lead to a better understanding of the users queries when
Experimental results show that our model achieves F1-score of          communicating with the users in a natural way throw these
95.11 on benchmark ATIS dataset.                                       chatbots. It will also help to ask the users about whatever
   Index Terms—SLU, slot-filling, Hybrid CNN and RNN, Deep             missing points they have to bring the best accurate answers,
learning                                                               such assistants could help disabled people and bring more
                                                                       solutions to the market to build a more intelligent world.
                       I. I NTRODUCTION                                The implementation of a voice assistant comes with different
  The methodological revolution in spoken language research            parts, as speech to text and text to speech models, but the
had been started about 20 years ago when the machine learning          most challenging part comes in the task of NLP to extract
the needs of the user and to know his intent from the                                    II. R ELATED WORK
conversation. The processing pipeline comes here into two
                                                                       Rule-based approaches are done manually, at first you all
parts, intent classification and slot filling after the intent is
                                                                    needed roles should be written need to achieve the goal, this
known. At this stage, the bot needs to generate a response
                                                                    operation is time-consuming and therefore not so efficient,
to the user and give feedback about whatever missing data
                                                                    it will be notable that the recall is not very nice because
there are. The whole system that organizes this process is the
                                                                    its so difficult to write all the varieties, but the positives of
dialogue manager which processes the users input, extract the
                                                                    ruled-based approaches the precision will be quite high [11].
meaning and generates the desired response. From a research
                                                                    The most widely used formal system for modeling constituent
perspective, the design of spoken dialogue systems provides a
                                                                    structure in English and other natural languages is the Context-
number of significant challenges, as these systems depend on
                                                                    Free Grammar or CFG. A context-free grammar consists of
solving several difficult NLP and decision making tasks, and
                                                                    a set of rules or productions, each of which expresses the
combining these into a functional dialogue system pipeline [1].
                                                                    ways that symbols of the language can be grouped and ordered
Intent detection and slot filling are usually processed sepa-
                                                                    together, and a lexicon of words and symbols [12].
rately. Intent detection can be treated as a semantic utterance
                                                                    In machine learning methods, we need a dataset of text with
classification problem, and popular classifiers like support
                                                                    markup, in this dataset, each word should be assigned to a
vector machines (SVMs) [2] and deep neural network meth-
                                                                    tag, this problem is known as slots filling problem. The first
ods [3] can be applied. Slot filling can be treated as a se-
                                                                    which we should do is making some Feature engineering, for
quence labeling task. Popular approaches to solving sequence
                                                                    example, see whether the word is capitalized or it is a name of
labeling problems include maximum entropy Markov models
                                                                    a city, some cities consists of two words, maybe you check the
(MEMMs) [4], conditional random fields (CRFs) [5], and
                                                                    previous or the next words (context). Probabilistic modeling
recurrent neural networks (RNNs) [6] [7] [8]. Joint model
                                                                    and Conditional Random Field not only assume that features
for intent detection and slot filling has also been proposed
                                                                    are dependent on each other but also considers the future
in literature [9] [10]. Such joint model simplifies the spoken
                                                                    observations while learning a pattern [21]. This combines the
language understanding (SLU) system, as only one model
                                                                    best of both HMM and MEMM. In terms of performance,
needs to be trained and fine-tuned for the two tasks.
                                                                    it is considered previously to be the best method for entity
This work focuses on the slot-filling part by building a model
                                                                    recognition problem. Another paper studied the comprehensive
that extracts information from text in a reliable way. Before
                                                                    investigations of RNNs for the task of slot filling in SLU.
the era of deep learning the task of Named Entity Recognition
                                                                    They implemented and compared several RNN architectures,
(NER) was solved using grammars-based models and rule-
                                                                    including the Elman-type and Jordan-type networks with their
based approaches, these models have proven to achieve good
                                                                    variants [18]
results in terms of precision but fail to capture all human-
text varieties and thus the recall will be bad. Probabilistic ap-                  III. D EEP L EARNING METHODS
proaches came with models built on HMM, which were state-
of-art for many years and achieved an impressive achievement.       A. Recurrent Neural Network RNN
With the recent revolution, many deep learning methods has             Recurrent Neural Networks “Fig. 1” are used for sequence
replaced traditional previous ones and pushed state-of-art for      modeling, it accepts input xt at time step t and a hidden state
these tasks in. Recurrent Neural Networks (RNN) models have         ht and use this hidden state to produce output yt , and this
replaced models based on HMM, that is RNN achieved the              hidden state will be passed to the next time step. So, we can
same task in a simpler way and deep RNNs are able to                think of the hidden state as a summary of the previous inputs
capture complex representations for the input. The problem          to the neural network, we use activation function such as tanh
with such models was that they need to handle the input token-      or ReLU to calculate hidden state. Output yt is the prediction
by-token in sequence. Therefore, such structures could not be       of the next tag, it would be a vector of probabilities across our
parallelized and the models will be slow to train and inference     vocabulary, the following formulas explain the general form
if the neural network structure is deep. Convolution Neural         of RNN:
Networks (CNN) added a way to extract relations between
tokens by mixing them in a way similar to extracting n-                                 ht = f (U xt + W ht−1 )                  (1)
grams in the traditional NLP tasks. Such architectures that
contain CNN could be optimized by parallelization so adding                               yt = softmax(V ht )                    (2)
a convolutional layer could reduce the complexity and control
the size of the neural network. In this paper, we discuss              Long Short Term Memory (LSTM) and Gated Recurrent
different approaches to solve slot filling for ticketing task       Units (GRU) are used as RNN units, these units can capture
as a NER problem, and showed different architectures that           long term dependency. The LSTM does have the ability to
contain distinct RNN, CNN or hybrid architectures ones.             remove or add information to the cell state, carefully regulated
We conducted many experiments with different values of the          by structures called gates. Gates are a way to optionally let
hyper-parameters and different optimization methods.                information through. They are composed out of a sigmoid
                                                                    neural net layer and a pointwise multiplication operation [19].
                                                                                       rt = σ(W r.[ht−1 , xt ])               (10)
                                                                                     ht = tanh(W.[r.ht−1 , xt ])              (11)
                                                                                   ht t = (1 − zt ).ht−1 + zt ∗ ht            (12)


          Fig. 1. General form of Recurrent Neural Network


The sigmoid layer outputs numbers between zero and one, de-
scribing how much of each component should be let through.
A value of zero means let nothing through, while a value of
one means let everything through! An LSTM has three of
these gates “Fig. 2”, to protect and control the cell state. The
                                                                                           Fig. 3. GRU unit
following formulas explain how does LSTM cell work:
                                                                      In our experiments, we used both GRU and LSTM units
                  ft = σ(Wf [ht−1 , xt ] + bf )              (3)   and compared between them. Other sequence architectures
                                                                   like Encoder-decoder architecture could be used to solve this
                   it = σ(Wi [ht−1 , xt ] + bi )             (4)
                                                                   task, at first the whole input will be encoded into hidden
                Ct = tanh(Wc [ht−1 , xt] + bc )              (5)   representation (encoder), and then this hidden representation is
                                                                   used to produce sequence of tags (decode). Some architectures
                     Ct = ft .Ct−1 + it .Ct                  (6)   use attention mechanism to give attention to parts of the input
                                                                   sequence and use these information to produce the output
                  ot = σ(Wo [ht−1 , xt ] + bo )              (7)
                                                                   token.
                       ht = ot .tanh(Ct )                    (8)   B. Convolution Neural Network for sequences
                                                                       RNNs operate sequentially, the output for the second in-
                                                                   put depends on the first one and so we cant parallelize
                                                                   an RNN. Convolutions have no such problem, each patch
                                                                   a convolutional kernel operates on is independent of the
                                                                   other, meaning that we can go over the entire input layer
                                                                   concurrently. Convolutions grow a larger receptive beld as we
                                                                   stack more and more layers. That means that by default, each
                                                                   step in the convolutions representation views all of the input
                                                                   in its receptive field, from before and after it “Fig. 4”. In
                                                                   our experiments we used 1D convolution to mix the tokens
                                                                   and extract relations between the consequence tokens, it is
                                                                   equivalent to n-gram relation where n is the size of the used
                                                                   filter, for example: if we care about the last 3 tokens we use
                                                                   filter size 3. Using CNN will result in some benefits, it runs
                         Fig. 2. LSTM unit                         faster than RNN and beats RNN in some tasks. If we divide
                                                                   convolution output into two parts, A and B, one of which will
   GRU has a simpler design “Fig. 3” it was introduced by          gate the other through element-wise multiplication, where A
Cho, et al. (2014) [20], The key difference between a GRU          is liner and B through sigmoid, we get GLU (gated linear
and an LSTM is that a GRU has two gates (reset and update          unit). Here we increased receptive field as it is shown in the
gates) whereas an LSTM has three gates (namely input, output       following formula:
and forget gates) [13]. The GRU unit controls the flow of
information like the LSTM unit, but without having to use a                               A = (X.W + b)                       (13)
memory unit. It just exposes the full hidden content without
any control. GRU is relatively new but computationally more                              B = σ(X.V + c)                       (14)
efficient. The following formulas describe the GRU mecha-                                 ht (x) = A ⊕ B                      (15)
nism:
                                                                                ht (x) = (X.W + b) ⊕ σ(X.V + c)               (16)
                    zt = σ(W z.[ht−1 , xt ])                 (9)
                                                                                                       TABLE I
                                                                                                ATIS DATASET E XAMPLE


                                                                     Sentence         show     me flights    from   Moscow      to   London    Today
                                                                     Slots/Concepts   O        O     O       O      B-fromLoc   O    I-toLoc   B-departDate
                                                                     Named Entity     O        O     O       O      S-city      O    I-city    O
                                                                     Intent           Find Flight
                                                                     Domain           Airline Travel


                                                                    following the IOB tagging representation, except for a more
                                                                    specific granularity.
                                                                       1) Training Details: In training, we compared between dif-
          Fig. 4. Convolution Neural Network for sequences          ferent models for NER (Named Entities Recognition) system,
                                                                    all the models were trained using 100 epochs. We tuned our
                                                                    models using different dropout values (0.1, 0.25, 0.5) and
C. Hybrid model CNN RNN                                             we used different optimization methods (ADAM, RMSProb,
   This model combines the benefits of both CNN and RNN,            SGD). For the embedding layer, we represent each token by a
where RNN helps to capture the dependencies between tokens          vector of size 100, and for our choice for the convolution
in the users query, using LSTM or GRU units will have               layer we used 64 filters of size 5 and used ReLU as an
resulted in a model that captures long-range dependencies           activation function. The hidden size of the GRU/LSTM unit
between tokens using memory cell in their architecture. CNN         is 100 “Fig. 5”.
will help with mixing the consequence tokens and extract            Our architecture will go as following, input layer which is
relations between them [14]. In the task of slot filling, the       a sequence of tokens represented by indices using bag of
hybrid architecture contains several convolution layers stacked     words, embedding layer will represent each token with a
with the same padding and the output of these layers will be        vector, the vector size is a hyperparameter for the network,
the input for RNN layers as in Fig. 5, we can also stack several    this embedding layer is followed by one of the main choices
RNN layers. After these RNN layers, there will be a dense           of the layers discussed above, recurrent neural network, con-
layer with softmax activations, this layer represents the output    volutional neural network or a hybrid model which contains
of the network.                                                     layer of CNN followed by layer of RNN.
                                                                       2) Evaluation Metrics: For evaluation, we computed preci-
                      IV. E XPERIMENTS                              sion, recall and F1 score for training and validation sets, and
                                                                    we picked the model with the best value of the F1 score.
A. Dataset                                                          For Slot filling, the error rate can be computed in two ways:
                                                                    The more common metric is the F-measure using the slots as
   ATIS (Airline Travel Information System) corpus (Tur et
                                                                    units. This metric is similar to what is being used for other
al., 2010) is one of the main data resources used in many
                                                                    sequence classification tasks in the natural language processing
studies over the past two decades for SLU research in spoken
                                                                    community, such as parsing and named entity extraction. In
dialog systems e.g. [15] [16] [17]. Two primary tasks in SLU
                                                                    this technique, usually the IOB schema is adopted, where each
are intent determination (ID) and slot filling (SF). The dataset
                                                                    of the words is tagged with their position in the slot: beginning
contains audio recordings of people making flight reservations.
                                                                    (B), in (I) or other (O). Then, recall and precision values are
The training set contains 4,478 utterances and the test set
                                                                    computed for each of the slots. A slot is considered to be
contains 893 utterances. We use another 500 utterances for
                                                                    correct if its range and type are correct. The F-Measure is
development set. There are 120 slot labels and 21 intent types
                                                                    defined as the harmonic mean of recall and precision:
in the training set [22].
The IOB format (inside, outside, beginning) is a common                                                        Recall × Precision
                                                                                      F1-Score = 2 ×                                                  (17)
tagging format for tagging tokens in a chunking task in                                                        Recall + Precision
computational linguistics, The B- prefix before a tag indicates     where:
that the tag is the beginning of a chunk, and an I- prefix before                                      #correct slots Found
                                                                                         Recall =                                                     (18)
a tag indicates that the tag is inside a chunk. The B- tag is                                              #true slots
used only when a tag is followed by a tag of the same type                                                  #correct slots Found
without O tokens between them. An O tag indicates that a                               Precision =                                                    (19)
                                                                                                               #found slots
token belongs to no chunk.
The Table I shows an example in the ATIS dataset , with             B. Results
the annotation of slot/concept, named entity, intent as well          During evaluation process we focused on the difference
as domain. The latter two annotations are for the other two         between the use of different architectures of neural networks,
tasks in SLU: domain detection and intent determination. We         we compared also between different optimization methods for
can see that the slot filling is quite similar to the NER task,     the best neural network structure and at the end we included
                                                                  Fig. 5. Hybrid model CNN/RNN


                          TABLE II                                                                         TABLE III
   C OMPARISON BETWEEN DIFFERENT DEEP LEARNING STRUCTURES                         C OMPARISON BETWEEN HYBRID STRUCTURES BASED ON OPTIMIZATION
                                                                                                                    MODEL

 Structure description            Precision   Recall   F1-score     Average
 Hybrid structure Convolution1D   94.47       95.61    95.04        94.89 ±0.15    Structure description            Precision   Recall   F1-score   Average
 and RNN/GRU with dropout 0.25                                                     Hybrid structure Convolution1D   94.47       95.61    95.04      94.89 ±0.15
 Convolution1D structure with     91.75       90.57    91.16        91.01 ±0.1     and RNN/GRU with dropout 0.25;
 dropout 0.25                                                                      RMSProp
 RNN/GRU structure with dropout   93.02       93.12    93.07        92.48 ±0.42    Hybrid structure Convolution1D   94.71       94.95    94.83      94.67 ±0.23
 0.25                                                                              and RNN/GRU with dropout 0.25;
                                                                                   ADAM
                                                                                   Hybrid structure Convolution1D   94.22       94.65    94.44      94.23 ±0.16
                                                                                   and RNN/GRU with dropout 0.25;
                                                                                   SGD
a comparison based on the type of recurrent unit used in the
model. We concluded the experiments 25 times, and we took
the mean of the samples and calculated the standard error. We
reported our results in the tables.
Our results show that hybrid architectures perform better than
other pure RNN or pure CNN models Table II, when we                               and RNN/GRU without dropout using RMSProp optimizer is
used dropout 0.25 and RMSProb optimization method , we                            giving the best F1-score 95.11 comparing with different levels
got F1-score 95.04 for hybrid model compared with 91.16 for                       of dropout on the same architecture Table IV Based on the
convolution model and 93.07 for recurrent model.                                  recurrent unit used in our experiments, GRU based hybrid
 Our results show also that the use of RMSProb resulted in the                    methods with F1-score 95.04 compared with LSTM based
best models according to F1-score metrics Table III, under the                    hybrid models with F1-score 94.67, GRU units improved
same dropout 0.25 and hybrid model, we got F1-score equals                        the score by 0.37% Table V. Our results show that the
to 95.04 for RMSProb compared with 94.83 when we used                             hybrid CNN/RNN-based models outperform Bi-dir. Jordan-
ADAM optimization model, and 94.44 when we used SGD.                              RNN baseline by 1.13% on the ATIS benchmark Table VI.
Result show that the effect the Hybrid structure Convolution1D
                                                                                                          TABLE V
                                                                                C OMPARISON BETWEEN DIFFERENT DEEP LEARNING STRUCTURES BASED
                                                                                                            ON DROPOUT VALUE


                                                                                 Structure description            Precision   Recall   F1-score   Average
                                                                                 Hybrid structure Convolution1D   94.98       95.47    95.11      94.69 ±0.47
                                                                                 and RNN/GRU without dropout;
                                                                                 RMSProp
                                                                                 Hybrid structure Convolution1D   94.29       95.31    94.82      94.42 ±0.28
                                                                                 and RNN/GRU with dropout 0.1;
                                                                                 RMSProb
                                                                                 Hybrid structure Convolution1D   94.47       95.61    95.04      94.89 ±0.15
                                                                                 and RNN/GRU with dropout 0.25;
                                                                                 RMSProp
                                                                                 Hybrid structure Convolution1D   93.3        94.6     93.95      93.25 ±0.43
                                                                                 and RNN/GRU with dropout 0.5;
                                                                                 RMSProp


                                                                                                         TABLE VI
                                                                                  C OMPARISON BETWEEN HYBRID STRUCTURES BASED ON THE USED
                                                                                                             RECURRENT UNIT

                                                                                  Models                                  Precision    Recall     F1-score
                                                                                  Jordan-RNN [18]                         92.76        93.87      93.31
                                                                                  Bi-dir. Jordan-RNN [18]                 93.82        94.15      93.98
                                                                                  Hybrid structure (Our)                  94.98        95.47      95.11


                                                                                                           V. C ONCLUSION
                                                                                   This paper addresses the problem of slot filling in Spoken
                                                                                Language Understanding. In particular, we focused on slot
                                                                                tagging without paying attention to the other intent classi-
                                                                                fication part. We formulated our learning architecture as a
                                                                                hierarchy of spatial CNN features followed by the RNNs to
                                                                                model dependencies in the temporal domain. Experimental
                                                                                results on the ATIS dataset consistently demonstrated the
                                                                                effectiveness of the proposed approach. It is good to mention
                                                                                that combined models that solve the two tasks at the same
                                                                                time could be implemented and these models had proven to
                                                                                lead to better performance. But still, in the way to implement
                                                                                a full chatbot, we will need to generate human-like text in
                                                                                response to users input. In future work, we intend to explore
                                                                                the incorporation of attentional mechanism in our model,
                                                                                which could provide additional information to the slot label
                                                                                prediction, and learn our architecture using another data-sets
                                                                                to generalize the results.

                                                                                                             R EFERENCES
                                                                                [1] P. Su, N. Mrksic, I. Casanueva and I. Vulic, “Deep Learning for Conver-
Fig. 6. Best Model Architecture, convolution layer with RNN layer of GRU            sational AI NAACL 2018 Tutorial,“ PolyAI, University of Cambridge,
units without dropout                                                               2018
                                                                                [2] P. Haffner, G. Tur, and J. H. Wright, “Optimizing svms for complex
                                                                                    call classification,“ Acoustics, Speech, and Signal Processing, 2003.
                          TABLE IV                                                  Proceedings.(ICASSP03). 2003 IEEE International Conference on, vol.
   C OMPARISON BETWEEN HYBRID STRUCTURES BASED ON THE USED                          1. IEEE, 2003, pp. I632.
                             RECURRENT UNIT                                     [3] R. Sarikaya, G. E. Hinton, and B. Ramabhadran, “Deep belief nets for
                                                                                    natural language call-routing,“ Acoustics, Speech and Signal Processing
                                                                                    (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp.
 Structure description            Precision   Recall   F1-score   Average           56805683.
 Hybrid structure Convolution1D   94.47       95.61    95.04      94.89 ±0.15
 and RNN/GRU with dropout 0.25;
                                                                                [4] A. McCallum, D. Freitag, and F. C. Pereira, “Maximum entropy markov
 RMSProp                                                                            models for information extraction and segmentation.“ ICML, vol. 17,
 Hybrid structure Convolution1D   94.36       95.17    94.76      94.40 ±0.35       2000, pp. 591598.
 and RNN/LSTM with dropout                                                      [5] C. Raymond and G. Riccardi, “Generative and discriminative algo-
 0.25; RMSProp                                                                      rithms for spoken language understanding,“ INTERSPEECH, 2007, pp.
                                                                                    16051608.
 [6] K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spoken
     language understanding using long short-term memory neural networks,“
     Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE,
     2014, pp. 189194.
 [7] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. HakkaniTur, X.
     He, L. Heck, G. Tur and D. Yu et al., “Using recurrent neural networks
     for slot filling in spoken language understanding,“ Audio, Speech, and
     Language Processing, IEEE/ACM Transactions on, vol. 23, no. 3, pp.
     530539, 2015.
 [8] B. Liu and I. Lane, “Recurrent neural network structured output pre-
     diction for spoken language understanding,“ Proc. NIPS Workshop on
     Machine Learning for Spoken Language Understanding and Interactions,
     2015.
 [9] D. Guo, G. Tur, W.-t. Yih, and G. Zweig, “Joint semantic utterance
     classification and slot filling with recursive neural networks,“ Spoken
     Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp.
     554559.
[10] P. Xu and R. Sarikaya, “Convolutional neural network based triangular
     crf for joint intent detection and slot filling,“ Automatic Speech Recogni-
     tion and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013,
     pp. 7883.
[11] M. Surdeanu and H. Ji, “Overview of the English Slot Filling Track at
     the TAC2014 Knowledge Base Population Evaluation“, 3rd International
     Workshop on Knowledge Discovery on the WEB, 2017
[12] U. Schade and M. R. Hieb, “Formalizing Battle Management Lan-
     guage:A Grammar for Specifying Orders“, 06S-SIW-068 Spring 2006,
     2006
[13] G. Kurata, B. Xiang, B. Zhou, and M. Yu, “Leveraging sentencelevel
     information with encoder lstm for natural language understanding,“
     arXiv preprint arXiv:1601.01530, 2016.
[14] S. T. Hsu, C. Moon, P. Jones and N. F. Samatova “A Hybrid CNN-RNN
     Alignment Model for Phrase-Aware Sentence Classification,“ 15th Con-
     ference of the European Chapter of the Association for Computational
     Linguistics, 2017
[15] Y. He and S. Young, “A data-driven spoken language understanding
     system,“ in IEEE ASRU 2003.
[16] C. Raymond and G. Riccardi, “Generative and discriminative algorithms
     for spoken language understanding,“ in Interspeech 2007
[17] G. Tur, D. Hakkani-Tur, and L. Heck, “What is left to be understood in
     ATIS¿‘ in IEEE SLT, 2010
[18] G. Mesnil, X. He, L. Deng and Y.Bengio, “Investigation of Recurrent-
     Neural-Network Architectures and Learning Methods for Spoken Lan-
     guage Understanding,“ INTERSPEECH 2013, pp 3771-3775.
[19] B. Qu, X. Li, D. Tao, and X. Lu, “Deep semantic understanding of high
     resolution remote sensing image,“ in 2016 International Conference on
     Computer, Information and Telecommunication Systems (CITS), 2016 .
[20] J. Chung, C. Gulcehre, K. Cho and Y. Bengio “Empirical Evalua-
     tion of Gated Recurrent Neural Networks on Sequence Modeling,“
     arXiv:1412.3555 [cs.NE], 2014
[21] C. Sutton, A. McCallum and K. Rohanimanesh, “ Dynamic Conditional
     Random Fields: Factorized Probabilistic Models for Labeling and Seg-
     menting Sequence Data,“ JMLR, 2007, pp. 693-723
[22] G. Gao, Y. Hsu, C. Huo, T. Chen, K, Hsu and Y. Che, “Slot-Gated
     Modeling for Joint Slot Filling and Intent Prediction,“ Proceedings of
     NAACL-HLT 2018, pp 753-757