A hybrid convolutional and recurrent network approach for conversational AI in spoken language understanding 1st Bassel Zaity 2nd Hazem Wannous Graduate School of Software Engineering IMT Lille Douai Peter the Great St.Petersburg Polytechnic University (SPbPU) CRIStAL UMR 9189, University of Lille Saint Petersburg, Russia Lille, France bassel.zaity@gmail.com hazem.wannous@univ-lille.fr 3rd Zein Shaheen 4th Igor Chernoruckiy Computer Intelligent Technologies Graduate School of Software Engineering Peter the Great St.Petersburg Polytechnic University (SPbPU) Peter the Great St.Petersburg Polytechnic University (SPbPU) Saint Petersburg, Russia Saint Petersburg, Russia shahin.z@edu.spbstu.ru igcher@spbstu.ru 5th Pavel Drobintsev 6th Vadim Pak Graduate School of Software Engineering Computer Intelligent Technologies Peter the Great St.Petersburg Polytechnic University (SPbPU) Peter the Great St.Petersburg Polytechnic University (SPbPU) Saint Petersburg, Russia Saint Petersburg, Russia drob@ics2.ecd.spbstu.ru vadim.pak@cit.icc.spbstu.ru Abstract—The deep learning revolution has an impact on algorithms started to take place in the programmer society. almost all parts of our life, it brought us improved momental ma- However, the last five years brought the real change after chine translators, modern human-like conversation voice assistant the new deep learning architectures, which leds to a new like Siri, Alexa, Alisa. This revolution had become truth because of deep learning methods which improved multiple processing level of solutions and the Spoken Dialogue Systems (SDS) layers to learn a hierarchical representation of data, and have is one of the fields which had really improved recently. SDS achieved the state-of-the-art results in many lives domains. In this and chatbots are taking a wider place day by day in the paper, we are focusing on one of the most famous NLP (Natural scientific conferences as a case study. They already have great language processing) problems which is slot filling to approach commercial potential according to the changing of the way the state-of-the-art results on the ticketing problem to make the Spoken Dialogue systems work more efficiently. We propose a humans interact with machines. The improvement of deep hybrid architecture, as a combination of a Recurrent Neural learning in general, and the Natural Language Processing Network and a Convolutional Neural Network models, for Slot (NLP) researchers in special, led to place a lot of difficult Filling in Spoken Language Understanding. In particular, our problems under the microscope, and the research teams over network model is built from stacked units of 1-dimensional CNN the world trying to test different architecture models to get the (Convolutional Neural Network) across the temporal domain, which are used to train an RNN (Recurrent neural network) layer state-of-the-art results to solve these problems. In our days, to model dependencies in the temporal domain. Experimental the importance of chatbots has increased, most websites tend tests show extensive comparisons between different models for to have their own chatbots to communicate with customers NER (Named Entities Recognition). Results demonstrate the and facilitate their work. The goal of such bots is to know effectiveness of hybrid models that combine benefits from both users needs and give responses in their natural language. This RNN and CNN architecture compared over distinct RNN and CNN models and also compared with other traditional models. will lead to a better understanding of the users queries when Experimental results show that our model achieves F1-score of communicating with the users in a natural way throw these 95.11 on benchmark ATIS dataset. chatbots. It will also help to ask the users about whatever Index Terms—SLU, slot-filling, Hybrid CNN and RNN, Deep missing points they have to bring the best accurate answers, learning such assistants could help disabled people and bring more solutions to the market to build a more intelligent world. I. I NTRODUCTION The implementation of a voice assistant comes with different The methodological revolution in spoken language research parts, as speech to text and text to speech models, but the had been started about 20 years ago when the machine learning most challenging part comes in the task of NLP to extract the needs of the user and to know his intent from the II. R ELATED WORK conversation. The processing pipeline comes here into two Rule-based approaches are done manually, at first you all parts, intent classification and slot filling after the intent is needed roles should be written need to achieve the goal, this known. At this stage, the bot needs to generate a response operation is time-consuming and therefore not so efficient, to the user and give feedback about whatever missing data it will be notable that the recall is not very nice because there are. The whole system that organizes this process is the its so difficult to write all the varieties, but the positives of dialogue manager which processes the users input, extract the ruled-based approaches the precision will be quite high [11]. meaning and generates the desired response. From a research The most widely used formal system for modeling constituent perspective, the design of spoken dialogue systems provides a structure in English and other natural languages is the Context- number of significant challenges, as these systems depend on Free Grammar or CFG. A context-free grammar consists of solving several difficult NLP and decision making tasks, and a set of rules or productions, each of which expresses the combining these into a functional dialogue system pipeline [1]. ways that symbols of the language can be grouped and ordered Intent detection and slot filling are usually processed sepa- together, and a lexicon of words and symbols [12]. rately. Intent detection can be treated as a semantic utterance In machine learning methods, we need a dataset of text with classification problem, and popular classifiers like support markup, in this dataset, each word should be assigned to a vector machines (SVMs) [2] and deep neural network meth- tag, this problem is known as slots filling problem. The first ods [3] can be applied. Slot filling can be treated as a se- which we should do is making some Feature engineering, for quence labeling task. Popular approaches to solving sequence example, see whether the word is capitalized or it is a name of labeling problems include maximum entropy Markov models a city, some cities consists of two words, maybe you check the (MEMMs) [4], conditional random fields (CRFs) [5], and previous or the next words (context). Probabilistic modeling recurrent neural networks (RNNs) [6] [7] [8]. Joint model and Conditional Random Field not only assume that features for intent detection and slot filling has also been proposed are dependent on each other but also considers the future in literature [9] [10]. Such joint model simplifies the spoken observations while learning a pattern [21]. This combines the language understanding (SLU) system, as only one model best of both HMM and MEMM. In terms of performance, needs to be trained and fine-tuned for the two tasks. it is considered previously to be the best method for entity This work focuses on the slot-filling part by building a model recognition problem. Another paper studied the comprehensive that extracts information from text in a reliable way. Before investigations of RNNs for the task of slot filling in SLU. the era of deep learning the task of Named Entity Recognition They implemented and compared several RNN architectures, (NER) was solved using grammars-based models and rule- including the Elman-type and Jordan-type networks with their based approaches, these models have proven to achieve good variants [18] results in terms of precision but fail to capture all human- text varieties and thus the recall will be bad. Probabilistic ap- III. D EEP L EARNING METHODS proaches came with models built on HMM, which were state- of-art for many years and achieved an impressive achievement. A. Recurrent Neural Network RNN With the recent revolution, many deep learning methods has Recurrent Neural Networks “Fig. 1” are used for sequence replaced traditional previous ones and pushed state-of-art for modeling, it accepts input xt at time step t and a hidden state these tasks in. Recurrent Neural Networks (RNN) models have ht and use this hidden state to produce output yt , and this replaced models based on HMM, that is RNN achieved the hidden state will be passed to the next time step. So, we can same task in a simpler way and deep RNNs are able to think of the hidden state as a summary of the previous inputs capture complex representations for the input. The problem to the neural network, we use activation function such as tanh with such models was that they need to handle the input token- or ReLU to calculate hidden state. Output yt is the prediction by-token in sequence. Therefore, such structures could not be of the next tag, it would be a vector of probabilities across our parallelized and the models will be slow to train and inference vocabulary, the following formulas explain the general form if the neural network structure is deep. Convolution Neural of RNN: Networks (CNN) added a way to extract relations between tokens by mixing them in a way similar to extracting n- ht = f (U xt + W ht−1 ) (1) grams in the traditional NLP tasks. Such architectures that contain CNN could be optimized by parallelization so adding yt = softmax(V ht ) (2) a convolutional layer could reduce the complexity and control the size of the neural network. In this paper, we discuss Long Short Term Memory (LSTM) and Gated Recurrent different approaches to solve slot filling for ticketing task Units (GRU) are used as RNN units, these units can capture as a NER problem, and showed different architectures that long term dependency. The LSTM does have the ability to contain distinct RNN, CNN or hybrid architectures ones. remove or add information to the cell state, carefully regulated We conducted many experiments with different values of the by structures called gates. Gates are a way to optionally let hyper-parameters and different optimization methods. information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation [19]. rt = σ(W r.[ht−1 , xt ]) (10) ht = tanh(W.[r.ht−1 , xt ]) (11) ht t = (1 − zt ).ht−1 + zt ∗ ht (12) Fig. 1. General form of Recurrent Neural Network The sigmoid layer outputs numbers between zero and one, de- scribing how much of each component should be let through. A value of zero means let nothing through, while a value of one means let everything through! An LSTM has three of these gates “Fig. 2”, to protect and control the cell state. The Fig. 3. GRU unit following formulas explain how does LSTM cell work: In our experiments, we used both GRU and LSTM units ft = σ(Wf [ht−1 , xt ] + bf ) (3) and compared between them. Other sequence architectures like Encoder-decoder architecture could be used to solve this it = σ(Wi [ht−1 , xt ] + bi ) (4) task, at first the whole input will be encoded into hidden Ct = tanh(Wc [ht−1 , xt] + bc ) (5) representation (encoder), and then this hidden representation is used to produce sequence of tags (decode). Some architectures Ct = ft .Ct−1 + it .Ct (6) use attention mechanism to give attention to parts of the input sequence and use these information to produce the output ot = σ(Wo [ht−1 , xt ] + bo ) (7) token. ht = ot .tanh(Ct ) (8) B. Convolution Neural Network for sequences RNNs operate sequentially, the output for the second in- put depends on the first one and so we cant parallelize an RNN. Convolutions have no such problem, each patch a convolutional kernel operates on is independent of the other, meaning that we can go over the entire input layer concurrently. Convolutions grow a larger receptive beld as we stack more and more layers. That means that by default, each step in the convolutions representation views all of the input in its receptive field, from before and after it “Fig. 4”. In our experiments we used 1D convolution to mix the tokens and extract relations between the consequence tokens, it is equivalent to n-gram relation where n is the size of the used filter, for example: if we care about the last 3 tokens we use filter size 3. Using CNN will result in some benefits, it runs Fig. 2. LSTM unit faster than RNN and beats RNN in some tasks. If we divide convolution output into two parts, A and B, one of which will GRU has a simpler design “Fig. 3” it was introduced by gate the other through element-wise multiplication, where A Cho, et al. (2014) [20], The key difference between a GRU is liner and B through sigmoid, we get GLU (gated linear and an LSTM is that a GRU has two gates (reset and update unit). Here we increased receptive field as it is shown in the gates) whereas an LSTM has three gates (namely input, output following formula: and forget gates) [13]. The GRU unit controls the flow of information like the LSTM unit, but without having to use a A = (X.W + b) (13) memory unit. It just exposes the full hidden content without any control. GRU is relatively new but computationally more B = σ(X.V + c) (14) efficient. The following formulas describe the GRU mecha- ht (x) = A ⊕ B (15) nism: ht (x) = (X.W + b) ⊕ σ(X.V + c) (16) zt = σ(W z.[ht−1 , xt ]) (9) TABLE I ATIS DATASET E XAMPLE Sentence show me flights from Moscow to London Today Slots/Concepts O O O O B-fromLoc O I-toLoc B-departDate Named Entity O O O O S-city O I-city O Intent Find Flight Domain Airline Travel following the IOB tagging representation, except for a more specific granularity. 1) Training Details: In training, we compared between dif- Fig. 4. Convolution Neural Network for sequences ferent models for NER (Named Entities Recognition) system, all the models were trained using 100 epochs. We tuned our models using different dropout values (0.1, 0.25, 0.5) and C. Hybrid model CNN RNN we used different optimization methods (ADAM, RMSProb, This model combines the benefits of both CNN and RNN, SGD). For the embedding layer, we represent each token by a where RNN helps to capture the dependencies between tokens vector of size 100, and for our choice for the convolution in the users query, using LSTM or GRU units will have layer we used 64 filters of size 5 and used ReLU as an resulted in a model that captures long-range dependencies activation function. The hidden size of the GRU/LSTM unit between tokens using memory cell in their architecture. CNN is 100 “Fig. 5”. will help with mixing the consequence tokens and extract Our architecture will go as following, input layer which is relations between them [14]. In the task of slot filling, the a sequence of tokens represented by indices using bag of hybrid architecture contains several convolution layers stacked words, embedding layer will represent each token with a with the same padding and the output of these layers will be vector, the vector size is a hyperparameter for the network, the input for RNN layers as in Fig. 5, we can also stack several this embedding layer is followed by one of the main choices RNN layers. After these RNN layers, there will be a dense of the layers discussed above, recurrent neural network, con- layer with softmax activations, this layer represents the output volutional neural network or a hybrid model which contains of the network. layer of CNN followed by layer of RNN. 2) Evaluation Metrics: For evaluation, we computed preci- IV. E XPERIMENTS sion, recall and F1 score for training and validation sets, and we picked the model with the best value of the F1 score. A. Dataset For Slot filling, the error rate can be computed in two ways: The more common metric is the F-measure using the slots as ATIS (Airline Travel Information System) corpus (Tur et units. This metric is similar to what is being used for other al., 2010) is one of the main data resources used in many sequence classification tasks in the natural language processing studies over the past two decades for SLU research in spoken community, such as parsing and named entity extraction. In dialog systems e.g. [15] [16] [17]. Two primary tasks in SLU this technique, usually the IOB schema is adopted, where each are intent determination (ID) and slot filling (SF). The dataset of the words is tagged with their position in the slot: beginning contains audio recordings of people making flight reservations. (B), in (I) or other (O). Then, recall and precision values are The training set contains 4,478 utterances and the test set computed for each of the slots. A slot is considered to be contains 893 utterances. We use another 500 utterances for correct if its range and type are correct. The F-Measure is development set. There are 120 slot labels and 21 intent types defined as the harmonic mean of recall and precision: in the training set [22]. The IOB format (inside, outside, beginning) is a common Recall × Precision F1-Score = 2 × (17) tagging format for tagging tokens in a chunking task in Recall + Precision computational linguistics, The B- prefix before a tag indicates where: that the tag is the beginning of a chunk, and an I- prefix before #correct slots Found Recall = (18) a tag indicates that the tag is inside a chunk. The B- tag is #true slots used only when a tag is followed by a tag of the same type #correct slots Found without O tokens between them. An O tag indicates that a Precision = (19) #found slots token belongs to no chunk. The Table I shows an example in the ATIS dataset , with B. Results the annotation of slot/concept, named entity, intent as well During evaluation process we focused on the difference as domain. The latter two annotations are for the other two between the use of different architectures of neural networks, tasks in SLU: domain detection and intent determination. We we compared also between different optimization methods for can see that the slot filling is quite similar to the NER task, the best neural network structure and at the end we included Fig. 5. Hybrid model CNN/RNN TABLE II TABLE III C OMPARISON BETWEEN DIFFERENT DEEP LEARNING STRUCTURES C OMPARISON BETWEEN HYBRID STRUCTURES BASED ON OPTIMIZATION MODEL Structure description Precision Recall F1-score Average Hybrid structure Convolution1D 94.47 95.61 95.04 94.89 ±0.15 Structure description Precision Recall F1-score Average and RNN/GRU with dropout 0.25 Hybrid structure Convolution1D 94.47 95.61 95.04 94.89 ±0.15 Convolution1D structure with 91.75 90.57 91.16 91.01 ±0.1 and RNN/GRU with dropout 0.25; dropout 0.25 RMSProp RNN/GRU structure with dropout 93.02 93.12 93.07 92.48 ±0.42 Hybrid structure Convolution1D 94.71 94.95 94.83 94.67 ±0.23 0.25 and RNN/GRU with dropout 0.25; ADAM Hybrid structure Convolution1D 94.22 94.65 94.44 94.23 ±0.16 and RNN/GRU with dropout 0.25; SGD a comparison based on the type of recurrent unit used in the model. We concluded the experiments 25 times, and we took the mean of the samples and calculated the standard error. We reported our results in the tables. Our results show that hybrid architectures perform better than other pure RNN or pure CNN models Table II, when we and RNN/GRU without dropout using RMSProp optimizer is used dropout 0.25 and RMSProb optimization method , we giving the best F1-score 95.11 comparing with different levels got F1-score 95.04 for hybrid model compared with 91.16 for of dropout on the same architecture Table IV Based on the convolution model and 93.07 for recurrent model. recurrent unit used in our experiments, GRU based hybrid Our results show also that the use of RMSProb resulted in the methods with F1-score 95.04 compared with LSTM based best models according to F1-score metrics Table III, under the hybrid models with F1-score 94.67, GRU units improved same dropout 0.25 and hybrid model, we got F1-score equals the score by 0.37% Table V. Our results show that the to 95.04 for RMSProb compared with 94.83 when we used hybrid CNN/RNN-based models outperform Bi-dir. Jordan- ADAM optimization model, and 94.44 when we used SGD. RNN baseline by 1.13% on the ATIS benchmark Table VI. Result show that the effect the Hybrid structure Convolution1D TABLE V C OMPARISON BETWEEN DIFFERENT DEEP LEARNING STRUCTURES BASED ON DROPOUT VALUE Structure description Precision Recall F1-score Average Hybrid structure Convolution1D 94.98 95.47 95.11 94.69 ±0.47 and RNN/GRU without dropout; RMSProp Hybrid structure Convolution1D 94.29 95.31 94.82 94.42 ±0.28 and RNN/GRU with dropout 0.1; RMSProb Hybrid structure Convolution1D 94.47 95.61 95.04 94.89 ±0.15 and RNN/GRU with dropout 0.25; RMSProp Hybrid structure Convolution1D 93.3 94.6 93.95 93.25 ±0.43 and RNN/GRU with dropout 0.5; RMSProp TABLE VI C OMPARISON BETWEEN HYBRID STRUCTURES BASED ON THE USED RECURRENT UNIT Models Precision Recall F1-score Jordan-RNN [18] 92.76 93.87 93.31 Bi-dir. Jordan-RNN [18] 93.82 94.15 93.98 Hybrid structure (Our) 94.98 95.47 95.11 V. C ONCLUSION This paper addresses the problem of slot filling in Spoken Language Understanding. In particular, we focused on slot tagging without paying attention to the other intent classi- fication part. We formulated our learning architecture as a hierarchy of spatial CNN features followed by the RNNs to model dependencies in the temporal domain. Experimental results on the ATIS dataset consistently demonstrated the effectiveness of the proposed approach. It is good to mention that combined models that solve the two tasks at the same time could be implemented and these models had proven to lead to better performance. But still, in the way to implement a full chatbot, we will need to generate human-like text in response to users input. In future work, we intend to explore the incorporation of attentional mechanism in our model, which could provide additional information to the slot label prediction, and learn our architecture using another data-sets to generalize the results. R EFERENCES [1] P. Su, N. Mrksic, I. Casanueva and I. Vulic, “Deep Learning for Conver- Fig. 6. Best Model Architecture, convolution layer with RNN layer of GRU sational AI NAACL 2018 Tutorial,“ PolyAI, University of Cambridge, units without dropout 2018 [2] P. Haffner, G. Tur, and J. H. Wright, “Optimizing svms for complex call classification,“ Acoustics, Speech, and Signal Processing, 2003. TABLE IV Proceedings.(ICASSP03). 2003 IEEE International Conference on, vol. C OMPARISON BETWEEN HYBRID STRUCTURES BASED ON THE USED 1. IEEE, 2003, pp. I632. RECURRENT UNIT [3] R. Sarikaya, G. E. Hinton, and B. Ramabhadran, “Deep belief nets for natural language call-routing,“ Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. Structure description Precision Recall F1-score Average 56805683. Hybrid structure Convolution1D 94.47 95.61 95.04 94.89 ±0.15 and RNN/GRU with dropout 0.25; [4] A. McCallum, D. Freitag, and F. C. Pereira, “Maximum entropy markov RMSProp models for information extraction and segmentation.“ ICML, vol. 17, Hybrid structure Convolution1D 94.36 95.17 94.76 94.40 ±0.35 2000, pp. 591598. and RNN/LSTM with dropout [5] C. Raymond and G. Riccardi, “Generative and discriminative algo- 0.25; RMSProp rithms for spoken language understanding,“ INTERSPEECH, 2007, pp. 16051608. [6] K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spoken language understanding using long short-term memory neural networks,“ Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 189194. [7] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. HakkaniTur, X. He, L. Heck, G. Tur and D. Yu et al., “Using recurrent neural networks for slot filling in spoken language understanding,“ Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 23, no. 3, pp. 530539, 2015. [8] B. Liu and I. Lane, “Recurrent neural network structured output pre- diction for spoken language understanding,“ Proc. NIPS Workshop on Machine Learning for Spoken Language Understanding and Interactions, 2015. [9] D. Guo, G. Tur, W.-t. Yih, and G. Zweig, “Joint semantic utterance classification and slot filling with recursive neural networks,“ Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 554559. [10] P. Xu and R. Sarikaya, “Convolutional neural network based triangular crf for joint intent detection and slot filling,“ Automatic Speech Recogni- tion and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 7883. [11] M. Surdeanu and H. Ji, “Overview of the English Slot Filling Track at the TAC2014 Knowledge Base Population Evaluation“, 3rd International Workshop on Knowledge Discovery on the WEB, 2017 [12] U. Schade and M. R. Hieb, “Formalizing Battle Management Lan- guage:A Grammar for Specifying Orders“, 06S-SIW-068 Spring 2006, 2006 [13] G. Kurata, B. Xiang, B. Zhou, and M. Yu, “Leveraging sentencelevel information with encoder lstm for natural language understanding,“ arXiv preprint arXiv:1601.01530, 2016. [14] S. T. Hsu, C. Moon, P. Jones and N. F. Samatova “A Hybrid CNN-RNN Alignment Model for Phrase-Aware Sentence Classification,“ 15th Con- ference of the European Chapter of the Association for Computational Linguistics, 2017 [15] Y. He and S. Young, “A data-driven spoken language understanding system,“ in IEEE ASRU 2003. [16] C. Raymond and G. Riccardi, “Generative and discriminative algorithms for spoken language understanding,“ in Interspeech 2007 [17] G. Tur, D. Hakkani-Tur, and L. Heck, “What is left to be understood in ATIS¿‘ in IEEE SLT, 2010 [18] G. Mesnil, X. He, L. Deng and Y.Bengio, “Investigation of Recurrent- Neural-Network Architectures and Learning Methods for Spoken Lan- guage Understanding,“ INTERSPEECH 2013, pp 3771-3775. [19] B. Qu, X. Li, D. Tao, and X. Lu, “Deep semantic understanding of high resolution remote sensing image,“ in 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), 2016 . [20] J. Chung, C. Gulcehre, K. Cho and Y. Bengio “Empirical Evalua- tion of Gated Recurrent Neural Networks on Sequence Modeling,“ arXiv:1412.3555 [cs.NE], 2014 [21] C. Sutton, A. McCallum and K. Rohanimanesh, “ Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Seg- menting Sequence Data,“ JMLR, 2007, pp. 693-723 [22] G. Gao, Y. Hsu, C. Huo, T. Chen, K, Hsu and Y. Che, “Slot-Gated Modeling for Joint Slot Filling and Intent Prediction,“ Proceedings of NAACL-HLT 2018, pp 753-757