Concept Tagging for Natural Language Understanding: Two Decadelong Algorithm Development Jacopo Gobbi Evgeny A. Stepanov Giuseppe Riccardi University of Trento VUI, Inc. University of Trento Trento, Italy Trento, Italy Trento, Italy jacopo.gobbi eas@vui.com giuseppe.riccardi @studenti.unitn.it @unitn.it Abstract development of the concept tagging (a.k.a. slot filling or entity extraction) task. It aims at com- English. Concept tagging is a type of puting a sequence of concept units, C = c1 ..cM , structured learning needed for natural lan- from a sequence of words in natural language, guage understanding (NLU) systems. In W = w1 ..wN . The task can be seen as a struc- this task, meaning labels from a domain tured learning problem where words are the input ontology are assigned to word sequences. and concepts are the output labels. In other words, In this paper, we review the algorithms the objective is to map a sentence (utterance) “I developed over the last twenty five years. want to go from Boston to Atlanta on Monday” to We perform a comparative evaluation of the sequence of domain labels “null null null generative, discriminative and deep learn- null null fromloc.city null toloc.city ing methods on two public datasets. We null depart date.day name”, that would allow report on the statistical variability perfor- to identify, for instance that Boston is a departure mance measurements. The third contribu- city . Difficulties may arise from different factors, tion is the release of a repository of the such as the variable token span of concepts, the algorithms, datasets and recipes for NLU long-distance word dependencies, a large and evaluation. ever changing vocabulary, or subtle semantic Italiano. L’annotazione automatica dei implications that might be hard to capture at concetti è un tipo di apprendimento a surface level or without some prior context strutturato necessario per i sistemi di knowledge. comprensione del linguaggio naturale Since the early nineties (Pieraccini and Levin, (NLU). In questo processo le etichette di 1992), the task has been designed as a core compo- un’ontologia di dominio sono assegnate nent of the natural language understanding process a sequenze di parole. In questo articolo in domain-limited conversational systems. Over esaminiamo gli algoritmi sviluppati negli the years, algorithms have been developed for gen- ultimi venticinque anni. Eseguiamo una erative, discriminative and, more recently, for deep valutazione comparativa dei metodi di ap- learning frameworks. In this paper, we provide a prendimento generativo, discriminatorio e comprehensive review of the algorithms, their pa- approfondito su due set di dati pubblici. Il rameters and their respective state-of-the-art per- secondo contributo é un’analisi della vari- formances. We discuss the relative advantages and abilitá delle misure di valutazione. Il terzo differences amongst algorithms in terms of perfor- contributo è il rilascio di un archivio degli mances and statistical variability and the optimal algoritmi, dei sets di dati e delle ricette per parameter settings. Last but not least, we have de- la valutazione dell’NLU. signed and provided a repository of the data, al- gorithms, implementations and parameter settings on two public datasets. The GitHub repository1 is 1 Introduction intended as a reference both for practitioners and The NLU component of a conversational system for algorithm development researchers. requires an automatic extraction of concept tags, With the conversational AI gaining popularity, dialogue acts, domain labels and entities. In the area of NLU is too vast to mention all relevant 1 this paper we describe and review the algorithm www.github.com/fruttasecca/concept-tagging-with-neural-networks or even recent studies. Moreover the objective learned from the training data. Additionally, we of this paper is to benchmark an important sub- experiment with word embeddings as additional task of NLU, concept tagging used by advanced features for CRFs (CRF+EMB). conversational systems. We benchmark genera- Recurrent Neural Networks (RNN). The first tive, discriminative and deep learning approaches neural network architecture4 we have considered to NLU, the work is in-line with the works of is an Elman RNN (Elman, 1990; Übeyli and (Raymond and Riccardi, 2007; Mesnil et al., 2015; Übeyli, 2012). In RNN, a hidden state depends Bechet and Raymond, 2018). Unlike previously on the current input and the previous hidden state. mentioned comparative performance analysis, in The output (label), on the other hand, depends on this paper, we benchmark deep learning architec- the new hidden state. tures and compare them to a generative and tradi- Long-Short Term Memory (LSTM) RNNs tional discriminative algorithms. To the best of our (Hochreiter and Schmidhuber, 1997) try to tackle knowledge, this is the first comprehensive compar- the vanishing gradient problem by introducing a ison of concept tagging algorithms at this scale on more complex mechanisms to address information public datasets and shared algorithm implementa- propagation and deletion, with the cost of a more tions (and their parameter settings). complex model with more parameters to train due to the system of gates it uses. The memory of 2 Algorithms the model is represented by the cell state and the Among the algorithms considered for benchmark- hidden state, which also represents the output for ing, we include a representative from the gen- the current token. We experimented with a sim- erative class, the weighted finite state transduc- ple LSTM, an LSTM which receives as input the ers (WFSTs), and two discriminative algorithms: word embedding concatenated with character em- Support Vector Machines (SVMs), Conditional beddings obtained through a convolutional layer Random Fields (CRFs), and a set of base neural (Józefowicz et al., 2016) (LSTM-CHAR-REP), networks architectures and their combinations. and an LSTM with pre-trained embeddings and Weighted Finite State Transducers2 cast con- dynamic embeddings learned from training data cept tagging as a translation problem from words (LSTM-2CH). In LSTM-2CH two separate LSTM to concepts (Raymond and Riccardi, 2007), and modules run in parallel and their outputs are con- usually consist of two components. The first catenated for each word. Similar to the rest of the component transduces words to concepts based deep learning models, the output is then fed to a on a score that can be either induced from data fully connected layer to map every token to the or manually designed; the second component is concept tag space. a stochastic conceptual language model, which Gated Recurrent Units (GRU) (Cho et al., re-scores concept sequences. The two com- 2014) use a reset and an update gate, which are ponents are composed to perform sequence-to- two vectors of weights that decide what informa- sequence translation and infer the best sequence tion is deleted (or re-scaled) from the current hid- using Viterbi algorithm. den state and how it will contribute to the new Support Vector Machines (SVM) are used hidden state, which is also the output for the cur- within Yamcha tool (Kudo and Matsumoto, 2001) rent input. Compared to the LSTM model, this that performs sequence labeling using forward and allows to train fewer parameters, but introduces a backward moving classifiers. Automatic labels as- constraint on memory, since it is also used as an signed to preceding tokens are used as dynamic output. features for the current token’s label decision. Convolutional Neural Networks (CONV) Conditional Random Fields (CRF)3 (Lafferty (Majumder et al., 2017; Kim, 2014) consider each et al., 2001) is a discriminative model based on a sentence as a matrix of shape (# words in sentence, dependency graph G and a set of features. Each size of embedding) for convolution using kernels feature fk has an associated weight λk . Features of different sizes to pass over the input sequence are generally hand-crafted and their weights are token-by-token, bigram by bigram and trigram by 2 We use OpenFST (http://www.openfst.org) and Open- trigram. The result of convolution is used as a GRM (http://www.opengrm.org) libraries. 3 4 We use CRFSUITE (Okazaki, 2007) implementation of All neural architectures are implemented within the Py- CRFs in out experiments. Torch framework (https://pytorch.org) starting hidden memory for a GRU RNN. GRU Model Parameters # Params F1 order 4, kneser ney (7907 states, 842178 arcs) 82.96 RNN is used on embedded tokens and starts with WFST order 4, kneser ney (4124 states, 76000 arcs) 93.08 the information on the sequence at a global level. SVM (4, 4) window of tokens, (- 10364 83.74 1, 0) of POS tag and pre- FC-INIT is similar to CONV. The difference is fix. Postfix and lemma of in the pre-elaboration of the hidden state, which is current word. Previous two done by fully connected layers elaborating on the labels. (6, 4) window of tokens, (- 16361 92.91 whole sequence. 1, 0) of prefix and postfix. ENCODER architecture (Cho et al., 2014) Previous two labels . casts the problem as a sequence-to-sequence trans- (4, 4) window of token, (- 1,200K 83.80 CRF 1, 0) of POS tag and prefix. lation and consists of two GRU RNNs. Encoder, Postfix and lemma of cur- the first GRU RNN, encodes the input sequence rent word. Previous + cur- to a fixed vector (the hidden state). Decoder, an- rent word conjunction, cur- rent + next word conjunc- other GRU RNN, uses the output of the encoder as tion. Bigram model. a starting hidden state. At each step, the decoder (6, 4) window of tokens, 2,201K 93.98 receives the label predicted at the previous step as (-1, 0) of prefix. Postfix of current word. Previous an input, starting with a special token. + current word conjunction. ATTENTION architecture is similar to EN- Bigram model. CODER with the addition of an attention mech- all above + (4, 4) word 1,390K 85.85 CRF+EMB embs + current token char anism (Bahdanau et al., 2014) on the outputs of embeddings the encoder. This allows the network to focus on all above + (6, 4) word 3,185K 94.00 embs + current token char a specific parts of the input sequence. The atten- embeddings tion weights are computed with a single fully con- nected layer that receives as input the embedding Table 1: F1 -scores for the WFST, SVM and of the current word concatenated to the last hidden CRF (with and without embeddings) algorithms state. on the MOVIES (top row) and ATIS (bottom row) LSTM-CRF (Yao et al., 2014; Zheng et al., datasets. 2015) is an architecture where the LSTM provides class scores for each token, and the Viterbi algo- from NL2SparQL (Chen et al., 2014) corpus semi- rithm decides on the labels of the sequence at a automatically aligning SPARQL query values to global level using bigrams and transition proba- utterance tokens. The dataset follows the split of bilities that are trained with the rest of the pa- the original corpus having 3,338 sentences (with rameters. We also experimented with a variant 1,728 unique tokens) and 1,084 sentences (with that considers character level information (LSTM- 1,039 tokens) in the training and test sets, respec- CRF-CHAR-REP). tively. The average length of a sentence is 6.50 and the OOV rate is 0.24. There are 43 concept 3 Corpora tags in the dataset. Given the Google embeddings, The evaluation of algorithms is performed on two once we consider every number as a class number, datasets. The Air Travel Information System we obtain 66 token types without an embedding (ATIS) dataset consists of sentences from users for the training set and 26 for the test set. querying for information about flights, departure dates, arrivals, etc. The training set consists of 4 Performance Analysis 4,978 sentences, while there are 893 sentences that One of our first observations is the fact that mod- constitute the test set. The average length of a sen- els such as WFST, SVM and CRF yield competi- tence is around 11 tokens, and there are a total of tive results with simple setups and few hyperpa- 127 unique tags (with IOB prefixes). Moreover, rameters to be tuned. The training of our deep the large majority of tokens missing an embedding learning models and the search of their hyperpa- are either numbers or airport/basis/aircraft codes. rameters would have been unfeasible without ded- The training set has a total of 18 types missing an icated hardware, while it took a fraction of the ef- embedding, and the test set has 9. fort for WFST, SVM and CRF. Moreover, adding The second corpus (MOVIES)5 was produced word embeddings as features to the CRF allowed 5 https://github.com/esrel/NL2SparQL4NLU it to outperform most of the deep neural networks. Model hidden epochs batch lr drop emb # of min F1 avg F1 best F1 size rate norm params 200 15 50 0.001 0.30 4 1,264K 81.00 82.55 83.96 RNN 400 10 50 0.001 0.25 2 580K 91.80 93.79 95.03 200 15 20 0.001 0.70 6 1,505K 82.67 83.76 84.57 LSTM 200 15 10 0.001 0.50 8 675K 87.82 94.53 95.36 400 20 20 0.001 0.70 4 2,085K 82.00 84.28 85.41 LSTM-CHAR-REP 400 15 10 0.001 0.50 6 1,272K 81.00 94.19 95.39 200 20 15 0.001 0.30 8 1,310K 81.22 82.68 83.76 LSTM-2CH 400 10 100 0.010 0.70 6 1,022K 93.10 94.61 95.38 200 20 20 0.001 0.50 4 1,424K 76.56 84.29 85.47 GRU 100 15 10 0.005 0.50 10 446K 91.53 94.28 95.28 200 20 20 0.001 0.50 4 2,646K 84.05 85.02 86.17 CONV 100 15 10 0.005 0.00 2 625K 91.51 94.22 95.38 100 30 20 0.001 0.30 4 2,805K 82.22 83.93 84.95 FC-INIT 400 15 50 0.010 0.25 4 7,144K 87.39 94.67 95.39 200 30 20 0.001 0.70 4 1,559K 71.25 76.39 79.00 ENCODER 200 25 5 0.001 0.70 6 730K 70.01 78.16 80.85 200 15 20 0.001 0.30 4 1,712K 71.86 79.77 82.67 ATTENTION 200 25 5 0.001 0.25 10 894K 92.47 94.09 94.98 200 10 1 0.001 0.70 6 1,507K 84.75 86.11 87.47 LSTM-CRF 400 15 10 0.001 0.50 6 1,200K 94.39 94.72 95.01 200 15 1 0.001 0.70 8 1,555K 85.07 86.08 87.05 LSTM-CRF-CHAR-REP 200 20 5 0.001 0.50 4 740K 94.45 94.91 95.12 Table 2: All models are bidirectional and have been trained with unfrozen Google embeddings, except for CONV and LSTM-2CH. Min, average and best F1 scores are obtained training the same model with the same hyperparameters, but different parameter initializations. Averages are from 50 runs for MOVIES and 25 for ATIS. For each architecture, the first row reports F1 -score for the MOVIES dataset and the second for ATIS. Hyperparameter search has been done randomly over ranges of values taken from published work. The number of parameters refers to the network parameters plus the embeddings, when those are unfrozen. Given a hidden layer size X reported in hidden column, each component in the bidirectional architecture would have a hidden layer size of X/2. Similarly, each of the two LSTM components in the LSTM-2CH model would have X/2 as a hidden layer size; and each bidirectional component would thus have a hidden layer size equal to X/4. We attribute this to two factors: (1) since these 4.1 Statistical Significance Testing models, unlike neural networks, do not learn fea- The best performing algorithms in our experi- ture representation from data, they are simpler and mental settings are LSTM-CRF and LSTM-CRF- faster to train; and, most importantly, (2) these CHAR-REP; however, they are not very far from models usually perform global optimization over CRF+EMB and CRF algorithms. In order to com- the label sequence, while neural networks usually pare the performances in terms of statistical signif- do not. Augmenting neural networks with CRF is icance, we perform Welch’s unequal variances t- not expensive in terms of parameters. Having a test (Welch, 1947), which, compared to more pop- CRF component on top of an LSTM increments ular Student’s t-test, does not assume equal vari- the number of parameters up to the square of the ances. The choice of test is motivated by the ob- tag-set size (about 2,500 for the MOVIES dataset), servation that neural architectures generally yield and provides the best performing model. higher variances than, for instance, CRF. There seems to be no strong correlation between The performances are compared on 10-fold the number of parameters and the variance of a cross-validation outputs on the training set for model performance with respect to the random ini- both ATIS and MOVIES datasets. Due to the tialization of its parameters. This is surprising, higher variance of neural network architectures, given the intuition that more parameters can po- a better way to test would be to perform many tentially lead to a lower probability of being stuck runs with different random initializations for each in a local minima. The case may be that differ- fold, and take the average of these results; how- ent initializations lead to different training times ever, such a procedure is computationally very de- required to get to good local minimas. manding. LSTM-CRF-CHAR-REP 4.2 Error Analysis Both MOVIES and ATIS datasets have imbal- anced distribution of concept labels. The imbal- anced distribution of labels is known to affect LSTM-CRF the performance of the minority classes. Conse- CRF-EMB quently, we correlate the distribution of labels in the training set to the percent of their mis-labeling CRF ALGORITHMS in the test set (by any model). As expected, the mis-labeling chance is inversely correlated to the MOVIES percentage of instances the label has in the training CRF set (e.g. given that a label amounts to less than 1% CRF-EMB * of a dataset, it usually has a mis-labeling chance LSTM-CRF * greater than 10%). For both datasets, the Kendall LSTM-CRF-CHAR-REP * rank correlation coefficients (Kendall, 1938) are ATIS approximately 0.6. CRF Independent of the distribution, there are certain CRF-EMB concepts that are mis-labeled more often. For ex- LSTM-CRF * ample, this is the case for producer name, person LSTM-CRF-CHAR-REP * * name, and director name in MOVIES, and city name, state name, and airport name in ATIS. It Table 3: Results of statistical significance test- is not surprising given that these concepts share ing using Welch’s t-test for MOVIES and ATIS the values (e.g. the same person may be an ac- datasets. Algorithms on rows with statistically sig- tor, director, and producer) and frequently lexical nificant differences in performance with p < 0.05 contexts. in comparison to the algorithms on columns are Supporting the observations in (Bechet and marked with ‘*’. Raymond, 2018) for ATIS, some errors stem from inconsistent labeling. For instance, in the The results of the statistical significance testing MOVIES dataset, “classic cars” is mapped to “O are reported in Table 3. For the MOVIES dataset, O”, but “are there any documentaries on clas- all the compared models (CRF-EMB, LSTM- sic cars” appears as “O O O B-movie.genre O CRF, LSTM-CRF-CHAR-REP) significantly out- B-movie.subject I-movie.subject”. perform the CRF model with p < 0.05. How- 5 Conclusion ever, these models do not yield statistically signif- icant differences among themselves. Specifically, One of the main outcomes of our experiments is using embeddings with CRF (i.e. CRF-EMB) pro- that sequence-level optimization is key to achieve duces statistically significant differences in perfor- the best performance. Moreover, augmenting any mance on top of CRF. Using CRF with LSTM, neural architecture with a CRF layer on top has even though produces better average F1 than CRF- a very low cost in terms of parameters and a EMB, the gain is not statistically significant, irre- very good return in terms of performance. Our spective of the type of embeddings used. best performing models (in terms of average F1 ) For the ATIS dataset, on the other hand, use are LSTM-CRF and LSTM-CRF-CHAR-REP. In of embeddings with CRF does not yield sta- general we may say that adding a sequence level tistically significant differences with respect to control to different type of NN architectures leads plain CRF. Neural architectures (LSTM-CRF and to very good model performances. Another im- LSTM-CRF-CHAR-REP), on the other hand, do portant observation is the variance of performance produce statistically significant difference in per- of NN models with respect to initialization pa- formance in comparison to CRF. Moreover, un- rameters. Consequently, we strongly believe that like for MOVIES dataset, the use of character em- this variability should be taken into consideration beddings in LSTM-CRF architecture significantly and reported (with the lowest and highest perfor- outperforms the CRF-EMB model. mances) to improve the reliability and replicability of the published results. References Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xi- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua aodong He, Larry Heck, Gokhan Tur, Dong Yu, et al. Bengio. 2014. Neural machine translation by 2015. Using recurrent neural networks for slot fill- jointly learning to align and translate. CoRR, ing in spoken language understanding. IEEE/ACM abs/1409.0473. Transactions on Audio, Speech, and Language Pro- Frederic Bechet and Christian Raymond. 2018. Is cessing, 23(3):530–539. ATIS too shallow to go deeper for benchmarking Naoaki Okazaki. 2007. Crfsuite: a fast implemen- spoken language understanding models? In Inter- tation of conditional random fields (crfs). URL speech. http://www. chokkan. org/software/crfsuite. Yun-Nung Chen, Dilek Hakkani-Tür, and Gokan Tur. Roberto Pieraccini and Esther Levin. 1992. Stochastic 2014. Deriving local relational surface forms from representation of semantic structure for speech un- dependency-based entity embeddings for unsuper- derstanding. Speech Communication, 11(2):283 – vised spoken language understanding. In Spoken 288. Eurospeech ’91. Language Technology Workshop (SLT), 2014 IEEE, pages 242–247. IEEE. Christian Raymond and Giuseppe Riccardi. 2007. Kyunghyun Cho, Bart van Merrienboer, Çaglar Generative and discriminative algorithms for spoken Gülçehre, Fethi Bougares, Holger Schwenk, and language understanding. In INTERSPEECH, pages Yoshua Bengio. 2014. Learning phrase representa- 1605–1608. ISCA. tions using RNN encoder-decoder for statistical ma- Elif Derya Übeyli and Mustafa Übeyli. 2012. Case chine translation. CoRR, abs/1406.1078. studies for applications of elman recurrent neural Jeffrey L. Elman. 1990. Finding structure in time. networks. COGNITIVE SCIENCE, 14(2):179–211. B. L. Welch. 1947. The generalization of ‘student’s’ Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long problem when several different population variances short-term memory. Neural Comput., 9(8):1735– are involved. Biometrika, 34(1-2):28–35. 1780, November. Kaisheng Yao, Baolin Peng, Geoffrey Zweig, Dong Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Yu, Xiaolong(Shiao-Long) Li, and Feng Gao. 2014. Shazeer, and Yonghui Wu. 2016. Exploring the lim- Recurrent conditional random field for language un- its of language modeling. CoRR, abs/1602.02410. derstanding. In ICASSP 2014. IEEE International Conference on Acoustics, Speech, and Signal Pro- M. G. Kendall. 1938. A new measure of rank correla- cessing (ICASSP), January. tion. Biometrika, 30(1-2):81–93. Shuai Zheng, Sadeep Jayasumana, Bernardino Yoon Kim. 2014. Convolutional neural networks for Romera-Paredes, Vibhav Vineet, Zhizhong Su, sentence classification. In Proceedings of the 2014 Dalong Du, Chang Huang, and Philip H. S. Torr. Conference on Empirical Methods in Natural Lan- 2015. Conditional random fields as recurrent neural guage Processing, EMNLP 2014, October 25-29, networks. In Proceedings of the 2015 IEEE Inter- 2014, Doha, Qatar, A meeting of SIGDAT, a Special national Conference on Computer Vision (ICCV), Interest Group of the ACL, pages 1746–1751. ICCV ’15, pages 1529–1537, Washington, DC, USA. IEEE Computer Society. Taku Kudo and Yuji Matsumoto. 2001. Chunking with support vector machines. In Proceedings of the Second Meeting of the North American Chap- ter of the Association for Computational Linguistics on Language Technologies, NAACL ’01, pages 1– 8, Stroudsburg, PA, USA. Association for Computa- tional Linguistics. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling se- quence data. In Proceedings of the Eighteenth Inter- national Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA. Mor- gan Kaufmann Publishers Inc. Navonil Majumder, Soujanya Poria, Alexander Gel- bukh, and Erik Cambria. 2017. Deep learning- based document modeling for personality detection from text. IEEE Intelligent Systems, 32(2):74–79, March.