<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="it">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Concept Tagging for Natural Language Understanding: Two Decadelong Algorithm Development</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jacopo</forename><surname>Gobbi</surname></persName>
							<email>jacopo.gobbi@studenti.unitn.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Trento Trento</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Evgeny</forename><forename type="middle">A</forename><surname>Stepanov</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">VUI, Inc. Trento</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giuseppe</forename><surname>Riccardi</surname></persName>
							<email>giuseppe.riccardi@unitn.it</email>
							<affiliation key="aff2">
								<orgName type="institution">University of Trento Trento</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Concept Tagging for Natural Language Understanding: Two Decadelong Algorithm Development</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">341E12CAC1873B93BF465AF008E4ECE7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T04:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>English. Concept tagging is a type of structured learning needed for natural language understanding (NLU) systems. In this task, meaning labels from a domain ontology are assigned to word sequences. In this paper, we review the algorithms developed over the last twenty five years. We perform a comparative evaluation of generative, discriminative and deep learning methods on two public datasets. We report on the statistical variability performance measurements. The third contribution is the release of a repository of the algorithms, datasets and recipes for NLU evaluation.</p><p>Italiano. L'annotazione automatica dei concetti è un tipo di apprendimento strutturato necessario per i sistemi di comprensione del linguaggio naturale (NLU). In questo processo le etichette di un'ontologia di dominio sono assegnate a sequenze di parole. In questo articolo esaminiamo gli algoritmi sviluppati negli ultimi venticinque anni. Eseguiamo una valutazione comparativa dei metodi di apprendimento generativo, discriminatorio e approfondito su due set di dati pubblici. Il secondo contributo é un'analisi della variabilitá delle misure di valutazione. Il terzo contributo è il rilascio di un archivio degli algoritmi, dei sets di dati e delle ricette per la valutazione dell'NLU.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="it">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The NLU component of a conversational system requires an automatic extraction of concept tags, dialogue acts, domain labels and entities. In this paper we describe and review the algorithm development of the concept tagging (a.k.a. slot filling or entity extraction) task. It aims at computing a sequence of concept units, C = c 1 ..c M , from a sequence of words in natural language, W = w 1 ..w N . The task can be seen as a structured learning problem where words are the input and concepts are the output labels. In other words, the objective is to map a sentence (utterance) "I want to go from Boston to Atlanta on Monday" to the sequence of domain labels "null null null null null fromloc.city null toloc.city null depart date.day name", that would allow to identify, for instance that Boston is a departure city . Difficulties may arise from different factors, such as the variable token span of concepts, the long-distance word dependencies, a large and ever changing vocabulary, or subtle semantic implications that might be hard to capture at a surface level or without some prior context knowledge.</p><p>Since the early nineties <ref type="bibr" target="#b14">(Pieraccini and Levin, 1992)</ref>, the task has been designed as a core component of the natural language understanding process in domain-limited conversational systems. Over the years, algorithms have been developed for generative, discriminative and, more recently, for deep learning frameworks. In this paper, we provide a comprehensive review of the algorithms, their parameters and their respective state-of-the-art performances. We discuss the relative advantages and differences amongst algorithms in terms of performances and statistical variability and the optimal parameter settings. Last but not least, we have designed and provided a repository of the data, algorithms, implementations and parameter settings on two public datasets. The GitHub repository<ref type="foot" target="#foot_0">1</ref> is intended as a reference both for practitioners and for algorithm development researchers.</p><p>With the conversational AI gaining popularity, the area of NLU is too vast to mention all relevant or even recent studies. Moreover the objective of this paper is to benchmark an important subtask of NLU, concept tagging used by advanced conversational systems. We benchmark generative, discriminative and deep learning approaches to NLU, the work is in-line with the works of <ref type="bibr" target="#b15">(Raymond and Riccardi, 2007;</ref><ref type="bibr" target="#b12">Mesnil et al., 2015;</ref><ref type="bibr" target="#b1">Bechet and Raymond, 2018)</ref>. Unlike previously mentioned comparative performance analysis, in this paper, we benchmark deep learning architectures and compare them to a generative and traditional discriminative algorithms. To the best of our knowledge, this is the first comprehensive comparison of concept tagging algorithms at this scale on public datasets and shared algorithm implementations (and their parameter settings).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Algorithms</head><p>Among the algorithms considered for benchmarking, we include a representative from the generative class, the weighted finite state transducers (WFSTs), and two discriminative algorithms: Support Vector Machines (SVMs), Conditional Random Fields (CRFs), and a set of base neural networks architectures and their combinations.</p><p>Weighted Finite State Transducers<ref type="foot" target="#foot_1">2</ref> cast concept tagging as a translation problem from words to concepts <ref type="bibr" target="#b15">(Raymond and Riccardi, 2007)</ref>, and usually consist of two components. The first component transduces words to concepts based on a score that can be either induced from data or manually designed; the second component is a stochastic conceptual language model, which re-scores concept sequences. The two components are composed to perform sequence-tosequence translation and infer the best sequence using Viterbi algorithm.</p><p>Support Vector Machines (SVM) are used within Yamcha tool <ref type="bibr" target="#b9">(Kudo and Matsumoto, 2001)</ref> that performs sequence labeling using forward and backward moving classifiers. Automatic labels assigned to preceding tokens are used as dynamic features for the current token's label decision.</p><p>Conditional Random Fields (CRF)<ref type="foot" target="#foot_2">3</ref>  <ref type="bibr" target="#b10">(Lafferty et al., 2001</ref>) is a discriminative model based on a dependency graph G and a set of features. Each feature f k has an associated weight λ k . Features are generally hand-crafted and their weights are learned from the training data. Additionally, we experiment with word embeddings as additional features for CRFs (CRF+EMB).</p><p>Recurrent Neural Networks (RNN). The first neural network architecture<ref type="foot" target="#foot_3">4</ref> we have considered is an Elman RNN <ref type="bibr" target="#b4">(Elman, 1990;</ref><ref type="bibr" target="#b16">Übeyli and Übeyli, 2012)</ref>. In RNN, a hidden state depends on the current input and the previous hidden state. The output (label), on the other hand, depends on the new hidden state.</p><p>Long-Short Term Memory (LSTM) RNNs <ref type="bibr" target="#b5">(Hochreiter and Schmidhuber, 1997)</ref> try to tackle the vanishing gradient problem by introducing a more complex mechanisms to address information propagation and deletion, with the cost of a more complex model with more parameters to train due to the system of gates it uses. The memory of the model is represented by the cell state and the hidden state, which also represents the output for the current token. We experimented with a simple LSTM, an LSTM which receives as input the word embedding concatenated with character embeddings obtained through a convolutional layer <ref type="bibr" target="#b6">(Józefowicz et al., 2016)</ref> (LSTM-CHAR-REP), and an LSTM with pre-trained embeddings and dynamic embeddings learned from training data (LSTM-2CH). In LSTM-2CH two separate LSTM modules run in parallel and their outputs are concatenated for each word. Similar to the rest of the deep learning models, the output is then fed to a fully connected layer to map every token to the concept tag space.</p><p>Gated Recurrent Units (GRU) <ref type="bibr" target="#b3">(Cho et al., 2014)</ref> use a reset and an update gate, which are two vectors of weights that decide what information is deleted (or re-scaled) from the current hidden state and how it will contribute to the new hidden state, which is also the output for the current input. Compared to the LSTM model, this allows to train fewer parameters, but introduces a constraint on memory, since it is also used as an output.</p><p>Convolutional Neural Networks (CONV) <ref type="bibr" target="#b11">(Majumder et al., 2017;</ref><ref type="bibr" target="#b8">Kim, 2014)</ref> consider each sentence as a matrix of shape (# words in sentence, size of embedding) for convolution using kernels of different sizes to pass over the input sequence token-by-token, bigram by bigram and trigram by trigram. The result of convolution is used as a starting hidden memory for a GRU RNN. GRU RNN is used on embedded tokens and starts with the information on the sequence at a global level.</p><p>FC-INIT is similar to CONV. The difference is in the pre-elaboration of the hidden state, which is done by fully connected layers elaborating on the whole sequence.</p><p>ENCODER architecture <ref type="bibr" target="#b3">(Cho et al., 2014</ref>) casts the problem as a sequence-to-sequence translation and consists of two GRU RNNs. Encoder, the first GRU RNN, encodes the input sequence to a fixed vector (the hidden state). Decoder, another GRU RNN, uses the output of the encoder as a starting hidden state. At each step, the decoder receives the label predicted at the previous step as an input, starting with a special token.</p><p>ATTENTION architecture is similar to EN-CODER with the addition of an attention mechanism <ref type="bibr" target="#b0">(Bahdanau et al., 2014)</ref> on the outputs of the encoder. This allows the network to focus on a specific parts of the input sequence. The attention weights are computed with a single fully connected layer that receives as input the embedding of the current word concatenated to the last hidden state.</p><p>LSTM-CRF <ref type="bibr" target="#b18">(Yao et al., 2014;</ref><ref type="bibr" target="#b19">Zheng et al., 2015)</ref> is an architecture where the LSTM provides class scores for each token, and the Viterbi algorithm decides on the labels of the sequence at a global level using bigrams and transition probabilities that are trained with the rest of the parameters. We also experimented with a variant that considers character level information (LSTM-CRF-CHAR-REP).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Corpora</head><p>The evaluation of algorithms is performed on two datasets. The Air Travel Information System (ATIS) dataset consists of sentences from users querying for information about flights, departure dates, arrivals, etc. The training set consists of 4,978 sentences, while there are 893 sentences that constitute the test set. The average length of a sentence is around 11 tokens, and there are a total of 127 unique tags (with IOB prefixes). Moreover, the large majority of tokens missing an embedding are either numbers or airport/basis/aircraft codes.</p><p>The training set has a total of 18 types missing an embedding, and the test set has 9.</p><p>The second corpus (MOVIES)<ref type="foot" target="#foot_4">5</ref> was produced  Table <ref type="table">1</ref>: F 1 -scores for the WFST, SVM and CRF (with and without embeddings) algorithms on the MOVIES (top row) and ATIS (bottom row) datasets.</p><p>from NL2SparQL <ref type="bibr" target="#b2">(Chen et al., 2014)</ref> corpus semiautomatically aligning SPARQL query values to utterance tokens. The dataset follows the split of the original corpus having 3,338 sentences (with 1,728 unique tokens) and 1,084 sentences (with 1,039 tokens) in the training and test sets, respectively. The average length of a sentence is 6.50 and the OOV rate is 0.24. There are 43 concept tags in the dataset. Given the Google embeddings, once we consider every number as a class number, we obtain 66 token types without an embedding for the training set and 26 for the test set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Performance Analysis</head><p>One of our first observations is the fact that models such as WFST, SVM and CRF yield competitive results with simple setups and few hyperparameters to be tuned. The training of our deep learning models and the search of their hyperparameters would have been unfeasible without dedicated hardware, while it took a fraction of the effort for WFST, SVM and CRF. Moreover, adding word embeddings as features to the CRF allowed it to outperform most of the deep neural networks. For each architecture, the first row reports F 1 -score for the MOVIES dataset and the second for ATIS. Hyperparameter search has been done randomly over ranges of values taken from published work. The number of parameters refers to the network parameters plus the embeddings, when those are unfrozen. Given a hidden layer size X reported in hidden column, each component in the bidirectional architecture would have a hidden layer size of X/2. Similarly, each of the two LSTM components in the LSTM-2CH model would have X/2 as a hidden layer size; and each bidirectional component would thus have a hidden layer size equal to X/4.</p><p>We attribute this to two factors: (1) since these models, unlike neural networks, do not learn feature representation from data, they are simpler and faster to train; and, most importantly, (2) these models usually perform global optimization over the label sequence, while neural networks usually do not. Augmenting neural networks with CRF is not expensive in terms of parameters. Having a CRF component on top of an LSTM increments the number of parameters up to the square of the tag-set size (about 2,500 for the MOVIES dataset), and provides the best performing model.</p><p>There seems to be no strong correlation between the number of parameters and the variance of a model performance with respect to the random initialization of its parameters. This is surprising, given the intuition that more parameters can potentially lead to a lower probability of being stuck in a local minima. The case may be that different initializations lead to different training times required to get to good local minimas.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Statistical Significance Testing</head><p>The best performing algorithms in our experimental settings are LSTM-CRF and LSTM-CRF-CHAR-REP; however, they are not very far from CRF+EMB and CRF algorithms. In order to compare the performances in terms of statistical significance, we perform Welch's unequal variances ttest <ref type="bibr" target="#b17">(Welch, 1947)</ref>, which, compared to more popular Student's t-test, does not assume equal variances. The choice of test is motivated by the observation that neural architectures generally yield higher variances than, for instance, CRF.</p><p>The performances are compared on 10-fold cross-validation outputs on the training set for both ATIS and MOVIES datasets. Due to the higher variance of neural network architectures, a better way to test would be to perform many runs with different random initializations for each fold, and take the average of these results; however, such a procedure is computationally very demanding.  The results of the statistical significance testing are reported in Table <ref type="table" target="#tab_2">3</ref>. For the MOVIES dataset, all the compared models (CRF-EMB, LSTM-CRF, LSTM-CRF-CHAR-REP) significantly outperform the CRF model with p &lt; 0.05. However, these models do not yield statistically significant differences among themselves. Specifically, using embeddings with CRF (i.e. CRF-EMB) produces statistically significant differences in performance on top of CRF. Using CRF with LSTM, even though produces better average F 1 than CRF-EMB, the gain is not statistically significant, irrespective of the type of embeddings used.</p><p>For the ATIS dataset, on the other hand, use of embeddings with CRF does not yield statistically significant differences with respect to plain CRF. Neural architectures (LSTM-CRF and LSTM-CRF-CHAR-REP), on the other hand, do produce statistically significant difference in performance in comparison to CRF. Moreover, unlike for MOVIES dataset, the use of character embeddings in LSTM-CRF architecture significantly outperforms the CRF-EMB model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Error Analysis</head><p>Both MOVIES and ATIS datasets have imbalanced distribution of concept labels. The imbalanced distribution of labels is known to affect the performance of the minority classes. Consequently, we correlate the distribution of labels in the training set to the percent of their mis-labeling in the test set (by any model). As expected, the mis-labeling chance is inversely correlated to the percentage of instances the label has in the training set (e.g. given that a label amounts to less than 1% of a dataset, it usually has a mis-labeling chance greater than 10%). For both datasets, the Kendall rank correlation coefficients <ref type="bibr" target="#b7">(Kendall, 1938</ref>) are approximately 0.6.</p><p>Independent of the distribution, there are certain concepts that are mis-labeled more often. For example, this is the case for producer name, person name, and director name in MOVIES, and city name, state name, and airport name in ATIS. It is not surprising given that these concepts share the values (e.g. the same person may be an actor, director, and producer) and frequently lexical contexts.</p><p>Supporting the observations in <ref type="bibr" target="#b1">(Bechet and Raymond, 2018)</ref> for ATIS, some errors stem from inconsistent labeling. For instance, in the MOVIES dataset, "classic cars" is mapped to "O O", but "are there any documentaries on classic cars" appears as "O O O B-movie.genre O B-movie.subject I-movie.subject".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>One of the main outcomes of our experiments is that sequence-level optimization is key to achieve the best performance. Moreover, augmenting any neural architecture with a CRF layer on top has a very low cost in terms of parameters and a very good return in terms of performance. Our best performing models (in terms of average F 1 ) are LSTM-CRF and LSTM-CRF-CHAR-REP. In general we may say that adding a sequence level control to different type of NN architectures leads to very good model performances. Another important observation is the variance of performance of NN models with respect to initialization parameters. Consequently, we strongly believe that this variability should be taken into consideration and reported (with the lowest and highest performances) to improve the reliability and replicability of the published results.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>* LSTM-CRF-CHAR-REP * *</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>All models are bidirectional and have been trained with unfrozen Google embeddings, except for CONV and LSTM-2CH. Min, average and best F 1 scores are obtained training the same model with the same hyperparameters, but different parameter initializations. Averages are from 50 runs for MOVIES and 25 for ATIS.</figDesc><table><row><cell>Model</cell><cell cols="3">hidden epochs batch</cell><cell>lr</cell><cell>drop</cell><cell>emb</cell><cell>#</cell><cell>of</cell><cell cols="2">min F1 avg F1 best F1</cell></row><row><cell></cell><cell></cell><cell>size</cell><cell></cell><cell></cell><cell>rate</cell><cell>norm</cell><cell cols="2">params</cell><cell></cell></row><row><cell>RNN</cell><cell>200 400</cell><cell>15 10</cell><cell>50 50</cell><cell>0.001 0.001</cell><cell>0.30 0.25</cell><cell></cell><cell cols="2">4 1,264K 2 580K</cell><cell>81.00 91.80</cell><cell>82.55 93.79</cell><cell>83.96 95.03</cell></row><row><cell>LSTM</cell><cell>200 200</cell><cell>15 15</cell><cell>20 10</cell><cell>0.001 0.001</cell><cell>0.70 0.50</cell><cell></cell><cell cols="2">6 1,505K 8 675K</cell><cell>82.67 87.82</cell><cell>83.76 94.53</cell><cell>84.57 95.36</cell></row><row><cell>LSTM-CHAR-REP</cell><cell>400 400</cell><cell>20 15</cell><cell>20 10</cell><cell>0.001 0.001</cell><cell>0.70 0.50</cell><cell></cell><cell cols="2">4 2,085K 6 1,272K</cell><cell>82.00 81.00</cell><cell>84.28 94.19</cell><cell>85.41 95.39</cell></row><row><cell>LSTM-2CH</cell><cell>200 400</cell><cell>20 10</cell><cell>15 100</cell><cell>0.001 0.010</cell><cell>0.30 0.70</cell><cell></cell><cell cols="2">8 1,310K 6 1,022K</cell><cell>81.22 93.10</cell><cell>82.68 94.61</cell><cell>83.76 95.38</cell></row><row><cell>GRU</cell><cell>200 100</cell><cell>20 15</cell><cell>20 10</cell><cell>0.001 0.005</cell><cell>0.50 0.50</cell><cell cols="3">4 1,424K 10 446K</cell><cell>76.56 91.53</cell><cell>84.29 94.28</cell><cell>85.47 95.28</cell></row><row><cell>CONV</cell><cell>200 100</cell><cell>20 15</cell><cell>20 10</cell><cell>0.001 0.005</cell><cell>0.50 0.00</cell><cell></cell><cell cols="2">4 2,646K 2 625K</cell><cell>84.05 91.51</cell><cell>85.02 94.22</cell><cell>86.17 95.38</cell></row><row><cell>FC-INIT</cell><cell>100 400</cell><cell>30 15</cell><cell>20 50</cell><cell>0.001 0.010</cell><cell>0.30 0.25</cell><cell></cell><cell cols="2">4 2,805K 4 7,144K</cell><cell>82.22 87.39</cell><cell>83.93 94.67</cell><cell>84.95 95.39</cell></row><row><cell>ENCODER</cell><cell>200 200</cell><cell>30 25</cell><cell>20 5</cell><cell>0.001 0.001</cell><cell>0.70 0.70</cell><cell></cell><cell cols="2">4 1,559K 6 730K</cell><cell>71.25 70.01</cell><cell>76.39 78.16</cell><cell>79.00 80.85</cell></row><row><cell>ATTENTION</cell><cell>200 200</cell><cell>15 25</cell><cell>20 5</cell><cell>0.001 0.001</cell><cell>0.30 0.25</cell><cell cols="3">4 1,712K 10 894K</cell><cell>71.86 92.47</cell><cell>79.77 94.09</cell><cell>82.67 94.98</cell></row><row><cell>LSTM-CRF</cell><cell>200 400</cell><cell>10 15</cell><cell>1 10</cell><cell>0.001 0.001</cell><cell>0.70 0.50</cell><cell></cell><cell cols="2">6 1,507K 6 1,200K</cell><cell>84.75 94.39</cell><cell>86.11 94.72</cell><cell>87.47 95.01</cell></row><row><cell>LSTM-CRF-CHAR-REP</cell><cell>200 200</cell><cell>15 20</cell><cell>1 5</cell><cell>0.001 0.001</cell><cell>0.70 0.50</cell><cell></cell><cell cols="2">8 1,555K 4 740K</cell><cell>85.07 94.45</cell><cell>86.08 94.91</cell><cell>87.05 95.12</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Results of statistical significance testing using Welch's t-test for MOVIES and ATIS datasets. Algorithms on rows with statistically significant differences in performance with p &lt; 0.05 in comparison to the algorithms on columns are marked with '*'.</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">www.github.com/fruttasecca/concept-tagging-with-neural-networks</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">We use OpenFST (http://www.openfst.org) and Open-GRM (http://www.opengrm.org) libraries.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"> 3  We use CRFSUITE<ref type="bibr" target="#b13">(Okazaki, 2007)</ref> implementation of CRFs in out experiments.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">All neural architectures are implemented within the Py-Torch framework (https://pytorch.org)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">https://github.com/esrel/NL2SparQL4NLU</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Neural machine translation by jointly learning to align and translate</title>
		<author>
			<persName><forename type="first">Dzmitry</forename><surname>Bahdanau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kyunghyun</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoshua</forename><surname>Bengio</surname></persName>
		</author>
		<idno>CoRR, abs/1409.0473</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Is ATIS too shallow to go deeper for benchmarking spoken language understanding models?</title>
		<author>
			<persName><forename type="first">Frederic</forename><surname>Bechet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christian</forename><surname>Raymond</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Interspeech</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Deriving local relational surface forms from dependency-based entity embeddings for unsupervised spoken language understanding</title>
		<author>
			<persName><forename type="first">Yun-Nung</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dilek</forename><surname>Hakkani-Tür</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gokan</forename><surname>Tur</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Spoken Language Technology Workshop (SLT)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="page" from="242" to="247" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Learning phrase representations using RNN encoder-decoder for statistical machine translation</title>
		<author>
			<persName><forename type="first">Kyunghyun</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bart</forename><surname>Van Merrienboer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>¸aglar Gülc ¸ehre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fethi</forename><surname>Bougares</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Holger</forename><surname>Schwenk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoshua</forename><surname>Bengio</surname></persName>
		</author>
		<idno>CoRR, abs/1406.1078</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Finding structure in time</title>
		<author>
			<persName><forename type="first">Jeffrey</forename><forename type="middle">L</forename><surname>Elman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">COGNITIVE SCIENCE</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="179" to="211" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Long short-term memory</title>
		<author>
			<persName><forename type="first">Sepp</forename><surname>Hochreiter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jürgen</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural Comput</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="1735" to="1780" />
			<date type="published" when="1997-11">1997. November</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Exploring the limits of language modeling</title>
		<author>
			<persName><forename type="first">Rafal</forename><surname>Józefowicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oriol</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mike</forename><surname>Schuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noam</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yonghui</forename><surname>Wu</surname></persName>
		</author>
		<idno>CoRR, abs/1602.02410</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A new measure of rank correlation</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">G</forename><surname>Kendall</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Biometrika</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="issue">1-2</biblScope>
			<biblScope unit="page" from="81" to="93" />
			<date type="published" when="1938">1938</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Convolutional neural networks for sentence classification</title>
		<author>
			<persName><forename type="first">Yoon</forename><surname>Kim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014</title>
				<meeting>the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014<address><addrLine>Doha, Qatar</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014-10-25">2014. October 25-29, 2014</date>
			<biblScope unit="page" from="1746" to="1751" />
		</imprint>
	</monogr>
	<note>, A meeting of SIGDAT, a Special Interest Group of the ACL</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Chunking with support vector machines</title>
		<author>
			<persName><forename type="first">Taku</forename><surname>Kudo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yuji</forename><surname>Matsumoto</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, NAACL &apos;01</title>
				<meeting>the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, NAACL &apos;01<address><addrLine>Stroudsburg, PA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Conditional random fields: Probabilistic models for segmenting and labeling sequence data</title>
		<author>
			<persName><forename type="first">John</forename><forename type="middle">D</forename><surname>Lafferty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Mccallum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fernando</forename><forename type="middle">C N</forename><surname>Pereira</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eighteenth International Conference on Machine Learning, ICML &apos;01</title>
				<meeting>the Eighteenth International Conference on Machine Learning, ICML &apos;01<address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Morgan Kaufmann Publishers Inc</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="282" to="289" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Deep learningbased document modeling for personality detection from text</title>
		<author>
			<persName><forename type="first">Navonil</forename><surname>Majumder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Soujanya</forename><surname>Poria</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexander</forename><surname>Gelbukh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Erik</forename><surname>Cambria</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Intelligent Systems</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="74" to="79" />
			<date type="published" when="2017-03">2017. March</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Using recurrent neural networks for slot filling in spoken language understanding</title>
		<author>
			<persName><forename type="first">Grégoire</forename><surname>Mesnil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yann</forename><surname>Dauphin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kaisheng</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoshua</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Li</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dilek</forename><surname>Hakkani-Tur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaodong</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Larry</forename><surname>Heck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gokhan</forename><surname>Tur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dong</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="530" to="539" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">Naoaki</forename><surname>Okazaki</surname></persName>
		</author>
		<ptr target="http://www.chokkan.org/software/crfsuite" />
		<title level="m">Crfsuite: a fast implementation of conditional random fields (crfs</title>
				<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Stochastic representation of semantic structure for speech understanding</title>
		<author>
			<persName><forename type="first">Roberto</forename><surname>Pieraccini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Esther</forename><surname>Levin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Eurospeech &apos;91</title>
				<imprint>
			<date type="published" when="1992">1992</date>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="283" to="288" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Generative and discriminative algorithms for spoken language understanding</title>
		<author>
			<persName><forename type="first">Christian</forename><surname>Raymond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giuseppe</forename><surname>Riccardi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">INTERSPEECH</title>
				<imprint>
			<publisher>ISCA</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="1605" to="1608" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Case studies for applications of elman recurrent neural networks</title>
		<author>
			<persName><forename type="first">Elif</forename><surname>Derya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Übeyli</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Mustafa</forename><surname>Übeyli</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">The generalization of &apos;student&apos;s&apos; problem when several different population variances are involved</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">L</forename><surname>Welch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Biometrika</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="issue">1-2</biblScope>
			<biblScope unit="page" from="28" to="35" />
			<date type="published" when="1947">1947</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Recurrent conditional random field for language understanding</title>
		<author>
			<persName><forename type="first">Kaisheng</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Baolin</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Geoffrey</forename><surname>Zweig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dong</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaolong</forename><forename type="middle">(</forename><surname>Shiao-Long) Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Feng</forename><surname>Gao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)</title>
				<imprint>
			<date type="published" when="2014-01">2014. January</date>
		</imprint>
	</monogr>
	<note>ICASSP 2014</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Conditional random fields as recurrent neural networks</title>
		<author>
			<persName><forename type="first">Shuai</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sadeep</forename><surname>Jayasumana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bernardino</forename><surname>Romera-Paredes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vibhav</forename><surname>Vineet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhizhong</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dalong</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chang</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Philip</forename><forename type="middle">H S</forename><surname>Torr</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV &apos;15</title>
				<meeting>the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV &apos;15<address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1529" to="1537" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
