<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Concept Tagging for Natural Language Understanding: Two Decadelong Algorithm Development</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Evgeny A. Stepanov VUI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Inc. Trento</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy eas@vui.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Giuseppe Riccardi University of Trento Trento</institution>
          ,
          <addr-line>Italy giuseppe.riccardi @unitn.it</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jacopo Gobbi University of Trento Trento</institution>
          ,
          <addr-line>Italy jacopo.gobbi @studenti.unitn.it</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. Concept tagging is a type of structured learning needed for natural language understanding (NLU) systems. In this task, meaning labels from a domain ontology are assigned to word sequences. In this paper, we review the algorithms developed over the last twenty five years. We perform a comparative evaluation of generative, discriminative and deep learning methods on two public datasets. We report on the statistical variability performance measurements. The third contribution is the release of a repository of the algorithms, datasets and recipes for NLU evaluation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. L’annotazione automatica dei
concetti e` un tipo di apprendimento
strutturato necessario per i sistemi di
comprensione del linguaggio naturale
(NLU). In questo processo le etichette di
un’ontologia di dominio sono assegnate
a sequenze di parole. In questo articolo
esaminiamo gli algoritmi sviluppati negli
ultimi venticinque anni. Eseguiamo una
valutazione comparativa dei metodi di
apprendimento generativo, discriminatorio e
approfondito su due set di dati pubblici. Il
secondo contributo e´ un’analisi della
variabilita´ delle misure di valutazione. Il terzo
contributo e` il rilascio di un archivio degli
algoritmi, dei sets di dati e delle ricette per
la valutazione dell’NLU.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>The NLU component of a conversational system
requires an automatic extraction of concept tags,
dialogue acts, domain labels and entities. In
this paper we describe and review the algorithm
development of the concept tagging (a.k.a. slot
filling or entity extraction) task. It aims at
computing a sequence of concept units, C = c1::cM ,
from a sequence of words in natural language,
W = w1::wN . The task can be seen as a
structured learning problem where words are the input
and concepts are the output labels. In other words,
the objective is to map a sentence (utterance) “I
want to go from Boston to Atlanta on Monday” to
the sequence of domain labels “null null null
null null fromloc.city null toloc.city
null depart date.day name”, that would allow
to identify, for instance that Boston is a departure
city . Difficulties may arise from different factors,
such as the variable token span of concepts, the
long-distance word dependencies, a large and
ever changing vocabulary, or subtle semantic
implications that might be hard to capture at
a surface level or without some prior context
knowledge.</p>
      <p>
        Since the early nineties
        <xref ref-type="bibr" rid="ref15">(Pieraccini and Levin,
1992)</xref>
        , the task has been designed as a core
component of the natural language understanding process
in domain-limited conversational systems. Over
the years, algorithms have been developed for
generative, discriminative and, more recently, for deep
learning frameworks. In this paper, we provide a
comprehensive review of the algorithms, their
parameters and their respective state-of-the-art
performances. We discuss the relative advantages and
differences amongst algorithms in terms of
performances and statistical variability and the optimal
parameter settings. Last but not least, we have
designed and provided a repository of the data,
algorithms, implementations and parameter settings
on two public datasets. The GitHub repository1 is
intended as a reference both for practitioners and
for algorithm development researchers.
      </p>
      <p>
        With the conversational AI gaining popularity,
the area of NLU is too vast to mention all relevant
1www.github.com/fruttasecca/concept-tagging-with-neural-networks
or even recent studies. Moreover the objective
of this paper is to benchmark an important
subtask of NLU, concept tagging used by advanced
conversational systems. We benchmark
generative, discriminative and deep learning approaches
to NLU, the work is in-line with the works of
        <xref ref-type="bibr" rid="ref13 ref16 ref2">(Raymond and Riccardi, 2007; Mesnil et al., 2015;
Bechet and Raymond, 2018)</xref>
        . Unlike previously
mentioned comparative performance analysis, in
this paper, we benchmark deep learning
architectures and compare them to a generative and
traditional discriminative algorithms. To the best of our
knowledge, this is the first comprehensive
comparison of concept tagging algorithms at this scale on
public datasets and shared algorithm
implementations (and their parameter settings).
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Algorithms</title>
      <p>Among the algorithms considered for
benchmarking, we include a representative from the
generative class, the weighted finite state
transducers (WFSTs), and two discriminative algorithms:
Support Vector Machines (SVMs), Conditional
Random Fields (CRFs), and a set of base neural
networks architectures and their combinations.</p>
      <p>
        Weighted Finite State Transducers2 cast
concept tagging as a translation problem from words
to concepts
        <xref ref-type="bibr" rid="ref16">(Raymond and Riccardi, 2007)</xref>
        , and
usually consist of two components. The first
component transduces words to concepts based
on a score that can be either induced from data
or manually designed; the second component is
a stochastic conceptual language model, which
re-scores concept sequences. The two
components are composed to perform
sequence-tosequence translation and infer the best sequence
using Viterbi algorithm.
      </p>
      <sec id="sec-3-1">
        <title>Support Vector Machines (SVM) are used</title>
        <p>
          within Yamcha tool
          <xref ref-type="bibr" rid="ref10 ref11">(Kudo and Matsumoto, 2001)</xref>
          that performs sequence labeling using forward and
backward moving classifiers. Automatic labels
assigned to preceding tokens are used as dynamic
features for the current token’s label decision.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conditional Random Fields (CRF)3 (Lafferty</title>
      <p>et al., 2001) is a discriminative model based on a
dependency graph G and a set of features. Each
feature fk has an associated weight k. Features
are generally hand-crafted and their weights are
2We use OpenFST (http://www.openfst.org) and
OpenGRM (http://www.opengrm.org) libraries.</p>
      <p>
        3We use CRFSUITE
        <xref ref-type="bibr" rid="ref14">(Okazaki, 2007)</xref>
        implementation of
CRFs in out experiments.
learned from the training data. Additionally, we
experiment with word embeddings as additional
features for CRFs (CRF+EMB).
      </p>
      <sec id="sec-4-1">
        <title>Recurrent Neural Networks (RNN). The first</title>
        <p>
          neural network architecture4 we have considered
is an Elman RNN
          <xref ref-type="bibr" rid="ref17 ref5">(Elman, 1990; U¨ beyli and
U¨beyli, 2012)</xref>
          . In RNN, a hidden state depends
on the current input and the previous hidden state.
The output (label), on the other hand, depends on
the new hidden state.
        </p>
        <p>
          Long-Short Term Memory (LSTM) RNNs
          <xref ref-type="bibr" rid="ref6">(Hochreiter and Schmidhuber, 1997)</xref>
          try to tackle
the vanishing gradient problem by introducing a
more complex mechanisms to address information
propagation and deletion, with the cost of a more
complex model with more parameters to train due
to the system of gates it uses. The memory of
the model is represented by the cell state and the
hidden state, which also represents the output for
the current token. We experimented with a
simple LSTM, an LSTM which receives as input the
word embedding concatenated with character
embeddings obtained through a convolutional layer
          <xref ref-type="bibr" rid="ref7">(Jo´zefowicz et al., 2016)</xref>
          (LSTM-CHAR-REP),
and an LSTM with pre-trained embeddings and
dynamic embeddings learned from training data
(LSTM-2CH). In LSTM-2CH two separate LSTM
modules run in parallel and their outputs are
concatenated for each word. Similar to the rest of the
deep learning models, the output is then fed to a
fully connected layer to map every token to the
concept tag space.
        </p>
        <p>
          Gated Recurrent Units (GRU)
          <xref ref-type="bibr" rid="ref1 ref4">(Cho et al.,
2014)</xref>
          use a reset and an update gate, which are
two vectors of weights that decide what
information is deleted (or re-scaled) from the current
hidden state and how it will contribute to the new
hidden state, which is also the output for the
current input. Compared to the LSTM model, this
allows to train fewer parameters, but introduces a
constraint on memory, since it is also used as an
output.
        </p>
        <p>
          Convolutional Neural Networks (CONV)
          <xref ref-type="bibr" rid="ref12 ref9">(Majumder et al., 2017; Kim, 2014)</xref>
          consider each
sentence as a matrix of shape (# words in sentence,
size of embedding) for convolution using kernels
of different sizes to pass over the input sequence
token-by-token, bigram by bigram and trigram by
trigram. The result of convolution is used as a
4All neural architectures are implemented within the
PyTorch framework (https://pytorch.org)
starting hidden memory for a GRU RNN. GRU
RNN is used on embedded tokens and starts with
the information on the sequence at a global level.
        </p>
        <p>FC-INIT is similar to CONV. The difference is
in the pre-elaboration of the hidden state, which is
done by fully connected layers elaborating on the
whole sequence.</p>
        <p>
          ENCODER architecture
          <xref ref-type="bibr" rid="ref1 ref4">(Cho et al., 2014)</xref>
          casts the problem as a sequence-to-sequence
translation and consists of two GRU RNNs. Encoder,
the first GRU RNN, encodes the input sequence
to a fixed vector (the hidden state). Decoder,
another GRU RNN, uses the output of the encoder as
a starting hidden state. At each step, the decoder
receives the label predicted at the previous step as
an input, starting with a special token.
        </p>
        <p>
          ATTENTION architecture is similar to
ENCODER with the addition of an attention
mechanism
          <xref ref-type="bibr" rid="ref1">(Bahdanau et al., 2014)</xref>
          on the outputs of
the encoder. This allows the network to focus on
a specific parts of the input sequence. The
attention weights are computed with a single fully
connected layer that receives as input the embedding
of the current word concatenated to the last hidden
state.
        </p>
        <p>
          LSTM-CRF
          <xref ref-type="bibr" rid="ref19 ref20">(Yao et al., 2014; Zheng et al.,
2015)</xref>
          is an architecture where the LSTM provides
class scores for each token, and the Viterbi
algorithm decides on the labels of the sequence at a
global level using bigrams and transition
probabilities that are trained with the rest of the
parameters. We also experimented with a variant
that considers character level information
(LSTMCRF-CHAR-REP).
3
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Corpora</title>
      <p>The evaluation of algorithms is performed on two
datasets. The Air Travel Information System
(ATIS) dataset consists of sentences from users
querying for information about flights, departure
dates, arrivals, etc. The training set consists of
4,978 sentences, while there are 893 sentences that
constitute the test set. The average length of a
sentence is around 11 tokens, and there are a total of
127 unique tags (with IOB prefixes). Moreover,
the large majority of tokens missing an embedding
are either numbers or airport/basis/aircraft codes.
The training set has a total of 18 types missing an
embedding, and the test set has 9.</p>
      <p>The second corpus (MOVIES)5 was produced
5https://github.com/esrel/NL2SparQL4NLU
Parameters
order 4, kneser ney
order 4, kneser ney
(4, 4) window of tokens,
(1, 0) of POS tag and
prefix. Postfix and lemma of
current word. Previous two
labels.
(6, 4) window of tokens,
(1, 0) of prefix and postfix.</p>
      <p>Previous two labels .
(4, 4) window of token,
(1, 0) of POS tag and prefix.</p>
      <p>Postfix and lemma of
current word. Previous +
current word conjunction,
current + next word
conjunction. Bigram model.
(6, 4) window of tokens,
(-1, 0) of prefix. Postfix
of current word. Previous
+ current word conjunction.</p>
      <p>
        Bigram model.
embeddings
all above + (6, 4) word
embs + current token char
embeddings
WFST
SVM
CRF
CRF+EMB aelml basb+ovceur+ren(t4,tok4e)nwcohradr
from NL2SparQL
        <xref ref-type="bibr" rid="ref3">(Chen et al., 2014)</xref>
        corpus
semiautomatically aligning SPARQL query values to
utterance tokens. The dataset follows the split of
the original corpus having 3,338 sentences (with
1,728 unique tokens) and 1,084 sentences (with
1,039 tokens) in the training and test sets,
respectively. The average length of a sentence is 6.50
and the OOV rate is 0.24. There are 43 concept
tags in the dataset. Given the Google embeddings,
once we consider every number as a class number,
we obtain 66 token types without an embedding
for the training set and 26 for the test set.
4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Performance Analysis</title>
      <p>One of our first observations is the fact that
models such as WFST, SVM and CRF yield
competitive results with simple setups and few
hyperparameters to be tuned. The training of our deep
learning models and the search of their
hyperparameters would have been unfeasible without
dedicated hardware, while it took a fraction of the
effort for WFST, SVM and CRF. Moreover, adding
word embeddings as features to the CRF allowed
it to outperform most of the deep neural networks.</p>
      <p>We attribute this to two factors: (1) since these
models, unlike neural networks, do not learn
feature representation from data, they are simpler and
faster to train; and, most importantly, (2) these
models usually perform global optimization over
the label sequence, while neural networks usually
do not. Augmenting neural networks with CRF is
not expensive in terms of parameters. Having a
CRF component on top of an LSTM increments
the number of parameters up to the square of the
tag-set size (about 2,500 for the MOVIES dataset),
and provides the best performing model.</p>
      <p>There seems to be no strong correlation between
the number of parameters and the variance of a
model performance with respect to the random
initialization of its parameters. This is surprising,
given the intuition that more parameters can
potentially lead to a lower probability of being stuck
in a local minima. The case may be that
different initializations lead to different training times
required to get to good local minimas.
4.1</p>
      <sec id="sec-6-1">
        <title>Statistical Significance Testing</title>
        <p>
          The best performing algorithms in our
experimental settings are LSTM-CRF and
LSTM-CRFCHAR-REP; however, they are not very far from
CRF+EMB and CRF algorithms. In order to
compare the performances in terms of statistical
significance, we perform Welch’s unequal variances
ttest
          <xref ref-type="bibr" rid="ref18">(Welch, 1947)</xref>
          , which, compared to more
popular Student’s t-test, does not assume equal
variances. The choice of test is motivated by the
observation that neural architectures generally yield
higher variances than, for instance, CRF.
        </p>
        <p>The performances are compared on 10-fold
cross-validation outputs on the training set for
both ATIS and MOVIES datasets. Due to the
higher variance of neural network architectures,
a better way to test would be to perform many
runs with different random initializations for each
fold, and take the average of these results;
however, such a procedure is computationally very
demanding.</p>
      </sec>
      <sec id="sec-6-2">
        <title>ALGORITHMS</title>
      </sec>
      <sec id="sec-6-3">
        <title>MOVIES</title>
        <p>CRF
CRF-EMB
LSTM-CRF
LSTM-CRF-CHAR-REP</p>
      </sec>
      <sec id="sec-6-4">
        <title>ATIS</title>
        <p>CRF
CRF-EMB</p>
        <p>LSTM-CRF
LSTM-CRF-CHAR-REP</p>
        <p>The results of the statistical significance testing
are reported in Table 3. For the MOVIES dataset,
all the compared models (CRF-EMB,
LSTMCRF, LSTM-CRF-CHAR-REP) significantly
outperform the CRF model with p &lt; 0:05.
However, these models do not yield statistically
significant differences among themselves. Specifically,
using embeddings with CRF (i.e. CRF-EMB)
produces statistically significant differences in
performance on top of CRF. Using CRF with LSTM,
even though produces better average F1 than
CRFEMB, the gain is not statistically significant,
irrespective of the type of embeddings used.</p>
        <p>
          For the ATIS dataset, on the other hand, use
of embeddings with CRF does not yield
statistically significant differences with respect to
plain CRF. Neural architectures (LSTM-CRF and
LSTM-CRF-CHAR-REP), on the other hand, do
produce statistically significant difference in
performance in comparison to CRF. Moreover,
unlike for MOVIES dataset, the use of character
embeddings in LSTM-CRF architecture significantly
outperforms the CRF-EMB model.
Both MOVIES and ATIS datasets have
imbalanced distribution of concept labels. The
imbalanced distribution of labels is known to affect
the performance of the minority classes.
Consequently, we correlate the distribution of labels in
the training set to the percent of their mis-labeling
in the test set (by any model). As expected, the
mis-labeling chance is inversely correlated to the
percentage of instances the label has in the training
set (e.g. given that a label amounts to less than 1%
of a dataset, it usually has a mis-labeling chance
greater than 10%). For both datasets, the Kendall
rank correlation coefficients
          <xref ref-type="bibr" rid="ref8">(Kendall, 1938)</xref>
          are
approximately 0.6.
        </p>
        <p>Independent of the distribution, there are certain
concepts that are mis-labeled more often. For
example, this is the case for producer name, person
name, and director name in MOVIES, and city
name, state name, and airport name in ATIS. It
is not surprising given that these concepts share
the values (e.g. the same person may be an
actor, director, and producer) and frequently lexical
contexts.</p>
        <p>
          Supporting the observations in
          <xref ref-type="bibr" rid="ref2">(Bechet and
Raymond, 2018)</xref>
          for ATIS, some errors stem
from inconsistent labeling. For instance, in the
MOVIES dataset, “classic cars” is mapped to “O
O”, but “are there any documentaries on
classic cars” appears as “O O O B-movie.genre O
B-movie.subject I-movie.subject”.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>One of the main outcomes of our experiments is
that sequence-level optimization is key to achieve
the best performance. Moreover, augmenting any
neural architecture with a CRF layer on top has
a very low cost in terms of parameters and a
very good return in terms of performance. Our
best performing models (in terms of average F1)
are LSTM-CRF and LSTM-CRF-CHAR-REP. In
general we may say that adding a sequence level
control to different type of NN architectures leads
to very good model performances. Another
important observation is the variance of performance
of NN models with respect to initialization
parameters. Consequently, we strongly believe that
this variability should be taken into consideration
and reported (with the lowest and highest
performances) to improve the reliability and replicability
of the published results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          . CoRR, abs/1409.0473.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Frederic</given-names>
            <surname>Bechet</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Raymond</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Is ATIS too shallow to go deeper for benchmarking spoken language understanding models? In Interspeech</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Yun-Nung</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Dilek Hakkani-Tu¨r, and
          <source>Gokan Tur</source>
          .
          <year>2014</year>
          .
          <article-title>Deriving local relational surface forms from dependency-based entity embeddings for unsupervised spoken language understanding</article-title>
          .
          <source>In Spoken Language Technology Workshop (SLT)</source>
          ,
          <year>2014</year>
          IEEE, pages
          <fpage>242</fpage>
          -
          <lpage>247</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart van Merrienboer, C¸aglar Gu¨lc¸ehre, Fethi Bougares, Holger Schwenk, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning phrase representations using RNN encoder-decoder for statistical machine translation</article-title>
          .
          <source>CoRR, abs/1406</source>
          .1078.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Jeffrey L.</given-names>
            <surname>Elman</surname>
          </string-name>
          .
          <year>1990</year>
          .
          <article-title>Finding structure in time</article-title>
          .
          <source>COGNITIVE SCIENCE</source>
          ,
          <volume>14</volume>
          (
          <issue>2</issue>
          ):
          <fpage>179</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and Ju¨rgen Schmidhuber.
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Comput.</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          , November.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Rafal</given-names>
            <surname>Jo</surname>
          </string-name>
          ´zefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and
          <string-name>
            <given-names>Yonghui</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Exploring the limits of language modeling</article-title>
          .
          <source>CoRR, abs/1602</source>
          .02410.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Kendall</surname>
          </string-name>
          .
          <year>1938</year>
          .
          <article-title>A new measure of rank correlation</article-title>
          .
          <source>Biometrika</source>
          ,
          <volume>30</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>81</fpage>
          -
          <lpage>93</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Yoon</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Convolutional neural networks for sentence classification</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29</source>
          ,
          <year>2014</year>
          , Doha,
          <string-name>
            <surname>Qatar,</surname>
          </string-name>
          <article-title>A meeting of SIGDAT, a Special Interest Group of the ACL</article-title>
          , pages
          <fpage>1746</fpage>
          -
          <lpage>1751</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Taku</given-names>
            <surname>Kudo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yuji</given-names>
            <surname>Matsumoto</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Chunking with support vector machines</article-title>
          .
          <source>In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies</source>
          ,
          <source>NAACL '01</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          , Stroudsburg, PA, USA. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>John D. Lafferty</surname>
          </string-name>
          ,
          <string-name>
            <surname>Andrew McCallum</surname>
          </string-name>
          , and
          <string-name>
            <surname>Fernando</surname>
            <given-names>C. N.</given-names>
          </string-name>
          <string-name>
            <surname>Pereira</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01</source>
          , pages
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          , San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Navonil</given-names>
            <surname>Majumder</surname>
          </string-name>
          , Soujanya Poria, Alexander Gelbukh, and
          <string-name>
            <given-names>Erik</given-names>
            <surname>Cambria</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Deep learningbased document modeling for personality detection from text</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          ,
          <volume>32</volume>
          (
          <issue>2</issue>
          ):
          <fpage>74</fpage>
          -
          <lpage>79</lpage>
          ,
          <year>March</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>Gre´goire Mesnil, Yann Dauphin</article-title>
          , Kaisheng Yao, Yoshua Bengio, Li Deng,
          <string-name>
            <surname>Dilek</surname>
            Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur,
            <given-names>Dong</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          , et al.
          <year>2015</year>
          .
          <article-title>Using recurrent neural networks for slot filling in spoken language understanding</article-title>
          .
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          ,
          <volume>23</volume>
          (
          <issue>3</issue>
          ):
          <fpage>530</fpage>
          -
          <lpage>539</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Naoaki</given-names>
            <surname>Okazaki</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Crfsuite: a fast implementation of conditional random fields (crfs)</article-title>
          . URL http://www. chokkan. org/software/crfsuite.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Pieraccini</surname>
          </string-name>
          and
          <string-name>
            <given-names>Esther</given-names>
            <surname>Levin</surname>
          </string-name>
          .
          <year>1992</year>
          .
          <article-title>Stochastic representation of semantic structure for speech understanding</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>11</volume>
          (
          <issue>2</issue>
          ):
          <fpage>283</fpage>
          -
          <lpage>288</lpage>
          . Eurospeech '
          <volume>91</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Christian</given-names>
            <surname>Raymond</surname>
          </string-name>
          and
          <string-name>
            <given-names>Giuseppe</given-names>
            <surname>Riccardi</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Generative and discriminative algorithms for spoken language understanding</article-title>
          .
          <source>In INTERSPEECH</source>
          , pages
          <fpage>1605</fpage>
          -
          <lpage>1608</lpage>
          . ISCA.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Elif Derya</surname>
            <given-names>U</given-names>
          </string-name>
          ¨ beyli and Mustafa U¨beyli.
          <year>2012</year>
          .
          <article-title>Case studies for applications of elman recurrent neural networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>B. L.</given-names>
            <surname>Welch</surname>
          </string-name>
          .
          <year>1947</year>
          .
          <article-title>The generalization of 'student's' problem when several different population variances are involved</article-title>
          .
          <source>Biometrika</source>
          ,
          <volume>34</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>28</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Kaisheng</given-names>
            <surname>Yao</surname>
          </string-name>
          , Baolin Peng, Geoffrey Zweig,
          <article-title>Dong Yu, Xiaolong(Shiao-Long) Li, and</article-title>
          <string-name>
            <given-names>Feng</given-names>
            <surname>Gao</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Recurrent conditional random field for language understanding</article-title>
          .
          <source>In ICASSP 2014. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)</source>
          ,
          <year>January</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Shuai</given-names>
            <surname>Zheng</surname>
          </string-name>
          , Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su,
          <string-name>
            <given-names>Dalong</given-names>
            <surname>Du</surname>
          </string-name>
          , Chang Huang, and
          <string-name>
            <surname>Philip</surname>
            <given-names>H. S.</given-names>
          </string-name>
          <string-name>
            <surname>Torr</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Conditional random fields as recurrent neural networks</article-title>
          .
          <source>In Proceedings of the 2015 IEEE International Conference on Computer Vision</source>
          (ICCV),
          <source>ICCV '15</source>
          , pages
          <fpage>1529</fpage>
          -
          <lpage>1537</lpage>
          , Washington, DC, USA. IEEE Computer Society.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>