Concept Tagging for Natural Language Understanding:
                    Two Decadelong Algorithm Development
        Jacopo Gobbi                        Evgeny A. Stepanov                      Giuseppe Riccardi
      University of Trento                      VUI, Inc.                           University of Trento
         Trento, Italy                         Trento, Italy                           Trento, Italy
       jacopo.gobbi                           eas@vui.com                         giuseppe.riccardi
    @studenti.unitn.it                                                                 @unitn.it


                     Abstract                         development of the concept tagging (a.k.a. slot
                                                      filling or entity extraction) task. It aims at com-
    English. Concept tagging is a type of             puting a sequence of concept units, C = c1 ..cM ,
    structured learning needed for natural lan-       from a sequence of words in natural language,
    guage understanding (NLU) systems. In             W = w1 ..wN . The task can be seen as a struc-
    this task, meaning labels from a domain           tured learning problem where words are the input
    ontology are assigned to word sequences.          and concepts are the output labels. In other words,
    In this paper, we review the algorithms           the objective is to map a sentence (utterance) “I
    developed over the last twenty five years.        want to go from Boston to Atlanta on Monday” to
    We perform a comparative evaluation of            the sequence of domain labels “null null null
    generative, discriminative and deep learn-        null null fromloc.city null toloc.city
    ing methods on two public datasets. We            null depart date.day name”, that would allow
    report on the statistical variability perfor-     to identify, for instance that Boston is a departure
    mance measurements. The third contribu-           city . Difficulties may arise from different factors,
    tion is the release of a repository of the        such as the variable token span of concepts, the
    algorithms, datasets and recipes for NLU          long-distance word dependencies, a large and
    evaluation.                                       ever changing vocabulary, or subtle semantic
    Italiano. L’annotazione automatica dei            implications that might be hard to capture at
    concetti è un tipo di apprendimento              a surface level or without some prior context
    strutturato necessario per i sistemi di           knowledge.
    comprensione del linguaggio naturale                 Since the early nineties (Pieraccini and Levin,
    (NLU). In questo processo le etichette di         1992), the task has been designed as a core compo-
    un’ontologia di dominio sono assegnate            nent of the natural language understanding process
    a sequenze di parole. In questo articolo          in domain-limited conversational systems. Over
    esaminiamo gli algoritmi sviluppati negli         the years, algorithms have been developed for gen-
    ultimi venticinque anni. Eseguiamo una            erative, discriminative and, more recently, for deep
    valutazione comparativa dei metodi di ap-         learning frameworks. In this paper, we provide a
    prendimento generativo, discriminatorio e         comprehensive review of the algorithms, their pa-
    approfondito su due set di dati pubblici. Il      rameters and their respective state-of-the-art per-
    secondo contributo é un’analisi della vari-      formances. We discuss the relative advantages and
    abilitá delle misure di valutazione. Il terzo    differences amongst algorithms in terms of perfor-
    contributo è il rilascio di un archivio degli    mances and statistical variability and the optimal
    algoritmi, dei sets di dati e delle ricette per   parameter settings. Last but not least, we have de-
    la valutazione dell’NLU.                          signed and provided a repository of the data, al-
                                                      gorithms, implementations and parameter settings
                                                      on two public datasets. The GitHub repository1 is
1   Introduction                                      intended as a reference both for practitioners and
The NLU component of a conversational system          for algorithm development researchers.
requires an automatic extraction of concept tags,        With the conversational AI gaining popularity,
dialogue acts, domain labels and entities. In         the area of NLU is too vast to mention all relevant
                                                         1
this paper we describe and review the algorithm              www.github.com/fruttasecca/concept-tagging-with-neural-networks
or even recent studies. Moreover the objective          learned from the training data. Additionally, we
of this paper is to benchmark an important sub-         experiment with word embeddings as additional
task of NLU, concept tagging used by advanced           features for CRFs (CRF+EMB).
conversational systems. We benchmark genera-               Recurrent Neural Networks (RNN). The first
tive, discriminative and deep learning approaches       neural network architecture4 we have considered
to NLU, the work is in-line with the works of           is an Elman RNN (Elman, 1990; Übeyli and
(Raymond and Riccardi, 2007; Mesnil et al., 2015;       Übeyli, 2012). In RNN, a hidden state depends
Bechet and Raymond, 2018). Unlike previously            on the current input and the previous hidden state.
mentioned comparative performance analysis, in          The output (label), on the other hand, depends on
this paper, we benchmark deep learning architec-        the new hidden state.
tures and compare them to a generative and tradi-          Long-Short Term Memory (LSTM) RNNs
tional discriminative algorithms. To the best of our    (Hochreiter and Schmidhuber, 1997) try to tackle
knowledge, this is the first comprehensive compar-      the vanishing gradient problem by introducing a
ison of concept tagging algorithms at this scale on     more complex mechanisms to address information
public datasets and shared algorithm implementa-        propagation and deletion, with the cost of a more
tions (and their parameter settings).                   complex model with more parameters to train due
                                                        to the system of gates it uses. The memory of
2   Algorithms
                                                        the model is represented by the cell state and the
Among the algorithms considered for benchmark-          hidden state, which also represents the output for
ing, we include a representative from the gen-          the current token. We experimented with a sim-
erative class, the weighted finite state transduc-      ple LSTM, an LSTM which receives as input the
ers (WFSTs), and two discriminative algorithms:         word embedding concatenated with character em-
Support Vector Machines (SVMs), Conditional             beddings obtained through a convolutional layer
Random Fields (CRFs), and a set of base neural          (Józefowicz et al., 2016) (LSTM-CHAR-REP),
networks architectures and their combinations.          and an LSTM with pre-trained embeddings and
   Weighted Finite State Transducers2 cast con-         dynamic embeddings learned from training data
cept tagging as a translation problem from words        (LSTM-2CH). In LSTM-2CH two separate LSTM
to concepts (Raymond and Riccardi, 2007), and           modules run in parallel and their outputs are con-
usually consist of two components. The first            catenated for each word. Similar to the rest of the
component transduces words to concepts based            deep learning models, the output is then fed to a
on a score that can be either induced from data         fully connected layer to map every token to the
or manually designed; the second component is           concept tag space.
a stochastic conceptual language model, which              Gated Recurrent Units (GRU) (Cho et al.,
re-scores concept sequences. The two com-               2014) use a reset and an update gate, which are
ponents are composed to perform sequence-to-            two vectors of weights that decide what informa-
sequence translation and infer the best sequence        tion is deleted (or re-scaled) from the current hid-
using Viterbi algorithm.                                den state and how it will contribute to the new
   Support Vector Machines (SVM) are used               hidden state, which is also the output for the cur-
within Yamcha tool (Kudo and Matsumoto, 2001)           rent input. Compared to the LSTM model, this
that performs sequence labeling using forward and       allows to train fewer parameters, but introduces a
backward moving classifiers. Automatic labels as-       constraint on memory, since it is also used as an
signed to preceding tokens are used as dynamic          output.
features for the current token’s label decision.           Convolutional Neural Networks (CONV)
   Conditional Random Fields (CRF)3 (Lafferty           (Majumder et al., 2017; Kim, 2014) consider each
et al., 2001) is a discriminative model based on a      sentence as a matrix of shape (# words in sentence,
dependency graph G and a set of features. Each          size of embedding) for convolution using kernels
feature fk has an associated weight λk . Features       of different sizes to pass over the input sequence
are generally hand-crafted and their weights are        token-by-token, bigram by bigram and trigram by
  2
    We use OpenFST (http://www.openfst.org) and Open-   trigram. The result of convolution is used as a
GRM (http://www.opengrm.org) libraries.
  3                                                        4
    We use CRFSUITE (Okazaki, 2007) implementation of        All neural architectures are implemented within the Py-
CRFs in out experiments.                                Torch framework (https://pytorch.org)
starting hidden memory for a GRU RNN. GRU               Model       Parameters                             # Params               F1
                                                                    order 4, kneser ney               (7907 states, 842178 arcs) 82.96
RNN is used on embedded tokens and starts with          WFST
                                                                    order 4, kneser ney                (4124 states, 76000 arcs) 93.08
the information on the sequence at a global level.      SVM
                                                                    (4, 4) window of tokens, (-                        10364 83.74
                                                                    1, 0) of POS tag and pre-
   FC-INIT is similar to CONV. The difference is
                                                                    fix. Postfix and lemma of
in the pre-elaboration of the hidden state, which is                current word. Previous two
done by fully connected layers elaborating on the                   labels.
                                                                    (6, 4) window of tokens, (-                        16361 92.91
whole sequence.                                                     1, 0) of prefix and postfix.
   ENCODER architecture (Cho et al., 2014)                          Previous two labels .
casts the problem as a sequence-to-sequence trans-                  (4, 4) window of token, (-                       1,200K 83.80
                                                        CRF
                                                                    1, 0) of POS tag and prefix.
lation and consists of two GRU RNNs. Encoder,                       Postfix and lemma of cur-
the first GRU RNN, encodes the input sequence                       rent word. Previous + cur-

to a fixed vector (the hidden state). Decoder, an-                  rent word conjunction, cur-
                                                                    rent + next word conjunc-
other GRU RNN, uses the output of the encoder as                    tion. Bigram model.
a starting hidden state. At each step, the decoder                  (6, 4) window of tokens,                         2,201K 93.98
receives the label predicted at the previous step as                (-1, 0) of prefix.      Postfix
                                                                    of current word. Previous
an input, starting with a special token.                            + current word conjunction.
   ATTENTION architecture is similar to EN-                         Bigram model.

CODER with the addition of an attention mech-                       all above + (4, 4) word                          1,390K 85.85
                                                        CRF+EMB
                                                                    embs + current token char
anism (Bahdanau et al., 2014) on the outputs of                     embeddings
the encoder. This allows the network to focus on                    all above + (6, 4) word                          3,185K 94.00
                                                                    embs + current token char
a specific parts of the input sequence. The atten-
                                                                    embeddings
tion weights are computed with a single fully con-
nected layer that receives as input the embedding       Table 1: F1 -scores for the WFST, SVM and
of the current word concatenated to the last hidden     CRF (with and without embeddings) algorithms
state.                                                  on the MOVIES (top row) and ATIS (bottom row)
   LSTM-CRF (Yao et al., 2014; Zheng et al.,            datasets.
2015) is an architecture where the LSTM provides
class scores for each token, and the Viterbi algo-      from NL2SparQL (Chen et al., 2014) corpus semi-
rithm decides on the labels of the sequence at a        automatically aligning SPARQL query values to
global level using bigrams and transition proba-        utterance tokens. The dataset follows the split of
bilities that are trained with the rest of the pa-      the original corpus having 3,338 sentences (with
rameters. We also experimented with a variant           1,728 unique tokens) and 1,084 sentences (with
that considers character level information (LSTM-       1,039 tokens) in the training and test sets, respec-
CRF-CHAR-REP).                                          tively. The average length of a sentence is 6.50
                                                        and the OOV rate is 0.24. There are 43 concept
3       Corpora
                                                        tags in the dataset. Given the Google embeddings,
The evaluation of algorithms is performed on two        once we consider every number as a class number,
datasets. The Air Travel Information System             we obtain 66 token types without an embedding
(ATIS) dataset consists of sentences from users         for the training set and 26 for the test set.
querying for information about flights, departure
dates, arrivals, etc. The training set consists of      4   Performance Analysis
4,978 sentences, while there are 893 sentences that
                                                        One of our first observations is the fact that mod-
constitute the test set. The average length of a sen-
                                                        els such as WFST, SVM and CRF yield competi-
tence is around 11 tokens, and there are a total of
                                                        tive results with simple setups and few hyperpa-
127 unique tags (with IOB prefixes). Moreover,
                                                        rameters to be tuned. The training of our deep
the large majority of tokens missing an embedding
                                                        learning models and the search of their hyperpa-
are either numbers or airport/basis/aircraft codes.
                                                        rameters would have been unfeasible without ded-
The training set has a total of 18 types missing an
                                                        icated hardware, while it took a fraction of the ef-
embedding, and the test set has 9.
                                                        fort for WFST, SVM and CRF. Moreover, adding
   The second corpus (MOVIES)5 was produced
                                                        word embeddings as features to the CRF allowed
    5
        https://github.com/esrel/NL2SparQL4NLU          it to outperform most of the deep neural networks.
Model                  hidden    epochs   batch      lr       drop     emb    #    of min F1 avg F1 best F1
                                          size                rate     norm   params
                           200       15         50   0.001        0.30      4 1,264K    81.00  82.55  83.96
RNN
                           400       10         50   0.001        0.25      2   580K    91.80  93.79  95.03
                           200       15         20   0.001        0.70      6 1,505K    82.67  83.76  84.57
LSTM
                           200       15         10   0.001        0.50      8   675K    87.82  94.53  95.36
                           400       20         20   0.001        0.70      4 2,085K    82.00  84.28  85.41
LSTM-CHAR-REP
                           400       15         10   0.001        0.50      6 1,272K    81.00  94.19  95.39
                           200       20         15   0.001        0.30      8 1,310K    81.22  82.68  83.76
LSTM-2CH
                           400       10        100   0.010        0.70      6 1,022K    93.10  94.61  95.38
                           200       20         20   0.001        0.50      4 1,424K    76.56  84.29  85.47
GRU
                           100       15         10   0.005        0.50     10   446K    91.53  94.28  95.28
                           200       20         20   0.001        0.50      4 2,646K    84.05  85.02  86.17
CONV
                           100       15         10   0.005        0.00      2   625K    91.51  94.22  95.38
                           100       30         20   0.001        0.30      4 2,805K    82.22  83.93  84.95
FC-INIT
                           400       15         50   0.010        0.25      4 7,144K    87.39  94.67  95.39
                           200       30         20   0.001        0.70      4 1,559K    71.25  76.39  79.00
ENCODER
                           200       25          5   0.001        0.70      6   730K    70.01  78.16  80.85
                           200       15         20   0.001        0.30      4 1,712K    71.86  79.77  82.67
ATTENTION
                           200       25          5   0.001        0.25     10   894K    92.47  94.09  94.98
                           200       10          1   0.001        0.70      6 1,507K    84.75  86.11  87.47
LSTM-CRF
                           400       15         10   0.001        0.50      6 1,200K    94.39  94.72  95.01
                           200       15          1   0.001        0.70      8 1,555K    85.07  86.08  87.05
LSTM-CRF-CHAR-REP
                           200       20          5   0.001        0.50      4   740K    94.45  94.91  95.12


Table 2: All models are bidirectional and have been trained with unfrozen Google embeddings, except
for CONV and LSTM-2CH. Min, average and best F1 scores are obtained training the same model
with the same hyperparameters, but different parameter initializations. Averages are from 50 runs for
MOVIES and 25 for ATIS. For each architecture, the first row reports F1 -score for the MOVIES dataset
and the second for ATIS. Hyperparameter search has been done randomly over ranges of values taken
from published work. The number of parameters refers to the network parameters plus the embeddings,
when those are unfrozen. Given a hidden layer size X reported in hidden column, each component in
the bidirectional architecture would have a hidden layer size of X/2. Similarly, each of the two LSTM
components in the LSTM-2CH model would have X/2 as a hidden layer size; and each bidirectional
component would thus have a hidden layer size equal to X/4.


   We attribute this to two factors: (1) since these         4.1   Statistical Significance Testing
models, unlike neural networks, do not learn fea-            The best performing algorithms in our experi-
ture representation from data, they are simpler and          mental settings are LSTM-CRF and LSTM-CRF-
faster to train; and, most importantly, (2) these            CHAR-REP; however, they are not very far from
models usually perform global optimization over              CRF+EMB and CRF algorithms. In order to com-
the label sequence, while neural networks usually            pare the performances in terms of statistical signif-
do not. Augmenting neural networks with CRF is               icance, we perform Welch’s unequal variances t-
not expensive in terms of parameters. Having a               test (Welch, 1947), which, compared to more pop-
CRF component on top of an LSTM increments                   ular Student’s t-test, does not assume equal vari-
the number of parameters up to the square of the             ances. The choice of test is motivated by the ob-
tag-set size (about 2,500 for the MOVIES dataset),           servation that neural architectures generally yield
and provides the best performing model.                      higher variances than, for instance, CRF.
   There seems to be no strong correlation between              The performances are compared on 10-fold
the number of parameters and the variance of a               cross-validation outputs on the training set for
model performance with respect to the random ini-            both ATIS and MOVIES datasets. Due to the
tialization of its parameters. This is surprising,           higher variance of neural network architectures,
given the intuition that more parameters can po-             a better way to test would be to perform many
tentially lead to a lower probability of being stuck         runs with different random initializations for each
in a local minima. The case may be that differ-              fold, and take the average of these results; how-
ent initializations lead to different training times         ever, such a procedure is computationally very de-
required to get to good local minimas.                       manding.
                                                            LSTM-CRF-CHAR-REP
                                                                                4.2    Error Analysis
                                                                                Both MOVIES and ATIS datasets have imbal-
                                                                                anced distribution of concept labels. The imbal-
                                                                                anced distribution of labels is known to affect


                                                 LSTM-CRF
                                                                                the performance of the minority classes. Conse-


                                       CRF-EMB
                                                                                quently, we correlate the distribution of labels in
                                                                                the training set to the percent of their mis-labeling


                                 CRF
      ALGORITHMS                                                                in the test set (by any model). As expected, the
                                                                                mis-labeling chance is inversely correlated to the
               MOVIES
                                                                                percentage of instances the label has in the training
                 CRF                                                            set (e.g. given that a label amounts to less than 1%
             CRF-EMB *                                                          of a dataset, it usually has a mis-labeling chance
            LSTM-CRF *                                                          greater than 10%). For both datasets, the Kendall
   LSTM-CRF-CHAR-REP *                                                          rank correlation coefficients (Kendall, 1938) are
                ATIS                                                            approximately 0.6.
                 CRF                                                               Independent of the distribution, there are certain
             CRF-EMB                                                            concepts that are mis-labeled more often. For ex-
            LSTM-CRF *                                                          ample, this is the case for producer name, person
   LSTM-CRF-CHAR-REP *                 *                                        name, and director name in MOVIES, and city
                                                                                name, state name, and airport name in ATIS. It
Table 3: Results of statistical significance test-                              is not surprising given that these concepts share
ing using Welch’s t-test for MOVIES and ATIS                                    the values (e.g. the same person may be an ac-
datasets. Algorithms on rows with statistically sig-                            tor, director, and producer) and frequently lexical
nificant differences in performance with p < 0.05                               contexts.
in comparison to the algorithms on columns are                                     Supporting the observations in (Bechet and
marked with ‘*’.                                                                Raymond, 2018) for ATIS, some errors stem
                                                                                from inconsistent labeling. For instance, in the
   The results of the statistical significance testing                          MOVIES dataset, “classic cars” is mapped to “O
are reported in Table 3. For the MOVIES dataset,                                O”, but “are there any documentaries on clas-
all the compared models (CRF-EMB, LSTM-                                         sic cars” appears as “O O O B-movie.genre O
CRF, LSTM-CRF-CHAR-REP) significantly out-                                      B-movie.subject I-movie.subject”.
perform the CRF model with p < 0.05. How-
                                                                                5     Conclusion
ever, these models do not yield statistically signif-
icant differences among themselves. Specifically,                               One of the main outcomes of our experiments is
using embeddings with CRF (i.e. CRF-EMB) pro-                                   that sequence-level optimization is key to achieve
duces statistically significant differences in perfor-                          the best performance. Moreover, augmenting any
mance on top of CRF. Using CRF with LSTM,                                       neural architecture with a CRF layer on top has
even though produces better average F1 than CRF-                                a very low cost in terms of parameters and a
EMB, the gain is not statistically significant, irre-                           very good return in terms of performance. Our
spective of the type of embeddings used.                                        best performing models (in terms of average F1 )
   For the ATIS dataset, on the other hand, use                                 are LSTM-CRF and LSTM-CRF-CHAR-REP. In
of embeddings with CRF does not yield sta-                                      general we may say that adding a sequence level
tistically significant differences with respect to                              control to different type of NN architectures leads
plain CRF. Neural architectures (LSTM-CRF and                                   to very good model performances. Another im-
LSTM-CRF-CHAR-REP), on the other hand, do                                       portant observation is the variance of performance
produce statistically significant difference in per-                            of NN models with respect to initialization pa-
formance in comparison to CRF. Moreover, un-                                    rameters. Consequently, we strongly believe that
like for MOVIES dataset, the use of character em-                               this variability should be taken into consideration
beddings in LSTM-CRF architecture significantly                                 and reported (with the lowest and highest perfor-
outperforms the CRF-EMB model.                                                  mances) to improve the reliability and replicability
                                                                                of the published results.
References                                               Grégoire Mesnil, Yann Dauphin, Kaisheng Yao,
                                                           Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xi-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua                aodong He, Larry Heck, Gokhan Tur, Dong Yu, et al.
  Bengio. 2014. Neural machine translation by              2015. Using recurrent neural networks for slot fill-
  jointly learning to align and translate. CoRR,           ing in spoken language understanding. IEEE/ACM
  abs/1409.0473.                                           Transactions on Audio, Speech, and Language Pro-
Frederic Bechet and Christian Raymond. 2018. Is            cessing, 23(3):530–539.
  ATIS too shallow to go deeper for benchmarking
                                                         Naoaki Okazaki. 2007. Crfsuite: a fast implemen-
  spoken language understanding models? In Inter-
                                                           tation of conditional random fields (crfs). URL
  speech.
                                                           http://www. chokkan. org/software/crfsuite.
Yun-Nung Chen, Dilek Hakkani-Tür, and Gokan Tur.
                                                         Roberto Pieraccini and Esther Levin. 1992. Stochastic
  2014. Deriving local relational surface forms from
                                                           representation of semantic structure for speech un-
  dependency-based entity embeddings for unsuper-
                                                           derstanding. Speech Communication, 11(2):283 –
  vised spoken language understanding. In Spoken
                                                           288. Eurospeech ’91.
  Language Technology Workshop (SLT), 2014 IEEE,
  pages 242–247. IEEE.                                   Christian Raymond and Giuseppe Riccardi. 2007.
Kyunghyun Cho, Bart van Merrienboer, Çaglar               Generative and discriminative algorithms for spoken
  Gülçehre, Fethi Bougares, Holger Schwenk, and          language understanding. In INTERSPEECH, pages
  Yoshua Bengio. 2014. Learning phrase representa-         1605–1608. ISCA.
  tions using RNN encoder-decoder for statistical ma-    Elif Derya Übeyli and Mustafa Übeyli. 2012. Case
  chine translation. CoRR, abs/1406.1078.                   studies for applications of elman recurrent neural
Jeffrey L. Elman. 1990. Finding structure in time.          networks.
   COGNITIVE SCIENCE, 14(2):179–211.
                                                         B. L. Welch. 1947. The generalization of ‘student’s’
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long         problem when several different population variances
  short-term memory. Neural Comput., 9(8):1735–             are involved. Biometrika, 34(1-2):28–35.
  1780, November.
                                                         Kaisheng Yao, Baolin Peng, Geoffrey Zweig, Dong
Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam      Yu, Xiaolong(Shiao-Long) Li, and Feng Gao. 2014.
  Shazeer, and Yonghui Wu. 2016. Exploring the lim-        Recurrent conditional random field for language un-
  its of language modeling. CoRR, abs/1602.02410.          derstanding. In ICASSP 2014. IEEE International
                                                           Conference on Acoustics, Speech, and Signal Pro-
M. G. Kendall. 1938. A new measure of rank correla-        cessing (ICASSP), January.
  tion. Biometrika, 30(1-2):81–93.
                                                         Shuai Zheng, Sadeep Jayasumana, Bernardino
Yoon Kim. 2014. Convolutional neural networks for          Romera-Paredes, Vibhav Vineet, Zhizhong Su,
  sentence classification. In Proceedings of the 2014      Dalong Du, Chang Huang, and Philip H. S. Torr.
  Conference on Empirical Methods in Natural Lan-          2015. Conditional random fields as recurrent neural
  guage Processing, EMNLP 2014, October 25-29,             networks. In Proceedings of the 2015 IEEE Inter-
  2014, Doha, Qatar, A meeting of SIGDAT, a Special        national Conference on Computer Vision (ICCV),
  Interest Group of the ACL, pages 1746–1751.              ICCV ’15, pages 1529–1537, Washington, DC,
                                                           USA. IEEE Computer Society.
Taku Kudo and Yuji Matsumoto. 2001. Chunking
  with support vector machines. In Proceedings of
  the Second Meeting of the North American Chap-
  ter of the Association for Computational Linguistics
  on Language Technologies, NAACL ’01, pages 1–
  8, Stroudsburg, PA, USA. Association for Computa-
  tional Linguistics.
John D. Lafferty, Andrew McCallum, and Fernando
  C. N. Pereira. 2001. Conditional random fields:
  Probabilistic models for segmenting and labeling se-
  quence data. In Proceedings of the Eighteenth Inter-
  national Conference on Machine Learning, ICML
  ’01, pages 282–289, San Francisco, CA, USA. Mor-
  gan Kaufmann Publishers Inc.
Navonil Majumder, Soujanya Poria, Alexander Gel-
  bukh, and Erik Cambria. 2017. Deep learning-
  based document modeling for personality detection
  from text. IEEE Intelligent Systems, 32(2):74–79,
  March.