<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta />
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>correct result with the classifier’s highest ranked result and
get a measure of the quality of the classifier .</p>
      <p>The testing process is performed as follows:</p>
    </sec>
    <sec id="sec-2">
      <title>1. Run the test data through the classifier, 2. Collect the results, 3. Compare the results of with the correct values.</title>
      <sec id="sec-2-1">
        <title>Test Evaluation</title>
        <p>We can compare the classifier results and the correct results
in a number of ways. One basic measure is accuracy.
Accuracy measures the percentage of test inputs that are
classified correctly by . More formally, accuracy is calculated
by summing all correct classifications and dividing this by
the number of inputs in the test data. Let cj be the
correct class label for ti and Ntest is the number of test inputs
ft1; :::; tNtest g = T . Next consider the indicator function I
defined as:</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Now accuracy is calculated as:</title>
      <p>I(ti) =</p>
      <p>Another way to use accuracy is to calculate the accuracy
for each class. Let Ntc be the number of test inputs whose
correct class is c, ft1;cj ; :::tn;cj g = Tj are the test inputs
whose correct class is cj and let Ij be the indicator function
as defined above for the class cj . Then the accuracy of for
each class cj is:
acccj =</p>
      <p>Pt2Tj Ij (t)</p>
      <p>Nj</p>
      <p>In many cases the per-class accuracy will reveal that some
classes have higher accuracy under than other classes.
This is important information when we are trying to
analyze error and improve classifiers (i.e., improve the accuracy
of ). We can extract a little more form this by looking at
the closely related confusion matrix.</p>
      <sec id="sec-3-1">
        <title>The Confusion Matrix</title>
        <p>The idea of the confusion matrix is to represent the degree
to which is ’confused’ between the correct class and ’s
selected class. If is not confused on class cj , then maps
all Tj inputs to cj . If is confused on cj then will map
some of the inputs Tj to other classes. It is very rare for a
classifier to map all the test inputs correctly so we always
expect some degree of confusion in .</p>
        <p>To represent the confusion of , first consider the inputs
Tji that are labelled cj but that maps to ci. Next redefine
the indicator function to be more specific:</p>
        <p>Iji(t) =</p>
        <p>With this background, the confusion matrix contains the
sum of test inputs that understood in one class but were
actually labelled as another class, as well as the correctly
classified inputs as in the per-class accuracy. The matrix
looks like this for a classifier defined on a domain of K
classes:</p>
        <p>CM</p>
        <p>c1
c1 0 N11
c2 B N21
= .. B ..</p>
        <p>. @ .
cK N K1
c2
N12
N22
.
.</p>
        <p>.</p>
        <p>N K2
. . .</p>
        <p>cK
N1K 1
N2K C
.. C
. A
NKK</p>
      </sec>
      <sec id="sec-3-2">
        <title>Example</title>
        <p>The concepts above can be illustrated by considering a small
classifier with 10 test inputs labelled with 3 classes. Table 1
shows the data used to test a simple classifier and the raw
results of the test. In this case accuracy measures the ratio
of correct classifications from compared to the number of
test inputs. In this example</p>
        <p>accuracy = 6=10 = 60%:
Per class accuracy measures the ration of correct
classification per inputs per class. This gives one value for each class,
or:
accuracyA = 2=2 = 1
accuracyB = 2=4 = 0:5
accuracyC = 2=4 = 0:5</p>
        <p>From this we can see that, though the average accuracy is
60%, the accuracy associated with class A is perfect. Further
refinement can be achieved by looking at the unnormalized
confusion matrix for this test data and . The confusion
matrix is:</p>
        <p>CM</p>
        <p>A
= B</p>
        <p>C</p>
        <p>The confusion matrix reveals even more details
regarding the classifier . for example, the confusion matrix tells
us that if, given some input, emits class A, then we can’t
tell whether the actual class is A,B, or C. We can estimate
the likelihood of each class from the confusion matrix. The
likelihood of emitting class A is 5=10 = 0:5: The
likelihood of emitting class A if the actual class is A is 1, but since
confuses other classes with A frequently, we can’t tell if
the actual class is A. This fact is very hard to see without the
confusion matrix.</p>
        <p>The confusion matrix gives us a snapshot of the classifier
with respect to the classes it thinks are relevant to the input
data. Later we describe how to use the confusion matrix to
determine which is the most likely class for unlabeled inputs
when multiple classifiers are used. Before exploring
multiple classifiers, this paper discusses the use of intent as a class
for natural language chat inputs.</p>
        <sec id="sec-3-2-1">
          <title>The Idea of ’Intent’ in Human Discourse</title>
          <p>
            It is generally accepted that human discourse takes place on
(at least) two different levels: one is the level of symbols and
their accepted arrangements. This is the domain of syntax.
The other is the level of ideas and concepts. This is generally
known as semantics. These are traditional areas of research
for both the linguist and the Natural Language Programming
(NLP) researcher. For an excellent survey of these topics and
classification methods within their scope, see
            <xref ref-type="bibr" rid="ref4">(Jurafsky and
Martin 2009)</xref>
            .
          </p>
          <p>Another important area of research in NLP is the idea of
intent in discourse, specifically in the domain of chat
discourse. In this paper, intent is defined as An interpretation
of a statement or question that allows one to formulate the
’best’ response to the statement. What makes intent
significant to NLP (specifically chat) applications is that, if a
computer program can determine intent then the program can
derive a meaningful response and thus can conduct a
conversation with a person in a chat context.</p>
          <p>For example, In an airline domain, where the system
should answer inquiries regarding flying, some examples of
intents might be:</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Number of Bags accepted,</title>
      <p>Max Age of Lap Child
Procedure for Changing a Ticket</p>
      <p>Each of these is an example of an intent. That is, one
can capture the inquiries that are asking for information on
these intents (i.e, ”I need to change my flight.”, How do I
change my ticket?”, etc.) and then formulate a response that
describes the needed outcome.</p>
      <p>But how can one create a classifier system that will
identify intent in the way described in the example above?</p>
      <sec id="sec-4-1">
        <title>Building Intent Classifiers</title>
        <p>
          It turns out that there are many ways to train a system to
identify intent. Information on some of these methods can
be found in
          <xref ref-type="bibr" rid="ref1">(Chakrabarti 2014)</xref>
          . In order to build an intent
classifier one should first recognize that intent is not a unique
feature. In other words, the first step in developing these
classifiers is to determine what is meant by ’intent’ in the
domain being studied. One easy way to identify intent is
to let people classify a set of chats into classes with similar
answers. The chats become the training data and each class
becomes an intent. Since different people will have different
ideas about how to classify chats, the class of a chat will
often be determined by a vote of multiple people.
        </p>
        <sec id="sec-4-1-1">
          <title>Training</title>
          <p>Once the intent classes have been identified and the training
data has been associated to intents, one can train a classifier
to recognize the best intent to associate to new chats. As
with the association, there are many methods for training a
classifier.</p>
          <p>
            One method is to directly associate a text string, a part or
all of a chat, with the intent. Then any new chat that
contains that string will be associated with the same intent (the
assumption is that the same response can be given to both
the training chats and the new chat). Another similar
technique uses regular expressions to identify patterns found in
chats associated with a single intent. This paper identifies
classifiers trained in this way are symbolic classifiers Then
new chats that include these patterns are going to be
associated with the same intent. This is the method used by the
AliceBot system
            <xref ref-type="bibr" rid="ref10">(Wallace 2004)</xref>
            and similar systems. Other,
more complex classifiers can be designed around this
relatively simple manual model.
          </p>
          <p>
            Another method uses statistical machine learning to
generate a classifier from the training data. In this case, the
researcher selects a learning algorithm and then treats the
training data as a labeled data set. The training data is
often preprocessed to produce more features and to reduce
noise in the set. Then a classifier is ’trained’ on the training
data. Finally new chats are run through the system to let the
classifier evaluate their intents. Intents can then be used in
other parts of the system to generate appropriate responses.
This paper identifies classifiers trained in this way as
statistical classifiers. There are many references both for
machine learning methods in NLP
            <xref ref-type="bibr" rid="ref4 ref7">(Jurafsky and Martin 2009;
Manning and Schu¨tze 1999)</xref>
            and for intent learning
            <xref ref-type="bibr" rid="ref6">(Li
2010)</xref>
            .
          </p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Does Intent Classification Work?</title>
        <p>The two methods described above produce classifiers that
work reasonably well when implemented in a limited
domain with not too many intents. We have conducted
experiments to compare different types of intent classifiers. These
experiments are not the topic of this paper, but the result is
that, in general, all of these methods perform about equally.</p>
        <p>Classifier performance can be measured in a number of
ways. Our experiments used a measure of accuracy (as
defined earlier) and measured the accuracy of classifiers by
classifying a set of labelled test data that were not part of
the training data.</p>
        <p>Many classifiers created from very different methods,
have comparable accuracy. Furthermore, each of these
classifiers can be improved over time by increasing the amount
of (labeled) training data and retraining the classifier. Table
2 shows the result of one such evaluation of classifier
accuracy. In this case, the two classifiers do not recognize intent
particularly well, but perhaps they are not tuned and thus are
very quick to train. In any case intent classification is
possible, though we will always expect to see some error in the
evaluation process.</p>
        <p>Looking at Table 2, the classifiers compared in this case
produce almost equivalent accuracy and so in some sense are
interchangeable. But this view of intent classification misses
some important information which can be seen better in
Table 3. In this table, we see the accuracy of the same
classifiers compared based on how much their accuracies overlap.
In the case of these two classifiers, on 60% of the test data,
the intents returned by each of the classifiers are identical
and correct. In 25% of the test data both classifier return the
incorrect intent.</p>
        <p>These two views of classifier test results raise a very
interesting point. Table 2 shows us that using one or the other of
these classifiers gives us approximately the same ability to
recognize intent. But Table 3 suggests that, if we can select
the classifier which gives a correct answer when available,
we can increase the accuracy of our combined classifier to
75% in the optimal case. We next discuss methods to help
8%
25%
determine if it is possible to improve the performance of a
system by combining multiple classifiers.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Combining Classifiers</title>
        <p>The remainder of this paper discusses some of the
possibilities and considerations of combining multiple intent
classifiers to produce better results than using a single classifier.
When one attempts to combine the results of classifiers, a
number of issues come up right away. This paper will
address the two following issues, the answers to which can get
us all the way to a functional classifier combiner.</p>
        <p>
          One major issue with combining classifiers is deciding on
the domain and training data of each classifier. In the
example above, each of the classifiers is trained on the same
training data and the same set of classes. we call this
classifiers with overlapping domain. This paper will present an
analysis of a method to compare two classifiers overlapping
domain
          <xref ref-type="bibr" rid="ref5">(Kuncheva 2004)</xref>
          .
        </p>
        <p>Another important way to train classifiers is to train each
classifier on different domains. The combination of
classifiers trained on different domains means that each classifier
knows different sets of intents. This allows the creation of
more complete systems of intent classification by combining
smaller domains of knowledge.</p>
        <p>Another important consideration is to recognize that
classifiers trained using different methods will typically return
very different types of score information. Scores are
important in classifiers because the score provides a way to rank
the classifier’s results so that a ’winning’ class can be
determined.</p>
        <p>For example, if a classifier that is trained on K classes
receives an input, the classifier will typically return a set of
classes with a score for each class. the score is typically
a floating point number, allowing the classifier to order the
results from most likely to least likely. Some classifiers
generate a score that is very close to the posterior probability
of the class, this is not true in general however. In regular
expression-based text classifiers, for example, the score is
often some multiple of the number of characters or tokens
matched for the candidate class.</p>
        <p>Given the variety of different scoring methods possible,
how can one hope to compare results of different classifiers
to provide the hoped-for improvement in intent matching?</p>
        <sec id="sec-4-3-1">
          <title>Overlapped Classifiers with Naive Bayes</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>Combination</title>
          <p>
            In order to compare classifiers in a general way, we require
a means to either normalize the scores of each classifier so
that they can be compared, or to explore a way to ignore the
scores but still rank the classifier results. Naive Bayes
Combination is a technique to achieve the latter. The principles
and theory behind Naive Bayes classification can be found
in
            <xref ref-type="bibr" rid="ref2">(Duda, Hart, and Stork 2012)</xref>
            .
          </p>
          <p>Consider a system of L classifiers f 1; 2; :::; Lg =
trained on the same test data and the same set of K
classes. Given an input x each i will output a class in
fc1; c2; :::; cK g = C. Let
(x) = [ 1(x)
2(x)</p>
          <p>L(x)]
be the vector of all the classes returned by the system given
input x. For example, for a given x, (x) might return the
vector
(xexample) = [c2
c4
c1
c4
c4
c1]:
What we want is to discover an expression to derive the most
probable class given all the results in (x) for any x. To do
this, we first use the Naive Bayes assumption that each entry
in the results in (X) are independent of any particular class
ci. This lets us calculate the joint probability of any group
of results as the product of the probabilities of each
conditional probability result. This is the conditional probability
assumption of Naive Bayes.</p>
          <p>Let P ( (x)jci) be the conditional probability the (x) is
returned given that the correct class is ci. Then the Naive
Bayes assumption says:</p>
          <p>P ( (x)jci) =</p>
          <p>L
Y P ( j (x)jci)
j=1
The posterior conditional probability of class ci given the
classifier results is then:</p>
          <p>P (cij (x)) =</p>
          <p>P (ci)P ( (x)jci) :</p>
          <p>P ( (x))</p>
          <p>Using the conditional independence assumption describe
above, we can now express the support for each class as:
L
i(x) / P (ci)P ( (x)jci) = P (ci) Y P ( j (x)jci)
j=1</p>
          <p>If we calculated the confusion matrix for each of the L
classifiers during the testing phase, then we have everything
we need to determine this equation for each i(x).</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>Using The Confusion Matrix to Select a Class</title>
          <p>In a previous section, the confusion matrix CMk for
classifier k was defined to be the matrix which sums the number
of test data which were assigned class cj when the correct
class was ci for 1 i K; 1 j K. This generates a
K K matrix of integers. Then CMk(i; j) is the cell
containing the number of test datum where k generated cj but
ci was the correct class. The diagonal of CMk has all the
test results where test data was correctly classified.</p>
          <p>Using CMk we can estimate</p>
          <p>P ( k(x)jcj )</p>
          <p>CMk(i; j)
1
Ni
where Ni is the total number of test inputs whose correct
classification is ci. We can also estimate P (ci) Ni=N
where N is the total number of test inputs.</p>
          <p>To arrive at a class from the K classifiers, we now only
need to calculate</p>
          <p>L
i(x) / P (ci) Y P ( j (x)jci)</p>
          <p>j=1
j=1
Ni YL CMk(i; j) 1
N</p>
          <p>Ni
1
/ N L 1</p>
          <p>L
Y CMk(i; j)
k=1
for each i; 1 i K. The i with the highest
proportional value will determine the class.</p>
        </sec>
        <sec id="sec-4-3-4">
          <title>Example: Naive Bayesian Estimation</title>
          <p>Suppose that we have two classifiers trained over the same
data and tested on 20 test inputs. There are three classes,
c1; c2; c3 (K = 3) and the number of test questions that are
labeled with each class are N1 = 8; N2 = 5; N3 = 7
respectively. The two classifiers 1; 2 have respective confusion
matrices:</p>
          <p>CM1 =
5 2 1!
2 3 0 ; CM2 =
0 3 4</p>
          <p>
            Given an input x suppose that 1(x) = c1 and 2(x) =
c2. we can use the technique described above to decide
which class is correct.
One problem with the method described above is that zero
entries in the confusion matrix can cause values to go to
zero. Since for large values of K the confusion matrix is
sparse (most of the entries are zeros), this technique is not
very robust by itself. There are a number of ways to
introduce smoothing into the system
            <xref ref-type="bibr" rid="ref9">(Titterington et al. 1981)</xref>
            .
We will look at one simple method for smoothing zeros by
including an additive element of the inverse of the number
of classes K
            <xref ref-type="bibr" rid="ref5">(Kuncheva 2004)</xref>
            .
          </p>
          <p>To accomplish this we rewrite the Naive Bayes
combination as:</p>
          <p>P ( (x)jci) /
( YL CMk(i; j) + 1=K )B
k=1
Note that two things happened here. One is that the zero
entry now has some weight. With sparse confusion matrices
this prevents all the values from disappearing from the naive
Bayes test.</p>
          <p>
            The other notable result from this smoothing method is
that the answer changed to class c1! This is a common
consequence of smoothing. The smoothing method used here
is Uniform Smoothing where we borrow some probability
mass from all the nonzero masses and redistribute them to
remove zeros. More complex smoothing methods can be
devised to have less direct affect on outcome. For examples
of more complex smoothing methods in probabilistic natural
language models see
            <xref ref-type="bibr" rid="ref3">(Dumoulin 2012)</xref>
            .
          </p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Conclusion</title>
        <p>This paper has described a theoretical framework for
generalized intent classifiers and a method for combining them.
Currently Next IT researchers are working on building a
framework for training and running a ’Panel of Experts’
utilizing the mathematical framework outlined here to compare
classifier results. The goal is to compare classifier outcomes
with each other and with human judgement (yet another kind
of classifier). These results have yet to be assembled but we
expect to study the panel and use it to improve
understanding of intent classifiers and their combination.</p>
      </sec>
      <sec id="sec-4-5">
        <title>Acknowledgments</title>
        <p>The author would like to thank Ian Beaver and Cynthia Wu
for their original investigation of this topic and their work in
investigating effective machine learning methods for intent
classification. Thanks also to Mark Zartler and Tanya Miller
for their work on advancing our understanding of symbolic
classifiers. Thanks also to Next IT Corp. for its sponsorship
of this fascinating research area.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Artificial Conversations for Chatter Bots Using Knowledge Representation, Learning, and</article-title>
          <string-name>
            <given-names>Pragmatics. Ph.D.</given-names>
            <surname>Dissertation</surname>
          </string-name>
          , University of New Mexico.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Duda</surname>
            ,
            <given-names>R. O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>P. E.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Stork</surname>
            ,
            <given-names>D. G.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Pattern classification</article-title>
          . John Wiley &amp; Sons.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Dumoulin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Smoothing of ngram language models of human chats</article-title>
          .
          <source>In Soft Computing and Intelligent Systems (SCIS) and 13th International Symposium on Advanced Intelligent Systems (ISIS)</source>
          ,
          <year>2012</year>
          Joint 6th International Conference on,
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          <year>2009</year>
          . Speech &amp; Language Processing 2nd Ed. Prentice Hall.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Kuncheva</surname>
            ,
            <given-names>L. I.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Combining pattern classifiers: methods and algorithms</article-title>
          . John Wiley &amp; Sons.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Understanding the semantic structure of noun phrase queries</article-title>
          .
          <source>In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <fpage>1337</fpage>
          -
          <lpage>1345</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          , and Schu¨tze,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>1999</year>
          .
          <article-title>Foundations of statistical natural language processing</article-title>
          . MIT press.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Ripley</surname>
            ,
            <given-names>B. D.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Pattern recognition and neural networks</article-title>
          . Cambridge university press.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Titterington</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Murray,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>Spiegelhalter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Skene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Habbema</surname>
          </string-name>
          , J.; and Gelpke,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>1981</year>
          .
          <article-title>Comparison of discrimination techniques applied to a complex data set of head injured patients</article-title>
          .
          <source>Journal of the Royal Statistical Society</source>
          . Series A (
          <year>General</year>
          )
          <fpage>145</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Wallace</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>From eliza to alice</article-title>
          . Article available at http://www. alicebot. org/articles/wallace/eliza. html.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>