-

correct result with the classifier’s highest ranked result and get a measure of the quality of the classifier .

The testing process is performed as follows:

1. Run the test data through the classifier, 2. Collect the results, 3. Compare the results of with the correct values. Test Evaluation

We can compare the classifier results and the correct results in a number of ways. One basic measure is accuracy. Accuracy measures the percentage of test inputs that are classified correctly by . More formally, accuracy is calculated by summing all correct classifications and dividing this by the number of inputs in the test data. Let cj be the correct class label for ti and Ntest is the number of test inputs ft1; :::; tNtest g = T . Next consider the indicator function I defined as:

Now accuracy is calculated as:

I(ti) =

Another way to use accuracy is to calculate the accuracy for each class. Let Ntc be the number of test inputs whose correct class is c, ft1;cj ; :::tn;cj g = Tj are the test inputs whose correct class is cj and let Ij be the indicator function as defined above for the class cj . Then the accuracy of for each class cj is: acccj =

Pt2Tj Ij (t)

In many cases the per-class accuracy will reveal that some classes have higher accuracy under than other classes. This is important information when we are trying to analyze error and improve classifiers (i.e., improve the accuracy of ). We can extract a little more form this by looking at the closely related confusion matrix.

The Confusion Matrix

The idea of the confusion matrix is to represent the degree to which is ’confused’ between the correct class and ’s selected class. If is not confused on class cj , then maps all Tj inputs to cj . If is confused on cj then will map some of the inputs Tj to other classes. It is very rare for a classifier to map all the test inputs correctly so we always expect some degree of confusion in .

To represent the confusion of , first consider the inputs Tji that are labelled cj but that maps to ci. Next redefine the indicator function to be more specific:

Iji(t) =

With this background, the confusion matrix contains the sum of test inputs that understood in one class but were actually labelled as another class, as well as the correctly classified inputs as in the per-class accuracy. The matrix looks like this for a classifier defined on a domain of K classes:

c1 c1 0 N11 c2 B N21 = .. B ..

. @ . cK N K1 c2 N12 N22 . .

N K2 . . .

cK N1K 1 N2K C .. C . A NKK

Example

The concepts above can be illustrated by considering a small classifier with 10 test inputs labelled with 3 classes. Table 1 shows the data used to test a simple classifier and the raw results of the test. In this case accuracy measures the ratio of correct classifications from compared to the number of test inputs. In this example

accuracy = 6=10 = 60%: Per class accuracy measures the ration of correct classification per inputs per class. This gives one value for each class, or: accuracyA = 2=2 = 1 accuracyB = 2=4 = 0:5 accuracyC = 2=4 = 0:5

From this we can see that, though the average accuracy is 60%, the accuracy associated with class A is perfect. Further refinement can be achieved by looking at the unnormalized confusion matrix for this test data and . The confusion matrix is:

A = B

The confusion matrix reveals even more details regarding the classifier . for example, the confusion matrix tells us that if, given some input, emits class A, then we can’t tell whether the actual class is A,B, or C. We can estimate the likelihood of each class from the confusion matrix. The likelihood of emitting class A is 5=10 = 0:5: The likelihood of emitting class A if the actual class is A is 1, but since confuses other classes with A frequently, we can’t tell if the actual class is A. This fact is very hard to see without the confusion matrix.

The confusion matrix gives us a snapshot of the classifier with respect to the classes it thinks are relevant to the input data. Later we describe how to use the confusion matrix to determine which is the most likely class for unlabeled inputs when multiple classifiers are used. Before exploring multiple classifiers, this paper discusses the use of intent as a class for natural language chat inputs.

The Idea of ’Intent’ in Human Discourse

It is generally accepted that human discourse takes place on (at least) two different levels: one is the level of symbols and their accepted arrangements. This is the domain of syntax. The other is the level of ideas and concepts. This is generally known as semantics. These are traditional areas of research for both the linguist and the Natural Language Programming (NLP) researcher. For an excellent survey of these topics and classification methods within their scope, see (Jurafsky and Martin 2009) .

Another important area of research in NLP is the idea of intent in discourse, specifically in the domain of chat discourse. In this paper, intent is defined as An interpretation of a statement or question that allows one to formulate the ’best’ response to the statement. What makes intent significant to NLP (specifically chat) applications is that, if a computer program can determine intent then the program can derive a meaningful response and thus can conduct a conversation with a person in a chat context.

For example, In an airline domain, where the system should answer inquiries regarding flying, some examples of intents might be:

Number of Bags accepted,

Max Age of Lap Child Procedure for Changing a Ticket

Each of these is an example of an intent. That is, one can capture the inquiries that are asking for information on these intents (i.e, ”I need to change my flight.”, How do I change my ticket?”, etc.) and then formulate a response that describes the needed outcome.

But how can one create a classifier system that will identify intent in the way described in the example above?

Building Intent Classifiers

It turns out that there are many ways to train a system to identify intent. Information on some of these methods can be found in (Chakrabarti 2014) . In order to build an intent classifier one should first recognize that intent is not a unique feature. In other words, the first step in developing these classifiers is to determine what is meant by ’intent’ in the domain being studied. One easy way to identify intent is to let people classify a set of chats into classes with similar answers. The chats become the training data and each class becomes an intent. Since different people will have different ideas about how to classify chats, the class of a chat will often be determined by a vote of multiple people.

Training

Once the intent classes have been identified and the training data has been associated to intents, one can train a classifier to recognize the best intent to associate to new chats. As with the association, there are many methods for training a classifier.

One method is to directly associate a text string, a part or all of a chat, with the intent. Then any new chat that contains that string will be associated with the same intent (the assumption is that the same response can be given to both the training chats and the new chat). Another similar technique uses regular expressions to identify patterns found in chats associated with a single intent. This paper identifies classifiers trained in this way are symbolic classifiers Then new chats that include these patterns are going to be associated with the same intent. This is the method used by the AliceBot system (Wallace 2004) and similar systems. Other, more complex classifiers can be designed around this relatively simple manual model.

Another method uses statistical machine learning to generate a classifier from the training data. In this case, the researcher selects a learning algorithm and then treats the training data as a labeled data set. The training data is often preprocessed to produce more features and to reduce noise in the set. Then a classifier is ’trained’ on the training data. Finally new chats are run through the system to let the classifier evaluate their intents. Intents can then be used in other parts of the system to generate appropriate responses. This paper identifies classifiers trained in this way as statistical classifiers. There are many references both for machine learning methods in NLP (Jurafsky and Martin 2009; Manning and Schu¨tze 1999) and for intent learning (Li 2010) .

Does Intent Classification Work?

The two methods described above produce classifiers that work reasonably well when implemented in a limited domain with not too many intents. We have conducted experiments to compare different types of intent classifiers. These experiments are not the topic of this paper, but the result is that, in general, all of these methods perform about equally.

Classifier performance can be measured in a number of ways. Our experiments used a measure of accuracy (as defined earlier) and measured the accuracy of classifiers by classifying a set of labelled test data that were not part of the training data.

Many classifiers created from very different methods, have comparable accuracy. Furthermore, each of these classifiers can be improved over time by increasing the amount of (labeled) training data and retraining the classifier. Table 2 shows the result of one such evaluation of classifier accuracy. In this case, the two classifiers do not recognize intent particularly well, but perhaps they are not tuned and thus are very quick to train. In any case intent classification is possible, though we will always expect to see some error in the evaluation process.

Looking at Table 2, the classifiers compared in this case produce almost equivalent accuracy and so in some sense are interchangeable. But this view of intent classification misses some important information which can be seen better in Table 3. In this table, we see the accuracy of the same classifiers compared based on how much their accuracies overlap. In the case of these two classifiers, on 60% of the test data, the intents returned by each of the classifiers are identical and correct. In 25% of the test data both classifier return the incorrect intent.

These two views of classifier test results raise a very interesting point. Table 2 shows us that using one or the other of these classifiers gives us approximately the same ability to recognize intent. But Table 3 suggests that, if we can select the classifier which gives a correct answer when available, we can increase the accuracy of our combined classifier to 75% in the optimal case. We next discuss methods to help 8% 25% determine if it is possible to improve the performance of a system by combining multiple classifiers.

Combining Classifiers

The remainder of this paper discusses some of the possibilities and considerations of combining multiple intent classifiers to produce better results than using a single classifier. When one attempts to combine the results of classifiers, a number of issues come up right away. This paper will address the two following issues, the answers to which can get us all the way to a functional classifier combiner.

One major issue with combining classifiers is deciding on the domain and training data of each classifier. In the example above, each of the classifiers is trained on the same training data and the same set of classes. we call this classifiers with overlapping domain. This paper will present an analysis of a method to compare two classifiers overlapping domain (Kuncheva 2004) .

Another important way to train classifiers is to train each classifier on different domains. The combination of classifiers trained on different domains means that each classifier knows different sets of intents. This allows the creation of more complete systems of intent classification by combining smaller domains of knowledge.

Another important consideration is to recognize that classifiers trained using different methods will typically return very different types of score information. Scores are important in classifiers because the score provides a way to rank the classifier’s results so that a ’winning’ class can be determined.

For example, if a classifier that is trained on K classes receives an input, the classifier will typically return a set of classes with a score for each class. the score is typically a floating point number, allowing the classifier to order the results from most likely to least likely. Some classifiers generate a score that is very close to the posterior probability of the class, this is not true in general however. In regular expression-based text classifiers, for example, the score is often some multiple of the number of characters or tokens matched for the candidate class.

Given the variety of different scoring methods possible, how can one hope to compare results of different classifiers to provide the hoped-for improvement in intent matching?

Overlapped Classifiers with Naive Bayes Combination

In order to compare classifiers in a general way, we require a means to either normalize the scores of each classifier so that they can be compared, or to explore a way to ignore the scores but still rank the classifier results. Naive Bayes Combination is a technique to achieve the latter. The principles and theory behind Naive Bayes classification can be found in (Duda, Hart, and Stork 2012) .

Consider a system of L classifiers f 1; 2; :::; Lg = trained on the same test data and the same set of K classes. Given an input x each i will output a class in fc1; c2; :::; cK g = C. Let (x) = [ 1(x) 2(x)

L(x)] be the vector of all the classes returned by the system given input x. For example, for a given x, (x) might return the vector (xexample) = [c2 c4 c1 c4 c4 c1]: What we want is to discover an expression to derive the most probable class given all the results in (x) for any x. To do this, we first use the Naive Bayes assumption that each entry in the results in (X) are independent of any particular class ci. This lets us calculate the joint probability of any group of results as the product of the probabilities of each conditional probability result. This is the conditional probability assumption of Naive Bayes.

Let P ( (x)jci) be the conditional probability the (x) is returned given that the correct class is ci. Then the Naive Bayes assumption says:

P ( (x)jci) =

L Y P ( j (x)jci) j=1 The posterior conditional probability of class ci given the classifier results is then:

P (cij (x)) =

P (ci)P ( (x)jci) :

P ( (x))

Using the conditional independence assumption describe above, we can now express the support for each class as: L i(x) / P (ci)P ( (x)jci) = P (ci) Y P ( j (x)jci) j=1

If we calculated the confusion matrix for each of the L classifiers during the testing phase, then we have everything we need to determine this equation for each i(x).

Using The Confusion Matrix to Select a Class

In a previous section, the confusion matrix CMk for classifier k was defined to be the matrix which sums the number of test data which were assigned class cj when the correct class was ci for 1 i K; 1 j K. This generates a K K matrix of integers. Then CMk(i; j) is the cell containing the number of test datum where k generated cj but ci was the correct class. The diagonal of CMk has all the test results where test data was correctly classified.

Using CMk we can estimate

P ( k(x)jcj )

CMk(i; j) 1 Ni where Ni is the total number of test inputs whose correct classification is ci. We can also estimate P (ci) Ni=N where N is the total number of test inputs.

To arrive at a class from the K classifiers, we now only need to calculate

L i(x) / P (ci) Y P ( j (x)jci)

j=1 j=1 Ni YL CMk(i; j) 1 N

Ni 1 / N L 1

L Y CMk(i; j) k=1 for each i; 1 i K. The i with the highest proportional value will determine the class.

Example: Naive Bayesian Estimation

Suppose that we have two classifiers trained over the same data and tested on 20 test inputs. There are three classes, c1; c2; c3 (K = 3) and the number of test questions that are labeled with each class are N1 = 8; N2 = 5; N3 = 7 respectively. The two classifiers 1; 2 have respective confusion matrices:

CM1 = 5 2 1! 2 3 0 ; CM2 = 0 3 4

Given an input x suppose that 1(x) = c1 and 2(x) = c2. we can use the technique described above to decide which class is correct. One problem with the method described above is that zero entries in the confusion matrix can cause values to go to zero. Since for large values of K the confusion matrix is sparse (most of the entries are zeros), this technique is not very robust by itself. There are a number of ways to introduce smoothing into the system (Titterington et al. 1981) . We will look at one simple method for smoothing zeros by including an additive element of the inverse of the number of classes K (Kuncheva 2004) .

To accomplish this we rewrite the Naive Bayes combination as:

P ( (x)jci) / ( YL CMk(i; j) + 1=K )B k=1 Note that two things happened here. One is that the zero entry now has some weight. With sparse confusion matrices this prevents all the values from disappearing from the naive Bayes test.

The other notable result from this smoothing method is that the answer changed to class c1! This is a common consequence of smoothing. The smoothing method used here is Uniform Smoothing where we borrow some probability mass from all the nonzero masses and redistribute them to remove zeros. More complex smoothing methods can be devised to have less direct affect on outcome. For examples of more complex smoothing methods in probabilistic natural language models see (Dumoulin 2012) .

Conclusion

This paper has described a theoretical framework for generalized intent classifiers and a method for combining them. Currently Next IT researchers are working on building a framework for training and running a ’Panel of Experts’ utilizing the mathematical framework outlined here to compare classifier results. The goal is to compare classifier outcomes with each other and with human judgement (yet another kind of classifier). These results have yet to be assembled but we expect to study the panel and use it to improve understanding of intent classifiers and their combination.

Acknowledgments

The author would like to thank Ian Beaver and Cynthia Wu for their original investigation of this topic and their work in investigating effective machine learning methods for intent classification. Thanks also to Mark Zartler and Tanya Miller for their work on advancing our understanding of symbolic classifiers. Thanks also to Next IT Corp. for its sponsorship of this fascinating research area.

Chakrabarti , C.

2014 . Artificial Conversations for Chatter Bots Using Knowledge Representation, Learning, and

Pragmatics. Ph.D.

Dissertation , University of New Mexico.

Duda , R. O. ; Hart , P. E. ; and Stork , D. G. 2012 . Pattern classification . John Wiley & Sons.

Dumoulin , J.

2012 . Smoothing of ngram language models of human chats . In Soft Computing and Intelligent Systems (SCIS) and 13th International Symposium on Advanced Intelligent Systems (ISIS) , 2012 Joint 6th International Conference on, 1 - 4 . IEEE.

Jurafsky , D. , and Martin , J. H. 2009 . Speech & Language Processing 2nd Ed. Prentice Hall.

Kuncheva , L. I.

2004 . Combining pattern classifiers: methods and algorithms . John Wiley & Sons.

Li , X.

2010 . Understanding the semantic structure of noun phrase queries . In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics , 1337 - 1345 . Association for Computational Linguistics.

Manning , C. D. , and Schu¨tze, H. 1999 . Foundations of statistical natural language processing . MIT press.

Ripley , B. D.

2007 . Pattern recognition and neural networks . Cambridge university press.

Titterington , D. ; Murray, G. ; Murray , L. ; Spiegelhalter , D. ; Skene , A. ; Habbema , J.; and Gelpke, G. 1981 . Comparison of discrimination techniques applied to a complex data set of head injured patients . Journal of the Royal Statistical Society . Series A ( General ) 145 - 175 .

Wallace , R.

2004 . From eliza to alice . Article available at http://www. alicebot. org/articles/wallace/eliza. html.