iClass – Applying Multiple Multi-Class Machine Learning Classifiers combined with Expert Knowledge to Roper Center Survey Data* Marmar Moussa1 Marc Maynard2 1 University of Connecticut, CT, USA marmar.moussa@uconn.edu 2 Roper Center for Public Opinion Research, CT, USA mmaynard@ropercenter.org Abstract. As one of the largest public opinion data archives in the world, Rop- er Center [1] collects datasets of polled survey questions as they get released from numerous media outlets and organizations with varying degrees of format ambiguity. The volume of data introduces search complexities over survey questions asked since the 1930s and poses challenges when analyzing search trends. Up to this point, Roper Center question-level retrieval applications used human metadata experts to assign topics to content. This has been insuffi- cient to reach required levels of consistency in catalogued data, and provides an inadequate base for creating an advanced search experience for research clients. The objective of this work is to combine the human expert teams’ knowledge of the nature of the poll questions and the concepts and topics these questions express, with the ability of multi-label classifiers to learn this knowledge and apply it to an automated, fast and accurate classification mecha- nism. This approach cuts down the question analysis and tagging time signifi- cantly as well as provides enhanced consistency and scalability for topics’ de- scriptions. At the same time, creating an ensemble of machine learning classifi- ers combined with expert knowledge is expected to enhance the search experi- ence and provide much needed analytic capabilities to the survey question data- bases. In our design, we use classification from several machine learning algo- rithms like SVM and Decision Trees, combined with expert knowledge in form of handcrafted rules, data analysis and result review. We consolidate this into a ‘Multipath Classifier’ with a ‘Confidence’ point system that decides on the rel- evance of topics assigned to poll questions with nearly perfect accuracy. Keywords: ensembles; expert knowledge; knowledge base; machine learning; multi-label classifiers; supervised learning; survey datasets Copyright © 2015 by the paper’s authors. Copying permitted only for private and academic purposes. In: R. Bergmann, S. Görg, G. Müller (Eds.): Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. Octo- ber 2015, published at http://ceur-ws.org 221 1 Introduction In this paper, we present an overview of our work at the Roper Center in applying machine learning to the public opinion survey datasets in an attempt to classify the questions to their respective most relevant set of topics/classes. The application iClass is a collection of modules of autonomous classifiers and a knowledge expert ‘Admin’ module which allows us to combine both human knowledge and machine learning to the classification and review processes. This paper presents the nearly complete first phase of iClass. We describe the business context and motivation behind this devel- opment, a design overview and preliminary evaluation results. The last section de- scribes the business payoff and trends for future phases. 1.1 Context and Motivation The Roper Center collects datasets of survey questions from polls performed by think tanks, media outlets, and academic organizations. The data has been gathered since the 1930s with varying degrees of format ambiguity. The volume of legacy data in- troduces search complexities and poses challenges when analyzing search trends. Homegrown backend systems serve up several data retrieval and analysis services to the Roper Center members. The primary two are 1) iPOLL, a question-level re- trieval database containing over 650,000 polling questions and answers, and 2) RoperExpress, a catalog of survey datasets conducted in the US and around the globe. Historically, datasets, iPOLL questions, and secondary material have been managed and cataloged by separate teams, which led to different descriptive practices. Dataset expert teams use free text key-word descriptors to assign topics to content. This means, even though there are clear topical and other kinds of connections among the content, lack of consistent description creates rifts, making these connections elusive (Fig.1). It results in costly string operations for even simple tasks, as well as costly retrospective updates to topics definitions and adding new topics. This approach also does not allow for any further data analytics capabilities. Fig. 1. Example of inconsistent topics assigned to ‘similar’ content 222 Our objective is therefore to develop a scalable system for concept-based classifi- cation of questions that implements an intelligent automated approach for identifying conceptual links between content at point of acquisition/creation using machine learn- ing classifiers while at the same time leverage existing expert knowledge. 1.2 Related Work In statistics and machine learning, ensemble methods achieve performance by combining opinions of multiple learners [2]. There are different ways of combining base learners into ensembles [3]. We decided to design a combining method that is tailored to our specific goals like scalability and utilizing available expert knowledge. This is required to accommodate changes in topic definitions over time and the emer- gence of new topics from newly acquired studies. Our combining method is a mix of weighting, majority voting and performance weighting. In weighting methods a clas- sifier has strength proportional to its assigned weight. In a voting scheme, the number of classifiers that decide on a specific label is counted and the label with the highest number of votes is considered. For performance weighting [4], the weight of each classifier is set proportional to its accuracy performance on a given validation set. 2 Design Overview 2.1 Data Analysis and Tools Several housekeeping steps had to take place before we would be able to develop a reliable system with high accuracy. The first was performing a data cleanup. The initial classification tests revealed numerous discrepancies and inconsistencies be- tween the actual concepts of questions and assigned topics as described in Section 1.1. Also, the review revealed the need for a number of new topics and a three-level topic hierarchy. This meant defining categories at the parent level in a new topics hierarchy (Fig.2), as well as refining existing topic definitions to achieve consistency. The effort resulted in 119 topics for the current question bank, with over 20 new topics, identi- fied as a result of initial classification test reviews and analysis. The topics were ar- ranged into 6 main categories and 3 levels of hierarchy. We also needed to implement necessary workflow changes to include testing results review, a review of the ‘Before & After’ list of topics associated with each question. An ‘Admin’ role with the neces- sary expert knowledge reviews the result and ‘approves’ the topics assigned and se- lects some of the accepted question-topic pairs to be fed back into the training set. 223 Fig. 2. New Categories Hierarchy Roper Center metadata is stored in an Oracle 11g database, prompting an examina- tion of machine learning algorithms supported by Oracle classifier functions. We conducted tests using the RTextTools package over datasets exported from the Oracle Database [5], also tests and evaluation using python scikit-learn package over export- ed data [6]. The main modules however used Oracle Pl/SQL for analysis as well as the training and classification for compatibility with the Roper Center’s architecture. 2.2 Machine Learning Algorithms We used two machine learning techniques for this phase of iClass, Support Vector Machine (SVM) and Decision Tree. SVM is known to perform well with significant accuracy, even with sparse data, also SVM classification attempts to separate target classes with the widest possible margin, and is very fast. Distinct versions of SVM use different kernel functions to handle different types of data sets. Linear and Gauss- ian (nonlinear) kernels are supported in SVM. We used linear kernels in this phase of iClass. SVM however does not produce human readable rules. In contrast, the Deci- sion Tree (DT) algorithm produces human readable and extendable rules. Decision trees extract predictive information in the form of human-understandable rules. The rules are nearly if-then-else expressions; they explain the decisions that led to the prediction. DT has good missing value interpretation, is fast and performs with good accuracy [7]. 224 3 Design Details 3.1 Classification Process Flow The assembling of a comprehensive training set that represents all topics and their features was challenging yet critical for success. The data analysis and initial tests resulted in a selected set of expert-classified questions to use as the seed for the train- ing set. The training set also included handcrafted question samples for under- represented topics. For new topic definitions, we used a set of SQL queries for a fine- grained selection of questions to be assigned the new topics. After the training set is constructed, SVM and Decision Tree classifiers are trained to produce a set of rules for each topic. Each topic also gets an additional set of Ad- min/Expert-defined rules in the form of keywords to look for or exclude from the question text. These manually defined rules formed the third set of rules to process. Three modules (DT, SVM and Rule-Based Classifiers) are created to use these sets of rules and ‘vote’ with different scores over the topics to be assigned. A fourth path for classification is formed by the direct SQL queries representing the more complex expert defined rules that are not included in the Rule-Based Classifier. For this path too, the implementation assigned confidence scores to the selected question-topic pairs. The four paths’ (sources) results construct a vector for each question and topic pair, containing the source and the designated score/confidence. (Fig.3) below provides a description of this process flow in iClass. Three values are then considered in combining the information from this ensemble of classifiers: 1) the (weighted) number of sources/votes that classified a topic to a specific question, 2) the threshold (possible one for each topic and source) that would consider this clas- sification true positive or false positive, and 3) the confidence/score values. A combined confidence/score is formed and then the classified question-topic pairs are reviewed by an expert to approve or reject. Approved results can then be fed back to the training set pool for a new round of training. This is needed as the dataset grows with incoming poll questions from newer studies acquired by the Roper Center. 225 Fig. 3. iClass Components and Process Flow details 3.2 Confidence Fig. 4. Confidence Points 1(low) to 12(high) As described in previous sections, iClass current phase has four different sources of classification: SVM, DT, Rule-Based and Expert/Manual direct selection paths. To combine them, we applied a ‘Confidence Points System’. Confidence/relevance levels (Low, Medium, and High) from each classification algorithm/path aggregate to an (N*3) point system, where N is the number of classification paths. As we currently implement 4 paths, there are 12 Points of Confidence (Fig.4). For sources 1 and 2, SVM and DT, the confidence is calculated via the Classifier functions as a value > 0 and <100, we convert this to a value 1 3 by using a pseudo- count, scaling, and rounding. For the Expert/Manual path, where the direct analysis process is implemented in the various SQL scripts, the confidence for each topic is configured directly based on the Admin’s analysis. For the Rule_Based Classifier, each topic is assigned a rule confidence level associated with the keywords and exclu- sion rules defined for that topic. A question-topic classified pair has therefore 1  12 possible confidence points: if 4 sources vote for a topic with the max confidence points (3) each, then the total confidence for this question-topic classification is 12. If 226 on the other hand only one source votes for this assignment and with the lowest con- fidence possible (1), the total will be 1 point (Fig.5). The Admin sets different thresh- olds for different functionalities, for instance a threshold >3 to appear in search re- sults, a threshold of ≤2 for admin review process to look at the weakest items. Fig. 5. Example of Question-topics assigned with highest and lowest confidence. A tradeoff exists between adding more (maybe distantly related) topics which could cause a degree of confusion to the reviewer/user versus being extra cautious in assign- ing topics and risking that related questions might not appear in results of related but not main topic searches. (Fig.6) is an example of this tradeoff, and is also an example of how iClass identified more relevant topics than were assigned by a human cata- loger. ‘Family’ and ‘Religion’ topics in this example, although both are topics long available in the system, were not initially assigned by manual classification during data entry. iClass assigned lower confidence to these topics compared to other more relevant topics, like ‘Abortion’ and ‘Courts’. Topic ‘Supreme Court’ is a new topic. It is also very relevant and is correctly captured and assigned a high confidence level. Fig. 6. Example of the new classification results 4 Evaluation The evaluation of classical multiclass classifiers is by nature challenging, as most of the metrics usually make the most sense when applied to binary classifiers. One way 227 to explore the performance of a multiclass classifier is to construct the confusion ma- trix (Fig.7) and extend it to (NxN) matrix, where N is the number of classes (topics). Fig. 7. Binary Confusion Matrix Aside from having multi-classes in our system, we have a further complexity; a ques- tion can be assigned multiple topics with varying confidence/relevance levels. A threshold then determines whether or not questions with lower confidence points are counted towards FP or TP. The cutoff between TP and FP is therefore a little blurred. Results based on SVM, DT & Rule-Based Classifier modules only: Hits: Avg. # of Questions with all topics TP (657,850 total Questions) 576,902 Average Accuracy ((TP+TN)/total) 0.917 False Negative/Miss Rate 0.026 "False Positive" Rate 0.030 # Newly identified Question-Topic pairs (Not present in training set) 695,792 # New correct topic assignments rate (added value) 0.455 Fig. 8. Evaluation Results Our evaluation of only 3 paths, the SVM, DT and Rules_Based Classifier results showed accuracy of 91.7% and over 99% when enabling all 4 paths. This is expected as the 4th path involves more direct human knowledge over the classification decision. 5 Conclusion & Future Work The development of the first phase of iClass, combining machine learning and expert knowledge, introduced performance and administrative benefits to the business pro- cess at the Roper Center. To name a few, the automated classification contributed to better consistency in topics definition and faster, streamlined topics assignment. The expert/Admin review process is dramatically shortened as it is focused on low confi- dence items. The process is now change-tolerant; when adding/updating topics, we can reflect the updates over the entire questions’ bank retrospectively. In terms of performance enhancement, the elimination of costly string operations improves func- tionalities like search and navigation by topics. Although still a work in progress, iClass is scalable in terms of thresholds, confidence level configurations as well as 228 adding entire extra classification paths to the system. Analytic capabilities are now part of the system, like efficient metadata statistics, especially about topics trends. From the application perspective, several components of iClass need further work in next phase, a more user-friendly Admin module is planned, the system currently supports only one set of handwritten rules per topic definition, as well as only one admin user, which needs to be extended for business needs. The classification of the datasets only lays the groundwork for better data analytics, which is currently not fully leveraged. There is also the business need to extend the functionality of iClass to knowledgebase facets other than topics, such as the survey sample classes. In addi- tion, there is still a great deal to be explored about learning techniques that best fit the business. As the classification process is prepared to accept more classification paths, part of the future work includes using other machine learning algorithms to create more classification paths, as well as study other ensemble classification methods for combining weights and votes, and compare the results of the different methods. 6 References 1. http://www.ropercenter.uconn.edu/ 2. Polikar R(2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3) 3. Rokach, L. (2010). "Ensemble-based classifiers". Artificial Intelligence Review Artificial Intelligence Review. February 2010, Volume 33, Issue 1-2, pp 1-39 4. Opitz D, Shavlik J (1996) Generating accurate and diverse members of a neural network ensemble. In: Touretzky DS, Mozer MC, Hasselmo ME (eds) Advances in neural infor- mation processing systems, vol 8. The MIT Press, Cambridge, pp 535–541 5. Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun,Emiliano Grossman,Wouter van Atteveldt (2014) Automatic Text Classification via Supervised Learning 6. "Scikit-learn.": Machine Learning in Python — 0.17.dev0 Documentation. 7. Smola, Alex, and S.V.N. Vishwanathan. Introduction to Machine Learning. Cambridge: Cambridge UP, 2008. 229