-

A text classi cation framework based on optimized Error Correcting Output Code

Mario Locci

locci.mario@gmail.com 0

Giuliano Armano

giuliano.armano@diee.unica.it 0 0 DIEE Dept. of Electrical and Electronic Engineering, University of Cagliari , Piazza dArmi 09123, Cagliari , Italy

In recent years, there has been increasing interest in using text classi ers for retrieving and ltering infomation from web sources. As the numbers of categories in this kind of software applications can be high, Error correcting Output Coding (ECOC) can be a valid approach to perform multi-class classi cation. This paper explores the use of ECOC for learning text classi ers using two kinds of dichotomizers and compares them to each corresponding monolithic classi er. We propose a simulated annealing approach to calculate the coding matrix using an energy function similar to the electrostatic potential energy of a system of charges, which allows to maximize the average distance between codewords |with low variance. In addition, we use a new criterion for selecting features, a feature (in this speci c context) being any term that may occur in a document. This criterion de nes a measure of discriminant capability and allows to order terms according to it. Three di erent measures have been experimented to perform feature ranking / selection, in a comparative setting. Experimental results show that reducing the set of features used to train classi ers does not a ect classi cation performance. Notably, feature selection is not a preprocessing activity valid for all dichotomizers. In fact, features are selected for each dichotomizer that occurs in the matrix coding, typically giving rise to a di erent subset of features depending on the dichotomizers at hand.

ECOC classi ers Simulated Annealing Feature extraction

Multi-class classi cation consists of assigning a given pattern x to a category taken from a prede ned set, say c 2 C, with C = fc1; c2; c3; : : : ; cmg. Several approaches have been devised to directly handle multi-class problems (e.g., decision trees [ 13 ] and CART [ 2 ]). Other algorithms, originally designed to handle binary problems have been extended to handle multi-class problems. Multi-class support vector machines (SVM) [?] are a notable example of this strategy. Other methods turn multi-class problems into a set of binary problems. Classical examples of this approach are: one-against-all and one-against-one. The former consists of handling multi-class problem with m binary classi ers, each trained to discriminate the i-th class against the others. The latter uses a binary classier to discriminate between each couple hci; cj i; i 6= j of categories. In so doing, the overall number of classi ers ends up to m (m 1)=2.

An alternative approach to solve multi-class learning task is to adopt ErrorCorrecting Output Coding (ECOC). Error correcting codes are widely used in data transmission, being in charge of correcting errors when messages are transmitted through a noisy channel. A simple encoding strategy in data transmission is to add extra bits to any given message, so that the receiver will be typically able to correct it in presence of noise. A variation of this this basic principle is applied with success in the eld of machine learning, to improve the performance of multi-class classi ers. The basic ECOC strategy is to assign a binary string of length n (i.e., a codeword) to each category, trying to separate as much as possible each codeword from the others. The set of codewords can also be viewed as a coding matrix, in which binary classi ers are related to columns, whereas categories are related to rows. Hence, the i-th classi er will consider samples taken from the j-th category as negative or posative depending on the value, i.e., 1 or 1, found at position hi; ji of the coding matrix. This approach was rst used in the NETtalk system [ 15 ]. Dietterich and Bakiri[ 5 ] have shown that ECOC can improve the generalization performance of both decision trees (experiments have been made with C4.5) and neural networks (using backpropagation), in several benchmark datasets. They have also shown that ECOC is robust with respect to changes in the size of training samples as well as in changes of codeword assignments. Interesting experimental results has been obtained by Berger [ 1 ] on several real-world datasets of documents. The author has shown that ECOC can o er signi cant improvements in accuracy over conventional algorithms on tree over four datasets used for experiments. In this paper, the author used Naive Bayes (NB) [ 11 ] as base classi er, whereas the codeword assignments were chosen randomly. 1.1

Coding strategies

Since the rst ECOC has been designed, many experiments have shown that, to achieve a good generalization capacity, codewords must be well separated, which implies that the corresponding binary classi ers are trained on di erent subsets of data. The most commonly used distance measure is the Hamming distance. Given two vectors x and y, with components xi; yi 2 f 1; +1g, the Hamming distance d(x; y) is de ned as: d(x; y) = n X jxi i=0 2 yij (1)

In the training phase n binary classi ers are trained with samples relabeled in accordance with the coding matrix, say . The trained classi ers have an output vector y, yj being the output of the corresponding j-th binary classi er. The decoding strategy is to assign to the output vector y the category that corresponds to the closest codeword. In symbols (with !j = codeword of the j-th category and d = adopted distance function): arg min d(!j ; y) j (2) (3) (4)

ECOC were used successfully in many application areas. It was shown that randomly generated matrices often perform well, sometimes even better than those generated by heuristic methods. Random codes were theoretically studied in [ 9 ], showing that the condition to obtain an optimal Bayes is to have equidistance between each pair of codewords, when the random code is large enough the ECOC classi er tends asymptotically to optimal Bayes if the base classi er is an optimal Bayes classi er.

Although maximizing the distance between any pair of codewords helps to remove individuals classi cation errors, still decoding errors may occur [ 16 ]. The e ect on decoding error can be understood by analyzing the decoding strategy and the Bayes decision rule. An ECOC matrix performs a linear transformation between spaces, the original output q of the optimal Bayes classi er is transformed by the ECOC matrix in the corresponding output p. With q probability vector (i.e, qi is the probability of the i-th class), the output vector p is: When all pairs of codewords are equidistant, Equation 2 implies maximizing posterior probability: p =

T q arg min qj

An interesting class of ECOC coding is BCH (from R. C. Bose and D. K. Ray-Chaudhuri), which form a class of cyclic error-correcting codes constructed using nite elds. The key feature of this type of coding is the precise control of correctable symbols. An example of algorithm for generating BHC codes is described in [ 12 ]. This algorithm uses a polynomial of degree m to build the Galois nite eld GF (2m). The length L of the binary code ful lls the following constraints: 2m 1 1 < L <= 2m 1. Moreover, given the parameter t, which represents the number of correctable error, the minimum distance between pairs of codewords is d = 2 t + 1. 1.2

Feature selection

A characteristic of text categorization problems is the high dimensionality of the feature space. Each document is typically represented using a bag of words. Each word being a base vector that generates the space of features, a document can be represented as linear combination of these base vectors. A major problem is that there can be hundreds of thousands of terms even for small text collections. The amount of words is prohibitively high for many learning algorithms. Hence, reducing the original space without losing accuracy is highly desirable.

Many methods to reduce the dimensionality of the feature space have been devised. Most of the methods select words according to their score obtained by means of a suitable performance measure devised to check to which extent the word at hand is in agreement (or disagreement) with the category under analysis. 2 [?] and Information Gain (IG) (e.g., [?]) are well known measures used to perform feature (i.e., word) ranking . Yang and Pedersen [ 17 ] measure the goodness of a term globally with respect to all categories on average de ning a general version of IG and 2 for multi-class problems. They found IG and 2 most e ective in aggressive term removal without losing accuracy in their experiments with k NN and LLSF. Rogati and Y. Yang [ 14 ] analyzed 100 variants of ve major feature selection and found that feature selection methods based on 2 statistics outperformed those based on other criteria. The problem of selecting features for ECOC is not particularly addressed in the literature even though in our view it is very important.

The remainder of this paper is organized into ve sections: Section 2 describes the proposed approach for code optimization; Section 3 introduces a selection method based on the con guration of the coding matrix; Section 4 explains the real dataset used and the experimental settings; Section 5 reports and discusses experimental results and Section 6 ends the paper. 2

The proposed Simulated Annealing approach for optimizing ECOC

In this section, we propose a method based on simulated annealing (SA) to optimize the coding matrix. SA is a very robust algorithm, often able to nd a global optimum and less likely to fail on di cult tasks [ 3 ] and [ 7 ]. In our case, SA explores a space D of coding matrices characterized by m rows (the set of codewords) and n columns (the number of binary classi ers). Let us denote with 2 D the optimal (or sub-optimal) coding matrix.

The standard SA algorithm starts with an initial temperature T = T0 and moves randomly in the neighborhood of the current tentative solution !. SA is a local search algorithm, whose strategy consists of always accepting any new solution improves the current one. However to avoid local minima, SA may also accept worse solutions, with a probability inversely proportional to the current value of the temperature T . The convergence of the algorithm is guaranteed by decreasing T as the search goes on. The search continues until the maximum iterations has been performed or no relevant changes has been observed between two consecutive steps.

A solution in the neighborhood of ! is calculated by the neighbor function, described by Equation 5 (with z uniform random variable and p1, p2, p3 given constants). In the speci c setting of searching for the (sub)optimal ECOC coding matrix, a neighbor is generated from ! i) randomly changing a bit from 1 to 1 or vice versa with probability p1, ii) adding a column vector with probability p2, or iii) removing a random column vector with p3 neighbor(!) = 8< cahdadnagerarnamdodmomcolyluambnitvoecft!or toif!z < ipf 1z < p2 : remove a random column vector from ! if p3 > z > p2 (5)

In the proposed variant of the SA algorithm, the cost function is analogous to the potential energy of a particle system of electric charges, and is de ned by Equation 6, where !i and !j are codewords of .

m m f (!) = X X The ECOC optimization method which makes use of SA will be denoted as SAE, hereinafter. Moreover, SAE which makes use of classi ers of type hxi will be denoted SAEhxi. 3

Feature selection ECOC dependent

As text categorization has a very high feature space (a typical order of magnitude is 10; 000), a feature selection method is needed. Our approach is enforced after having found the coding matrix, as in our view each individual binary classi er should have its proper subset of features.

Many selection methods are based on the estimation of words probability, class probability and the joint probability of words and classes. These methods are usually computed considering only the corpus of documents, independently from the way classi ers group the data. This is reasonable if the adopted kind of classi er is inherently multi-class (e.g., NB classi ers). However, an ECOC classi er actually embodies a set of n dichotomizers (being n the length of the codewords). In particular, given a dichotomizer gj , a category ci can be considered as source of negative or positive samples, depending on which symbol appears at position hi; ji of the coding matrix ( 1 for negative samples and 1 for positive samples). This is the reason why performing feature selection for each individual dichotomizer appears a reasonable choice. To help the reader better understand the whole process, let us summarize the whole procedure: 1. the coding matrix is calculated; 2. The given set of samples, say S, is split in two sets (i.e., S+ and S ), in accordance with the content of the coding matrix; 3. Features are ordered in descending order starting from the highest score; 4. The set of features is reduced by selecting the rst K features (where K is a given constant).1

Feature ranking has been performed according to three measures of discriminant capability, which will be described in the next subsection. 1 Typical values of K range from 5% to 40% of the original feature space dimension. 3.1

Measures of Discriminant Capability

Three measures of discriminant capability have been experimented to perform feature ranking: 2, IG, and . The rst and the second measures are well known. Let us spend few words on the method denoted as . It originates from the proposal of Armano [?], focused on the de nition of an unbiased 2two-dimensional measure space, called ' . In particular, ' has been devised to measure the so-called characteristic capability, i.e., the ability of the feature at hand of being spread (' = 1) or absent (' = 1) over the given dataset. Conversely, has been devised to measure the so-called discriminant capability, i.e., the ability of the feature at hand of being in accordance ( = 1) or in discordance ( = 1) with the category under investigation. It is worth pointing out that the actual discriminant capability of a feature can be made coincident with the absolute value of , as the ability of separating positive from negative samples is high when j j 1, regardless from the fact that the feature is highly covariant or highly contravariant with the given category.

Focusing on the selected measure (i.e., ), let us recall its de nition: = tp f p (7) where tp and f p are respectively true and false positive rates of the main class.

A de nition of this measure in the event that samples are a corpus of documents and features the terms (or words) found in the corresponding dictionary, is the following:

#(t; c) #(t; c) (t; c) = (8)

jcj jcj where t denotes a word and c a category. Moreover, #(t; c) and #(t; c) denote the number of documents belonging to the main (c) or to the alternate (c) category in which t appears, respectively. Of course, jcj is the number of documents of the main category and jcj the number of documents of the alternative category. 4

Experimental settings

In all the experiments, base binary classi er were of two kinds: NB and SVM [ 6 ]. The following datasets have been selected: { Industry sector. It is a collection of web pages extracted from the web site of companies from various economic sectors. The leafs of this hierarchy are web pages, the parent directory is an industry sectors or class. The data is publicly available at [ 8 ]. This dataset contains a total of 9555 documents divided into 105 classes. A small fraction of these documents (about 20) belongs to multiple classes, but in our experiments they have been removed from the corpus. Web pages have been preprocessed to lter out the HTML code. 2 In the jargon of the author, a measure is \unbiased" when it is independent from the imbalance between positive and negative samples. Notable examples in this category of measures are sensitivity and speci city. { 20 news groups dataset. This is a well known dataset for text classi cation [ 10 ]. It is a collection of 20; 000 messages posted by the users of UseNet, the worldwide distributed discussion system. The dataset collects posting messages taken from 20 di erent discussion groups. Each discussion group covers a topic: 5 groups are about companies and 3 are focused on religion topics. Other topics are: politics, sports, sciences and miscellaneous. { Library and multimedial materials. It is a collection of library and multimedia materials classi ed manually by librarian. The dataset is a collection of recorded metadata that use the MARC format (MAchine-Readable Cataloging). MARC standards are a set of digital formats for the description of items catalogued by libraries. Each eld in a MARC record provides particular information about the item the record is describing, such as author, title, publisher, date, language, media type, abstract, isbn, and subject. In this dataset each item is classi ed using the Dewey decimal classi cation taxonomy. The dataset contains 75207 items, of which 23760 are duplicated (abstracts and author elds and some other eld are equals for duplicated items) and 11655 are unclassi ed. The remaining 39786, which are unique and classi ed, have been used in the experiments. We have performed experiments using a reduced form of the Dewey taxonomy, that considers the granularity of details from the root to the third level (the rst three digits of the Dewey code). The resulting number of classes is 647. { The four universities dataset. Four universities dataset is a collection of HTML web pages from computer science departments of various universities [ 4 ]. Documents that appear therein have been collected from January 1997 by the World Wide Knowledge Base (WebKb) project of the CMU text learning group. The dataset contains 8; 282 Web pages divided into 7 classes, they are extracted from the Web sites of four universities. The data set is organized as a directory, each le is an HTML page. Web pages have been preprocessed also to remove the HTML code.

For each dataset we rst processed the text of each document by removing puntuaction and stopwords. 3 For each experiment, we split the dataset at hand in two randomly-selected subsets (70% for training and 30% for testing). Classi cation accuracy has been used as performance measure. For each test, we ran 10 experiments using di erent data samples, then we computed mean and variance of the corresponding accuracies. 5 5.1

Experimental Results Comparison of base classi er to ECOC classi er

To show the advantages of ECOC classi ers whose codeword matrix has been optimized with SA, accuracy is reported together with the one obtained with 3 As for stopwords, we used two di erent blacklists, one for the Italian and one for the English language, as part one corpus of documents (i.e., the one concerning libraries) is in Italian. the corresponding base classi ers. Table 1 reports experimental results (the best results are highlighted in bold). In particular we observed that: { ECOC classi ers generally perform better than base classi ers. However, better results are obtained with base classi ers in the four universities dataset. Let us also note that improvements are not statistically signi cant for the library dataset. These two data sets have in common the fact of being highly unbalanced. { There are signi cant di erences between the performance of the SVM and NB classi ers and this di erence a ects also the performance of the corresponding ECOC classi ers.

Comparative analysis of SAE, Random and BHC ECOC

In these experiments we imposed the same length of the codeword for all ECOC classi ers (i.e., 63 bits). Algorithms have been con gured as follows: { Random (RA): Random values 1 and 1 of the matrix bits are chosen with the same probability; { BHC: the minimum value of the corrective capacity is chosen equals to t = 6, so that the minimum distance between codewords is d = 2t + 1 = 12; { SA: The initial matrix state is obtained by using the algorithm RA. relevant parameters have been set as follows: T0 = f0=5, Tmin = 0:01, L0 = 30, and N = 100.

We used the same training partition of the data set to train the ECOC matrices obtained with three di erent algorithms. We ran ten experiment computing the mean and variance of the accuracy, Table 2 shows experimental results. We calculated also the mean ( ) and standard deviation ( ) between pairs of codewords for a matrix of size 100 104, the matrix calculated by the RA algorithm has = 49:96 and = 5:9, whereas the one calculated by the SA algorithm has = 50:48 and = 2:77. We observed that { SA reduces the gap between minimum distance and maximum distance of codeword pairs, increases the minimum and the mean distance reducing the variance. { SAE can achieve better performance than others for most of the datasets. Selection the best terms able to ensure a good performance in terms of time and memory consumption plays a fundamental role in text classi cation, in particular when the selected corpus contains many documents and / or the corresponding dictionary is contains many words. This section reports a comparative assessment of the selected score functions. Table 3 reports experimental results, the best results being highlighted in bold. In particular, we found that the ordering among score function (from the best downwards) is the following: 2, and IG. In this paper a novel approach for building ECOC classi ers has been proposed. The corresponding algorithm is based on simulated annealing, whose energy function is anologous to the potential of a system of charges. Experimental results show that in the con guration of minimum energy the distances between codewords have high mean and low variance. A method for feature extraction based on the coding matrix has also been presented, three score functions for selecting words have been compared. As for future work, more detailed experiments will be made on the ability of score functions to guarantee good classi cation performance. In particular, the generalized version of , able to deal with unbalanced datasets, will be experimented in a comparative setting.

Berger . Error-correcting output coding for text classi cation . In IJCAI-99: Workshop on machine learning for information ltering. Citeseer , 1999 .

Breiman ,

Friedman ,

C. J.

Stone , and

R. A.

Olshen . Classi cation and regression trees . CRC press, 1984 .

Corana ,

Marchesi ,

Martini , and

Ridella . Minimizing multimodal functions of continuous variables with the simulated annealing algorithm corrigenda for this article is available here . ACM Transactions on Mathematical Software (TOMS) , 13 ( 3 ): 262 { 280 , 1987 .

Craven ,

McCallum ,

PiPasquo , T. Mitchell, and

Freitag . Learning to extract symbolic knowledge from the world wide web . Technical report, DTIC Document , 1998 .

T. G.

Dietterich and

Bakiri . Solving multiclass learning problems via errorcorrecting output codes . arXiv preprint cs/9501101 , 1995 .

6. R.-E. Fan,

K.-W.

Chang , C.-J. Hsieh , X.-R.

Wang , and C.-J.

Lin . LIBLINEAR: A library for large linear classi cation . Journal of Machine Learning Research , 9 : 1871 { 1874 , 2008 .

7. W. L. Go e, G. D. Ferrier , and J. Rogers . Global optimization of statistical functions with simulated annealing . Journal of Econometrics , 60 ( 1 ): 65 { 99 , 1994 .

M. G.

Inc . Industry sector dataset, 2011 . on line 2015 .

James and

Hastie . The error coding method and picts . Journal of Computational and Graphical Statistics , 7 ( 3 ): 377 { 387 , 1998 .

10.

Lang . Newsweeder: Learning to lter netnews . In Proceedings of the 12th international conference on machine learning , pages 331 { 339 , 1995 .

11. D. D. Lewis . Naive (bayes) at forty: The independence assumption in information retrieval . In Machine learning: ECML-98 , pages 4 { 15. Springer, 1998 .

12.

Lin . Error Control Coding: Fundamentals and Applications , volume 1 .

Prentice

Hall , 1983 .

13. J. Quinlan . C4 . 5: Programs for empirical learning , 1993 .

14.

Rogati and

Yang . High-performing feature selection for text classi cation . In Proceedings of the eleventh international conference on Information and knowledge management , pages 659 { 661 . ACM, 2002 .

15. T. J. Sejnowski and

C. R.

Rosenberg . Parallel networks that learn to pronounce english text . Complex systems , 1 ( 1 ): 145 { 168 , 1987 .

16.

Windeatt and

Ghaderi . Coding and decoding strategies for multi-class learning problems . Information Fusion , 4 ( 1 ): 11 { 21 , 2003 .

17.

Yang and

J. O.

Pedersen . A comparative study on feature selection in text categorization . In ICML , volume 97 , pages 412 { 420 , 1997 .