<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A text classi cation framework based on optimized Error Correcting Output Code</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mario Locci</string-name>
          <email>locci.mario@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuliano Armano</string-name>
          <email>giuliano.armano@diee.unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIEE Dept. of Electrical and Electronic Engineering, University of Cagliari</institution>
          ,
          <addr-line>Piazza dArmi 09123, Cagliari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, there has been increasing interest in using text classi ers for retrieving and ltering infomation from web sources. As the numbers of categories in this kind of software applications can be high, Error correcting Output Coding (ECOC) can be a valid approach to perform multi-class classi cation. This paper explores the use of ECOC for learning text classi ers using two kinds of dichotomizers and compares them to each corresponding monolithic classi er. We propose a simulated annealing approach to calculate the coding matrix using an energy function similar to the electrostatic potential energy of a system of charges, which allows to maximize the average distance between codewords |with low variance. In addition, we use a new criterion for selecting features, a feature (in this speci c context) being any term that may occur in a document. This criterion de nes a measure of discriminant capability and allows to order terms according to it. Three di erent measures have been experimented to perform feature ranking / selection, in a comparative setting. Experimental results show that reducing the set of features used to train classi ers does not a ect classi cation performance. Notably, feature selection is not a preprocessing activity valid for all dichotomizers. In fact, features are selected for each dichotomizer that occurs in the matrix coding, typically giving rise to a di erent subset of features depending on the dichotomizers at hand.</p>
      </abstract>
      <kwd-group>
        <kwd>ECOC classi ers</kwd>
        <kwd>Simulated Annealing</kwd>
        <kwd>Feature extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Multi-class classi cation consists of assigning a given pattern x to a category
taken from a prede ned set, say c 2 C, with C = fc1; c2; c3; : : : ; cmg. Several
approaches have been devised to directly handle multi-class problems (e.g.,
decision trees [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and CART [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). Other algorithms, originally designed to handle
binary problems have been extended to handle multi-class problems. Multi-class
support vector machines (SVM) [?] are a notable example of this strategy. Other
methods turn multi-class problems into a set of binary problems. Classical
examples of this approach are: one-against-all and one-against-one. The former
consists of handling multi-class problem with m binary classi ers, each trained
to discriminate the i-th class against the others. The latter uses a binary
classier to discriminate between each couple hci; cj i; i 6= j of categories. In so doing,
the overall number of classi ers ends up to m (m 1)=2.
      </p>
      <p>
        An alternative approach to solve multi-class learning task is to adopt
ErrorCorrecting Output Coding (ECOC). Error correcting codes are widely used in
data transmission, being in charge of correcting errors when messages are
transmitted through a noisy channel. A simple encoding strategy in data transmission
is to add extra bits to any given message, so that the receiver will be typically
able to correct it in presence of noise. A variation of this this basic principle is
applied with success in the eld of machine learning, to improve the performance
of multi-class classi ers. The basic ECOC strategy is to assign a binary string of
length n (i.e., a codeword) to each category, trying to separate as much as
possible each codeword from the others. The set of codewords can also be viewed as
a coding matrix, in which binary classi ers are related to columns, whereas
categories are related to rows. Hence, the i-th classi er will consider samples taken
from the j-th category as negative or posative depending on the value, i.e., 1
or 1, found at position hi; ji of the coding matrix. This approach was rst used
in the NETtalk system [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Dietterich and Bakiri[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have shown that ECOC can
improve the generalization performance of both decision trees (experiments have
been made with C4.5) and neural networks (using backpropagation), in several
benchmark datasets. They have also shown that ECOC is robust with respect
to changes in the size of training samples as well as in changes of codeword
assignments. Interesting experimental results has been obtained by Berger [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] on
several real-world datasets of documents. The author has shown that ECOC can
o er signi cant improvements in accuracy over conventional algorithms on tree
over four datasets used for experiments. In this paper, the author used Naive
Bayes (NB) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] as base classi er, whereas the codeword assignments were chosen
randomly.
1.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Coding strategies</title>
      <p>Since the rst ECOC has been designed, many experiments have shown that, to
achieve a good generalization capacity, codewords must be well separated, which
implies that the corresponding binary classi ers are trained on di erent subsets
of data. The most commonly used distance measure is the Hamming distance.
Given two vectors x and y, with components xi; yi 2 f 1; +1g, the Hamming
distance d(x; y) is de ned as:
d(x; y) =
n
X jxi
i=0
2
yij
(1)</p>
      <p>In the training phase n binary classi ers are trained with samples relabeled
in accordance with the coding matrix, say . The trained classi ers have an
output vector y, yj being the output of the corresponding j-th binary classi er.
The decoding strategy is to assign to the output vector y the category that
corresponds to the closest codeword. In symbols (with !j = codeword of the
j-th category and d = adopted distance function):
arg min d(!j ; y)
j
(2)
(3)
(4)</p>
      <p>
        ECOC were used successfully in many application areas. It was shown that
randomly generated matrices often perform well, sometimes even better than
those generated by heuristic methods. Random codes were theoretically studied
in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], showing that the condition to obtain an optimal Bayes is to have
equidistance between each pair of codewords, when the random code is large enough
the ECOC classi er tends asymptotically to optimal Bayes if the base classi er
is an optimal Bayes classi er.
      </p>
      <p>
        Although maximizing the distance between any pair of codewords helps to
remove individuals classi cation errors, still decoding errors may occur [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The
e ect on decoding error can be understood by analyzing the decoding strategy
and the Bayes decision rule. An ECOC matrix performs a linear
transformation between spaces, the original output q of the optimal Bayes classi er is
transformed by the ECOC matrix in the corresponding output p. With q
probability vector (i.e, qi is the probability of the i-th class), the output vector p
is:
When all pairs of codewords are equidistant, Equation 2 implies maximizing
posterior probability:
p =
      </p>
      <p>T q
arg min qj</p>
      <p>j</p>
      <p>
        An interesting class of ECOC coding is BCH (from R. C. Bose and D. K.
Ray-Chaudhuri), which form a class of cyclic error-correcting codes constructed
using nite elds. The key feature of this type of coding is the precise control
of correctable symbols. An example of algorithm for generating BHC codes is
described in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This algorithm uses a polynomial of degree m to build the
Galois nite eld GF (2m). The length L of the binary code ful lls the following
constraints: 2m 1 1 &lt; L &lt;= 2m 1. Moreover, given the parameter t, which
represents the number of correctable error, the minimum distance between pairs
of codewords is d = 2 t + 1.
1.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Feature selection</title>
      <p>A characteristic of text categorization problems is the high dimensionality of the
feature space. Each document is typically represented using a bag of words. Each
word being a base vector that generates the space of features, a document can
be represented as linear combination of these base vectors. A major problem is
that there can be hundreds of thousands of terms even for small text collections.
The amount of words is prohibitively high for many learning algorithms. Hence,
reducing the original space without losing accuracy is highly desirable.</p>
      <p>
        Many methods to reduce the dimensionality of the feature space have been
devised. Most of the methods select words according to their score obtained
by means of a suitable performance measure devised to check to which extent
the word at hand is in agreement (or disagreement) with the category under
analysis. 2 [?] and Information Gain (IG) (e.g., [?]) are well known measures
used to perform feature (i.e., word) ranking . Yang and Pedersen [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] measure the
goodness of a term globally with respect to all categories on average de ning a
general version of IG and 2 for multi-class problems. They found IG and 2 most
e ective in aggressive term removal without losing accuracy in their experiments
with k NN and LLSF. Rogati and Y. Yang [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] analyzed 100 variants of ve major
feature selection and found that feature selection methods based on 2 statistics
outperformed those based on other criteria. The problem of selecting features
for ECOC is not particularly addressed in the literature even though in our view
it is very important.
      </p>
      <p>The remainder of this paper is organized into ve sections: Section 2 describes
the proposed approach for code optimization; Section 3 introduces a selection
method based on the con guration of the coding matrix; Section 4 explains the
real dataset used and the experimental settings; Section 5 reports and discusses
experimental results and Section 6 ends the paper.
2</p>
      <sec id="sec-3-1">
        <title>The proposed Simulated Annealing approach for optimizing ECOC</title>
        <p>
          In this section, we propose a method based on simulated annealing (SA) to
optimize the coding matrix. SA is a very robust algorithm, often able to nd a
global optimum and less likely to fail on di cult tasks [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. In our case,
SA explores a space D of coding matrices characterized by m rows (the set of
codewords) and n columns (the number of binary classi ers). Let us denote with
2 D the optimal (or sub-optimal) coding matrix.
        </p>
        <p>The standard SA algorithm starts with an initial temperature T = T0 and
moves randomly in the neighborhood of the current tentative solution !. SA is
a local search algorithm, whose strategy consists of always accepting any new
solution improves the current one. However to avoid local minima, SA may also
accept worse solutions, with a probability inversely proportional to the current
value of the temperature T . The convergence of the algorithm is guaranteed by
decreasing T as the search goes on. The search continues until the maximum
iterations has been performed or no relevant changes has been observed between
two consecutive steps.</p>
        <p>A solution in the neighborhood of ! is calculated by the neighbor function,
described by Equation 5 (with z uniform random variable and p1, p2, p3 given
constants). In the speci c setting of searching for the (sub)optimal ECOC coding
matrix, a neighbor is generated from ! i) randomly changing a bit from 1 to
1 or vice versa with probability p1, ii) adding a column vector with probability
p2, or iii) removing a random column vector with p3
neighbor(!) =
8&lt; cahdadnagerarnamdodmomcolyluambnitvoecft!or toif!z &lt; ipf 1z &lt; p2
: remove a random column vector from ! if p3 &gt; z &gt; p2
(5)</p>
        <p>In the proposed variant of the SA algorithm, the cost function is analogous
to the potential energy of a particle system of electric charges, and is de ned by
Equation 6, where !i and !j are codewords of .</p>
        <p>m m
f (!) = X X
The ECOC optimization method which makes use of SA will be denoted as
SAE, hereinafter. Moreover, SAE which makes use of classi ers of type hxi will
be denoted SAEhxi.
3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Feature selection ECOC dependent</title>
        <p>As text categorization has a very high feature space (a typical order of magnitude
is 10; 000), a feature selection method is needed. Our approach is enforced after
having found the coding matrix, as in our view each individual binary classi er
should have its proper subset of features.</p>
        <p>Many selection methods are based on the estimation of words probability,
class probability and the joint probability of words and classes. These methods
are usually computed considering only the corpus of documents, independently
from the way classi ers group the data. This is reasonable if the adopted kind
of classi er is inherently multi-class (e.g., NB classi ers). However, an ECOC
classi er actually embodies a set of n dichotomizers (being n the length of the
codewords). In particular, given a dichotomizer gj , a category ci can be
considered as source of negative or positive samples, depending on which symbol
appears at position hi; ji of the coding matrix ( 1 for negative samples and 1 for
positive samples). This is the reason why performing feature selection for each
individual dichotomizer appears a reasonable choice. To help the reader better
understand the whole process, let us summarize the whole procedure:
1. the coding matrix is calculated;
2. The given set of samples, say S, is split in two sets (i.e., S+ and S ), in
accordance with the content of the coding matrix;
3. Features are ordered in descending order starting from the highest score;
4. The set of features is reduced by selecting the rst K features (where K is
a given constant).1</p>
        <p>Feature ranking has been performed according to three measures of
discriminant capability, which will be described in the next subsection.
1 Typical values of K range from 5% to 40% of the original feature space dimension.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Measures of Discriminant Capability</title>
      <p>Three measures of discriminant capability have been experimented to perform
feature ranking: 2, IG, and . The rst and the second measures are well known.
Let us spend few words on the method denoted as . It originates from the
proposal of Armano [?], focused on the de nition of an unbiased 2two-dimensional
measure space, called ' . In particular, ' has been devised to measure the
so-called characteristic capability, i.e., the ability of the feature at hand of being
spread (' = 1) or absent (' = 1) over the given dataset. Conversely, has
been devised to measure the so-called discriminant capability, i.e., the ability of
the feature at hand of being in accordance ( = 1) or in discordance ( = 1)
with the category under investigation. It is worth pointing out that the actual
discriminant capability of a feature can be made coincident with the absolute
value of , as the ability of separating positive from negative samples is high
when j j 1, regardless from the fact that the feature is highly covariant or
highly contravariant with the given category.</p>
      <p>Focusing on the selected measure (i.e., ), let us recall its de nition:
= tp
f p
(7)
where tp and f p are respectively true and false positive rates of the main class.</p>
      <p>A de nition of this measure in the event that samples are a corpus of
documents and features the terms (or words) found in the corresponding dictionary,
is the following:</p>
      <p>#(t; c) #(t; c)
(t; c) = (8)</p>
      <p>jcj jcj
where t denotes a word and c a category. Moreover, #(t; c) and #(t; c) denote the
number of documents belonging to the main (c) or to the alternate (c) category
in which t appears, respectively. Of course, jcj is the number of documents of
the main category and jcj the number of documents of the alternative category.
4</p>
      <sec id="sec-4-1">
        <title>Experimental settings</title>
        <p>
          In all the experiments, base binary classi er were of two kinds: NB and SVM [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
The following datasets have been selected:
{ Industry sector. It is a collection of web pages extracted from the web site
of companies from various economic sectors. The leafs of this hierarchy are
web pages, the parent directory is an industry sectors or class. The data is
publicly available at [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This dataset contains a total of 9555 documents
divided into 105 classes. A small fraction of these documents (about 20)
belongs to multiple classes, but in our experiments they have been removed
from the corpus. Web pages have been preprocessed to lter out the HTML
code.
2 In the jargon of the author, a measure is \unbiased" when it is independent from the
imbalance between positive and negative samples. Notable examples in this category
of measures are sensitivity and speci city.
{ 20 news groups dataset. This is a well known dataset for text classi
cation [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. It is a collection of 20; 000 messages posted by the users of UseNet,
the worldwide distributed discussion system. The dataset collects posting
messages taken from 20 di erent discussion groups. Each discussion group
covers a topic: 5 groups are about companies and 3 are focused on religion
topics. Other topics are: politics, sports, sciences and miscellaneous.
{ Library and multimedial materials. It is a collection of library and
multimedia materials classi ed manually by librarian. The dataset is a collection
of recorded metadata that use the MARC format (MAchine-Readable
Cataloging). MARC standards are a set of digital formats for the description
of items catalogued by libraries. Each eld in a MARC record provides
particular information about the item the record is describing, such as author,
title, publisher, date, language, media type, abstract, isbn, and subject. In
this dataset each item is classi ed using the Dewey decimal classi cation
taxonomy. The dataset contains 75207 items, of which 23760 are duplicated
(abstracts and author elds and some other eld are equals for duplicated
items) and 11655 are unclassi ed. The remaining 39786, which are unique
and classi ed, have been used in the experiments. We have performed
experiments using a reduced form of the Dewey taxonomy, that considers the
granularity of details from the root to the third level (the rst three digits
of the Dewey code). The resulting number of classes is 647.
{ The four universities dataset. Four universities dataset is a collection of
HTML web pages from computer science departments of various
universities [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Documents that appear therein have been collected from January
1997 by the World Wide Knowledge Base (WebKb) project of the CMU text
learning group. The dataset contains 8; 282 Web pages divided into 7 classes,
they are extracted from the Web sites of four universities. The data set is
organized as a directory, each le is an HTML page. Web pages have been
preprocessed also to remove the HTML code.
        </p>
        <p>For each dataset we rst processed the text of each document by removing
puntuaction and stopwords. 3 For each experiment, we split the dataset at hand in
two randomly-selected subsets (70% for training and 30% for testing). Classi
cation accuracy has been used as performance measure. For each test, we ran 10
experiments using di erent data samples, then we computed mean and variance
of the corresponding accuracies.
5
5.1</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experimental Results</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Comparison of base classi er to ECOC classi er</title>
      <p>To show the advantages of ECOC classi ers whose codeword matrix has been
optimized with SA, accuracy is reported together with the one obtained with
3 As for stopwords, we used two di erent blacklists, one for the Italian and one for the
English language, as part one corpus of documents (i.e., the one concerning libraries)
is in Italian.
the corresponding base classi ers. Table 1 reports experimental results (the best
results are highlighted in bold). In particular we observed that:
{ ECOC classi ers generally perform better than base classi ers. However,
better results are obtained with base classi ers in the four universities dataset.
Let us also note that improvements are not statistically signi cant for the
library dataset. These two data sets have in common the fact of being highly
unbalanced.
{ There are signi cant di erences between the performance of the SVM and
NB classi ers and this di erence a ects also the performance of the
corresponding ECOC classi ers.</p>
    </sec>
    <sec id="sec-6">
      <title>Comparative analysis of SAE, Random and BHC ECOC</title>
      <p>In these experiments we imposed the same length of the codeword for all ECOC
classi ers (i.e., 63 bits). Algorithms have been con gured as follows:
{ Random (RA): Random values 1 and 1 of the matrix bits are chosen with
the same probability;
{ BHC: the minimum value of the corrective capacity is chosen equals to t = 6,
so that the minimum distance between codewords is d = 2t + 1 = 12;
{ SA: The initial matrix state is obtained by using the algorithm RA. relevant
parameters have been set as follows: T0 = f0=5, Tmin = 0:01, L0 = 30, and
N = 100.</p>
      <p>We used the same training partition of the data set to train the ECOC
matrices obtained with three di erent algorithms. We ran ten experiment computing
the mean and variance of the accuracy, Table 2 shows experimental results. We
calculated also the mean ( ) and standard deviation ( ) between pairs of
codewords for a matrix of size 100 104, the matrix calculated by the RA algorithm
has = 49:96 and = 5:9, whereas the one calculated by the SA algorithm has
= 50:48 and = 2:77. We observed that
{ SA reduces the gap between minimum distance and maximum distance of
codeword pairs, increases the minimum and the mean distance reducing the
variance.
{ SAE can achieve better performance than others for most of the datasets.
Selection the best terms able to ensure a good performance in terms of time and
memory consumption plays a fundamental role in text classi cation, in particular
when the selected corpus contains many documents and / or the corresponding
dictionary is contains many words. This section reports a comparative assessment
of the selected score functions. Table 3 reports experimental results, the best
results being highlighted in bold. In particular, we found that the ordering among
score function (from the best downwards) is the following: 2, and IG.
In this paper a novel approach for building ECOC classi ers has been proposed.
The corresponding algorithm is based on simulated annealing, whose energy
function is anologous to the potential of a system of charges. Experimental
results show that in the con guration of minimum energy the distances between
codewords have high mean and low variance. A method for feature extraction
based on the coding matrix has also been presented, three score functions for
selecting words have been compared. As for future work, more detailed
experiments will be made on the ability of score functions to guarantee good classi
cation performance. In particular, the generalized version of , able to deal with
unbalanced datasets, will be experimented in a comparative setting.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Berger</surname>
          </string-name>
          .
          <article-title>Error-correcting output coding for text classi cation</article-title>
          .
          <source>In IJCAI-99: Workshop on machine learning for information ltering. Citeseer</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Stone</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Olshen</surname>
          </string-name>
          .
          <article-title>Classi cation and regression trees</article-title>
          . CRC press,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Corana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marchesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ridella</surname>
          </string-name>
          .
          <article-title>Minimizing multimodal functions of continuous variables with the simulated annealing algorithm corrigenda for this article is available here</article-title>
          .
          <source>ACM Transactions on Mathematical Software (TOMS)</source>
          ,
          <volume>13</volume>
          (
          <issue>3</issue>
          ):
          <volume>262</volume>
          {
          <fpage>280</fpage>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Craven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>PiPasquo</surname>
          </string-name>
          , T. Mitchell, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Freitag</surname>
          </string-name>
          .
          <article-title>Learning to extract symbolic knowledge from the world wide web</article-title>
          .
          <source>Technical report, DTIC Document</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>T. G.</given-names>
            <surname>Dietterich</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Bakiri</surname>
          </string-name>
          .
          <article-title>Solving multiclass learning problems via errorcorrecting output codes</article-title>
          .
          <source>arXiv preprint cs/9501101</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. R.-E. Fan,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Hsieh</surname>
            ,
            <given-names>X.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            , and
            <given-names>C.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>LIBLINEAR: A library for large linear classi cation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>9</volume>
          :
          <year>1871</year>
          {
          <year>1874</year>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. W. L. Go e, G. D.
          <string-name>
            <surname>Ferrier</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rogers</surname>
          </string-name>
          .
          <article-title>Global optimization of statistical functions with simulated annealing</article-title>
          .
          <source>Journal of Econometrics</source>
          ,
          <volume>60</volume>
          (
          <issue>1</issue>
          ):
          <volume>65</volume>
          {
          <fpage>99</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Inc</surname>
          </string-name>
          . Industry sector dataset,
          <year>2011</year>
          . on line
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>G.</given-names>
            <surname>James</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Hastie</surname>
          </string-name>
          .
          <article-title>The error coding method and picts</article-title>
          .
          <source>Journal of Computational and Graphical Statistics</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <volume>377</volume>
          {
          <fpage>387</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>K.</given-names>
            <surname>Lang</surname>
          </string-name>
          . Newsweeder:
          <article-title>Learning to lter netnews</article-title>
          .
          <source>In Proceedings of the 12th international conference on machine learning</source>
          , pages
          <volume>331</volume>
          {
          <fpage>339</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>D. D. Lewis</surname>
          </string-name>
          .
          <article-title>Naive (bayes) at forty: The independence assumption in information retrieval</article-title>
          .
          <source>In Machine learning: ECML-98</source>
          , pages
          <fpage>4</fpage>
          <lpage>{</lpage>
          15. Springer,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <source>Error Control Coding: Fundamentals and Applications</source>
          , volume
          <volume>1</volume>
          .
          <string-name>
            <given-names>Prentice</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>J. Quinlan</surname>
          </string-name>
          .
          <source>C4</source>
          .
          <article-title>5: Programs for empirical learning</article-title>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>M.</given-names>
            <surname>Rogati</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>High-performing feature selection for text classi cation</article-title>
          .
          <source>In Proceedings of the eleventh international conference on Information and knowledge management</source>
          , pages
          <volume>659</volume>
          {
          <fpage>661</fpage>
          . ACM,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. T. J. Sejnowski and
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Rosenberg</surname>
          </string-name>
          .
          <article-title>Parallel networks that learn to pronounce english text</article-title>
          .
          <source>Complex systems</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <volume>145</volume>
          {
          <fpage>168</fpage>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>T.</given-names>
            <surname>Windeatt</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Ghaderi</surname>
          </string-name>
          .
          <article-title>Coding and decoding strategies for multi-class learning problems</article-title>
          .
          <source>Information Fusion</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <volume>11</volume>
          {
          <fpage>21</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Pedersen</surname>
          </string-name>
          .
          <article-title>A comparative study on feature selection in text categorization</article-title>
          .
          <source>In ICML</source>
          , volume
          <volume>97</volume>
          , pages
          <fpage>412</fpage>
          {
          <fpage>420</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>