<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Neural pairwise classification models created by ignoring irrelevant alternatives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ondrej Šuch</string-name>
          <email>ondrejs@savbb.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Kontšek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Tinajová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mathematical Institue, Slovak Academy of Sciences 955 01 Banská Bystrica</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Žilinská Univerzita v Žiline</institution>
          ,
          <addr-line>Univerzitná 8215/1, 010 26 Žilina</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>It is possible to construct multiclass classification models from binary classifiers trained in pairwise (one-on-one) manner. Important examples of models created in this way are support vector machines applied to multiclass problems. In this work we examine feasibility of this approach for convolutional neural networks. We examine multiple ways to train pairwise classification networks for MNIST dataset, and multiple ways to combine them into a multiclass classifier for MNIST classification problem. Our experimental results show definite promise of this approach, especially in reducing complexity of deep neural networks. An important unresolved question of our approach is how to choose the best pairwise network to include into a full multi-class model.</p>
      </abstract>
      <kwd-group>
        <kwd>MNIST</kwd>
        <kwd>convolutional network</kwd>
        <kwd>pairwise coupling</kwd>
        <kwd>one-on-one classification</kwd>
        <kwd>binary classification</kwd>
        <kwd>dropout</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Deep neural networks are currently the most powerful type
of classifiers applicable for a multitude of machine
learning problems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Perhaps their biggest drawback is their
complexity, which manifests in multiple ways. First, they
require preparation and use of large datasets to attain the
best precision [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. They need a lot of specialized
computing power for training [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and the training process may
take a long time. Finally, the classification process is
obscured by their complexity, which makes it harder to
understand their weaknesses and guarantee performance on
unseen instances. In this article we consider the question,
whether the classification process using deep neural
networks could be made more modular, alleviating the
drawbacks resulting from the complexity.
      </p>
      <p>
        The approach is inspired by research on support
vector machines. Support vector machines were proposed by
Vapnik as a general purpose classifier [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. They are still
popular to this date for a variety of classification tasks.
Since SVM work by dividing feature space (or its
embedding into a higher-dimensional space [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) into two parts
by a hyperplane, they are naturally suited for two-class
problems. Multi-class classification with SVM is
accomplished by training SVM for pairs of classes, then fitting
sigmoid to obtain pairwise probabilities [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and finally
using a pairwise coupling method to obtain multiclass
prediction probabilities [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The same approach can be applied to deep neural
networks. An additional simplification compared to SVM is
that the step of fitting sigmoid is not necessary, since
neural networks typically use soft-max for their final layers
that directly output prediction probabilities. On the other
hand, compared to SVM, the process of training of neural
networks allows for many more hyperparameters.</p>
      <p>
        In this paper we will carry out the process on MNIST
digit classification task [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] illustrating the potential and
possible pitfalls of the approach (Figure 1). Even for the
basic MNIST task we had to severely limit the number
of investigated training procedures. A key restriction we
adopted is that we trained the two-class networks only
with examples belonging to the corresponding two classes.
Such approach, dubbed ‘ignoring irrelevant alternatives’,
promises to speed up the training process for the two-class
networks by reducing the size of the training dataset as
well as to cut the time needed to train the whole multi-class
model. Moreover, it is philosophically consistent with the
assumption of the independence of irrelevant alternatives
in the softmax layer commonly used in neural networks.
      </p>
      <p>
        Our restriction, and pairwise decomposition of
multiclass classification itself are not without potential
problems. A key issue is that of extrapolating prediction
probabilities to classes that a two-class classifier has never seen
during training (e.g. the insight of G. Hinton described in
the work of Hastie and Tibshirani [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). Another potential
problem to guard against is that the proposed
classification scheme may require much more parameters since as
much as 10 x 9 / 2 = 45 neural networks need to be trained
instead of one.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methodology outline</title>
      <p>
        MNIST dataset of handwritten digits (zero to nine) is a
widely used benchmark task in which convolutional
networks proved quite successful [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. It consists of 60000
training samples and 10000 testing samples. Throughout
this work we will use 8-layer feed-forward networks
derived from the network structure of the sample network
defined for solving MNIST classification in Keras
framework [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The network has eight layers totaling 1,199,882
parameters. The underlying tensor shapes and number of
parameters are given in the Table 1. The optimization
criterion is crossentropy loss and the model is trained with
Adadelta optimizer.
      </p>
      <p>The networks we used differed from this one by
changing the dropout probabilities in layers dropout_1
and dropout_2 and the number of kernels in conv2d_1,
conv2d_2 and the number of neurons dense_1 layer.
Moreover, since we use smaller training datasets, we
compensated by increasing the number of training epochs to
24 from 12.</p>
      <p>The network achieves about 99.13% average success
rate classifying all digits (Fig. 2 left). The trained
instance of the network can be viewed also as a two-class
classifier for any pair of digits. The right part of Figure 2
summarizes measured test accuracy as a 2-7 classifier.</p>
      <p>
        For the coupling method we used the method of
WuLin-Weng [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which is commonly applied in SVM
libraries [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The method is nontrainable i.e. there are no
adjustable parameters.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Models using large CNN</title>
      <p>By a large CNN studied in this section we mean a
convolutional network that differs from the Keras sample
network only in the values of the dropout parameters in layers
dropout_1 and dropout_2. Moreover, all large CNN
networks were trained only on pairs of digits, and the number
of training epochs was increased to 24.
3.1</p>
      <sec id="sec-3-1">
        <title>Dependence on dropout parameters</title>
        <p>
          Dropout layers serve to suppress overfitting [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The
number of parameters of Keras neural networks greatly
outnumbers the number of training samples, and without
dropout it would be difficult to achieve high accuracy
reliably. The optimal value of dropout probabilities may be
different from the original Keras network, since a) the task
has changed, and so did the training set, b) we increased
the number of epochs. Therefore we systematically
sampled dropout parameter space and measured mean
performance of the network. The results are shown in Figure
3. Overall the performance did not vary too much.
Moreover the Figure implies that larger values of dropout are
appropriate for the first dropout layer and smaller values
of dropout are appropriate for the the second dropout layer.
Notable however is that training only on relevant examples
i.e. samples of twos and sevens resulted in lower
performance than obtained using original Keras network (Fig. 2
right) for any choice of dropout parameters.
2
_
t
u
poo40
r
d
20
Coupling models using to build complete pairwise models
have built-in redudancy. This redundancy has been shown
to ameliorate subpar performance of individual pairwise
classifiers in simulations with synthetic data [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. It is
therefore worthwhile to evaluate the performance of
complete pairwise models. We did so for several choices of
dropout parameters: one for the values provided in the
original Keras network and the rest for dropout values
proved moderately better in classifying 2’s vs. 7’s (Figure
3). Only one experiment for each value was conducted,
since the number of parameters of the complete model is
impractcally large and the results are only for establishing
a comparison benchmark. The results are shown in Table
2. We can surmise that the performance dropped slightly
(by 0.1%) compared to the Keras network, despite the fact
that at the same time the number of parameters increased
45-fold.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Models built using small CNN</title>
      <p>By a small CNN studied in this section we mean a neural
network that differs from the Keras network by having
using 4 kernels at conv2d_1 layer (rather than 32), 4 kernels
at conv2d_2 layers (rather than 64), 16 neurons at dense_1
layer (rather than 128), and possibly having a different
dropout probabilities for layers dropout_1 and dropout_2.
This ad hoc choice was made so that the network uses only
9,454 parameters, so a complete model built from such
networks would use 425,430 parameters, which is about
35% of the size of the original Keras network. Again, all
small CNN networks considered within this section were
trained only on samples for pairs of digits and the number
of epochs was increased to 24.
4.1</p>
      <sec id="sec-4-1">
        <title>Dependence on dropout parameters</title>
        <p>Since the small CNN networks have so few parameters
one may hypothesize that dropout is not needed to
prevent overfitting. In order to verify this hypothesis we again
systematically explored varying dropout values as shown
in Figure 4. The results are consistent with the hypothesis.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Average performance of pairwise models</title>
        <p>In view of the dropout experiments further small networks
in this section were trained without dropout. We created
two series of series of sets of networks: (A) trained on
all available training samples for corresponding digits, and
(B) setting aside 10% of the training samples as a
validation dataset, to be used for model selection in Section 4.3.
Each series consisted of 20 sets, and each set contained 45
pairwise networks corresponding to every pair of distinct
digits. The mean error rates are shown in Table 3.
Having trained multiple sets of pairwise models, we
proceeded to complete multi-class MNIST classification
models using pairwise coupling. There are two distinct
ways how to select pairwise networks into a complete
model – we can take all networks from a single set, or
we can choose for each pair of digits a model from
various sets for which we expect the smallest error. There
are again multiple ways to do the latter – we may choose
based on its (pairwise) error rate, or its (pairwise) loss, and
we may measure those on either the training set or the
validation set. The results for various combinations of these
paarameters are shown in Table 4.</p>
        <p>From the table we can see that the best result was error
rate 1.24% achieved by a model built from A series of
networks which were trained on the full training set without
setting aside a validation subset.</p>
        <p>In order to contemplate the hypothesis that none of these
selection criteria is reliable (e.g. if sever overfitting occurs
on the train set), we can also compare selection based on
the results on the test pairwise dataset. The results for this
last choice are of course not indicative of real performance
of a method, but could be considered as an estimate of an
upper bound based on a hypothetical method for
choosing the best individual pairwise classifiers. The results
shown in Table 5 indicate that there would be significant
improvement, which would rival the precision achieved by
the original Keras network (see Fig. 1 left).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>Our experimental results are tantalizing. At the end of
Section 4 we were able to exhibit neural pairwise
classification models with weights trained only on training
data from MNIST that show comparable performance to
Keras classification network, yet needing only 35% of its
weights. Moreover, they are modular and individual
networks in the models can be trained in parallel. However,
we are unable to provide an algorithm that would arrive at
such classification models.</p>
      <p>Similarly there are plenty of large CNN networks of
the structure considered in this paper that top
classification rate 99.5% on the problem of classifying twos from
sevens in MNIST dataset. Yet, we were not able to find a
combination of dropout hyperparameters that would yield
the same performance, when training the network only on
twos and sevens. The unsettling suggestion is that it is
advantageous to use irrelevant examples for training deep
convolutional neural networks.</p>
      <p>It seems more attention should be paid to binary
image classification problems with convolutional networks.
pair Series</p>
      <p>of trained
digits without
validation
subset
train
0-1 0.02
0-2 0.03
0-3 0.04
0-4 0.02
0-5 0.02
0-6 0.13
0-7 0.02
0-8 0.09
0-9 0.13
1-2 0.08
1-3 0.03
1-4 0.09
1-5 0.01
1-6 0.02
1-7 0.16
1-8 0.13
1-9 0.09
2-3 0.08
2-4 0.03
2-5 0
2-6 0.01
2-7 0.12
2-8 0.08
2-9 0.03
3-4 0.01
3-5 0.12
3-6 0.01
3-7 0.05
3-8 0.05
3-9 0.10
4-5 0.00
4-6 0.02
4-7 0.03
4-8 0.05
4-9 0.25
5-6 0.19
5-7 0.00
5-8 0.12
5-9 0.08
6-7 0.00
6-8 0.08
6-9 0.00
7-8 0.12
7-9 0.08
8-9 0.14
Average 0.07
test
0.06
0.29
0.10
0.09
0.27
0.44
0.25
0.45
0.27
0.29
0.13
0.02
0.13
0.18
0.25
0.18
0.24
0.29
0.28
0.09
0.32
0.72
0.43
0.36
0.12
0.53
0.07
0.41
0.35
0.45
0.04
0.28
0.16
0.20
0.77
0.63
0.14
0.36
0.35
0.13
0.39
0.26
0.39
0.57
0.53
0.29</p>
      <p>
        Series
trained
validation
subset
On the surface they may seem trivial and impractical, but
one may expect niche applications such as classification of
medical images [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] would naturally have only two classes
to consider. Binary classification problems of images may
also be more amenable to theoretical analysis.
      </p>
      <p>The pairwise models considered in this paper were
formed from share-nothing networks. The hope was that
the pairwise coupling method would be able to
counteract their smaller individual capacity/precision and achieve
equal or better multiclass classification performance by
combining and extracting diversity of information
contained within these networks.</p>
      <p>
        Share-nothing approach is not viable for problems with
larger number of categories than MNIST. There are other
approaches to pairwise multi-class classification with
neural networks that use partial sharing of learning capacity
by pairwise classifiers. It is also possible to create
ensemble models without pairwise coupling [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. We hope
to investigate these questions in future works.
training
dataset
training
dataset
validation
dataset
training
dataset
training
dataset
testing
dataset
criterion
measured
      </p>
      <p>on
testing
dataset
testing
dataset</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgment</title>
      <p>Work on this paper was partially supported by grants
VEGA 2/0144/18 and APVV-14-0560. We also thank
Google Inc. for providing us with education credit
#32870744 that we used on Google Compute cloud
service.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>LeCun</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <article-title>Deep learning</article-title>
          .
          <source>Nature</source>
          <volume>521</volume>
          , (
          <year>2015</year>
          ),
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Hestness</surname>
            <given-names>J.</given-names>
          </string-name>
          et al,
          <article-title>Deep learning is predictable, empirically</article-title>
          , arXiV:
          <fpage>1712</fpage>
          .
          <fpage>00409</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Jouppi</surname>
            <given-names>N.P.</given-names>
          </string-name>
          et al,
          <article-title>In-datacenter performance analysis of a tensor processing unit,"</article-title>
          <source>in Proceedings of ISCA'17</source>
          , Toronto, ON, Canada, (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Cortes</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vapnik</surname>
            <given-names>V.N.</given-names>
          </string-name>
          ,
          <article-title>Support-vector networks</article-title>
          ,
          <source>Machine learning</source>
          ,
          <volume>20</volume>
          , (
          <issue>3</issue>
          ), (
          <year>1995</year>
          ),
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Boser</surname>
            <given-names>B.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guyon</surname>
            <given-names>M.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vapnik</surname>
            <given-names>V.N.</given-names>
          </string-name>
          ,
          <article-title>A training algorithm for optimal margin classifiers</article-title>
          ,
          <source>Proceedings of the fifth annual workshop on Computational learning theory - COLT</source>
          '
          <fpage>92</fpage>
          , (
          <year>1992</year>
          ), p.
          <fpage>144</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Platt</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <article-title>Probabilities for SVM machines, in Advances in Large Margin classifiers, Smola A</article-title>
          .J.,
          <string-name>
            <surname>Bartlett</surname>
            <given-names>P.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scholkopf</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuurmans</surname>
          </string-name>
          , D., Eds. MIT Press, (
          <year>2000</year>
          ),
          <fpage>61</fpage>
          -
          <lpage>74</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C-C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>C-J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>LIBSVM:</surname>
          </string-name>
          <article-title>A library for support vector machines</article-title>
          ,
          <source>ACM Transactions on Intelligent System and Technology</source>
          , vol
          <volume>2</volume>
          ., issue
          <volume>3</volume>
          , (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>LeCun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <source>The MNIST database of handwritten digits</source>
          , (
          <year>1998</year>
          ), http://yan.lecun.com/exdb/mnist
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Hastie</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <article-title>Classification by pairwise coupling</article-title>
          , http://fisher.utstat.toronto.edu/pub/tibs/coupling.ps
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Chollet</surname>
            <given-names>F.</given-names>
          </string-name>
          et al, Keras, (
          <year>2015</year>
          ), https://keras.io
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Wu</surname>
            <given-names>T-F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>C-J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            <given-names>R.C.</given-names>
          </string-name>
          ,
          <article-title>"Probability estimates for multi-class classification by pairwise coupling"</article-title>
          ,
          <source>Jounral of Machine Learning Reserach</source>
          , (
          <year>2004</year>
          ),
          <fpage>975</fpage>
          -
          <lpage>1005</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Srivastava</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>
          ,
          <source>Jounral of Machine Learning Research</source>
          ,
          <volume>15</volume>
          , (
          <year>2014</year>
          ),
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Litjens</surname>
            <given-names>G.</given-names>
          </string-name>
          et al,
          <article-title>A survey of deep learning in medical image analysis</article-title>
          ,
          <source>Medical Image Analysis</source>
          ,
          <volume>42</volume>
          , (
          <year>2017</year>
          ),
          <fpage>60</fpage>
          -
          <lpage>88</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Hartono</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <article-title>Ensemble of Perceptrons with Confidence Measure for Piecewise Linear Decomposition</article-title>
          ,
          <source>IEEE Int. Joint Conf. on Neural Networks (IJCNN</source>
          <year>2011</year>
          ), (
          <year>2011</year>
          ),
          <fpage>648</fpage>
          -
          <lpage>653</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Hartono</surname>
          </string-name>
          ,
          <article-title>Ensemble of Linear Experts as an Interpretable Piecewise Linear Classifier</article-title>
          ,
          <source>Innovative Computing, Information and Control Express</source>
          Letters Vol.
          <volume>2</volume>
          , No.
          <volume>3</volume>
          , (
          <year>2008</year>
          ),
          <fpage>295</fpage>
          -
          <lpage>303</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>