=Paper=
{{Paper
|id=Vol-1391/53-CR
|storemode=property
|title=Convolutional Neural Networks for Subfigure Classification
|pdfUrl=https://ceur-ws.org/Vol-1391/53-CR.pdf
|volume=Vol-1391
|dblpUrl=https://dblp.org/rec/conf/clef/LyndonKKLF15a
}}
==Convolutional Neural Networks for Subfigure Classification==
<pdf width="1500px">https://ceur-ws.org/Vol-1391/53-CR.pdf</pdf>
<pre>
            Convolutional Neural Networks for
                 Subfigure Classification

David Lyndon1 , Ashnil Kumar1,3 , Jinman Kim1,3 , Philip H. W. Leong2,3 , and
                             Dagan Feng1,3
         1
           School of Information Technologies, University of Sydney, Australia
2
    School of Electrical and Information Engineering, University of Sydney, Australia
     3
        Institute of Biomedical Engineering and Technology, University of Sydney,
                                       Australia
                             dlyn9602@uni.sydney.edu.au
       {ashnil.kumar,jinman.kim,philip.leong,dagan.feng}@sydney.edu.au


       Abstract. A major challenge for Medical Image Retrieval (MIR) is
       the discovery of relationships between low-level image features (inten-
       sity, gradient, texture, etc.) and high-level semantics such as modal-
       ity, anatomy or pathology. Convolutional Neural Networks (CNNs) have
       been shown to have an inherent ability to automatically extract hier-
       archical representations from raw data. Their successful application in
       a variety of generalised imaging tasks suggests great potential for MIR.
       However, a major hurdle to their deployment in the medical domain is
       the relative lack of robust training corpora when compared to general
       imaging benchmarks such as ImageNET and CIFAR. In this paper, we
       present the adaptation of CNNs to the subfigure classification subtask
       of the medical classification task at ImageCLEF 2015.

       Keywords: Deep Learning, Convolutional Neural Networks, Medical
       Image Retrieval


1    Introduction
This paper documents the Biomedical Engineering and Technology (BMET)
team from the University of Sydney’s submissions for the ImageCLEF 2015 [1]
Medical Classification task [2]. Specifically, BMET’s work was directed at the
Subfigure Modality Classification subtask.
    The objective of our experiments was to evaluate the effectiveness of Convo-
lutional Neural Networks (CNNs) for this subtask. In particular, we propose a
deep learning framework that could learns high-level representations of different
image modalities and use these to classify the modality of each subfigure.


2    Background
Convolutional Neural Networks, a type of deep learning algorithm, have been
used to produce state-of-the-art results for a variety of machine learning tasks
such as image recognition, acoustic recognition and natural language processing
since 2012 [3–5]. CNNs share the common features of all deep learning algo-
rithms: stacked layers of neuronal subunits that learn hierarchical representa-
tions (allowing the data to be understood at various levels of abstraction, in
isolation or combination [3]), the ability to perform unsupervised pre-training
on unlabeled data and efficient parallelization on multiple core GPUs which can
result in improvements of up to 5000% over CPU-only implementations [4].
     A more subtle implication of deep learning is that it can automatically extract
features from raw data [3–5]. Typically, a key factor in the success of typical
machine learning algorithms is extracting salient features from the raw data.
Taking image recognition as an example, a feature set such as edges or SIFT [6]
would be extracted from the raw data and it is these new features per se or
in combination with the original raw data that would be fed into the machine
learning algorithm. While some aspects of the process can be automated or
implemented with well known algorithms, a major drawback is that it generally
requires expert domain knowledge to define which features should be used and
evaluate their success.
     Deep learning algorithms, however, are able to directly utilise raw data in-
stead of hand-crafted features. By feeding the data sequentially through many
successive layers of subunits, the higher levels of the system are able to under-
stand the data in terms of successively abstract representations [3].
     Medical Image Retrieval (MIR) tasks, such as the tests devised for Image-
CLEF, require learning precisely these kinds of highly abstract representations,
i.e. image modality or the anatomical semantics of the image. However, to the
best of our knowledge it is not currently a well established method in this domain.
This is due to not only the inherent challenges of medical images[7], but also
because state-of-the-art deep learning results are typically obtained using huge
sets of labelled training data4 on tasks that are arguably less subtle. As a justifi-
cation for these claims, consider that the ImageNET general object recognition
task corpora consists of millions of robustly labelled images and was created with
the assistance of crowdsourcing via Amazon Mechanical Turk [9]. On the other
hand, medical imaging datasets require careful labelling by domain experts, of-
ten specialists in a particular area [7, 10, 11] and as a result are generally much
smaller.
     Large training sets are a current necessity of very deep systems because they
contain many millions of internal parameters that must be estimated from the
data. Too little data can result in the the higher-level neurons’ activation being
the result of salient features of the training set and not reflecting the high-level
representations. If this ’overfitting’ occurs then the system’s ability to generalise
on new data is severely impaired [12].
     In addition to the issues regarding the volume of data required, it must be
mentioned that while deep learning can automatically perform excellent feature
extraction, this comes at the significant cost of the larger number of hyperpa-
4
    Krizhevsky et. al. [8] used approximately 1.2 million labelled examples for their
    breakthrough result in ImageNET in 2012.
rameters that must be evaluated in order to find an optimal system [13]. For
example, compared to a commonly used machine learning algorithm such as the
Support Vector Machine (SVM) that has a basic hyperparameter search space
with dimensions of choice of kernel, regularization constant and kernel hyper-
parameter, even the simplest implementation of a CNN requires fundamental
choices about the number and type of layers, filter size and number of filters
per layer, and the learning rate. More advanced implementations include fac-
tors such as unit activation function and the use of dropout. While there are
guidelines for these choices in the literature [13], the difficulty of even a small
parameter search is compounded by the increased computational requirements
of training the system.


3     Methods

3.1   Image Preprocessing

A requirement of our classifiers was uniformly sized input vectors, however,
the supplied training data varied greatly in size. This was achieved by square-
cropping the image to 500px, any dimension of the image smaller than 500px
was filled with black pixels.
    Even prior to training the CNN, we were aware that the computational re-
quirements were quite demanding and this would be exacerbated by using large
images. We resized the images to 160x160px to reduce the computational over-
head that would have been required by using higher resolution images. Good
results have been reported in the literature for complex tasks with 48x48px
images [14] and Krizhevsky et. al. [8] achieved state-of-the art general ob-
ject recognition with 256x256px images (technically, the system had an input of
224x224px, but these were subimages of the original 256x256px images).
    After resizing the images were 160x160x3px, the third dimension describing
the three colour channels. For the purposes of simplicity and to further reduce
the computational requirements we reduced the 3 channel colour representation
to a single channel (red).
    We randomly divided the training data into a 70/30 split for training and
validation.


3.2   Softmax Classification

We evaluated the effectiveness of CNN-derived features by comparing it to the
results achieved by a Softmax classifier on the raw data. This experiment is im-
portant because the CNN’s final layer is the input to a Softmax classifier. This
experiment can therefore be used to quantify the effectiveness of the unsuper-
vised feature extraction performed by the CNN.
3.3     Convolutional Neural Network

The architecture for the CNN used for our experimentation was based on a
simplified version of Yann LeCun et. al.’s [15] LeNet-55 . This CNN is capable of
correctly classifying the MNIST handwritten digit database with 1.7% test error.
We modified the input to account for larger images and output a greater number
of classifications. The network consists of two convolutional pooling layers, with
one fully connected hidden layer. The features that are output by the hidden
layer are used for classification by a Softmax classifier. The architecture of the
system is shown in Figure 1.


               Fig. 1. The architecture of CNN used for the experiments


      The specifications of the convolutional-pooling layers are detailed in Table 1.


                   Table 1. Details of Convolutional Pooling Layers

                           Hyperparameter Layer0 Layer1
                          Number of Filters   20      50
                           Size of Filters 15x15px 15x15px
                            Max Pooling      2x2     2x2
                               Stride         1       1


5
    http://deeplearning.net/tutorial/lenet.html
     Other hyperparameters for the CNN are detailed in Table 2.

                          Table 2. Other details for CNN

                               Hyperparameter                Value
                 Number of Units in Fully Connected Layer 500
                               Batch Size                  20


    As mentioned earlier the CNN requires a great deal of computational resource
to run. It took approximately 3.5 hours to train a single epoch for each model,
while training two models simultaneously on the CPU of a powerful system6 .
However, the models were not able to converge before the submission deadline.
As such the runs that we submitted were based on only partially converged
models. The details of the four runs submitted are detailed in Table 3.

                  Table 3. Run-specific details of all submissions.

                Submission Model Learning Rate Training Epochs
                 sf run 1 Softmax        0.05             1000
                 sf run 2  CNN           0.005             47
                 sf run 3  CNN           0.005             55
                 sf run 4  CNN           0.007             46
                 sf run 5 Softmax        0.05             1000
                 sf run 6  CNN           0.005             59


4     Validation Results
The validation error for all runs is displayed in Table 4.

                   Table 4. Validation error of all submissions.

                            Submission Validation Error
                             sf run 1         0.0%
                             sf run 2       13.85%
                             sf run 3        8.94%
                             sf run 4       10.53%
                             sf run 5         5.3%
                             sf run 6        6.53%


6
    Azure Standard A4 VM: 8-core 2.1GHz CPU, 14GB RAM
4.1   Softmax Classification

The first Softmax model, trained for 1000 epochs, produced a 0% validation er-
ror. This was interpreted as being the result of severe over-fitting to the supplied
training data. Despite the fact that this classification scheme was essentially a
baseline to evaluate the performance of CNN-extracted representations over the
raw data, it was thought prudent to perform second run, with less training and
hopefully less overfitting, in order to see the results of a more general model.
Thus, for sf run 5 we submitted the results of training the same model for only
155 epochs, this resulted in a 5.3% error rate on the validation set.


4.2   Convolutional Neural Networks

The validation errors displayed in Table 4 for the CNN runs (sf run 2,3,4,6)
demonstrated a clear correlation between the number of epochs they were trained
for and increasing performance (decreasing validation error). This is demon-
strated visually in Figure 2.


Fig. 2. The validation error of the models decreases as they approach convergence.
The scatter points correspond to the epoch and validation error for each of the four
test submissions.


5     Test Results

The test results for the six runs as supplied by ImageCLEF are displayed in Table
5. The CNNs demonstrated improved performance over the Softmax classifica-
tion and their accuracy approximately corresponded to the amount of training
that was performed.
                 Table 5. Test results as supplied by ImageCLEF.

                         Submission Correctly Classified
                           sf run 1        37.56%
                           sf run 2        43.62%
                           sf run 3        45.63%
                           sf run 4        44.34%
                           sf run 5        37.56%
                           sf run 6        45.00%


5.1    Softmax Classification

The test accuracy for both runs of the Softmax classifier were 37.56%. This indi-
cates that despite cutting short the training for sf run 5 compared to sf run 1,
both models had effectively the same representation of the data when it came
to classifying the test data.


5.2    Convolutional Neural Networks

Compared to the validation results for the CNNs, the improvements with regard
to the number of training epochs are not so clear-cut. For the model trained with
learning rate of 0.005% there is a clear improvement between the test submitted
at epoch 47 and the test submitted at epoch 55. However, the test accuracy
decreased in the run submitted at epoch 59.


6     Analysis of Results

6.1    Validation vs. Test variance

An examination of the validation error and test results in Tables 4 and 5 is
very illuminating. Clearly the 70/30 training/validation method we applied was
inappropriate in this case, as demonstrated by the significant variance between
the validation and test performance. While it is possible that the test set was
substantially different to the training set, it’s more likely the 30% chosen for
validation was not fully representative of the data. Given that the training data
was not evenly distributed in all classes it is likely that the models overfit the
data corresponding to the more common classes and that the validation set
was heavily skewed towards the common classes. However, we still believe that
CNNs are suitable for this task despite the evidence of overfitting in this case.
Techniques for overcoming this issue are discussed in Section 7.
    Having already pointed out the tremendous computational demands required
by the CNNs, more robust validation procedures such as 10-fold cross-validation
are clearly not feasible with the system employed in this test. That said, it may
be possible to perform this kind of validation on a simpler model such as Softmax,
in order to discover a more indicative training/validation split. Another option
would be to take a more manual approach to splitting the sets, ensuring that all
classes are evenly represented in the validation set.
    It’s worth noting that the models used for testing were only trained on the
70% training split. In future, we can expect better results by retraining the best
model (based on some validation metric) on the entire dataset.


6.2   CNN Training

As alluded to out earlier, although the CNNs did not converge during training,
they may have already begun to overfit the training data with the result that
the test performance actually decreased for the model at epoch 59 compared to
the model at epoch 55. However, this is not entirely certain, as it is also possible
that the model at epoch 59 was a better fit for the validation data (table 4)
at that point, but simultaneously a worse fit for the test data. Had the models
been able to train for longer, we may have had a clearer indication of their true
performance.


6.3   CNN-Learnt Features

The CNNs were able to extract improved representations from raw data without
the requirement for domain knowledge. This is an important result both for
this task and for MIR generally as it suggests that there is potential in using
CNN or other deep learning strategies as a ’black box’, whereby we will be able
to achieve excellent machine learning performance without the need of expert-
designed feature extraction or domain knowledge.
    We would have liked to train the network further, but need to prematurely
halt the system for the purposes of submission. We believe that additional train-
ing would yield a better result.


7     Perspectives for Future Work

We believe that that these results can be significantly improved upon by making
use of a variety of techniques. Primarily we would want to explore training
the CNNs using GPUs, as this will allow us to expand our hyperparamter and
architecture search. Rectified Linear Units (ReLUs), as opposed the Tanh units
used in our network are also known to improve training performance [16, 17].
    Although this network is very capable of learning quality representations
of the MNIST dataset, it is both less deep and less dense than networks used
to achieve state-of-the-art results in more sophisticated tasks [8]. For instance,
Krizhevsky et. al. [8] used a network with 2 convolutional-max pooling layers, 3
convolutional layers and 3 fully connected layers, all of which were more neuron-
dense that ours, to achieve their result in ImageNET 2012. Improved training
performance will allow us to implement a larger and deeper network along these
lines.
    Larger and deeper networks introduce issues with overfitting, but we believe
this can be controlled using well-tried techniques such as dropout [8, 12, 18], data
augmentation [8, 19] and unsupervised pretraining [20, 21].
    Finally, the validation method we utilised for these experiments did not pro-
duce an accurate understanding of the performance of our systems. In approach-
ing this task in future we would be careful to construct a more representative
validation set or use the 2015 test data for validation.

Acknowledgements
This work was supported in part by a Microsoft Azure for Research grant, which
provided the cloud infrastructure to conduct our experiments.


References
1. M. Villegas, H. Mller, A. Gilbert, L. Piras, J. Wang, K. Mikolajczyk, A. G. S. de
   Herrera, S. Bromuri, M. A. Amin, M. K. Mohammed, B. Acar, S. Uskudarli, N.
   B. Marvasti, J. F. Aldana, and M. del Mar Roldn Garcia, General Overview of
   ImageCLEF at the CLEF 2015 Labs, Springer International Publishing, 2015.
2. A. Garcia Seco de Herrera, H. Mller, and S. Bromuri, Overview of the ImageCLEF
   2015 medical classification task, in Working Notes of CLEF 2015 (Cross Language
   Evaluation Forum), 2015.
3. Y. Bengio, A. Courville, and P. Vincent, Representation learning: a review and new
   perspectives, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 17981828,
   Aug. 2013.
4. Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp.
   436444, May 2015.
5. J. Schmidhuber, Deep learning in neural networks: an overview, Neural Netw., vol.
   61, pp. 85117, Jan. 2015.
6. D. G. Lowe, Object recognition from local scale-invariant features, in Computer
   Vision, 1999. The Proceedings of the Seventh IEEE International Conference on,
   1999, vol. 2, pp. 11501157 vol.2.
7. A. Kumar, J. Kim, W. Cai, M. Fulham, and D. Feng, Content-based medical image
   retrieval: a survey of applications to multidimensional and multimodality data, J.
   Digit. Imaging, vol. 26, no. 6, pp. 10251039, Dec. 2013.
8. A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep
   Convolutional Neural Networks, in Advances in Neural Information Processing Sys-
   tems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran
   Associates, Inc., 2012, pp. 10971105.
9. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ImageNet: A large-
   scale hierarchical image database, in Computer Vision and Pattern Recognition,
   2009. CVPR 2009. IEEE Conference on, 2009, pp. 248255.
10. J. Kalpathy-Cramer, A. G. S. de Herrera, D. Demner-Fushman, S. Antani, S.
   Bedrick, and H. Mller, Evaluating performance of biomedical image retrieval system-
   sAn overview of the medical image retrieval task at ImageCLEF 20042013, Comput.
   Med. Imaging Graph., vol. 39, pp. 5561, 2015.
11. H. Mller, N. Michoux, and D. Bandon, A review of content-based image retrieval
   systems in medical applicationsclinical benefits and future directions, International
   journal of, 2004.
12. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
   Dropout: A Simple Way to Prevent Neural Networks from Overfitting, J. Mach.
   Learn. Res., vol. 15, no. 1, pp. 19291958, Jan. 2014.
13. Y. Bengio, Practical recommendations for gradient-based training of deep archi-
   tectures, arXiv [cs.LG], 24-Jun-2012.
14. D. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, A committee of neural net-
   works for traffic sign classification, in Neural Networks (IJCNN), The 2011 Interna-
   tional Joint Conference on, 2011, pp. 19181921.
15. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied
   to document recognition, Proc. IEEE, vol. 86, no. 11, pp. 22782324, Nov. 1998.
16. V. Nair and G. E. Hinton, Rectified linear units improve restricted boltzmann
   machines, in Proceedings of the 27th International Conference on Machine Learning
   (ICML-10), 2010, pp. 807814.
17. A. L. Maas, A. Y. Hannun, and A. Y. Ng, Rectifier Nonlinearities Improve Neural
   Network Acoustic Models, W—&CP, vol. 28, 2013.
18. G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov,
   Improving neural networks by preventing co-adaptation of feature detectors, arXiv
   [cs.NE], 03-Jul-2012.
19. Classifying plankton with deep neural networks, Sander Dieleman. [Online]. Avail-
   able: http://benanne.github.io/2015/03/17/plankton.html. [Accessed: 30-May-
   2015].
20. X. Glorot, A. Bordes, and Y. Bengio, Domain adaptation for large-scale sentiment
   classification: A deep learning approach, in Proceedings of the 28th International
   Conference on Machine Learning (ICML-11), 2011, pp. 513520.
21. Y. Bar, I. Diamant, L. Wolf, and H. Greenspan, Deep learning with non-medical
   training used for chest pathology identification, in SPIE Medical Imaging, 2015, p.
   94140V94140V7.

</pre>