<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IPL at CLEF 2016 Medical Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonidas Valavanis</string-name>
          <email>valavanisleonidas@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spyridon Stathopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Theodore Kalamboukis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Processing Laboratory, Department of Informatics, Athens University of Economics and Business</institution>
          ,
          <addr-line>76 Patission Str, 10434, Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present the image classi cation techniques performed by the IPL Group for the sub gure classi cation subtask of ImageCLEF 2016 Medical Task. For the visual representation of images, various state-of-the-art visual features, such as, Bag of Visual Words computed with pyramid-histogram of-visual-words descriptors and quadtree bag-of-colors were adopted. We present the results of our runs and our extensive experiments applying early or late fusion on the results obtained from a multi-class linear kernel support vector machine. Our top run was ranked 3rd among 34 runs.</p>
      </abstract>
      <kwd-group>
        <kwd>pyramid-histogram of visual words</kwd>
        <kwd>bag of visual words</kwd>
        <kwd>bag of colors</kwd>
        <kwd>early-fusion</kwd>
        <kwd>late-fusion</kwd>
        <kwd>textual-classi cation</kwd>
        <kwd>support vector machines</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Image classi cation is perhaps the most important and challenging task within
the eld of computer vision with applications in several domains. A broad area of
image-processing approaches is directed by image classi cation, the automated
assignment of unknown images into a set of prede ned categories.</p>
      <p>
        In the medical domain Content Based Image Retrieval (CBIR) plays an
important role in supporting diagnosis, treatment and teaching [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Visual image
classi cation into a relatively small number of classes, has shown to deliver good
results in several benchmarks. Approaches combining both visual and textual
techniques for classi cation have shown to be promising in medical image
classi cation tasks. Here we should mention the substantial contribution of the
ImageCLEFmed task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] focusing on medical images over a decade on the CBIR
and classi cation tasks.
      </p>
      <p>
        The ImageCLEF 2016 Medical Task, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], consists of 5 subtasks: compound
gure detection, gure separation, multi-label classi cation, sub gure classi
cation, caption prediction. Sub gures extracted from compound images are
classi ed into 30 heterogeneous classes ranging from diagnosis images to various
biomedical illustrations. Some image categories were represented by few
training examples, thus the enrichment of the original collection was necessary in
order to counteract the imbalanced dataset. Over the past years of the contest
there was a large class of compound images that contained sub-images of
several modalities something which made it di cult to train a classi er. This year
there are no compound images in the sub gure classi cation subtask. However,
both, the train and the test sets remain unbalanced with one very large category
(GFIG, 2085) and some other categories that contain just few images(GPLI 2)
or (DSEE, 3).
      </p>
      <p>
        This year our group participated only in the sub gure classi cation subtask.
Details of this task can be found in the overview paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and the web page of
the contest 1. Our approach to classi cation is based on merging two well known
models, that of the BoW, [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and a generalized version of bag of colors (BoC),
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] approach combined with early or late fusion which gave us the third best
performing position.
      </p>
      <p>In the next section we present a detailed description of the modelling
techniques and data fusion used. In section 4, the classi cation tools and parameters
are described as well as the submission runs with our results. Finally, Section 5
concludes our work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Image Visual Representation</title>
      <p>Inspired from text retrieval, the Bag-of-visual Words (BoW) approach has shown
promising results in the eld of image retrieval and classi cation. In this vein,
we based our approach to the BoW model for the image classi cation task.
In this section, we describe the methodology used for the visual and textual
representation of images.
2.1</p>
      <sec id="sec-2-1">
        <title>Pyramid Histogram of Visual Words (PHOW)</title>
        <p>
          PHOW is an extension of the BoW model used for image classi cation. In this
model, we identify small regions (local interest points) known as, salient image
patches that contain rich local information of the image. To extract such
keypoints, the SIFT [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] or the Dense SIFT [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] descriptors are employed. However,
the number of features extracted from local interest points may vary, depending
on the image. In order to have a xed number of feature dimensions, a visual
codebook is created by clustering the extracted local interest points of a number
of sample images, using the k-means clustering algorithm. Each cluster (visual
word) represents a di erent local pattern, which shares similar interest points.
The histogram of an image, is created by performing a vector quantization which
assigns each key-point to its closest cluster (visual word) [8]. However, as it is
known, the BoW model loses the spatial information of the local descriptors
due to the clustering which, limits severely their discriminative power. Pyramid
Histogram of Visual Words (PHOW) addresses this problem by dividing the
image into increasingly ne sub-regions of equal size, which are called pyramids.
1 http://www.imageclef.org/2016/medical
The histogram of visual words is computed in each local sub-region of the image
and in the sequel they are concatenated into a single feature vector [9]. For our
experiments, we partition the image into 2 2 and 4 4 sub-regions and then
combine the generated quantizations. As for the size of the visual codebook, after
experimentation with several values, we selected 1536 visual words. Thus each
image was represented with a vector of 30720 features (2x2x1536 + 4x4x1536).
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Quad-Tree Bag-of-Colors Model(QBoC)</title>
        <p>
          With the BoC model [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] a color vocabulary is learned from a sub-set of the
image collection. This vocabulary is used to extract the color histograms for
each image. Through experiments, it has been shown that using a learned color
vocabulary improves retrieval performance over a at color space quantization.
Furthermore, this model is succesfully fused with the SIFT descriptor into a
compact binary signature [10] increasing further the performance of classi
cation. The BoC model was used for classi cation of biomedical images in [11] and
it was shown that it is combined successfully with the BoW-SIFT model in a late
fusion manner. Similarly to the BoW model the main drawback with the BoC is
the lack of spatial information. Furthermore, it is evident that the construction
of the vocabulary and in particular the selection of its size is another weak point
of the algorithm. To address this problem, we have extended the BoC model
applying a quad-tree-decomposition of images [12]. Quad-Tree decomposition
sub-divides an image into regions of homogeneous colors. Each time the image
is split into four equal size squares and the process continues until we reach a
sub-region of size 1 1 pixel (see gure 1b). To speed up the pre-processing of
the images the Quad-Tree decomposition may end when we reach a sub-region
of 2 2 pixels. Similar colors within a sub-region are quantized into the same
color. This is tuned with an extra parameter which, was set to 0:15 in all our
runs. In both models the TFIDF weights of visual words were calculated and
the image vectors were normalized with the L1 norm.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Textual Representation</title>
      <p>The text representation for the sub- gure images is derived from the caption
of their corresponding compound gures. The caption of a compound gure is
assigned to all its constituent sub gures. This makes it di cult to distinguish
between sub-images, and is a point to be improved in the future. For text
retrieval we used the vector space model with TFIDF weights of the terms. While
we didn't submit runs using textual information due to a misunderstanding,
experimentation outside competition showned that stemming signi cantly drops
the performance of categorization (see section 4.4).</p>
      <p>(a)
(b)</p>
    </sec>
    <sec id="sec-4">
      <title>4 Image Classi cation</title>
      <p>All our experiments were conducted using several combinations of the two
models described in section 2. For the classi cation the LibLinear classi er 2 was
employed, an open source library for large scale linear classi cation [13]. Linear
SVMs are in general much faster to train and predict than the non-linear and can
approximate large scale non-linear SVMs using a suitable feature map. E cient
feature mapping can be achieved using additive kernels, commonly employed in
computer vision, with the homogeneous kernel map being the most common [14].
The homogeneous kernel map includes the intersection, Hellinger's, Jensen
Shannon, Chi2, which allows large scale training of non-linear SVMs. The
transformation of the data results into a compact linear representation which reproduces
the desired kernel to a very good level of approximation. This transformation
makes the use of linear SVM solvers feasible 3, 4. In our experiments, the
homogeneous kernel mapping of VLFeat is used and more speci cally Chi2 kernel.
The implementation of VLFeat does not require any parameters but experiments
have shown that results can be improved slightly by changing the Gamma
parameter. The Gamma parameter sets the homogeneity degree of the kernel. The
SVM model was tuned using n-fold cross validation to nd the best cost.
LibLinear has an embedded grid search which conducts n-fold cross validation with
di erent costs and nds the best one. Besides from the cost parameter, that is
discovered using grid search, bias multiplier and kernel type were given. Results
were not greatly a ected when varying bias multiplier or kernel type. After
experimentation using several parameters, results yielded better performance with
2 https://www.csie.ntu.edu.tw/~cjlin/liblinear/
3 http://www.robots.ox.ac.uk/~vgg/software/homkermap/#r1
4 http://vision.princeton.edu/pvt/SiftFu/SiftFu/SIFTransac/vlfeat/doc/
api/homkermap.html
cost 10, Gamma 0.5 and the L2-Regularized L2-loss support vector classi cation
kernel.
4.2</p>
      <sec id="sec-4-1">
        <title>Early and Late Data Fusion</title>
        <p>In early fusion also referred to as feature fusion, [15], image representation
features extracted from di erent models are integrated into a single uni ed
representation. Normalization techniques may be applied before the integration so
that features are on the same scale. There is only one learning phase that
handles all multimodal features together. Five of our submitted runs were conducted
using early fusion. In late fusion also referred to as decision level fusion, multiple
probabilistic output scores obtained from separate classi ers are combined into
a single vector to form the nal decision. Models are trained and classi ed
separately and their respective outputs are combined to form the nal decision. In
contrast to the early fusion, late fusion requires two learning phases and there
is a potential loss of correlation in the mixed feature space. Nevertheless, late
fusion does not su er from the integration problem early fusion does and can be
easily used due to its simplicity and scalability. Five of our submitted runs were
conducted using late fusion.
4.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Submitted Runs and Results</title>
        <p>In this year's contest we submitted ten visual runs for the sub gure classi cation
subtask. The results are presented in table 1. Early tests on the learning curves
of our model on the imageCLEF 2013 dataset shown that the test error drops
continuously with increasing the training instances. This suggests that with a
larger dataset the test error would drop even more. Thus we have enriched the
poorest train categories with new images. These categories were the following
14/30: DRAN, DRCO, DRCT, DRPE, DRUS, DRXR, DSEC, DSEE, DSEM,
DVDM, DVEN, GFLO, GMAT, GPLI. Thus we have used two datasets with
our runs:
{ Original Dataset: The original training collection distributed for the subgure
classication task in ImageCLEF2016 Medical task containing 6776 images
and the
{ Enriched Dataset: The original training collection was enriched with 482
images from the ImageCLEF 2013 Modality Classication training collection
[16]. The enriched dataset contains 7258 images .</p>
        <p>The name of each run describes the methods and the parameters used in the run.
For example, the rst run in table 1, corresponds to an early fusion experiment
on the enriched dataset of the
{ QBoC model, using a quad tree decomposition of the image terminating at
a block of size 1 1 and a codebook of 256 colors in the RGB color space
and the
{ BoW model, with the default PHOW 2 -level descriptor with 1536 features.</p>
        <p>A color option is used to compute the color variant of the descriptor, i.e.
RGB. The value of the parameter "default", denotes that the gray-scale
variant of the descriptor is computed.
From the confusion matrix, in gure 2, corresponding to our rst run, we see that
in three categories, zero true positive examples were assigned. These categories
were: the PET (DRPE) where the majority of the examples were classi ed into
Computerized Tomography (DRCT) and the Electrocardiogpaphy (DSEC) and
Electromyography (DSEM) categories where most of the examples were classi ed
as statistical gures, graphs and charts (GFIG). These three categories happen
to have the smallest learning sets even after the enrichment with new images
with 30, 39, and 23 train images each.</p>
        <p>Although we submitted runs exclusively for visual categorization, for
completeness, we present here our results for textual and mixed classi cation. Our
textual representation of images was based on a naive TFIDF bag of words
model with stopword removal and stemming. The textual classi cation on the
enriched dataset attained an accuracy of 63:68% with stemming and 70:07%
without stemming. Our mixed results combining QBoC with PHOW and Text
in an early fashion mode, with weights (0:5; 0:3; 0:2) respectively, attained
accuracy 86:9%.</p>
        <p>Run ID
SC enriched GBOC 1x1 256 RGB Phow Default 1500 EarlyFusion
SC enriched GBOC 1x1 128 HSV Phow RGB 1500 EarlyFusion
SC enriched GBOC 1x1 128 HSV Phow RGB 1500 LateFusion
SC original GBOC 1x1 256 RGB w 0.6 Phow Default 1500 w 0.4 EarlyFusion
SC original GBOC 1x1 256 RGB Phow Default 1500 EarlyFusion
SC original GBOC 1x1 128 RGB Phow Default 1500 EarlyFusion
SC original GBOC 1x1 256 RGB Phow Default 1500 LateFusion
SC original GBOC 1x1 128 HSV Phow RGB 1500 LateFusion
SC original GBOC 1x1 128 RGB Phow Default 1500 LateFusion
In this paper we presented the image classi cation techniques performed by the
IPL Group for the sub gure classi cation subtask at ImageCLEF 2016 Medical
Task. For our runs, we used Early and Late Fusion on two bag-of-visual-words
models. The rst model was a novel generalized version of the BoC model, and
the second was the classical BoW with the PHOW descriptor to represent images.
Our experiments show that using Early or Late Fusion performs better than any
of the two models on their own. Providing visual image representation with
textual representation, proved to be bene cial for classi cation accuracy. The
results so far with our new approach of the QBoC model are encouraging and
several new directions have emerged which need further investigation.
8. Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating
bag-of-visualwords representations in scene classi cation. In: Proceedings of the International
Workshop on Workshop on Multimedia Information Retrieval. MIR '07, New York,
NY, USA, ACM (2007) 197{206
9. Khaligh-Razavi, S.: What you need to know about the state-of-the-art
computational models of object-vision: A tour through the models. CoRR abs/1407.2776
(2014)
10. Jegou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image
search. International Journal of Computer Vision 87(3) (2010) 316{336
11. de Herrera, A.G.S., Markonis, D., Muller, H.: Bag{of{colors for biomedical
document image classi cation. In: Medical Content-Based Retrieval for Clinical
Decision Support. Springer (2013) 110{121
12. Yin, X., Duntsch, I., Gediga, G.: Quadtree representation and compression of
spatial data. Springer (2011)
13. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library
for large linear classi cation. J. Mach. Learn. Res. 9 (June 2008) 1871{1874
14. Vedaldi, A., Zisserman, A.: E cient additive kernels via explicit feature maps.</p>
        <p>IEEE Trans. Pattern Anal. Mach. Intell. 34(3) (March 2012) 480{492
15. Zhou, X., Depeursinge, A., Muller, H.: Information fusion for combining visual
and textual image retrieval in imageclef@icpr. In: Proceedings of the 20th
International Conference on Recognizing Patterns in Signals, Speech, Images, and Videos.</p>
        <p>ICPR'10, Berlin, Heidelberg, Springer-Verlag (2010) 129{137
16. De Herrera, A., Kalpathy-Cramer, J., Fushman, D., Antani, S., Muller, H. In:
Overview of the ImageCLEF 2013 medical tasks. Volume 1179. CEUR-WS (2013)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Muller, H.,
          <string-name>
            <surname>Michoux</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bandon</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Geissbuhler, A.:
          <article-title>A review of content-based image retrieval systems in medical applications - clinical bene ts and future directions</article-title>
          .
          <source>I. J. Medical Informatics</source>
          <volume>73</volume>
          (
          <issue>1</issue>
          ) (
          <year>2004</year>
          )
          <volume>1</volume>
          {
          <fpage>23</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kalpathy-Cramer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <source>Garc</source>
          a Seco de Herrera,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Antani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Bedrick</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          , Muller, H.:
          <article-title>Evaluating performance of biomedical image retrieval systems{ an overview of the medical image retrieval task at ImageCLEF review{ 2014. Computerized Medical Imaging and Graphics (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Garc a Seco de Herrera,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Schaer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Bromuri</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          , Muller, H.:
          <article-title>Overview of the ImageCLEF 2016 medical task</article-title>
          .
          <source>In: Working Notes of CLEF</source>
          <year>2016</year>
          (
          <article-title>Cross Language Evaluation Forum)</article-title>
          .
          <source>(September</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>F.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A bayesian hierarchical model for learning natural scene categories</article-title>
          .
          <source>In: CVPR (2)</source>
          . (
          <year>2005</year>
          )
          <volume>524</volume>
          {
          <fpage>531</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Wengert</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jegou</surname>
          </string-name>
          , H.:
          <article-title>Bag-of-colors for improved image search</article-title>
          .
          <source>In: Proceedings of the 19th International Conference on Multimedia</source>
          <year>2011</year>
          , Scottsdale,
          <string-name>
            <surname>AZ</surname>
          </string-name>
          , USA, November 28 - December 1,
          <year>2011</year>
          . (
          <year>2011</year>
          )
          <volume>1437</volume>
          {
          <fpage>1440</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          :
          <article-title>Object recognition from local scale-invariant features</article-title>
          .
          <source>In: Proceedings of the International Conference on Computer Vision</source>
          -Volume 2
          <article-title>- Volume 2</article-title>
          . ICCV '99, Washington, DC, USA, IEEE Computer Society (
          <year>1999</year>
          )
          <volume>1150</volume>
          {
          <fpage>1157</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Bosch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Mun~oz, X.:
          <article-title>Image classi cation using random forests and ferns</article-title>
          .
          <source>In: IEEE 11th International Conference on Computer Vision</source>
          , ICCV 2007, Rio de Janeiro, Brazil,
          <source>October 14-20</source>
          ,
          <year>2007</year>
          . (
          <year>2007</year>
          ) 1{
          <fpage>8</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>