=Paper=
{{Paper
|id=Vol-1609/16090279
|storemode=property
|title=DUTh at the ImageCLEF 2016 Image Annotation Task: Content Selection
|pdfUrl=https://ceur-ws.org/Vol-1609/16090279.pdf
|volume=Vol-1609
|authors=Georgios Barlas,Maria Ntonti,Avi Arampatzis
|dblpUrl=https://dblp.org/rec/conf/clef/BarlasNA16
}}
==DUTh at the ImageCLEF 2016 Image Annotation Task: Content Selection==
<pdf width="1500px">https://ceur-ws.org/Vol-1609/16090279.pdf</pdf>
<pre>
         DUTh at the ImageCLEF 2016 Image
          Annotation Task: Content Selection

                Georgios Barlas, Maria Ntonti, Avi Arampatzis

               Department of Electrical and Computer Engineering
              Democritus University of Thrace, Xanthi 67 100, Greece
                      {gbarlas,mntonti,avi}@ee.duth.gr


      Abstract. This report describes our experiments in the Content Selec-
      tion subtask of the Image Annotation task of ImageClef 2016[7, 13]. Our
      approach is based on the fact that the human visual system concentrates
      mostly on local features [12]. In this respect, we trained an SVM classi-
      fier with descriptors that are based on the local features of the image,
      such as edges and corners. For the experimentation process we used the
      set of 500 images provided for the task, divided into training and test
      set. This set was particularly created for this year’s new subtask, Con-
      tent Selection, although the concepts are the same as last year. Through
      experimentation we determine which descriptors give the best results for
      the given task. To conduct the main experiment the SVM classifier is
      trained with the aforementioned set of 500 images using a subset of the
      top-performing features. Consecutively, the SVM processes the new set
      of 450 images and selects the boxes that best describe them conceptually.


1   Introduction
Content Selection is the intermediate step between identifying objects of an
image and generating a natural language caption. The objective of this task
is to identify which bounded objects of the image are important so as to be
included in the annotator’s caption. In addition, since each object is labeled, the
result would be a conceptually accurate description of the image. As requested
by the ImageCLEF competition we developed a system that receives as input
labeled objects of an image and identifies the objects that are referred to in the
corresponding image caption. Since participants of the content selection subtask
have concentrated so far on using only text features or bounding box information,
this paper may provide a novel contribution in exploring features based on visual
keypoints only. We are not aware of another work with keypoints being used in
our ways (e.g. ratio of keypoints in bounding boxes to keypoints in image).
    Our approach relies on finding suitable descriptors, from the given data,
in order to train an SVM classifier. In this work we followed an image-only
approach without processing the textual information provided as groundtruth.
Inspired by the fact that the human visual system concentrates mostly on local
features, we incorporated several such descriptors using well known algorithms
of image processing. Thereby we created a set of 17 descriptors and grouped
them into several subsets referring to similar features. Then, after experimenting
with descriptors standalone and as groups, we arrived to a set of 9 most-useful
descriptors.
    Complementary to testing for different feature subsets, experiments were also
conducted using various SVM kernel functions. Linear, polynomial and Gaus-
sian radial basis function kernels were tested. Polynomial and Gaussian kernels
performed similarly, but the latter one was chosen as it produced slightly better
results.
    The rest of this report is organized as follows. In the next section we de-
scribe the dataset provided as well as the methodology we followed to tackle the
problem. Specifically, we analyze the descriptors that were used during exper-
imentation and describe the SVM classifier. Section 3 describes the evaluation
methodology along with our the experimental results. Conclusion and directions
for further research are summarized in Section 4.


2     Data and Methods

2.1   Dataset

The dataset provided by the ImageCLEF competition consists of 500 images
from various concepts, accompanied with a set of bounding boxes for each image
and a label for each box. Furthermore, a set of textual descriptions is given for
each image as ground-truth with a minimum of 5 and a mean of 9.5 sentences
per image [7]. This set was initially split into a training and a validation set
and was used for experimentation. After experimentally concluding to the best
configuration the whole set was used to train the SVM classifier. A second set
of 450 images was later released by ImageCLEF which was used as the test set.


2.2   Feature Extraction

Initially, 17 descriptors were created. After experimentation we concluded to
a subset of 9 that was found to have the maximum contribution. Correlation
information between the descriptors and the purpose of each one was used to
cluster them into categories. In the rest of this section, we will elaborate on each
descriptor category.


Position Desctiptors For each bounding box, two points are given that define
its position in the image. The coordinates of these points are used individually
as four values xmin , xmax , ymin , ymax , divided by the corresponding dimension
of the image, as follows:
                                        xmin
                                   d1 =
                                         w
                                        xmax
                                   d2 =
                                         w
                                        ymin
                                   d3 =
                                         h
                                        ymax
                                   d4 =
                                         h

where w, h denote the width and the height of the image, respectively. Addition-
ally, two more features are formed that correspond to the relative position of the
center of the bounding box, abscissa and ordinate respectively:


                                       d1 + d2
                                  d5 =
                                          2
                                       d3 + d4
                                  d6 =
                                          2

   The purpose of those descriptors is to investigate the correlation between the
position of the bounding box and the importance of it [6].


Size Descriptors The descriptors of this group aim to calculate the portion of
the image that is occupied by the bounding box, separately for the two dimen-
sions and combined as well.
                                       wb
                                  d7 =
                                        w
                                       hb
                                  d8 =
                                        h
                                  d9 = d7 d8

where wb , hb denote the width and height of the bounding box, respectively.


Descriptors based on Local Features For the calculation of the descriptors
of this group we used several well-established algorithms for local feature detec-
tion (i.e. the Canny edge detector [2], Harris and Stephens [8], BRISK [9], SURF
[1] and FAST [11]). These algorithms imitate the way that human processes vi-
sual information. The calculated features are based on the hypothesis that an
elevated number of the key-points or key-regions detected will be located in con-
ceptually important boxes. Towards this direction we propose the calculation of
the percentage of the image local features detected in each box.
                                      Cannybox
                               d10 =
                                     Cannyimage
                                      Harrisbox
                               d11 =
                                     Harrisimage
                                      BRISKbox
                               d12 =
                                     BRISKimage
                                      SU RFbox
                               d13 =
                                     SU RFimage
                                      F ASTbox
                               d14 =
                                     F ASTimage

Entropy Descriptors Image entropy is a quantity that is used to describe
the amount of information that is contained in an image. In this regard, the
motivation behind this feature group is to quantify the amount of information
held by the content of a bounding box, proportionally to that of the image.
As a first step we produced three different versions of the image. The first two
correspond to the edge maps generated by the Canny edge detector [2] and the
Structured Forests edge detection method (ESF) [5], respectively. The last one
corresponds to the image reduced to gray-level pixel values. Consecutively, we
calculate the entropy contained by a bounding box in all three images divided
by the total entropy of the image.

                                   E(Cannybox )
                            d15 =
                                  E(Cannyimage )
                                   E(ESFbox )
                            d16 =
                                  E( ESFimage )
                                     E(Grayscalebox )
                            d17 =
                                    E(Grayscaleimage )
      where E(x) denotes the entropy of an image x, defined as
                                       N
                                       X
                            E(x) = −         h(i) log2 h(i)
                                       i=1

where h(i) is the count of pixels assigned to the ith bin of the image histogram
and N the total number of bins.


2.3     Support Vector Machine (SVM) Classification

For the purpose of this task, we trained a binary SVM in order to classify each
bounding box as ‘important’ or ‘not important’. SVM tackles the problem of non-
liner classification by determining a hyperplane that separates two classes in a
space of higher dimensionality than the feature space using a kernel function [3,
4]. We used MATLAB’s default function to train and use the SVM. For improved
performance, we experimented with different kernel functions, such as the Linear,
the Polynomial and the Gaussian radial basis kernel functions, concluding to the
Gaussian kernel as the top performing.
    During the experiments, the training dataset was divided into two subsets,
the training and the validation subset, respectively. We experimented with train-
ing subsets of 100 and 250 images randomly selected. We investigated the perfor-
mance of different sets of descriptors, concluding to a set of 9. Table 1 presents
the results of the experiments with different configurations.
    In the cases where the SVM classified a small number of the boxes as impor-
tant, we used the SURF descriptor as criterion for the selection. Specifically, if
the result contained less than two boxes, then the n2 + 3 boxes with the biggest
d13 value were added to result set, where n is the number of boxes in the image.
In case the image contained less than n2 +3 boxes, then all of them were selected.
    The number n2 + 3 was selected after experimentation. We initially started
with number n2 and then concluded to n2 + 3.

3      Experimental Evaluation
3.1     Evaluation Measures
According to the instructions, Subtask 3 is evaluated using the content selection
metric, which is the F1 score averaged across all test images. Each F1 score is
computed from the precision and recall metrics averaged over all gold standard
descriptions for the image.
   The precision P Ii for test image Ii is computed as:
                                            M
                                        1 X |GImi ∩ S Ii |
                               P Ii =                                         (1)
                                        M m=1   |S Ii |

      The recall R( Ii ) for test image Ii is computed as:
                                            M
                                        1 X |GImi ∪ S Ii |
                               R Ii =                                         (2)
                                        M m=1   |GImi |
where
 I = {I1 , I2 , . . . , IN } the set of test images
 GIi = {GI1i , GI2i , . . . , GIMi } the set of gold standard descriptions
 S Ii the resulting set of unique bounding box instances
 M the number of gold standard descriptions for image Ii .
      The content selection score F Ii for image Ii , is computed as:
                                               P Ii R Ii
                                   F Ii = 2                                   (3)
                                              P Ii + R Ii
The final P , R and F scores are computed as the mean P , R and F scores across
all test images.
3.2   Experimental Results

For the experiments we used the 500 images from imageCLEF dataset. Firstly,
the images are split randomly in two sets, the training set and the test set. As
training set, 100 or 250 images are used, 20% of 50% of the set respectively. As
expected, the bigger training dataset gave better results. for this reason, it was
decided to use all the 500 provided images dataset as training set at the sub-
mission run. Experiments took place with various combination of descriptors.
As shown in Table 2, descriptors are managed as groups. For decision criterion
in the cases that the SVM classified a small number of boxes as important, we
experimented with d9 and d13 descriptors. The SURF descriptor (d13 ) produced
better results, as it is a well-established and robust algorithm, in contrary to de-
scriptor d9 which is more abstract. Table 2 shows the setup of each experiment.
For example, for experiment 1 descriptors d1...4 , d11 , d13 and d17 were used.
SVMs kernel is radial basis function, 100 images of 500 were used as training set
and as criterion d9 were used. Before the experiments each group of descriptors
were tested separately, so the behavior of each was known. That means that
there was not need to include or exclumide all descriptors of the group in each
experiment but the best representatives of the group. Furthermore, the correla-
tion matrix was taken into account, that is why for example d9 is not included
in our experiments, as shown in Table 2. Finally, as seen from Table 1, the F-
measure was not the only one taken under consideration for our final choice but
also the balance of precision and recall.
    Our approach using Gaussian kernel SVM classifier and 9 descriptors as
experiment 14 achieves an overall F-measure of 54.59% ± 15.33, the best result
of the two participating groups to the subtask.


        Table 1. Experimental results for various combinations of descriptor

            Experiment     F-score        Precession        Recall
                 1     0.5885 ± 0.2300 0.5972 ± 0.2451 0.6758 ± 0.2963
                 2     0.5873 ± 0.2297 0.5946 ± 0.2442 0.6783 ± 0.2946
                 3     0.5882 ± 0.2312 0.5991 ± 0.2465 0.6726 ± 0.2982
                 4     0.5888 ± 0.2294 0.5963 ± 0.2446 0.6761 ± 0.2978
                 5     0.5980 ± 0.2224 0.5994 ± 0.2390 0.6892 ± 0.2955
                 6     0.5976 ± 0.2235 0.5982 ± 0.2392 0.6893 ± 0.2962
                 7     0.5891 ± 0.2296 0.5969 ± 0.2446 0.6758 ± 0.2986
                 8     0.5889 ± 0.2301 0.5985 ± 0.2449 0.6746 ± 0.2973
                 9     0.5585 ± 0.2108 0.4939 ± 0.2081 0.7461 ± 0.2994
                10     0.6380 ± 0.1899 0.5683 ± 0.2210 0.8203 ± 0.2253
                11     0.5314 ± 0.1602 0.3865 ± 0.1620 0.9671 ± 0.0761
                12     0.5709 ± 0.2175 0.5298 ± 0.2257 0.7235 ± 0.2850
                13     0.5894 ± 0.2283 0.5950 ± 0.2430 0.6817 ± 0.2930
                14     0.5888 ± 0.2322 0.6018 ± 0.2475 0.6663 ± 0.3042
                15     0.5893 ± 0.2303 0.6001 ± 0.2457 0.6717 ± 0.3000
                16     0.5893 ± 0.2312 0.6014 ± 0.2462 0.6688 ± 0.3030
                17     0.5888 ± 0.2300 0.5973 ± 0.2450 0.6757 ± 0.2975
        Fig. 1. Correlation calculation between every descriptor investigated


4   Conclusion
This report describes the methodology, the experimentation, and the results
acquired concerning DUTh’s participation to the Subtask 3 of the Image Anno-
tation task of ImageCLEF 2016. Our novelty is that we have tackled the task
using only visual keypoints/features, in contrast to other participants so far us-
ing only text features or bounding box information. We investigated the perfor-
mance of seventeen image descriptors combined with three SVM configurations
corresponding to different SVM kernels. Experimental evaluation highlighted
the significance of a feature subset in determining the conceptually important
boxes. These features mostly relate to the edges and corners detected by well-
established algorithms. Our approach using Gaussian kernel SVM classifier and
nine descriptors achieves an overall F-measure of 54.59% ± 15.33, the best result
of the two participating groups to the subtask.
    Taking a step further, we believe that the proposed methodology can be
improved towards two directions. The first one concerns the improvement of
the feature extraction methods. Motivated by the high performance of local
feature detection demonstrated by this project, we would like to additionally
incorporate the local feature descriptors that correspond to them. The statistical
analysis and the comparison of these descriptors may provide useful information
concerning the importance of each key-point. Room for improvement also exists
in the exploitation of textual analysis of the proposed annotation terms. Textual
features that are based on term and document frequencies can provide useful
        Table 2. Experimental results for various combinations of descriptor

           #        1   2   3    4    5   6   7   8   9 10 11 12 13 14 15 16 17
           d1      ·       ·   ·     ·                                   ·
                                                                       · ·
           d2      ·       ·   ·     ·                                   ·
                                                                       · ·
           d3      ·       ·   ·     ·                                   ·
                                                                       · ·
           d4      ·       ·   ·     ·                                   ·
                                                                       · ·
           d5                  · ·                                     · ·
           d6                  · ·                                     · ·
           d7      · · · · · · · · ·                               · · · ·
           d8      · · · · · · · ·   ·                             · · · ·
           d9                                                          · ·
          d10                                                          · ·
          d11      · · · · · · · ·                             ·   · · · ·
          d12                                                          · ·
          d13
          d14
                   · · · · · · · ·                               · · · · ·
          d15                                                          · ·
          d16                    ·                                     · ·
          d17
        Kernel
                   · · · · · · · ·                           ·     · · · ·
                   rbf rbf poly poly rbf rbf rbf rbf rbf rbf rbf rbf rbf rbf rbf rbf rbf
      Training set 100 100 100 100 250 250 250 250 100 100 100 100 100 100 100 100 100
       Criterion d9 d9 d9 d9 d9 d9 d9 d9 d9 d9 d9 d9 d9 d13 d13 d9 d13


insight in order to determine the importance of every box label. Furthermore,
textual features concerning term significance may be extracted exploiting word
ontologies or semantic networks, such as WordNet [10].


References
 1. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf).
    Computer vision and image understanding 110(3), 346–359 (2008)
 2. Canny, J.: A computational approach to edge detection. Pattern Analysis and
    Machine Intelligence, IEEE Transactions on (6), 679–698 (1986)
 3. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297
    (1995)
 4. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and
    other kernel-based learning methods. Cambridge university press (2000)
 5. Dollár, P., Zitnick, C.L.: Structured forests for fast edge detection. In: ICCV. In-
    ternational Conference on Computer Vision (December 2013), http://research.
    microsoft.com/apps/pubs/default.aspx?id=202540
 6. Friedman, J., Hastie, T., Tibshirani, R.: The elements of statistical learning, vol. 1.
    Springer series in statistics Springer, Berlin (2001)
 7. Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandrea, E., Gaizauskas,
    R., Villegas, M., Mikolajczyk, K.: Overview of the ImageCLEF 2016 Scalable Con-
    cept Image Annotation Task. In: CLEF2016 Working Notes. CEUR Workshop
    Proceedings, CEUR-WS.org, Évora, Portugal (September 5-8 2016)
 8. Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey vision
    conference. vol. 15, p. 50. Citeseer (1988)
 9. Leutenegger, S., Chli, M., Siegwart, R.Y.: Brisk: Binary robust invariant scalable
    keypoints. In: Computer Vision (ICCV), 2011 IEEE International Conference on.
    pp. 2548–2555. IEEE (2011)
10. Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM
    38(11), 39–41 (1995)
11. Rosten, E., Porter, R., Drummond, T.: Faster and better: A machine learning
    approach to corner detection. Pattern Analysis and Machine Intelligence, IEEE
    Transactions on 32(1), 105–119 (2010)
12. Shapley, R., Tolhurst, D.: Edge detectors in human vision. The Journal of physi-
    ology 229(1), 165 (1973)
13. Villegas, M., Müller, H., Garcı́a Seco de Herrera, A., Schaer, R., Bromuri, S.,
    Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandrea, E., Gaizauskas,
    R., Puigcerver, K.M.J., Toselli, A.H., Snchez, J.A., Vidal, E.: General Overview of
    ImageCLEF at the CLEF 2016 Labs. Lecture Notes in Computer Science, Springer
    International Publishing (2016)

</pre>