Automatic Annotation of Liver CT Images: the
      Submission of the BMET Group to
            ImageCLEFmed 2014

 Ashnil Kumar1,3 , Shane Dyer2 , Changyang Li1,3 , Philip H. W. Leong2,3 , and
                                Jinman Kim1,3
        1
           School of Information Technologies, University of Sydney, Australia
2
    School of Electrical and Information Engineering, University of Sydney, Australia
     3
       Institute of Biomedical Engineering and Technology, University of Sydney,
                                       Australia
     {ashnil.kumar,changyang.li,philip.leong,jinman.kim}@sydney.edu.au


       Abstract. In this paper we present the strategies that were designed
       and applied by the Institute of Biomedical Engineering and Technology
       (BMET) team to the liver image annotation task of ImageCLEF 2014.
       This was the first year this challenge was held and as such our strategies
       form the basis for future exploration in this area. The major challenge of
       the liver annotation task was limited training data, with some annota-
       tion labels having very few positive samples and other labels having no
       samples at all. We propose two strategies for annotating the liver images
       given the unbalanced nature of the training dataset. Our first method
       uses multi-class classification scheme where each label has a classifier
       that is trained to separate it from the other labels. Our second method
       uses the similarity scores from an image retrieval algorithm as weights
       for a majority voting scheme, thereby reducing the inherent bias towards
       labels that have a high number of samples. We also investigate the per-
       formance of our methods using different feature sets. In total, BMET
       submitted 8 runs to the ImageCLEF liver annotation task. All of our
       runs achieved high scores (> 90%) during evaluation. We also achieved
       the highest score out of all submissions to the ImageCLEF 2014 liver
       annotation task.

       Keywords: SVM, Image Retrieval, Multi-class Classification, Image An-
       notation, Liver, Computed Tomography


1    Introduction

ImageCLEF [1] is the image retrieval track of the Cross Language Evaluation
Forum (CLEF). In the past, one of the major focuses of ImageCLEF [2–8] has
been medical imaging, with tasks ranging from modality-classification to case-
based retrieval. In 2014, for the first time, the objective of the ImageCLEF
medical imaging task was the automatic annotation of medical images [9]. The
aim of the challenge was to generate a structured report based on an analysis


                                       428
of the image features in computed tomography (CT) images of the liver, with
the goal of detecting subtle differences in image features and to annotate them
using a standard terminology.
    One of the major challenges was the limited amount of training data com-
pared to the number of annotations that needed to be recognised. In particular,
there were some annotations that did not occur at all in the dataset. Similarly,
there were also instances where all of the training samples had the same anno-
tation. Our methods were thus designed to account for the unbalanced dataset
by reducing the bias towards annotations with more samples.
    In this paper, we outline our submission to ImageCLEF 2014 liver annota-
tion challenge. We present two methods for the annotation of liver CT images,
one based on multi-class classification and another using image retrieval. Our
aim was to perform the annotation using visual features only. As such, we do
not utilise any of the information in the ONtology of the LIver for RAdiol-
ogy (ONLIRA) [10], such as the semantic distance between related terms [11].
Our method treats the challenge as multiple multi-class classification problems.
We adapted well-established classification and retrieval techniques to investigate
their perform on the annotation of liver CT images. We envision that this will
establish a baseline for improvements in future iterations of the challenge.
   The following terminology is employed in the remainder of this paper. A
question refers to a specific annotation task, i.e., an element of the structured
report that needs to be automatically filled. A label is a possible annotation
that can be assigned to a question. An answer is the label that is assigned to
the question.


2   Materials


The training dataset contained 50 CT volumes cropped to the region around the
liver; the volumes had varied resolutions (x: 190–308 pixels, y: 213–387 pixels,
slices: 41–588) and spacings (x, y: 0.674–1.007mm, slice: 0.399–2.5mm). A mask
of the liver pixels and the bounding box for a selected lesion was provided for
each image in the training dataset. The training data also included a set of 60
well-established image features (with a total dimensionality of 458) that had
been extracted from the images in the dataset. The answers to 73 questions was
provided for each training image.
    The test dataset contained 10 CT volumes, with varied resolutions and pixel
spacings, cropped to the region around the liver. The test data also included
a mask of the liver pixels, a bounding box for a lesion and a set of 60 well-
established image features (with a total dimensionality of 458) for each image in
the dataset.


                                     429
3     Methods
3.1     Overview
Our aim was to investigate annotation using variations of two methods: support
vector machine (SVM) classification [12] and content-based image retrieval [13].
The specific variations that we used were:
    • Method 1: Two stage classification using SVMs with linear kernels.
    • Method 2: Two stage classification using SVMs with radial basis function
      (RBF) kernels.
    • Method 3: Content-based image retrieval.
    • Method 4: Content-based image retrieval with feature selection.
The two-stage classification method is described in Section 3.3 and the image
retrieval method is described in Section 3.4.
    We also investigated the effect of expanding feature set on all these methods.
The feature set expansion is described in Section 3.2.
    Our method was applied to a subset of the annotations; seven questions where
the label sets were unbounded (e.g., measurements in millimetres or counts of
particular objects) were excluded. We also excluded one question that accepted
multiple labels as the answer. In total, we answered 64 of the 73 questions.

3.2     Feature Sets
We used two feature sets for the annotation challenge. These were:
    • Feature Set 1: The well-established features included with dataset after
      cleaning as described below (total dimensionality = 446).
    • Feature Set 2: Feature Set 1 + a bag-of-visual-words (BoVW) features,
      constructed as described below (total dimensionality = 1446).

Dataset Features The image features in the dataset included features ex-
tracted from the liver, the hepatic vasculature of the liver, and the selected
lesion. Features extracted globally across all lesions were also included. The fea-
tures described object shape properties (e.g., volume, surface area, sphericity,
solidity, convexity, Hu shape invariants [14]), texture information (e.g., Haral-
ick [15], Gabor [16], Tamura [17], Haar [18]), and pixel intensity information.
    We cleaned the feature data by removing feature dimensions that had a not-
a-number (NaN) value or that were used to scale other features. We removed
the Anatomical Location feature (5 dimensions) of the lesion since one training
image had NaN values for this feature. The Hu Moments feature (3 dimensions)
of the lesion was also removed for the same reason. We also removed the first
two dimensions (upper and lower bounds) of the Histogram feature of the le-
sion and the HistogramOfAllLesions (a feature extracted across all lesions). The
cleaned feature set had a dimensionality of 446. Readers are directed to the task
documentation for more information about the image features.


                                     430
                                      STAGE 1: 1-v-all classiﬁers
              Train Stage 1     A           B                              Z            Test
               Classiﬁers       vs          vs                ...          vs          Image
                               Rest        Rest                           Rest


                                                   Multiple          No
                                                  Positives?

                                                       Yes

   Training                           STAGE 2: 1-v-1 classiﬁers
    Data
              Train Stage 2     A          A                               Y
               Classiﬁers       vs         vs                 ...          vs
                                B          C                               Z


                                                Accumulate
                                                  Votes


                                                  Tiebreak          No
                                                  Needed

                                                        Yes


                                                Resolve Ties                       Answer


                Fig. 1: An overview of the classification scheme for annotation.


Bag-of-Visual-Words We extracted Scale Invariant Feature Transform (SIFT)
descriptors [19] from key points detected in the 2D slices of the CT images.
There were a total of 4,524,946 descriptors extracted from the training dataset
and 433,846 descriptors extracted from the test dataset.
    We created a visual codebook from the SIFT descriptors extracted from the
training dataset. We randomly sampled 5% of the descriptors and grouped the
subsampled data using k-means clustering with k = 1000. The cluster centres
were treated as the visual words in the codebook. A single visual word was
assigned to every descriptor in both datasets; this assignment was determined
by finding the visual word whose cluster centre with the minimum Euclidean
distance from the descriptor. A BoVW descriptor was then created for every
image using a 1000-bin histogram of the visual words in that image [20].

3.3    Annotation using Two Stage Classification
Our two stage classification approach for image annotation is shown in Figure 1.
Each stage consisted of a bank of several SVM classifiers. This two stage ap-
proach was repeated separately for each question. Due to the unbalanced train-
ing dataset, we expected that the classifiers for labels with low samples would
have relatively low accuracy. For this reason, we used the two stage approach to
introduce further discriminative power, especially in the case of ties.
    Let Ω be a question. Also let LΩ be the set of labels for Ω with |LΩ | = l.
For every label A ∈ L, we trained a A-vs-rest (1-vs-all) SVM classifier, hence


                                             431
forming l 1-vs-all classifiers. We also trained A-vs-B (1-vs-1) SVMs for every pair
of labels A, B ∈ L where A 6= B, forming a total of l2 − l /2 1-vs-1 classifiers.
For every question, our first stage was composed of the 1-vs-all classifiers and
the second stage was composed of the 1-vs-1 classifiers.
    After the classifiers have been trained, our annotation process proceeded
as follows. An un-annotated image (from the test dataset) was classified using
the first stage. If only one of the 1-vs-all SVMs returned a positive classification
(i.e., there was no tie) then the label corresponding to that classifier was adopted
as the answer. If the classifiers in the first stage assigned multiple labels (i.e.,
multiple 1-vs-all classifiers returned positive results) then the second stage was
activated.
    Let L+ ⊆ L be the set of labels given positive responses by the first stage of
classifiers. During the second stage, we classified the un-annotated image using
the 1-vs-1 classifiers for all the labels in L+ (i.e., the 1-vs-1 classifiers for the
tied labels). A majority voting scheme was used to select the answer.
    Two tiebreaker situations remained after both classification stages were com-
pleted. The ties included the case where the first stage did not return a positive
label and when there was a tie in the vote during the second stage (multiple
labels had the highest majority vote). In both of these situations, we set the
answer to “other”, noted in the task description as the label selected when the
radiologist was unsure of the correct annotation. For such ties in questions Ω
where “other” ∈  / LΩ , we selected the label “N/A” if it was available or ”false”
for questions that expected a boolean answer.
    During training, we tested our algorithm on various SVM kernels (linear,
quadratic, radial basis function (RBF), multilayer perceptron, polynomial) and
parameters using 10-fold cross validation. We discovered that the best overall
accuracy was achieved by the RBF kernel with scaling factor equal to 1. There
were only five questions in which the RBF kernel was beaten by other kernels and
in each of these cases the difference was not significant. We therefore selected the
commonly-used linear kernel and the RBF kernels for our classification approach
(Methods 1 and 2, respectively).


3.4   Annotation using Image Retrieval

Our image retrieval based approach for annotation used the most similar training
images to select the answers for an un-annotated image. While the classification
approach (Section 3.3) trained separate classifiers for each question, the retrieval
approach attempted to answer all of the questions together. An overview of the
method is shown in Figure 2.
   We defined the similarity of the the unannotated image (U ) and a training
image (T ) as the Euclidean distance between their respective feature vectors:
                                        v
                                        u d
                                        uX          2
                            s (U, T ) = t   (u − t )
                                                  i   i                          (1)
                                            i=0


                                      432
                              Calculate                          Test
                              Similarity                        Image


   Training
                               Select n
    Data
                               Similar


                              Calculate
                              Weights


                             Accumulate
                               Votes


                               Tiebreak
                               Needed


                             Resolve Ties                                   Answers


              Fig. 2: An overview of the retrieval scheme for annotation.


where ui was the i-th feature in the feature vector of U , ti was the i-th feature
in the feature vector of T , and d was the dimensionality of the feature set (see
Section 3.2). Under this formulation, lower values of s indicated greater similarity
with s (U, T ) = 0 implying that U and T were exactly similar.
    We selected the n most similar images from the training data. Let S =
{s1 , ..., sn } be the similarity values of these images (sorted from most similar to
least similar). A weighted voting scheme was used to select the answer for each
question. The weights for the voting scheme were determined from the set of
similarity values. The weight wi for the i-th most similar image was calculated
as:
                                              c × si
                                       wi =                                       (2)
                                                s1
where c was a constant scaling factor and si ∈ S was the similarity value of
the i-th most similar image. In our experiments, we empirically set n = 10
and c = 10. Our weighting scheme adjusted the value of the vote based on the
calculated similarity. Images with a higher similarity would thus have a stronger
vote compared to images with a lower similarity. The weighting was necessary
due to the unbalanced nature of the dataset. If a majority voting scheme was
used then the labels that had a higher frequency in the dataset would have a
higher chance to be selected as the answer (depending on the value of n).


                                           433
                           Table 1: Summary of Results1

           Run Method Feature Set Completeness Accuracy Score
             1      1          1             0.98          0.89    0.935
             2      1          2             0.98          0.90    0.939
             3      2          1             0.98          0.89    0.933
             4      2          2             0.98          0.90    0.939
             5      3          1             0.98          0.91    0.947
             6      3          2             0.98          0.87    0.927
             7      4          1             0.98          0.91    0.947
             8      4          2             0.98          0.87    0.926


    In the case of a tie (multiple labels having the same weighted vote), we set the
answer to “other”. When “other” was not part of the label set, we selected the
label “N/A” if it was available or ”false” for questions that expected a boolean
answer.
    We also designed an alternate question-based version of our retrieval scheme
for annotation; this is noted as Method 4 in Section 3.1. In the alternate scheme,
we applied sequential feature selection [21] to use the most discriminating fea-
tures for each question during the similarity calculation (Equation 1). This en-
sured that optimised features were used to retrieve the most similar images, i.e.,
image retrieval was performed using the features most suited for answering a
particular question.


4     Results and Discussion

We submitted eight runs to the ImageCLEF 2014 liver annotation challenge.
The runs were created using a combination of the four methods listed in Sec-
tion 3.1 and the two feature sets listed in Section 3.2. The runs were evaluated
on completeness, the percentage of questions that were answered, and accuracy,
the percentage of completed questions with a correct answer. Only 65 questions
formed part of the evaluation; questions with unbounded labels (e.g., measure-
ments) were not evaluated as part of the 2014 challenge. Table 1 shows the mean
completeness and accuracy of our eight runs.
    There were 20 registered participants for the ImageCLEF 2014 liver anno-
tation challenge. However, only three groups (including ours) submitted runs
for evaluation. A comparison of the groups is shown in Table 2. In 2014, our
submission achieved the highest score of all the participants.
    The results show that all of our runs achieved high scores (> 0.92). We
achieved a completeness score of 0.98 for every run because we always answered
64 of the 65 questions. The question that we excluded from our submission was
1
    These results are from http://www.imageclef.org/2014/liver#Results.


                                     434
                          Table 2: Comparison of Results1

                     Group          Completeness Accuracy Score
               BMET (our group)           0.98          0.91     0.94
                    CASMIP                0.95          0.91     0.93
                  piLabVAVlab             0.51          0.39     0.45


one that accepted multiple labels as the answer. We could answer this question
and achieve a perfect completeness score by removing the tiebreaker from both
of our annotation methods. This would be an exception for only this particular
question, i.e., all other questions would still go through a tiebreaker process if
necessary.
     In general, there were no large differences between the two variants of the clas-
sification method. That is, the accuracy of two classification methods (Method 1
and Method 2) were approximately the same. This suggests that the choice of
kernel was not a major factor in the overall accuracy of the annotation. This
outcome was contrary to the training stage (as stated in Section 3.3) where
our selection of the RBF kernel (Method 2) was due to its higher accuracy
compared to the other kernels. We attribute this difference to the unbalanced
training dataset, which may not have reflected the labels of the test dataset.
However, our high scores (> 0.93) demonstrate that our classification approach
for annotation performs well despite the unbalanced training dataset.
     The score of the retrieval method with feature selection (Method 4) was
equal to or less than that of the retrieval method with no feature selection
(Method 3). This result is counter-intuitive as the expectation is that feature
selection would improve the accuracy of the annotation. One explanation for
this could be that the feature selection reduces the similarity scores calculated
during retrieval (since fewer features are used), which in turn negatively impacts
the weighted vote by returning voting power to labels with a larger number of
training samples.
     It is interesting to note that the classification methods performed best when
using the expanded feature set (Feature Set 2) while the retrieval methods per-
formed best when using the normal feature set (Feature Set 1). This suggests
that one of the major considerations in the annotation of the liver is the combi-
nation of features and methods. That is, the findings indicate that one cannot
choose a method for annotation without considering which features will be used,
and vice versa.


5    Conclusions
This paper described the methods and results of the BMET group’s submission
to the liver annotation task of ImageCLEF 2014. Our eight runs investigated
different combinations of methods and feature sets. While all of our runs achieved
high scores they also revealed the areas in which our method could be optimised.


                                      435
Our future work will investigate building associations between image features
and ONLIRA terms to create classifiers for labels with no samples in the training
dataset [22].

Acknowledgments. The authors would like to thank the organisers of the
ImageCLEF 2014 liver annotation task for their assistance in procuring the
detailed results.

References
 1. Caputo, B., Müller, H., Martinez-Gomez, J., Villegas, M., Acar, B., Patricia, N.,
    Marvasti, N., Üsküdarlı, S., Paredes, R., Cazorla, M., Garcia-Varea, I., Morell, V.:
    ImageCLEF 2014: Overview and analysis of the results. In: CLEF proceedings.
    Lecture Notes in Computer Science. Springer Berlin / Heidelberg (2014)
 2. Müller, H., Deselaers, T., Deserno, T., Kalpathy-Cramer, J., Kim, E., Hersh, W.:
    Overview of the ImageCLEFmed 2007 medical retrieval and medical annotation
    tasks. In Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D., Peñas, A.,
    Petras, V., Santos, D., eds.: Advances in Multilingual and Multimodal Information
    Retrieval. Volume 5152 of Lecture Notes in Computer Science. Springer Berlin /
    Heidelberg (2008) 472–491
 3. Müller, H., Kalpathy-Cramer, J., Kahn, C., Hatt, W., Bedrick, S., Hersh, W.:
    Overview of the ImageCLEFmed 2008 medical image retrieval task. In Peters, C.,
    Deselaers, T., Ferro, N., Gonzalo, J., Jones, G., Kurimo, M., Mandl, T., Peñas, A.,
    Petras, V., eds.: Evaluating Systems for Multilingual and Multimodal Information
    Access. Volume 5706 of Lecture Notes in Computer Science. Springer Berlin /
    Heidelberg (2009) 512–522
 4. Müller, H., Kalpathy-Cramer, J., Eggel, I., Bedrick, S., Radhouani, S., Bakke, B.,
    Kahn, C., Hersh, W.: Overview of the CLEF 2009 medical image retrieval track.
    In Peters, C., Caputo, B., Gonzalo, J., Jones, G., Kalpathy-Cramer, J., Müller,
    H., Tsikrika, T., eds.: Multilingual Information Access Evaluation II. Multimedia
    Experiments. Volume 6242 of Lecture Notes in Computer Science. Springer Berlin
    / Heidelberg (2010) 72–84
 5. Kalpathy-Cramer, J., Müller, H., Bedrick, S., Eggel, I., de Herrera, A.G.S.,
    Tsikrika, T.: Overview of the CLEF 2011 medical image classification and re-
    trieval tasks. In: CLEF (Notebook Papers/Labs/Workshop). (2011)
 6. Müller, H., de Herrera, A.G.S., Kalpathy-Cramer, J., Demner-Fushman, D., An-
    tani, S., Eggel, I.: Overview of the ImageCLEF 2012 medical image retrieval and
    classification tasks. In: CLEF (Online Working Notes/Labs/Workshop). (2012)
 7. de Herrera, A.G.S., Kalpathy-Cramer, J., Demner-Fushman, D., Antani, S., Müller,
    H.: Overview of the ImageCLEF 2013 medical tasks. Working notes of CLEF
    (2013)
 8. Kalpathy-Cramer, J., de Herrera, A.G.S., Demner-Fushman, D., Antani, S.,
    Bedrick, S., Müller, H.:       Evaluating performance of biomedical image re-
    trieval systems—an overview of the medical image retrieval task at Image-
    CLEF 2004–2013. Computerized Medical Imaging and Graphics (2014) doi:
    10.1016/j.compmedimag.2014.03.004.
 9. Marvasti, N., Kökciyan, N., Türkay, R., Yazıcı, A., Yolum, P., Üsküdarlı, S., Acar,
    B.: ImageCLEF Liver CT Image Annotation Task 2014. In: CLEF 2014 Evaluation
    Labs and Workshop, Online Working Notes. (2014)


                                        436
10. Kokciyan, N., Turkay, R., Uskudarli, S., Yolum, P., Bakir, B., Acar, B.: Semantic
    description of liver CT images: An ontological approach. IEEE Journal of Biomed-
    ical and Health Informatics (2014) 10.1109/JBHI.2014.2298880.
11. Budanitsky, A., Hirst, G.: Evaluating WordNet-based measures of lexical semantic
    relatedness. Computational Linguistics 32(1) (2006) 13–47
12. Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines,
    Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA
    (2002)
13. Kumar, A., Kim, J., Cai, W., Feng, D.: Content-based medical image retrieval:
    a survey of applications to multidimensional and multimodality data. Journal of
    Digital Imaging 26(6) (2013) 1025–1039
14. Hu, M.K.: Visual pattern recognition by moment invariants. Information Theory,
    IRE Transactions on 8(2) (1962) 179 –187
15. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classi-
    fication. IEEE Transactions on Systems, Man and Cybernetics 3(6) (1973) 610
    –621
16. Jain, A., Farrokhnia, F.: Unsupervised texture segmentation using gabor filters. In:
    IEEE International Conference on Systems, Man and Cybernetics. (1990) 14–19
17. Tamura, H., Mori, S., Yamawaki, T.: Textural features corresponding to visual
    perception. IEEE Transactions on Systems, Man and Cybernetics 8(6) (1978)
    460–473
18. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple
    features. In: IEEE Computer Society Conference on Computer Vision and Pattern
    Recognition. (2001) 511–518
19. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna-
    tional Journal of Computer Vision 60 (2004) 91–110
20. Zhou, X., Stern, R., Müller, H.: Case-based fracture image retrieval. International
    Journal of Computer Assisted Radiology and Surgery 7 (2012) 401–411
21. Pudil, P., Novovičová, J., Kittler, J.: Floating search methods in feature selection.
    Pattern Recognition Letters 15(11) (1994) 1119 – 1125
22. Lampert, C., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-
    shot visual object categorization. IEEE Transactions on Pattern Analysis and
    Machine Intelligence 36(3) (2014) 453–465


                                         437