ImageSem at ImageCLEFmed Caption 2019 Task: a Two-stage Medical Concept Detection Strategy

Zhen Guo, Xuwen Wang, Yu Zhang, Jiao Li
    ImageSem at ImageCLEFmed Caption 2019 Task: a
      Two-stage Medical Concept Detection Strategy

                    Zhen Guo1, Xuwen Wang1, Yu Zhang1, Jiao Li1*
1 Institute of Medical Information / Medical Library, Chinese Academy of Medical Sciences &

                   Peking Union Medical College, Beijing 100020, China

       Abstract. This paper presents the participation of the Image Semantics group
       (ImageSem) of the Institute of Medical Information at the ImageCLEFmed Cap-
       tion task, which was launched by ImageCLEF 2019. The Concept Detection sub-
       task aims at identifying 5,528 semantic concepts from 70,786 training images
       and 10,000 test images. In this study, we proposed the two-stage concept detec-
       tion strategy, including the medical image pre-classification based on body parts
       and the transfer learning-based multi-label classification model. We totally sub-
       mitted 10 runs in the final evaluation. The evaluation results showed that we
       achieved an F1 Score of 0.2235, which ranked 8th overall. There is still a great
       room for improving the performance of concept detection from large-scale med-
       ical images.
       Keywords: Concept Detection; Transfer Learning; Multi-label Classification;

1      Introduction1

The ImageCLEF task [1] contributes to enhancing the computational methods for ma-
chine understandable medical images [2, 3]. ImageCLEFmed Caption 2019 [4] focus
on the concept detection task, which aims to identify the UMLS [5] Concept Unique
Identifiers (CUIs) for a given medical image from the biomedical literature. On behalf
of the Institute of Medical Information, Chinese Academy of Medical Sciences, our
Image Semantics group (ImageSem) participated in the concept detection task of Im-
ageCLEFmed Caption 2019, and submitted 10 runs to the final evaluation.
    Fig. 1 shows the workflow and submissions of ImageSem in ImageCLEFmed Cap-
tion 2019. On the basis of data analysis and preprocessing, we applied two kinds of
concept detection methods. For one thing, we referenced our previous work in Im-
ageCLEFcaption 2018 task [6], and applied the transfer learning-based multi-label clas-
sification model to the overall training set to predict high-frequency concepts. For an-
other thing, we proposed a two-stage medical concept detection strategy. Specifically,
for a given medical image, a pre-classification model was used to determine which body

ing multi-label classification model. Finally we collected useful concepts using differ-
ent concept selection strategies.

          Fig. 1. Workflow of ImageSem at the ImageCLEFmed Caption 2019 Task

This paper is organized as follows. Section 2 analyses the concept detection data set of
the ImageCLEFmed Caption 2019 task, and describes our work of data preprocessing.
Section 3 presents the methods for concept detection. Section 4 lists all of our submitted
runs. Section 5 makes a brief summarization.

2      Data

2.1    Data analysis
The ImageCLEFmed Caption 2019 task provides a subset of the Radiology Objects in
COntext (ROCO) dataset [7]. To focus on radiology images and non-compound fig-
ures, automatic filtering with deep learning systems as well as manual revisions were
applied, reducing the dataset to 70,786 radiology images of several medical imaging
modalities. It is further divided into a training set (56,629 images) and a validation set
(14,157 images). In the concept detection task, a set of CUIs was provided for each
image, totally 5,528 annotated concepts (CUIs). Table 1 shows the concept distribution
in the overall dataset, and Table 2 shows the top ranked concepts in the training set. It
is observed that the high-frequency concepts account for most proportion (about 97.7%)
of the overall occurrence in the dataset.

2.2    Data preprocessing
Selecting concepts and images for multi-label classification models. Considering the
uneven concept distribution in table 1, we define the problem of detecting high-fre-
quency concepts from medical images as a multi-label classification task. For training
the multi-label classification model, we selected 87 CUIs appeared in more than 1,000
medical images, 548 CUIs appeared in 100 to 1,000 images, and 1,263 CUIs appeared
in 10 to 100 images, respectively. Then we extracted all the medical images containing
high-frequency CUIs from the training set and constructed corresponding subsets,
namely F1000, F100, and F10. For each medical image, we filtered out low-frequency
    Backtracking semantic types of CUIs and manual image annotation for pre-clas-
sification. To realize the pre-classification of medical images based on different body
parts, we backtracked the semantic types of all CUIs from the UMLS and selected use-
ful TUIs for automatically assigning images to different body parts, e.g. T023, which
stands for “Body Part, Organ, or Organ Component” includes multiple body-related
CUIs. Then concepts with T023 were automatically extracted and manually classified
to corresponding body parts. We extracted images annotated with pre-defined concepts
of corresponding categories, and manually check each image subset.

            Table 1. Statistics of the concepts from the training set and the validation set.

    Frequency        Number         Proportion of Num        Occurrence      Proportion of occur
    0-10              3630                  65.67%             9987                  2.31%
    10-100            1263                  22.85%              45630                 10.54%
    100-1000           548                  9.91%              173472                 40.09%
    1000+               87                  1.57%              203664                 47.06%
    Total             5528                 100.00%             432753                 100%

                     Table 2. Top10 high-frequency concepts in the training set.

       CUI                Associated Image                           UMLS Term
    C0441633                     8425                             diagnostic scanning
    C0043299                     7906                               x-ray procedure
    C1962945                     7902                                   radiogr
    C0040395                     7697                                   tomogr
    C0034579                     7564                                 pantomogr
    C0817096                     7470                                   thoracics
    C0040405                     7164                    x-ray computer assisted tomography
    C1548003                     6428                                 radiograph
    C0221198                     5678                                visible lesion
    C0772294                     5677                                   alesion

3          Methods

In the ImageCLEFcaption 2018 task, we applied two methods to identify multiple con-
cepts for a specific image, including the transfer learning-based multi-label classifica-
tion model and the image retrieval-based topic model [6]. The experimental results in-
dicated that the transfer learning-based multi-label classification method was robust on
high-frequency concept detection across different data sets, while the image retrieval-
based topic models identified the high-frequency concepts and low-frequency concepts
at the same time, but depended heavily on the quality of the retrieved images.
    In the ImageCLEFmed Caption 2019 task, for one thing, we continued to use the
transfer learning-based multi-label classification model to identify high frequency con-
cepts, for another thing, we paid more attention to the distinction of labels between
images of different body parts, and classified medical images based on body parts be-
fore the concept detection process.

3.1    Transfer learning-based multi-label classification
The problem of detecting high-frequency concepts from medical images was viewed as
a multi-label classification task, and Convolutional Neural Networks (CNNs) was em-
ployed to assign one or multiple CUIs to a specific medical image. We used the Incep-
tion-V3 [8] and ResNet152 [9], which were pre-trained on the ImageNet datasets in-
cluding 1.2 million images with more than 1,000 common object classes [10]. The
fully-connected layer before the last softmax layer was replaced and the parameters of
the pre-trained CNN model were transferred as the initial parameters of our multi-label
classification model.
   During the training process, we selected 87 CUIs appeared in more than 1,000 med-
ical images in the training set as high-frequency labels, and collected corresponding
medical images from the training set, namely F1000 subset. Then we fine-tuned net-
work weights layer by layer and adjust parameters based on the validation set. For a
given test image, top N concepts which prediction probability higher than the threshold
were selected as the predicted labels.

3.2    Medical image pre-classification based on body parts
By observing the radiology images of the ROCO dataset from the ImageCLEFmed
Caption 2019 task, and analyzing the semantic type of some concept CUIs, we were
inspired to cluster the images into different categories based on different kinds of body
    First, we summarized four body part-related categories based on the medical imag-
ing reading diagnostic atlas [11], including “abdomen”, “chest”, “head and neck” and
“skeletal muscle”. Second, we cluster concepts in the training set according to their
semantic type, e.g., concepts with the TUI number T023 (Body Part, Organ, or Organ
Component) or T029 (Body Location or Region) were automatically extracted and clas-
sified to corresponding categories. Third, some part of medical images with annotated
concepts in the training set were classified into different categories. We manually dou-
ble check the images being assigned to different categories and created four body part-
based image-concepts subset. Finally, we employed the AlexNet [12] model to auto-
matically classify the rest of medical images in the training set to different categories,
as well as the validation set and the test set, which achieved the best accuracy of 84.73%
on the validation set. We had also applied other networks to perform pre-classification,
such as the ResNet152 and the Inception V3, however, the complex network structure
showed no significant advantage in the classification performance. Table 3 shows the
distribution of medical images in different body part categories. Then we could train
multi-label classification models on different medical image categories, respectively

       Table 3. Statistics of medical images pre-classified into different body part categories.

           Dataset           Abdomen       Chest      Head and neck     Skeletal Muscle     Total
      Manual annotated         7546         5406           6000              4000          22952

          Training             19430       12458          15445              9296          56629
         Validation            4802         3040           4003              2312          14157
             Test              3578         2277           2607              1538          10000

3.3       Two-stage medical concept detection
On the basis of the above works, we proposed a two-stage medical concept detection
model. For a given medical image, the computer will firstly determine which body part
the given image belongs to, after the pre-classification step, multiple labels will be pre-
dicted based on the corresponding multi-label classification model, the Inception V3
model we used in this study. Different concept selection strategies were also applied to
different categories, such as using concept of frequency higher than 100, output top N
concepts, or concepts with score above a specific threshold, etc. Then we combined the
best output of different categories, which evolved plenty of combinations.

4         Submitted Runs

We submitted the following 10 runs of concept detection to the ImageCLEFmed Cap-
tion 2019 task (see Table 4):

      Table 4. Submission runs by the ImageSem group in ImageCLEFmed Caption 2019 task

                Submission Run                         Rank overall            Mean F1 Score
                  F1TOP1.txt                                8                   0.2235690
                  F1TOP2.txt                                9                   0.2227917
               F1TOP5_Pmax.txt                             10                   0.2216225
                  F1TOP3.txt                               11                   0.2190201
             07Comb_F1Top1.txt                             12                   0.2187337
               F1TOP5_Rmax.txt                             13                   0.2147437
               08Comb_Pmax.txt                             18                   0.1912173
            09Comb_Rmax_new.txt                            40                   0.1121941
         yu_1000_inception_v3_top6.csv                     52                   0.0009450
          yu_1000_resnet_152_top6.csv                      53                   0.0008925
Run1 (F1TOP1): This submission employed the two-stage concept detection strategy,
in which medical images were firstly pre-classified into different body parts using
Alexnet, then multiple concepts were predicted for the given image using multi-label
classification models trained on the corresponding image subset. The max epoch was
set to 30 and the learning rate was set to 0.001. For the images in the test set, we selected
concepts with frequency above 100 in the training set as the training labels. If the given
image was classified to the “abdomen” or the “chest” subset, output the top7 concepts
of corresponding multi-label classification model. If the given image belongs to the
“head & neck” or the “skeletal muscle” subset, output the top 5 concepts. Finally, we
combined all of the selected concepts as overall results.
Run2 (F1TOP2): The same training process as the F1TOP1 except that we selected th
e top 5 concepts for the images belongs to the “abdomen” subset, concepts with score
above 0.2 for the “chest” subset, top 7 concepts for the “head & neck” subset and conc
epts with score above 0.1 for the “skeletal muscle” subset.
Run3 (F1TOP5_Pmax): The same training process as the F1TOP1 except that we sel
ected the top 5 concepts for the images belongs to the “abdomen”, the “chest” and the
“head & neck” subset, and the top 3 concepts for the “skeletal muscle” subset.
Run4 (F1TOP3): The same training process as the F1TOP1 except that we selected th
e top 10 concepts for the images belongs to the “abdomen”, the top 5 concepts for the
“chest” subset, concepts with score above 0.1 for the “head & neck” subset, and the to
p 7 concepts for the “skeletal muscle” subset.
Run5 (07Comb_F1Top1): The same training process as the F1TOP1 except that we s
elected the top 7 concepts for the images belongs to the “abdomen”, concepts with sco
re above 0.3 for the “chest” subset, the top 5 concepts for the “head & neck” subset, a
nd concepts with score above 0.25 for the “skeletal muscle” subset.
Run6 (F1TOP5_Rmax): The same training process as the F1TOP1 except that we sel
ected the top 10 concepts for the images belongs to the “abdomen”, concepts with sco
re above 0.1 for the “chest” subset, the top 7 concepts for the “head & neck” subset, a
nd the top 10 concepts for the “skeletal muscle” subset.
Run7 (08Comb_Pmax): The same training process as the F1TOP1 except that we sel
ected the top 3 concepts for the images belongs to the “abdomen”, the “chest”, the “he
ad & neck” and the “skeletal muscle” subset. The above combination of parameters ac
hieved the best precision rate in our validating experiments.
Run8 (09Comb_Rmax_new): The same training process as the F1TOP1 except that
we selected the concepts with score above 0.05 for the images belongs to the “abdome
n”, the “chest”, the “head & neck” and the “skeletal muscle” subset. The above combi
nation of parameters achieved the best recall rate in our validating experiments.
Run9 (yu_1000_inception_v3_top6): This submission utilized the transfer learning-b
ased multi-label classification method, which is using the Inception V3 model pre-trai
ned on the ImageNet dataset to perform multi-label classification. The batch size was
set to 20, the max epoch was set to 30 and the learning rate was set to 0.003. For the i
mages in the test set, we selected 87 concepts with frequency above 1000 in the trainin
g set as the training labels, and output the top 6 concepts for each test image.
Run10 (yu_1000_resnet_152_top6): This submission employed the transfer learning
-based concept detection using the ResNet152 model pre-trained on the ImageNet data
set. The batch size was set to 20, the max epoch was set to 30 and the learning rate wa
s set to 0.003. For the images in the test set, we also selected 87 concepts with frequen
cy above 1000 in the training set as the training labels, and output the top 6 concepts f
or each test image.

Fig. 2. An example of concept detection from the validation set of the ImageCLEFmed Caption
 2019 task. The GT concepts were ground truth provided by the ImageCLEF organizers, while
      the Predict Concepts were results of our two-stage medical concept detection model.

Fig. 2 shows an example of concept detection from the validation set of the Im-
ageCLEFmed Caption 2019 collection. The predicted concepts matched four labels (in
red) with the ground truth concepts, while the unmatched concept (C0102410983129;
Lung) was also meaningful to the given image. The good data quality, as well as the
pre-classification based on body parts contribute to the preferable performance on de-
tecting semantic concepts from large-scale medical images. In summarization, we
achieved an F1 score of 0.2235, ranked 8th in the overall submission results, but there
is still a great room for improvement in the further research.

5      Conclusions

This paper presents the participation of the Image Semantics group (ImageSem) at the
ImageCLEFmed Caption 2019 task. We submitted 10 runs in the concept detection
task. Multiple concepts were identified for interpreting medical images by the two-stage
concept detection strategy, including the medical image pre-classification based on
body parts and the transfer learning-based multi-label classification. The evaluation re-
sults showed that we achieved an F1 Score of 0.2235, which was superior to our former
achievement in ImageCLEFcaption 2018. The reason for the improvement may due to
the good data quality, as well as the pre-classification of medical images based on pre-
defined body part categories.
   However, the work of semantic concept detection on large-scale open medical im-
ages still needs further research, and we will try to seek more useful semantic clues
from external labelled data.

6      Acknowledgement

This study was supported by the Non-profit Central Research Institute Fund of Chinese
Academy of Medical Sciences (Grant No. 2018-I2M-AI-016, Grant No. 2017PT63010
and Grant No. 2018PT33024); the National Natural Science Foundation of China
(Grant No. 81601573) and the Fundamental Research Funds for the Central Universi-
ties (Grant No. 3332018153).

