=Paper= {{Paper |id=Vol-2936/paper-118 |storemode=property |title=ImageSem Group at ImageCLEFmed Caption 2021 Task: Exploring the Clinical Significance of the Textual Descriptions Derived from Medical Images |pdfUrl=https://ceur-ws.org/Vol-2936/paper-118.pdf |volume=Vol-2936 |authors=Xuwen Wang,Zhen Guo,Chunyuan Xu,Lianglong Sun,Jiao Li |dblpUrl=https://dblp.org/rec/conf/clef/WangGXSL21 }} ==ImageSem Group at ImageCLEFmed Caption 2021 Task: Exploring the Clinical Significance of the Textual Descriptions Derived from Medical Images == https://ceur-ws.org/Vol-2936/paper-118.pdf

ImageSem Group at ImageCLEFmed Caption 2021 Task:
Exploring the Clinical Significance of the Textual Descriptions
Derived from Medical Images
Xuwen Wang 1, Zhen Guo 1, Chunyuan Xu 2, Lianglong Sun 1 and Jiao Li 1*
1
Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical
College, Beijing, 100020, China
2
School of Life Science, Beijing Institute of Technology, Beijing, 100081, China

Abstract
This paper presents the work of ImageSem group in the ImageCLEFmed Caption 2021 task.
In the concept detection subtask, we employed the transfer learning-based multi-label
classification model as our baseline. We also trained multiple fine-grained MLC models based
on manually annotated semantic categories, such as Imaging Type, Anatomic Structure, and
Findings, which may reveal clinical insights of radiology images. We submitted 9 runs to the
concept detection subtask, and achieved the F1 Score of 0.419, which ranked 3rd in the leader
board. In the caption prediction subtask, our first method simply combines detected concepts
according to the sentence patterns. The second method used a dual path CNN model for
matching images and captions. We submitted 4 runs to the caption prediction subtask, and
achieved the BLEU score of 0.257, which ranked 6th among the participating teams.
Keywords
Concept detection, caption prediction, multi-label classification, fine-grained semantic
labelling

1. Introduction
The medical track of ImageCLEF[1]aims at promoting the research of computer-aided radiology image
analysis and interpretation. ImageCLEFmed Caption 2021[2]is one of the ImageCLEFmedical tasks,
which focus on mapping visual information of radiology images to textual descriptions. It consists of two
subtasks, namely Concept Detection and Caption Prediction. On behalf of the Institute of Medical
Information and Library, Chinese Academy of Medical Sciences, our Image Semantics group (ImageSem)
participated in both of the two subtasks.
The concept detection subtask aims to identify the UMLS [3]Concept Unique Identifiers (CUIs) for a given
radiology image. Following our previous work on ImageCLEF 2019 [4], we employed transfer learning-
based multi-label classification (MLC) [5],[6] as our first method for modeling all the concepts in the
training set. In order to annotate each image with more meaningful concepts, we manually classified the
concepts into three categories according to their UMLS semantic types, namely Imaging Type, Anatomical
Structure, and Findings. Then we trained MLC sub models separately for different concept categories as
our second method.
The caption prediction subtask asks participants to generate coherent captions for the entirety of an image,
which requires higher accuracy and semantic interpretability of expression. We also employed two
methods for caption prediction. The first method was the pattern-based combination of concepts identified
in the previous task. The second method was based on the dual path CNN model [7], which

1
CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
wang.xuwen@imicams.ac.cn (X. Wang); li.jiao@imicams.ac.cn (J. L. )
https://www.imicams.ac.cn (X. Wang); https://www.imicams.ac.cn (J. L. )
0000-0003-3022-6513 (X. Wang); 0000-0002-7454-0750 (Z. Guo); 0000-0001-6391-8343 (J. L. )
©️ 2021 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
is commonly used in the image-text retrieval field to match images and captions for instance-level retrieval.
This paper is organized as follows. Section 2 describes the data set of the ImageCLEFmed Caption 2021
task. Section 3 presents the methods for concept detection and caption prediction. Section 4 lists all of our
submitted runs. Section 5 makes a brief summarization.

2. Dataset
The ImageCLEFmed Caption 2021 task is in its 5th edition this year. Compared with previous years, the
released images were strictly limited to radiology, and the number of images and associated UMLS
concepts were reduced. There were 222,314 images with 111,156 concepts in 2018 [8], 70,786 radiology
images with 5,528 concepts in 2019 [9], 80,747 radiology images with 3,047 concepts in 2020 [10], and
3,256 radiology images with 1,586 concepts and 3,256 captions in 2021. Another improvement of the
dataset is that the validation set and test set include real radiology images annotated by medical doctors,
which increased the medical context relevance of the UMLS concepts. For one thing, the reduction of
concept scope and size lowered the difficulty of concept identification. For another thing, the reduction of
image size is not conducive to training large-scale neural networks.
The organizers provided UMLS concepts along with their imaging modality information, for training
purposes. We observed that most images were assigned with concepts indicating the diagnostic procedure
or medical device, and some images were accompanied by concepts indicating the body part, organ or
clinical findings. As shown in Table 1, the high-frequency concepts are concentrated in several specific
semantic types. For our experiments, we utilized this feature and manually classified three concept
categories for building fine-grained multi-label classification models.

Table 1 High-frequency concepts in the training and validation set of ImageCLEFmed caption 2021 task.
CUI #Num Term String TUI Semantic Type
C0040398 1400 Tomography, Emission-Computed T060 Diagnostic Procedure
C0024485 796 Magnetic Resonance Imaging T060 Diagnostic Procedure
C1306645 627 Plain x-ray T060 Diagnostic Procedure
C0041618 373 Ultrasonography T060 Diagnostic Procedure
C0009924 283 Contrast Media T130 Indicator, Reagent, or
Diagnostic Aid
C0577559 274 Mass of body structure T033 Finding
C0002978 119 angiogram T060 Diagnostic Procedure
C0221198 108 Lesion T033 Finding
C1322687 107 Endoscopes, Gastrointestinal Tract, T074 Medical Device
Upper Tract
C0205400 92 Thickened T033 Finding
C1881358 78 Large Mass T033 Finding
C0202823 60 Chest CT T060 Diagnostic Procedure
C0005910 59 Body Weight T032 Organism Attribute
C0150312 55 Present T033 Finding
C0180459 53 Disks (device) T073 Manufactured Object
C0003617 52 Appendix T023 Body Part, Organ, or Organ
Component
C0228134 50 Spinal epidural space T030 Body Space or Junction
C0016658 47 Fracture T037 Injury or Poisoning
C0005889 47 Body Fluids T031 Body Substance
C0227613 47 Right kidney T023 Body Part, Organ, or Organ
Component
3. Methods
This section describes methods we used in two subtasks. Fig. 1 shows the workflow and submissions of
ImageSem in ImageCLEFmed Caption 2021.

Figure 1: Workflow of ImageSem in the ImageCLEFmed Caption 2021 task

3.1. Concept detection
In the concept detection subtask, for one thing, we employed the transfer learning-based multi-label
classification model to identify overall concepts; for another thing, we paid more attention to the distinction
of labels with different semantic types, and focus on three major categories of concepts, which may reveal
clinical insights of radiology images.

3.1.1. Transfer learning-based multi-label classification
In our previous work, we employed a transfer learning-based multi-label classification model to assign
multiple CUIs to a specific medical image. This is a classic approach under the condition of limited tag
size and high frequency concepts. In our first method, for modeling overall concepts, we applied the
Inception-V3[5] and DenseNet 201[6]which were pre-trained on the ImageNet datasets [11]. The fully
connected layer before the last softmax layer was replaced and the parameters of the pre-trained CNN
model were transferred as the initial parameters of our MLC model.
During the training process, we collected 1,586 CUIs from both of training set and validation set as our
labels. Then we fine-tuned the models on the validation set. For a given test image, concepts of high
probabilities above the threshold were selected as the prediction labels. Empirically, we adjusted the
threshold gradually from 0.1 to 0.7 on the basis of the validation set.

3.1.2. Fine-grained multi-label classification
In this method, according to the UMLS semantic types, we go further to divide ImageCLEF concepts into
four semantic categories, namely Imaging Type (IT), Anatomic Structure (AS), Findings (FDs) and others.
Based on the official training set and validation set, we reprocessed the images and associated concepts
via our medical image annotation platform.
As shown in Figure 2, for a given radiology image, there are three sources of related concepts. The first
one is ImageCLEF concepts annotated by concept extraction tools and medical doctors. These concepts
are semantically related, but often incomplete, since many images having only one concept. The second
source of concepts are automatically annotated from the given image captions, using the Metamap tool
[12]together with UMLS 2020ab. These concepts are more comprehensive, but also introduce noise words.
The third source is the expanding concepts that we summarize manually based on the high- frequency
ImageCLEF concepts, for labelling convenience purpose.
We invited graduate students majoring in medical imaging to label images with reference to visual
information, caption descriptions and the above three sources of concepts. The labeling protocol is that
each radiology image was assigned with at least one IT label, zero or more AS labels, and zero or more
FDs labels. Specifically, ImageCLEF concepts that are difficult to be classified to the above categories,
can be assigned to the ‘Others’.
Then we build three image-concept sub collections for training fine-grained MLC models. These
collections have same training and validation images, but differentiate in related concepts. Table 2 shows
the distribution of different concept categories.
We verified our MLC models based on the re-annotated validation set. The experimental results showed
that our model performs well on the prediction of Imaging Type labels, with F1 score of 0.9273. However,
the predictions for the other two kinds of labels are far from satisfactory. One possible reason is that there
are few images but too many labels for training. It is intuitively understandable that images of the same or
similar cases would have a similar anatomic structure or medical findings label. Whereas the data
characteristics of this subtask are obviously not suitable for specific diseases, which raised the difficulty
to predict accurate body part, organ, or findings.

Figure 2: Process of manual re-annotation and fine-grained MLC model training and validation

Table 2 Distribution of concepts from different semantic categories
Category #Concepts Concept Sample
Imaging Types 99 C0040398 Tomography Emission-Computed
Anatomic Structure 786 C0228134 Spinal epidural space
Findings 854 C0577559 Mass of body structure

3.2. Caption Prediction

3.2.1. Pattern-based caption generation
For generating reasonable image captions, the first method was the pattern-based combination of
concepts identified in the previous task. We designed a simple sentence pattern based on the
characteristic of captions in the training and validation set, see Table 3. Obviously, the accuracy of
concept detection results would directly determine the quality of sentence generation.

Table 3 Sentence pattern for caption generation
Pattern Sample
of demonstrate / show synpic24243: Sagittal T1-weighted image of the
/suggest cervical spine demonstrates cord expansion.
demonstrate / show / suggest synpic19193: Lateral radiograph of the skull shows
in/of/within lytic lesions in the temporoparietal region.

3.2.2. Image matching for caption prediction
In this method, we employed the algorithm commonly used in the image-text retrieval field to match
images and captions for instance-level retrieval. It is based on an unsupervised assumption that every
image/test group can be viewed as one class, so each category is equivalent to 1+m (1 image vs m
descriptions) samples.
We use the model proposed by Zheng[7], which contains two convolutional neural networks to learn visual
and textual representations simultaneously. When testing, we first extract the image feature by image CNN
and the text feature by text CNN, and then use the cosine distance to evaluate the similarity between the
image and candidate sentences.
● Data Preparation
In this field, most existing works use two generic retrieval datasets (Flickr30k and MSCOCO), which have
more than 30,000 images. Each image in these datasets is annotated with around five sentences. So we
expanded the caption from 1 to 5 sentences per image. Specifically, we first translate the caption into
Chinese, Japanese, German, French and then translate back to English. We use GoogleNews-vectors
word2vec model trained by Google, which contains 2,000,000 words to get our dictionary. Our dictionary
ultimately have 6039 words, each has a 1*300 vector corresponding to it.
● Train
Given a sentence, we convert it into code T of size n * d, where n is the length of the sentence, and d
denotes the size of the dictionary. T is used as the input for the text CNN. Given an image, we resize it to
224 × 224 pixels, which are randomly cropped.
The training process includes two stages: in the first stage, we use the instance loss to learn fine- grained
differences between intra-modal samples with similar semantics. in the second stage, we use the ranking
loss to focus on the distance between the two modalities to build the relationship between the image and
text.
● Test
In this experiment, we use 16,280 sentences from training set and validation set as candidate captions, each
sentence is corresponding to its text feature extracted by text CNN. For each test image, we first extract
the image feature by image CNN, and then use the cosine distance to evaluate the similarity between the
image and candidate sentences.
When we use the model trained on ImageCLEF datasets, we get the almost same top 10 sentences from
16,280 candidate captions, because the features learned by text CNN between each captions is not
discriminative. However, when we test it on the model trained by MSCOCO datasets, each query image
can get different sentences, but they do not match either.

4. Submitted runs
Table 4 presents the 9 runs we submitted to the concept detection subtask, along with the official rankings.
We take the Inception-V3 model trained on overall concepts as a baseline. We tried to submit concepts of
the three semantic categories predicted by sub MLC models. The submissions were either by categories or
by combining with the baseline results. To our surprise, the
concepts of Imaging Types achieved the best F1 score of 0.419, indicating the high precision and coverage
of this kind of concepts in radiology images. As to the concepts from other types and baseline results, they
introduce more unmentioned words and reduce the overall score. However, in view of our experience on
manual labeling, we believe that some unmentioned words may also be helpful in interpreting the given
image. Figure 3 shows two examples of our method on the validation set.
Table 5 shows the 4 runs we submitted to the caption prediction subtask. We take the Dual path CNN
model as our baseline, which achieved a BLEU score of 0.137. The pattern-based method achieved a BLEU
score of 0.257, still far from satisfactory descriptions that are readable and interpretable for doctors.

Table 4 Submissions of ImageSem in the concept detection subtask
Approach F1 Score Ranking
03ImagingTypes 0.419 14
02Comb_ImagingTypes_Baseline 0.400 16
07Intersect_06_baseline 0.396 17
04Comb_ImagingTypes_AnatomicStructure 0.370 19
05Comb_ImagingTypes_MedicalFindings 0.355 22
06Comb_ImagingTypes_AnatomicStructure_Findings 0.327 24
08AnatomicStructure 0.037 28
09Findings 0.019 29
01baseline 0.380 18
Table 5 Submissions of ImageSem in the caption prediction subtask
Approach BLEU
04pattern1+ImagingTypes_AnatomicStructure_Findings 0.203
05pattern2+ImagingTypes_AnatomicStructure_Findings 0.181
06pattern3+ImagingTypes_AnatomicStructure_Findings 0.257
03baseline_Dual_Path_CNN Model 0.137

Figure 3: Examples of concepts and captions predicted by ImageSem on the validation set
5. Conclusions
This paper presents the participation of the ImageSem Group at the ImageCLEFmed Caption 2021 task.
We tried different strategies for both subtasks. In the concept detection subtask, we used the transfer
learning-based MLC model to detect overall 1,586 concepts. We also trained multiple fine-grained MLC
models based on manually annotated semantic categories. One of the lessons is that we have become much
clearer about which concepts are clinically relevant to radiology images, and in order to obtain better
predictions, the semantic labels of images should be more focused and specific. Furthermore, how to
generate a readable description based on clear and clinically meaningful concepts, is still worth exploring.

6. Acknowledgements
This work has been supported by the National Natural Science Foundation of China (Grant No.
61906214), the Beijing Natural Science Foundation (Grant No. Z200016).

References
[1] B. Ionescu, H. Müller, R. Peteri, A. Ben Abacha, D. Demner-Fushman, S. Hasan, M. Sarrouti, O.
Pelka, C. Friedrich, A. Herrera, J. Jacutprakart, V. Kovalev, S. Kozlovski, V. Liauchuk,Y. Dicente
Cid, J. Chamberlain, A. Clark, A. Campello, H. Moustahfid, A. Popescu, The 2021 ImageCLEF
Benchmark: Multimedia Retrieval in Medical, Nature, Internet and Social Media Applications, 2021,
pp. 616–623.
[2] O. Pelka, C. M. Friedrich, A. Herrera, H. Müller, Overview of the ImageCLEFmed 2021 concept &
caption prediction task, in: CLEF2021 Working Notes, ’CEUR’ Workshop Proceedings, CEUR-
WS.org, Bucharest, Romania, 2021.
[3] O. Bodenreider, The unified medical language system (umls): integrating biomedical terminology,
Nucleic Acids Research 32 (2004) 267–270.
[4] Z. Guo, X. Wang, Y. Zhang, J. Li, Imagesem at imageclefmed caption 2019 task: a two-stage
medical concept detection strategy, Lugano, 2019.
[5] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecturefor
computer vision, in: IEEE, 2016, pp. 2818–2826.
[6] G. Huang, Z. Liu, V. Laurens, K. Q. Weinberger, Densely connected convolutional networks,in:
IEEE Computer Society, 2016.
[7] Z. Zheng, L. Zheng, M. Garrett, Y. Yang, M. Xu, Y.-D. Shen, Dual-path convolutional image-text
embedding with instance loss, ACM Transactions on Multimedia Computing, Communications, and
Applications 2 (2020) 1–23. doi: https://doi.org/10.1145/3383184 .
[8] Y. Zhang, X. Wang, Z. Guo, J. Li, Imagesem at imageclef 2018 caption task: Image retrieval and
transfer learning. in: Clef2018 working notes, Avignon, France, 2018.
[9] V. Kougia, J. Pavlopoulos, Androutsopoulos, Aueb nlp group at imageclefmed caption 2019. in:
Clef2019 working notes, CEUR-WS.org, Lugano, Switzerland (2019), 2019.
[10] I. B, H. Müller, R. Péteri, A. Abacha, C. B., M. G, Overview of the imageclef 2020: Multimedia
retrieval in medical, lifelogging, nature, and internet applications, Lecture Notes in Computer
Science (2020).
[11] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
Bernstein, Imagenet large scale visual recognition challenge, International Journal of Computer
Vision (2014) 1–42.
[12] A. R. Aronson, Effective mapping of biomedical text to the umls metathesaurus: the metamap
program, 2001, pp. 17–21.