=Paper=
{{Paper
|id=Vol-2936/paper-103
|storemode=property
|title=NLIP-Essex-ITESM at ImageCLEFcaption 2021 task : Deep Learning-based Information Retrieval
                        and Multi-label Classification towards improving Medical Image Understanding
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-103.pdf
|volume=Vol-2936
|authors=Janadhip Jacutprakart,Francisco Parrilla Andrade,Rodolfo Cuan,Arely Aceves Compean,Giorgos Papanastasiou,Alba Garcia Seco de Herrera
|dblpUrl=https://dblp.org/rec/conf/clef/JacutprakartACC21
}}
==NLIP-Essex-ITESM at ImageCLEFcaption 2021 task : Deep Learning-based Information Retrieval
                        and Multi-label Classification towards improving Medical Image Understanding==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-103.pdf</pdf>
<pre>
NLIP-Essex-ITESM at ImageCLEFcaption 2021 task :
Deep Learning-based Information Retrieval and
Multi-label Classification towards improving
Medical Image Understanding
Janadhip Jacutprakart1 , Francisco Parrilla Andrade2 , Rodolfo Cuan2 ,
Arely Aceves Compean2 , Giorgos Papanastasiou1 and Alba G. Seco de Herrera1
1
University of Essex, Wivenhoe Park, Colchester CO4 3SQ, United Kingdom
2
Instituto Tecnologico y de Estudios Superiores de Monterrey, Av. Eugenio Garza Sada 2501 Sur, Tecnológico, 64849
Monterrey, N.L., Mexico


                  Abstract
                  This work presents the NLIP-Essex-ITESM team’s participation in the concept detection sub-task of
                  the ImageCLEFcaption 2021 task. We developed a method to predict health outcomes from medical
                  images by processing concepts from radiology reports and their associated medical images. Our aim
                  is to improved medical image understanding and provide sophisticated tools to automate the thorough
                  analysis of multi-modal medical images. In this paper, two deep learning- and k-NN-based methods of
                  a) Information Retrieval and b) Multi-label Classification were developed and assessed. In addition, a
                  Densenet-121 and an EfficientNet were used to train and extract imaging features. Our team achieved
                  the second-highest score when the Information Retrieval method was used (F1-score bench-marking
                  was 0.469). Further investigations are underway in the setting of improving health outcome predictions
                  from multi-modal medical images. Code and pre-trained models are available at https://github.com/
                  fjpa121197/ImageCLEF2021.

                   Keywords
                   ImageCLEF, image understanding, concept detection, medical image retrieval, Densenet, EfficientNet,
                   k-NN


1. Introduction
This paper presents the contributions of the NLIP-Essex-ITESM team in the ImageCLEFmed
caption 2021 task. The team is composed of the Natural Language and Information Processing
research group1 at the School of Computer Science and Electronic Engineering (CSEE) of the
University of Essex, and the Instituto Tecnológico y de Estudios Superiores de Monterrey. Since
2003, ImageCLEF [1, 2] held an evaluation campaign as part of the Cross Language Evaluation
Forum (CLEF), creating a free online resource on topics and subjects related to cross-language
information retrieval. The ImageCLEFmed caption task 2021 edition has two sub-tasks: concept
CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" j.jacutprakart@essex.ac.uk (J. Jacutprakart); A00820996@itesm.mx (F. P. Andrade); fitocuan@gmail.com
(R. Cuan); arelyac01@gmail.com (A. A. Compean); g.papanastasiou@essex.ac.uk (G. Papanastasiou);
alba.garcia@essex.ac.uk (A. G. S. d. Herrera)
    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR Workshop Proceedings (CEUR-WS.org)
      1
        https://essexnlip.uk/
detection and caption prediction. The NLIP-Essex-ITESM team participated in 2021 in the
concept detection sub-task. The concept detection sub-task focuses on detecting concepts (i.e.
UMLS®Concept Unique Identifiers) in a large corpus of radiology images. A detailed description
of this year’s sub-tasks and data are provided by Pelka et al. [3].
   For 2021, we propose two methods: one based on Information Retrieval (IR) and the other
based on Multi-label classification (MLC). The IR method used two deep learning models
(Densenet-121 [4] and EfficientNet [5]) whilst the MLC used only a Densenet-121 model.
   In the ImageCLEFcaption 2020 edition [6], the best results were achieved by the AUEB_NLP
team [7]. They examined various Convolutional Neural Network (CNN) models such as Con-
ceptCXN and DenseNet121, combined with two different approaches: feed-forward Neural
Network (FFNN) or k-Nearest-Neighbours (k-NN). In this work, we applied the k-NN technique
based on the AUEB_NLP team model and referred to our last year submission [8] to evaluate
the data and compute the distance. We also implemented the k-NN approach with various
metric types to improve the computational time on calculating the distances between a query
image and an indexed image to retrieve a similar image in the IR method. Additionally, a new
approach using semantic types implemented on MLC.
   We developed and implemented several methods in the train and validation dataset. We
selected the best approaches based on the top rank information retrieval F1-score evaluation
performance. Code and pre-trained models used in this paper are fully publicly available2 .
   The paper is structured as follows. Section 2 presents the data collections used in this work.
Section 3 details the overall methodology and the two main modelling techniques proposed
in this paper (IR and Multi-label classification), including a detailed description of the runs
submitted to the ImageCLEFmed caption 2021 task. The results are presented in Section 4 and
the conclusions in Section 5.


2. Collection & evaluation
In this work, we used the dataset provided by the ImageCLEFmed caption 2021 task [3]. The
dataset consists of:

    • Training set including 2,756 images-concepts pairs;
    • Validation set including 500 images-concepts pairs;
    • Test set including 444 images.

   Each image is associated with multiple Unified Medical Language System® (UMLS) Concept
Unique Identifiers (CUIs). The UMLS CUIs (associated with the medical images) and 3,256
medical images were included in the training and validation datasets. However, the UMLS CUIs
from the test dataset were not distributed where the ImageCLEFcaption task [3] organisers
used F1-score to evaluate the submitted runs (see Section 3)


   2
       https://github.com/fjpa121197/ImageCLEF2021
3. Methodology
This work proposed two distinct methods: an Information Retrieval (IR)- and a Multi-label
classification-based approach. Both methods used two different deep learning models (Densenet-
121 and EfficientNet) for image training and feature extraction. In addition, we implemented
k-NN to evaluate differences between a query image (whose features were obtained using the
same extraction process) and each element in the set with different metrics for the concept
selection part. The modelling pipeline uses the preprocessed images in the input and the
corresponding concepts in both methods’ output.

3.1. Information Retrieval based approach
This approach is based on how a content-based image retrieval system works. Both deep
learning models implemented were used for optimally extracting imaging features. The overall
approach was separated into three main processes: model training, feature extraction and
concept selection.


Figure 1: Overview of the model training process.


3.1.1. Model Training.
For this process, both Densenet-121 and EfficientNet models, were used along with pre-trained
weights (ImageNet) as base models. For Densenet-121, the input image remained at the maximum
input size, 224×224; for EfficientNet, following careful performance evaluations across different
input image resolutions, two different input size images were finally selected and examined
further: EfficientNet B0 (224 × 224 ) and EfficientNet B3 (300 × 300).
   As shown in Figure 1, the process starts by modifying a pre-trained model, Densenet-121
and EfficientNet, and replacing the classification layer with a layer that is allowed to converge
to the unique concept output (1585 concepts). In addition, following transfer learning (via
fine-tuning) principles, the pre-trained layers of the base model were kept frozen, allowing only
the classification model layer to be trained [11]. By default, the activation function used for
classification on the last layer uses sigmoid for Densenet-121 [4] and softmax for EfficientNet [5].
The classification layer of the model was first trained for 15 epochs. Once the initial training was
over, a portion of the frozen layers was unfrozen, resulting in more trainable layers. The model
was then re-trained again for 13 epochs, resulting in a fine-tuned model that was subsequently
Figure 2: The architecture of EfficientNet B0 [9]


Figure 3: The architecture of DenseNet-121 with 4 dense blocks and 3 transition layers [10]


used in the next step of the process (for feature extraction). The specific parameters used for
training are the following:
    • Optimizer: Adam
    • Learning rate: 0.001
    • Validation Split : 0.2
    • Batch size: 32
    • Loss function: Binary Cross-Entropy
    • Epochs: 15 for initial training and 13 for the second training phase

3.1.2. Feature Extraction.
This process consists of using a fine-tuned model to extract features for a set of images. A
fine-tuned model (based on Densenet-121 and EfficientNet) was trained as described in the
previous process. The same preprocessing steps used in training were followed, and the images
were passed to the fine-tuned model. In last year’s (ImageCLEF 2020) participation, the output
from the batch-normalisation layer was used to get the features for the images. However,
following evaluation, we decided that for the Densenet-121, the average pool layer would be
used to get the features. On the contrary, EfficientNet used no pooling, which is the default
value of the model; as a result, the output of the model will be in the 4D tensor output of the
last convolutional layer (sample_size, image width, image height, color_depth). Afterwards, the
features were saved to prepare for the next step (the concept selection process, see the following
subsection).
Figure 4: Overview of the feature extraction process.


3.1.3. Concept selection.
For the concept selection process, the set of features obtained from the feature extraction process
were indexed by the corresponding imaging data. These indexed features served as the database
used to evaluate differences between a query image (whose features were obtained using the
same extraction process) and each element in the set. In order to evaluate the differences,
distance calculation was implemented using the k-NN algorithm. Different experiments were
generated using several metrics to find the one that yielded the best results. The distance metrics
considered were Canberra, Cosine and Bray–Curtis [12]. We used Bray–Curtis to measured
the differences between samples. Based on the F1 score, the Cosine and the Bray–Curtis were
the best methods for the Densenet-121 and the EfficientNet, respectively. Although this year’s
concept selection process had an overall resemblance to our team’s method from last year [8],
there was also a distance comparison followed by selecting a set of most similar images, where
the process was modified to be more efficient. It has generated similar results to the last year in
term of computing time using k-NN. Along with k-NN, through the experiments, a number of
different values of k were applied, and we achieved the best outcome with k=1 and using this
year’s dataset only. As a result, the concepts assigned to the query image correspond to the
concepts from the closest indexed image. Since only one image was retrieved, also the ranking
process used for last year’s concept selection was removed.

3.2. Multi-label Classification
The main characteristic of this approach is that it only uses deep learning to predict the concepts
for an image. Since an image can have multiple concepts assigned to it, the problem results in
a multi-label classification problem, and consequently, a pre-trained Densenet-121 model has
been adapted and used for this task. This approach consists of three main processes: concepts
file preprocessing , model training and concepts prediction.

3.2.1. Concepts preprocessing .
In this process, we propose that the concepts which present as output in the competition can
be classified by their semantic type (Diagnostic Procedure and Body Part or Organ). We used
Figure 5: Overview of the concept selection process.


umls-api, which is a Python package in order to retrieve the UMLS REST API. The UMLS
REST API and Json output that offer links for important UMLS entities such as CUIs, atoms,
and subsets. By using the UMLS REST API, an application programming interface (API) will
retrieve a collection of convenient Uniform Resource Identifier (URI) patterns and obtained
information related to each concept using its Concept Unique Identifiers (CUI), which consists
of the following information: name from the source vocabulary, URLs that refer to the definition
and relation(s), date added, semantic type, status, and Unique Identifiers (UI). The python library
umls-api3 was used to access the mentioned API.


Figure 6: Overview of the preprocessing process.


  In order to optimise the number of labels, the semantic type was selected to obtain each
concept from the train and validation dataset. As a result, 33 different semantic types were
obtained. The following list shows the top 5 most frequent semantic types:
   1. Diagnostic Procedure
   2. Body Part, Organ or Organ Component
   3. Finding
   4. Body Location or Region
   5. Disease or Syndrome
  An individual dataset was made for each type based on the top 5 categories, with an extra
   3
       Python library can be found in https://github.com/odwyersoftware/umls-api
dataset for the remaining semantic types. Each dataset consisted of medical images as input
and related concepts, based on each image (related to each specific semantic type).

3.2.2. Model training.
As previously mentioned, the preprocessing step creates six new concepts files, each one with
the image and its corresponding concepts (of that semantic type). As a result, six models were
be trained, where each model corresponds to a specific semantic type. Nonetheless, the model’s
training process remains the same (except for finding the best threshold).
   As illustrated in Figure 7, the process starts by modifying a pre-trained Densenet-121 model
(with ImageNet weights) for which the base layers have been frozen. A classification layer is
added to the model, which has N number of outputs (this is the unique number of concepts per
concept file). The model (only with the classification layer unfrozen) is trained for a certain
number of epochs, varying according to each semantic type. Then, a particular portion of the
base model is unfrozen, and the model is put into unfrozen again until it exceeds the maximum
number of epochs (100) or the callback assigned to it. The specific parameters used for training
are the following:

    • Optimiser: Adam
    • Initial Learning rate used only in classification layer: 0.0001
    • Learning rate used in second training: 0.00001
    • Validation Split : 0.2
    • Batch size: 32
    • Loss function: Binary Cross-Entropy
    • Epochs: This varies depending on the semantic type (see code).

   Following two training phases, the output given by the classification layer is a probability
(between 0 and 1) that a concept is present given an image; a threshold needs to be found using
unseen data. After finding the threshold that gives the best F1-score on unseen data, the model
and this threshold will be saved for the concept prediction.


Figure 7: Overview of the training process.
3.2.3. Concepts prediction.
For the prediction of the concepts, the query image passed through the same preprocessing and
training method as IR but using only DenseNet-121 model. In this process, we have selected the
best threshold found in each model, used only the concepts that passed in that threshold and
assigned to the image. In the end, the predictions of each model were merged and used as the
final output.
   While testing the models and their scores on the validation set, it was noted that using the
semantic types result in a positive effect in the overall F1- score; therefore, it was decided only
to include these two models when predicting for the test set.


Figure 8: Overview of the concepts prediction process.


3.3. Runs
This section provides a detailed description of the runs submitted to ImageCLEFcaption 2021
task.
    • Run 132945: For this run, the IR method was implemented, using DenseNet-121 as the
      feature extraction (average pooling layer) and image inputs are loaded as 224 × 224. The
      length of the resulting feature vector for each image is 1024. Then, (k-NN) (k = 1 and
      metric = cosine) is used to retrieve the closest image and the concepts of the closest image
      are assigned to the query image.
    • Run 136379: For this run, the IR method is implemented, using EfficientNet B3 as feature
      extraction (no pooling layers) and image inputs are loaded as 300 × 300. The (k-NN) (k
      = 1 and metric = canberra) is used to retrieve the closest image and the concepts of the
      closest image are assigned to the query image.
    • Run 136400: In this run, the IR method is implemented, using EfficientNet B0 as feature
      extraction (no pooling layers) and image inputs are loaded as 224 × 224. The (k-NN) (k
      = 1 and metric = canberra) is used to retrieve the closest image and the concepts of the
      closest image are assigned to the query image.
    • Run 136404: For this run, the IR method is implemented, using EfficientNet B0 as feature
      extraction (no pooling layers) and image inputs are loaded as 224 × 224. The (k-NN) (k =
      1 and metric = cosine) is used to retrieve the closest image and the concepts of the closest
      image are assigned to the query image.
    • Run 136429: For this run, the IR method is implemented, using EfficientNet B0 as feature
      extraction (no pooling layers) and image inputs are loaded as 224 × 224. The (k-NN) (k =
      1 and metric = braycurtis) is used to retrieve the closest image and the concepts of the
      closest image are assigned to the query image.
    • Run 133912: For this run, the MLC method is implemented, and based on F1-score obtained
      from the validation set, only the models for the diagnostic procedure and body part or
      organ were included in the finals predictions. The threshold used for assigning a concept
      for the diagnostic procedure model was 0.4 and for the body part or organ was 0.1. The
      other semantic types were not included because when testing on the validation set, adding
      the additional semantic types affected negatively the overall F1-score.


4. Results
Table 1 summarises the techniques used by each run. Table 1 presents the official results our

Table 1
Description and performance of the runs submitted to ImageCLEF 2021 Concept Detection Task and
their ranks compared with all the 29 runs submitted by the 5 participating teams.
  Run ID   Size Input Image   Method                       DL Model        Similarity measure   F1-Score   Ranking


  132945   224 × 224          IR                           Densenet-121    Cosine               0.469        6
  133912   224 × 224          Multi-label classification   Densenet-121    N/A                  0.412        15
  136379   300 × 300          IR                           EfficienNetB3   Canberra             0.355        21
  136400   224 × 224          IR                           EfficienNetB0   Canberra             0.423        13
  136404   224 × 224          IR                           EfficienNetB0   Cosine               0.440        12
  136429   224 × 224          IR                           EfficienNetB0   BrayCurtis           0.451        11
  Best ImageCLEF2021          -                            -               -                    0.505         1


team achieved in the ImageCLEF 2021 Concept Detection Task and ranking along other 29 runs
submitted by the five different teams. Our team received the second place in overall score with
the best results on 132945 with F1-score of 0.469, close to a couple of submission from the first
place team, AUEBs_NLP_Group who achieved F1-score of 0.505, 0.495, 0.493, 0.490, respectively.
   Notably, our best submission used DenseNet-121 as the feature extraction (average pooling
layer)and image inputs are loaded as 224 × 224 for fine-tuning and image extraction with IR
method using k-NN method with Cosine distance to retrieve the closest image and the concepts
of the closest image are assigned to the query image. Based on the final results that our team has
achieved, it is clear that using the further IR method with different distance metrics also improve
the score differently between DenseNet-121 and EfficientNet. However, with another method
using Multi-label classification, we use the same DenseNet-121 with the IR method but with a
different process in the latter part. It might be due to the size of the dataset, in which a bigger
dataset is required in order for the Multi-label classification method to be efficient. Besides
the two different deep learning models we used this year, there is a slight difference in both
results from using DenseNet-121 and EfficientNet. Similarly, using three different similarity
metrics (Canberra, Cosine and BrayCurtis) also resulting in a slightly best score using Cosine
on DenseNet-121 and BrayCurtis in the EfficientNet model.


5. Conclusions
This paper describes our contributions in the ImageCLEFcaption 2021 task. Two different
methods were developed and used in this paper. An information retrieval method using two
deep learning models, a DenseNet-121 and an EfficientNet ,to train and extract features from
the data collections. At the same time, multi-label classification was implemented using a
DenseNet-121 only. Considering the baseline model from last year [8], we have differentiated
and optimised our modelling pipeline to further generalise our approach and improve outcomes.
Our DenseNet-121 model showed the highest performance when incorporated in IR method.
Followig this method, we achieved the second-best performance (F1-score of 0.469. Unlike
the previous year, no additional modality information was provided, which added additional
complexixity in our processing pipeline. Further investigations on developing and customising
deep learning model arcitectures and fine-tuning are already underway, so that we will further
improve model performannce.


References
 [1] J. Kalpathy-Cramer, A. García Seco de Herrera, D. Demner-Fushman, S. Antani, S. Bedrick,
     H. Müller, Evaluating performance of biomedical image retrieval systems—an overview of
     the medical image retrieval task at imageclef 2004–2013, Computerized Medical Imaging
     and Graphics 39 (2015) 55–61.
 [2] B. Ionescu, H. Müller, R. Peteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A.
     Hasan, S. Kozlovski, V. Liauchuk, Y. Dicente, V. Kovalev, O. Pelka, A. G. S. de Herrera,
     J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D.
     Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid,
     A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval
     in medical, nature, internet and social media applications, in: Experimental IR Meets
     Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International
     Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer
     Science, Springer, Bucharest, Romania, 2021.
 [3] O. Pelka, A. Ben Abacha, A. García Seco de Herrera, J. Jacutprakart, C. M. Friedrich,
     H. Müller, Overview of the ImageCLEFmed 2021 concept & caption prediction task,
     in: CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bucharest,
     Romania, 2021.
 [4] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolu-
     tional networks, in: Proceedings of the IEEE conference on computer vision and pattern
     recognition, 2017, pp. 4700–4708.
 [5] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks,
     in: International Conference on Machine Learning, PMLR, 2019, pp. 6105–6114.
 [6] B. Ionescu, H. Müller, R. Péteri, A. B. Abacha, V. Datla, S. A. Hasan, D. Demner-Fushman,
     S. Kozlovski, V. Liauchuk, Y. D. Cid, V. Kovalev, O. Pelka, C. M. Friedrich, A. G. S. de Herrera,
     V.-T. Ninh, T.-K. Le, L. Zhou, L. Piras, M. Riegler, P. l Halvorsen, M.-T. Tran, M. Lux, C. Gur-
     rin, D.-T. Dang-Nguyen, J. Chamberlain, A. Clark, A. Campello, D. Fichou, R. Berari, P. Brie,
     M. Dogariu, L. D. Ştefan, M. G. Constantin, Overview of the ImageCLEF 2020: Multimedia
     retrieval in medical, lifelogging, nature, and internet applications, in: Experimental IR
     Meets Multilinguality, Multimodality, and Interaction, volume 12260 of Proceedings of the
     11th International Conference of the CLEF Association (CLEF 2020), LNCS Lecture Notes in
     Computer Science, Springer, Thessaloniki, Greece, 2020.
 [7] B. Karatzas, J. Pavlopoulos, V. Kougia, I. Androutsopoulos, AUEB NLP Group at Image-
     CLEFmed Caption 2020, in: CLEF 2020 Working Notes, Thessaloniki, Greece, September
     22-25, 2020, 2020.
 [8] A. G. S. de Herrera, F. P. Andrade, L. Bentley, A. A. Compean, Essex at Image-CLEFcaption
     2020 task, in: CLEF2020 Working Notes. CEUR Workshop Proceedings, CEUR-WS. org,
     Thessaloniki, Greece (September 22-25 2020), 2020.
 [9] T. A. Putra, S. I. Rufaida, J.-S. Leu, Enhanced Skin Condition Prediction Through Machine
     Learning Using Dynamic Training and Testing Augmentation, IEEE Access 8 (2020)
     40536–40546.
[10] L. Sarker, M. M. Islam, T. Hannan, Z. Ahmed, Covid-densenet: A deep learning architecture
     to detect covid-19 from chest radiology images (2020).
[11] P. Dutta, P. Upadhyay, M. De, R. Khalkar, Medical image analysis using deep convolu-
     tional neural networks: CNN architectures and transfer learning, in: 2020 International
     Conference on Inventive Computation Technologies (ICICT), IEEE, 2020, pp. 175–180.
[12] N. Thakur, D. Mehrotra, A. Bansal, M. Bala, Analysis and Implementation of the Bray–
     Curtis Distance-Based Similarity Measure for Retrieving Information from the Medical
     Repository, in: International Conference on Innovative Computing and Communications,
     Springer, 2019, pp. 117–125.

</pre>