NeuralDynamicsLab at ImageCLEFmedical 2022
Georgios Moschovis1,3 , Erik Fransén1,2
1
  KTH Royal Institute of Technology, Lindstedtsvägen 5, 114 28 Stockholm, Sweden
2
  Science for Life (SciLife) Laboratory, Tomtebodavägen 23A, Gamma 6, 171 65 Solna, Sweden
3
  Corresponding author


                                         Abstract
                                         Diagnostic Captioning is described as the automatic text generation from a collection of X-RAY images
                                         and it can assist inexperienced doctors and radiologists to reduce clinical errors or help experienced
                                         professionals to increase their productivity. Therefore, tools that would help doctors and radiologists
                                         produce higher quality reports in less time could be of high interest for medical imaging departments, as
                                         well as significantly impact deep learning research within the biomedical domain. With our participation
                                         in ImageCLEFmedical 2022 Caption evaluation campaign, we have attempted to address both concept
                                         detection and caption prediction tasks by developing baselines based on Deep Neural Networks; including
                                         image encoders, classifiers and text generators. Our group, NeuralDynamicsLab at KTH Royal Institute of
                                         Technology, within the school of Electrical Engineering and Computer Science, ranked 4th in the former
                                         and 5th in the latter task.

                                         Keywords
                                         Neural networks, Speech and language technology, Natural Language Processing (NLP), Deep learning,
                                         Generative deep networks, Convolutional neural networks (CNN), Text generation, Information retrieval,
                                         Diagnostic captioning, Image captioning, concept prediction, classification, image encoders, transformers,
                                         Encoder-Decoder architecture, abstractive summarization


1. Introduction
One of the most exciting technological aspects nowadays is Machine Learning’s impressive
potential in transforming the world we live in, primarily due to its exciting resurgence through
Deep Learning (DL). The increasing size of biomedical data has allowed researchers demonstrate
the evolving capabilities of Deep Learning in biomedical applications, through the development
of advanced computing and imaging systems in biomedical engineering, machine learning-based
biomedical data mining algorithms [1] and baselines for Diagnostic Captioning that has recently
attracted researchers’ attention, towards the goal of reducing the time required by a doctor or
radiologist to produce medical texts and the amount of clinical errors, but also increasing the
throughput of medical imaging departments [2].
   In this work, we attempted to develop Diagnostic Captioning baselines, based on novel Deep
Learning approaches, to investigate to what extent deep networks are capable of automatically

CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ geomos@kth.se (G. Moschovis); erikf@kth.se (E. Fransén)
 https://www.linkedin.com/in/georgios-moschovis-96428029/ (G. Moschovis); https://www.kth.se/profile/erikf/
(E. Fransén)
 0000-0003-0547-0581 (G. Moschovis); 0000-0003-0281-9450 (E. Fransén)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
generating a diagnostic text from a set of medical images and how much their interpretation of
medical images can assist doctors and radiologists produce better quality diagnoses; also at an
increased throughput [2]. Towards this objective, the first step is concept detection that boils
down to predicting relevant tags for X-RAY images, while the end goal is caption generation. In
ImageCLEFmedical 2022 evaluation campaign, we experimented with addressing both concept
detection and caption prediction tasks in order to get a quantitative measure of our proposed
architectures’ performance [3].


2. Dataset
In this section, we describe the data provided in ImageCLEFmedical 2022 evaluation campaign.
Precisely, we provide details about the ImageCLEFmedical 2022 concept detection and caption
prediction datasets that include images from different radiological image modalities but without
including imaging modality information.
   The dataset provided for both subtasks of ImageCLEFmedical 2022 evaluation campaign [4]
consists of 90920 images that constitute a subset of the extended Radiology Objects in COntext
(ROCO) dataset [5], without imaging modality information. As in previous editions, the dataset
originates from biomedical articles of the PMC OpenAccess subset. After merging the initially
provided train and validation data, we shuffle them after manually setting the seeds to eliminate
randomness in consecutive runs while tuning our hyperparameters and then keep 80% as our
training set, 10% as our validation set used for hyperparameter tuning and the remaining 10%
as our development set used for model selection. Since the dataset is large we perform neither
cross-validation nor data-augmentation. We experimented with adding noise to the images, in
the form of random rotations and translations, which however did not provide any additional
benefit in our baselines’ quantitative evaluation.
   Regarding the concept detection subtask, there are 8374 tags of concepts that are assigned
to the X-RAY images, while each image in any of the training, validation or development set
is assigned 5 tags on average. Regarding the caption prediction subtask, the total number of
captions in the training set is 72736, the total number of unique captions is 70879 and the
average caption length is 108 words, including 28 unique words. In the validation set the total
number of captions is 9092, the total number of unique captions is 8984, the average caption
length is 107 words, including 26 unique words. In the development set the total number of
captions is 9092, the total number of unique captions is 8977 and the average caption length is
108 words, including 28 unique words. These counts verify that the aforementioned sets are
balanced in terms of their statistics.
   "The concepts were generated using a reduced subset of the Unified Medical Language System
(UMLS) 2020 AB release, which includes the sections (restriction levels) 0, 1, 2, and 9". [4] The
UMLS is a set of files and software that collects multiple health and biomedical vocabularies and
standards to enable interoperability between computer systems. To improve the feasibility of
recognizing concepts from the images, concepts were filtered based on their semantic type and
concepts with very low frequency were removed. In each caption, tokens containing numbers
and all punctuation were removed, captions were converted to lower-case and lemmatization
was applied using spaCy toolkit [3].
3. Methods and results
In this section, we describe the core components of the methods utilized to encode the X-RAYs
with dense embeddings in our work and explain in detail the baseline networks that we proposed
in ImageCLEFmedical 2022 evaluation campaign, in order of performance, for both subtasks
that are based on the aforementioned core components that rely on pre-trained architectures,
extremely popular in computer vision.
   Precisely, we provide details about the ImageCLEFmedical 2022 concept detection and caption
prediction datasets and on how we designed backbone networks as generic image encoders that
rely on Convolutional Neural Networks (CNN) architectures that are popular for vision tasks
on generic images, such as classification and semantic segmentation, while they are shared
within all baselines, in both ImageCLEFmedical Caption tasks. Furthermore, we describe the
components of each model and give details on the selected hyper-parameters. For all our models,
we have set in advance all the random seeds equal to 0, the CUDNNs backends as deterministic
and disabled the CUDNNs backends benchmark to ensure consistency of the aforementioned
splits in consecutive runs for hyper-parameter selection. This procedure has been applied for
both subtasks of the evaluation campaign.

3.1. Backbone Networks: image encoders
One of the principal components in the proposed architectures that is shared for both subtasks
includes the image encoders. They constitute existing state-of-the-art architectures, pretrained
on ImageNet classification dataset [6], which are obtained from torchvision models library to
perform inference, while any additional components such as a multi-label classification head or a
caption generation architecture are appended to the output of the image encoder; in this content
these models are referred to as “backbone networks”. The goal of these networks is to encode
the images into dense numerical representations. Since Deep Learning became popular and
what is called the Deep Learning Community was given birth, different initialization strategies
for the weights and the biases were proposed. We used Glorot initialization shown below [7] to
initialize the weights of the classification heads and experimented with non-pretrained image
encoders that we initialized using the same strategy and fully-finetuned them, their performance
however was inferior in concept prediction.
                                           (︂ √︂              √︂            )︂
                                                     6               6
                       Glorot: 𝑊𝑖,𝑗 ∼ 𝒰 −                   ,
                                                 𝑓in + 𝑓out      𝑓in + 𝑓out

   Some Convolutional Neural Network (CNN) encoders that have been attempted to use include
variants of AlexNet [8], ResNet [9], DenseNet [10], VGG [11] and EfficientNet [12], which are
obtained from torchvision models library as mentioned above. We also experimented with
another architectural choice that is Vision Transformers (ViT) [13], the performance obtained
was poor however compared to CNN encoders. That outcome is in line with the observation
in [14] that Vision Transformers and "Hybrid-ViT architectures are inferior to the CNN-based
ones". The above summarize the first step in the design of image encoders that is model selection
based on their performance on a development set.
   Moreover, model selection shall be followed by a model collaboration design principle, based
on ensemble learning. In this case, we have used the aforementioned models as members of the
ensemble or weak learners in a pool of encoders trained with different parameter values (e.g.
learning rates, decision thresholds for the positive class, number of epochs), as well as based on
different architectures, to seek for diversity and exploit the "Wisdom of the crowd" [15] for the
fine-tuned models. In this context, we take into consideration the “votes” of all the different
CNNs by averaging their outputs to make decisions on the generated tags or make guesses on
the assigned captions.

3.2. Concept prediction subtask
As mentioned in section 3.1 “backbone networks” refer to image encoders, which are state-of-
the-art architectures, pretrained on ImageNet classification dataset [6], shared for both subtasks.
In the case of concept prediction, an additional classification head that is either a Perceptron
or a Multi-layered Perceptron was added on top of these “backbone networks" and its weights
were initialized using Glorot initialization strategy [7].

3.2.1. Pre-trained DenseNet161 with fine-tuned classification head, learning rate 10−3 ,
       Adam optimizer and gradient clipping
The first two models correspond to a DenseNet161 convolutional network that is pretrained
on ImageNet classification dataset and its head is a Perceptron, which is further fine-tuned on
the ImageCLEFmedical 2022 data using sigmoid activation function in the output units that
equal the number of concepts -thus 8374 nodes, a constant learning rate equal to 10−3 and
the negative 𝐹1 score as a minimization criterion. For each image, we assign it the concepts
that have predicted probabilities above 50%, while the tags obtain their numerical IDs in their
order of appearance before shuffling them. Furthermore, we clip the gradients computed during
training to be in [−1, 1], to increase numerical stability.
   When performing stochastic or minibatch Gradient Descent, and the loss changes quickly at
one direction and slowly at another, Gradient Descent will progress slowly along the shallow
dimension and jitter along the steep one. To overcome this issue, we used Adam optimizer [16],
so that progress along steep directions is damped and meanwhile progress along flat directions
is accelerated. Adam uses exponentially decaying average to discard history but also momentum
as an estimate of the first-order gradient. It has bias corrections for first-order and second-order
moments and converges rapidly after finding a local convex bowl. If 𝑡 represents the current
time step, Adam updates are equal to:

                                                v(𝑡)
                         w(𝑡+1) = w(𝑡) − 𝜖       √       , 𝛿, 𝜖 ∈ R+
                                             𝛿+ r    (𝑡)

                          v(𝑡+1) = 𝜌1 v(𝑡) + (1 − 𝜌1 )g(𝑡) , 𝜌1 ∈ R+
                                                       (︁      )︁2
                          r(𝑡+1) = 𝜌2 r(𝑡) + (1 − 𝜌2 ) g(𝑡) , 𝜌2 ∈ R+

  Our best performing model (with submission ID 181750) is an instance of the aforementioned
architecture trained in all the provided data, thus after merging again the training, validation
and development sets that are described in section 2 and achieves 𝐹1 = 0.43601. The next
model corresponds to the same network architecture but is trained only in training set (with
submission ID 181715) and achieves a score 𝐹1 = 0.43567. For the latter case, where we have
measured performance in all sets, we present plots with the evolution of 𝐹1 score and accuracy
during training in Figure 1(a).

3.2.2. Pre-trained DenseNet161 with fine-tuned classification head, learning rate
       5 × 10−4 , AdamW optimizer and gradient clipping
The next model corresponds to another DenseNet161 convolutional network that is pretrained
on ImageNet classification dataset and its head is a Perceptron, which is further fine-tuned on
the ImageCLEFmedical 2022 data using sigmoid activation function in the output units that
equal the number of concepts -thus 8374 nodes, a constant learning rate equal to 5 × 10−4 and
the negative 𝐹1 score as a minimization criterion. For each image, we assign it the concepts
that have predicted probabilities above 50%, while the tags obtain their numerical IDs in their
order of appearance before shuffling them. Furthermore, we clip the gradients computed during
training to be in [−1, 1], to ensure numerical stability.
   In this occasion we have used an improved version of Adam optimizer, called AdamW [17],
where weight decay is performed only after controlling the parameter-wise step size and thus
yields models that generalize much better. Compared to Adam optimizer that we discussed
in section 3.2.1, as well as other adaptive gradient algorithms, where the potential benefit of
weight decay regularization is limited because "the weights do not decay multiplicatively but
by an additive constant factor" [17], AdamW optimizer may overcome this issue, while training
much faster than stochastic or minibatch Gradient Descent.
   Our model is an instance of the aforementioned network architecture, it is trained only in
training set (with submission ID 181753) and achieves a score 𝐹1 = 0.43558, although we would
expect training with AdamW to perform better. Since the gain of re-training the model after
merging all the splits is almost negligible, as we already noticed in section 3.2.1, the remaining
models are not re-trained in the entire dataset. Once again, we present plots with the evolution
of 𝐹1 score and accuracy in Figure 1(b).

3.2.3. Pre-trained DenseNet161 with fine-tuned classification head, learning rate
       5 × 10−4 and Adam optimizer
The subsequent model is yet another DenseNet161 convolutional network that is pretrained on
ImageNet classification dataset and its head is a Perceptron, which is further fine-tuned on the
ImageCLEFmedical 2022 data using sigmoid activation function in the output units that equal
the number of concepts -thus 8374 nodes, a constant learning rate equal to 5 × 10−4 and the
negative 𝐹1 score as a minimization criterion. For each image, we assign it the concepts that
have predicted probabilities above 50%, while the tags obtain their numerical IDs in their order
of appearance before shuffling them and train the network using Adam optimizer; as we have
excessively described in section 3.2.1.
   Our model is an instance of the aforementioned network architecture (with submission ID
182152) and achieves a score 𝐹1 = 0.43539, however, in this baseline we omit clipping the
gradients, in contrast with the models described above in sections 3.2.1 and 3.2.2. Furthermore,
as for both previous best-performing models we present plots with the evolution of 𝐹1 score
and accuracy below in Figure 1(c).


                 (a)                                (b)                                (c)

Figure 1: 𝐹1 and accuracy scores plots per epoch for the models described (a) in section 3.2.1, (b) in
section 3.2.2, as well as (c) in section 3.2.3. We observe that the classifications heads, which we finetune
on ImageCLEFmedical 2022 data, appear to be sufficiently regularized (thus there is no overfitting) and
to have used their maximum capacity.


3.2.4. Ensemble of pre-trained DenseNet CNNs with fine-tuned classification heads
The proceeding model and the best performing mixture of individual networks corresponds to
the 10 best performing DenseNet CNNs, including instances of DenseNet161 and DenseNet121
architectures, and indicates our quest for diversity and to consequently exploit the "Wisdom of
the crowd" [15] notion.
    In this context, we take into account the “votes” of all the different CNNs to make decisions
on the assigned tags. The voting scheme consists of averaging the probabilities computed by the
different weak learners before assigning to each image the concepts that have average predicted
probabilities above 50%, while the tags as usual obtain their numerical IDs in their order of
appearance before shuffling them. We also experimented with using alternative voting policies,
such as computing the union or intersection of the assigned tags by each weak learner, where
assignments are defined by the predicted probabilities being above 50%, in the pool of finetuned
networks, but they performed poorly.
    Table 1 summarizes the architecture of all individual networks in the pool of encoders. This
includes the type of Backbone Network, the optimizer, the value of learning rate and whether
it is decaying per epoch, the batch size and the submission ID of the individual network, for
the aforementioned weak learners in sections 3.2.1, 3.2.2, 3.2.3 that performed better than the
ensemble altogether and thus were submitted individually. Note that the classification head
is always a Perceptron which is further fine-tuned in the ImageCLEFmedical 2022 data using
sigmoid activation function in the output units that equal the number of concepts. Moreover,
when linear decay is applied, the learning rate is updated by: 𝜂𝑡+1 = 𝜂0 × 1−𝑡
                                                                             𝑇 where 𝑡 represents
the current time step, 𝑇 the total number of epochs and 𝜂0 is the learning rate at the beginning
of training procedure. The performance of this mixture of experts (with submission ID 182338)
equals 𝐹1 = 0.43496.
Table 1
Summary of weak learners’ architecture and training regime in model 182338
 Backbone Net.    Optimizer    Learning Rate   Linear Decay    Batch size    Epochs     Subm. ID
 DenseNet121       AdamW         5 × 10−4          False           60          20            -
 DenseNet121       AdamW           10−3            False           60          20            -
 DenseNet121       AdamW           10−4            False           60          20            -
 DenseNet161        Adam           10−3            False          120          20     181750, 181715
 DenseNet161       AdamW           10−3            True           120          20            -
 DenseNet161        Adam         5 × 10−4          False          120          20            -
 DenseNet161        Adam         5 × 10−4          False          120          20         181753
 DenseNet161       AdamW         5 × 10−4          False          120          20         182152
 DenseNet161       AdamW           10−4            False          120          50            -
 DenseNet161       AdamW           10−4            False          120          20            -


3.2.5. Ensemble of pre-trained DenseNet CNNs with fine-tuned classification heads
Although Dense Convolutional Networks (DenseNet CNNs) appear to outperform other network
architectures, which is in line with their extensive use in biomedical applications that include X-
RAYs processing [18], we also experimented with a plethora of CNNs backbone networks as we
have mentioned in section 3.1. Consequently, the ensuing three models constitute ensembles that
include different architectures within their members, with varying hyperparameter values to
encourage diversity of training regimes. During the voting process we average the probabilities
computed by the softmax layer of all different week learners before assigning to each image the
tags that have average predicted probabilities above 50%.

Table 2
Summary of weak learners’ architecture and training regime in model 181546
 Backbone Net.    Optimizer    Learning Rate   Linear Decay    Batch size    Epochs     Subm. ID
    AlexNet        AdamW           10−4            False           60          20            -
    AlexNet        AdamW         5 × 10−5          False           60          20            -
 DenseNet121       AdamW         5 × 10−4          False           60          20            -
 DenseNet121       AdamW           10−3            False           60          20            -
 DenseNet121       AdamW           10−4            False           60          20            -
 DenseNet161        Adam           10−3            False          120          20     181750, 181715
 DenseNet161       AdamW           10−3            True           120          20            -
 DenseNet161        Adam         5 × 10−4          False          120          20            -
 DenseNet161        Adam         5 × 10−4          False          120          20         181753
 DenseNet161       AdamW         5 × 10−4          False          120          20         182152
   ResNet50        AdamW           10−4            False           60          20            -
  ResNet101        AdamW           10−4            False           60          20            -
    VGG-13         AdamW           10−4            False           60          20            -
    VGG-16         AdamW           10−4            False           60          20            -
Table 3
Summary of weak learners’ architecture and training regime in model 182155
 Backbone Net.    Optimizer   Learning Rate   Linear Decay    Batch size     Epochs     Subm. ID
    AlexNet        AdamW          10−4            False           60           20           -
    AlexNet        AdamW        5 × 10−5          False           60           20           -
 DenseNet121       AdamW        5 × 10−4          False           60           20           -
 DenseNet121       AdamW          10−3            False           60           20           -
 DenseNet161        Adam          10−3            False          120           20     181750, 181715
 DenseNet161       AdamW          10−3            True           120           20           -
   ResNet50        AdamW          10−4            False           60           20           -
   ResNet50        AdamW        5 × 10−5          False           60           20           -
  ResNet101        AdamW          10−4            False           60           20           -
  ResNet101        AdamW        5 × 10−4          False           60           20           -
    VGG-13         AdamW          10−4            False           60           20           -
    VGG-13         AdamW        5 × 10−5          False           60           20           -
    VGG-16         AdamW          10−4            False           60           20           -
    VGG-16         AdamW        5 × 10−5          False           60           20           -


Table 4
Summary of weak learners’ architecture and training regime in model 182154
 Backbone Net.    Optimizer   Learning Rate   Linear Decay    Batch size     Epochs     Subm. ID
    AlexNet        AdamW          10−4            False           60           20           -
    AlexNet        AdamW        5 × 10−5          False           60           20           -
 DenseNet121       AdamW        5 × 10−4          False           60           20           -
 DenseNet121       AdamW          10−3            False           60           20           -
 DenseNet121       AdamW          10−4            False           60           20           -
 DenseNet161        Adam          10−3            False          120           20     181750, 181715
 DenseNet161       AdamW          10−3            True           120           20           -
 DenseNet161        Adam        5 × 10−4          False          120           20           -
   ResNet50        AdamW          10−4            False           60           20           -
  ResNet101        AdamW          10−4            False           60           20           -
    VGG-13         AdamW          10−4            False           60           20           -
    VGG-16         AdamW          10−4            False           60           20           -

  Our three following mixtures of experts (with submission IDs 181546, 182155, 182154) and
achieve a score 𝐹1,1 = 0.43404, 𝐹1,2 = 0.43130, 𝐹1,3 = 0.42957 respectively. Tables 2, 3, 4
summarize the architecture of all individual networks in each pool of encoders. Their format is
identical to that used in section 3.2.4 and consequently they also refer to the hyper-parameter
values for each of the weak learners.
  Note that the classification head is always a Perceptron which is further fine-tuned in the
ImageCLEFmedical 2022 data using sigmoid activation function in the output units that equal
the number of concepts. Moreover, when linear decay is applied, the learning rate is updated
by: 𝜂𝑡+1 = 𝜂0 × 1−𝑡 𝑇 where 𝑡 represents the current time step, 𝑇 the total number of epochs
and 𝜂0 is the initial learning rate.
3.2.6. Fully fine-tuned DenseNet161 with cyclical learning rate and AdamW optimizer
The succeeding model corresponds to a DenseNet161 convolutional network that is now fully-
finetuned on the ImageCLEFmedical 2022 data using sigmoid activation function in the output
units that equal the number of concepts -thus 8374 nodes, scheduled learning rate [19] and the
negative 𝐹1 score as a minimization criterion. For each image, we assign it the concepts that
have predicted probabilities above 50%, while the tags obtain their numerical IDs in their order
of appearance before shuffling them.
   One important aspect of minibatch or stochastic gradient descent relates to the choice of the
learning rate 𝜂 that controls the size of the update, which will occur to the gradients in every
iteration. Constant learning rates have been traditionally used to train Deep Neural Networks
based on back-propagation algorithm, although do not guarantee optimal convergence rate
according to the Stochastic Approximation Theory [20], precisely the network parameters hover
around a minimum at an average distance proportional to the learning rate and to a variance
that is dependent on the objective function and the exemplar set [21]. To this end, cyclical
learning rates have been proposed as a new method for setting the learning rate by cyclically
varying its value between reasonable boundary values, which increases classification accuracy
when training CNNs with generic images [22].


               (a)                                (b)                                (c)

Figure 2: (a) Schematic illustration of the error landscape with a high learning rate, (b) example plot of
a cyclical learning rate with 𝜂min = 0.01, 𝜂max = 0.30, 𝑛𝑠 = 2 and (c) 𝐹1 and accuracy scores plots per
epoch for the model described in section 3.2.6.


  A high value of 𝜂 will make the network make large steps above the minimum of the error
function but never converge to it, as illustrated in Figure 2(a). A small value of 𝜂 will delay
convergence, preventing the network to find a minimum of the error function if the number of
epochs is limited. A cyclical learning rate linearly ranges between two values 𝜂min and 𝜂max .
One maximization of the learning rate followed by a minimization is called a cycle. In Figure
2(b) hereunder we present an example of cyclical learning rate, where 𝜂min = 0.01, 𝜂max = 0.30,
𝑛𝑠 = 2 and we denote as 2𝑛𝑠 the time required for a cycle of our learning rate to complete. In
our model we set 𝜂min = 10−5 , 𝜂max = 0.1, 𝑛𝑠 = 4 for the first 80 epochs and then set it to a
constant value 𝜂 = 10−3 for 30 additional epochs.
  This network (with submission ID 182156) achieves a score 𝐹1 = 0.31687, which is a rather
lower score compared to the pre-trained models on ImageNet classification dataset [6], achieving
more than 10% higher 𝐹1 results on the test set. Moreover, we present plots with the evolution
of 𝐹1 score and accuracy per training epoch of the model in Figure 2(c) that is quite unstable
while varying the learning rate.

3.2.7. Nearest Neighbours Baseline
The ensuing model is a generalization of the 1-NN baseline proposed in [23]. We further either
remind or inform the reader that for every image in the test set, the 1-NN baseline assigns the
tags of the visually most similar image from the training set as the output and consequently for
every image, 𝑥    ^ , in the test set, the 1-NN baseline will output the set of concepts, say 𝑦 * , of the
most similar image, say 𝑥* , from the training set as output [2]. Therefore, if we denote by e(.)
the output of the employed image encoder among those mentioned in section 3.1, 1-NN predicts
(𝑥          ^ , 𝑦 * ) that satisfies (𝑥* , 𝑦 * ) = arg min𝑥* cos (e(𝑥
 ^ , 𝑦^) = (𝑥                                                       ^ ), e(𝑥* )). Our generalized Nearest
Neighbours baseline takes into account 𝑘 ∈ Z+ neighbours instead and not necessarily only
the one with closest representation. Our model (with submission ID 182331) uses 𝑘 = 1 with a
VGG-16 encoder pre-trained on ImageNet classification dataset and achieves only 𝐹1 = 0.25061
that indicates the importance of fine-tuning.

3.3. Performance summary
Table 5 below summarizes several characteristics of the proposed baselines for concept detection,
in order of performance with respect to 𝐹1 scores, together with their respective submission
IDs. We observe that DenseNet161 image encoders with finetuned classification heads are the
top performing configurations and outperform other CNN architectures, which is in accordance
with their extensive use X-RAYs processing [18], while fully finetuning the backbone networks
and using retrieval based heuristics that capture representations’ similarities, such as the 1-NN
baseline [23], achieve lower scores.

Table 5
Summary of our configurations’ characteristics and statistics
    Backbone Network      Section described        Type of model          F1 scores   Submission ID
       DenseNet161          Section 3.2.1        Deep Network Head        0.43601        181750
       DenseNet161          Section 3.2.1        Deep Network Head        0.43601        181750
       DenseNet161          Section 3.2.2        Deep Network Head        0.43558        181753
       DenseNet161          Section 3.2.3        Deep Network Head        0.43539        182152
    DenseNet variants       Section 3.2.4       Ensemble of Networks      0.43496        182338
     Various networks       Section 3.2.5       Ensemble of Networks      0.43404        181546
     Various networks       Section 3.2.5       Ensemble of Networks      0.43130        182155
     Various networks       Section 3.2.5       Ensemble of Networks      0.42957        182154
       DenseNet161          Section 3.2.6        Deep Network (full)      0.31687        182156
          VGG-16            Section 3.2.7        Nearest Neighbour        0.25061        182331


3.4. Caption generation subtask
In ImageCLEFmedical 2022 evaluation campaign, "the first step to automatic image captioning
and scene understanding boils down to identifying the presence and location of relevant concepts
within a large corpus of medical images that is followed by caption generation in captioning.
Based on medical images content, the concept prediction task provides the building blocks for
scene understanding by identifying the individual components, referred to as image tags, from
which captions are composed. The assigned concepts can be further applied for context-based
image and information retrieval purposes" [3].
   "On the basis of the vocabulary 𝒱 identified during concept prediction task, as well as the
visual information of their interaction in the image, caption generation task refers to composing
coherent captions for each entire image. For the medical captioning task, rather than the mere
coverage of visual concepts, detecting the interplay of visible elements can be crucial for strong
performance" [3]. In the following, we describe our proposed models for Diagnostic Captioning,
in which the generalized Nearest Neighbours baseline that we introduced in section 3.2.7 has a
crucial role despite it performing poorly as is.

3.4.1. (1 + 𝑘)-NN image retriever with Pegasus summarizer
Our best performing models extend the Nearest Neighbours baseline for caption generation.
Precisely, 1-NN [23] constitutes one of the model components, where for every image in the test
set, it will produce the diagnostic text of the visually most similar image from the training set as
the output and consequently it will assign the corresponding caption, say 𝑦 * , of the most similar
image, say 𝑥* , from the training set as output [2]. Thus, if we denote by e(.) the output of the
employed image encoder among those mentioned in section 3.1, 1-NN predicts (𝑥                   ^, 𝑦*)
                                                                                     ^ , 𝑦^) = (𝑥
                  *   *
that satisfies (𝑥 , 𝑦 ) = arg min𝑥* cos (e(𝑥          *
                                              ^ ), e(𝑥 )). This prediction constitutes the first part
of the models’ generated caption.
   In the generalized baseline however, apart from the neighbour with the closest representation,
we retrieve the top-(𝑘 + 1) nearest neighbours, concatenate their outputs, excluding that of
the most similar image and feed them as input to an abstractive summarizer; Pegasus [24] that
is based on the transformer architecture [25], one idea that revolutionized Natural Language
Processing and is trained with a Masked Language Modelling objective, which became popular
within the research community though BERT [26].
   For our models we employed a pre-trained AlexNet CNN on ImageNet classification dataset as
our image encoder and merged our training, validation and development sets that are described
in section 2, in order to benefit from an extensive set of train data to compute similarities
with the test images. For each of them we keep the caption of the visually most similar image,
concatenate the captions of the 𝑘 proceeding ones and give them as input to Pegasus summarizer,
which we allow to produce a summary of maximum length 𝑛 tokens to eliminate repetitions. We
exclude phrases as "All images are copyrighted." and "Images courtesy of AFP, EPA, Getty" that
were probably included in Pegasus’ training set from our generated summaries. The predicted
captions constitute the concatenation of 1-NN baseline and Pegasus summarizer outputs. Table
6 below presents all configurations’ hyper-parameter values, namely 𝑘 and 𝑛, their submission
IDs and BLEU scores in decreasing order [27].
Table 6
Summary of our configurations’ hyper-parameters and statistics
           Backbone Network     Captions 𝑘    Tokens 𝑛     BLEU scores    Submission ID
                AlexNet           𝑘=9          𝑛 = 15        0.29166         182337
                AlexNet           𝑘=4          𝑛 = 15        0.28343         182286
                AlexNet           𝑘=3          𝑛 = 15        0.27855         182284
                AlexNet           𝑘=2          𝑛 = 15        0.27007         182285
                AlexNet           𝑘=4          𝑛=5           0.25521         182271
                AlexNet           𝑘=3          𝑛=5           0.25334         182272


3.4.2. 𝑘-NN image retriever with Retrieval Augmented Generation
It has been impressive to researchers how nowadays general-purpose sequence-to-sequence
models are getting really powerful, they manage to capture the world knowledge in parameters,
they achieve strong results on loads of tasks and are applicable for almost everything. However,
they often hallucinate, may usually struggle to access, and apply knowledge and are difficult to
update. On the other hand, modern Information Retrieval (IR) is great as well, as externally
reviewed knowledge can be useful for a huge variety of NLP tasks. Modern IR provides a precise
and accurate knowledge access mechanism, it is trivial to update, whereas by “modern” IR we
refer to dense retrieval that starts to outperform traditional IR. On the negative side though, it
still needs retrieval supervision or heuristics such as BM25, as well as some –task specific– way
to integrate into downstream tasks.
   The goal of Retrieval Augmented Generation (RAG) [28], which was used as model component,
pretrained on Wikipedia with a FAISS index [29] built on 42% of PubMed 2022 including recent
publications related to the fields of neuroscience and computational biology; is to combine the
strengths of sequence-to-sequence models and explicit knowledge retrieval. Obviously, RAG
is also blended with the 1-NN baseline; namely its outputs are concatenated with the caption
of the visually most similar image from the training set to produce caption predictions. This
model uses either a pre-trained AlexNet or VGG-16 CNN on ImageNet classification dataset as
backbone network and, despite it containing a non-parametric memory, additional to storing
information in the parameters of a sequence-to-sequence generative model that is a Bidirectional
Auto-Regressive Transformers (BART) generator [30], after merging our training, validation
and development sets that are described in section 2 to take advantage of more input-output
pairs (𝑥, 𝑡), achieves a lower BLEU score than its predecessors described in 3.4.1 according to
Table 7 below. These results could possibly improve if we store extracts from patients’ previous
diagnoses instead of biomedical articles.

Table 7
Summary of our configurations’ image encoders and statistics
                        Backbone Network     BLEU scores       Submission ID
                             AlexNet           0.25127            181712
                             VGG-16            0.23958            181860
   In the RAG approach [28], dual memory components are pre-trained and pre-loaded with
extensive knowledge to encapsulate information via the representations without further training;
the generator 𝑝𝜃 acts as a parametric memory, with the retriever 𝑝𝜂 embodying a non-parametric
memory in the query encoder q(.), while also including a Dense Passage Retriever (DPR) [31].
To train the retriever 𝑝𝜂 and generator 𝑝𝜃 end-to-end, we can treat the retrieved document as a
latent variable 𝑧, while the embedding of the closest document representation is represented as
d(𝑧)). The Maximum Inner Product Search (MIPS) algorithm [32] is used to compute the top 𝑘
retrieved documents with respect to 𝑝𝜂 (𝑧|𝑥). Finally, the generated caption 𝑦 is produced by
marginalizing over the predictions.

                                 𝑝𝜂 (𝑧|𝑥) = exp d(𝑧)𝑇 q(𝑥)
                                                 (︀           )︀

   The generator 𝑝𝜃 is a sequence-to-sequence model, a BART [30] instance precisely, which
conditions on the latent documents 𝑧 together with each input 𝑥 to generate each output. As
an overall component, it produces 𝑝𝜃 (𝑦𝑖 |𝑥, 𝑧, 𝑦1:𝑖−1 ) to create a Language Model (LM) over
the tokens vocabulary 𝒱 given as input the latent documents 𝑧 and queries 𝑥, which are the
outputs of 1-NN baseline. During training, we treat questions-answers as input-output pairs
i.e. (𝑥, 𝑡) and train RAG-token by directly minimizing the negative marginal log-likelihood of
generating output sequences 𝑦 on input sequences 𝑥. If 𝒟 = {𝑥𝑗 , 𝑡𝑗 }𝑗 is the complete dataset,
our training objective is:

                                𝑙cross (𝑥, 𝑡; 𝜃, 𝜂) = − log 𝑝(𝑦|𝑥; 𝜃, 𝜂)
                        ∑︁                             ∑︁
                             𝑙cross (𝑥𝑗 , 𝑡𝑗 ; 𝜃, 𝜂) =    − log 𝑝(𝑦𝑗 |𝑥𝑗 ; 𝜃, 𝜂)
                         𝑗                            𝑗


3.4.3. 1-NN image retrieval baseline
Last but not least, we attempted using the 1-NN baseline [2] as is to generate the diagnostic
text within the captions, which however achieved a lower score than all the aforementioned
approaches. Although at first, one could interpret this as RAG models, in which the generator
acts as a parametric memory, whereas the retriever 𝑝𝜂 embodies a non-parametric memory in
the query encoder q examined in section 3.4.2, perform better than solely the 1-NN baseline;
when the latter is combined with abstractive summarization techniques for the diagnostic texts
of 𝑘 additional visually similar images from the training set, where 𝑘 ∈ Z+ , it may perform
better as it is indicated in section 3.4.1 and Table 6. Our models use a pre-trained AlexNet or
VGG-16 CNN on ImageNet classification dataset as image encoder, our training, validation and
development sets that are described in section 2 merged together and achieve a BLEU score
according to Table 8.

Table 8
Summary of our configurations’ image encoders and statistics
                        Backbone Network       BLEU scores      Submission ID
                             AlexNet             0.24064           181711
                             VGG-16              0.22757           181859
4. Directions for future work
In this work, we developed CNN-based image encoders trained end-to-end for tags assignment
or combined with heuristics such as the 1-NN baseline for either concept prediction or caption
generation, which although is really simple performs rather well if combined with abstractive
summarization algorithms, as highlighted in section 3.4.1 as well as the study in [2], where this
baseline itself performs well for the Indiana University chest X-ray Collection [33] (IU chest
X-ray dataset). Future work could focus on the use of task-specific models for summarization,
such as Bio-BERT [34], further fine-tuning on the number of neighbours 𝑘 and the summary
maximum length 𝑛 in section 3.4.1 and consideration of potential associations between the two
subtasks during 1-NN baseline extension.
   Furthermore, although higher quantitative accuracy is most often better, there are categorical
differences of the DC methods as well, which relate to their qualitative evaluation and indicate
their practical usefulness. It is an open question how we may obtain practical information about
the quality of the generated captions.


5. Ethical considerations
Development of Diagnostic Captioning systems based on novel DL architectures could have
both positive and negative societal impacts. My proposed work, for example, may be used for
analyzing medical image data in undeveloped regions or countries under development. This
is related to the 3rd goal of United Nations Sustainability Goals (UNSG) about ensuring good
health and well-being and the 10th goal about reduced inequalities. On the other hand, privacy
issues might arise from the use of medical data and "concerns over the sensitive information
security and privacy" [35] that may also be related to the General Data Protection Regulation
(GDPR) and EU legislation.
References
[1] C. E. Lawson, J. M. Martí, T. Radivojevic, S. V. R. Jonnalagadda, R. Gentz, N. J. Hillson,
    S. Peisert, J. Kim, B. A. Simmons, C. J. Petzold, S. W. Singer, A. Mukhopadhyay, D. Tanjore,
    J. G. Dunn, H. Garcia Martin, Machine learning for metabolic engineering: A review,
    Metabolic Engineering 63 (2021) 34–60. URL: https://www.sciencedirect.com/science/
    article/pii/S109671762030166X. doi:https://doi.org/10.1016/j.ymben.2020.10.
    005, tools and Strategies of Metabolic Engineering.
[2] J. Pavlopoulos, V. Kougia, I. Androutsopoulos, D. Papamichail, Diagnostic caption-
    ing: a survey, Knowledge and Information Systems (2022) 1–32. doi:10.1007/
    s10115-022-01684-7.
[3] J. Rückert, A. Ben Abacha, A. García Seco de Herrera, L. Bloch, R. Brüngel, A. Idrissi-
    Yaghir, H. Schäfer, H. Müller, C. M. Friedrich, Overview of ImageCLEFmedical 2022 –
    caption prediction and concept detection, in: CLEF2022 Working Notes, CEUR Workshop
    Proceedings, CEUR-WS.org, Bologna, Italy, 2022.
[4] B. Ionescu, H. Müller, R. Peteri, J. Rückert, A. B. Abacha, A. G. S. de Herrera, C. M. Friedrich,
    L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, S. Kozlovski, Y. D. Cid, V. Kovalev, L.-D.
    Stefan, M. G. Constantin, M. Dogariu, A. Popescu, J. Deshayes-Chossart, H. Schindler,
    J. Chamberlain, A. Campello, A. Clark, Overview of the ImageCLEF 2022: Multimedia
    retrieval in medical, social media and nature applications, in: Experimental IR Meets
    Multilinguality, Multimodality, and Interaction, Proceedings of the 13th International
    Conference of the CLEF Association (CLEF 2022), LNCS Lecture Notes in Computer
    Science, Springer, Bologna, Italy, 2022.
[5] O. Pelka, S. Koitka, J. Rückert, F. Nensa, C. M. Friedrich, Radiology Objects in COntext
    (ROCO): A Multimodal Image Dataset, in: D. Stoyanov, Z. Taylor, S. Balocco, R. Sznit-
    man, A. L. Martel, L. Maier-Hein, L. Duong, G. Zahnd, S. Demirci, S. Albarqouni, S. Lee,
    S. Moriconi, V. Cheplygina, D. Mateus, E. Trucco, E. Granger, P. Jannin (Eds.), Intravascular
    Imaging and Computer Assisted Stenting - and - Large-Scale Annotation of Biomedical Data
    and Expert Label Synthesis - 7th Joint International Workshop, CVII-STENT 2018 and Third
    International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada,
    Spain, September 16, 2018, Proceedings, volume 11043 of Lecture Notes in Computer Sci-
    ence, Springer, 2018, pp. 180–189. URL: https://doi.org/10.1007/978-3-030-01364-6_20.
    doi:10.1007/978-3-030-01364-6\_20.
[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
    A. Khosla, M. Bernstein, A. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recogni-
    tion Challenge, International Journal of Computer Vision 115 (2014). doi:10.1007/
    s11263-015-0816-y.
[7] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural
    networks, in: Y. W. Teh, M. Titterington (Eds.), Proceedings of the Thirteenth International
    Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine
    Learning Research, PMLR, Chia Laguna Resort, Sardinia, Italy, 2010, pp. 249–256. URL:
    https://proceedings.mlr.press/v9/glorot10a.html.
[8] A. Krizhevsky, One weird trick for parallelizing convolutional neural networks, CoRR
    abs/1404.5997 (2014). URL: http://arxiv.org/abs/1404.5997. arXiv:1404.5997.
 [9] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely Connected Convolutional
     Networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
     2017, pp. 2261–2269. doi:10.1109/CVPR.2017.243.
[10] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in: 2016
     IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
     doi:10.1109/CVPR.2016.90.
[11] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
     recognition, arXiv 1409.1556 (2014).
[12] M. Tan, Q. V. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neu-
     ral Networks,        CoRR abs/1905.11946 (2019). URL: http://arxiv.org/abs/1905.11946.
     arXiv:1905.11946.
[13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De-
     hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An Image is Worth
     16x16 Words: Transformers for Image Recognition at Scale, in: International Conference
     on Learning Representations, 2021. URL: https://openreview.net/forum?id=YicbFdNTTy.
[14] I. Athanasiadis, G. Moschovis, A. Tuoma, Weakly-Supervised Semantic Segmentation via
     Transformer Explainability, in: ML Reproducibility Challenge 2021 (Fall Edition), 2022.
     URL: https://openreview.net/forum?id=rcEDhGX3AY.
[15] J. Surowiecki, The Wisdom of Crowds, Anchor, 2005.
[16] D. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, International Conference
     on Learning Representations (2014).
[17] I. Loshchilov, F. Hutter, Fixing Weight Decay Regularization in Adam, 2018. URL: https:
     //openreview.net/forum?id=rk6qdGgCZ.
[18] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Y. Ding, A. Bagul, C. P. Langlotz,
     K. S. Shpanskaya, M. P. Lungren, A. Y. Ng, CheXNet: Radiologist-Level Pneumonia
     Detection on Chest X-Rays with Deep Learning, CoRR abs/1711.05225 (2017). URL: http:
     //arxiv.org/abs/1711.05225. arXiv:1711.05225.
[19] J. Konar, P. Khandelwal, R. Tripathi, Comparison of Various Learning Rate Scheduling
     Techniques on Convolutional Neural Network, in: 2020 IEEE International Students’
     Conference on Electrical,Electronics and Computer Science (SCEECS), 2020, pp. 1–5.
     doi:10.1109/SCEECS48394.2020.94.
[20] H. Robbins, S. Monro, A stochastic approximation method, Annals of Mathematical
     Statistics 22 (1951) 400–407.
[21] C. J. Darken, J. E. Moody, Note on Learning Rate Schedules for Stochastic Optimization,
     in: NIPS, 1990.
[22] L. N. Smith, No More Pesky Learning Rate Guessing Games, CoRR abs/1506.01186 (2015).
     URL: http://arxiv.org/abs/1506.01186. arXiv:1506.01186.
[23] G. Liu, T. H. Hsu, M. B. A. McDermott, W. Boag, W. Weng, P. Szolovits, M. Ghassemi,
     Clinically Accurate Chest X-Ray Report Generation, CoRR abs/1904.02633 (2019). URL:
     http://arxiv.org/abs/1904.02633. arXiv:1904.02633.
[24] J. Zhang, Y. Zhao, M. Saleh, P. J. Liu, PEGASUS: Pre-Training with Extracted Gap-Sentences
     for Abstractive Summarization, in: Proceedings of the 37th International Conference on
     Machine Learning, ICML’20, JMLR.org, 2020.
[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser,
     I. Polosukhin, Attention is All you Need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
     R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing
     Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/
     paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[26] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/
     N19-1423. doi:10.18653/v1/N19-1423.
[27] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation
     of machine translation, in: Proceedings of the 40th Annual Meeting of the Association
     for Computational Linguistics, Association for Computational Linguistics, Philadelphia,
     Pennsylvania, USA, 2002, pp. 311–318. URL: https://aclanthology.org/P02-1040. doi:10.
     3115/1073083.1073135.
[28] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t.
     Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-Augmented Generation for Knowledge-
     Intensive NLP Tasks, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin
     (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Asso-
     ciates, Inc., 2020, pp. 9459–9474. URL: https://proceedings.neurips.cc/paper/2020/file/
     6b493230205f780e1bc26945df7481e5-Paper.pdf.
[29] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GPUs, IEEE Transac-
     tions on Big Data 7 (2019) 535–547.
[30] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle-
     moyer, BART: Denoising sequence-to-sequence pre-training for natural language gen-
     eration, translation, and comprehension, in: Proceedings of the 58th Annual Meeting
     of the Association for Computational Linguistics, Association for Computational Lin-
     guistics, Online, 2020, pp. 7871–7880. URL: https://aclanthology.org/2020.acl-main.703.
     doi:10.18653/v1/2020.acl-main.703.
[31] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense
     Passage Retrieval for Open-Domain Question Answering, in: Proceedings of the 2020
     Conference on Empirical Methods in Natural Language Processing (EMNLP), Association
     for Computational Linguistics, Online, 2020, pp. 6769–6781. URL: https://aclanthology.org/
     2020.emnlp-main.550. doi:10.18653/v1/2020.emnlp-main.550.
[32] S. Mussmann, S. Ermon, Learning and Inference via Maximum Inner Product Search, in:
     M. F. Balcan, K. Q. Weinberger (Eds.), Proceedings of The 33rd International Conference
     on Machine Learning, volume 48 of Proceedings of Machine Learning Research, PMLR,
     New York, New York, USA, 2016, pp. 2587–2596. URL: https://proceedings.mlr.press/v48/
     mussmann16.html.
[33] D. Demner-Fushman, M. Kohli, M. Rosenman, S. Shooshan, L. Rodriguez, S. Antani,
     G. Thoma, C. McDonald, Journal of the American Medical Informatics Association :
     JAMIA 23 (2016) 304–310. doi:10.1093/jamia/ocv080, publisher Copyright: © 2015
     Published by Oxford University Press on behalf of the American Medical Informatics
     Association 2015. This work is written by US Government employees and is in the public
     domain in the US.
[34] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: a pre-trained biomedical
     language representation model for biomedical text mining, Bioinformatics 36 (2019) 1234–
     1240. doi:10.1093/bioinformatics/btz682.
[35] K. Abouelmehdi, A. Beni-Hssane, H. Khaloufi, M. Saadi, Big data security and pri-
     vacy in healthcare: A Review, Procedia Computer Science 113 (2017) 73–80. URL:
     https://www.sciencedirect.com/science/article/pii/S1877050917317015. doi:https://doi.
     org/10.1016/j.procs.2017.08.292, the 8th International Conference on Emerging
     Ubiquitous Systems and Pervasive Networks (EUSPN 2017) / The 7th International Con-
     ference on Current and Future Trends of Information and Communication Technologies
     in Healthcare (ICTH-2017) / Affiliated Workshops.