<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Techniques for Medical Concept Detection from Multi-Modal Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rohit Sonker</string-name>
          <email>rohit.sonker@pwc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ayush Mishra</string-name>
          <email>ayush.mishra@pwc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Palvika Bansal</string-name>
          <email>palvika.lnu@pwc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anup Pattnaik</string-name>
          <email>anup.a.pattnaik@pwc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>PricewaterhouseCoopers US Advisory</institution>
          ,
          <addr-line>Mumbai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the increasing availability of medical images coming from di erent modalities (X-Ray, CT, PET, MRI, Ultrasound, etc.), the task of automatic medical image captioning is emerging as a key component in medical research. ImageCLEF 2020 is dedicated to extracting relevant concepts from a large corpus of radiology medical images with di erent image modalities by learning the visual contents of the images. The variability between modalities and expertise required in interpreting radiology images often represents a bottleneck in clinical diagnosis pipelines. Therefore, we propose a reliable automatic classi cation method which is highly desired as assistance for human radiologists in producing reports more accurately and e ciently. Throughout the experiment, we leveraged CNN Architectures, NLP, and clustering techniques to come up with our best system. In this paper, we introduce a novel technique of band classi cation, where we rst cluster the vocabulary of concepts into bands and then build customized classi cation architectures for each of the band. Predictions of one band are given as input to subsequent bands to aid the learning of associated concepts. Also, we systematically explored several pre-processing approaches to handle variations in contrasts, intensities across images of di erent modalities. In the nal evaluation of ImageCLEF 2020, we submitted 9 runs out of which our best systems ranked 3rd, 4th, 5th. Overall our team ranked 2nd among 41 participants globally.</p>
      </abstract>
      <kwd-group>
        <kwd>Image Captioning</kwd>
        <kwd>Medical Imaging Modalities</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Concept Detection</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The healthcare industry has been witnessing an increasing shift towards
digitization across the world. With more and more hospitals now saving their patient
data and medical images electronically, the platform is set to leverage AI
capabilities to assist doctors in their diagnosis and enhance the entire healthcare
ecosystem. Medical Images, ranging from MRI scans, CT scans, PET scans,
X-Ray are used for diagnosis and treatment of many diseases such as cancer,
pneumonia, and pneumothorax. Medical domain experts go through the
medical scans of the patient and subsequently write a condensed textual report,
which is a time-consuming process and also leads to an increase in the cost of
such treatment. The reading and interpretation of medical images, like all other
human processes, are prone to error.</p>
      <p>The above stated problems and the abundance of medical images in the
current scenario have motivated us to use AI techniques to semi-automate the
report generation process. We intend to build a pipeline that will take a medical
image and caption out the keywords stating the abnormalities and the type of
machine used to produce the image. The output caption will not be a free owing
text in Natural Language which would make sense but just a list of keywords
next to each other. We hope this system will be an e cient secondary check for
the medical experts and will also speed up the report writing process.</p>
      <p>ImageCLEF [4] hosted the 4th edition of its Medical Image Captioning task
where we were provided with a subset of Radiology Objects in Context (ROCO)
dataset [8]. We have used multiple approaches to tackle the problem
leveraging deep learning architectures and NLP techniques. Before passing the image
into the deep neural network models, they are rst passed through certain
preprocessing steps, which are further explained in section 3.</p>
      <p>We have used the following methods to extract the keywords from a given
patient medical image</p>
    </sec>
    <sec id="sec-2">
      <title>1. ResNet18 ne-tuned + Custom CNN</title>
      <p>2. ResNet18 on Scan Type
3. Band Classi cation
4. KNN with ResNet101 embeddings Modality wise
5. KNN on ResNet18 embeddings with weighted label combination
6. Concept Clustering based on data segregation
We will explain all the above mentioned methods in detail in Section 4.
2</p>
      <sec id="sec-2-1">
        <title>Data</title>
        <p>ImageCLEF 2020 MedCaption task [7] focused on extracting information from
radiology images. The dataset provided as part of the challenge is a subset of the
extended ROCO dataset, with additional imaging modality information. A total
of 6,031,814 image-caption pairs were extracted. To focus on radiology images
and non-compound gures, automatic ltering with deep learning systems as
well as manual revisions were applied. Post the ltering, a total of 80,183 images
were provided to us, out of which 64,753 images were part of the training set and
15,970 images were part of the validation set. There were 3047 unique concepts
in the training set. For a given image, the number of concepts were in the range
of 1-140. The average number of concepts per image was close to 6. All the
images were in jpeg format. The number of channels within the images was not
consistent.</p>
        <p>There were several challenges within the dataset. There was signi cant noise
in most of the images. The noises ranged from doctor signatures, patient ID and
other random numbers inscribed on top of the radiology images. While most
of the images were restricted to a given organ of the body ( g.1), some of the
images had the entire human body and a zoomed-in scan of a particular organ.
Extracting captions from these types of images would be quite di cult as the AI
system has to focus only on the zoomed-in part to understand the abnormalities.
The zoomed scan in such images occupied very limited space making it more
di cult to focus on the relevant part. Many images also had arrows, straight
lines, and the watermark of the hospital or organization where the scan was
taken, as shown in g.2.</p>
        <p>(a) Image: ROCO2 CLEF 05912</p>
        <p>(b) Image: ROCO2 CLEF 31469</p>
        <p>The data was quite skewed in terms of the number of images in which a
particular concept occurred. The range varied from 34 to 20031 which depicts the
level of skewness. As part of exploration, we also extracted the text descriptions
of the concept IDs that were provided to us. The caption was processed using
QuickUMLS [9] to produce the gold UMLS concept unique identi ers (CUIs).
We have further used the text extracted in one of the techniques. However, the
text description for 12 concepts were not available.</p>
        <p>Fig.3 shows the distribution of the concepts. There are very few concepts with
frequency&gt;1000 in the training set and most of them are in the 0-500 bucket.
Concepts follow a similar distribution in the validation set as well.</p>
        <p>Table 1 shows list of the top 10 most frequently occurring concept IDs along
with their descriptions:</p>
      </sec>
      <sec id="sec-2-2">
        <title>Data Pre-processing</title>
        <p>We explored di erent pre-processing approaches for di erent systems of models.
In general the CLAHE technique was used in most of the systems.</p>
        <sec id="sec-2-2-1">
          <title>Contrast Limited Adaptive Histogram Equalization (CLAHE) : Im</title>
          <p>ages of di erent modalities have varying brightness, intensity, contrasts, etc, To
handle these modalities, and to enhance feature detection, we used the CLAHE
[11] technique as the rst step. CLAHE equalizes brightness and contrast among
images. An image is divided into regions and each region is histogram equalized.
To limit noise ampli cation, contrast limiting is applied. It strengthens feature
extraction from the edges of each region in the image. In this, each pixel is
transformed based on the histogram surrounding the pixel. CLAHE limits the
ampli cation by clipping the histogram at a prede ned value called a clip limit.
During our experiments, we used the clip limit of 2.5, as we got the best results
from this value. Example is shown in g.4
(a) Before
Intensity Normalization : MRI images have larger intensity variations, due to
di erent scanners or parameters used during MRI image acquisition. To
normalize the intensity, Linear normalization was applied to MRI images. This changes
the range of pixel intensity values. We transformed images to new intensity
values in the range (0, 255). The motivation behind this was to bring all the images
in a similar intensity range that is normal. This helps the deep learning network
to learn faster compared to CLAHE. An example is shown in g.5
Range Normalization : We leveraged simple range normalization for the CT
images, where intensities are comparable across di erent scanners and feature
extraction from these images bene ts from clipping or rescaling. This
normalization mapped intensities to [-1,1] using below mentioned transformation, where
shift and scale were identi ed from minimum and maximum intensities. The</p>
          <p>Fig. 5: Intensity Normalisation on Image ROCO2 CLEF 31504
images were reconstructed through this transformation to handle contrasts. For
intensity I,</p>
          <p>Inorm = (</p>
          <p>I</p>
          <p>shif t
scale
)
2</p>
          <p>
            I
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The e ect of range normalization is shown in g.6.</title>
      <p>(a) Before
Data Augmentation : To increase the diversity of data available for training
the models, various data augmentation techniques were used in the process. The
image is rst rotated in the range of 20 degrees after which the image is translated
horizontally by 0.2 fractions of total width. Then shear transformation is applied
to the image in the counter-clockwise direction by 0.2 radian angle. Then the
image is zoomed to a scale ranging from 0.8 to 1.2 and then the image is ipped
horizontally. All of the above mentioned transformations are done using Keras
ImageDataGenerator on a random basis.</p>
      <sec id="sec-3-1">
        <title>Techniques</title>
        <p>The input images were rst all transformed into 3 channel RGB images and
then resized to 224 224 3 which is the input requirement shape for ResNet
architectures [3]. Then the images were passed through a series of pre-processing
steps for noise removal and enhancement of brightness and contrast. We have
used Keras ImageDataGenerator in all of our techniques to load the images on
the go and avoiding the need to load all the training images at once, which
will require a lot of disk space. We also performed data augmentation using
Keras DataGenerator. The images in the training as well as the validation set
were given in 7 di erent folders, each representing the scan type of the images.
Since each image could have multiple CUIs tagged to it, we create a CUI index
dictionary mapping concept IDs to integers ranging from 0 to 3046. We created
target vectors for each image as a 3047 long vector. The vector will have 1
when the concept ID corresponding to the index of the vector is present in the
caption of the image. All the operations were done using Python 3.6.2 and Keras
framework. The Deep Learning models were trained on a virtual Ubuntu server
machine equipped with 2 NVIDIA Tesla P4 GPU accelerators. The GPUs were
accessed using the Google Cloud Platform. We now describe each technique
variant in detail below.
4.1</p>
        <sec id="sec-3-1-1">
          <title>System 1: ResNet18 on All Data</title>
          <p>Training Images varied a lot from each other across scan types as well as within
the same type of scan. Despite using multiple pre-processing steps to remove
noise, performing a multilabel classi cation with 3047 images was a daunting
task. A convolutional neural network with enough depth was the obvious choice
to learn the features and patterns and perform the above mentioned task. We
chose a pre-trained ResNet18 as our baseline model. In this particular approach,
we keep all the layers trainable. We start-o with weights trained on ImageNet
data [2]. These weights are trained on images that are completely di erent from
medical images, but they are still a better place to start with as compared to
random weights, assuming that the model will be able to extract the high-level
features of the image and learn the custom features when trained on our medical
images. We took the output of the penultimate layer of ResNet18 and then passed
that tensor through a convolutional layer and maxpool layer and nally through
3 fully connected layers. We used sigmoid function as the activation for the
last layer to enable multi label outputs. We used binary cross entropy as the
loss function and Adam optimizer [5] with a learning rate of 0.0001. The batch
size was xed at 32. The model was trained for 50 epochs and we observed the
validation F1 score saturated post 42 epochs. Training each epoch took close to
840 seconds. The model architecture is described in g.7.
4.2</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>System 2: ResNet18 on Scan Type</title>
          <p>
            The given dataset had seven types of scanned medical images in their respective
folders. A type of scanned image might have di erent features than another one
due to di erent scanners or the way the image is captured, for example, MRI
image will have larger intensity variations compared to CT image. Our approach
in this method was to train individual folders (based on the type of scan) on a
ResNet18 network so that the model better captures the input features from the
given image and those features will be limited to the type of scan. The number
of images per folder varied from 500 in PET scans to 20000 in Ultrasound scans.
The two folders with less than 1000 images (DRCO and DRPE) were merged
together and passed through a single ResNet18 network, others were passed into
their respective network. The custom layers of all the model had the following
custom layers: Conv2d(
            <xref ref-type="bibr" rid="ref3 ref3">128,3,3</xref>
            ) + MaxPool(
            <xref ref-type="bibr" rid="ref2 ref2">2,2</xref>
            ) + FC(1024) + FC (512) +
FC(No. Of Concepts) + Sigmoid
The model architecture is described in g.8. The threshold value to calculate the
label presence is calculated individually for each network by checking the best
F1 score on the validation set.
In this approach, we aim to separately handle the concepts which are predicted
correctly versus the concepts which are not being predicted well. The idea behind
this approach is to use di erent network parameters for di erent sets of labels
and to allow correlated labels to enhance performance. The bands consist of
di erent sets of labels categorized by the prediction performance on the complete
ResNet18 network. The images are preprocessed using CLAHE and the training
set is augmented.
          </p>
          <p>Hence, we rst lter out the concepts which are being predicted well by our
overall ResNet18 model. We rst calculate the F1 score corresponding to each
concept and lter out concepts that have an F1 score &gt; 0.2. This forms our rst
band of target concepts. Subsequently, the remaining concepts are considered as
band 2. Based on our threshold of 0.2, band 1 contains 25 concepts and band
2 contains 3022 concepts. Further decomposition to more bands is also possible
by repeating the process however, we notice that in band 2, there is no wide
separation as to which concepts perform well, hence, no further decomposition
is considered.</p>
          <p>We train two separate neural networks to predict the concepts in band 1 and
band 2, for all images. The images are rst passed through band 1 network and
their predictions are generated. In the second band, the predictions of the rst
band are added as an auxiliary input to the network. This is done to include
associative information between concepts which may aid in learning concepts
that were not being predicted well in a combined network. We use a ResNet
18 model with additional layers for both the band networks. The networks vary
slightly in their architecture, due to the addition of more layers and an auxiliary
input in band 2.</p>
          <p>
            In band 1, the ResNet18 architecture (without top layer) is followed by a 2D
Convolution layer (
            <xref ref-type="bibr" rid="ref3 ref3">128,3,3</xref>
            ), max pooling (
            <xref ref-type="bibr" rid="ref2 ref2">2,2</xref>
            ), and two fully connected layers
of 512 and 256 neurons respectively. The fully connected layers have ReLU
activation. All additional layers have a dropout of 0.2 while training. Finally, we
have an output layer (25 neurons) with sigmoidal activation.
          </p>
          <p>
            In band 2, we have a similar structure with ResNet18 (without top layer)
followed by a 2D Convolution layer (
            <xref ref-type="bibr" rid="ref3 ref3">128,3,3</xref>
            ) with max pooling (
            <xref ref-type="bibr" rid="ref2 ref2">2,2</xref>
            ) and fully
connected (FC) layers of 1024 neurons each. We also have an auxiliary input
which is the prediction result of band 1. The input is passed through a layer of
256 neurons and concatenated to the FC layer as mentioned above. Finally, the
combined input is passed through an FC layer with 1024 neurons and attached
to a sigmoid output layer of 3022 neurons. The architecture is shown in g.9.
          </p>
          <p>The outputs from both are probabilities of each label. To convert this to
a one-hot encoded vector we set a threshold based on the maximum F1 score
on the validation set. The outputs from both bands are combined to get the
complete prediction for 3047 concepts. This process happens in both bands. The
value of the threshold used in band 1 and band 2 was 0.3 and 0.25 respectively.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>System 4: K-Nearest Neighbour with ResNet101 Embeddings</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Modality wise</title>
          <p>In this approach, we use K-Nearest Neighbour algorithm [1] on ResNet101
embeddings [3]. For each test image, the K-most similar images from the training
set are retrieved and their labels are used to predict labels of the test image.
This approach is implemented independently for each modality.</p>
          <p>
            The training images are rst converted to embeddings using a ResNet101
encoder. We use ResNet101 pre-trained on ImageNet data, without further training
of the network. The input images are preprocessed using CLAHE. No
augmentation of the training set is performed here. The embeddings of dimension (
            <xref ref-type="bibr" rid="ref7 ref7">7,7,512</xref>
            )
are extracted from ResNet101 and then attened.
          </p>
          <p>These training embeddings are added as a new layer to the encoder network,
such that for each input image, we get a similarity value corresponding to each
training image. This architecture allows us to compute the cosine similarity of
each input with respect to training images. The architecture is depicted in g.10.
Hence, each test image is rst encoded and then its cosine similarity with respect
to each training image embedding is computed by the network.</p>
          <p>To compute the output labels the following approach is used. The labels
of these K images are taken as one-hot encoded vectors and rst added then
normalized. Hence, get a vector with values ranging from 0-1 for all labels.
We then select a threshold value t to convert this combined vector to a one-hot
encoded label vector. Note, that this process is done separately for each modality.
Hence, we have a total of 7 models each with their own values of K and t. K
is chosen in proportion to the size of the training set. The threshold value is
chosen by analyzing the best performance on the validation set. The details of
K,t chosen are in the table 2.</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>System 5: KNN on ResNet18 Embeddings with weighted label combination</title>
          <p>In this approach, we use a K-Nearest Neighbour algorithm on ResNet18
embeddings. A single ResNet18 network is used here for all modalities. The images are
preprocessed using CLAHE and the training set is augmented using the
methodology described earlier. The ResNet18 network is rst trained on the training
set as described in the section on ResNet18 (all data) and then the top layer is
removed. This forms the encoder network.</p>
          <p>The training images are converted into embeddings using the encoder and
these embeddings are used as layers on top of the encoder network. This leads to
architecture similar to the previous approach however all modalities are taken
together here. A test image is rst encoded and its cosine similarity to all training
images is computed. We select a K of 200 and all label vectors of these closest
training images are retrieved.</p>
          <p>In this approach, we de ne a new method to combine these K one-hot label
vectors. We use a method similar to term frequency-inverse document frequency
(TF-IDF) to combine values for each label. The combined value is proportional
to the occurrence of a label in the K images and inversely proportional to their
occurrence in the complete training set. This allows rare concepts to be predicted
more. The formula is described below. Let L be the combined vector of the sum
of K one hot encoded label vectors of the nearest neighbours for a particular
test point. Let R be the resulting combined label vector for our test point. The
for each label i the value Ri is de ned as</p>
          <p>Ri =</p>
          <p>Li
max(L)
log(</p>
          <p>
            size of training set
f requency of label i in training set
)
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
          </p>
          <p>Once the vector R has been computed, we must convert this vector of values
to a one-hot encoded vector to predict labels. For this, we set a threshold value
which gives the best performance on the validation set, hence threshold t = 1.17.
4.6</p>
        </sec>
        <sec id="sec-3-1-6">
          <title>System 6: Concept Clustering based data segregation</title>
          <p>The Concept Unique identi er (alphanumeric code for concepts) associated with
the image must have some relation with each other, for example, an aneurism
would be more closely related with blood clot than a fractured bone. In this
approach, we tried to group together such concepts that were closely related to
each other. However, this relationship would be hard to determine using CUIs
which were assigned to a given image, hence we converted the unique identi ers
into human readable comments using UMLS conversion [9]. Once we had the
UMLS converted concepts, we extracted embeddings for each using BioWordVec
[10] which is a word to vector converter trained on vocabulary frequently used
in the medical eld. We then calculated the closeness of a concept with respect
to others using cosine similarity and then grouped them into 6 clusters using the
k-means algorithm [6].</p>
          <p>Once we have a data table with similarity scores of concept with each other,
it is divided into 5 clusters using k-means clustering. Each cluster had between
30,000 to 42,000 images and 400 to 600 concepts associated with it, except for
the 6th cluster which had all the images with concepts whose embedding was
not found using BioWordVec which had approximately 26000 images and 170
concepts. The process is described in g.11.</p>
          <p>
            A data frame was created for each cluster and was trained on ResNet18 model
(w/o top layer) and following custom layers: Conv2d(
            <xref ref-type="bibr" rid="ref3 ref3">128,3,3</xref>
            ) + MaxPool(
            <xref ref-type="bibr" rid="ref2 ref2">2,2</xref>
            )
+ FC(1024) + FC (512) + FC(No. Of Concepts) + Sigmoid
          </p>
          <p>All the clusters were trained individually with a dedicated model. The
architecture is shown in g.12. The test image was passed through each of these
6 models and their predictions were appended to make the nal prediction of
concepts.
5</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
        <p>We submitted our concept predictions on the test set in a txt le, in which
each row corresponds to the image ID followed by the predicted Concept IDs of
that image via our models. The predictions were evaluated using F1 score, by
comparing the ground truth vector ytrue to the predicted concept vector ypred
and then averaging across all test images. Both the vectors were 3047 in length,
which is equal to the number of unique classes present. The KNN clustering
using ResNet-101 embeddings gave the best F1 score of 0.392. All results are
shown in table 3.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Conclusions and Future Work</title>
        <p>Throughout the challenge, we experimented with visual features of images as
well as the UMLS embeddings and tried to bene t from the concept grouping
through the clustering mechanism. The setup is motivated by the variabilities in
modalities of radiology images and stimulus to extract value from all the data
that is provided. Our best model KNN with ResNet101 embeddings achieved an
F1 score of 0.39238 and ranked 3rd. The band classi cation approach that we
introduced in this paper allows the usage of di erent network architectures for
di erent sets of labels. Incorporating predictions from a set of bands in making
predictions for other bands promises an increase in performance in scenarios
where the concepts in bands are associated with each other.</p>
        <p>In future work, we aim to experiment with the attention mechanism to focus
on important features of images to improve concept detection and
interpretability of the models. We also aim to improve the performance of the band
classication approach. Exploring skewness in data availability for di erent concepts
could be an interesting extension to our models.
8. Pelka, O., Koitka, S., Ruckert, J., Nensa, F., Friedrich, C.M.: Radiology objects
in context (roco): A multimodal image dataset. In: Intravascular Imaging and
Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and
Expert Label Synthesis, pp. 180{189. Springer (2018)
9. Soldaini, L.: Quickumls: a fast, unsupervised approach for medical concept
extraction (2016)
10. Yijia, Z., Chen, Q., Yang, Z., Lin, H., lu, Z.: Biowordvec, improving biomedical
word embeddings with subword information and mesh. Scienti c Data 6 (12 2019).
https://doi.org/10.1038/s41597-019-0055-0
11. Zuiderveld, K.: Contrast limited adaptive histogram equalization. Graphics gems
pp. 474{485 (1994)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cover</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Nearest neighbor pattern classi cation</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          <volume>13</volume>
          (
          <issue>1</issue>
          ),
          <volume>21</volume>
          {
          <fpage>27</fpage>
          (
          <year>1967</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <string-name>
            <surname>ImageNet: A LargeScale Hierarchical Image</surname>
          </string-name>
          <article-title>Database</article-title>
          .
          <source>In: CVPR09</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          . pp.
          <volume>770</volume>
          {
          <issue>778</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Peteri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datla</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DemnerFushman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozlovski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cid</surname>
            ,
            <given-names>Y.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pelka</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ninh</surname>
            ,
            <given-names>V.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , l Halvorsen,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.T.</given-names>
            ,
            <surname>Lux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Gurrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Dang-Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.T.</given-names>
            ,
            <surname>Chamberlain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Campello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Fichou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Berari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Brie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Stefan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.D.</given-names>
            ,
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.G.</surname>
          </string-name>
          :
          <article-title>Overview of the ImageCLEF 2020: Multimedia retrieval in lifelogging, medical, nature, and internet applications</article-title>
          .
          <source>In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th International Conference of the CLEF Association (CLEF</source>
          <year>2020</year>
          ), vol.
          <volume>12260</volume>
          .
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer, Thessaloniki,
          <source>Greece (September</source>
          <volume>22</volume>
          - 25
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>MacQueen</surname>
          </string-name>
          , J., et al.:
          <article-title>Some methods for classi cation and analysis of multivariate observations</article-title>
          .
          <source>In: Proceedings of the fth Berkeley symposium on mathematical statistics and probability</source>
          . vol.
          <volume>1</volume>
          , pp.
          <volume>281</volume>
          {
          <fpage>297</fpage>
          . Oakland, CA, USA (
          <year>1967</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pelka</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            a Seco de Herrera,
            <given-names>A.</given-names>
          </string-name>
          , Muller, H.:
          <article-title>Overview of the ImageCLEFmed 2020 concept prediction task: Medical image understanding</article-title>
          .
          <source>In: CLEF2020 Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Thessaloniki,
          <source>Greece (September</source>
          <volume>22</volume>
          -25
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>