<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-Label and Cross-Modal Based Concept Detection in Biomedical Images by MORGAN CS at ImageCLEF2020</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oyebisi Layode</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Md Mahmudur Rahman</string-name>
          <email>md.rahman@morgan.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Morgan State University</institution>
          ,
          <addr-line>Maryland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automating the detection of concepts from medical images still remains a challenging task, which requires further research and exploration. Since the manual annotation of medical images poses a cumbersome and error prone task, the development of concept detection system would reduce the burdens of annotation, interpretation of medical images while providing a decision support system for medical practitioners. This paper describes the participation of the CS department at Morgan State University, Baltimore, USA (Morgan CS) in the medical Concept Detection task of the ImageCLEF2020 challenge. The task involves generating appropriate Uni ed Medical Language System (UMLS) Concept Unique Identi ers (CUIs) for corresponding radiology images. We approached the concept detection task as a multilabel classi cation problem by training a classi er on several deep features extracted from using pre- trained Convolutional Neural Networks (CNNs) and also by training a deep Autoencoder. We also explored a Recurrent Concept Sequence generator based on using a multimodal technique of combining text and image features for recurrent sequence prediction. Training and evaluation were performed on the dataset (training, validation, and test sets) provided by the CLEF organizer and we achieved our best F1 scores as 0.167 by using DenseNet based deep feature.</p>
      </abstract>
      <kwd-group>
        <kwd>Medical imaging</kwd>
        <kwd>Image annotation</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Concept detection</kwd>
        <kwd>Multi-label classi cation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Diagnostic analysis of medical images such as radiography or biopsy mostly
involve interpretations based on observed visual characteristics. In essence,
visual characteristics or features from images can be mapped to its corresponding
semantic annotations. Neural networks over the last two decades have been
successfully modeled to learn such mappings from data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and consequently this
paper involves the annotation of medical images to generate condensed textual
descriptions in the form of UMLS (Uni ed Medical Language System) CUIs
(Concept Unique Identi ers) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] using the dataset under ImageCLEFmed 2020
concept detection task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which is a subset of a larger Radiology Objects in
COntext (ROCO) dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The main objective of this challenge involves
automatically identifying the presence of concepts (CUIs) in a large corpus of medical
images based on the visual image features. The concept detection task began in
2017 under the ImageCLEF challenge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and the participants were tasked with
developing methods for predicting captions and detecting the multilabel
concepts over a range of medical and non-medical images in a corpus. For example,
our previous participation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in the ImageCLEFmed 2018 challenge involved
the use of LSTM architectures in creating models that approached the concept
detection task by developing a language model that predicts the probability of
the next word (concept) occurring in a text sequence from the features of an
image input and the words (concept) already predicted. This year, the task was
limited strictly to concept detection in radiology images [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Evaluation criteria
for the results obtained is given as the F1 score between the predicted concepts
and ground truth concept labels.
1.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>
        The dataset contains 64,753 radiology images from different modality classes as
the training set, 15,970 radiology images as the validation and 3,534 radiology
images from the same modality classes as the test set [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The training images are
annotated with 3047 unique UMLS concepts serving as the image captions. The
maximum length of the concept annotation is 140 and the minimum annotation
is 1. The frequency distribution of the 3047 UMLS concepts across the training
images is represented in the Table 1.
      </p>
      <sec id="sec-2-1">
        <title>Methods</title>
        <p>We approached the concept detection task by comparing elementary CUI
multilabel classi cation and a recurrent CUI sequence generation using extracted
features from varying deep learning architectures. The multilabel classi cation
involves feeding the outputs from a feature extraction network into a fully
connected network to obtain a sigmoid activation output representing the CUI label
predictions.
2.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Feature Extraction</title>
      <p>
        Feature extraction is a critical component of medical image analysis. The
descriptiveness and discriminative power of features extracted from medical images are
critical to achieve good classi cation and retrieval performances. Instead of using
any hand-crafted features, transfer learning techniques can be used to extract
features of images from a relatively small dataset using pre-trained Convolutional
Neural Network (CNN) models [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Visual Feature Extraction To perform deep feature extraction, we chose
Densenet169 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and ResNet50 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] as our pre-trained CNN models. These
models have been trained on the ImageNet [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] dataset consisting of 1000 categories.
The Densenet architecture consists of dense blocks of convolution layers - with
consecutive operations of batch normalization (BN)[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], followed by a recti ed
linear unit (ReLU) [15], which provides direct connections from any layer in the
block to all subsequent block layers [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. ResNet, short for Residual Networks is a
classic neural network, which is implemented with double - or triple - layer skips
that combine features within this residual block of layers and contain
nonlinearities (ReLU) and batch normalization in between [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We used the Densenet169
and ResNet50 pre-trained models which is a 169 layered dense network and a
50 layered residual network respectively. Both models have been trained on 1.28
million images [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. For feature extraction, both models are modi ed to exclude
the nal 1000-D classi cation layer and the output before this classi cation layer
is saved. To obtain our deep features, the input images are rst reduced to the
required input size of 224 224 and further preprocessed using the Keras [16]
preprocess input function, which preprocesses the input into the format the model
requires. Since the DenseNet model had been modi ed to exclude the nal
1000D classi cation output, a 4096-D feature vector is obtained as the output from
the last Average Pooling layer. Also, a 2048-D feature vector was obtained by
passing the 224 224 input images through the modi ed pre-trained ResNet50
model. The extracted features are utilized for transfer learning with multilabel
and recurrent CUI sequence classi cation models built on the Densenet features
and a feature fusion of the Densenet and Resnet extracted features.
Feature Fusion Feature fusion methods have been demonstrated to be
effective for many computer vision-based applications [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Combining features
learned from various architectures creates an expanded feature learning space.
We combined the features obtained from the pretrained DenseNet169 and the
ResNet50 models by computing the partial least square canonical correlation
analysis (PLS-CCA) [17] of both feature vectors, the canonical correlation
computes a linear combination of the feature elements from both vectors such that
the correlation between the vectors is maximized. Before computing the
PLSCCA, the ResNet50 based deep features are resized from the 2048-D vector to
4096-D output. Since the PLS-CCA required both vectors to be the same
dimension the resized 4096-D vector is obtained by doubling each element from the
2048-D vector. The PLS-CCA is computed by combining the 4096-D DenseNet
with the resized 4096-D Resnet based deep features. For feature vectors X
(4096D DenseNet) and Y (4096-D Resnet), rst and second component vectors u and
v are obtained such that the correlation corr(X; Y ) is maximized [17]:
ut Xt Y v
corr((X; u); (Y; v)) = put Xt X upvt Y t Y v
(1)
Where, u = a1X1; a2X2 anXn and v = b1Y1; b2Y2 bnYn.
      </p>
      <p>
        Vectors u and v are obtained by computing the weight vectors [a1; a2 an]
and [b1; b2 bn]. We selected the rst component 4096-D feature vector from
the PLS-CCA computation. The result obtained is representative of the features
from the maximized correlation of the DenseNet169 and ResNet50 features.
Feature Extraction based on Autoencoder We also use an
encoder-decoderbased framework (Fig. 1) to extract deep feature representations unique to the
dataset. Autoencoders are a type of unsupervised neural network (i.e., no class
labels or labeled data) that consist of an encoder and a decoder model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. When
trained, the encoder takes input data and learns a latent-space representation
of the data. This latent-space representation is a compressed representation of
the data, allowing the model to represent it in far fewer parameters than the
original data.
      </p>
      <p>The encoder region contracts normalized pixel-wise data from input images
into smaller dimensional feature maps using sequential layers of 2D convolutions,
batch normalization and ReLU activation. The output from the convolutional
blocks is passed to a fully connected layer that represents a 256-D feature space.
The decoder expands the 256 fully connected output by applying transposed
convolutions that up sample the features back to the original input size. Batch
normalization and ReLU activation are also added at each step of the transposed
convolution sequence and the encoder lter sizes mirror the decoder lter sizes.
The 256-feature output from the encoder is given as the auto-encoded deep
feature representation of our input image. The Autoencoder was trained using
the Adam optimizer [18] and a mean squared error loss on the ROCO training
dataset for 20 epochs with a batch size of 50. The initial Adam learning was also
set to 0.001.</p>
      <p>
        Text Feature Extraction The deep text features are extracted from the image
concepts by learning and mapping deep feature embeddings that represent the
sequence of image concepts. The embeddings are learned during training when
a xed length of CUI sequence is passed to a neural embedding layer. Before
passing the CUI sequences to the embedding layer, for each input image, the
image concept sequence is tokenized using the Keras text preprocessing library.
Since a xed length of tokenized CUI sequence was required for the embedding
layer the differences in CUI sequence length for different input images was
accommodated by zero-padding the tokenized sequence up to the maximum CUI
sequence length of 140. During training, the embedding layer uses a mask to
ignore the padded values and its output is passed to a long short-term memory
(LSTM) layer [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] with 256 memory units. The output from the text encoding
block of the embedding and LSTM layer is a 256-D vector holding recurrent
information that may be mapped back to the input concept sequence.
The high volume of classi cation (CUI) labels (3047) and imbalance in the label
frequency results in a huge bias towards the multi-label classi cation problem.
The concepts set was split into groups based on the concept frequencies (Table
1) and separate models were trained for classi cation within the concept set
groups. The DenseNet feature, fused DenseNet-ResNet feature and the
Autoencoded feature are passed to a stack of fully connected layers for the
multilabel prediction in the different dataset groups as shown in Fig. 2. The fully
connected network is composed of Dense layers stacked together to learn weights
for a nal sigmoid classi cation of the concept labels. The expected input for
the fully connected classi er is the deep encoded feature vector corresponding to
an image while the output is the binary multi-label classi cation of the concepts
associated with the input image features.
      </p>
      <p>The fully connected classi er was trained over 20 epochs with a learning rate
of 1e 3 for an Adam optimizer. Since the concept set was split into groups and
a different classi er trained for each concept group the overall CUI prediction
for an input image involves the combination of the predictions from all concept
group classi ers.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Concept Sequence Generation</title>
      <p>The CUI sequence generator involves training a recurrent classi er on a fusion
of the extracted image features and the embedded textual features. The text
features are obtained by learning the embeddings at training time from the
embedding layer stacked with a LSTM layer to give a 256-D text feature output.
Since concatenating the image and text feature vectors would require equal
feature vector lengths, to combine the 256-D text features and the 4096-D image
features, the 4096-D image feature is down-sampled to 256 by passing it through
a dense layer with 256 units to give a 256-D feature output. The 256-D image
feature and 256-D text feature is passed to a concatenation layer to obtain a
256D output that is passed to a nal dense classi cation layer for the prediction of
the next word in the CUI sequence. The CUI sequence prediction begins when a
start signal is passed as the rst element in the CUI sequence and the prediction
ends when a stop signal is predicted by the classi cation model as shown in Fig.
3. The recurrent classi er was trained over 30 epochs with a learning rate of
1e 3 for an Adam optimizer and a batch size of 50.
3</p>
      <sec id="sec-4-1">
        <title>Results and Discussions</title>
        <p>Using the provided test dataset, multiple runs were submitted based on the
multi-label classi cation with DenseNet, DenseNet-ResNet and Auto Encoded
features. The result from the recurrent concept sequence generator with DenseNet
encoded features was also submitted and the F1 evaluations are represented in
Table 2. Our best result with a F1 score of 0.167 was obtained from the
multilabel classi cation of DenseNet feature.</p>
        <p>Run Method F1 Score
MSU dense fcn Densenet169 + multilabel classi cation 0.167
MSU dense resnet fcn 1 (Densenet169 + Resnet50) + multilabel classi cation 0.153
MSU dense feat Densenet169 + multilabel classi cation 0.139
MSU dense fcn 2 Densenet169 + multilabel classi cation 0.094
MSU dense fcn 3 Densenet169 + multilabel classi cation 0.089
MSU autoenc fcn Autoencoder + multilabel classi cation 0.063
MSU lstm dense fcn Desnet169 + Recurrent concept generator 0.062
1. MSU dense fcn: This run utilized a multi-label classi cation model with
the training parameters (described in section 2.2) based on the features
extracted from a pre-trained DenseNet169. The threshold for the prediction
score is set at 0.4 for the multi-label sigmoid classi cation which ranged
from 0 to 1. Concept labels with prediction scores less than 0.4 are
considered irrelevant to the input image.
2. MSU dense resnet fcn 1: In this run, the PLS-CCA of DenseNet169 and
ResNet50 features ar computed to obtain fused features for the multi-label
classi cation. The prediction score threshold for this run is also set at 0.4
for the nal multi-label sigmoid classi cation
3. MSU dense feat, MSU dense fcn 2, MSU dense fcn 3: These runs
are variations of the MSU dense fcn run with different prediction score
thresholds of 0.5, 0.3 and 0.25 respectively.
4. MSU autoenc fcn: The encoder-decoder model is utilized for this run to
obtain the encoded features of the input images. The multi-label classi
cation model (with parameters same as in runs 1,2 and 3) is trained on
the Autoencoded features. The threshold for the prediction score from the
classi cation model is also set to 0.4.
5. MSU lstm dense fcn: This run involved the recurrent generation of
concepts by utilizing image features extracted from DenseNet169 combined with
embedded concept sequences as described in 2.3. The obtained results clearly
show the concept prediction challenge as more of a classi cation problem
than a sequence generation task since all multi-label classi cation approaches
performed better.
4</p>
      </sec>
      <sec id="sec-4-2">
        <title>Conclusions</title>
        <p>This article describes the strategies of the participation of the Morgan CS group
for the concept detection tasks of ImageCLEF2020. We performed multi-label
classi cation of CUIs in different deep feature spaces. We achieved comparable
results considering the limited resources (computing and memory power) we had
at the time of the submission. Since the ROCO data set is grouped into different
modalities, we plan to perform separate multi-label classi cation for the different
modalities in future.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Acknowledgment</title>
        <p>This work is supported by an NSF grant (Award ID 1601044), HBCU-UP
Research Initiation Award (RIA).
15. R.H.R. Hahnloser, R. Sarpeshkar, M.A. Mahowald, R.J. Douglas and H.S.</p>
        <p>Seung: Digital selection and analogue ampli cation coexist in a
cortexinspired silicon circuit. Nature. 405, pp. 947{951 (2000). https://doi.org/doi:
https://doi.org/10.1038/35016072
16. F. Chollet: keras, GitHub. https://github.com/fchollet/keras. Last accessed
29 Jul 2020
17. H. Hotelling : Relations Between Two Sets of Variates. in Breakthroughs in
Statistics: Methodology and Distribution, S. Kotz and N. L. Johnson, Eds. New York,
NY: Springer, pp. 162|190 (1992)
18. D.P. Kingma and L.J. Ba : Adam: A Method for Stochastic Optimization.
arXiv:1412.6980 [cs.LG] International Conference on Learning Representations
(ICLR), (2015)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Toshev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          and
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Erhan : Show and tell: A neural image caption generator</article-title>
          .
          <source>2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , Boston, MA, pp.
          <fpage>3156</fpage>
          -
          <lpage>3164</lpage>
          , (
          <year>2015</year>
          ) https://doi.org/DOI: 10.1109/CVPR.
          <year>2015</year>
          .7298935
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>O.</surname>
          </string-name>
          <article-title>Bodenreider : The Uni ed Medical Language System (UMLS): integrating biomedical terminology</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <volume>32</volume>
          (
          <issue>5</issue>
          ),
          <source>D267{D270</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garc</surname>
          </string-name>
          <article-title>a Seco de Herrera and H. Muller: Overview of the ImageCLEFmed 2020 Concept Prediction Task: Medical Image Understanding</article-title>
          .
          <source>CEUR Workshop Proceedings (CEUR- WS.org)</source>
          ,
          <source>ISSN</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          , J. Ruckert,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nensa und C. M.</surname>
          </string-name>
          <article-title>Friedrich : Radiology Objects in COntext (ROCO): A Multimodal Image Dataset</article-title>
          .
          <source>Proceedings of the MICCAI Workshop on Large-scale Annotation of Biomedical data and Expert Label Synthesis (MICCAI LABELS</source>
          <year>2018</year>
          ), Granada, Spain,
          <year>September 16</year>
          ,
          <year>2018</year>
          , Lecture Notes in Computer Science (LNCS) 11043, pp
          <volume>180</volume>
          {
          <fpage>189</fpage>
          , (
          <year>2018</year>
          ) https://doi.org/doi: 10.1007/978-3-
          <fpage>030</fpage>
          -01364-6 20
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , H. Muller,
          <string-name>
            <given-names>R.</given-names>
            <surname>Peteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Datla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          , D. DemnerFushman,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kozlovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Liauchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dicente Cid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garc a Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ninh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Piras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gurrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dang-Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chamberlain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Campello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fichou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Berari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          , L. Daniel S,tefan, M. Gabriel Constantin :
          <article-title>Overview of the ImageCLEF 2020: Multimedia Retrieval in Lifelogging, Medical, Nature, and Internet Applications In: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the 11th International Conference of the CLEF Association (CLEF</source>
          <year>2020</year>
          ), Thessaloniki, Greece,
          <source>LNCS Lecture Notes in Computer Science</source>
          ,
          <volume>12260</volume>
          , Springer, September
          <volume>22</volume>
          - 25, (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Rahman : A cross modal deep learning based approach for caption prediction and concept detection by CS Morgan State</article-title>
          . Working Notes of CLEF 2018 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Avignon, France,
          <source>September 10-14</source>
          ,
          <year>2018</year>
          , CEUR Workshop Proceedings, 2125, CEUR-WS.org, (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Zisserman: Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          . CoRR, abs
          <volume>1409</volume>
          .1556 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          <article-title>:Deep Residual Learning for Image Recognition</article-title>
          .
          <source>2016 IEEE Conference on Computer Vision</source>
          and
          <article-title>Pattern Recognition (CVPR), Las Vegas</article-title>
          , NV, pp.
          <volume>770</volume>
          {
          <fpage>778</fpage>
          , (
          <year>2016</year>
          ) https://doi.org/doi: 10.1109/CVPR.
          <year>2016</year>
          .
          <volume>90</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van Der</given-names>
            <surname>Maaten</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.Q.</given-names>
            <surname>Weinberger :Densely Connected Convolutional Networks</surname>
          </string-name>
          .
          <source>2017 IEEE Conference on Computer Vision</source>
          and
          <article-title>Pattern Recognition (CVPR), Honolulu</article-title>
          , HI, pp.
          <volume>2261</volume>
          {
          <fpage>2269</fpage>
          , (
          <year>2017</year>
          ) https://doi.org/doi: 10.1109/CVPR.
          <year>2017</year>
          .
          <volume>243</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>J. Deng</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            and
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Fei-Fei:ImageNet: A Large-Scale Hierarchical Image Database</surname>
          </string-name>
          .
          <source>2009 IEEE Conference on Computer Vision</source>
          and Pattern Recognition, Miami, FL, pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          , (
          <year>2009</year>
          ). https://doi.org/10.1109/CVPR.
          <year>2009</year>
          .5206848
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>T.</given-names>
            <surname>Akilan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q. M. J. Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            and
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Safaei</surname>
          </string-name>
          <article-title>:Fusion of transfer learning features and its application in image classi cation</article-title>
          .
          <source>2017 IEEE 30th Canadian Conference on Electrical and Computer</source>
          Engineering (CCECE), Windsor, ON, pp.
          <volume>1</volume>
          {
          <issue>5</issue>
          , (
          <year>2017</year>
          ) https://doi.org/doi:10.1109/CCECE.
          <year>2017</year>
          .
          <volume>7946733</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          , I. Lajoie,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Manzagol</surname>
          </string-name>
          <article-title>: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          .
          <volume>11</volume>
          , pp.
          <volume>3371</volume>
          {
          <issue>3408</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Schmidhuber : Long short-term memory</article-title>
          .
          <source>Neural Computation</source>
          .
          <volume>9</volume>
          (
          <issue>8</issue>
          ), pp.
          <volume>1735</volume>
          {
          <issue>80</issue>
          (
          <year>1997</year>
          ), https://doi.org/DOI: 10.1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.
          <fpage>1735</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>S.</given-names>
            <surname>Ioffe</surname>
          </string-name>
          and
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Szegedy: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</article-title>
          .
          <source>ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning</source>
          .
          <volume>37</volume>
          , pp.
          <volume>448</volume>
          |
          <fpage>456</fpage>
          , (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>