<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fine-grained visual classification of fish*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Piotr Żerdziński</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Applied Mathematics, Silesian University of Technology</institution>
          ,
          <addr-line>Kaszubska 23, 44100 Gliwice</addr-line>
          ,
          <country country="PL">POLAND</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IVUS2024: Information Society and University Studies 2024</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Fine-grained visual classification (FGVC) is a concept of classifying images belonging to the same metaclass. This problem is challenging due to the small differences between classes and also the small number of data. In this paper, a fine-grained classification model based on the attention mechanism is proposed. Attention allows the model to focus on small differences that determine class membership. The model used was tested on the Croatian Fish Dataset and achieved an accuracy of 94.375%.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;cnn</kwd>
        <kwd>attention</kwd>
        <kwd>classification</kwd>
        <kwd>fine-grained visual classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Image processing is a basic category of problems that face AI. Image classification is the main
subtype of this problem. There is huge potential in diverse approaches to data classification
through different methodologies. In the case of artificial neural networks, the most important
tool is convolutional networks. However, it is not possible to identify a single architecture
that can solve many classification problems. Hence, various solutions are modeled that can
focus on the classification of different features. An example is the possibility of using transfer
learning, i.e. neural models learned on huge databases. The possibility of their use consists
of using weights and training them on new data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The capabilities of neural networks are
also supported by additional techniques such as attention modules, as shown in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where the
attention module was used in recurrent networks.
      </p>
      <p>
        The problem of fine-grained visual classification, on the other hand, implies the classification of
images belonging to the same metaclass. Thus, it is a more challenging task, since the
differences between classes are small and may involve a small part of the image. Thus: the
classification of bird species occurring in a given area may require perceiving the difference
only in the shape and color of the feathers [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], while the classificationof aircraft assumes
recognizing the difference in, for example, the shape of the wings [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this paper, a
finegrained visual classification of the Croatian Fish Dataset is made. In this case, it is important
not only to focus on individual small differences between species, but it is necessary to address the
problem arising from the specifics of the dataset, i.e., poor visibility and noise.
      </p>
      <p>
        Therefore, due to the specificity of the problem, it is necessary to find and focus on the most
informative ones that constitute the difference between classes of regions [
        <xref ref-type="bibr" rid="ref10 ref11 ref6 ref7 ref8 ref9">6, 7, 8, 9, 10, 11</xref>
        ].
      </p>
      <p>For this reason, it became necessary to implement an attention mechanism to extract the
most important fragments.</p>
      <p>In this paper, a Fine-Grained Visual Classification (FGVC) model is presented. Thus, a simple
preprocessing coupled with image augmentation is made. In the second place, the CNN
architecture implementing the attention mechanism crucial in the analyzed problem is
presented, as well as skip connections, which support the extraction of the most relevant image
elements. The main contributions of this paper are:
• new network architecture based on the attention module and skip connections,
• new CNN-based FGVC model,
• scheme for further expansion with more efficient photo preprocessing.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>Working on images taken underwater, we are forced to solve problems caused by the quality of
the analyzed images: noise, discoloration and distortion. Water and particles dispersed in water
(pollution, plankton, etc.) absorb, scatter and reflect sunlight. The extreme wavelengths in the
visible light range are particularly strongly absorbed. The dominant colors of the images are
therefore green and blue, and for this reason, any differences due to the color of the fishare lost
and are almost invisible.</p>
      <p>
        In other words, the proposed architectures implemented generative models to improve the
quality of the analyzed images [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. Potentially, denoising and color correction methods also based
on generative models could be applied to the fish classification problem 1[
        <xref ref-type="bibr" rid="ref4">4, 15, 16, 17, 18</xref>
        ]. However,
the common feature of both solutions is a significant increase in the complexity of the proposed
solution. The presented architecture focuses on achieving maximum efficiency with minimized
complexity, so the analyzed images were processed only by simple transformations. After
preprocessing, the data was divided into two sets: training and test at a ratio of 80% to 20%. Due
to the initial imbalance in the number of elements in each class, there was also an imbalance
when dividing the training and testing data. The data prepared in this way was then
used for training and evaluation of the model.
      </p>
      <p>The proposed model, shown in Figure 3, uses images with a size of 64x64 px. This size
was determined based on the minimum size of the photos in the database and represents a
compromise between the required size to make the photos informative and the excessive
deformation of the images caused by expanding the photos. The higher dimensionality of the
images did not affect the efficiency of the model, as the initial images were small, but only
increased the computational complexity.</p>
      <p>The basic elements of the presented architecture are two convolutional blocks, with a fixed
composition visible in Figure 2. The firstcontains a convolution layer, followed by a batch
normalization and a GELU activation function, and a pooling layer, whose output is subjected
to a spatial dropout with a probability equal to 20%. The second, on the other hand, lacks only a
pooling layer. Another important component that follows each of the blocks except the last
is CBAM - Channel Attention Module Block, which implements the attention mechanism to
the model. The skip connection mechanism, which passes the information after the firstblock
deep into the model, requires a convolution layer with a kernel size of 1x1 - in this way, it is
possible to change the dimension of the layer, which allows the outputs of the two blocks to
be combined. At the very end, two fully connected layers are implemented. The firstone is
preceded by a dropout with a probability equal to 20%, and after it the GELU function is used.
The output of the second uses the LogSoftmax function, which returns the probability for each
class.</p>
      <sec id="sec-2-1">
        <title>2.1. Preprocessing images</title>
        <p>In the proposed architecture, the image preprocessing phase has been reduced to a minimum. As
the dataset is loaded, each image is brightened, by increasing the value of each pixel by 50%.
For very blurry images, the brightening allows the fish to be significantly separated fromthe
background, while for the more accurate ones, the fish’s features become more visible. An
example application of this transformation is in Figure 1.</p>
        <p>The next step is the process of augmenting the training data. It is necessary due to the
small size of the initial set (see Table 1). The orientation in the case of the analyzed dataset is
irrelevant, so two transformations were applied without fear of losing informativeness: vertical
reflection and horizontal reflection. Both transformations are applied with a probability of 90%.</p>
        <p>Below is the initial photo and its modification:
(a) Original photo.</p>
        <p>(b) Photo brightened by 50%.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. CNN model</title>
        <p>To analyze the images and classify them, convolutional neural networks (CNNs), which are
dedicated solutions to computer vision problems, were used.</p>
        <p>CNN’s operating principle is based on analyzing an image, represented as a matrix, through
a series of layers and functions aimed at classifying or segmenting images. The most important of
these is the convolution layer, which detects the basic features of an image through a process of
convolution, that is, element-wise multiplication and summation between an image and a set of
filters.The filters are also represented as matrices, are initialized randomly, and are corrected in the
learning process. The problem of fine-grainedvisual classificationrequires a detailed analysis
of the image due to the small differences between the classes, so the size of the filtersremained
small in the proposed solution.</p>
        <p>Pooling layers reduce the size of spatial dimensions of input feature maps. The proposed
model uses average pooling with a filtersize equal to 2. Importantly, the pooling layer, despite
the reduction of dimensionality, does not cause a significant loss of informativeness of the data.</p>
        <p>This operation is described by the formula:
,</p>
        <p>(1)
where   is the input feature map with dimensions described as  is a number of channels, W
is width, and H is height.</p>
        <p>The other elements present in each block forming the presented model (see Figure 2) are
batch normalization, GELU activation function and dropout. Due to the small size of the
analyzed dataset, it became necessary to prevent overfitting. For this reason, spatial dropout is
used, which removes entire feature maps, while classical dropout disables neurons. The use of
the GELU function is aimed at introducing nonlinearity, which leads to the model learning
more complex relationships. Moreover, it is more efficient than ReLU and ELU [19]. On the
other hand, batch normalization stabilizes and speeds up the learning process, reduces internal
covariance shift, improves convergence and presets slight regularization [20]. This process can be
represented by the formula: (2)
is learnable parameter.</p>
        <p>The fully connected layer, also known as the Dense layer implements the classic linear
approach, in which each neuron in a given layer is connected to each neuron in the previous
layer, and each neuron in a given layer passes its activation to each neuron in the next layer.
The proposed model uses two dense layers, the firstof which uses dropout functions in the
form described above before linear transformation, while it passes the result of its action to the
GELU activation function. The last layer, with an output size corresponding to the number of
classes, passes its result through LogSoftmax functions. The task of this function is to normalize
the results of the model to the distribution of the logarithm of probability. It is expressed by the
formula:</p>
        <p>The presented model also implements an attention mechanism that selectively focuses on
parts of the analyzed image, assigning different weights to different areas. The implemented
attention mechanism is based on attention block (CBAM - Convolutional Block Attention
Module [21]), consisting of Channel Attention and Spatial Attention.</p>
        <p>Due to the depth of the presented model and the small number of data in the analyzed set,
the proposed model uses a skip-connections mechanism to help fight with degradation
problem. The most important feature of the skip-connections mechanism is the ability to
transfer low-level features captured at the initial layers of the network, to deeper layers, where
they are mixed with high-level features. In the proposed model, skip connections are arranged
according to the architecture of DenseNets [22], i.e. the result of the firstblock is passed to
each subsequent layer. However, the results of subsequent layers are no longer combined. At
each stage, convolutional blocks analyze current feature maps combined with low-level features from
the first convolutional block.</p>
        <p>During training, the model tries to match the real data as closely as possible, for this reason,
it is necessary to use a loss function. This function, which accounts for how much the model’s
predictions, deviate from the actual data. Minimizing this function is therefore the main goal of
training. The proposed solution uses a Cross-entropy Loss function, expressed by the formula:
(4)
where  is number of classes,  is true distribution,  is predicted distribution. Complementary to the
task of minimizing the loss function is the selection of new values of model parameters, based
on the value of this function. This role is assumed by the optimization algorithm, which in the
proposed solution is ADAM (Adaptive Moment Estimation).</p>
        <sec id="sec-2-2-1">
          <title>3. Experiments</title>
          <p>This section is devoted to analyzing the results obtained for the Croatian Fish Dataset. The
results were obtained for two approaches: in the first,the images were not preprocessed at all
and were not augmented, while in the second, the full preprocessing described in Section
2.1 was implemented. The results obtained were compared to the results of the authors of the
dataset.
3.1. Database
The analyzed dataset was prepared by researchers from Fulda University of Applied Sciences,
Friedrich Schiller University Jena and the University of Zadar [23]. The Croatian Fish Dataset
contains 764 photos of 12 species of fishfound in the Adriatic Sea in Croatia (see Table 1).
The images are a subset of the main dataset, which includes 1280x960 px and 1920x1080 px
resolution videos. Each detected fish in the output set was marked with a bounding box and
extracted as a separate photo. For this reason, the sizes of the images in the analyzed database
vary from over 500 × 200 px to 19 × 23 px. Also because these are photos cut from a larger
image the position of the fishand their visibility, as well as the type of background and its
lighting, varies.</p>
          <p>Species
Chromis chromis
Coris julis female
Coris julis male
Diplodus annularis
Diplodus vulgaris
Oblada melanura
Serranus scriba
Spondyliosoma cantharus
Spicara maena
Symphodus melanocercus
Symphodus tinca
Sarpa salpa
Total</p>
          <p>Method
Jäger et al. (2015) [23]
Qiu et al. (2018) [24]
Sudhakara et al. (2022) [25]
Proposed architecture without preprocessing
Proposed architecture with preprocessing
(a) Accuracy before and after preprocessing (b) Loss before and after preprocessing
Figure 4: Accuracy and loss function during training process before and after preprocessing.
(a) Accuracy before and after preprocessing.</p>
          <p>(b) Precision before and after preprocessing.
(d) F1-Score before and after preprocessing.
3.2. Results and discussion
The model was trained for 200 epochs at a batch size of 64. The graphs corresponding to the
accuracy value and the loss function value for both models are in Figures 4a and 4b. The graphs
for the model trained on images with and without preprocessing in the final stage of training
converge to identical values - accuracy reaches nearly 100% with a loss function value of about
0.1. However, the accuracy graph for the model with full preprocessing converges faster to the
maximum value, and the difference in the loss function value during training is also evident.
Moreover, the value of the loss function for the case without preprocessing is more chaotic even in
the finalstage, which is translated into fluctuations in accuracy values.</p>
          <p>After each epoch, the models were evaluated on a test set to check their accuracy, precision,
recall, and F1-Score. Corresponding graphs can be seen in Fig 5a, 5b, 5c, and 5d. The model
based on the preprocessed data reaches higher values for all metrics in the vast majority of
epochs - the exception being around epochs 150 and 165. The regularity from the model training stage
is observed in the evaluation stage: the model trained on the preprocessed data converges to
maximum values faster reaching 80% in epoch 14, 90% in epoch 43, and ultimately reaching its
maximum value of 96.88% in epoch 174. Meanwhile, the model trained on the set not
preprocessed reaches the accuracy value of 80% in the 30th epoch, and surpassing 90% only 4
times, with its maximum value of 91.41% achieved in the 151st epoch. The values in the other
metrics present similar patterns: the values for the preprocessed data converge more quickly to
maximum values, reaching values equal to or higher than 90%, while the model for data
without preprocessing does not reach this value except for the same 4 epochs where accuracy
also reached that level.</p>
          <p>Significantly, there was much more fluctuation in the value of each metric for both models.
During the training stage, such behavior was evident only for the model based on data not
preprocessed, so the training set in this case was only 635 elements, while the more stable
training stage for the second model had 1778 images at its disposal. Thus, it can be assumed
that the stability of the results obtained is a product of, among other things, the size of the test
set. In the case of the test data, it was equal to 159 elements in both models, which, together
with the imbalance of class sizes, leads to visible fluctuations close to the maximum value for
each metric.</p>
          <p>A summary of the obtained results can be found in Table 2, where the maximum results
obtained by the presented architecture, divided into models based on data with and without
preprocessing, can be seen. Also included is the accuracy value obtained by the authors of the
analyzed database, which equals 66.75% and that was obtained by using pre-trained CNN with
SVM for the classification part [23]. Another well-known solution in the literature is learning
transfer [24], where the authors achieved accuracy on a level of 83.92%. A similar approach was shown
by [25], where deep learning CNN was described. The reached accuracy was 95.64%.
Compared to those works known from the literature, the proposed solution achieves a higher
accuracy value, which was 96.88%. This is due to the deep network, which was extended with
an attention module. This solution allowed the classifier to focus on the important features of
the classified objects.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>4. Conclusion</title>
          <p>This paper proposes an attention-based CNN model supported by a simple preprocessing process. The
architecture used was tested on the Croatian Fish Dataset twice, once subjecting the data to
preprocessing and the second time not. The results achieved are the highest available. In the
future, emphasis should be placed on:
• achieving a better method of image preprocessing and more efficient data augmentation
based on generative models,
• to achieve a more efficient and accurate attention mechanism that would more accurately select
key elements of the image.
network, Proceedings of the 16th International Conference on Agents and Artificial
Intelligence 3 (2024) 863–870.
[14] C.-H. Yeh, C.-H. Huang, C.-H. Lin, Deep learning underwater image color correction and
contrast enhancement based on hue preservation, in: 2019 IEEE Underwater Technology
(UT), 2019, pp. 1–6. doi:10.1109/UT.2019.8734469.
[15] Y. Wang, J. Zhang, Y. Cao, Z. Wang, A deep cnn method for underwater image
enhancement, in: 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp.
1382–1386. doi:10.1109/ICIP.2017.8296508.
[16] J. H. Park, D. Han, H. ko, Adaptive weighted multi-discriminator cyclegan for underwater
image enhancement, Journal of Marine Science and Engineering 7 (2019) 200. doi:10.
3390/jmse7070200.
[17] X. Chen, P. Zhang, L. Quan, C. Yi, C. Lu, Underwater image enhancement based on
deep learning and image formation model, 2021. arXiv:2101.00991.
[18] M. J. Islam, Y. Xia, J. Sattar, Fast underwater image enhancement for improved visual
perception, 2020. arXiv:1903.09766.
[19] D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), 2023. arXiv:1606.08415.
[20] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing
internal covariate shift, 2015. arXiv:1502.03167.
[21] S. Woo, J. Park, J.-Y. Lee, I. S. Kweon, Cbam: Convolutional block attention module, 2018.</p>
          <p>arXiv:1807.06521.
[22] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutional
networks, 2018. arXiv:1608.06993.
[23] J. Jäger, M. Simon, J. Denzler, V. Wolff, K. Fricke-Neuderth, C. Kruschel, Croatian fish
dataset: Fine-grained classification of fish species in their natural habitat, 2015, pp. 6.1–6.7.
doi:10.5244/C.29.MVAB.6.
[24] C. Qiu, S. Zhang, C. Wang, Z. Yu, H. Zheng, B. Zheng, Improving transfer learning and
squeeze- and-excitation networks for small-scale fine-grained fish image classification,
IEEE Access 6 (2018) 78503–78512. doi:10.1109/ACCESS.2018.2885055.
[25] M. Sudhakara, M. J. Meena, K. R. Madhavi, P. Anjaiah, L. P. K, Fish classification using
deep learning on small scale and low-quality images, International Journal of Intelligent
Systems and Applications in Engineering 10 (2022) 279 –. URL: https://ijisae.org/index.
php/IJISAE/article/view/2292.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaszcz</surname>
          </string-name>
          , Vgg16
          <article-title>-based approach for side-scan sonar image analysis</article-title>
          ,
          <source>IVUS 2022: 27th International Conference on Information Technology</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaszcz</surname>
          </string-name>
          ,
          <article-title>Energy consumption prediction model for smart homes via decentralized federated learning with lstm</article-title>
          ,
          <source>IEEE Transactions on Consumer Electronics</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-J.</given-names>
            <surname>Zha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Looking for the devil in the details: Learning trilinear at- tention sampling network for fine-grained image recognition</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1903</year>
          .06150.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tjiputra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>Fine-grained visual classification using self assessment classifier</article-title>
          ,
          <year>2022</year>
          .arXiv:
          <volume>2205</volume>
          .
          <fpage>10529</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Maji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahtu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kannala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blaschko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <article-title>Fine-grained visual classification of aircraft</article-title>
          ,
          <year>2013</year>
          . arXiv:
          <volume>1306</volume>
          .
          <fpage>5151</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Mei,
          <article-title>Destruction and construction learning for finegrained image recognition</article-title>
          ,
          <source>in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5152</fpage>
          -
          <lpage>5161</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2019</year>
          .
          <volume>00530</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          , Snapmix:
          <article-title>Semantically proportional mixing for augmenting fine-grained data</article-title>
          ,
          <year>2020</year>
          .arXiv:
          <year>2012</year>
          .04846.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Learning multi-attention convolutional neural network for fine-grained image recognition</article-title>
          ,
          <source>in: 2017 IEEE International Conference on Computer Vision</source>
          (ICCV),
          <year>2017</year>
          , pp.
          <fpage>5219</fpage>
          -
          <lpage>5227</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2017</year>
          .
          <volume>557</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-J.</given-names>
            <surname>Zha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Looking for the devil in the details: Learning trilinear at- tention sampling network for fine-grained image recognition</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1903</year>
          .06150.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tjiputra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>Fine-grained visual classification using self assessment classifier</article-title>
          ,
          <year>2022</year>
          .arXiv:
          <volume>2205</volume>
          .
          <fpage>10529</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaszcz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wawrzyniak</surname>
          </string-name>
          , G. Zaniewicz,
          <article-title>Bilinear pooling with poisoning detection module for automatic side scan sonar data analysis</article-title>
          ,
          <source>IEEE Access</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <article-title>Improving transfer learning and squeeze- and-excitation networks for small-scale fine-grained fish image classification</article-title>
          ,
          <source>IEEE Access 6</source>
          (
          <year>2018</year>
          )
          <fpage>78503</fpage>
          -
          <lpage>78512</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2018</year>
          .
          <volume>2885055</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaszcz</surname>
          </string-name>
          ,
          <article-title>Heuristic feedback for generator support in generative adversarial</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>