Fine-grained visual classification of fish*
                                 Piotr Żerdziński1,∗,†
                                 1
                                  Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND


                                               Abstract
                                               Fine-grained visual classification (FGVC) is a concept of classifying images belonging to the same
                                               metaclass. This problem is challenging due to the small differences between classes and also the small
                                               number of data. In this paper, a fine-grained classification model based on the attention mechanism is
                                               proposed. Attention allows the model to focus on small differences that determine class membership.
                                               The model used was tested on the Croatian Fish Dataset and achieved an accuracy of 94.375%.
                                               Keywords
                                               cnn, attention, classification, fine-grained visual classification


                                 1. Introduction
                                 Image processing is a basic category of problems that face AI. Image classification is the main
                                 subtype of this problem. There is huge potential in diverse approaches to data classification
                                 through different methodologies. In the case of artificial neural networks, the most important
                                 tool is convolutional networks. However, it is not possible to identify a single architecture
                                 that can solve many classification problems. Hence, various solutions are modeled that can
                                 focus on the classification of different features. An example is the possibility of using transfer
                                 learning, i.e. neural models learned on huge databases. The possibility of their use consists
                                 of using weights and training them on new data [1]. The capabilities of neural networks are
                                 also supported by additional techniques such as attention modules, as shown in [2], where the
                                 attention module was used in recurrent networks.
                                    The problem of fine-grained visual classification, on the other hand, implies the classification of
                                 images belonging to the same metaclass. Thus, it is a more challenging task, since the
                                 differences between classes are small and may involve a small part of the image. Thus: the
                                 classification of bird species occurring in a given area may require perceiving the difference
                                 only in the shape and color of the feathers [3, 4], while the classification of aircraft assumes
                                 recognizing the difference in, for example, the shape of the wings [5]. In this paper, a fine-
                                 grained visual classification of the Croatian Fish Dataset is made. In this case, it is important
                                 not only to focus on individual small differences between species, but it is necessary to address the
                                 problem arising from the specifics of the dataset, i.e., poor visibility and noise.
                                    Therefore, due to the specificity of the problem, it is necessary to find and focus on the most
                                 informative ones that constitute the difference between classes of regions [6, 7, 8, 9, 10, 11].
                                         For this reason, it became necessary to implement an attention mechanism to extract the
                                 most important fragments.
                                    In this paper, a Fine-Grained Visual Classification (FGVC) model is presented. Thus, a simple
                                 preprocessing coupled with image augmentation is made. In the second place, the CNN
                                 architecture implementing the attention mechanism crucial in the analyzed problem is
                                 presented, as well as skip connections, which support the extraction of the most relevant image
                                 elements. The main contributions of this paper are:


                                * IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania
                                  1,∗
CEUR
Workshop
                  ceur-ws.org         Corresponding author
                                  †
                                      These author contributed equally.
              ISSN 1613-0073
Proceedings

                                       piotzer046@student.polsl.pl (P. Żerdziński)
                                             ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    • new network architecture based on the attention module and skip connections,
    • new CNN-based FGVC model,
    • scheme for further expansion with more efficient photo preprocessing.

2. Methodology
Working on images taken underwater, we are forced to solve problems caused by the quality of
the analyzed images: noise, discoloration and distortion. Water and particles dispersed in water
(pollution, plankton, etc.) absorb, scatter and reflect sunlight. The extreme wavelengths in the
visible light range are particularly strongly absorbed. The dominant colors of the images are
therefore green and blue, and for this reason, any differences due to the color of the fish are lost
and are almost invisible.
   In other words, the proposed architectures implemented generative models to improve the
quality of the analyzed images [12, 13]. Potentially, denoising and color correction methods also based
on generative models could be applied to the fish classification problem [14, 15, 16, 17, 18]. However,
the common feature of both solutions is a significant increase in the complexity of the proposed
solution. The presented architecture focuses on achieving maximum efficiency with minimized
complexity, so the analyzed images were processed only by simple transformations. After
preprocessing, the data was divided into two sets: training and test at a ratio of 80% to 20%. Due
to the initial imbalance in the number of elements in each class, there was also an imbalance
when dividing the training and testing data. The data prepared in this way was then
used for training and evaluation of the model.
   The proposed model, shown in Figure 3, uses images with a size of 64x64 px. This size
was determined based on the minimum size of the photos in the database and represents a
compromise between the required size to make the photos informative and the excessive
deformation of the images caused by expanding the photos. The higher dimensionality of the
images did not affect the efficiency of the model, as the initial images were small, but only
increased the computational complexity.
   The basic elements of the presented architecture are two convolutional blocks, with a fixed
composition visible in Figure 2. The first contains a convolution layer, followed by a batch
normalization and a GELU activation function, and a pooling layer, whose output is subjected
to a spatial dropout with a probability equal to 20%. The second, on the other hand, lacks only a
pooling layer. Another important component that follows each of the blocks except the last
is CBAM - Channel Attention Module Block, which implements the attention mechanism to
the model. The skip connection mechanism, which passes the information after the first block
deep into the model, requires a convolution layer with a kernel size of 1x1 - in this way, it is
    possible to change the dimension of the layer, which allows the outputs of the two blocks to
    be combined. At the very end, two fully connected layers are implemented. The first one is
    preceded by a dropout with a probability equal to 20%, and after it the GELU function is used.
    The output of the second uses the LogSoftmax function, which returns the probability for each
    class.

    2.1. Preprocessing images
    In the proposed architecture, the image preprocessing phase has been reduced to a minimum. As
    the dataset is loaded, each image is brightened, by increasing the value of each pixel by 50%.
    For very blurry images, the brightening allows the fish to be significantly separated from the
    background, while for the more accurate ones, the fish’s features become more visible. An
    example application of this transformation is in Figure 1.
       The next step is the process of augmenting the training data. It is necessary due to the
    small size of the initial set (see Table 1). The orientation in the case of the analyzed dataset is
    irrelevant, so two transformations were applied without fear of losing informativeness: vertical
    reflection and horizontal reflection. Both transformations are applied with a probability of 90%.
       Below is the initial photo and its modification:


                      (a) Original photo.                           (b) Photo brightened by 50%.

    Figure 1: Image before and after preprocessing.


    2.2. CNN model
    To analyze the images and classify them, convolutional neural networks (CNNs), which are
    dedicated solutions to computer vision problems, were used.
        CNN’s operating principle is based on analyzing an image, represented as a matrix, through
    a series of layers and functions aimed at classifying or segmenting images. The most important of
    these is the convolution layer, which detects the basic features of an image through a process of
    convolution, that is, element-wise multiplication and summation between an image and a set of
    filters. The filters are also represented as matrices, are initialized randomly, and are corrected in the
    learning process. The problem of fine-grained visual classification requires a detailed analysis
    of the image due to the small differences between the classes, so the size of the filters remained
    small in the proposed solution.
        Pooling layers reduce the size of spatial dimensions of input feature maps. The proposed
    model uses average pooling with a filter size equal to 2. Importantly, the pooling layer, despite
    the reduction of dimensionality, does not cause a significant loss of informativeness of the data.
    This operation is described by the formula:
,                               (1)


       where 𝐹 𝑐 is the input feature map with dimensions described as 𝐶 is a number of channels, W
    is width, and H is height.
   The other elements present in each block forming the presented model (see Figure 2) are
batch normalization, GELU activation function and dropout. Due to the small size of the
analyzed dataset, it became necessary to prevent overfitting. For this reason, spatial dropout is
used, which removes entire feature maps, while classical dropout disables neurons. The use of
the GELU function is aimed at introducing nonlinearity, which leads to the model learning
more complex relationships. Moreover, it is more efficient than ReLU and ELU [19]. On the
other hand, batch normalization stabilizes and speeds up the learning process, reduces internal
covariance shift, improves convergence and presets slight regularization [20]. This process can be
represented by the formula:                                                                     (2)


is learnable parameter.


Figure 2: Two convolution blocks forming the presented model.


   The fully connected layer, also known as the Dense layer implements the classic linear
approach, in which each neuron in a given layer is connected to each neuron in the previous
layer, and each neuron in a given layer passes its activation to each neuron in the next layer.
The proposed model uses two dense layers, the first of which uses dropout functions in the
form described above before linear transformation, while it passes the result of its action to the
GELU activation function. The last layer, with an output size corresponding to the number of
classes, passes its result through LogSoftmax functions. The task of this function is to normalize
the results of the model to the distribution of the logarithm of probability. It is expressed by the
formula:
                                                                                      (3)
    The presented model also implements an attention mechanism that selectively focuses on
parts of the analyzed image, assigning different weights to different areas. The implemented
attention mechanism is based on attention block (CBAM - Convolutional Block Attention
Module [21]), consisting of Channel Attention and Spatial Attention.
    Due to the depth of the presented model and the small number of data in the analyzed set,
the proposed model uses a skip-connections mechanism to help fight with degradation
problem. The most important feature of the skip-connections mechanism is the ability to
transfer low-level features captured at the initial layers of the network, to deeper layers, where
they are mixed with high-level features. In the proposed model, skip connections are arranged
according to the architecture of DenseNets [22], i.e. the result of the first block is passed to
each subsequent layer. However, the results of subsequent layers are no longer combined. At
each stage, convolutional blocks analyze current feature maps combined with low-level features from
the first convolutional block.
    During training, the model tries to match the real data as closely as possible, for this reason,
it is necessary to use a loss function. This function, which accounts for how much the model’s
predictions, deviate from the actual data. Minimizing this function is therefore the main goal of
training. The proposed solution uses a Cross-entropy Loss function, expressed by the formula:


                                                                                                   (4)


where 𝑁 is number of classes, 𝑡 is true distribution, 𝑝 is predicted distribution. Complementary to the
task of minimizing the loss function is the selection of new values of model parameters, based
on the value of this function. This role is assumed by the optimization algorithm, which in the
proposed solution is ADAM (Adaptive Moment Estimation).


Figure 3: CNN architecture.
3. Experiments
This section is devoted to analyzing the results obtained for the Croatian Fish Dataset. The
results were obtained for two approaches: in the first, the images were not preprocessed at all
and were not augmented, while in the second, the full preprocessing described in Section
2.1 was implemented. The results obtained were compared to the results of the authors of the
dataset.

3.1. Database
The analyzed dataset was prepared by researchers from Fulda University of Applied Sciences,
Friedrich Schiller University Jena and the University of Zadar [23]. The Croatian Fish Dataset
contains 764 photos of 12 species of fish found in the Adriatic Sea in Croatia (see Table 1).
The images are a subset of the main dataset, which includes 1280x960 px and 1920x1080 px
resolution videos. Each detected fish in the output set was marked with a bounding box and
extracted as a separate photo. For this reason, the sizes of the images in the analyzed database
vary from over 500 × 200 px to 19 × 23 px. Also because these are photos cut from a larger
image the position of the fish and their visibility, as well as the type of background and its
lighting, varies.

                            Species                        Number of images
                            Chromis chromis                106
                            Coris julis female             57
                            Coris julis male               57
                            Diplodus annularis             94
                            Diplodus vulgaris              111
                            Oblada melanura                57
                            Serranus scriba                56
                            Spondyliosoma cantharus        51
                            Spicara maena                  49
                            Symphodus melanocercus         105
                            Symphodus tinca                34
                            Sarpa salpa                    17
                            Total                          794
Table 1
Number of images per species.


Table 2
The results of the proposed method compared to other algorithms.
                     Method                                            Accuracy (%)
                     Jäger et al. (2015) [23]                             66.75
                     Qiu et al. (2018) [24]                               83.92
                     Sudhakara et al. (2022) [25]                         95.64
                     Proposed architecture without preprocessing          91.41
                     Proposed architecture with preprocessing             96.88
    (a) Accuracy before and after preprocessing                 (b) Loss before and after preprocessing

Figure 4: Accuracy and loss function during training process before and after preprocessing.


      (a) Accuracy before and after preprocessing.              (b) Precision before and after preprocessing.


         (c) Recall before and after preprocessing.               (d) F1-Score before and after preprocessing.

Figure 5: Performence metrics before and after preprocessing.
3.2. Results and discussion
The model was trained for 200 epochs at a batch size of 64. The graphs corresponding to the
accuracy value and the loss function value for both models are in Figures 4a and 4b. The graphs
for the model trained on images with and without preprocessing in the final stage of training
converge to identical values - accuracy reaches nearly 100% with a loss function value of about
0.1. However, the accuracy graph for the model with full preprocessing converges faster to the
maximum value, and the difference in the loss function value during training is also evident.
Moreover, the value of the loss function for the case without preprocessing is more chaotic even in
the final stage, which is translated into fluctuations in accuracy values.
   After each epoch, the models were evaluated on a test set to check their accuracy, precision,
recall, and F1-Score. Corresponding graphs can be seen in Fig 5a, 5b, 5c, and 5d. The model
based on the preprocessed data reaches higher values for all metrics in the vast majority of
epochs - the exception being around epochs 150 and 165. The regularity from the model training stage
is observed in the evaluation stage: the model trained on the preprocessed data converges to
maximum values faster reaching 80% in epoch 14, 90% in epoch 43, and ultimately reaching its
maximum value of 96.88% in epoch 174. Meanwhile, the model trained on the set not
preprocessed reaches the accuracy value of 80% in the 30th epoch, and surpassing 90% only 4
times, with its maximum value of 91.41% achieved in the 151st epoch. The values in the other
metrics present similar patterns: the values for the preprocessed data converge more quickly to
maximum values, reaching values equal to or higher than 90%, while the model for data
without preprocessing does not reach this value except for the same 4 epochs where accuracy
also reached that level.
   Significantly, there was much more fluctuation in the value of each metric for both models.
During the training stage, such behavior was evident only for the model based on data not
preprocessed, so the training set in this case was only 635 elements, while the more stable
training stage for the second model had 1778 images at its disposal. Thus, it can be assumed
that the stability of the results obtained is a product of, among other things, the size of the test
set. In the case of the test data, it was equal to 159 elements in both models, which, together
with the imbalance of class sizes, leads to visible fluctuations close to the maximum value for
each metric.
   A summary of the obtained results can be found in Table 2, where the maximum results
obtained by the presented architecture, divided into models based on data with and without
preprocessing, can be seen. Also included is the accuracy value obtained by the authors of the
analyzed database, which equals 66.75% and that was obtained by using pre-trained CNN with
SVM for the classification part [23]. Another well-known solution in the literature is learning
transfer [24], where the authors achieved accuracy on a level of 83.92%. A similar approach was shown
by [25], where deep learning CNN was described. The reached accuracy was 95.64%.
Compared to those works known from the literature, the proposed solution achieves a higher
accuracy value, which was 96.88%. This is due to the deep network, which was extended with
an attention module. This solution allowed the classifier to focus on the important features of
the classified objects.
4. Conclusion
This paper proposes an attention-based CNN model supported by a simple preprocessing process. The
architecture used was tested on the Croatian Fish Dataset twice, once subjecting the data to
preprocessing and the second time not. The results achieved are the highest available. In the
future, emphasis should be placed on:
    • achieving a better method of image preprocessing and more efficient data augmentation
      based on generative models,
    • to achieve a more efficient and accurate attention mechanism that would more accurately select
      key elements of the image.

References
 [1] A. Jaszcz, Vgg16-based approach for side-scan sonar image analysis, IVUS 2022: 27th
     International Conference on Information Technology (2022).
 [2] D. Połap, G. Srivastava, A. Jaszcz, Energy consumption prediction model for smart homes
     via decentralized federated learning with lstm, IEEE Transactions on Consumer Electronics
     (2023).
 [3] H. Zheng, J. Fu, Z.-J. Zha, J. Luo, Looking for the devil in the details: Learning trilinear
     at- tention sampling network for fine-grained image recognition, 2019. arXiv:1903.06150.
 [4] T. Do, H. Tran, E. Tjiputra, Q. D. Tran, A. Nguyen, Fine-grained visual classification
     using self assessment classifier, 2022. arXiv:2205.10529.
 [5] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, A. Vedaldi, Fine-grained visual classification of
     aircraft, 2013. arXiv:1306.5151.
 [6] Y. Chen, Y. Bai, W. Zhang, T. Mei, Destruction and construction learning for fine-
     grained image recognition, in: 2019 IEEE/CVF Conference on Computer Vision and
     Pattern Recognition (CVPR), 2019, pp. 5152–5161. doi:10.1109/CVPR.2019.00530.
 [7] S. Huang, X. Wang, D. Tao, Snapmix: Semantically proportional mixing for augmenting
     fine-grained data, 2020. arXiv:2012.04846.
 [8] H. Zheng, J. Fu, T. Mei, J. Luo, Learning multi-attention convolutional neural network
     for fine-grained image recognition, in: 2017 IEEE International Conference on Computer
     Vision (ICCV), 2017, pp. 5219–5227. doi:10.1109/ICCV.2017.557.
 [9] H. Zheng, J. Fu, Z.-J. Zha, J. Luo, Looking for the devil in the details: Learning trilinear
     at- tention sampling network for fine-grained image recognition, 2019. arXiv:1903.06150.
[10] T. Do, H. Tran, E. Tjiputra, Q. D. Tran, A. Nguyen, Fine-grained visual classification
     using self assessment classifier, 2022. arXiv:2205.10529.
[11] D. Połap, A. Jaszcz, N. Wawrzyniak, G. Zaniewicz, Bilinear pooling with poisoning
     detection module for automatic side scan sonar data analysis, IEEE Access (2023).
[12] C. Qiu, S. Zhang, C. Wang, Z. Yu, H. Zheng, B. Zheng, Improving transfer learning and
     squeeze- and-excitation networks for small-scale fine-grained fish image classification,
     IEEE Access 6 (2018) 78503–78512. doi:10.1109/ACCESS.2018.2885055.
[13] D. Połap, A. Jaszcz, Heuristic feedback for generator support in generative adversarial
     network, Proceedings of the 16th International Conference on Agents and Artificial
     Intelligence 3 (2024) 863–870.
[14] C.-H. Yeh, C.-H. Huang, C.-H. Lin, Deep learning underwater image color correction and
     contrast enhancement based on hue preservation, in: 2019 IEEE Underwater Technology
     (UT), 2019, pp. 1–6. doi:10.1109/UT.2019.8734469.
[15] Y. Wang, J. Zhang, Y. Cao, Z. Wang, A deep cnn method for underwater image
     enhancement, in: 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp.
     1382–1386. doi:10.1109/ICIP.2017.8296508.
[16] J. H. Park, D. Han, H. ko, Adaptive weighted multi-discriminator cyclegan for underwater
     image enhancement, Journal of Marine Science and Engineering 7 (2019) 200. doi:10.
     3390/jmse7070200.
[17] X. Chen, P. Zhang, L. Quan, C. Yi, C. Lu, Underwater image enhancement based on
     deep learning and image formation model, 2021. arXiv:2101.00991.
[18] M. J. Islam, Y. Xia, J. Sattar, Fast underwater image enhancement for improved visual
     perception, 2020. arXiv:1903.09766.
[19] D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), 2023. arXiv:1606.08415.
[20] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing
     internal covariate shift, 2015. arXiv:1502.03167.
[21] S. Woo, J. Park, J.-Y. Lee, I. S. Kweon, Cbam: Convolutional block attention module, 2018.
     arXiv:1807.06521.
[22] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutional
     networks, 2018. arXiv:1608.06993.
[23] J. Jäger, M. Simon, J. Denzler, V. Wolff, K. Fricke-Neuderth, C. Kruschel, Croatian fish
     dataset: Fine-grained classification of fish species in their natural habitat, 2015, pp. 6.1–6.7.
     doi:10.5244/C.29.MVAB.6.
[24] C. Qiu, S. Zhang, C. Wang, Z. Yu, H. Zheng, B. Zheng, Improving transfer learning and
     squeeze- and-excitation networks for small-scale fine-grained fish image classification,
     IEEE Access 6 (2018) 78503–78512. doi:10.1109/ACCESS.2018.2885055.
[25] M. Sudhakara, M. J. Meena, K. R. Madhavi, P. Anjaiah, L. P. K, Fish classification using
     deep learning on small scale and low-quality images, International Journal of Intelligent
     Systems and Applications in Engineering 10 (2022) 279 –. URL: https://ijisae.org/index.
     php/IJISAE/article/view/2292.