=Paper=
{{Paper
|id=Vol-3885/paper2
|storemode=property
|title=Fine-Grained Visual Classification of Fish
|pdfUrl=https://ceur-ws.org/Vol-3885/paper2.pdf
|volume=Vol-3885
|authors=Piotr Żerdziński
|dblpUrl=https://dblp.org/rec/conf/ivus/Zerdzinski24
}}
==Fine-Grained Visual Classification of Fish==
Fine-grained visual classification of fish*
Piotr Żerdziński1,∗,†
1
Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND
Abstract
Fine-grained visual classification (FGVC) is a concept of classifying images belonging to the same
metaclass. This problem is challenging due to the small differences between classes and also the small
number of data. In this paper, a fine-grained classification model based on the attention mechanism is
proposed. Attention allows the model to focus on small differences that determine class membership.
The model used was tested on the Croatian Fish Dataset and achieved an accuracy of 94.375%.
Keywords
cnn, attention, classification, fine-grained visual classification
1. Introduction
Image processing is a basic category of problems that face AI. Image classification is the main
subtype of this problem. There is huge potential in diverse approaches to data classification
through different methodologies. In the case of artificial neural networks, the most important
tool is convolutional networks. However, it is not possible to identify a single architecture
that can solve many classification problems. Hence, various solutions are modeled that can
focus on the classification of different features. An example is the possibility of using transfer
learning, i.e. neural models learned on huge databases. The possibility of their use consists
of using weights and training them on new data [1]. The capabilities of neural networks are
also supported by additional techniques such as attention modules, as shown in [2], where the
attention module was used in recurrent networks.
The problem of fine-grained visual classification, on the other hand, implies the classification of
images belonging to the same metaclass. Thus, it is a more challenging task, since the
differences between classes are small and may involve a small part of the image. Thus: the
classification of bird species occurring in a given area may require perceiving the difference
only in the shape and color of the feathers [3, 4], while the classification of aircraft assumes
recognizing the difference in, for example, the shape of the wings [5]. In this paper, a fine-
grained visual classification of the Croatian Fish Dataset is made. In this case, it is important
not only to focus on individual small differences between species, but it is necessary to address the
problem arising from the specifics of the dataset, i.e., poor visibility and noise.
Therefore, due to the specificity of the problem, it is necessary to find and focus on the most
informative ones that constitute the difference between classes of regions [6, 7, 8, 9, 10, 11].
For this reason, it became necessary to implement an attention mechanism to extract the
most important fragments.
In this paper, a Fine-Grained Visual Classification (FGVC) model is presented. Thus, a simple
preprocessing coupled with image augmentation is made. In the second place, the CNN
architecture implementing the attention mechanism crucial in the analyzed problem is
presented, as well as skip connections, which support the extraction of the most relevant image
elements. The main contributions of this paper are:
* IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania
1,∗
CEUR
Workshop
ceur-ws.org Corresponding author
†
These author contributed equally.
ISSN 1613-0073
Proceedings
piotzer046@student.polsl.pl (P. Żerdziński)
©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
• new network architecture based on the attention module and skip connections,
• new CNN-based FGVC model,
• scheme for further expansion with more efficient photo preprocessing.
2. Methodology
Working on images taken underwater, we are forced to solve problems caused by the quality of
the analyzed images: noise, discoloration and distortion. Water and particles dispersed in water
(pollution, plankton, etc.) absorb, scatter and reflect sunlight. The extreme wavelengths in the
visible light range are particularly strongly absorbed. The dominant colors of the images are
therefore green and blue, and for this reason, any differences due to the color of the fish are lost
and are almost invisible.
In other words, the proposed architectures implemented generative models to improve the
quality of the analyzed images [12, 13]. Potentially, denoising and color correction methods also based
on generative models could be applied to the fish classification problem [14, 15, 16, 17, 18]. However,
the common feature of both solutions is a significant increase in the complexity of the proposed
solution. The presented architecture focuses on achieving maximum efficiency with minimized
complexity, so the analyzed images were processed only by simple transformations. After
preprocessing, the data was divided into two sets: training and test at a ratio of 80% to 20%. Due
to the initial imbalance in the number of elements in each class, there was also an imbalance
when dividing the training and testing data. The data prepared in this way was then
used for training and evaluation of the model.
The proposed model, shown in Figure 3, uses images with a size of 64x64 px. This size
was determined based on the minimum size of the photos in the database and represents a
compromise between the required size to make the photos informative and the excessive
deformation of the images caused by expanding the photos. The higher dimensionality of the
images did not affect the efficiency of the model, as the initial images were small, but only
increased the computational complexity.
The basic elements of the presented architecture are two convolutional blocks, with a fixed
composition visible in Figure 2. The first contains a convolution layer, followed by a batch
normalization and a GELU activation function, and a pooling layer, whose output is subjected
to a spatial dropout with a probability equal to 20%. The second, on the other hand, lacks only a
pooling layer. Another important component that follows each of the blocks except the last
is CBAM - Channel Attention Module Block, which implements the attention mechanism to
the model. The skip connection mechanism, which passes the information after the first block
deep into the model, requires a convolution layer with a kernel size of 1x1 - in this way, it is
possible to change the dimension of the layer, which allows the outputs of the two blocks to
be combined. At the very end, two fully connected layers are implemented. The first one is
preceded by a dropout with a probability equal to 20%, and after it the GELU function is used.
The output of the second uses the LogSoftmax function, which returns the probability for each
class.
2.1. Preprocessing images
In the proposed architecture, the image preprocessing phase has been reduced to a minimum. As
the dataset is loaded, each image is brightened, by increasing the value of each pixel by 50%.
For very blurry images, the brightening allows the fish to be significantly separated from the
background, while for the more accurate ones, the fish’s features become more visible. An
example application of this transformation is in Figure 1.
The next step is the process of augmenting the training data. It is necessary due to the
small size of the initial set (see Table 1). The orientation in the case of the analyzed dataset is
irrelevant, so two transformations were applied without fear of losing informativeness: vertical
reflection and horizontal reflection. Both transformations are applied with a probability of 90%.
Below is the initial photo and its modification:
(a) Original photo. (b) Photo brightened by 50%.
Figure 1: Image before and after preprocessing.
2.2. CNN model
To analyze the images and classify them, convolutional neural networks (CNNs), which are
dedicated solutions to computer vision problems, were used.
CNN’s operating principle is based on analyzing an image, represented as a matrix, through
a series of layers and functions aimed at classifying or segmenting images. The most important of
these is the convolution layer, which detects the basic features of an image through a process of
convolution, that is, element-wise multiplication and summation between an image and a set of
filters. The filters are also represented as matrices, are initialized randomly, and are corrected in the
learning process. The problem of fine-grained visual classification requires a detailed analysis
of the image due to the small differences between the classes, so the size of the filters remained
small in the proposed solution.
Pooling layers reduce the size of spatial dimensions of input feature maps. The proposed
model uses average pooling with a filter size equal to 2. Importantly, the pooling layer, despite
the reduction of dimensionality, does not cause a significant loss of informativeness of the data.
This operation is described by the formula:
, (1)
where 𝐹 𝑐 is the input feature map with dimensions described as 𝐶 is a number of channels, W
is width, and H is height.
The other elements present in each block forming the presented model (see Figure 2) are
batch normalization, GELU activation function and dropout. Due to the small size of the
analyzed dataset, it became necessary to prevent overfitting. For this reason, spatial dropout is
used, which removes entire feature maps, while classical dropout disables neurons. The use of
the GELU function is aimed at introducing nonlinearity, which leads to the model learning
more complex relationships. Moreover, it is more efficient than ReLU and ELU [19]. On the
other hand, batch normalization stabilizes and speeds up the learning process, reduces internal
covariance shift, improves convergence and presets slight regularization [20]. This process can be
represented by the formula: (2)
is learnable parameter.
Figure 2: Two convolution blocks forming the presented model.
The fully connected layer, also known as the Dense layer implements the classic linear
approach, in which each neuron in a given layer is connected to each neuron in the previous
layer, and each neuron in a given layer passes its activation to each neuron in the next layer.
The proposed model uses two dense layers, the first of which uses dropout functions in the
form described above before linear transformation, while it passes the result of its action to the
GELU activation function. The last layer, with an output size corresponding to the number of
classes, passes its result through LogSoftmax functions. The task of this function is to normalize
the results of the model to the distribution of the logarithm of probability. It is expressed by the
formula:
(3)
The presented model also implements an attention mechanism that selectively focuses on
parts of the analyzed image, assigning different weights to different areas. The implemented
attention mechanism is based on attention block (CBAM - Convolutional Block Attention
Module [21]), consisting of Channel Attention and Spatial Attention.
Due to the depth of the presented model and the small number of data in the analyzed set,
the proposed model uses a skip-connections mechanism to help fight with degradation
problem. The most important feature of the skip-connections mechanism is the ability to
transfer low-level features captured at the initial layers of the network, to deeper layers, where
they are mixed with high-level features. In the proposed model, skip connections are arranged
according to the architecture of DenseNets [22], i.e. the result of the first block is passed to
each subsequent layer. However, the results of subsequent layers are no longer combined. At
each stage, convolutional blocks analyze current feature maps combined with low-level features from
the first convolutional block.
During training, the model tries to match the real data as closely as possible, for this reason,
it is necessary to use a loss function. This function, which accounts for how much the model’s
predictions, deviate from the actual data. Minimizing this function is therefore the main goal of
training. The proposed solution uses a Cross-entropy Loss function, expressed by the formula:
(4)
where 𝑁 is number of classes, 𝑡 is true distribution, 𝑝 is predicted distribution. Complementary to the
task of minimizing the loss function is the selection of new values of model parameters, based
on the value of this function. This role is assumed by the optimization algorithm, which in the
proposed solution is ADAM (Adaptive Moment Estimation).
Figure 3: CNN architecture.
3. Experiments
This section is devoted to analyzing the results obtained for the Croatian Fish Dataset. The
results were obtained for two approaches: in the first, the images were not preprocessed at all
and were not augmented, while in the second, the full preprocessing described in Section
2.1 was implemented. The results obtained were compared to the results of the authors of the
dataset.
3.1. Database
The analyzed dataset was prepared by researchers from Fulda University of Applied Sciences,
Friedrich Schiller University Jena and the University of Zadar [23]. The Croatian Fish Dataset
contains 764 photos of 12 species of fish found in the Adriatic Sea in Croatia (see Table 1).
The images are a subset of the main dataset, which includes 1280x960 px and 1920x1080 px
resolution videos. Each detected fish in the output set was marked with a bounding box and
extracted as a separate photo. For this reason, the sizes of the images in the analyzed database
vary from over 500 × 200 px to 19 × 23 px. Also because these are photos cut from a larger
image the position of the fish and their visibility, as well as the type of background and its
lighting, varies.
Species Number of images
Chromis chromis 106
Coris julis female 57
Coris julis male 57
Diplodus annularis 94
Diplodus vulgaris 111
Oblada melanura 57
Serranus scriba 56
Spondyliosoma cantharus 51
Spicara maena 49
Symphodus melanocercus 105
Symphodus tinca 34
Sarpa salpa 17
Total 794
Table 1
Number of images per species.
Table 2
The results of the proposed method compared to other algorithms.
Method Accuracy (%)
Jäger et al. (2015) [23] 66.75
Qiu et al. (2018) [24] 83.92
Sudhakara et al. (2022) [25] 95.64
Proposed architecture without preprocessing 91.41
Proposed architecture with preprocessing 96.88
(a) Accuracy before and after preprocessing (b) Loss before and after preprocessing
Figure 4: Accuracy and loss function during training process before and after preprocessing.
(a) Accuracy before and after preprocessing. (b) Precision before and after preprocessing.
(c) Recall before and after preprocessing. (d) F1-Score before and after preprocessing.
Figure 5: Performence metrics before and after preprocessing.
3.2. Results and discussion
The model was trained for 200 epochs at a batch size of 64. The graphs corresponding to the
accuracy value and the loss function value for both models are in Figures 4a and 4b. The graphs
for the model trained on images with and without preprocessing in the final stage of training
converge to identical values - accuracy reaches nearly 100% with a loss function value of about
0.1. However, the accuracy graph for the model with full preprocessing converges faster to the
maximum value, and the difference in the loss function value during training is also evident.
Moreover, the value of the loss function for the case without preprocessing is more chaotic even in
the final stage, which is translated into fluctuations in accuracy values.
After each epoch, the models were evaluated on a test set to check their accuracy, precision,
recall, and F1-Score. Corresponding graphs can be seen in Fig 5a, 5b, 5c, and 5d. The model
based on the preprocessed data reaches higher values for all metrics in the vast majority of
epochs - the exception being around epochs 150 and 165. The regularity from the model training stage
is observed in the evaluation stage: the model trained on the preprocessed data converges to
maximum values faster reaching 80% in epoch 14, 90% in epoch 43, and ultimately reaching its
maximum value of 96.88% in epoch 174. Meanwhile, the model trained on the set not
preprocessed reaches the accuracy value of 80% in the 30th epoch, and surpassing 90% only 4
times, with its maximum value of 91.41% achieved in the 151st epoch. The values in the other
metrics present similar patterns: the values for the preprocessed data converge more quickly to
maximum values, reaching values equal to or higher than 90%, while the model for data
without preprocessing does not reach this value except for the same 4 epochs where accuracy
also reached that level.
Significantly, there was much more fluctuation in the value of each metric for both models.
During the training stage, such behavior was evident only for the model based on data not
preprocessed, so the training set in this case was only 635 elements, while the more stable
training stage for the second model had 1778 images at its disposal. Thus, it can be assumed
that the stability of the results obtained is a product of, among other things, the size of the test
set. In the case of the test data, it was equal to 159 elements in both models, which, together
with the imbalance of class sizes, leads to visible fluctuations close to the maximum value for
each metric.
A summary of the obtained results can be found in Table 2, where the maximum results
obtained by the presented architecture, divided into models based on data with and without
preprocessing, can be seen. Also included is the accuracy value obtained by the authors of the
analyzed database, which equals 66.75% and that was obtained by using pre-trained CNN with
SVM for the classification part [23]. Another well-known solution in the literature is learning
transfer [24], where the authors achieved accuracy on a level of 83.92%. A similar approach was shown
by [25], where deep learning CNN was described. The reached accuracy was 95.64%.
Compared to those works known from the literature, the proposed solution achieves a higher
accuracy value, which was 96.88%. This is due to the deep network, which was extended with
an attention module. This solution allowed the classifier to focus on the important features of
the classified objects.
4. Conclusion
This paper proposes an attention-based CNN model supported by a simple preprocessing process. The
architecture used was tested on the Croatian Fish Dataset twice, once subjecting the data to
preprocessing and the second time not. The results achieved are the highest available. In the
future, emphasis should be placed on:
• achieving a better method of image preprocessing and more efficient data augmentation
based on generative models,
• to achieve a more efficient and accurate attention mechanism that would more accurately select
key elements of the image.
References
[1] A. Jaszcz, Vgg16-based approach for side-scan sonar image analysis, IVUS 2022: 27th
International Conference on Information Technology (2022).
[2] D. Połap, G. Srivastava, A. Jaszcz, Energy consumption prediction model for smart homes
via decentralized federated learning with lstm, IEEE Transactions on Consumer Electronics
(2023).
[3] H. Zheng, J. Fu, Z.-J. Zha, J. Luo, Looking for the devil in the details: Learning trilinear
at- tention sampling network for fine-grained image recognition, 2019. arXiv:1903.06150.
[4] T. Do, H. Tran, E. Tjiputra, Q. D. Tran, A. Nguyen, Fine-grained visual classification
using self assessment classifier, 2022. arXiv:2205.10529.
[5] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, A. Vedaldi, Fine-grained visual classification of
aircraft, 2013. arXiv:1306.5151.
[6] Y. Chen, Y. Bai, W. Zhang, T. Mei, Destruction and construction learning for fine-
grained image recognition, in: 2019 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2019, pp. 5152–5161. doi:10.1109/CVPR.2019.00530.
[7] S. Huang, X. Wang, D. Tao, Snapmix: Semantically proportional mixing for augmenting
fine-grained data, 2020. arXiv:2012.04846.
[8] H. Zheng, J. Fu, T. Mei, J. Luo, Learning multi-attention convolutional neural network
for fine-grained image recognition, in: 2017 IEEE International Conference on Computer
Vision (ICCV), 2017, pp. 5219–5227. doi:10.1109/ICCV.2017.557.
[9] H. Zheng, J. Fu, Z.-J. Zha, J. Luo, Looking for the devil in the details: Learning trilinear
at- tention sampling network for fine-grained image recognition, 2019. arXiv:1903.06150.
[10] T. Do, H. Tran, E. Tjiputra, Q. D. Tran, A. Nguyen, Fine-grained visual classification
using self assessment classifier, 2022. arXiv:2205.10529.
[11] D. Połap, A. Jaszcz, N. Wawrzyniak, G. Zaniewicz, Bilinear pooling with poisoning
detection module for automatic side scan sonar data analysis, IEEE Access (2023).
[12] C. Qiu, S. Zhang, C. Wang, Z. Yu, H. Zheng, B. Zheng, Improving transfer learning and
squeeze- and-excitation networks for small-scale fine-grained fish image classification,
IEEE Access 6 (2018) 78503–78512. doi:10.1109/ACCESS.2018.2885055.
[13] D. Połap, A. Jaszcz, Heuristic feedback for generator support in generative adversarial
network, Proceedings of the 16th International Conference on Agents and Artificial
Intelligence 3 (2024) 863–870.
[14] C.-H. Yeh, C.-H. Huang, C.-H. Lin, Deep learning underwater image color correction and
contrast enhancement based on hue preservation, in: 2019 IEEE Underwater Technology
(UT), 2019, pp. 1–6. doi:10.1109/UT.2019.8734469.
[15] Y. Wang, J. Zhang, Y. Cao, Z. Wang, A deep cnn method for underwater image
enhancement, in: 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp.
1382–1386. doi:10.1109/ICIP.2017.8296508.
[16] J. H. Park, D. Han, H. ko, Adaptive weighted multi-discriminator cyclegan for underwater
image enhancement, Journal of Marine Science and Engineering 7 (2019) 200. doi:10.
3390/jmse7070200.
[17] X. Chen, P. Zhang, L. Quan, C. Yi, C. Lu, Underwater image enhancement based on
deep learning and image formation model, 2021. arXiv:2101.00991.
[18] M. J. Islam, Y. Xia, J. Sattar, Fast underwater image enhancement for improved visual
perception, 2020. arXiv:1903.09766.
[19] D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), 2023. arXiv:1606.08415.
[20] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing
internal covariate shift, 2015. arXiv:1502.03167.
[21] S. Woo, J. Park, J.-Y. Lee, I. S. Kweon, Cbam: Convolutional block attention module, 2018.
arXiv:1807.06521.
[22] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutional
networks, 2018. arXiv:1608.06993.
[23] J. Jäger, M. Simon, J. Denzler, V. Wolff, K. Fricke-Neuderth, C. Kruschel, Croatian fish
dataset: Fine-grained classification of fish species in their natural habitat, 2015, pp. 6.1–6.7.
doi:10.5244/C.29.MVAB.6.
[24] C. Qiu, S. Zhang, C. Wang, Z. Yu, H. Zheng, B. Zheng, Improving transfer learning and
squeeze- and-excitation networks for small-scale fine-grained fish image classification,
IEEE Access 6 (2018) 78503–78512. doi:10.1109/ACCESS.2018.2885055.
[25] M. Sudhakara, M. J. Meena, K. R. Madhavi, P. Anjaiah, L. P. K, Fish classification using
deep learning on small scale and low-quality images, International Journal of Intelligent
Systems and Applications in Engineering 10 (2022) 279 –. URL: https://ijisae.org/index.
php/IJISAE/article/view/2292.