=Paper=
{{Paper
|id=Vol-3742/paper25
|storemode=property
|title=A study on rapid identification of medical vector organisms based on improved transformer
|pdfUrl=https://ceur-ws.org/Vol-3742/paper25.pdf
|volume=Vol-3742
|authors=Yan Zhou,Jie Zhong,Xin Fang,Juan Huang,Lingyu Yan
|dblpUrl=https://dblp.org/rec/conf/citi2/ZhouZFHY24
}}
==A study on rapid identification of medical vector organisms based on improved transformer==
<pdf width="1500px">https://ceur-ws.org/Vol-3742/paper25.pdf</pdf>
<pre>
                                A study on rapid identification of medical vector
                                organisms based on improved Transformer
                                Yan Zhou1,†, Jie Zhong1,∗,†, Xin Fang1,†, Juan Huang1,† and Lingyu Yan2,*,†

                                1 Hubei International Travel Healthcare Center (Wuhan Customs Port Outpatient Department), Wuhan

                                2 School of Computer Science, Hubei University of Technology, Wuhan.


                                                Abstract
                                                The identification of port and imported medical vector species is an extremely important issue
                                                for customs work, which is related to public health safety. The traditional methods of vector
                                                identification and identification mainly rely on manual labor. In order to reduce personnel
                                                workload, this paper proposes a new lightweight image classification network model for vector
                                                biometric recognition. The backbone network of the model is a network architecture based on
                                                the combination of convolutional neural networks and Transformers, a convolutional neural
                                                network model constructed with smaller parameters. Firstly, a fusion module is introduced to
                                                replace the depthwise separable convolution in shuffleNet with a regular conv3 × 3, and h-
                                                swish is used to replace the ReLU activation function, which reduces costs while better reducing
                                                the number of model parameters. Then, this article uses a network architecture that combines
                                                convolution and improved Transformer to input images in sequence into the Transformer to
                                                obtain global feature information, ensuring that the network model can extract local and global
                                                features of the image. At the same time, the Tranformer is improved to reduce the number of
                                                model parameters..

                                                Keywords
                                               Biometric identification, Image classification, Convolutional neural networks, Transformer raffic
                                engineering, in 1


                                1. Introduction
                                Accurate and timely identification of medical vector organisms (hereinafter referred to as
                                "vector organisms") at ports of entry and exit is an important issue in the work of the
                                Customs and Excise Department.
                                   Among them, flies, mosquitoes and cockroaches are important parts of vector
                                organisms. Therefore, it is a meaningful task to conduct research on the identification of


                                CITI’2024: 2nd International Workshop on Computer Information Technologies in Industry 4.0, June 12–14, 2024,
                                Ternopil, Ukraine
                                ∗ Corresponding author.
                                † These authors contributed equally.

                                   zhouyan@163.com (Y. Zhou); zhongjie@163.com (J. Zhong); fangxin@163.com (X. Fang) ;
                                huangjuan@163.com (J. Huang) ; yanranyaya2024@163.com (L. Yan)
                                   0000-0003-3412-1639 (Y. Zhou); 0000-0002-5385-5761 (J. Zhong); 0000-0002-9421-8566 (X. Fang) ;
                                0009-0006-6171-7293 (J. Huang); 0000-0002-8434-5473 (L. Yan)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
vector organisms at ports of entry and exit. The traditional method of vector organism
identification [1-2] mainly relies on the experience accumulated by the staff in the actual
identification work, which requires the staff to match the physical and morphological
characteristics of the vector organisms distinguished by the naked eye (or with the help of
microscopic magnification equipment) with the descriptive text in the books on species
classification, which is very much dependent on the practical experience of the staff.
However, due to the large number of vector organisms, some vector organisms have
similar apparent characteristics, and the staff are not all experts with rich practical
experience in identification, relying only on the staff's experience to determine the types
of vector organisms, especially in the initial screening stage of the sample size is large,
there is a slower speed, poorer real-time, and subjective so that the impact of the fresh
sample morphology identification of the detection of the problem of the initial screening,
making this method is very time-consuming and labor-intensive.
   At present, applying deep learning methods to vector organisms rapid identification is
the main trend of intelligence [3-5], but most deep learning models are characterized by
many model parameters and large model volume, which require high memory and
computational resources and consume too much time for inference and computation, and
are difficult to be applied to small mobile devices, which restricts the popularization and
application of network models. The traditional solution is to deploy the model on the
server, perform the computation on the server, and then feedback the results to the cell
phone, but this method is too dependent on the network, and the recognition efficiency
will be greatly affected by the network latency and the absence of the network, etc.
Therefore, the lightweight convolutional model has gradually emerged. Therefore
gradually appeared lightweight convolutional neural network[6-8], lightweight network
compared to the mainstream CNN [9] network, in the recognition effect of a small
reduction in the case of a substantial reduction in the number of parameters of the model
and the amount of computation. It can be applied to devices with limited computational
resources such as mobile devices and edge devices, which effectively improves the
practicality of deep learning in classification, recognition and other aspects. Therefore, the
vector biometric identification method based on lightweight convolutional neural network
has high research value.
   MobileNet [10] is an efficient lightweight network architecture aimed at providing
higher performance to enable mobile and embedded devices to better meet the needs of
various visual applications. This network architecture can fully utilize limited computing
power and storage space. The key advantage of MobileNet [11] is its ability to use
depthwise separable convolutions. By splitting the standard convolution operation into
smaller units, deep convolution operation and point by point convolution operation are
obtained. By adopting this approach, we can significantly reduce the computational cost
and parameter requirements of convolution operations, enabling us to better utilize
limited computing power and storage space, while also significantly reducing the
complexity of the entire network. This also enables deep convolutional neural networks to
adapt to the limitations of mobile and embedded devices. MobileNetv2 [12] has optimized
the depthwise separable convolution module, which is more advanced and efficient in
design and functionality than the original ones. MobileNetv2 proposes a linear bottleneck
reverse residual module, which has higher accuracy and faster processing speed
compared to MobileNetv1.
    ShuffleNet [13] is a lightweight network model with extremely high computational
efficiency, which can effectively support mobile devices with weaker computing power.
ShuffleNet has two important breakthroughs in its network structure. Firstly, it adopts
point by point grouping convolution technology; Secondly, it adopts channel random
mixing technology. By grouping convolution, feature maps can be divided equally
according to channel dimensions, and each group of features is extracted by a convolution
kernel, thereby achieving parallel computation of multiple convolutions, greatly improving
computational efficiency. By randomly mixing feature maps from different groups, we can
achieve convolutional optimization of the entire network. This method is called channel
random shuffling.
    In this paper, we propose a new lightweight image classification network model, the
backbone network is based on a network architecture combining convolutional neural
network and Transformer, a convolutional neural network model built with smaller
parameters. Firstly, the fusion module is introduced to replace the depth separable
convolution in shuffleNet with a common conv3×3 and the ReLU activation function [14]
is replaced with h-swish to better reduce the number of parameters of the model while
reducing the cost. Then this paper uses a network architecture that combines the
convolution with an improved Transformer to input the image as a sequence into the
Transformer to obtain global feature information, ensuring that the network model is able
to extract both local and global features of the image, and at the same time improving the
Transformer to reduce the number of parameters of the model.

2. Methods
   In this paper, we use three sizes of network models with exactly the same network
structure. the number of Transformer layers is [2, 4, 3], the dimensions are [64, 80, 96],
and finally a group point convolution is performed to reduce the feature map. The overall
architecture in the model is shown in Fig. 1. Only the input dimensions of Transformer are
different ShuffleViT-xxs input token is [64,80,96], shufflenViT-xs input token is [96, 120,
144]. shuffleViT-S input token is [144, 192, 240]. The network structure is shown in Table
1.

2.1. Lightweight Module
    ShufffleNet network is a lightweight convolutional neural network. The proposed idea
lies in the fact that when the computational conditions are limited, the number of channels
of the feature map in the convolutional neural network operation is also limited. Therefore
ShuffleNet proposes two techniques to increase the number of channels of feature maps
without significantly increasing the computational effort of the network: point-by-point
group convolution and bottleneck-like structure.
Figure 1: Lightweight image classification network model diagram
    Table 1
    ShuffleViT network structure table
                                                                           Output channels
       Layer            Output        Output stride    Repeat
                                                                   XXS          XS              S
        Image           256×256             1
   Conv-3×3，↓2                                          1          16          16             16
                        128×128             2
         SV2                                            1          16          32             32
       SV2, ↓2                                          1          24           48             64
                        64×64               4
         SV2                                            2          24           48             64
       SV2, ↓2                                          1          48           64             96
                        32×32               8
ShuffleViT block(L=2)                                   1       48(d=64)     64(d=96)      96(d=144)
       SV2,↓2                                           1          64           80            128
                        16×16              16
ShuffleViT block(L=4)                                   1       64(d=80)    80(d=120)     128(d=192)
       SV2, ↓2                                                     80           96            160
                                                        1
ShuffleViT block(L=3)   8×8                32                   80(d=96)    96(d=144)     160(d=240)
                                                        1
      Conv-1×1                                                    320          384            640
     Global pool                                        1
                        1×1               256                    1000          1000          1000
        Linear                                          1
     Parameters                                                  1.22M        2.21M           5.1M

        Convolutional neural networks are divided into multiple modules for feature
    extraction, e.g., Xception and ResNeXt deep separable convolution with swarm
    convolution to form a neural network with better performance, which achieves a balance
    between performance and computational consumption. However, a large number of point-
    by-point convolutions are used in these networks, which can lead to high computational
    complexity overhead. The use of point-by-point convolution in lightweight convolutional
    neural networks fills the feature channels with constraints, which can drastically reduce
    the performance of the network model. Point-by-point group convolution uses group
    convolution to make the convolution operation on the corresponding channel, and group
    convolution reduces the computational loss of the model.ShuffleNet is an improvement on
    the residual structure: it first does a 1×1 group convolution, followed by a channel mixing
    operation, then a 3×3 depth-separable convolution, and finally a group convolutional
    transformation of the output size to be directly summed with the input. When the step size
    is not 1, it is necessary to do a convolution operation to transform the size of the input and
    then splice it with the output. Subsequently, there was a V2 version, ShuffleNetV2, which
    proposed that an efficient network structure should have the same convolution width,
    smaller cost of convolution, reduced minutiae operations and as lean a network model as
    possible. Since the point-by-point group convolution and bottleneck structure proposed in
    ShuffleNetV1 increased the operation cost too much for a lightweight network model,
    ShuffleNetV2 replaced the group convolution with ordinary convolution, when the group
    convolution is replaced with ordinary convolution, the channel shuffling effect behind the
    model will disappear, so the channel shuffling is put into the input and output splicing
    after the shuffling. Improved ShuffleNet structure schematic shown in Figure 2. In contrast
the ReLU6 activation function is limiting the ReLU function to a maximum output value of
6, which allows for excellent numerical resolution on low precision devices. When the
ReLU function is not limited to a maximum value, it will not be able to accurately
characterize a large range of values on low precision devices, resulting in a loss of
accuracy. The use of the h-swish activation function reduces the number of parameters
compared to ReLU6 and h-swish.
   In this paper, the convolution part is further optimized on the basis of ShuffleNetV2 by
replacing the beginning 1×1 convolution with 3×3 depth-separable convolution structure
in the branch to 3×3 ordinary convolution. Then the activation function ReLU is replaced
with h-swish activation function:

                                                               (    )
                                     ℎ−     ℎ ℎ( ) =     ⋅              ,                (1)
                                          6（ ） =        (      ( , 0),6),                (2)
   In contrast the ReLU6 activation function is limiting the ReLU function to a maximum
output value of 6, which allows for excellent numerical resolution on low precision
devices. When the ReLU function is not limited to a maximum value, it will not be able to
accurately characterize a large range of values on low precision devices, resulting in a loss
of accuracy. The use of the h-swish activation function reduces the number of parameters
compared to ReLU6 and h-swish.


Figure 2: Schematic diagram of the improved ShuffleNet structure
2.2. Attention Module
    To solve the problem of insufficient spatial generalization bias, this paper uses a
convolutional neural network combined with Transformer. Convolutional neural network
has excellent spatial imputation bias ability, because the picture has a strong two-
dimensional local structure, spatial neighboring pixels are often highly correlated.
Convolutional neural network can forcefully capture the spatial inductive bias by using
local sensory fields, shared weights and spatial sampling, which to some extent realizes
the non-deformation of the transformation in each of the pictures. Whereas Transformer
is more capable of capturing global features using a multi-head attention mechanism.
    For the part of the convolutional neural network combined with the Transformer we
first pass the feature map through a convolution operation with a convolution kernel size
of n×n to obtain a local feature modeling of the model, and then adjust the number of
channels of the feature map through a convolution layer with a convolution kernel size of
1×1. The feature map is then tiled by performing a tiling operation on the already tuned
feature map, similar to the operation of Vision Transformer, the tiling operation in this
paper introduces channel blending to reduce the computation. The block size of the
feature map is first divided, e.g., when the block size is 2×2, i.e., each block consists of 2×2
pixels. Since a convolution operation with a convolution kernel size of n×n has already
been performed on the input feature map to obtain local feature modeling, operating each
token with other tokens while doing attention operation for global feature modeling will
result in wastage of arithmetic, and the obtained local feature modeling has already been
obtained by the previous convolution operation. In the operation on token. In this paper,
the data is first transformed to make it into the data format needed for self-attention. For
the common attention mechanism is to directly spread the two dimensions of height and
width to make it into a sequence of tokens, whereas in this paper, we choose to generalize
the corresponding pixels between each block to perform the attention operation on the
tokens in the same position. It is divided into different sequences by the pixel position of
each block. As shown in Fig. 3.


Figure 3: Schematic diagram of ShuffleViT image slice structure

   Each pixel operates attentively only with other pixels in the same position. The number
of sequences formed is the pixel value of each block. Assuming that the width, height and
number of channels of the feature map are H, W and C respectively, then the number of
sequences in each sequence is W×H/P and P is the size of the block. Transformer
operation is performed on the model after obtaining the sequences.


Figure 4: Schematic diagram of ShuffleiViT Block picture flow

   The amount of long attention is calculated as:

                                                [ , , ]=             ,                  (3)
   where q, k, and v denote the queue, key, and key value in the attention mechanism,
respectively, and denote the amount of computation used to compute q, k, and v.

                                                    =                    ,              (4)

  A is the amount of matrix computation for the attention single key value and queue,
             ×
where ∈        。

                                                        ( )=    ∅,                      (5)
    ( ) denotes the amount of matrix computation computed by the attention
mechanism at one time.

                                     ( )=[      ( );      ( ); … ;       ( )]   ,       (6)
        ( )denotes the amount of multi-attention mechanism computation. included
among these         ∈ ⋅ ℎ × . It can be concluded that the arithmetic power used by the
multi-head attention in performing the operation occupies the majority of the entire
coding module, which is calculated to be one-fourth of the computational effort of the
multi-head attention mechanism of the Improved Transformer as compared to the
computational effort of the ViT. Therefore dividing the input feature map sequence will
lead to a significant reduction in the computation of the model. After completing the
attention computation a stacking operation is performed on the output results, where the
output feature values are recovered from the one-dimensional feature map by feature
changes according to their original positions. The number of channels is then adjusted to
the original size by a convolutional layer with a convolutional kernel size of 1×1. Then the
original input feature map is spliced with the original input feature map using a shortcut
branch.
3. Experiments
   In this paper, we use the dataset which contains 3 species of cockroaches, 6 species of
mosquitoes and 20 species of flies, with a total of 338 image datasets, and the background
of the images is a single, unobstructed environment in the laboratory, where the various
categories of vector organisms are displayed in the form of pin-inserted specimens. The
dataset suffers from uneven distribution of the number of images in each category and
differences in the angle of vector organism display, with the number of sample images in
each category ranging from 9 to 53. Fig 5 shows the main categories of the dataset
displayed.


   cockroach


   mosquitoes


   musca


Figure 5: Presentation of the main categories of the dataset

Table 2
Accuracy on the recognition task using ResNet-18, ResNet-34, ResNet-50, MobileNetV1,
MobileNetV2, MobileNetV3, ShuffleNetV2 models vs. our model
           Model                       MaxAcc(%)          Acc(%)       #Params
           ResNet-18                   81.27%             80.27%       11.18M
           ResNet-34                   81.57%             81.93%       21.29M
           ResNet-50                   81.60%             81.20%       23.53M
           MobileNetV1                 78.57%             78.36%       4.22M
           MobileNetV2/1.0             79.36%             79.07%       2.24M
           MobileNetV2/2.0             81.02%             80.89%       8.72M
           MobileNetV3-Large           80.28%             80.06%       3.88M
           MobileNetV3-small           77.15%             77.12%       1.84M
            ShuffleNet_V2_0.5           76.71%             76.32%         1.42M
            ShuffleNet_V2_1.0           79.24%             79.22%         2.28M
            ShuffleViT                  82.60%             82.12%         2.21M
   The results in Table 2 show that ShuffleViT outperforms other convolutional neural
networks. For the ResNet series of convolutional neural networks, ShuffleViT outperforms
ResNet-18 by 1.33%, ResNet-34 by 1.03%, and ResNet-50 by 1.00% in the accuracy of the
image classification task. For the mobileNet family of convolutional neural networks,
ShuffleViT outperforms MobileNetV1 by 4.03%, MobileNetV2/1.0 by 3.24%,
MobileNetV2/2.0 by 1.58%, MobileNetV3-Large by 2.37% higher than MobileNetV3-
Smalle, and 5.45% higher than MobileNetV3-Smalle. For the shuffleNet family of
convolutional neural networks, ShuffleViT is 5.89% more accurate than ShuffleNet_V2_0.5
and 3.36% more accurate than ShuffleNet_V2_1.0 for image classification tasks.

4. Conclusion
   The improvement of image classification task performance requires a very large
dataset to support. When the dataset is too small, overfitting may occur, and a larger
network structure will bring better classification accuracy. With the deepening of neural
networks, image classification tasks have achieved astonishing accuracy. However,
improving performance requires state-of-the-art network structures and larger
computing resources. In the current era of widespread use of intelligent mobile devices,
the enormous amount of computing has far exceeded the capabilities of many mobile
devices and embedded application devices. How to successfully deploy image
classification tasks on mobile devices has become a key research issue. This article
proposes a new lightweight image classification network model for vector recognition.
Firstly, a fusion module is introduced to replace the depthwise separable convolution in
shuffleNet with a regular conv3 × 3, and h-swish is used to replace the ReLU activation
function, which reduces costs while better reducing the number of model parameters.
Then, this article uses a network architecture that combines convolution and improved
Transformer to input images in sequence into the Transformer to obtain global feature
information, ensuring that the network model can extract local and global features of the
image. At the same time, the Tranformer is improved to reduce the number of model
parameters.

Reference
[1] Tatfeng Y M, Usuanlele M U, Orukpe A, et al. Mechanical transmission of pathogenic
    organisms: the role of cockroaches[J]. Journal of vector borne diseases, 2005, 42(4):
    129.
[2] Nicholson W L, Allen K E, McQuiston J H, et al. The increasing recognition of rickettsial
    pathogens in dogs and people[J]. Trends in parasitology, 2010, 26(4): 205-212.
[3] Gourisaria M K, Das S, Sharma R, et al. A deep learning model for malaria disease
    detection and analysis using deep convolutional neural networks[J]. International
    Journal of Emerging Technologies, 2020, 11(2): 699-704.
[4] Noureddine S, Zineeddine B, Toumi A, et al. A new predictive medical approach based
     on data mining and Symbiotic Organisms Search algorithm[J]. International Journal of
     Computers and Applications, 2022, 44(5): 465-479.
[5] Rani P, Kotwal S, Manhas J, et al. Machine learning and deep learning based
     computational approaches in automatic microorganisms image recognition:
     methodologies, challenges, and developments[J]. Archives of Computational Methods
     in Engineering, 2022, 29(3): 1801-1837.
[6] Liu F, Xu H, Qi M, et al. Depth-wise separable convolution attention module for
     garbage image classification[J]. Sustainability, 2022, 14(5): 3099.
[7] Huang T, Chen J, Jiang L. DS-UNeXt: depthwise separable convolution network with
     large convolutional kernel for medical image segmentation[J]. Signal, Image and Video
     Processing, 2023, 17(5): 1775-1783.
[8] Xia Q, Dong S, Peng T. An Abnormal Traffic Detection Method for IoT Devices Based on
     Federated       Learning      and     Depthwise     Separable    Convolutional   Neural
     Networks[C]//2022          IEEE     International    Performance,     Computing,   and
     Communications Conference (IPCCC). IEEE, 2022: 352-359.
[9] Jiang K, Zhang C, Wei B, Li Z, Kochan O. Fault diagnosis of RV reducer based on
     denoising time–frequency attention neural network [J]. Expert Systems with
     Applications, 2024, 238: 121762.
[10] Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks
     for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017.
[11] Liu X, Qi P, Siarry P, et al. Mining security assessment in an underground environment
     using a novel face recognition method with improved multiscale neural network [J].
     Alexandria Engineering Journal, 2023, 80: 217-228.
[12] Sandler M, Howard A, Zhu M, et al. Mobilenetv2: Inverted residuals and linear
     bottlenecks[C]//Proceedings of the IEEE conference on computer vision and pattern
     recognition. 2018: 4510-4520.
[13] Zhang X, Zhou X, Lin M, et al. Shufflenet: An extremely efficient convolutional neural
     network for mobile devices[C]//Proceedings of the IEEE conference on computer
     vision and pattern recognition. 2018: 6848-6856.
[14] Xu X, Przystupa K, Kochan O. Social Recommendation Algorithm Based on Self-
     Supervised Hypergraph Attention [J]. Electronics. 2023, 12(4):906.

</pre>