<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>August</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Measurement using Deep Learning Based Semantic Segmentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haruhiro TAKAHASHI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryuto ISHIBASHI</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hayata KANEKO</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lin MENG</string-name>
          <email>menglin@fc.ristumei.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Science and Engineering, Ritsumeikan University</institution>
          ,
          <addr-line>1-1-1 Noji-higashi, Kusatsu, Shiga, Japan 525-8577</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Graduate School of Science and Engineering, Ritsumeikan University</institution>
          ,
          <addr-line>1-1-1 Noji-higashi, Kusatsu, Shiga</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>9</fpage>
      <lpage>22</lpage>
      <abstract>
        <p>In the super-aging society, the number of caregivers for the elderly is insuficient. Furthermore, the increasing demand for quality care exacerbates the labor shortage problem. Hence, care technology work has become increasingly important. This paper focuses on automatically measuring food intake to monitor the health situation of the elderly. The proposed approach utilizes a semantic segmentation model based on deep learning methods. Specifically, the U-Net architecture is employed as a foundational module for segmenting food intake from the after-meal plate. Attention modules are integrated into the U-Net to enhance the Mean Intersection over Union (mIoU) while minimizing the number of model parameters and FLOPs (Floating Point Operations). Furthermore, an Attention Search method is proposed to search for the optimized Attention module insert position in the U-Net. The experimental results demonstrate that the proposed method achieves a high score, such as mIoU of 87.1%, with only a slight increase in the number of parameters and FLOPs, thus confirming its efectiveness. Optimizing and applying the proposal to practice is important work in the future. To further optimize the number of parameters and FLOPs, the Attention insertion position should be explored with a technique called Neural Architecture Search (NAS).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The number of people aged 65 and over in relation to the world’s population has been rising
yearly and is expected to continue to rise [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In Japan, about 29% of the total population is
65 and over, making it a super-aged society. Also, only 12.1% of care providers can provide
care to people in need of care, and labor shortages in the nursing care industry are becoming
serious [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In addition to their usual duties, such as caring for residents, caregivers keep food
and water intake records. As a problem, recording the amount of food and water is controlled
manually by a person, and the standard is not uniform. Recently, AI-based image recognition
methods have been widely used and have significant achievements in various fields[
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref7">3, 4, 5, 6, 7</xref>
        ].
nEvelop-O
LGOBE
http://www.ihpc.se.ritsumei.ac.jp/ (L. MENG)
CEUR
CEUR
      </p>
      <p>ceur-ws.org
Therefore, this study mitigates manual tasks by implementing image segmentation with AI and
unifying food intake criteria.</p>
      <p>This paper proposes the model using semantic segmentation, one of the AI-based image
segmentation models to detect the leftover food area. However, due to computer resource
limitations, a highly accurate deep-learning image recognition model with AI is computationally
expensive and challenging to apply to nursing homes. In addition, to solve this problem, an
Attention module that improves accuracy with a small increase in computational complexity
is adopted, and the performance is validated. The problem is that searching for the U-Net
manually optimized Attention module insertion position takes much time. Therefore, Attention
Search is proposed to decide the optimal Attention module insertion position automatically.</p>
      <p>The major contributions of this paper are shown as follows:
• Propose AI-based semantic segmentation model to reduce food intake measurement of
caregivers’ tasks.
• Optimize the Attention module insertion position using Attention Search.
• Proposed U-Net model achieves higher accuracy with less computation than conventional
semantic segmentation models.</p>
      <p>The remaining parts of this paper are organized as follows. Related works are listed in Section
2. Section 3 proposes our U-Net model. Section 4 shows the dataset, the experimental method,
experimental results, and the discussion. Finally, this paper is concluded in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. Semantic Segmentaiton Model</title>
        <p>Image recognition technology consists of image classification, object detection, and image
segmentation. Semantic segmentation is one of the image segmentation methods used in this
paper. Semantic segmentation is a method of labeling what appears in each pixel of an image.
U-Net</p>
        <p>
          U-Net [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is one of the models for semantic segmentation that consists of an encoder and
a decoder. U-Net encoder convolves the input image several times to extract features.
This part of the network structure is called the backbone, and the model learned by image
classification tasks such as ResNet [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] can be used as the backbone. U-Net decoder takes
the features extracted by the encoder, performs deconvolution, and outputs a probability
map of the same size as the input image. In addition, U-Net concatenates the encoder
feature map to the decoder feature map for each hierarchy.
        </p>
        <sec id="sec-2-1-1">
          <title>DeepLabv3</title>
          <p>
            DeepLabv3 [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] is one of the models for semantic segmentation. DeepLabv3 uses
pretrained models for image classification tasks as the backbone. The backbone extracts
image feature maps, performs parallel Atrous Spatial Pyramid Pooling (ASPP), and then
performs several convolutions to output the segmented image. The ASPP is inspired by
the Spatial Pyramid Pooling (SPP) [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. In SPP, pooling is performed on image feature
maps in a plurality of scales, and concat and output them. Therefore, it has the advantage
that it can cope with input of various sizes and multi-scales. In ASPP, SPP is performed
eficiently by applying atrous convolution to pooling while maintaining the advantage of
SPP.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Attention Module</title>
        <p>Attention module is a module for improving accuracy with a small number of parameter
increases. It is inserted into several points of the model. In Figure 1, r is the compression rate of
the number of channels, and the computational complexity becomes smaller as r is increased.</p>
        <sec id="sec-2-2-1">
          <title>Squeeze and Excitation Module</title>
          <p>
            Squeeze and Excitation Module (SE-Module) [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] is one of the Attention modules and
its structure is shown in Figure 1 (a). The Squeeze part uses global average pooling to
take advantage of the feature size of the entire image. In the Excitation part, the obtained
feature size of each channel is passed through several layers to obtain the weight of each
channel, which is then multiplied by the input feature maps and output.
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Coordinate Attention Module</title>
          <p>
            Coordinate Attention Module(CA-Module) [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] is one of Attention modules and its
structure is shown in Figure 1 (b). Unlike the SE-Module, the CA-Module is divided into
W and H directions for pooling and weighting. By dividing, not only the computational
complexity is reduced, but also spatial information can be included in the weighting.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. MobileNetV3</title>
        <p>
          MobileNet is one of the models that can be installed in a mobile phone with a small number
of parameters and can be used for various tasks. MobileNetV1 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] uses Depthwise Separable
Convolution, a convolution that performs Depthwise convolution followed by Pointwise
convolution, to reduce the number of parameters compared to normal convolution. MobileNetV2
introduced an Inverted Residual to reduce the large calculation amount of Pointwise Convolution
in MobileNetV1. Inverted Residual is an improvement on ResNet’s Residual Block,
approximating Depthwise Separable Convolution with a small number of calculations by sandwiching
Depthwise Convolution with Pointwise Convolution. In MobileNetV2, the number of channels
is reduced compared with MobileNetV1, but only in Inverted Residual, the number of channels
is increased to suficiently extract the feature quantity, thereby reducing the parameter quantity.
MobileNetV3 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] introduces SE-Module for Inverted Residual and changes some non-linear
activation from ReLU to Hardswish. In addition, hardware-enabled Neural Architecture Search
(NAS)[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] is used to tailor the model to the phone’s CPU.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method</title>
      <sec id="sec-3-1">
        <title>3.1. Our U-Net model</title>
        <p>In this paper, U-Net model for semantic segmentation and Attention Search that optimizes the
position of the Attention module are proposed to realize higher accuracy with less computation.
This paper proposes a U-Net model with Attention module that increases the accuracy by a
small number of parameters, without increasing the number of channels. Its structure is shown
in Figure 2. In Two Convolution Block(TCB) of Figure 2, 3×3 convolution, batch normalization,
and ReLU (CBR) are performed twice, and Attention Point to insert Attention module is after
each CBR.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Attention Search</title>
        <p>Attention Search is performed as a method to insert an Attention module at the optimal position
to improve accuracy. Attention module used in this paper is the SE-Module and the CA-Module.
For each SE-Module and CA-Module, this method searches which TCB is the best position to
insert the attention module at Attention Point 1 and 2, respectively.</p>
        <p>Attention Search is performed as shown in Algorithm 1.  includes the SE-Module and
CA-Module,  includes the Attention Point 1 and 2,  indicates the number of TCBs in our
U-Net model.  indicates Attention module to be inserted,  indicates Attention Point to be
inserted,  indicates set of models to save,  indicates the number of the pattern to be executed, 
indicates the model to be saved,  indicates the maximum score model in the set  , ℳ indicates
set of maximum score models to save. Then, an explanation of Algorithm.1 is described. In
line 2, Attention Module is selected as SE-Module or CA-Module. In line 3, Attention Point
is selected as 1 or 2. In line 5, selecting the pattern to insert Attention module to be executed
In line 6 through 8, inserting Attention module into the TCB corresponding to the selected
pattern. In line 10 through 16, the selected model is trained, and is saved when the epoch is
 × 0.8 or greater. In line 18 through 19, the model with the maximum score from the saved
models is selected. The score used is mIoU. In all processing of lines 1 through 21, Attention
Search explores the model with the maximum score.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>The dataset used in this study was created by taking pictures before, during, and after meals
and labeling the areas of rice and miso soup. The image sizes are 456 (or 455) ×608 and 608×456
(or 455), and the input image of the model is resized to 224×224. There are a total of 241 samples
and about 85% (206 samples) of the total are used in the training and about 15% (35 samples) of
the total are used in the test. The dataset with the correct labels is shown in Figure 3.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experiments Method</title>
        <sec id="sec-4-2-1">
          <title>Implementational Conditions</title>
          <p>Experiments in this paper are conducted on Intel(R) Xeon(R) Gold 6342 CPU, a single
NVIDIA RTX A6000 GPU, and NVIDIA Jetson-Nano. The operating system is Ubuntu
22.04.6 LTS, the CUDA version is 12.2, and the jetpack version is 4.6.1 (the tensorRT
version is 8.2). PyTorch library is used as the framework for implementing deep learning.
Algorithm 1 Attention Search
Input:  : sets of Attention modules,
 : sets of Attention Points,
 : number of TCBs in U-Net,
 : total epochs,
Output: ℳ: Best Models
1: ℳ = {}
2: for  in  do
3: for  in  do
4:  = {}
5: for  = 0 to 2 − 1 do
6: if  ≡ 1 ( mod 2) then
7:  is inserted in  of TCB[ ]
8: end if
9:  = //2
10: for ℎ = 1 to  do
11: Training
12: if ℎ ≧  × 0.8 then
13:  ← Model
14:  ←  ∪ {}
15: end if
16: end for
17: end for
18:  ← Maximum Score of 
19: ℳ ← ℳ ∪ {}
20: end for
21: end for</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Learning Conditions</title>
          <p>The model is learned under the following conditions.</p>
          <p>
            • Loss Function: The loss function used in this paper is Tversky Loss [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. Equation
(1) is the formula for calculating Tversky Loss.  is the predicted label and  is the
correct label . In this paper, α=0.7 and β=0.3.
          </p>
          <p>Tversky_Loss(, ) = 1 −
∑  +</p>
          <p>
            ∑ 
∑(1 − ) + 
∑ (1 − )
(1)
• Optimizer: Adam [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]
• Learning Rate : 1e-4
• Batch Size: 4
• Epoch: 100
          </p>
        </sec>
        <sec id="sec-4-2-3">
          <title>Evaluation index</title>
          <p>The evaluation index used in this paper is IoU, Throughput, Params, and FLOPs.
• IoU: Equation (2) is the formula for calculating IoU (Intersection over Union).  is
the predicted label and  is the correct label .</p>
          <p>(, ) =</p>
          <p>∑ 
∑  + ∑  − ∑ 
(2)
• Throughput: Throughput is a measure of how many images can be processed per
second.
• Params: Params is the parameter quantity of the model
• FLOPs: FLOPs (FLoating-point OPerationS) is the computational complexity of the
model.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental Result</title>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Discussion and Future Work</title>
        <p>From the results, inserting Attention module is an efective way to improve mIoU. However,
depending on the number of Attention module insertions, the number of parameters and FLOPs
increases, and the throughput decreases. As a result of the Attention Search, it is shown that
mIoU is improved by inserting in Attention Point 2 rather than Attention Point 1 in both SE
and CA cases without much increase in the parameter amount. In addition, SE is more efective
compared to CA in Attention Point 1 and 2 because mIoU is higher and the number of parameters
is similar. From the results of the throughput (Table 1), it is considered that the throughput
corresponds to the number of parameters because the number of cores is not suficient when
executing on CPU, and the throughput corresponds to FLOPs because the number of cores
is suficient when executing on GPU and Jetson. As shown in Figure 3, segmentation image
accuracy is improved corresponding to mIoU. According to the results of Table 1, Type: B of
the proposed model has a higher score in all the evaluation indexes, which shows Type: B of
the proposed model superiority. Based on the discussion so far, it can be shown that the model
without the number of channels increased and added the Attention module has an advantage
over the DeepLabv3 models with MobileNetV3 as the backbone for the task of this paper with
the small dataset.</p>
        <p>
          The amount of leftover food is detected by calculating the volume corresponding to the
obtained area. The food intake is estimated from the volume obtained from photographs taken
before and after meals. However, this method can only be used if the plate size does not change
between the photographs taken before and after meals. Future work includes an object detection
[
          <xref ref-type="bibr" rid="ref19">19, 20</xref>
          ] component for detecting the plate, allowing us to make estimations even if the plate
size changes. Additionally, future work covers not only rice and miso soup but also other dishes.
Furthermore, we constructed a model with high mIoU using Attention Search; however, brute
force is not eficient. For this reason, it is necessary to search for a model using NAS in future
work.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The aging society is progressing worldwide, and the labor shortage in the nursing care industry
is becoming serious in Japan. This paper proposes the model using semantic segmentation
to measure food intake, which is one of the tasks in the nursing care industry. The proposed
model with Attention module improves the IoU while suppressing the increase in the number
of parameters and FLOPs of the model. In conclusion, our proposed method is suitable for
measuring leftover food areas due to the limitations of computational resources. Future work
implements model with addition of object detection to measure food intake without being
afected by plate size. Additionally, pruning models using NAS is an interesting direction for
future work. Further, this work is expected to be applied in food service on robot-based food
industry automation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The dataset used in this work is provided by MOUantAI, ”Meshi-Pasha Project”.
[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, Ssd: Single
shot multibox detector, in: Computer Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, 2016,
pp. 21–37.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>U.</given-names>
            <surname>Nations</surname>
          </string-name>
          , Population by broad age groups,
          <year>2022</year>
          . https://population.un.org/wpp/Graphs/ DemographicProfiles/Line/900.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>C. O. of Japan</surname>
          </string-name>
          ,
          <source>Annual report on the ageing society</source>
          ,
          <year>2023</year>
          ,
          <year>2023</year>
          . https://www8.cao.go.jp/ kourei/whitepaper/w-2023/html/zenbun/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Hardware-aware approach to deep neural network optimization</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>559</volume>
          (
          <year>2023</year>
          )
          <fpage>126808</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>YUE</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. MENG</surname>
          </string-name>
          ,
          <article-title>Yolo-sm: a lightweight single-class multi-deformation object detection network</article-title>
          ,
          <source>IEEE Transactions on Emerging Topics in Computational Intelligence</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>YUE</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. MENG</surname>
          </string-name>
          ,
          <article-title>Yolo-msa: A multi-scale stereoscopic attention network for empty-dish recycling robots</article-title>
          ,
          <source>IEEE Transactions on Instrumentation and Measurement</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Iot-based automatic deep learning model generation and the application on empty-dish recycling robots</article-title>
          ,
          <source>Internet of Things</source>
          <volume>25</volume>
          (
          <year>2024</year>
          )
          <fpage>101047</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hayashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Facial expression recognition with face mask using attention mechanism and metric loss</article-title>
          ,
          <source>in: 2023 International Conference on Advanced Mechatronic Systems (ICAMechS)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net:
          <article-title>Convolutional networks for biomedical image segmentation, in: Medical image computing and computer-assisted intervention-MICCAI 2015: 18th international conference</article-title>
          , Munich, Germany, October 5-
          <issue>9</issue>
          ,
          <year>2015</year>
          , proceedings,
          <source>part III 18</source>
          , Springer,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.-C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Papandreou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schrof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <article-title>Rethinking atrous convolution for semantic image segmentation</article-title>
          ,
          <source>arXiv preprint arXiv:1706.05587</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Spatial pyramid pooling in deep convolutional networks for visual recognition</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>37</volume>
          (
          <year>2015</year>
          )
          <fpage>1904</fpage>
          -
          <lpage>1916</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          , G. Sun,
          <string-name>
            <surname>Squeeze-</surname>
          </string-name>
          and
          <article-title>-excitation networks</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>7132</fpage>
          -
          <lpage>7141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <article-title>Coordinate attention for eficient mobile network design</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>13713</fpage>
          -
          <lpage>13722</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Weyand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Andreetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adam</surname>
          </string-name>
          , Mobilenets:
          <article-title>Eficient convolutional neural networks for mobile vision applications</article-title>
          ,
          <source>arXiv preprint arXiv:1704.04861</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chu</surname>
          </string-name>
          , L.-
          <string-name>
            <surname>C. Chen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Vasudevan</surname>
          </string-name>
          , et al.,
          <source>Searching for mobilenetv3, in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1314</fpage>
          -
          <lpage>1324</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Neural architecture search with reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:1611.01578</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S. S. M.</given-names>
            <surname>Salehi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erdogmus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gholipour</surname>
          </string-name>
          ,
          <article-title>Tversky loss function for image segmentation using 3d fully convolutional deep networks</article-title>
          ,
          <source>in: International workshop on machine learning in medical imaging</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>379</fpage>
          -
          <lpage>387</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Diederik</surname>
          </string-name>
          ,
          <article-title>Adam: A method for stochastic optimization, (No Title) (</article-title>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>You only look once: Unified, real-time object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>