<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Method of Wisdom Site Safety Helmet Detection Based on Deep Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiaxiang Guo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huiyi Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tao Tao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rencai Jin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>China Technical Quality Department, China MCC17 Group Co., LTD</institution>
          ,
          <addr-line>Maanshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>College of Computer, Anhui University of Technology</institution>
          ,
          <addr-line>Maanshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <fpage>168</fpage>
      <lpage>176</lpage>
      <abstract>
        <p>Aiming at the real-time detection of helmet wearing in construction sites with small targets, incomplete features and many interference factors, the multi-scale and multi-branch feature extraction and feature reconstruction module are applied to YOLOv5s for model reconstruction, so that each level contains convolution networks with different sizes and depths, which can capture the details of different sizes of receptive fields in the scene. The feature reconstruction is used to extract finer grained features improve the robustness of the model. Experiments on self-made construction site data sets show that compared with the YOLOv5s model, the improved algorithm improves the recognition effect of helmet wearing in the detection of long-distance small pixels and the presence of a large number of occlusion, and the real-time performance is not affected, which can meet the application needs of smart construction sites.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;YOLOv5s</kwd>
        <kwd>feature reconstruction</kwd>
        <kwd>target detection</kwd>
        <kwd>wisdom site</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction 1</p>
      <p>With the help of image recognition technology, automatic detection and management of safety
helmet are carried out, which is one of the main means of wisdom site construction. The construction
site environment is complex and the detection targets will be blocked by various objects or mutual
occlusion between people, so in the actual scene, the helmet detection will be a lot of interference.
Due to the fixed position of the camera, the different recognition distance of the target will also
increase the interference. The real-time automatic detection of the helmet is a small target when the
remote detection is performed.</p>
      <p>
        The helmet detection method has experienced the stages of radio frequency identification
technology and image processing technology. Early mobile radio frequency identification
technology[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] could not confirm whether the helmet was worn because the reader had a limited
working range and could only detect whether the helmet was close to the worker. RibGaiya and Silva
algorithm[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] combined the frequency domain information of the image with the histogram of
orientation gradient to detect the human body, and then used the ring Hough transform algorithm to
detect the helmet wearing, which solved the problem that it is difficult to distinguish the skin color of
the human body and the helmet. However, it is easy to be interfered by many occlusions and light in
the actual scene, which affects the detection accuracy. In reference[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a hybrid descriptor consisting
of local binary pattern, color histograms, and Hu moment invariants is proposed to extract the features
of hard hats, and then hierarchical support vector machine is constructed to classify hard hats, which
reduces the influence of environmental changes. But this method is not accurate enough to detect
small objects such as safety helmets. Based on the improved YOLOv3 model method[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the latitude
clustering of the target box is used to optimize the selection of the target box, which improves the
accuracy of the detection helmet, but the model is complex and the response speed is slow.
      </p>
      <p>YOLOv5 model has fast response speed and high detection accuracy, which is considered as one
of the effective algorithms for real-time image recognition, and is widely used in unmanned driving,
wisdom sites and other fields. However, YOLOv5 network model is also susceptible to noise in some
multi-objective scenarios.</p>
      <p>To solve this problem, based on YOLOv5s of YOLOv5 series, this paper applies a multi-scale
feature extraction module in its model head to improve the extraction ability of features of different
sizes. Furthermore, the feature reconstruction module is proposed to improve the ability of the model
to extract fine-grained features and improve the robustness of the algorithm. Make it more suitable for
wisdom site safety helmet inspection application.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Algorithm design</title>
      <p>(b)
Figure 1. An improved YOLOv5s model block diagram with some module annotations</p>
    </sec>
    <sec id="sec-3">
      <title>2.1.Multi-scale feature extraction method</title>
      <p>
        Since people wearing safety helmets have different positions relative to the camera and different
shooting angles, if the network model can have different receptive fields when extracting features, the
recognition accuracy of the model for the target object will be effectively improved. The YOLOv5s
model adopts ordinary convolution with a single type of kernel, and the calculation process is shown
in figure 2. The input feature is x, and if the size of the convolution kernel in a single space is  , the
depth is equal to the number of input feature maps  . Applying a large number of  cores with
the same spatial resolution and depth to the input feature map  can get a large number of output
feature maps  .
different types of convolution kernel to deal with the input feature[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The input features pass through
multiple layers of convolution kernels 
, 
,, 
a large number of output features, and finally concatenated together. The convolution kernel in each
layer of the multi-scale feature extraction module has different sizes, and the depth of the convolution
kernel gradually decreases with the increase of the number of convolution kernel layers. Generally,
the small convolution kernel has a smaller receptive field, so that local details can be obtained. The
larger receptive field of convolution kernel can get the global semantic information of large target.
with different sizes and types to obtain
is the same, and then the output features are concatenated.
      </p>
      <p>Assuming that the input of the multi-scale extraction module contains 
channels, the
convolution kernel resolution of each layer is  ,  ,, 
, and the depth is  ,
+ 
+ +
corresponding output feature dimension is 
, 
,,</p>
      <p>. The dimensions of the final output feature
convolution are:
extraction module are:</p>
      <p>Then the parameter usage and floating point operations per second required by ordinary
The parameter usage and floating point operation times per second of the multi-scale feature
 

 

(3)
(4)</p>
      <p>If the number of output channels of each layer of multi-scale extraction method is the same, then
the number of parameters and computational complexity of each layer will be distributed evenly. The
measured results show (see Table 1) that after adding the multi-scale feature extraction and feature
reconstruction module, the number of parameters increases from 16.3×106 to 29.6×106, and the
number of iterations per second decreases by 0.31, which improves the detection ability of the model
for small target objects at the minimum cost.
extraction + feature reconstruction</p>
      <sec id="sec-3-1">
        <title>The number of</title>
        <p>arguments
16.3×106
21.1×106
29.6×106</p>
      </sec>
      <sec id="sec-3-2">
        <title>Iterations per second</title>
        <p>2.44
2.26
2.13</p>
        <p>In order to solve the problem of gradient disappearance caused by increasing the depth of deep
neural network, hopping residual connection structure is adopted in the model, and the original
features are added after multi-layer convolution, as shown in figure 5.</p>
        <p>The input features are outputted by two groups of different convolution kernels and then
concatenated to get the output features. The final output features is  :</p>
        <p>x represents the input feature of the multi-scale feature extraction module, 
the multi-scale convolution, the network model adopts residual link, and the output feature is
(5)
represents
 
x+ x
 ∈ ℝ
 ∈ ℝ</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.2.Reconstructed feature module</title>
      <p>In order to improve the extraction effect of fine-grained features, a Feature Reconstruction Module
(FRM) is further constructed. FRM takes the output of the multi-scale Feature extraction Module
as the input and introduces the attention mechanism, as shown in figure 6. The feature
reconstruction module includes three convolution layers: shift convolution, ordinary convolution,
cyclic grouping convolution and an attention mechanism layer. The results of the three-part</p>
      <p>Shift convolutional layer cuts the input feature map into two parts and rejoins them. The purpose is
to force the network to learn the disconnected feature map, so that the network can pay attention to the
small features that cannot be noticed under normal conditions.</p>
      <p>Ordinary convolutional layer:
the feature map  ,</p>
      <p>and then convolution.
 , 
represents the two parts of the original feature, ℎ
() represents the reassembly of

ℎ


, 
convolution are summed with the output of a certain weight and attention mechanism, and the features
are reconstituted into feature maps of the same size as before.
(6)
(7)
(8)
(9)</p>
      <p>represents ordinary convolution operation.</p>
      <p>
        The convolution method used in the cyclic grouping convolutional layer is to extract information
by using convolution check of different scales inside a convolutional layer[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and different expansion
rates are adopted for each input channel. At the same time, different convolution kernels and
expansion rates are used repeatedly, and finally, block convolution is also used to improve the
computational efficiency.
      </p>
      <p>Let  ∈ ℝ
kernel.、 ∈ ℝ
convolution respectively. Then the conventional convolution is defined as:
denote the input feature,  ∈ ℝ</p>
      <p>denote the convolution
represents the output features of ordinary convolution and cyclic block
 ,,
∑
∑
∑
 ,,,
 ,,
And the circular block convolution method is:
 ,,
∑
∑
∑
 ,,,
 ,
,
In equation (9),  ,
represents  ∈ ℝ
, which is a matrix composed of the expansion
rates of channel level and filter level of two orthogonal dimensions.  ,
particular channel in a filter, the entire matrix 
representation of a lattice of convolution kernels in its expansion rate subspace.</p>
      <p>can therefore be interpreted as a mathematical
between input and output, and there are connections between each input channel and output channel.
does not reduce the computational efficiency.
calculation process.</p>
      <p>is the output feature.</p>
      <p>represents the grouped cyclic convolution,  ∈ ℝ
is the input feature, and</p>
      <p>
        In order to increase the sensitivity of the network model to key targets, an additional feature
attention mechanism layer[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is added to the feature reconstruction module. Figure.8 shows its
      </p>
      <p>In Figure 8 Csp2-1 is the existing module in YOLOv5s, and the output features after multi-scale
feature extraction are taken as the input of the attention layer. FC represents the fully connected layer.
A global maximum pooling operation is performed on the input feature A first, and the feature's
dimension becomes 1×1×C. After dimensionality reduction by the first fully connected layer, the
feature is activated at the site to the rectified linear unit (ReLU), and the original dimension is restored
by the latter fully connected layer. Then the activation value of each channel is multiplied by the
original feature to be used as the input feature of the next level. The principle is to enhance the
important features and weaken the unimportant features by learning the weight coefficients of each
channel, so as to make the extracted features more directional.</p>
      <p>The final output feature  ∈ ℝ</p>
      <p>of the feature reconstruction module is:
 is a hyper-parameter. The reconstructed features 
is then input into the YOLOv5s model to
realize the improvement of its application in real-time detection of smart construction site safety hats.</p>
    </sec>
    <sec id="sec-5">
      <title>3. Analysis of experimental results</title>
    </sec>
    <sec id="sec-6">
      <title>3.1. Data sources and preprocessing</title>
      <p>The experimental data set in this paper is processed and made based on the pictures taken by the
camera in the construction site, the relevant pictures climbed from Google and baidu, and the pictures
in the Safety-Helmet-Wearing-Dataset that has been publicly released. The
Safety-Helmet-Wearing-Dataset contains a large number of classroom self-learning images taken by
cameras, which do not conform to the detection task in real scenes. Therefore, these images are
deleted to clean the open Data set. And add more than 500 images with a lot of disturbing factors.
Compared to the publicly available hardhat datasets, the newly created data set includes images with
more occlusions. The collected data also includes workers wearing helmets and those not wearing
helmets in different environments, at different resolutions, and at different construction sites. Include
more pictures with helmets as small targets and pictures with occlusions.</p>
      <p>The data set consists of 7,495 images. The data set was annotated in XML format (PASCAL VOC
format[8]) with the labeling software LabelImg, where people wearing helmets were labeled as hat and
people without helmets were labeled as people. And use a specific script to convert to YOLO format.
The ratio of training set to test set is divided according to 8:2. 6874 images are used as training set,
and 1621 images are used for testing. These include human helmet wearing objects (front) and normal
head objects (not wearing or profile). Once annotated, each image corresponds to an XML file with
the same name as the image, which is converted to a TXT file in YOLO format. Each line in the TXT
file represents an instance of the tag. The TXT file has 5 columns, from left to right respectively
represents the label category, the ratio of the tag box central abscissa to image width, the ratio of the
tag box central ordinate to image height, the ratio of tag box width to image width, the ratio of tag box
height to image height.</p>
    </sec>
    <sec id="sec-7">
      <title>3.2. Analysis of experimental results</title>
      <p>The performance of the proposed algorithm is evaluated by the common evaluation indexes in
object detection algorithms, such as mean average precision(mAP), precision rate(P) and recall
rate(R). Through experiments, the first added multi-scale feature extraction module can adopt the
double-layer structure to achieve the best effect. The module contains two groups of convolution
kernels with different sizes, and the size of convolution kernels for each channel is 3×3 and 5×5
respectively, which can make the model achieve better results. In addition, hyper-parameters in the
multi-scale feature extraction module can be flexibly mobilized, such as the number of layers, the
number of output channels in different layers, different depths, and different number of groups to
adapt to different detection tasks. The hyper-parameter in the reconstruction module is set to 0.1. The
experimental results are shown in Table 2.
mAP upgrade</p>
      <p>0.4%
0.7%</p>
      <p>As can be seen from Table 2, the algorithm in this paper can effectively improve the detection
accuracy of safety helmets and workers who are not wearing safety helmets. In the original YOLOv5s
model, the Average mAP(Mean Average Precision) of people wearing and not wearing hard hats was
91.7%. After adding the multi-scale feature extraction module and feature reconstruction
module(FRM), the mAP increased by 0.7% to 92.4%. Among them, the accuracy of detecting workers
without helmets increased by 0.6 percent to 93.6 percent.</p>
      <p>A total of 210 images with obstructions and long distance (small field of view) in the data set were
used in the model anti-interference experiment (Table 3), and the mAP of the improved model was
improved by 1.9%. It shows that the detection accuracy of the proposed algorithm is more excellent
than that of the YOLOv5s model in the scenarios of different distances, pixel changes, obstacles and
so on. It can meet the accuracy requirements of helmet inspection in complex working environment.</p>
    </sec>
    <sec id="sec-8">
      <title>4. Conclusion</title>
      <p>YOLOv5s is currently recognized as one of the effective real-time image detection algorithms,
which is widely used in application fields with high real-time requirements. In order to apply it to the
real-time detection of helmet wearing in construction sites with small targets, incomplete features and
many interference factors, this paper applies the multi-scale and multi-branch feature extraction and
feature reconstruction method to YOLOv5s for model reconstruction. Experiments show that
compared with the YOLOv5s model, the improved algorithm has better scene recognition effect and
real-time performance in the detection of remote small pixels and the detection in the presence of
large number of occlusions. And can meet the application requirements of wisdom construction sites.</p>
      <p>In order to further improve the detection accuracy, multiple cameras can be arranged to detect the
same scene from different angles, and then the composite processing can be performed.</p>
    </sec>
    <sec id="sec-9">
      <title>5. References</title>
      <p>[8] EVERINGHAM M, WINN J. The PASCAL visual object classes challenge 2012 (VOC2012)
development kit. (2019-03-20) [2022-8-31]. URL:
http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>KELM</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>LAUSSAT</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>MEINS-BECKER</surname>
            <given-names>A</given-names>
          </string-name>
          , et al.
          <article-title>Mobile passive radio frequency identification (RFID) portal for automated and rapid control of personal protective equipment (PPE) on construction sites</article-title>
          .
          <source>Automation in Construction</source>
          ,
          <year>2013</year>
          ,
          <volume>36</volume>
          :
          <fpage>38</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>LI Q R. A</surname>
          </string-name>
          <article-title>Research and Implementation of Safety-helmet Video Detection System Based on Human Body Recognition</article-title>
          . Chengdu：University of Electronic Science and Technology of China,
          <year>2017</year>
          ,
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ,
          <fpage>34</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>WU</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>ZHAO J S.</surname>
          </string-name>
          <article-title>An intelligent vision-based approach for helmet identification for work safety</article-title>
          .
          <source>Computers in Industry</source>
          ,
          <year>2018</year>
          ,
          <volume>100</volume>
          :
          <fpage>267</fpage>
          -
          <lpage>277</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>SHI</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>CHEN X</surname>
            <given-names>Q</given-names>
          </string-name>
          ,
          <string-name>
            <surname>YANG</surname>
            <given-names>Y</given-names>
          </string-name>
          , et al.
          <article-title>Safety helmet wearing detection method of improved YOLOv3</article-title>
          .
          <source>Computer Engineering and Applications</source>
          ,
          <year>2019</year>
          ,
          <volume>55</volume>
          :
          <fpage>213</fpage>
          -
          <lpage>220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>LIU</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>ZHU</surname>
            <given-names>F</given-names>
          </string-name>
          , SHAO L.
          <article-title>Pyramidal convolution: rethinking convolutional neural networks for visual recognition</article-title>
          .
          <source>(2020-06-20)</source>
          [
          <fpage>2022</fpage>
          -8-31]. URL: https://arxiv.org/pdf/
          <year>2006</year>
          .11538.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>LI</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>YAO A B</surname>
            ,
            <given-names>CHEN Q F. PSConv</given-names>
          </string-name>
          <article-title>: squeezing feature pyramid into one compact poly-scale convolutional layer</article-title>
          .
          <source>In: European Conference on Computer Vision</source>
          .
          <year>2020</year>
          :
          <fpage>615</fpage>
          -
          <lpage>632</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>HU</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SHEN</surname>
            <given-names>L</given-names>
          </string-name>
          , SUN G.
          <article-title>Squeeze-and-excitation networks//</article-title>
          <source>Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE</source>
          ,
          <year>2018</year>
          :
          <fpage>7132</fpage>
          -
          <lpage>7141</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>