<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Novel DWT-based Encoder for Human Pose Estimation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgio De Magistris</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Romano</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Janusz Starczewski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Napoli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computational Intelligence, Czstochowa University of Technology</institution>
          ,
          <addr-line>al. Armii Krajowej 36, Częstochowa, 42-200</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer, Control and Management Engineering, Sapienza University of Rome</institution>
          ,
          <addr-line>Via Ariosto 25, Roma, 00185</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute for Systems Analysis and Computer Science, Italian National Research Council</institution>
          ,
          <addr-line>Via dei Taurini 19, Roma, 00185</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>33</fpage>
      <lpage>40</lpage>
      <abstract>
        <p>The proposed approach for pose estimation is based on the construction of a Convolutional Neural Network with an encodingdecoding structure and a spatial pyramid based on WASP structure in its bottleneck and a Discrete wavelet transform encoder. These techniques already shown their capabilities to solve the main problems in state of the art related to: diferent Field of view (FoV) required to analyze the diferent possible sizes of a specific subject. we want to solve the faulty structure of the modern CNN based Neural Networks in the encoding part using DWT encoder and WASP. This Work also have the objective of demonstrating from a more general point of view which could be the advantages of a Discrete Wavelet Transform (DWT) encoder in any CNN-based approach for Pose Estimation and Object detection in any form, such as for several subjects in the same image or in the internal video due to the almost redundant use of the usual most famous encoding structures for CNN such as ResNet-101, U-Net or VGG16-19. we will do our tests using a U-net Based CNN in order to evaluate the importance of the results of the Discrete Wavelet Transform encoder also in the decoding part through the cropping of theme at the last layers of the network. This is necessary due to the loss of border's pixels during encoding that could be useful for the result's evaluation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Discrete Wavelet Transform</kwd>
        <kwd>Convolutional Neural Network</kwd>
        <kwd>WASP</kwd>
        <kwd>Atrous Convolution</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        by diferent works in image classification as [ 7, 8] or for multiple subject in the image in which case we not only
object detection approaches as in [1, 9] or other solutions establish a heatmap for the evaluation of the key points
[10, 11]. With this premises our work is focused on the in a Gaussian map but we consider have to also a
segapplication of an U-net based model with a multi-level mentation of the image and evaluation of long,medium
decomposition (MLD) of the image as encoder parallel to and short range ofset for each key point of the subject
U-net with concatenation of the gathered information for to establish the relationship between diferent key points
each layer of the DWT encoder to the encoder of U-net belonging to diferent subject and the ones in the same
and propagation of this information to the decoder. In segmented part of the image. In these cases is also
imthe feature extraction at the U-net’s bottleneck is instead portant to mention as in [18] the WASPv2 version used
used a WASP structure to obtain the resulting feature in combination to HRNet structures without a decoding
map of the image from diferent field of view after the structure that follow it performing a cascade of Atrous
encoding part to pass to the decoder. In this way the convolutions at increasing rates to gain eficiency. In
encoder do not loose the information eliminated by the the next section are explored the past works in DWT
down sampling operation because reconstructed from the application to a variety of state of the arts and explains
information of wavelets passed forward to the decoder. how our approach is diferent in chapter 3 and in the
conclusion how the results are improved by it. Talking
about possible applications of Discrete Wavelet
Trans2. Related Works form (DWT) related to diferent state of the art we can
consider as fundamental the contribute given by D-Unet
Two main aspect of this works are considered as first a dual encoder used for diferent purposes as Image
segrespect to the works related to the wavelet applied to mentation and object detection [19]. DWT-based encoder
diferent fields as Image segmentation or object detection is an important addition to the state of the art because
and then the common structures and modern approaches it demonstrated its superiority in reconstructing
inforactually used in Pose Estimation, this work is focused on mation lost in the encoding part of the neural network
the analysis of approaches from both this two aspects and also the superior capability with respect to diferent
used in combination with pre-processing common prac- methods used as steganalysis rich model (SRM). In which
tise related this to the task of Human Pose Estimation. case the layer extracts the image noise providing
addiThere are many pose estimation systems [12, 13, 14], tional evidence for the classification of multiple types of
among them UniPose [15] is a pose estimation method image key points as shoulders, elbows and faces. In [20]
based on the application of a so-called WASP structure has been instead demonstrated the problems related the
as the central part of the bottleneck in the CNN. It is use of a down-sampling and up-sampling operation and
based on the use of layers with diferent dilation in the corresponding interpolation operation needed to
reconapplication of a rate parameter related to the formula: struct the original image from the global feature map at
1 2 the start of the decoding part. As last consideration is
[, ] = ∑︁ ∑︁ [ − ,  − ] * [, ] (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) important to cite also the capabilities demonstrated by
=0 =0 spatial pyramid in obtaining of low-resolution feature
maps with global features if used with diferent FoV and
with r = rate of dilation , I = image and k as kernel over dilation parameters the so called Atrous Spatial Pyramid
the x/y-axis and 1 or 2 as pixel’s position. In this way Pooling (ASPP) compared to other simpler architecture
is possible to get a higher FoV for the image and con- to handle diferent scales, sizes, and aspect ratios of the
nect it to a depth-wise convolution operation to obtain subjects.
a higher level of abstraction features to consider in the
latent representation. UniPose reached a Percentage of
Correct Parts (PCP) considerably high for the actual state 3. Model Architecture
of the art but not even near to be considered as robust
approach respect most complex methods based on 3D The complete structure of the model is separated in three
models of poses. One of the most important approaches parts: The encoder and decoder structure typical of
Ubased on Cascaded Pyramid Network and pointing to net with propagation of the information from decoding
give a 3D representation of the scene as in [16] how- to encoding layers, the bottleneck where the waterfall
ever we are not interested to highlight because too much Atrous spatial pyramid (WASP) is used to obtain feature
computational expensive and applicable only in specific map in diferent Field of view and the parallel encoder
environments as in the case of multi-camera detection of used for the multi-level decomposition of the image for
the scene. A particular explanation of wich problems can each diferent image’s channel with the corresponding
be encountered in this kind of task is given by Person- concatenation of the low pass representation in the
layLab [17] where the Atrous approach is also analyzed for ers of U-net delegated to the forward propagation of
the information to the decoder. The complete architec- WASP is designed with the goal of reduce the number
ture is visible in where each block represent a layer of of parameters in order to deal with memory constraints
U-Net with: down sampling operation, dropout and nor- and solve the main issue of Atrous convolutions using
malization of the batch. We selected two datasets for diferent FoV for image global feature representation. In
the validation and training the first is COCO containing this part we deal with latent representation
manipula40000 images while a more specific and generally chal- tion and how in this case it influence the decoding part.
lenging dataset is LSP for this kind of task that , used varying the parameters of this part in fact is possible to
in combination with a small part of COCO to obtain a notice how these variation can lead to diferent results
single dataset of 3600 samples and the remaining for test- from a PCP viewpoint and more robustness to far
subing and validation. The LSP dataset includes modified jects. in fact one in particular of our test has been variate
data with noise addition, having a good assessment of the subject of the image from a very near subjects to the
the network performance for the task of single person camera to a more far representation given by diferent
pose estimation and even in this case one of the most dataset used for Human pose estimation from UAV. It is
problems is the occluded limbs. possible to see that the results in case of a simple person
      </p>
      <p>
        As mentioned in [21] the Fully Convolutional Net- in front on the camera are superior but when the limbs of
works (FCN) are the most used kind of CNN in this the subject are composed by very small groups of pixel
ifelds and all are structured as encoder-decoder with a more pixel-by-pixel analysis is needed. In terms of
up-sampling procedure for reconstruction of a resolu- result we obtained a level of PCP for 50 epoch-training
tion and restore of loosed data in encoding part. In this showing the challenging properties of the images. The
section are considered as assumption that the structure parameters modified are in fact not only the kernel sizes
will be similar to U-Net. however is important to remark but also the dilation or rate parameters obtaining in this
that these kind of structures already establish state of way a general FoV of the image. We also tried to use
the art result without thorough consideration of other diferent sizes for the Latent representation, needed to
methods of image feature extraction. As first approach use the dilation high and use bigger FoV than we can and
has been considered a solution based on the generation more parallel levels of FoV concatenated for the decoding
of an heat map but in that case the construction of it is part. The shapes of the layers variate between 1,2,6 rate
very similar to the binary segmentation of [19] which parameters while the dimension of the kernels between
can lead to problems as the necessity to use heat maps 3,5,7. Another important fact is related to the presence of
separated for each key point in order to connect theme the 1 by 1 convolutions before the concatenation useful if
each other in a correct way even if we have overlap of we want to manipulate the dimension of the data without
it. So in this case a simple binary segmentation is not loss of features from the local to the global feature map.
useful for overlap of limbs and articulations and a more
computational complexity shows up in working
separately for each key point. For these reasons has been 4. The DWT-based Encoder
chosen to consider a layer for a regression task with a The methodologies applied for the construction of the
invector representing the scaled coordinates of the points formation related the analysis of the image in frequency
in the image that we have to interpolate with our model. with diferent scale lead me to many diferent choice,
The reason behind the U-net structure as choice, instead from the application of Gaussian or Sobel Filters to the
most common VGG or ResNet, is due to the possibility to use of SRM structure as in [22]. But what in the end
propagate the information produced with wavelet’s coef- establish the most significant result has been the DWT
ifcients everywhere in the layers between encoding and encoder with the multi-scale decomposition of the image.
decoding. The network to propagate context information To make it we built a sequence of layer applying low
to higher resolution layers, exploiting this capability of pass and high pass filters to an image and generating
Unet the information propagated are in the structure of relevant Haar-features for the localization of relevant
our Neural Network the information gathered from the key points.The Multilevel decomposition method called
wavelet in the encoder layers concatenated layer by layer in this work DWT-based encoder will generate these
coto the information obtained from the network’s encoder. eficients over all three direction in the image vertical,
This will lead us to a distance function weighted respect
to the subject dimension in the image that we will have cdoiaegficoiennatls afonrdthheodrieztoaniltsaaln(dapp,roxim,ationo,verd )iferw-ith
to minimize (e.g simple MSE respect the position) and ent thresholds to pass to the next layer as explained in
ifnd the best function that interpolate the position of the [23],[24] or [25]. With a more mathematical viewpoint
key-points respect the image’s information. We want in each frequency component can be defined in a matrix
this way to be able to build augmented information for form for 2D input as:
an image with very low dimension and use them to infer
invisible information for a simple U-net encoder-decoder. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
  = Γ  Γ 
Where  is the input,   are the low frequency
component and the high defined with  ℎ,  ℎ,  ℎℎ Defining
the Low pass and High pass filter as ,
      </p>
      <p>:
Γ = ⎜⎜ − 1</p>
      <p>⎜ − 2  − 1  0
⎛ ...
⎜⎝ ...</p>
      <p>...</p>
      <p>...
 0
 1
...</p>
      <p>...
 1
 2
...
...
...</p>
      <p>3
 1  2 ...</p>
      <p>...⎞
...⎟
...⎟⎠
...⎟⎟, K = ⎜⎜</p>
      <p>⎛ ...</p>
      <p>⎜
⎜⎝ ...</p>
      <p>...
− 2 
− 1
...
 0
 1
...
− 1  0
...
 1
 2
...
...
...</p>
      <p>
        3
These information produced for each layer will be con- It can be interpreted as a Fourier transform of f at the
fre(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
map to find for each key point an unique connection to
the others for the skeleton construction. our approach is
based on the analysis of the information produced by this
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) filters and the improvement given by the analysis of the
image by the DWT encoder as substitute to the one based
on SRM. In order to better understand the uses of DWT
in this part is given a recap of basics concepts. Given a
window function as the one used for Fourier transform
usually found in a common form:
 ()( −  )−
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref11">11</xref>
        )
 (,  ) =
∫︁ +∞
      </p>
      <p>
        −∞
quency  , localized by the window g in the neighborhood
of  . Multiplying the signal represented by f(t) with g and
computing the Fourier coeficients we obtain indication
of the frequency content of the signal f in a
neighborhood of  , shifting the window from 0 and obtained a
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) sequence of coeficients that give a representation of the
image sensible to certain frequencies. Now, considering
g as the family of function generated from a single 2()
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) function by phase space translations ( , ) where  = 1/s
      </p>
      <p>
        ("coherent states"), an important property of this
func(
        <xref ref-type="bibr" rid="ref9">9</xref>
        ) tion is the capability to completely reconstruct f from
the phase space projections given by ⟨(, ),  ⟩. This is
due to the property of this mapping function of being an
isometry that as mentioned in [24],[26] or [27] is given by
so called resolution of the identity property that implies
that the f function can be written as:
 =
1 ∫︁
2

∫︁
⟨(, ),  ⟩(, )
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref12">12</xref>
        )
In similar way the wavelets are family of functions that
involve the (, ) derived from a function, but indexed
by two labels, one for position and one for frequency
with s = 1/ as scale factor and
      </p>
      <p>= translation where
the resolution of the identity is written as:
 = − 1 ∫︁  ∫︁

2
 ⟨ (, ),  ⟩ (, )</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref13">13</xref>
        )
Taking into account the (
        <xref ref-type="bibr" rid="ref14">14</xref>
        ) in this way we can redefine
completely f with a set of coeficients over a direction
generated by simple filter application and re-defining the
 (,  ) = √︀ 
1
| | 1
− 1
∑︁  () [  −
      </p>
      <p>]</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref14">14</xref>
        )
Usually it is chosen as parameters  = 2 as dilation in
order to have a discrete dilation by taking powers of
a fixed j, 
= 2  as translation of the wavelet and k
This is used in the encoder But not as a down-sampling
operation to substitute in the encoder instead its
downsampled version is added hierarchically to the layers as
a Parallel encoder providing in this way to the three
encoders, but also decoder considering that the structure
is U-net as, the features stressed by each layer of the
multi-level decomposition. As it is possible to see the
application of diferent low pass filters applied for the
ifrst,fifth and ninth image for each layer and high pass
iflter for the rest. These information concatenated will
be added hierarchically to the encoder layers in
particular in the 2-th,3-th and 4-th, each level will divide the
dimension of the image with 2 with j=number of the
layer. In order to recap how wavelet works I’m referring
to diferent mother wavelet as Haar wavelets but we will
also analyze performances in correspondence of diferent
capabilities in the isolation of high from low frequency
components in images and isolation in all directions of
edges at diferent scale and resolutions as in [ 2] and [24].
      </p>
      <sec id="sec-1-1">
        <title>Another remarkable fact is that we do not need to use</title>
        <p>IDWT in decoding for obvious reason and the fact that,
having a simple Multi level decomposition without
variation filters. We will have just one gradient in common
wavelets applied as Daubechies that already proved their image using:
to consider for the loss minimization evaluating the ad- as direction rewriting (, ) as  ,. Given a function
ditional DWT features directly in the same CNN’s loss
as in the case of [15] with diferent loss for each heat
 () as signal of input that will be our image, it has a
large amplitude near sharp transitions of pixels such as
edges, obtaining the coeficients over the ⟨ ,,  ⟩ ≥ 
threshold and varying it over the frequency. What
now is produced by this are three corresponding high
pass and low pass filters we obtain four results As
approximation and details coeficients for each of
the three layers to pass to the next one in the multi
level decomposition where each of theme is defined as:
⎧2  = (  2 (− ) 2 (− ))(2−  , 2−  )
⎪
⎪
⎪⎨21  = (  2 (− )˛2 (− ))(2−  , 2−  )</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref15">15</xref>
          )
⎪22  = (˛2 (− ) 2 (− ))(2−  , 2−  )
⎪
⎪⎩23  = (˛2 (− )˛2 (− ))(2−  , 2−  )
Another important remark is on how the output is asso- Table 1
ciate and added to the neural network this is related the Parameters used in the tests
method of concatenation (fusion) and the corresponding
result. In case of application of hierarchical fusion has
been proved an increment booth in velocity of the loss
convergence and PCP metric evaluated.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>The tests are evaluated both for COCO and LSP datasets</title>
        <p>
          but in the end, the evaluation has been done on a
combination of it.The results in accuracy are evaluated not from
the first epoch but from the 50th epoch while the loss is
2 = ( −  )2 + ( −  )2 (
          <xref ref-type="bibr" rid="ref16">16</xref>
          ) shown from the beginning. It is important to mention
Having this kind of loss we have the possibility to define that the approaches have been proven on 160x160 images
a suitable metric depending on these coordinates for the but also diferent dimensions and increasing them the
evaluation based on a threshold to apply in order to un- results increased also more with the DWT encoder
strucderstand if I’m going near to desired result. The PKC will ture with respect to a simple CNN just proving that the
be denoted an evaluation of the joints by a formula in loss, over 100 epoch arrived from a 110.07 value to 120.3.
the form: Not a big increase but considering that the PCP arrived
2( ) ≤ 0.52 (
          <xref ref-type="bibr" rid="ref17">17</xref>
          ) from a value of 65% to 71% and that the loss of an
interpolation problem for the 160x160 images start usually
In other words if the segment given by the predicted over 1000 as MSE loss initial value while 300 for 80x80
endpoints lie within fraction of the length of the ground- images we can deduce that the augmented complexity
truth segment the distance calculated by the prediction of the interpolation problem is compensated from the
will have to be smaller than the half of the efective lenght information provided from DWT confirming its utilities
(threshold = 0.5) as mentioned in [1]. Alternatively is for the analysis of complex data, It could be interesting
possible to use as metric the Object Keypoint Similarity to try with 1280x720 images as a future improvement.
(OKS) in the form: Some of the most challenging aspects respect this kind
of problem are reported in the 4-th image at the top
∑︀ exp 2 /222 ( &gt; 0) (18) that are evaluated with our method confronted at parity
∑︀  ( &gt; 0) of epoch training and parameter to a simple CNN with
WASP without addition of DWT. better results can be
establish, without changing the parameter, eliminating
the down sampling and using images that contains a
better resolution (instead a 160x160 image) as in [15] where
image’s dimensions are around 1280x720 even with
de
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Another possible metric the one adopted in this work is</title>
        <p>the PCP that is based on detected joint that is considered
correct if the distance between the predicted and the
true joint is within a certain threshold. In this work you
can find an example of the results in terms of accuracy
nonised images. The images have been chosen to give a
basic representation of the main problems of the human
body pose estimation due to the complexity of the pose
and occlusion of the limbs of the subject. Looking at the Figure 2: Test with HaarDWT used with pretarined CNN
ifrst line the first picture, beyond a small negligible error, Daubachies.
denote a very good result in the pose estimation of the complicated CNNs are not capable to compete with the
subject. Diferently, the third image is more complex to wavelet encoder showing clear results of over fitting so
evaluate due to the occlusion of the limbs and the com- we preferred to do not modify epochs more than 100 and
plexity of the pose. A diferent kind of problem is instead using just 4 convolution layers. We extended these result
represented by the pose of the second and third image even to a diferent topic as human pose estimation, the
where the entire image down sampled with a very high initial objective was to establish results in object
detecfactor leads to the problem of low border resolution of the tion approach for pose estimation but it has been discard
subject giving imprecision in the evaluation of the key because as already said the top-down approach proved
points positions. Note carefully the presence of padding to be superior. In addition to this we solved the faulty
in the CNN leads us to the shifted results in the figure as encoder-decoder general structure common to all most
in object detection as [28]. In an other important test we used CNNs for this field as VGG-16 and ResNet
showconsidered the DWT based method confronted between ing lower loss of information during encoding using the
COCO and LSPII dataset where the data are more prob- DWT and the analysis of the Wavelet’s information of
lematic and many subject assume complex poses. These the images.
tests are also evaluated with respect a PKC value but
in the end we used the PCP because more appropriate
for a bottom-up approach and easier to implement and References
evaluate but similar results are initially evaluated with
respect the PKC.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>7. Conclusion</title>
      <p>The wavelet procedure actually increase the capabilities
with the DWT encoder and with this result how is
possible to extend it also to diferent fields that include these
kind of structures. Diferent results are obtained using
more epochs or layer in the convolution structure from 4
to 3 layers obtained lower results in PCP terms so more</p>
      <p>ing a machine learning approach with gan-based
data augmentation technique trained using a
custom dataset, OBM Neurobiology 6 (2022). doi:10.</p>
      <p>Method
 + ,ℎ
  ,ℎ
 
CNN+ ,ℎ
U-net
 ,
  ,ℎ</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eichner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marin-Jimenez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ferrari</surname>
          </string-name>
          ,
          <article-title>2D articulated human pose estimation and retrieval in (almost) unconstrained still images</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>99</volume>
          (
          <year>2012</year>
          )
          <fpage>190</fpage>
          -
          <lpage>214</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Isaacs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Foo</surname>
          </string-name>
          ,
          <article-title>Hand pose estimation for american sign language recognition</article-title>
          ,
          <source>Thirty-Sixth Southeastern Symposium on System Theory</source>
          ,
          <year>2004</year>
          . Proceedings of the (
          <year>2004</year>
          )
          <fpage>132</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pepe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tedeschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Brandizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>Human attention assessment us21926/obm</article-title>
          .neurobiol.
          <volume>2204139</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ponzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vincelli</surname>
          </string-name>
          ,
          <article-title>Supporting impaired people with a following robotic assistant by means of end-to-end visual target navigation and reinforcement learning approaches</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          , volume
          <volume>3118</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bonanno</surname>
          </string-name>
          , G. Capizzi, G. Sciuto,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tramontana</surname>
          </string-name>
          ,
          <article-title>A novel cloud-distributed toolbox for optimal energy dispatch management from renewables in igss by using wrnn predictors and gpu parallel solutions</article-title>
          , in: 2014
          <source>International Symposium on Power Electronics</source>
          , Electrical Drives, Automation and Motion,
          <string-name>
            <surname>SPEEDAM</surname>
          </string-name>
          <year>2014</year>
          , IEEE Computer Society,
          <year>2014</year>
          , pp.
          <fpage>1077</fpage>
          -
          <lpage>1084</lpage>
          . doi:
          <volume>10</volume>
          .1109/SPEEDAM.
          <year>2014</year>
          .
          <volume>6872127</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          , E. Tramontana,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nowicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Starczewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <article-title>Toward work groups classification based on probabilistic neural network approach</article-title>
          ,
          <source>in: Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)</source>
          , volume
          <volume>9119</volume>
          , Springer Verlag,
          <year>2015</year>
          , pp.
          <fpage>79</fpage>
          -
          <lpage>89</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -19324-
          <issue>3</issue>
          _
          <fpage>8</fpage>
          . [18]
          <string-name>
            <given-names>B.</given-names>
            <surname>Artacho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Savakis</surname>
          </string-name>
          ,
          <article-title>Omnipose: A multi-scale</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <article-title>Wavelet integrated framework for multi-person pose estimation, 2021. cnns for noise-robust image classification</article-title>
          ,
          <source>in: arXiv:2103.10180. IEEE/CVF Conference on Computer Vision</source>
          and Pat- [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <source>Dtern Recognition (CVPR)</source>
          ,
          <year>2020</year>
          .
          <article-title>unet: A dimension-fusion u shape network for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wozniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , E. Tramontana,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Capizzi, chronic stroke lesion segmentation</article-title>
          , IEEE/ACM G. Lo Sciuto,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nowicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Starczewski</surname>
          </string-name>
          ,
          <article-title>A mul- Transactions on Computational Biology and Biointiscale image compressor with rbfnn and discrete formatics 18 (</article-title>
          <year>2021</year>
          )
          <fpage>940</fpage>
          -
          <lpage>950</lpage>
          . doi:
          <volume>10</volume>
          .1109/TCBB. wavelet decomposition,
          <source>in: Proceedings of the In- 2019.2939522. ternational Joint Conference on Neural Networks</source>
          <volume>,</volume>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Wavelet pooling for convoluvolume 2015-September, Institute of Electrical and tional neural networks</article-title>
          ,
          <year>2018</year>
          . Electronics Engineers Inc.,
          <year>2015</year>
          . doi:
          <volume>10</volume>
          .1109/ [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          , E. Shelhamer, T. Darrell, Fully convoluIJCNN.
          <year>2015</year>
          .
          <volume>7280461</volume>
          .
          <article-title>tional networks for semantic segmentation</article-title>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brandizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bianco</surname>
          </string-name>
          , G. Castro,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          , A. Wa- arXiv:
          <fpage>1411</fpage>
          .4038. jda,
          <source>Automatic rgb inference based on facial emo-</source>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-I. Park</surname>
          </string-name>
          ,
          <article-title>Cnn-based ternary tion recognition</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          ,
          <article-title>classification for image steganalysis, Electronics 8 volume 3092</article-title>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>66</fpage>
          -
          <lpage>74</lpage>
          . (
          <year>2019</year>
          ). URL: https://www.mdpi.com/2079-9292/8/
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          , E. Tramontana,
          <source>Using</source>
          <volume>11</volume>
          /1225. doi:
          <volume>10</volume>
          .3390/electronics8111225. modularity metrics to assist move method refactor- [23]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. Q.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>A ing of large systems</article-title>
          ,
          <source>in: Proceedings - 2013 7th robust dwt-based video watermarking algorithm, International Conference on Complex, Intelligent</source>
          ,
          <source>2002 IEEE International Symposium on Circuits and and Software Intensive Systems, CISIS</source>
          <year>2013</year>
          ,
          <year>2013</year>
          , Systems.
          <source>Proceedings (Cat. No.02CH37353) 3</source>
          (
          <issue>2002</issue>
          ) pp.
          <fpage>529</fpage>
          -
          <lpage>534</lpage>
          . doi:
          <volume>10</volume>
          .1109/CISIS.
          <year>2013</year>
          .96. III-III.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Capizzi</surname>
          </string-name>
          , G. Sciuto,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tramontana</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          [24]
          <string-name>
            <surname>I. Daubechies</surname>
          </string-name>
          ,
          <article-title>The wavelet transform, timemultithread nested neural network architecture to frequency localization and signal analysis, IEEE model surface plasmon polaritons propagation</article-title>
          ,
          <source>Mi- Transactions on Information Theory</source>
          <volume>36</volume>
          (
          <year>1990</year>
          )
          <fpage>961</fpage>
          -
          <lpage>cromachines</lpage>
          7 (
          <year>2016</year>
          ). doi:
          <volume>10</volume>
          .3390/mi7070110. 1005. doi:
          <volume>10</volume>
          .1109/18.57199.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>G. De Magistris</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Caprari</surname>
            , G. Castro, S. Russo, [25]
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Sturm</surname>
          </string-name>
          ,
          <article-title>Stéphane mallat: A wavelet tour of signal L</article-title>
          .
          <string-name>
            <surname>Iocchi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Nardi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>Vision-based processing, 2nd edition, Computer Music Journal - holistic scene understanding for context-</article-title>
          aware
          <source>COMPUT MUSIC J</source>
          <volume>31</volume>
          (
          <year>2007</year>
          )
          <fpage>83</fpage>
          -
          <lpage>85</lpage>
          . doi:
          <volume>10</volume>
          .1162/ human-robot
          <source>interaction 13196 LNAI</source>
          (
          <year>2022</year>
          )
          <fpage>310</fpage>
          -
          <lpage>comj</lpage>
          .
          <year>2007</year>
          .
          <volume>31</volume>
          .3.83. 325. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -08421-8_
          <fpage>21</fpage>
          . [26]
          <string-name>
            <given-names>I.</given-names>
            <surname>Daubechies</surname>
          </string-name>
          , T. Paul, Time-frequency localisation
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Brociek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Magistris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cardia</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Coppa, operators-a geometric phase space approach: Ii. the S. Russo, Contagion prevention of covid-19 by use of dilations, Inverse Problems 4 (</article-title>
          <year>1988</year>
          )
          <fpage>661</fpage>
          -
          <lpage>680</lpage>
          .
          <article-title>means of touch detection for retail stores</article-title>
          , in: CEUR [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mallat</surname>
          </string-name>
          ,
          <article-title>A theory for multiresolution signal deWorkshop Proceedings</article-title>
          , volume
          <volume>3092</volume>
          ,
          <article-title>CEUR-WS, composition: the wavelet representation</article-title>
          ,
          <source>IEEE</source>
          <year>2021</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>94</lpage>
          .
          <source>Transactions on Pattern Analysis and Machine</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Avanzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Beritelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <source>Vac- Intelligence</source>
          <volume>11</volume>
          (
          <year>1989</year>
          )
          <fpage>674</fpage>
          -
          <lpage>693</lpage>
          . doi:
          <volume>10</volume>
          .1109/34. caro, Yolov3
          <article-title>-based mask and face recognition al- 192463. gorithm for individual protection applications</article-title>
          , in: [28]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <source>Spatial CEUR Workshop Proceedings</source>
          , volume
          <volume>2768</volume>
          , CEUR- pyramid
          <source>pooling in deep convolutional networks WS</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>45</lpage>
          .
          <article-title>for visual recognition</article-title>
          , Lecture Notes in Com-
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Artacho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Savakis</surname>
          </string-name>
          , Unipose: Unified human puter
          <source>Science</source>
          (
          <year>2014</year>
          )
          <fpage>346</fpage>
          -
          <lpage>361</lpage>
          . URL: http://dx.doi.
          <article-title>pose estimation in single images and videos</article-title>
          , in: Pro- org/10.1007/978-3-
          <fpage>319</fpage>
          -10578-9_
          <fpage>23</fpage>
          . doi:
          <volume>10</volume>
          .1007/ ceedings of the IEEE/CVF Conference on Computer 978-3-
          <fpage>319</fpage>
          -10578-9_
          <fpage>23</fpage>
          .
          <article-title>Vision and Pattern Recognition (CVPR</article-title>
          ),
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Fast and robust multi-person 3d pose estimation from multiple views</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1901</year>
          .04111.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Capizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bonanno</surname>
          </string-name>
          ,
          <article-title>Innovative second-generation wavelets construction with recurrent neural networks for solar radiation forecasting</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>23</volume>
          (
          <year>2012</year>
          )
          <fpage>1805</fpage>
          -
          <lpage>1815</lpage>
          . doi:
          <volume>10</volume>
          . 1109/TNNLS.
          <year>2012</year>
          .
          <volume>2216546</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>