A Novel DWT-based Encoder for Human Pose Estimation
Giorgio De Magistris1 , Matteo Romano1 , Janusz Starczewski2 and Christian Napoli1,3
1
  Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, Roma, 00185, Italy
2
  Department of Computational Intelligence, Czstochowa University of Technology, al. Armii Krajowej 36, Częstochowa, 42-200, Poland
3
  Institute for Systems Analysis and Computer Science, Italian National Research Council, Via dei Taurini 19, Roma, 00185, Italy


                                             Abstract
                                             The proposed approach for pose estimation is based on the construction of a Convolutional Neural Network with an encoding-
                                             decoding structure and a spatial pyramid based on WASP structure in its bottleneck and a Discrete wavelet transform encoder.
                                             These techniques already shown their capabilities to solve the main problems in state of the art related to: different Field of
                                             view (FoV) required to analyze the different possible sizes of a specific subject. we want to solve the faulty structure of the
                                             modern CNN based Neural Networks in the encoding part using DWT encoder and WASP. This Work also have the objective
                                             of demonstrating from a more general point of view which could be the advantages of a Discrete Wavelet Transform (DWT)
                                             encoder in any CNN-based approach for Pose Estimation and Object detection in any form, such as for several subjects in the
                                             same image or in the internal video due to the almost redundant use of the usual most famous encoding structures for CNN
                                             such as ResNet-101, U-Net or VGG16-19. we will do our tests using a U-net Based CNN in order to evaluate the importance of
                                             the results of the Discrete Wavelet Transform encoder also in the decoding part through the cropping of theme at the last
                                             layers of the network. This is necessary due to the loss of border’s pixels during encoding that could be useful for the result’s
                                             evaluation.

                                             Keywords
                                             Discrete Wavelet Transform, Convolutional Neural Network, WASP, Atrous Convolution


1. Introduction                                                                                                            tures based on a representation of the image in a latent
                                                                                                                           space through the construction of feature maps from a
Pose Estimation task is important for many aspects from                                                                    local to a global viewpoint through Spatial Pyramid ap-
Human detection and pose estimation to the navigation                                                                      proach as bottleneck of the CNN. For this type of task,
system for autonomous car and also different fields as                                                                     various technologies are currently present in the state of
object detection and image segmentation. We can use dif-                                                                   the art for improving the performances of these CNNs,
ferent types of approaches as Top-down detection human                                                                     also related to the management of the different fields of
with a bounding box as object detection task and then use                                                                  view (FoV) of the image necessary for the evaluation of
the pose estimation algorithm as in [1]. An alternative                                                                    objects with different scales in the representation but
for bottom-up approach estimate the points from the im-                                                                    what we want to add with this work is the resolution of
age and then recreate the human form by the "skeleton"                                                                     problems localized in the encoding part of these neural
given by the conjunction of these points. The method                                                                       networks. The final product of this kind of structure for
used in our case will be the Top-down for the estimation                                                                   pose estimation of a single subject will be at the end a
of these points. This is because the result that we want to                                                                simple set of points to interpolate or a heatmap of the
emphasize is the improvement of the performances at en-                                                                    image obtained through the generation in pre-processing
coding level so that include alle the structure of this kind                                                               phase of a ground truth (GT) image with gaussian noise
even the ones used for object detection. There are dif-                                                                    in the keypoints area for the evaluation of a loss function
ferent challenges in the capability of certain CNN to get                                                                  based on a percentage of Correct Key points respect the
good results even in complex situations and In a different                                                                 GT image and predicted key points. There are actually
way the novelty of our study could affect the capabilities                                                                 different challenges in the state of the art not only related
of these CNN structures. The actual Convolutional neu-                                                                     the technologies actually used but also problems due to
ral network Based method for the key points detection                                                                      the case of possible occlusion of limbs due to presence
and estimation of the pose of human limbs is actually all                                                                  of different persons in images and requiring the use of
developed through encoding-decoding structures which                                                                       Multi Pose estimation or low-quality of the images as
are very similar to each other. More specifically, struc-                                                                  in video where we have very fast objects moving in the
                                                                                                                           image that needs of less smooth corners for the analysis.
SYSYEM 2022: 8th Scholar’s Yearly Symposium of Technology,
Engineering and Mathematics, Brunek, July 23, 2022
                                                                                                                           Other examples where is possible to find that this kind
$ demagistris@diag.uniroma1.it (G. De Magistris);                                                                          of structure are used beyond the Human pose estimation
janusz.starczewski@pcz.pl (J. Starczewski);                                                                                also for hand pose as in [2, 3, 4] where are analyzed many
cnapoli@diag.uniroma1.it (C. Napoli)                                                                                       kinds of wavelet applied to the purpose or even different
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                          applications in the state of the art [5, 6] as noticed also
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                      33
Giorgio De Magistris et al. CEUR Workshop Proceedings                                                           33–40


by different works in image classification as [7, 8] or formultiple subject in the image in which case we not only
object detection approaches as in [1, 9] or other solutionsestablish a heatmap for the evaluation of the key points
[10, 11]. With this premises our work is focused on the    in a Gaussian map but we consider have to also a seg-
application of an U-net based model with a multi-level     mentation of the image and evaluation of long,medium
decomposition (MLD) of the image as encoder parallel to    and short range offset for each key point of the subject
U-net with concatenation of the gathered information for   to establish the relationship between different key points
each layer of the DWT encoder to the encoder of U-net      belonging to different subject and the ones in the same
and propagation of this information to the decoder. In     segmented part of the image. In these cases is also im-
the feature extraction at the U-net’s bottleneck is insteadportant to mention as in [18] the WASPv2 version used
used a WASP structure to obtain the resulting feature      in combination to HRNet structures without a decoding
map of the image from different field of view after the    structure that follow it performing a cascade of Atrous
encoding part to pass to the decoder. In this way the      convolutions at increasing rates to gain efficiency. In
encoder do not loose the information eliminated by the     the next section are explored the past works in DWT
down sampling operation because reconstructed from the     application to a variety of state of the arts and explains
information of wavelets passed forward to the decoder.     how our approach is different in chapter 3 and in the
                                                           conclusion how the results are improved by it. Talking
                                                           about possible applications of Discrete Wavelet Trans-
2. Related Works                                           form (DWT) related to different state of the art we can
                                                           consider as fundamental the contribute given by D-Unet
Two main aspect of this works are considered as first
                                                           a dual encoder used for different purposes as Image seg-
respect to the works related to the wavelet applied to
                                                           mentation and object detection [19]. DWT-based encoder
different fields as Image segmentation or object detection
                                                           is an important addition to the state of the art because
and then the common structures and modern approaches
                                                           it demonstrated its superiority in reconstructing infor-
actually used in Pose Estimation, this work is focused on
                                                           mation lost in the encoding part of the neural network
the analysis of approaches from both this two aspects
                                                           and also the superior capability with respect to different
used in combination with pre-processing common prac-
                                                           methods used as steganalysis rich model (SRM). In which
tise related this to the task of Human Pose Estimation.
                                                           case the layer extracts the image noise providing addi-
There are many pose estimation systems [12, 13, 14],
                                                           tional evidence for the classification of multiple types of
among them UniPose [15] is a pose estimation method
                                                           image key points as shoulders, elbows and faces. In [20]
based on the application of a so-called WASP structure
                                                           has been instead demonstrated the problems related the
as the central part of the bottleneck in the CNN. It is
                                                           use of a down-sampling and up-sampling operation and
based on the use of layers with different dilation in the
                                                           corresponding interpolation operation needed to recon-
application of a rate parameter related to the formula:
                                                           struct the original image from the global feature map at
                  𝐿1 𝐿2                                    the start of the decoding part. As last consideration is
                         𝐼[𝑚 − 𝑟𝑖, 𝑚 − 𝑟𝑗] * 𝑘[𝑖, 𝑗] (1) important to cite also the capabilities demonstrated by
                 ∑︁ ∑︁
      𝑦[𝑖, 𝑗] =
                 𝑖=0 𝑗=0                                   spatial pyramid in obtaining of low-resolution feature
                                                           maps with global features if used with different FoV and
with r = rate of dilation , I = image and k as kernel over dilation parameters the so called Atrous Spatial Pyramid
the x/y-axis and 𝐿1 or 𝐿2 as pixel’s position. In this way Pooling (ASPP) compared to other simpler architecture
is possible to get a higher FoV for the image and con- to handle different scales, sizes, and aspect ratios of the
nect it to a depth-wise convolution operation to obtain subjects.
a higher level of abstraction features to consider in the
latent representation. UniPose reached a Percentage of
Correct Parts (PCP) considerably high for the actual state 3. Model Architecture
of the art but not even near to be considered as robust
approach respect most complex methods based on 3D The complete structure of the model is separated in three
models of poses. One of the most important approaches parts: The encoder and decoder structure typical of U-
based on Cascaded Pyramid Network and pointing to net with propagation of the information from decoding
give a 3D representation of the scene as in [16] how- to encoding layers, the bottleneck where the waterfall
ever we are not interested to highlight because too much Atrous spatial pyramid (WASP) is used to obtain feature
computational expensive and applicable only in specific map in different Field of view and the parallel encoder
environments as in the case of multi-camera detection of used for the multi-level decomposition of the image for
the scene. A particular explanation of wich problems can each different image’s channel with the corresponding
be encountered in this kind of task is given by Person- concatenation of the low pass representation in the lay-
Lab [17] where the Atrous approach is also analyzed for ers of U-net delegated to the forward propagation of


                                                              34
Giorgio De Magistris et al. CEUR Workshop Proceedings                                                                33–40


the information to the decoder. The complete architec-         WASP is designed with the goal of reduce the number
ture is visible in where each block represent a layer of       of parameters in order to deal with memory constraints
U-Net with: down sampling operation, dropout and nor-          and solve the main issue of Atrous convolutions using
malization of the batch. We selected two datasets for          different FoV for image global feature representation. In
the validation and training the first is COCO containing       this part we deal with latent representation manipula-
40000 images while a more specific and generally chal-         tion and how in this case it influence the decoding part.
lenging dataset is LSP for this kind of task that , used       varying the parameters of this part in fact is possible to
in combination with a small part of COCO to obtain a           notice how these variation can lead to different results
single dataset of 3600 samples and the remaining for test-     from a PCP viewpoint and more robustness to far sub-
ing and validation. The LSP dataset includes modified          jects. in fact one in particular of our test has been variate
data with noise addition, having a good assessment of          the subject of the image from a very near subjects to the
the network performance for the task of single person          camera to a more far representation given by different
pose estimation and even in this case one of the most          dataset used for Human pose estimation from UAV. It is
problems is the occluded limbs.                                possible to see that the results in case of a simple person
   As mentioned in [21] the Fully Convolutional Net-           in front on the camera are superior but when the limbs of
works (FCN) are the most used kind of CNN in this              the subject are composed by very small groups of pixel
fields and all are structured as encoder-decoder with          a more pixel-by-pixel analysis is needed. In terms of
up-sampling procedure for reconstruction of a resolu-          result we obtained a level of PCP for 50 epoch-training
tion and restore of loosed data in encoding part. In this      showing the challenging properties of the images. The
section are considered as assumption that the structure        parameters modified are in fact not only the kernel sizes
will be similar to U-Net. however is important to remark       but also the dilation or rate parameters obtaining in this
that these kind of structures already establish state of       way a general FoV of the image. We also tried to use
the art result without thorough consideration of other         different sizes for the Latent representation, needed to
methods of image feature extraction. As first approach         use the dilation high and use bigger FoV than we can and
has been considered a solution based on the generation         more parallel levels of FoV concatenated for the decoding
of an heat map but in that case the construction of it is      part. The shapes of the layers variate between 1,2,6 rate
very similar to the binary segmentation of [19] which          parameters while the dimension of the kernels between
can lead to problems as the necessity to use heat maps         3,5,7. Another important fact is related to the presence of
separated for each key point in order to connect theme         the 1 by 1 convolutions before the concatenation useful if
each other in a correct way even if we have overlap of         we want to manipulate the dimension of the data without
it. So in this case a simple binary segmentation is not        loss of features from the local to the global feature map.
useful for overlap of limbs and articulations and a more
computational complexity shows up in working sepa-
rately for each key point. For these reasons has been          4. The DWT-based Encoder
chosen to consider a layer for a regression task with a
                                                               The methodologies applied for the construction of the in-
vector representing the scaled coordinates of the points
                                                               formation related the analysis of the image in frequency
in the image that we have to interpolate with our model.
                                                               with different scale lead me to many different choice,
The reason behind the U-net structure as choice, instead
                                                               from the application of Gaussian or Sobel Filters to the
most common VGG or ResNet, is due to the possibility to
                                                               use of SRM structure as in [22]. But what in the end
propagate the information produced with wavelet’s coef-
                                                               establish the most significant result has been the DWT
ficients everywhere in the layers between encoding and
                                                               encoder with the multi-scale decomposition of the image.
decoding. The network to propagate context information
                                                               To make it we built a sequence of layer applying low
to higher resolution layers, exploiting this capability of
                                                               pass and high pass filters to an image and generating
Unet the information propagated are in the structure of
                                                               relevant Haar-features for the localization of relevant
our Neural Network the information gathered from the
                                                               key points.The Multilevel decomposition method called
wavelet in the encoder layers concatenated layer by layer
                                                               in this work DWT-based encoder will generate these co-
to the information obtained from the network’s encoder.
                                                               efficients over all three direction in the image vertical,
This will lead us to a distance function weighted respect
                                                               diagonal and horizontal (𝐿𝐻𝑖 , 𝐻𝐿𝑖 , 𝐻𝐻𝑖 , 𝐿𝐿𝑖 ) with
to the subject dimension in the image that we will have
                                                               coefficients for the details and approximation over differ-
to minimize (e.g simple MSE respect the position) and
                                                               ent thresholds to pass to the next layer as explained in
find the best function that interpolate the position of the
                                                               [23],[24] or [25]. With a more mathematical viewpoint
key-points respect the image’s information. We want in
                                                               each frequency component can be defined in a matrix
this way to be able to build augmented information for
                                                               form for 2D input as:
an image with very low dimension and use them to infer
invisible information for a simple U-net encoder-decoder.                             𝜒𝑙𝑙 = Γ𝜒Γ𝑇                        (2)


                                                          35
Giorgio De Magistris et al. CEUR Workshop Proceedings                                                               33–40


                      𝜒𝑙ℎ = Γ𝜒𝐾 𝑇                       (3)     map to find for each key point an unique connection to
                                                                the others for the skeleton construction. our approach is
                      𝜒ℎ𝑙 = 𝐾𝜒Γ    𝑇
                                                            (4)
                                                                based on the analysis of the information produced by this
                    𝜒ℎℎ = 𝐾𝜒𝐾 𝑇                             (5) filters and the improvement given by the analysis of the
                                                                image by the DWT encoder as substitute to the one based
Where 𝜒 is the input, 𝜒𝑙𝑙 are the low frequency compo-
                                                                on SRM. In order to better understand the uses of DWT
nent and the high defined with 𝜒ℎ𝑙 , 𝜒𝑙ℎ , 𝜒ℎℎ Defining
                                                                in this part is given a recap of basics concepts. Given a
the Low pass and High pass filter as 𝜅𝑖, 𝛾𝑖 :
                                                                window function as the one used for Fourier transform
      ⎛
        ... ... ... ... ...
                           ⎞       ⎛
                                      ... ... ... ... ...
                                                          ⎞
                                                                usually found in a common form:
      ⎜𝛾−2 𝛾−1 𝛾0 ... ...⎟         ⎜𝜅−2 𝜅−1 𝜅0 ... ...⎟
                             , K =                          (6)
      ⎜                    ⎟       ⎜                      ⎟
    Γ=⎜𝛾
      ⎜ −1  𝛾 0 𝛾 1 ... ...⎟         𝜅
                                   ⎜ −1
                                   ⎜      𝜅 0 𝜅 1 ... ... ⎟                          ∫︁ +∞
                                                                                            𝑓 (𝑡)𝑔(𝑡 − 𝜏 )𝑒−𝑖𝑡𝜖 𝑑𝑡    (11)
                           ⎟                              ⎟
      ⎝ ... 𝛾1 𝛾2 𝛾3 ...⎠          ⎝ ...  𝜅1 𝜅2 𝜅3 ...⎠
        ... ... 𝛾1 𝛾2 ...             ... ... 𝜅1 𝜅2 ...
                                                                          𝐹 (𝜏, 𝜖) =
                                                                                    −∞

These information produced for each layer will be con-       It can be interpreted as a Fourier transform of f at the fre-
catenated to the layers of the encoder but, having a U-net   quency 𝜖, localized by the window g in the neighborhood
structure will be added also at the last layers with for-    of 𝜏 . Multiplying the signal represented by f(t) with g and
ward propagation that will be taken into account by the      computing the Fourier coefficients we obtain indication
convolution’s weights and updated by back-propagation:       of the frequency content of the signal f in a neighbor-
                                                             hood of 𝜏 , shifting the window from 0 and obtained a
                       𝜕𝜒𝑙𝑙
                            = Γ𝑇 𝐺Γ                      (7) sequence of coefficients that give a representation of the
                       𝜕𝜒
                                                             image sensible to certain frequencies. Now, considering
                      𝜕𝜒ℎ𝑙                                   g as the family of function generated from a single 𝐿2 (𝑅)
                            = Γ𝑇 𝐺𝐾                      (8) function by phase space translations (𝜏 ,𝜖) where 𝜖 = 1/s
                       𝜕𝜒
                                                             ("coherent states"), an important property of this func-
                      𝜕𝜒𝑙ℎ
                            = 𝐾 𝑇 𝐺Γ                     (9) tion is the capability to completely reconstruct f from
                       𝜕𝜒                                    the phase space projections given by ⟨𝑔 (𝜏,𝑞) , 𝑓 ⟩. This is
                      𝜕𝜒ℎℎ                                   due to the property of this mapping function of being an
                            = 𝐾 𝑇 𝐺𝐾                    (10) isometry that as mentioned in [24],[26] or [27] is given by
                       𝜕𝜒
                                                             so called resolution of the identity property that implies
This is used in the encoder But not as a down-sampling
                                                             that the f function can be written as:
operation to substitute in the encoder instead its down-
sampled version is added hierarchically to the layers as
                                                                                 ∫︁     ∫︁
                                                                               1
                                                                        𝑓=           𝑑𝜏 𝑑𝑞⟨𝑔 (𝜖,𝜏 ) , 𝑓 ⟩𝑔 (𝑞,𝜏 )     (12)
a Parallel encoder providing in this way to the three en-                     2𝜋
coders, but also decoder considering that the structure
                                                             In similar way the wavelets are family of functions that
is U-net as, the features stressed by each layer of the
                                                             involve the 𝑔 (𝜏,𝑞) derived from a function, but indexed
multi-level decomposition. As it is possible to see the
                                                             by two labels, one for position and one for frequency
application of different low pass filters applied for the
                                                             with s = 1/𝜖 as scale factor and 𝜏 = translation where
first,fifth and ninth image for each layer and high pass
                                                             the resolution of the identity is written as:
filter for the rest. These information concatenated will
be added hierarchically to the encoder layers in partic-                         ∫︁
                                                                                     𝑑𝑞
                                                                                         ∫︁
ular in the 2-th,3-th and 4-th, each level will divide the             𝑓 = 𝐶𝜓−1              𝑑𝜏 ⟨𝜓 (𝜏,𝑞) , 𝑓 ⟩𝜓 (𝜏,𝑞) (13)
                                                                                     𝑞2
dimension of the image with 2 with j=number of the
                                  𝑗

layer. In order to recap how wavelet works I’m referring Taking into account the (14) in this way we can redefine
to different mother wavelet as Haar wavelets but we will completely f with a set of coefficients over a direction
also analyze performances in correspondence of different generated by simple filter application and re-defining the
wavelets applied as Daubechies that already proved their image using:
capabilities in the isolation of high from low frequency                                    𝑝−1
components in images and isolation in all directions of                                1 ∑︁                 𝑡−𝜏
                                                                        𝐹 (𝜏, 𝑠) = √︀           𝑓 (𝑡)𝜓 𝑘 [         ]  (14)
edges at different scale and resolutions as in [2] and [24].                           |𝑠| 1                  𝑠
Another remarkable fact is that we do not need to use
IDWT in decoding for obvious reason and the fact that, Usually it is chosen as parameters 𝑠 = 2𝑗 as dilation in
having a simple Multi level decomposition without vari- order to have a discrete dilation by taking powers of
ation filters. We will have just one gradient in common a fixed j, 𝜏 = 2𝑗 𝑛 as translation of the wavelet and k
to consider for the loss minimization evaluating the ad- as direction rewriting 𝑔 (𝜏,𝑞) as 𝜓𝑗,𝑛     𝑘
                                                                                                       . Given a function
ditional DWT features directly in the same CNN’s loss 𝑓 (𝑡) as signal of input that will be our image, it has a
as in the case of [15] with different loss for each heat large amplitude near sharp transitions of pixels such as


                                                              36
Giorgio De Magistris et al. CEUR Workshop Proceedings                                                             33–40


edges, obtaining the coefficients over the ⟨𝜓𝑗,𝑛
                                             𝑘
                                                 , 𝑓⟩ ≥ 𝑇                ..
                                                                𝑝𝑎𝑟𝑎𝑚𝑠    . 𝑡𝑒𝑠𝑡𝑠       Variations     diff. score
threshold and varying it over the frequency. What               output feature             (36×36)     -
now is produced by this are three corresponding high            input images            (256×256)      -
pass and low pass filters we obtain four results As             Optimizer               Adam/SDG       +0.17 loss/epoch
approximation and details coefficients for each of              Activation function      Lrelu/Relu    -
the three layers to pass to the next one in the multi           Num.layer                        3/4   ±0.1𝑝𝑐𝑝
level decomposition where each of theme is defined as:          N.layer(loss)                    1/3   -0.08 loss/epoch
                                                                Batch                         16/32    +0.2 % pcp
 ⎪𝐴2𝑗 𝑓 = (𝑓 𝛾2𝑗 (−𝑥)𝛾2𝑗 (−𝑦))(2−𝑗 𝑛, 2−𝑗 𝑚)
 ⎧ 𝑑
 ⎪
 ⎨𝐷1 𝑓 = (𝑓 𝛾 𝑗 (−𝑥)˛2𝑗 (−𝑦))(2−𝑗 𝑛, 2−𝑗 𝑚)
 ⎪                                                              Learning rate          3𝑒−3 /0.0025    +0.2% loss/epoch
      2𝑗         2
                                                       (15)     Epochs                    10/30/120    +0.11→+0.5% pcp
   𝐷  2
       𝑗 𝑓 = (𝑓˛2 𝑗
                    (−𝑥)𝛾2𝑗 (−𝑦))(2−𝑗 𝑛, 2−𝑗 𝑚)                 Dropout                      0.2/0.5   3.0% loss/epoch
 ⎩ 23
 ⎪
 ⎪
 ⎪                                                              Wavelet                 Haar/Daub.     +0.05→+0.14% pcp
   𝐷2𝑗 𝑓 = (𝑓˛2𝑗 (−𝑥)˛2𝑗 (−𝑦))(2−𝑗 𝑛, 2−𝑗 𝑚)
Another important remark is on how the output is asso-         Table 1
ciate and added to the neural network this is related the      Parameters used in the tests
method of concatenation (fusion) and the corresponding    PCP on a single image of the adopted methods varying
result. In case of application of hierarchical fusion has during the training. It is relevant to consider that we
been proved an increment booth in velocity of the loss    have to save many coefficients in the computation of the
convergence and PCP metric evaluated.                     wavelet so the computation will be calculated before the
                                                          training function. In this way we will gain some time
5. Experimental Setup                                     in the computation repeated for each epoch. But this is
                                                          now a very challenging problem from the memory point
In our case is necessary to give some hint for the eval- of view because we have to occupy space for each image
uation of performances. we used the commonly used and the corresponding coefficients.
method of Percentage of Correct Parts (PCP) and Per-
centage of Correct Key-points (PKC) to evaluate the re-
sults. In particular considering a simple MSE function to
                                                          6. Results
minimize the error respect predicted and correct coordi- The tests are evaluated both for COCO and LSP datasets
nate we have to consider a simple euclidean distance to but in the end, the evaluation has been done on a combi-
minimize (as in a regression problem) in the simple form: nation of it.The results in accuracy are evaluated not from
                                                               the first epoch but from the 50th epoch while the loss is
 𝑑2 = (𝑥𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑥𝐺𝑇 )2 + (𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑦𝐺𝑇 )2 (16)
                                                               shown from the beginning. It is important to mention
Having this kind of loss we have the possibility to define that the approaches have been proven on 160x160 images
a suitable metric depending on these coordinates for the but also different dimensions and increasing them the
evaluation based on a threshold to apply in order to un- results increased also more with the DWT encoder struc-
derstand if I’m going near to desired result. The PKC will ture with respect to a simple CNN just proving that the
be denoted an evaluation of the joints by a formula in loss, over 100 epoch arrived from a 110.07 value to 120.3.
the form:                                                      Not a big increase but considering that the PCP arrived
                   2                   2
                 𝑑𝑝𝑟𝑒𝑑 𝑐𝑜𝑠(𝜃) ≤ 0.5𝑑𝑡𝑟𝑢𝑒                  (17) from a value of 65% to 71% and that the loss of an in-
                                                               terpolation problem for the 160x160 images start usually
In other words if the segment given by the predicted over 1000 as MSE loss initial value while 300 for 80x80
endpoints lie within fraction of the length of the ground- images we can deduce that the augmented complexity
truth segment the distance calculated by the prediction of the interpolation problem is compensated from the
will have to be smaller than the half of the effective lenght information provided from DWT confirming its utilities
(threshold = 0.5) as mentioned in [1]. Alternatively is for the analysis of complex data, It could be interesting
possible to use as metric the Object Keypoint Similarity to try with 1280x720 images as a future improvement.
(OKS) in the form:                                                Some of the most challenging aspects respect this kind
                ∑︀       2     2 2                             of problem are reported in the 4-th image at the top
                   exp 𝑑𝑖 /2𝑠 𝑘𝑖 𝑓 (𝑣𝑖 > 0)
                       ∑︀                                 (18) that are evaluated with our method confronted at parity
                           𝑓 (𝑣𝑖 > 0)                          of epoch training and parameter to a simple CNN with
Another possible metric the one adopted in this work is WASP without addition of DWT. better results can be
the PCP that is based on detected joint that is considered establish, without changing the parameter, eliminating
correct if the distance between the predicted and the the down sampling and using images that contains a bet-
true joint is within a certain threshold. In this work you ter resolution (instead a 160x160 image) as in [15] where
can find an example of the results in terms of accuracy image’s dimensions are around 1280x720 even with de-


                                                          37
Giorgio De Magistris et al. CEUR Workshop Proceedings                                                                      33–40


Figure 1: Behaviour of the loss with 80 epoch with WASP
bottleneck on the dataset LSPII calculated over smaller images
80x80 and on COCO dataset, what is clearly visible beyond
the lower loss is the irregularity of the function in this case
due to noised data that is the main difference from LSPII and
COCO subset selected by me and the lower resolution of the
images.

nonised images. The images have been chosen to give a
basic representation of the main problems of the human
body pose estimation due to the complexity of the pose
                                                                       Figure 2: Test with HaarDWT used with pretarined CNN
and occlusion of the limbs of the subject. Looking at the
                                                                       Daubachies.
first line the first picture, beyond a small negligible error,
denote a very good result in the pose estimation of the                complicated CNNs are not capable to compete with the
subject. Differently, the third image is more complex to               wavelet encoder showing clear results of over fitting so
evaluate due to the occlusion of the limbs and the com-                we preferred to do not modify epochs more than 100 and
plexity of the pose. A different kind of problem is instead            using just 4 convolution layers. We extended these result
represented by the pose of the second and third image                  even to a different topic as human pose estimation, the
where the entire image down sampled with a very high                   initial objective was to establish results in object detec-
factor leads to the problem of low border resolution of the            tion approach for pose estimation but it has been discard
subject giving imprecision in the evaluation of the key                because as already said the top-down approach proved
points positions. Note carefully the presence of padding               to be superior. In addition to this we solved the faulty
in the CNN leads us to the shifted results in the figure as            encoder-decoder general structure common to all most
in object detection as [28]. In an other important test we             used CNNs for this field as VGG-16 and ResNet show-
considered the DWT based method confronted between                     ing lower loss of information during encoding using the
COCO and LSPII dataset where the data are more prob-                   DWT and the analysis of the Wavelet’s information of
lematic and many subject assume complex poses. These                   the images.
tests are also evaluated with respect a PKC value but
in the end we used the PCP because more appropriate
for a bottom-up approach and easier to implement and                   References
evaluate but similar results are initially evaluated with
respect the PKC.                                                        [1] M. Eichner, M. Marin-Jimenez, A. Zisserman, V. Fer-
                                                                            rari, 2D articulated human pose estimation and
                                                                            retrieval in (almost) unconstrained still images, In-
7. Conclusion                                                               ternational Journal of Computer Vision 99 (2012)
                                                                            190–214.
The wavelet procedure actually increase the capabilities                [2] J. C. Isaacs, S. Y. Foo, Hand pose estimation for
with the DWT encoder and with this result how is possi-                     american sign language recognition, Thirty-Sixth
ble to extend it also to different fields that include these                Southeastern Symposium on System Theory, 2004.
kind of structures. Different results are obtained using                    Proceedings of the (2004) 132–136.
more epochs or layer in the convolution structure from 4                [3] S. Pepe, S. Tedeschi, N. Brandizzi, S. Russo, L. Ioc-
to 3 layers obtained lower results in PCP terms so more                     chi, C. Napoli, Human attention assessment us-


                                                                  38
Giorgio De Magistris et al. CEUR Workshop Proceedings                                                                            33–40


Figure 3: Examples of qualitative result of our model including
some incorrect classification in the image 3 where the PCP is
very law note how the most problematic one are the first and
second in the parts where we have limb’s occlusion.


                                                                       Figure 5: Compared final results over the epoch as re-
                                                                       sult of DWT+WASP and Daubechies compared between
                                                                       LSPII+COCO and COCO only dataset.


                                                                           Method                              COCO       LSPII
                                                                           𝐶𝑁 𝑁𝑤𝑎𝑠𝑝+𝐷𝑊 𝑇,ℎ𝑎𝑎𝑟                  76.45%     71%
                                                                           𝐶𝑁 𝑁𝐷𝑊 𝑇,ℎ𝑎𝑎𝑟                       75.02%     -%
                                                                           𝐶𝑁 𝑁𝑤𝑎𝑠𝑝                               77%     63%
                                                                           CNN𝑤𝑎𝑠𝑝+𝐷𝑊 𝑇,𝐷𝑎𝑢𝑏𝑎𝑐ℎ𝑖𝑒𝑠              74.1%     71%
Figure 4: Another test with different kernel’s sizes in the                U-net                                73.7%     69.3%
WASP that obtained better results but with padding that de-                𝐶𝑁 𝑁𝑤𝑎𝑠𝑝,𝑆𝑅𝑀                        55.33%     60.21%
creased the result’s PCP (see the colab).                                  𝐶𝑁 𝑁𝐷𝑊 𝑇,𝐷𝑎𝑢𝑏𝑎𝑐ℎ𝑖𝑒𝑠                    61%     -%

     ing a machine learning approach with gan-based                    Table 2
                                                                       The results with respect a PCP metric evaluated over 100
     data augmentation technique trained using a cus-
                                                                       epoch, a batch of 40 elements each tested on LSPII dataset and
     tom dataset, OBM Neurobiology 6 (2022). doi:10.                   over COCO for U-net for the CNN with different structures.
     21926/obm.neurobiol.2204139.                                      The - values are not interesting with respect the previous result
 [4] N. Dat, V. Ponzi, S. Russo, F. Vincelli, Supporting               in the table (e.g same % for U-net and 𝐶𝑁 𝑁𝐷𝑊 𝑇,ℎ𝑎𝑎𝑟 or
     impaired people with a following robotic assistant                𝐶𝑁 𝑁𝐷𝑊 𝑇,ℎ𝑎𝑎𝑟 and 𝐶𝑁 𝑁𝐷𝑊 𝑇,𝐷𝑎𝑢𝑏𝑎𝑐ℎ𝑖𝑒𝑠 in COCO).
     by means of end-to-end visual target navigation
     and reinforcement learning approaches, in: CEUR                        trical Drives, Automation and Motion, SPEEDAM
     Workshop Proceedings, volume 3118, CEUR-WS,                            2014, IEEE Computer Society, 2014, pp. 1077–1084.
     2021, pp. 51–63.                                                       doi:10.1109/SPEEDAM.2014.6872127.
 [5] F. Bonanno, G. Capizzi, G. Sciuto, C. Napoli, G. Pap-              [6] C. Napoli, G. Pappalardo, E. Tramontana, R. Now-
     palardo, E. Tramontana, A novel cloud-distributed                      icki, J. Starczewski, M. Woźniak, Toward work
     toolbox for optimal energy dispatch management                         groups classification based on probabilistic neural
     from renewables in igss by using wrnn predic-                          network approach, in: Lecture Notes in Artificial In-
     tors and gpu parallel solutions, in: 2014 Inter-                       telligence (Subseries of Lecture Notes in Computer
     national Symposium on Power Electronics, Elec-                         Science), volume 9119, Springer Verlag, 2015, pp.


                                                                  39
Giorgio De Magistris et al. CEUR Workshop Proceedings                                                            33–40


     79–89. doi:10.1007/978-3-319-19324-3_8.                   [18] B. Artacho, A. Savakis, Omnipose: A multi-scale
 [7] Q. Li, L. Shen, S. Guo, Z. Lai, Wavelet integrated             framework for multi-person pose estimation, 2021.
     cnns for noise-robust image classification, in:                arXiv:2103.10180.
     IEEE/CVF Conference on Computer Vision and Pat-           [19] Y. Zhou, W. Huang, P. Dong, Y. Xia, S. Wang, D-
     tern Recognition (CVPR), 2020.                                 unet: A dimension-fusion u shape network for
 [8] M. Wozniak, C. Napoli, E. Tramontana, G. Capizzi,              chronic stroke lesion segmentation, IEEE/ACM
     G. Lo Sciuto, R. Nowicki, J. Starczewski, A mul-               Transactions on Computational Biology and Bioin-
     tiscale image compressor with rbfnn and discrete               formatics 18 (2021) 940–950. doi:10.1109/TCBB.
     wavelet decomposition, in: Proceedings of the In-              2019.2939522.
     ternational Joint Conference on Neural Networks,          [20] T. Williams, R. Li, Wavelet pooling for convolu-
     volume 2015-September, Institute of Electrical and             tional neural networks, 2018.
     Electronics Engineers Inc., 2015. doi:10.1109/            [21] J. Long, E. Shelhamer, T. Darrell, Fully convolu-
     IJCNN.2015.7280461.                                            tional networks for semantic segmentation, 2015.
 [9] N. Brandizzi, V. Bianco, G. Castro, S. Russo, A. Wa-           arXiv:1411.4038.
     jda, Automatic rgb inference based on facial emo-         [22] S. Kang, H. Park, J.-I. Park, Cnn-based ternary
     tion recognition, in: CEUR Workshop Proceedings,               classification for image steganalysis, Electronics 8
     volume 3092, CEUR-WS, 2021, pp. 66–74.                         (2019). URL: https://www.mdpi.com/2079-9292/8/
[10] C. Napoli, G. Pappalardo, E. Tramontana, Using                 11/1225. doi:10.3390/electronics8111225.
     modularity metrics to assist move method refactor-        [23] H. Liu, N. Chen, J. Huang, X. Huang, Y. Q. Shi, A
     ing of large systems, in: Proceedings - 2013 7th               robust dwt-based video watermarking algorithm,
     International Conference on Complex, Intelligent,              2002 IEEE International Symposium on Circuits and
     and Software Intensive Systems, CISIS 2013, 2013,              Systems. Proceedings (Cat. No.02CH37353) 3 (2002)
     pp. 529–534. doi:10.1109/CISIS.2013.96.                        III–III.
[11] G. Capizzi, G. Sciuto, C. Napoli, E. Tramontana, A        [24] I. Daubechies, The wavelet transform, time-
     multithread nested neural network architecture to              frequency localization and signal analysis, IEEE
     model surface plasmon polaritons propagation, Mi-              Transactions on Information Theory 36 (1990) 961–
     cromachines 7 (2016). doi:10.3390/mi7070110.                   1005. doi:10.1109/18.57199.
[12] G. De Magistris, R. Caprari, G. Castro, S. Russo,         [25] B. Sturm, Stéphane mallat: A wavelet tour of signal
     L. Iocchi, D. Nardi, C. Napoli, Vision-based                   processing, 2nd edition, Computer Music Journal -
     holistic scene understanding for context-aware                 COMPUT MUSIC J 31 (2007) 83–85. doi:10.1162/
     human-robot interaction 13196 LNAI (2022) 310–                 comj.2007.31.3.83.
     325. doi:10.1007/978-3-031-08421-8_21.                    [26] I. Daubechies, T. Paul, Time-frequency localisation
[13] R. Brociek, G. Magistris, F. Cardia, F. Coppa,                 operators-a geometric phase space approach: Ii. the
     S. Russo, Contagion prevention of covid-19 by                  use of dilations, Inverse Problems 4 (1988) 661–680.
     means of touch detection for retail stores, in: CEUR      [27] S. Mallat, A theory for multiresolution signal de-
     Workshop Proceedings, volume 3092, CEUR-WS,                    composition: the wavelet representation, IEEE
     2021, pp. 89–94.                                               Transactions on Pattern Analysis and Machine
[14] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac-         Intelligence 11 (1989) 674–693. doi:10.1109/34.
     caro, Yolov3-based mask and face recognition al-               192463.
     gorithm for individual protection applications, in:       [28] K. He, X. Zhang, S. Ren, J. Sun,             Spatial
     CEUR Workshop Proceedings, volume 2768, CEUR-                  pyramid pooling in deep convolutional networks
     WS, 2020, pp. 41–45.                                           for visual recognition, Lecture Notes in Com-
[15] B. Artacho, A. Savakis, Unipose: Unified human                 puter Science (2014) 346–361. URL: http://dx.doi.
     pose estimation in single images and videos, in: Pro-          org/10.1007/978-3-319-10578-9_23. doi:10.1007/
     ceedings of the IEEE/CVF Conference on Computer                978-3-319-10578-9_23.
     Vision and Pattern Recognition (CVPR), 2020.
[16] J. Dong, W. Jiang, Q. Huang, H. Bao, X. Zhou, Fast
     and robust multi-person 3d pose estimation from
     multiple views, 2019. arXiv:1901.04111.
[17] G. Capizzi, C. Napoli, F. Bonanno, Innovative
     second-generation wavelets construction with re-
     current neural networks for solar radiation fore-
     casting, IEEE Transactions on Neural Networks
     and Learning Systems 23 (2012) 1805–1815. doi:10.
     1109/TNNLS.2012.2216546.


                                                          40