A Novel DWT-based Encoder for Human Pose Estimation Giorgio De Magistris1 , Matteo Romano1 , Janusz Starczewski2 and Christian Napoli1,3 1 Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, Roma, 00185, Italy 2 Department of Computational Intelligence, Czstochowa University of Technology, al. Armii Krajowej 36, Częstochowa, 42-200, Poland 3 Institute for Systems Analysis and Computer Science, Italian National Research Council, Via dei Taurini 19, Roma, 00185, Italy Abstract The proposed approach for pose estimation is based on the construction of a Convolutional Neural Network with an encoding- decoding structure and a spatial pyramid based on WASP structure in its bottleneck and a Discrete wavelet transform encoder. These techniques already shown their capabilities to solve the main problems in state of the art related to: different Field of view (FoV) required to analyze the different possible sizes of a specific subject. we want to solve the faulty structure of the modern CNN based Neural Networks in the encoding part using DWT encoder and WASP. This Work also have the objective of demonstrating from a more general point of view which could be the advantages of a Discrete Wavelet Transform (DWT) encoder in any CNN-based approach for Pose Estimation and Object detection in any form, such as for several subjects in the same image or in the internal video due to the almost redundant use of the usual most famous encoding structures for CNN such as ResNet-101, U-Net or VGG16-19. we will do our tests using a U-net Based CNN in order to evaluate the importance of the results of the Discrete Wavelet Transform encoder also in the decoding part through the cropping of theme at the last layers of the network. This is necessary due to the loss of border’s pixels during encoding that could be useful for the result’s evaluation. Keywords Discrete Wavelet Transform, Convolutional Neural Network, WASP, Atrous Convolution 1. Introduction tures based on a representation of the image in a latent space through the construction of feature maps from a Pose Estimation task is important for many aspects from local to a global viewpoint through Spatial Pyramid ap- Human detection and pose estimation to the navigation proach as bottleneck of the CNN. For this type of task, system for autonomous car and also different fields as various technologies are currently present in the state of object detection and image segmentation. We can use dif- the art for improving the performances of these CNNs, ferent types of approaches as Top-down detection human also related to the management of the different fields of with a bounding box as object detection task and then use view (FoV) of the image necessary for the evaluation of the pose estimation algorithm as in [1]. An alternative objects with different scales in the representation but for bottom-up approach estimate the points from the im- what we want to add with this work is the resolution of age and then recreate the human form by the "skeleton" problems localized in the encoding part of these neural given by the conjunction of these points. The method networks. The final product of this kind of structure for used in our case will be the Top-down for the estimation pose estimation of a single subject will be at the end a of these points. This is because the result that we want to simple set of points to interpolate or a heatmap of the emphasize is the improvement of the performances at en- image obtained through the generation in pre-processing coding level so that include alle the structure of this kind phase of a ground truth (GT) image with gaussian noise even the ones used for object detection. There are dif- in the keypoints area for the evaluation of a loss function ferent challenges in the capability of certain CNN to get based on a percentage of Correct Key points respect the good results even in complex situations and In a different GT image and predicted key points. There are actually way the novelty of our study could affect the capabilities different challenges in the state of the art not only related of these CNN structures. The actual Convolutional neu- the technologies actually used but also problems due to ral network Based method for the key points detection the case of possible occlusion of limbs due to presence and estimation of the pose of human limbs is actually all of different persons in images and requiring the use of developed through encoding-decoding structures which Multi Pose estimation or low-quality of the images as are very similar to each other. More specifically, struc- in video where we have very fast objects moving in the image that needs of less smooth corners for the analysis. SYSYEM 2022: 8th Scholar’s Yearly Symposium of Technology, Engineering and Mathematics, Brunek, July 23, 2022 Other examples where is possible to find that this kind $ demagistris@diag.uniroma1.it (G. De Magistris); of structure are used beyond the Human pose estimation janusz.starczewski@pcz.pl (J. Starczewski); also for hand pose as in [2, 3, 4] where are analyzed many cnapoli@diag.uniroma1.it (C. Napoli) kinds of wavelet applied to the purpose or even different © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). applications in the state of the art [5, 6] as noticed also CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 33 Giorgio De Magistris et al. CEUR Workshop Proceedings 33–40 by different works in image classification as [7, 8] or formultiple subject in the image in which case we not only object detection approaches as in [1, 9] or other solutionsestablish a heatmap for the evaluation of the key points [10, 11]. With this premises our work is focused on the in a Gaussian map but we consider have to also a seg- application of an U-net based model with a multi-level mentation of the image and evaluation of long,medium decomposition (MLD) of the image as encoder parallel to and short range offset for each key point of the subject U-net with concatenation of the gathered information for to establish the relationship between different key points each layer of the DWT encoder to the encoder of U-net belonging to different subject and the ones in the same and propagation of this information to the decoder. In segmented part of the image. In these cases is also im- the feature extraction at the U-net’s bottleneck is insteadportant to mention as in [18] the WASPv2 version used used a WASP structure to obtain the resulting feature in combination to HRNet structures without a decoding map of the image from different field of view after the structure that follow it performing a cascade of Atrous encoding part to pass to the decoder. In this way the convolutions at increasing rates to gain efficiency. In encoder do not loose the information eliminated by the the next section are explored the past works in DWT down sampling operation because reconstructed from the application to a variety of state of the arts and explains information of wavelets passed forward to the decoder. how our approach is different in chapter 3 and in the conclusion how the results are improved by it. Talking about possible applications of Discrete Wavelet Trans- 2. Related Works form (DWT) related to different state of the art we can consider as fundamental the contribute given by D-Unet Two main aspect of this works are considered as first a dual encoder used for different purposes as Image seg- respect to the works related to the wavelet applied to mentation and object detection [19]. DWT-based encoder different fields as Image segmentation or object detection is an important addition to the state of the art because and then the common structures and modern approaches it demonstrated its superiority in reconstructing infor- actually used in Pose Estimation, this work is focused on mation lost in the encoding part of the neural network the analysis of approaches from both this two aspects and also the superior capability with respect to different used in combination with pre-processing common prac- methods used as steganalysis rich model (SRM). In which tise related this to the task of Human Pose Estimation. case the layer extracts the image noise providing addi- There are many pose estimation systems [12, 13, 14], tional evidence for the classification of multiple types of among them UniPose [15] is a pose estimation method image key points as shoulders, elbows and faces. In [20] based on the application of a so-called WASP structure has been instead demonstrated the problems related the as the central part of the bottleneck in the CNN. It is use of a down-sampling and up-sampling operation and based on the use of layers with different dilation in the corresponding interpolation operation needed to recon- application of a rate parameter related to the formula: struct the original image from the global feature map at 𝐿1 𝐿2 the start of the decoding part. As last consideration is 𝐼[𝑚 − 𝑟𝑖, 𝑚 − 𝑟𝑗] * 𝑘[𝑖, 𝑗] (1) important to cite also the capabilities demonstrated by ∑︁ ∑︁ 𝑦[𝑖, 𝑗] = 𝑖=0 𝑗=0 spatial pyramid in obtaining of low-resolution feature maps with global features if used with different FoV and with r = rate of dilation , I = image and k as kernel over dilation parameters the so called Atrous Spatial Pyramid the x/y-axis and 𝐿1 or 𝐿2 as pixel’s position. In this way Pooling (ASPP) compared to other simpler architecture is possible to get a higher FoV for the image and con- to handle different scales, sizes, and aspect ratios of the nect it to a depth-wise convolution operation to obtain subjects. a higher level of abstraction features to consider in the latent representation. UniPose reached a Percentage of Correct Parts (PCP) considerably high for the actual state 3. Model Architecture of the art but not even near to be considered as robust approach respect most complex methods based on 3D The complete structure of the model is separated in three models of poses. One of the most important approaches parts: The encoder and decoder structure typical of U- based on Cascaded Pyramid Network and pointing to net with propagation of the information from decoding give a 3D representation of the scene as in [16] how- to encoding layers, the bottleneck where the waterfall ever we are not interested to highlight because too much Atrous spatial pyramid (WASP) is used to obtain feature computational expensive and applicable only in specific map in different Field of view and the parallel encoder environments as in the case of multi-camera detection of used for the multi-level decomposition of the image for the scene. A particular explanation of wich problems can each different image’s channel with the corresponding be encountered in this kind of task is given by Person- concatenation of the low pass representation in the lay- Lab [17] where the Atrous approach is also analyzed for ers of U-net delegated to the forward propagation of 34 Giorgio De Magistris et al. CEUR Workshop Proceedings 33–40 the information to the decoder. The complete architec- WASP is designed with the goal of reduce the number ture is visible in where each block represent a layer of of parameters in order to deal with memory constraints U-Net with: down sampling operation, dropout and nor- and solve the main issue of Atrous convolutions using malization of the batch. We selected two datasets for different FoV for image global feature representation. In the validation and training the first is COCO containing this part we deal with latent representation manipula- 40000 images while a more specific and generally chal- tion and how in this case it influence the decoding part. lenging dataset is LSP for this kind of task that , used varying the parameters of this part in fact is possible to in combination with a small part of COCO to obtain a notice how these variation can lead to different results single dataset of 3600 samples and the remaining for test- from a PCP viewpoint and more robustness to far sub- ing and validation. The LSP dataset includes modified jects. in fact one in particular of our test has been variate data with noise addition, having a good assessment of the subject of the image from a very near subjects to the the network performance for the task of single person camera to a more far representation given by different pose estimation and even in this case one of the most dataset used for Human pose estimation from UAV. It is problems is the occluded limbs. possible to see that the results in case of a simple person As mentioned in [21] the Fully Convolutional Net- in front on the camera are superior but when the limbs of works (FCN) are the most used kind of CNN in this the subject are composed by very small groups of pixel fields and all are structured as encoder-decoder with a more pixel-by-pixel analysis is needed. In terms of up-sampling procedure for reconstruction of a resolu- result we obtained a level of PCP for 50 epoch-training tion and restore of loosed data in encoding part. In this showing the challenging properties of the images. The section are considered as assumption that the structure parameters modified are in fact not only the kernel sizes will be similar to U-Net. however is important to remark but also the dilation or rate parameters obtaining in this that these kind of structures already establish state of way a general FoV of the image. We also tried to use the art result without thorough consideration of other different sizes for the Latent representation, needed to methods of image feature extraction. As first approach use the dilation high and use bigger FoV than we can and has been considered a solution based on the generation more parallel levels of FoV concatenated for the decoding of an heat map but in that case the construction of it is part. The shapes of the layers variate between 1,2,6 rate very similar to the binary segmentation of [19] which parameters while the dimension of the kernels between can lead to problems as the necessity to use heat maps 3,5,7. Another important fact is related to the presence of separated for each key point in order to connect theme the 1 by 1 convolutions before the concatenation useful if each other in a correct way even if we have overlap of we want to manipulate the dimension of the data without it. So in this case a simple binary segmentation is not loss of features from the local to the global feature map. useful for overlap of limbs and articulations and a more computational complexity shows up in working sepa- rately for each key point. For these reasons has been 4. The DWT-based Encoder chosen to consider a layer for a regression task with a The methodologies applied for the construction of the in- vector representing the scaled coordinates of the points formation related the analysis of the image in frequency in the image that we have to interpolate with our model. with different scale lead me to many different choice, The reason behind the U-net structure as choice, instead from the application of Gaussian or Sobel Filters to the most common VGG or ResNet, is due to the possibility to use of SRM structure as in [22]. But what in the end propagate the information produced with wavelet’s coef- establish the most significant result has been the DWT ficients everywhere in the layers between encoding and encoder with the multi-scale decomposition of the image. decoding. The network to propagate context information To make it we built a sequence of layer applying low to higher resolution layers, exploiting this capability of pass and high pass filters to an image and generating Unet the information propagated are in the structure of relevant Haar-features for the localization of relevant our Neural Network the information gathered from the key points.The Multilevel decomposition method called wavelet in the encoder layers concatenated layer by layer in this work DWT-based encoder will generate these co- to the information obtained from the network’s encoder. efficients over all three direction in the image vertical, This will lead us to a distance function weighted respect diagonal and horizontal (𝐿𝐻𝑖 , 𝐻𝐿𝑖 , 𝐻𝐻𝑖 , 𝐿𝐿𝑖 ) with to the subject dimension in the image that we will have coefficients for the details and approximation over differ- to minimize (e.g simple MSE respect the position) and ent thresholds to pass to the next layer as explained in find the best function that interpolate the position of the [23],[24] or [25]. With a more mathematical viewpoint key-points respect the image’s information. We want in each frequency component can be defined in a matrix this way to be able to build augmented information for form for 2D input as: an image with very low dimension and use them to infer invisible information for a simple U-net encoder-decoder. 𝜒𝑙𝑙 = Γ𝜒Γ𝑇 (2) 35 Giorgio De Magistris et al. CEUR Workshop Proceedings 33–40 𝜒𝑙ℎ = Γ𝜒𝐾 𝑇 (3) map to find for each key point an unique connection to the others for the skeleton construction. our approach is 𝜒ℎ𝑙 = 𝐾𝜒Γ 𝑇 (4) based on the analysis of the information produced by this 𝜒ℎℎ = 𝐾𝜒𝐾 𝑇 (5) filters and the improvement given by the analysis of the image by the DWT encoder as substitute to the one based Where 𝜒 is the input, 𝜒𝑙𝑙 are the low frequency compo- on SRM. In order to better understand the uses of DWT nent and the high defined with 𝜒ℎ𝑙 , 𝜒𝑙ℎ , 𝜒ℎℎ Defining in this part is given a recap of basics concepts. Given a the Low pass and High pass filter as 𝜅𝑖, 𝛾𝑖 : window function as the one used for Fourier transform ⎛ ... ... ... ... ... ⎞ ⎛ ... ... ... ... ... ⎞ usually found in a common form: ⎜𝛾−2 𝛾−1 𝛾0 ... ...⎟ ⎜𝜅−2 𝜅−1 𝜅0 ... ...⎟ , K = (6) ⎜ ⎟ ⎜ ⎟ Γ=⎜𝛾 ⎜ −1 𝛾 0 𝛾 1 ... ...⎟ 𝜅 ⎜ −1 ⎜ 𝜅 0 𝜅 1 ... ... ⎟ ∫︁ +∞ 𝑓 (𝑡)𝑔(𝑡 − 𝜏 )𝑒−𝑖𝑡𝜖 𝑑𝑡 (11) ⎟ ⎟ ⎝ ... 𝛾1 𝛾2 𝛾3 ...⎠ ⎝ ... 𝜅1 𝜅2 𝜅3 ...⎠ ... ... 𝛾1 𝛾2 ... ... ... 𝜅1 𝜅2 ... 𝐹 (𝜏, 𝜖) = −∞ These information produced for each layer will be con- It can be interpreted as a Fourier transform of f at the fre- catenated to the layers of the encoder but, having a U-net quency 𝜖, localized by the window g in the neighborhood structure will be added also at the last layers with for- of 𝜏 . Multiplying the signal represented by f(t) with g and ward propagation that will be taken into account by the computing the Fourier coefficients we obtain indication convolution’s weights and updated by back-propagation: of the frequency content of the signal f in a neighbor- hood of 𝜏 , shifting the window from 0 and obtained a 𝜕𝜒𝑙𝑙 = Γ𝑇 𝐺Γ (7) sequence of coefficients that give a representation of the 𝜕𝜒 image sensible to certain frequencies. Now, considering 𝜕𝜒ℎ𝑙 g as the family of function generated from a single 𝐿2 (𝑅) = Γ𝑇 𝐺𝐾 (8) function by phase space translations (𝜏 ,𝜖) where 𝜖 = 1/s 𝜕𝜒 ("coherent states"), an important property of this func- 𝜕𝜒𝑙ℎ = 𝐾 𝑇 𝐺Γ (9) tion is the capability to completely reconstruct f from 𝜕𝜒 the phase space projections given by ⟨𝑔 (𝜏,𝑞) , 𝑓 ⟩. This is 𝜕𝜒ℎℎ due to the property of this mapping function of being an = 𝐾 𝑇 𝐺𝐾 (10) isometry that as mentioned in [24],[26] or [27] is given by 𝜕𝜒 so called resolution of the identity property that implies This is used in the encoder But not as a down-sampling that the f function can be written as: operation to substitute in the encoder instead its down- sampled version is added hierarchically to the layers as ∫︁ ∫︁ 1 𝑓= 𝑑𝜏 𝑑𝑞⟨𝑔 (𝜖,𝜏 ) , 𝑓 ⟩𝑔 (𝑞,𝜏 ) (12) a Parallel encoder providing in this way to the three en- 2𝜋 coders, but also decoder considering that the structure In similar way the wavelets are family of functions that is U-net as, the features stressed by each layer of the involve the 𝑔 (𝜏,𝑞) derived from a function, but indexed multi-level decomposition. As it is possible to see the by two labels, one for position and one for frequency application of different low pass filters applied for the with s = 1/𝜖 as scale factor and 𝜏 = translation where first,fifth and ninth image for each layer and high pass the resolution of the identity is written as: filter for the rest. These information concatenated will be added hierarchically to the encoder layers in partic- ∫︁ 𝑑𝑞 ∫︁ ular in the 2-th,3-th and 4-th, each level will divide the 𝑓 = 𝐶𝜓−1 𝑑𝜏 ⟨𝜓 (𝜏,𝑞) , 𝑓 ⟩𝜓 (𝜏,𝑞) (13) 𝑞2 dimension of the image with 2 with j=number of the 𝑗 layer. In order to recap how wavelet works I’m referring Taking into account the (14) in this way we can redefine to different mother wavelet as Haar wavelets but we will completely f with a set of coefficients over a direction also analyze performances in correspondence of different generated by simple filter application and re-defining the wavelets applied as Daubechies that already proved their image using: capabilities in the isolation of high from low frequency 𝑝−1 components in images and isolation in all directions of 1 ∑︁ 𝑡−𝜏 𝐹 (𝜏, 𝑠) = √︀ 𝑓 (𝑡)𝜓 𝑘 [ ] (14) edges at different scale and resolutions as in [2] and [24]. |𝑠| 1 𝑠 Another remarkable fact is that we do not need to use IDWT in decoding for obvious reason and the fact that, Usually it is chosen as parameters 𝑠 = 2𝑗 as dilation in having a simple Multi level decomposition without vari- order to have a discrete dilation by taking powers of ation filters. We will have just one gradient in common a fixed j, 𝜏 = 2𝑗 𝑛 as translation of the wavelet and k to consider for the loss minimization evaluating the ad- as direction rewriting 𝑔 (𝜏,𝑞) as 𝜓𝑗,𝑛 𝑘 . Given a function ditional DWT features directly in the same CNN’s loss 𝑓 (𝑡) as signal of input that will be our image, it has a as in the case of [15] with different loss for each heat large amplitude near sharp transitions of pixels such as 36 Giorgio De Magistris et al. CEUR Workshop Proceedings 33–40 edges, obtaining the coefficients over the ⟨𝜓𝑗,𝑛 𝑘 , 𝑓⟩ ≥ 𝑇 .. 𝑝𝑎𝑟𝑎𝑚𝑠 . 𝑡𝑒𝑠𝑡𝑠 Variations diff. score threshold and varying it over the frequency. What output feature (36×36) - now is produced by this are three corresponding high input images (256×256) - pass and low pass filters we obtain four results As Optimizer Adam/SDG +0.17 loss/epoch approximation and details coefficients for each of Activation function Lrelu/Relu - the three layers to pass to the next one in the multi Num.layer 3/4 ±0.1𝑝𝑐𝑝 level decomposition where each of theme is defined as: N.layer(loss) 1/3 -0.08 loss/epoch Batch 16/32 +0.2 % pcp ⎪𝐴2𝑗 𝑓 = (𝑓 𝛾2𝑗 (−𝑥)𝛾2𝑗 (−𝑦))(2−𝑗 𝑛, 2−𝑗 𝑚) ⎧ 𝑑 ⎪ ⎨𝐷1 𝑓 = (𝑓 𝛾 𝑗 (−𝑥)˛2𝑗 (−𝑦))(2−𝑗 𝑛, 2−𝑗 𝑚) ⎪ Learning rate 3𝑒−3 /0.0025 +0.2% loss/epoch 2𝑗 2 (15) Epochs 10/30/120 +0.11→+0.5% pcp 𝐷 2 𝑗 𝑓 = (𝑓˛2 𝑗 (−𝑥)𝛾2𝑗 (−𝑦))(2−𝑗 𝑛, 2−𝑗 𝑚) Dropout 0.2/0.5 3.0% loss/epoch ⎩ 23 ⎪ ⎪ ⎪ Wavelet Haar/Daub. +0.05→+0.14% pcp 𝐷2𝑗 𝑓 = (𝑓˛2𝑗 (−𝑥)˛2𝑗 (−𝑦))(2−𝑗 𝑛, 2−𝑗 𝑚) Another important remark is on how the output is asso- Table 1 ciate and added to the neural network this is related the Parameters used in the tests method of concatenation (fusion) and the corresponding PCP on a single image of the adopted methods varying result. In case of application of hierarchical fusion has during the training. It is relevant to consider that we been proved an increment booth in velocity of the loss have to save many coefficients in the computation of the convergence and PCP metric evaluated. wavelet so the computation will be calculated before the training function. In this way we will gain some time 5. Experimental Setup in the computation repeated for each epoch. But this is now a very challenging problem from the memory point In our case is necessary to give some hint for the eval- of view because we have to occupy space for each image uation of performances. we used the commonly used and the corresponding coefficients. method of Percentage of Correct Parts (PCP) and Per- centage of Correct Key-points (PKC) to evaluate the re- sults. In particular considering a simple MSE function to 6. Results minimize the error respect predicted and correct coordi- The tests are evaluated both for COCO and LSP datasets nate we have to consider a simple euclidean distance to but in the end, the evaluation has been done on a combi- minimize (as in a regression problem) in the simple form: nation of it.The results in accuracy are evaluated not from the first epoch but from the 50th epoch while the loss is 𝑑2 = (𝑥𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑥𝐺𝑇 )2 + (𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑦𝐺𝑇 )2 (16) shown from the beginning. It is important to mention Having this kind of loss we have the possibility to define that the approaches have been proven on 160x160 images a suitable metric depending on these coordinates for the but also different dimensions and increasing them the evaluation based on a threshold to apply in order to un- results increased also more with the DWT encoder struc- derstand if I’m going near to desired result. The PKC will ture with respect to a simple CNN just proving that the be denoted an evaluation of the joints by a formula in loss, over 100 epoch arrived from a 110.07 value to 120.3. the form: Not a big increase but considering that the PCP arrived 2 2 𝑑𝑝𝑟𝑒𝑑 𝑐𝑜𝑠(𝜃) ≤ 0.5𝑑𝑡𝑟𝑢𝑒 (17) from a value of 65% to 71% and that the loss of an in- terpolation problem for the 160x160 images start usually In other words if the segment given by the predicted over 1000 as MSE loss initial value while 300 for 80x80 endpoints lie within fraction of the length of the ground- images we can deduce that the augmented complexity truth segment the distance calculated by the prediction of the interpolation problem is compensated from the will have to be smaller than the half of the effective lenght information provided from DWT confirming its utilities (threshold = 0.5) as mentioned in [1]. Alternatively is for the analysis of complex data, It could be interesting possible to use as metric the Object Keypoint Similarity to try with 1280x720 images as a future improvement. (OKS) in the form: Some of the most challenging aspects respect this kind ∑︀ 2 2 2 of problem are reported in the 4-th image at the top exp 𝑑𝑖 /2𝑠 𝑘𝑖 𝑓 (𝑣𝑖 > 0) ∑︀ (18) that are evaluated with our method confronted at parity 𝑓 (𝑣𝑖 > 0) of epoch training and parameter to a simple CNN with Another possible metric the one adopted in this work is WASP without addition of DWT. better results can be the PCP that is based on detected joint that is considered establish, without changing the parameter, eliminating correct if the distance between the predicted and the the down sampling and using images that contains a bet- true joint is within a certain threshold. In this work you ter resolution (instead a 160x160 image) as in [15] where can find an example of the results in terms of accuracy image’s dimensions are around 1280x720 even with de- 37 Giorgio De Magistris et al. CEUR Workshop Proceedings 33–40 Figure 1: Behaviour of the loss with 80 epoch with WASP bottleneck on the dataset LSPII calculated over smaller images 80x80 and on COCO dataset, what is clearly visible beyond the lower loss is the irregularity of the function in this case due to noised data that is the main difference from LSPII and COCO subset selected by me and the lower resolution of the images. nonised images. The images have been chosen to give a basic representation of the main problems of the human body pose estimation due to the complexity of the pose Figure 2: Test with HaarDWT used with pretarined CNN and occlusion of the limbs of the subject. Looking at the Daubachies. first line the first picture, beyond a small negligible error, denote a very good result in the pose estimation of the complicated CNNs are not capable to compete with the subject. Differently, the third image is more complex to wavelet encoder showing clear results of over fitting so evaluate due to the occlusion of the limbs and the com- we preferred to do not modify epochs more than 100 and plexity of the pose. A different kind of problem is instead using just 4 convolution layers. We extended these result represented by the pose of the second and third image even to a different topic as human pose estimation, the where the entire image down sampled with a very high initial objective was to establish results in object detec- factor leads to the problem of low border resolution of the tion approach for pose estimation but it has been discard subject giving imprecision in the evaluation of the key because as already said the top-down approach proved points positions. Note carefully the presence of padding to be superior. In addition to this we solved the faulty in the CNN leads us to the shifted results in the figure as encoder-decoder general structure common to all most in object detection as [28]. In an other important test we used CNNs for this field as VGG-16 and ResNet show- considered the DWT based method confronted between ing lower loss of information during encoding using the COCO and LSPII dataset where the data are more prob- DWT and the analysis of the Wavelet’s information of lematic and many subject assume complex poses. These the images. tests are also evaluated with respect a PKC value but in the end we used the PCP because more appropriate for a bottom-up approach and easier to implement and References evaluate but similar results are initially evaluated with respect the PKC. [1] M. Eichner, M. Marin-Jimenez, A. Zisserman, V. Fer- rari, 2D articulated human pose estimation and retrieval in (almost) unconstrained still images, In- 7. Conclusion ternational Journal of Computer Vision 99 (2012) 190–214. The wavelet procedure actually increase the capabilities [2] J. C. Isaacs, S. Y. Foo, Hand pose estimation for with the DWT encoder and with this result how is possi- american sign language recognition, Thirty-Sixth ble to extend it also to different fields that include these Southeastern Symposium on System Theory, 2004. kind of structures. Different results are obtained using Proceedings of the (2004) 132–136. more epochs or layer in the convolution structure from 4 [3] S. Pepe, S. Tedeschi, N. Brandizzi, S. Russo, L. Ioc- to 3 layers obtained lower results in PCP terms so more chi, C. Napoli, Human attention assessment us- 38 Giorgio De Magistris et al. CEUR Workshop Proceedings 33–40 Figure 3: Examples of qualitative result of our model including some incorrect classification in the image 3 where the PCP is very law note how the most problematic one are the first and second in the parts where we have limb’s occlusion. Figure 5: Compared final results over the epoch as re- sult of DWT+WASP and Daubechies compared between LSPII+COCO and COCO only dataset. Method COCO LSPII 𝐶𝑁 𝑁𝑤𝑎𝑠𝑝+𝐷𝑊 𝑇,ℎ𝑎𝑎𝑟 76.45% 71% 𝐶𝑁 𝑁𝐷𝑊 𝑇,ℎ𝑎𝑎𝑟 75.02% -% 𝐶𝑁 𝑁𝑤𝑎𝑠𝑝 77% 63% CNN𝑤𝑎𝑠𝑝+𝐷𝑊 𝑇,𝐷𝑎𝑢𝑏𝑎𝑐ℎ𝑖𝑒𝑠 74.1% 71% Figure 4: Another test with different kernel’s sizes in the U-net 73.7% 69.3% WASP that obtained better results but with padding that de- 𝐶𝑁 𝑁𝑤𝑎𝑠𝑝,𝑆𝑅𝑀 55.33% 60.21% creased the result’s PCP (see the colab). 𝐶𝑁 𝑁𝐷𝑊 𝑇,𝐷𝑎𝑢𝑏𝑎𝑐ℎ𝑖𝑒𝑠 61% -% ing a machine learning approach with gan-based Table 2 The results with respect a PCP metric evaluated over 100 data augmentation technique trained using a cus- epoch, a batch of 40 elements each tested on LSPII dataset and tom dataset, OBM Neurobiology 6 (2022). doi:10. over COCO for U-net for the CNN with different structures. 21926/obm.neurobiol.2204139. The - values are not interesting with respect the previous result [4] N. Dat, V. Ponzi, S. Russo, F. Vincelli, Supporting in the table (e.g same % for U-net and 𝐶𝑁 𝑁𝐷𝑊 𝑇,ℎ𝑎𝑎𝑟 or impaired people with a following robotic assistant 𝐶𝑁 𝑁𝐷𝑊 𝑇,ℎ𝑎𝑎𝑟 and 𝐶𝑁 𝑁𝐷𝑊 𝑇,𝐷𝑎𝑢𝑏𝑎𝑐ℎ𝑖𝑒𝑠 in COCO). by means of end-to-end visual target navigation and reinforcement learning approaches, in: CEUR trical Drives, Automation and Motion, SPEEDAM Workshop Proceedings, volume 3118, CEUR-WS, 2014, IEEE Computer Society, 2014, pp. 1077–1084. 2021, pp. 51–63. doi:10.1109/SPEEDAM.2014.6872127. [5] F. Bonanno, G. Capizzi, G. Sciuto, C. Napoli, G. Pap- [6] C. Napoli, G. Pappalardo, E. Tramontana, R. Now- palardo, E. Tramontana, A novel cloud-distributed icki, J. Starczewski, M. Woźniak, Toward work toolbox for optimal energy dispatch management groups classification based on probabilistic neural from renewables in igss by using wrnn predic- network approach, in: Lecture Notes in Artificial In- tors and gpu parallel solutions, in: 2014 Inter- telligence (Subseries of Lecture Notes in Computer national Symposium on Power Electronics, Elec- Science), volume 9119, Springer Verlag, 2015, pp. 39 Giorgio De Magistris et al. CEUR Workshop Proceedings 33–40 79–89. doi:10.1007/978-3-319-19324-3_8. [18] B. Artacho, A. Savakis, Omnipose: A multi-scale [7] Q. Li, L. Shen, S. Guo, Z. Lai, Wavelet integrated framework for multi-person pose estimation, 2021. cnns for noise-robust image classification, in: arXiv:2103.10180. IEEE/CVF Conference on Computer Vision and Pat- [19] Y. Zhou, W. Huang, P. Dong, Y. Xia, S. Wang, D- tern Recognition (CVPR), 2020. unet: A dimension-fusion u shape network for [8] M. Wozniak, C. Napoli, E. Tramontana, G. Capizzi, chronic stroke lesion segmentation, IEEE/ACM G. Lo Sciuto, R. Nowicki, J. Starczewski, A mul- Transactions on Computational Biology and Bioin- tiscale image compressor with rbfnn and discrete formatics 18 (2021) 940–950. doi:10.1109/TCBB. wavelet decomposition, in: Proceedings of the In- 2019.2939522. ternational Joint Conference on Neural Networks, [20] T. Williams, R. Li, Wavelet pooling for convolu- volume 2015-September, Institute of Electrical and tional neural networks, 2018. Electronics Engineers Inc., 2015. doi:10.1109/ [21] J. Long, E. Shelhamer, T. Darrell, Fully convolu- IJCNN.2015.7280461. tional networks for semantic segmentation, 2015. [9] N. Brandizzi, V. Bianco, G. Castro, S. Russo, A. Wa- arXiv:1411.4038. jda, Automatic rgb inference based on facial emo- [22] S. Kang, H. Park, J.-I. Park, Cnn-based ternary tion recognition, in: CEUR Workshop Proceedings, classification for image steganalysis, Electronics 8 volume 3092, CEUR-WS, 2021, pp. 66–74. (2019). URL: https://www.mdpi.com/2079-9292/8/ [10] C. Napoli, G. Pappalardo, E. Tramontana, Using 11/1225. doi:10.3390/electronics8111225. modularity metrics to assist move method refactor- [23] H. Liu, N. Chen, J. Huang, X. Huang, Y. Q. Shi, A ing of large systems, in: Proceedings - 2013 7th robust dwt-based video watermarking algorithm, International Conference on Complex, Intelligent, 2002 IEEE International Symposium on Circuits and and Software Intensive Systems, CISIS 2013, 2013, Systems. Proceedings (Cat. No.02CH37353) 3 (2002) pp. 529–534. doi:10.1109/CISIS.2013.96. III–III. [11] G. Capizzi, G. Sciuto, C. Napoli, E. Tramontana, A [24] I. Daubechies, The wavelet transform, time- multithread nested neural network architecture to frequency localization and signal analysis, IEEE model surface plasmon polaritons propagation, Mi- Transactions on Information Theory 36 (1990) 961– cromachines 7 (2016). doi:10.3390/mi7070110. 1005. doi:10.1109/18.57199. [12] G. De Magistris, R. Caprari, G. Castro, S. Russo, [25] B. Sturm, Stéphane mallat: A wavelet tour of signal L. Iocchi, D. Nardi, C. Napoli, Vision-based processing, 2nd edition, Computer Music Journal - holistic scene understanding for context-aware COMPUT MUSIC J 31 (2007) 83–85. doi:10.1162/ human-robot interaction 13196 LNAI (2022) 310– comj.2007.31.3.83. 325. doi:10.1007/978-3-031-08421-8_21. [26] I. Daubechies, T. Paul, Time-frequency localisation [13] R. Brociek, G. Magistris, F. Cardia, F. Coppa, operators-a geometric phase space approach: Ii. the S. Russo, Contagion prevention of covid-19 by use of dilations, Inverse Problems 4 (1988) 661–680. means of touch detection for retail stores, in: CEUR [27] S. Mallat, A theory for multiresolution signal de- Workshop Proceedings, volume 3092, CEUR-WS, composition: the wavelet representation, IEEE 2021, pp. 89–94. Transactions on Pattern Analysis and Machine [14] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac- Intelligence 11 (1989) 674–693. doi:10.1109/34. caro, Yolov3-based mask and face recognition al- 192463. gorithm for individual protection applications, in: [28] K. He, X. Zhang, S. Ren, J. Sun, Spatial CEUR Workshop Proceedings, volume 2768, CEUR- pyramid pooling in deep convolutional networks WS, 2020, pp. 41–45. for visual recognition, Lecture Notes in Com- [15] B. Artacho, A. Savakis, Unipose: Unified human puter Science (2014) 346–361. URL: http://dx.doi. pose estimation in single images and videos, in: Pro- org/10.1007/978-3-319-10578-9_23. doi:10.1007/ ceedings of the IEEE/CVF Conference on Computer 978-3-319-10578-9_23. Vision and Pattern Recognition (CVPR), 2020. [16] J. Dong, W. Jiang, Q. Huang, H. Bao, X. Zhou, Fast and robust multi-person 3d pose estimation from multiple views, 2019. arXiv:1901.04111. [17] G. Capizzi, C. Napoli, F. Bonanno, Innovative second-generation wavelets construction with re- current neural networks for solar radiation fore- casting, IEEE Transactions on Neural Networks and Learning Systems 23 (2012) 1805–1815. doi:10. 1109/TNNLS.2012.2216546. 40