1. Introduction

A Novel DWT-based Encoder for Human Pose Estimation

Giorgio De Magistris

Matteo Romano

Janusz Starczewski

Christian Napoli

1 2 0 Department of Computational Intelligence, Czstochowa University of Technology , al. Armii Krajowej 36, Częstochowa, 42-200 , Poland 1 Department of Computer, Control and Management Engineering, Sapienza University of Rome , Via Ariosto 25, Roma, 00185 , Italy 2 Institute for Systems Analysis and Computer Science, Italian National Research Council , Via dei Taurini 19, Roma, 00185 , Italy

33 40

The proposed approach for pose estimation is based on the construction of a Convolutional Neural Network with an encodingdecoding structure and a spatial pyramid based on WASP structure in its bottleneck and a Discrete wavelet transform encoder. These techniques already shown their capabilities to solve the main problems in state of the art related to: diferent Field of view (FoV) required to analyze the diferent possible sizes of a specific subject. we want to solve the faulty structure of the modern CNN based Neural Networks in the encoding part using DWT encoder and WASP. This Work also have the objective of demonstrating from a more general point of view which could be the advantages of a Discrete Wavelet Transform (DWT) encoder in any CNN-based approach for Pose Estimation and Object detection in any form, such as for several subjects in the same image or in the internal video due to the almost redundant use of the usual most famous encoding structures for CNN such as ResNet-101, U-Net or VGG16-19. we will do our tests using a U-net Based CNN in order to evaluate the importance of the results of the Discrete Wavelet Transform encoder also in the decoding part through the cropping of theme at the last layers of the network. This is necessary due to the loss of border's pixels during encoding that could be useful for the result's evaluation.

eol>Discrete Wavelet Transform Convolutional Neural Network WASP Atrous Convolution

1. Introduction

by diferent works in image classification as [ 7, 8] or for multiple subject in the image in which case we not only object detection approaches as in [1, 9] or other solutions establish a heatmap for the evaluation of the key points [10, 11]. With this premises our work is focused on the in a Gaussian map but we consider have to also a segapplication of an U-net based model with a multi-level mentation of the image and evaluation of long,medium decomposition (MLD) of the image as encoder parallel to and short range ofset for each key point of the subject U-net with concatenation of the gathered information for to establish the relationship between diferent key points each layer of the DWT encoder to the encoder of U-net belonging to diferent subject and the ones in the same and propagation of this information to the decoder. In segmented part of the image. In these cases is also imthe feature extraction at the U-net’s bottleneck is instead portant to mention as in [18] the WASPv2 version used used a WASP structure to obtain the resulting feature in combination to HRNet structures without a decoding map of the image from diferent field of view after the structure that follow it performing a cascade of Atrous encoding part to pass to the decoder. In this way the convolutions at increasing rates to gain eficiency. In encoder do not loose the information eliminated by the the next section are explored the past works in DWT down sampling operation because reconstructed from the application to a variety of state of the arts and explains information of wavelets passed forward to the decoder. how our approach is diferent in chapter 3 and in the conclusion how the results are improved by it. Talking about possible applications of Discrete Wavelet Trans2. Related Works form (DWT) related to diferent state of the art we can consider as fundamental the contribute given by D-Unet Two main aspect of this works are considered as first a dual encoder used for diferent purposes as Image segrespect to the works related to the wavelet applied to mentation and object detection [19]. DWT-based encoder diferent fields as Image segmentation or object detection is an important addition to the state of the art because and then the common structures and modern approaches it demonstrated its superiority in reconstructing inforactually used in Pose Estimation, this work is focused on mation lost in the encoding part of the neural network the analysis of approaches from both this two aspects and also the superior capability with respect to diferent used in combination with pre-processing common prac- methods used as steganalysis rich model (SRM). In which tise related this to the task of Human Pose Estimation. case the layer extracts the image noise providing addiThere are many pose estimation systems [12, 13, 14], tional evidence for the classification of multiple types of among them UniPose [15] is a pose estimation method image key points as shoulders, elbows and faces. In [20] based on the application of a so-called WASP structure has been instead demonstrated the problems related the as the central part of the bottleneck in the CNN. It is use of a down-sampling and up-sampling operation and based on the use of layers with diferent dilation in the corresponding interpolation operation needed to reconapplication of a rate parameter related to the formula: struct the original image from the global feature map at 1 2 the start of the decoding part. As last consideration is [, ] = ∑︁ ∑︁ [ − , − ] * [, ] ( 1 ) important to cite also the capabilities demonstrated by =0 =0 spatial pyramid in obtaining of low-resolution feature maps with global features if used with diferent FoV and with r = rate of dilation , I = image and k as kernel over dilation parameters the so called Atrous Spatial Pyramid the x/y-axis and 1 or 2 as pixel’s position. In this way Pooling (ASPP) compared to other simpler architecture is possible to get a higher FoV for the image and con- to handle diferent scales, sizes, and aspect ratios of the nect it to a depth-wise convolution operation to obtain subjects. a higher level of abstraction features to consider in the latent representation. UniPose reached a Percentage of Correct Parts (PCP) considerably high for the actual state 3. Model Architecture of the art but not even near to be considered as robust approach respect most complex methods based on 3D The complete structure of the model is separated in three models of poses. One of the most important approaches parts: The encoder and decoder structure typical of Ubased on Cascaded Pyramid Network and pointing to net with propagation of the information from decoding give a 3D representation of the scene as in [16] how- to encoding layers, the bottleneck where the waterfall ever we are not interested to highlight because too much Atrous spatial pyramid (WASP) is used to obtain feature computational expensive and applicable only in specific map in diferent Field of view and the parallel encoder environments as in the case of multi-camera detection of used for the multi-level decomposition of the image for the scene. A particular explanation of wich problems can each diferent image’s channel with the corresponding be encountered in this kind of task is given by Person- concatenation of the low pass representation in the layLab [17] where the Atrous approach is also analyzed for ers of U-net delegated to the forward propagation of the information to the decoder. The complete architec- WASP is designed with the goal of reduce the number ture is visible in where each block represent a layer of of parameters in order to deal with memory constraints U-Net with: down sampling operation, dropout and nor- and solve the main issue of Atrous convolutions using malization of the batch. We selected two datasets for diferent FoV for image global feature representation. In the validation and training the first is COCO containing this part we deal with latent representation manipula40000 images while a more specific and generally chal- tion and how in this case it influence the decoding part. lenging dataset is LSP for this kind of task that , used varying the parameters of this part in fact is possible to in combination with a small part of COCO to obtain a notice how these variation can lead to diferent results single dataset of 3600 samples and the remaining for test- from a PCP viewpoint and more robustness to far subing and validation. The LSP dataset includes modified jects. in fact one in particular of our test has been variate data with noise addition, having a good assessment of the subject of the image from a very near subjects to the the network performance for the task of single person camera to a more far representation given by diferent pose estimation and even in this case one of the most dataset used for Human pose estimation from UAV. It is problems is the occluded limbs. possible to see that the results in case of a simple person

As mentioned in [21] the Fully Convolutional Net- in front on the camera are superior but when the limbs of works (FCN) are the most used kind of CNN in this the subject are composed by very small groups of pixel ifelds and all are structured as encoder-decoder with a more pixel-by-pixel analysis is needed. In terms of up-sampling procedure for reconstruction of a resolu- result we obtained a level of PCP for 50 epoch-training tion and restore of loosed data in encoding part. In this showing the challenging properties of the images. The section are considered as assumption that the structure parameters modified are in fact not only the kernel sizes will be similar to U-Net. however is important to remark but also the dilation or rate parameters obtaining in this that these kind of structures already establish state of way a general FoV of the image. We also tried to use the art result without thorough consideration of other diferent sizes for the Latent representation, needed to methods of image feature extraction. As first approach use the dilation high and use bigger FoV than we can and has been considered a solution based on the generation more parallel levels of FoV concatenated for the decoding of an heat map but in that case the construction of it is part. The shapes of the layers variate between 1,2,6 rate very similar to the binary segmentation of [19] which parameters while the dimension of the kernels between can lead to problems as the necessity to use heat maps 3,5,7. Another important fact is related to the presence of separated for each key point in order to connect theme the 1 by 1 convolutions before the concatenation useful if each other in a correct way even if we have overlap of we want to manipulate the dimension of the data without it. So in this case a simple binary segmentation is not loss of features from the local to the global feature map. useful for overlap of limbs and articulations and a more computational complexity shows up in working separately for each key point. For these reasons has been 4. The DWT-based Encoder chosen to consider a layer for a regression task with a The methodologies applied for the construction of the invector representing the scaled coordinates of the points formation related the analysis of the image in frequency in the image that we have to interpolate with our model. with diferent scale lead me to many diferent choice, The reason behind the U-net structure as choice, instead from the application of Gaussian or Sobel Filters to the most common VGG or ResNet, is due to the possibility to use of SRM structure as in [22]. But what in the end propagate the information produced with wavelet’s coef- establish the most significant result has been the DWT ifcients everywhere in the layers between encoding and encoder with the multi-scale decomposition of the image. decoding. The network to propagate context information To make it we built a sequence of layer applying low to higher resolution layers, exploiting this capability of pass and high pass filters to an image and generating Unet the information propagated are in the structure of relevant Haar-features for the localization of relevant our Neural Network the information gathered from the key points.The Multilevel decomposition method called wavelet in the encoder layers concatenated layer by layer in this work DWT-based encoder will generate these coto the information obtained from the network’s encoder. eficients over all three direction in the image vertical, This will lead us to a distance function weighted respect to the subject dimension in the image that we will have cdoiaegficoiennatls afonrdthheodrieztoaniltsaaln(dapp,roxim,ationo,verd )iferw-ith to minimize (e.g simple MSE respect the position) and ent thresholds to pass to the next layer as explained in ifnd the best function that interpolate the position of the [23],[24] or [25]. With a more mathematical viewpoint key-points respect the image’s information. We want in each frequency component can be defined in a matrix this way to be able to build augmented information for form for 2D input as: an image with very low dimension and use them to infer invisible information for a simple U-net encoder-decoder. ( 2 ) = Γ Γ Where is the input, are the low frequency component and the high defined with ℎ, ℎ, ℎℎ Defining the Low pass and High pass filter as ,

: Γ = ⎜⎜ − 1

⎜ − 2 − 1 0 ⎛ ... ⎜⎝ ...

...

... 0 1 ...

... 1 2 ... ... ...

3 1 2 ...

...⎞ ...⎟ ...⎟⎠ ...⎟⎟, K = ⎜⎜

⎛ ...

⎜ ⎜⎝ ...

... − 2 − 1 ... 0 1 ... − 1 0 ... 1 2 ... ... ...

3 These information produced for each layer will be con- It can be interpreted as a Fourier transform of f at the fre( 3 ) ( 4 ) ( 6 ) map to find for each key point an unique connection to the others for the skeleton construction. our approach is based on the analysis of the information produced by this ( 5 ) filters and the improvement given by the analysis of the image by the DWT encoder as substitute to the one based on SRM. In order to better understand the uses of DWT in this part is given a recap of basics concepts. Given a window function as the one used for Fourier transform usually found in a common form: ()( − )−

( 11 ) (, ) = ∫︁ +∞

−∞ quency , localized by the window g in the neighborhood of . Multiplying the signal represented by f(t) with g and computing the Fourier coeficients we obtain indication of the frequency content of the signal f in a neighborhood of , shifting the window from 0 and obtained a ( 10 ) ( 7 ) sequence of coeficients that give a representation of the image sensible to certain frequencies. Now, considering g as the family of function generated from a single 2() ( 8 ) function by phase space translations ( , ) where = 1/s

("coherent states"), an important property of this func( 9 ) tion is the capability to completely reconstruct f from the phase space projections given by ⟨(, ), ⟩. This is due to the property of this mapping function of being an isometry that as mentioned in [24],[26] or [27] is given by so called resolution of the identity property that implies that the f function can be written as: = 1 ∫︁ 2 ∫︁ ⟨(, ), ⟩(, )

( 12 ) In similar way the wavelets are family of functions that involve the (, ) derived from a function, but indexed by two labels, one for position and one for frequency with s = 1/ as scale factor and

= translation where the resolution of the identity is written as: = − 1 ∫︁ ∫︁ 2 ⟨ (, ), ⟩ (, )

( 13 ) Taking into account the ( 14 ) in this way we can redefine completely f with a set of coeficients over a direction generated by simple filter application and re-defining the (, ) = √︀ 1 | | 1 − 1 ∑︁ () [ −

]

( 14 ) Usually it is chosen as parameters = 2 as dilation in order to have a discrete dilation by taking powers of a fixed j, = 2 as translation of the wavelet and k This is used in the encoder But not as a down-sampling operation to substitute in the encoder instead its downsampled version is added hierarchically to the layers as a Parallel encoder providing in this way to the three encoders, but also decoder considering that the structure is U-net as, the features stressed by each layer of the multi-level decomposition. As it is possible to see the application of diferent low pass filters applied for the ifrst,fifth and ninth image for each layer and high pass iflter for the rest. These information concatenated will be added hierarchically to the encoder layers in particular in the 2-th,3-th and 4-th, each level will divide the dimension of the image with 2 with j=number of the layer. In order to recap how wavelet works I’m referring to diferent mother wavelet as Haar wavelets but we will also analyze performances in correspondence of diferent capabilities in the isolation of high from low frequency components in images and isolation in all directions of edges at diferent scale and resolutions as in [ 2] and [24].

Another remarkable fact is that we do not need to use

IDWT in decoding for obvious reason and the fact that, having a simple Multi level decomposition without variation filters. We will have just one gradient in common wavelets applied as Daubechies that already proved their image using: to consider for the loss minimization evaluating the ad- as direction rewriting (, ) as ,. Given a function ditional DWT features directly in the same CNN’s loss as in the case of [15] with diferent loss for each heat () as signal of input that will be our image, it has a large amplitude near sharp transitions of pixels such as edges, obtaining the coeficients over the ⟨ ,, ⟩ ≥ threshold and varying it over the frequency. What now is produced by this are three corresponding high pass and low pass filters we obtain four results As approximation and details coeficients for each of the three layers to pass to the next one in the multi level decomposition where each of theme is defined as: ⎧2 = ( 2 (− ) 2 (− ))(2− , 2− ) ⎪ ⎪ ⎪⎨21 = ( 2 (− )˛2 (− ))(2− , 2− )

( 15 ) ⎪22 = (˛2 (− ) 2 (− ))(2− , 2− ) ⎪ ⎪⎩23 = (˛2 (− )˛2 (− ))(2− , 2− ) Another important remark is on how the output is asso- Table 1 ciate and added to the neural network this is related the Parameters used in the tests method of concatenation (fusion) and the corresponding result. In case of application of hierarchical fusion has been proved an increment booth in velocity of the loss convergence and PCP metric evaluated.

The tests are evaluated both for COCO and LSP datasets

but in the end, the evaluation has been done on a combination of it.The results in accuracy are evaluated not from the first epoch but from the 50th epoch while the loss is 2 = ( − )2 + ( − )2 ( 16 ) shown from the beginning. It is important to mention Having this kind of loss we have the possibility to define that the approaches have been proven on 160x160 images a suitable metric depending on these coordinates for the but also diferent dimensions and increasing them the evaluation based on a threshold to apply in order to un- results increased also more with the DWT encoder strucderstand if I’m going near to desired result. The PKC will ture with respect to a simple CNN just proving that the be denoted an evaluation of the joints by a formula in loss, over 100 epoch arrived from a 110.07 value to 120.3. the form: Not a big increase but considering that the PCP arrived 2( ) ≤ 0.52 ( 17 ) from a value of 65% to 71% and that the loss of an interpolation problem for the 160x160 images start usually In other words if the segment given by the predicted over 1000 as MSE loss initial value while 300 for 80x80 endpoints lie within fraction of the length of the ground- images we can deduce that the augmented complexity truth segment the distance calculated by the prediction of the interpolation problem is compensated from the will have to be smaller than the half of the efective lenght information provided from DWT confirming its utilities (threshold = 0.5) as mentioned in [1]. Alternatively is for the analysis of complex data, It could be interesting possible to use as metric the Object Keypoint Similarity to try with 1280x720 images as a future improvement. (OKS) in the form: Some of the most challenging aspects respect this kind of problem are reported in the 4-th image at the top ∑︀ exp 2 /222 ( > 0) (18) that are evaluated with our method confronted at parity ∑︀ ( > 0) of epoch training and parameter to a simple CNN with WASP without addition of DWT. better results can be establish, without changing the parameter, eliminating the down sampling and using images that contains a better resolution (instead a 160x160 image) as in [15] where image’s dimensions are around 1280x720 even with de

Another possible metric the one adopted in this work is

the PCP that is based on detected joint that is considered correct if the distance between the predicted and the true joint is within a certain threshold. In this work you can find an example of the results in terms of accuracy nonised images. The images have been chosen to give a basic representation of the main problems of the human body pose estimation due to the complexity of the pose and occlusion of the limbs of the subject. Looking at the Figure 2: Test with HaarDWT used with pretarined CNN ifrst line the first picture, beyond a small negligible error, Daubachies. denote a very good result in the pose estimation of the complicated CNNs are not capable to compete with the subject. Diferently, the third image is more complex to wavelet encoder showing clear results of over fitting so evaluate due to the occlusion of the limbs and the com- we preferred to do not modify epochs more than 100 and plexity of the pose. A diferent kind of problem is instead using just 4 convolution layers. We extended these result represented by the pose of the second and third image even to a diferent topic as human pose estimation, the where the entire image down sampled with a very high initial objective was to establish results in object detecfactor leads to the problem of low border resolution of the tion approach for pose estimation but it has been discard subject giving imprecision in the evaluation of the key because as already said the top-down approach proved points positions. Note carefully the presence of padding to be superior. In addition to this we solved the faulty in the CNN leads us to the shifted results in the figure as encoder-decoder general structure common to all most in object detection as [28]. In an other important test we used CNNs for this field as VGG-16 and ResNet showconsidered the DWT based method confronted between ing lower loss of information during encoding using the COCO and LSPII dataset where the data are more prob- DWT and the analysis of the Wavelet’s information of lematic and many subject assume complex poses. These the images. tests are also evaluated with respect a PKC value but in the end we used the PCP because more appropriate for a bottom-up approach and easier to implement and References evaluate but similar results are initially evaluated with respect the PKC.

7. Conclusion

The wavelet procedure actually increase the capabilities with the DWT encoder and with this result how is possible to extend it also to diferent fields that include these kind of structures. Diferent results are obtained using more epochs or layer in the convolution structure from 4 to 3 layers obtained lower results in PCP terms so more

ing a machine learning approach with gan-based data augmentation technique trained using a custom dataset, OBM Neurobiology 6 (2022). doi:10.

Method + ,ℎ ,ℎ CNN+ ,ℎ U-net , ,ℎ

[1]

Eichner ,

Marin-Jimenez ,

Zisserman ,

Ferrari , 2D articulated human pose estimation and retrieval in (almost) unconstrained still images , International Journal of Computer Vision 99 ( 2012 ) 190 - 214 .

[2]

J. C.

Isaacs ,

S. Y.

Foo , Hand pose estimation for american sign language recognition , Thirty-Sixth Southeastern Symposium on System Theory , 2004 . Proceedings of the ( 2004 ) 132 - 136 .

[3]

Pepe ,

Tedeschi ,

Brandizzi ,

Russo ,

Iocchi ,

Napoli , Human attention assessment us21926/obm .neurobiol. 2204139 .

[4]

Dat ,

Ponzi ,

Russo ,

Vincelli , Supporting impaired people with a following robotic assistant by means of end-to-end visual target navigation and reinforcement learning approaches , in: CEUR Workshop Proceedings , volume 3118 , CEUR-WS , 2021 , pp. 51 - 63 .

[5]

Bonanno , G. Capizzi, G. Sciuto,

Napoli ,

Pappalardo ,

Tramontana , A novel cloud-distributed toolbox for optimal energy dispatch management from renewables in igss by using wrnn predictors and gpu parallel solutions , in: 2014 International Symposium on Power Electronics , Electrical Drives, Automation and Motion, SPEEDAM 2014 , IEEE Computer Society, 2014 , pp. 1077 - 1084 . doi: 10 .1109/SPEEDAM. 2014 . 6872127 .

[6]

Napoli ,

Pappalardo , E. Tramontana,

Nowicki ,

Starczewski ,

Woźniak , Toward work groups classification based on probabilistic neural network approach , in: Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science) , volume 9119 , Springer Verlag, 2015 , pp. 79 - 89 . doi: 10 .1007/978-3- 319 -19324- 3 _ 8 . [18]

Artacho ,

Savakis , Omnipose: A multi-scale

[7]

Li ,

Shen ,

Guo ,

Lai , Wavelet integrated framework for multi-person pose estimation, 2021. cnns for noise-robust image classification , in: arXiv:2103.10180. IEEE/CVF Conference on Computer Vision and Pat- [19]

Zhou ,

Huang ,

Dong ,

Xia ,

Wang , Dtern Recognition (CVPR) , 2020 . unet: A dimension-fusion u shape network for

[8]

Wozniak ,

Napoli , E. Tramontana, G. Capizzi, chronic stroke lesion segmentation , IEEE/ACM G. Lo Sciuto,

Nowicki ,

Starczewski , A mul- Transactions on Computational Biology and Biointiscale image compressor with rbfnn and discrete formatics 18 ( 2021 ) 940 - 950 . doi: 10 .1109/TCBB. wavelet decomposition, in: Proceedings of the In- 2019.2939522. ternational Joint Conference on Neural Networks , [20]

Williams ,

Li , Wavelet pooling for convoluvolume 2015-September, Institute of Electrical and tional neural networks , 2018 . Electronics Engineers Inc., 2015 . doi: 10 .1109/ [21]

Long , E. Shelhamer, T. Darrell, Fully convoluIJCNN. 2015 . 7280461 . tional networks for semantic segmentation , 2015 .

[9]

Brandizzi ,

Bianco , G. Castro,

Russo , A. Wa- arXiv: 1411 .4038. jda, Automatic rgb inference based on facial emo- [22]

Kang ,

Park , J.-I. Park , Cnn-based ternary tion recognition , in: CEUR Workshop Proceedings , classification for image steganalysis, Electronics 8 volume 3092 , CEUR-WS , 2021 , pp. 66 - 74 . ( 2019 ). URL: https://www.mdpi.com/2079-9292/8/

[10]

Napoli ,

Pappalardo , E. Tramontana, Using 11 /1225. doi: 10 .3390/electronics8111225. modularity metrics to assist move method refactor- [23]

Liu ,

Chen ,

Huang ,

Y. Q.

Shi , A ing of large systems , in: Proceedings - 2013 7th robust dwt-based video watermarking algorithm, International Conference on Complex, Intelligent , 2002 IEEE International Symposium on Circuits and and Software Intensive Systems, CISIS 2013 , 2013 , Systems. Proceedings (Cat. No.02CH37353) 3 ( 2002 ) pp. 529 - 534 . doi: 10 .1109/CISIS. 2013 .96. III-III.

[11]

Capizzi , G. Sciuto,

Napoli ,

Tramontana , A [24] I. Daubechies , The wavelet transform, timemultithread nested neural network architecture to frequency localization and signal analysis, IEEE model surface plasmon polaritons propagation , Mi- Transactions on Information Theory 36 ( 1990 ) 961 - cromachines 7 ( 2016 ). doi: 10 .3390/mi7070110. 1005. doi: 10 .1109/18.57199.

[12] G. De Magistris , R.

Caprari , G. Castro, S. Russo, [25] B.

Sturm , Stéphane mallat: A wavelet tour of signal L . Iocchi , D.

Nardi , C.

Napoli , Vision-based processing, 2nd edition, Computer Music Journal - holistic scene understanding for context- aware COMPUT MUSIC J 31 ( 2007 ) 83 - 85 . doi: 10 .1162/ human-robot interaction 13196 LNAI ( 2022 ) 310 - comj . 2007 . 31 .3.83. 325. doi: 10 .1007/978-3- 031 -08421-8_ 21 . [26]

Daubechies , T. Paul, Time-frequency localisation

[13]

Brociek ,

Magistris ,

Cardia , F.

Coppa, operators-a geometric phase space approach: Ii. the S. Russo, Contagion prevention of covid-19 by use of dilations, Inverse Problems 4 (

1988 ) 661 - 680 . means of touch detection for retail stores , in: CEUR [27]

Mallat , A theory for multiresolution signal deWorkshop Proceedings , volume 3092 , CEUR-WS, composition: the wavelet representation , IEEE 2021 , pp. 89 - 94 . Transactions on Pattern Analysis and Machine

[14]

Avanzato ,

Beritelli ,

Russo ,

Russo , M. Vac- Intelligence 11 ( 1989 ) 674 - 693 . doi: 10 .1109/34. caro, Yolov3 -based mask and face recognition al- 192463. gorithm for individual protection applications , in: [28]

He ,

Zhang , S. Ren,

Sun , Spatial CEUR Workshop Proceedings , volume 2768 , CEUR- pyramid pooling in deep convolutional networks WS , 2020 , pp. 41 - 45 . for visual recognition , Lecture Notes in Com-

[15]

Artacho ,

Savakis , Unipose: Unified human puter Science ( 2014 ) 346 - 361 . URL: http://dx.doi. pose estimation in single images and videos , in: Pro- org/10.1007/978-3- 319 -10578-9_ 23 . doi: 10 .1007/ ceedings of the IEEE/CVF Conference on Computer 978-3- 319 -10578-9_ 23 . Vision and Pattern Recognition (CVPR ), 2020 .

[16]

Dong ,

Jiang ,

Huang ,

Bao ,

Zhou , Fast and robust multi-person 3d pose estimation from multiple views , 2019 . arXiv: 1901 .04111.

[17]

Capizzi ,

Napoli ,

Bonanno , Innovative second-generation wavelets construction with recurrent neural networks for solar radiation forecasting , IEEE Transactions on Neural Networks and Learning Systems 23 ( 2012 ) 1805 - 1815 . doi: 10 . 1109/TNNLS. 2012 . 2216546 .