Leveraging Human Pose Estimation to improve Health
                         Recommender Systems
                         Gaetano Dibenedetto
                         University of Bari Aldo Moro - Department of Computer Science, Via Orabona 4, Bari, 70125, Italy


                                      Abstract
                                      In recent years, the integration of multimodal and multi-source data has gained attention for its potential to
                                      enhance the accuracy and relevance of recommender systems. While Health Recommender Systems (HRS)
                                      predominantly rely on patient-specific data, the inclusion of Pose Estimation (PE) data remains unexplored.
                                          My Ph.D. research aims to bridge this gap by investigating and incorporating PE as a data source within HRS.
                                      This will also focus on addressing critical challenges such as ensuring user privacy and optimizing the trade-off
                                      between system performance and real-time responsiveness.

                                      Keywords
                                      Health Recommender Systems, Human Pose Estimation, Privacy, Explainability


                         1. Introduction
                         This widespread use of wearable health devices and online health information systems has generated
                         a growing user need for more personalized health advice, a challenge that Health Recommender
                         Systems (HRS) aims to address. Despite their potential, current HRSs face the challenge of aligning
                         their recommendations with users’ expectations, a key factor in building trust in such systems. HRS
                         found a strong synergy with Human Pose Estimation (HPE). Indeed, observing poses risky to the user’s
                         health can be crucial in providing effective recommendations and support and studies applied to the
                         healthcare domain. For example, HPE is adopted in the field of occupational medicine for conducting
                         ergonomic postural assessment. Indeed, one of the primary reasons for nonattendance from work
                         is the health problems stemming from repeated improper postures and movements [1]. To address
                         these issues, ergonomists assess the posture through direct on-site observation or by analyzing video
                         recordings of workers performing their routine job tasks. Traditional postural assessment methods
                         often rely on standardized indices that provide scores based on evaluations of various aspects, such
                         as physiological body angles, load weight, and number of repetitions. In HRS, a possible innovative
                         aspect is to leverage data gathered from HPE techniques not only to enhance performance but also to
                         provide more personalized explanations based on both the user’s characteristics and actions. Building
                         on these ideas, I proposed a preliminary work [2] focused on posture correction for office workers.
                         In the literature, many studies share our goal of posture classification [3, 4, 5], but their approaches
                         rely on data collected under strict constraints such as the use specialized cameras, sensors or other
                         devices embedded into chairs. In contrast, I have proposed a simple approach based on data collected
                         from classical cameras and lightweight, fast AI-based classification models. By analyzing the results
                         of the classification model, we are then able to suggest corrections to the pose for improving worker
                         well-being. The goal I am therefore committed to pursuing during my Ph.D. is to introduce a novel
                         approach that integrates data from HPE into HRS, paving the way for more precise and personalized
                         recommendations.


                          Doctoral Consortium at the 23rd International Conference of the Italian Association for Artificial Intelligence, Bolzano, Italy,
                          November 25-28, 2024
                          $ gaetano.dibenedetto@uniba.it (G. Dibenedetto)
                           https://linkedin.com/in/gaetano-dibenedetto/ (G. Dibenedetto)
                           0000-0001-6083-3600 (G. Dibenedetto)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. State Of The Art
2.1. Health Recommender Systems
HRS suggest personalized health information to individuals based on their specific needs and preferences.
These systems goal is to improve health outcomes by delivering relevant, accurate, and up-to-date
health information, while reducing the time and cost associated with the decision-making process.
Given the interplay between Recommender System (RS) and the exploration of HPE, our focus will be
on Health Status Prediction Systems and Physical Activity Recommender System. Health Status Prediction
Systems (HSPS) adopt machine learning algorithms to understand complex relationships between
self-reported health concerns and their outcomes, allowing them to predict users’ current health status.
These systems are particularly beneficial for older patients and individuals with pre-existing conditions,
as they integrate data from wearable sensors to provide real-time monitoring and alert users to potential
health issues. As an example, in [6], a smart HSPS for predicting hypertension and type 2 diabetes
is presented. This system collects health-related data, such as blood pressure, weight, and physical
activity, via wearable devices and home-based sensors. The data is then processed and analyzed using
serveral machine learning algorithms (e.g., decision trees, random forests, neural networks) to predict
the probablity of developing these conditions.
Physical Activity Recommender Systems (PARS) focus on the user’s current health status and demo-
graphic factors like age and gender to suggest personalized daily exercise routines. These systems,
often integrated into wearable devices, continuously collect user data such as calories burned, daily
step count, and heart rate. In [7], a PARS designed for patients with arterial hypertension, a condition
characterized by high blood pressure, is presented. The system uses data from wearable devices, such
as heart rate monitors, to track physical activity levels and provide recommendations based on the
patient’s age, gender, and health status.
Although the extensive literature on HSPS and PARS is extensive and relevant to understanding how
the scientific community is moving on the topic, as far I know, no work incorporates Human Pose
Estimation into HRS, i.e., the primary focus of my Ph.D. research.

2.2. Human Pose Estimation
A key aspect of my Ph.D. studies is about how efficiently estimate Human Pose. Human Pose Estimation
(HPE) is a well-established field in computer vision that focuses on predicting human body parts based
on the analysis of images and videos. The rapid advancements in deep learning have demonstrated its
superiority over traditional computer vision techniques in tasks such as image classification [8], semantic
segmentation [9], and object detection [10]. In literature there are both 2D and 3D approaches for HPE.
2D systems focus to identify and tracking body keypoints in two-dimensional images or videos. These
methods are computationally efficient and capable of providing real-time results; however, they may be
less accurate than 3D methods, especially in complex poses or cases involving occlusion. For 2D HPE, the
best results today are achieved by transformer-based models. Transformer architectures have recently
gained prominence and have been shown to be effective in HPE. According to benchmark results on the
COCO Dataset [11], the leading model is ViTPose [12]. It employs a plain and non-hierarchical vision
transformers as backbones to extract features for a given person instance and a lightweight decoder for
HPE. It is highly scalable, with model sizes ranging from 100M to 1B parameters, taking advantage of
the transformer’s capacity for scalability and parallelism. In the context of 2D-to-3D lifting approaches,
3D HPE inspired by recent advances in 2D HPE has gained popularity as a solution. By leveraging
the strong performance of 2D pose detectors, 2D-to-3D lifting approaches generally outperform direct
3D estimation methods [13]. A notable transformer-based architecture in this domain is MotionBERT,
which achieves state-of-the-art results on the Human3.6M dataset [14]. MotionBERT incorporates
a motion encoder during pretraining to reconstruct 3D motion from incomplete 2D observations by
integrating geometric, kinematic, and physical insights into human movement. By considering such
research directions, I will continue to investigate approaches Transformer based.
3. RESEARCH APPROACH, METHODS, AND RATIONALE
The main objective of my PhD project is to design and develop an HRS that includes HPE data. In
the following, I present possible pipeline starting from gather this data and possible strategies for a
possible project of mine that integrate HPE into a HRS, and how to evaluate them.

Recording Data. The scarcity of visual data related to the medical sector, linked to health data such
as medical records, is due to the difficulty privacy restrictions. One potential solution to address these
restrictions is to implement real-time adjustments at the recording source, making it easier to obtain
new data within the healthcare environment. Such modifications could include:

• Real-time preprocessing of the videos or images. As demonstrated in [15], facial blurring
  effectively safeguards privacy without causing a statistically significant difference on kinematic
  calculations. In the context of HRS this approach could allow systems to gather more informative
  data for performance enhancement or detailed explanations based on specific actions. Incorporating
  activity data from HPE can provide a more comprehensive and understandable experience, also for
  non-expert users in the domain.

• Storing only data derived from HPE models instead of original image or video sources.
  Recent advancements in HPE models have shown remarkable results, with scalable architectures that
  offer flexibility in selecting the most suitable option for real-time inference. This enables camera-based
  systems to achieve high accuracy without the need for powerful servers. Studies, such as [16], have
  demonstrated that deep neural networks processing only keypoints can perform comparably, or
  sometimes even better, than models relying on original sources alone. Moreover, training a model
  based on a limited number of keypoints is more efficient than training one from the large volume of
  pixels in each frame.

Strategies to include HPE data into HRS. A possible approach to be used for integrating HPE
into HRS is to treat the pose as a categorical descriptive feature. Therefore, each pose is classified into
predefined categories, enabling the RS to analyze patterns and generate recommendation based on these
categories, such as human action recognition task [17]. For instance, poses indicating different levels of
physical activity can be classified as sedentary, moderate, or vigorous, helping to provide personalized
health recommendations. Another approach involves representing pose data as a vector of keypoints or
as a graph embedding. Each detected keypoint or graph node corresponds to a specific body part or
joint, and their spatial relationships are captured to create a comprehensive pose representation. This
representation can then be integrated with other data sources in a multi-source setting, as seen in [18].
By using this method, the RS can better understand body posture and movement dynamics, enabling
more personalized and context-aware recommendations.

Performance Evaluation and Considerations. A key aspect of this project is the evaluation of
the trade-off between the pose detection accuracy and the responsiveness of the recommender system.
High-quality models can yield more precise recommendations but may compromise computational
efficiency, making them unsuitable for real-time environments. Therefore, it is crucial to find the right
balance between accuracy and responsiveness to meet the specific requirements of the application. This
balance will be explored in the context of a real-world scenario. Since my scholarship is supported by a
healthcare-affiliated company, there is a promising opportunity to develop and test a prototype of this
system in a practical setting. Additionally, the system will include an explanatory module that allows
users to understand the reasoning behind specific recommendations, enhancing both user engagement
and comprehension. In the context of active aging, pose detection and wearable devices can play a vital
role in combating sedentary lifestyles. By continuously monitoring an individual’s routine and posture,
the system can identify inactivity patterns and offer personalized interventions to encourage physical
activity and improve overall health.
4. RESEARCH QUESTIONS AND ONGOING WORKS
To assess the effectiveness of the proposed methodology, the following research questions are posed:
RQ1 - Does a specific HPE method yield better results than others?
Several state-of-the-art HPE techniques exist (Section 2.2), ranging from 2D to 3D PE, with sub-categories
such as Single-Person, Multi-Person, Top-Down, and Bottom-Up approaches. Choosing the best HPE
method to integrate into HRS depends on a various factors, such as the environment, data availability,
and contextual requirements. To better understand the strengths and limitations of different HPE
methods, I conducted a comprehensive literature review on pose estimation techniques, comparing
methods for 2D/3D pose detection from images and videos across different scenarios (e.g., single/multi-
person, monocular/multi-view input). This review, summarized in the paper “Comparing Human Pose
Estimation through Deep Learning Approaches: An Overview”, currently under submission to Elsevier
Computer Vision and Image Understanding, will guide the selection of the most appropriate technique
for the real-time environment I am working on. I gained practical experience with HPE methods
through the development of a posture correction system for office workers [2].
RQ2 - Are there available HPE data for the developing of my work?
Section 2.1 highlights a lack of existing work incorporating HPE data. As my scholarship is funded
by a specialized company working on healthcare applications and HPE, i.e. Naps Lab S.r.l.s., eases the
process of defining an environment for working on real-world HRS, where pose estimation data can
be collected, thus addressing the issue of data scarcity. Nowadays, healthcare facilities, even those for
elderly people, are technologically well-equipped, making easier obtaining this type of data from a
hardware perspective. In order to start my work on it, while working on [2], I have already created a
new dataset capturing data from a webcam, which is currently available on Zenodo.
RQ3 - How can HPE data be incorporated into a recommender system?
At present, we do not yet have a clear idea of how HPE data can be integrated into a successful HRS. We
are investigating the idea of using context-based recommender systems models based on Transformer
architectures, where the visual data regarding the subject’s location is provided directly as input to the
model. In this way, the output of the recommender system can be an item from any domain where a
correlation with human posture can be observed, e.g., physical exercises. This point, however, is the
one on which a discussion with the AIxIA community may be most helpful in successfully addressing
my future studies.
RQ4 - How can I measure the performances of the model?
With limited directly comparable prior research, it is critical to define appropriate evaluation method-
ologies. I will evaluate single modules in offline mode by using specialized benchmarks and datasets.
An optimal solution to evaluate the whole architecture will be to obtain feedback from experts in the
healthcare field. Thanks to Naps Lab S.r.l.s.’s expertise, I will have the opportunity to evaluate the
system not only in-vitro but also in real-world (in-vivo) environments. This will involve comparing
HRS performance using only medical data against systems that integrate HPE data as well. Feedback
from users and medical professionals will be instrumental in refining and improving the research.
RQ5 - How how to deal with privacy and trustworthy?
Privacy management is extremely important, particularly in the healthcare sector. The collaboration
with Naps Lab S.r.l.s. will enable the collection of real data, but at the same time makes it mandatory
to implement techniques to protect personal and medical data. This is a very critical aspect of my
work, which cannot be overlooked. Other researchers experiences and suggestions on the topic will be
strongly appreciated. As stated in [19], the necessity to be able to explain the technology used for medical
decision-making processes represents a normative standard unquestioned in its principle relevance. In my
work on posture correction system for offices [2], the dataset have been published on Zenodo1 , without
including any original video or image data that might raise privacy concerns, but this can strongly
affect the replicability of proposed approaches.


1
    https://zenodo.org/records/11075018
5. LONG-TERM GOALS
As long term goals, I aim to advance the actual state of the art HPE models by exploring different
novel architectures based on Foundational Large Multimodal Models like META-Sapiens [20]. While
I already explored the use of ViTPose [12], a state-of-the-art 2D HPE known for its high accuracy
and scalability, allowing for configurations ranging from 100 million to 1 billion parameters, a further
solution could be the use of 3D HPE, e.g. MotionBert [21], providing more comprehensive measure for
posture comparison. In the current work, a simulation of the Epipolar Geometry computation have
been carried out, to compare different postures taken from different viewpoints. While effective, it may
not be as precise as a true 3D rotation-based approach.
   Additionally, I plan to expand the Health Recommender System [2], which is currently able to rank
poses and suggest corrections to users. My goal is to extend this functionality to detect potential
long-term postural deviations such as scoliosis, kyphosis, and lordosis. Based on these detection, the
system could recommend a nearby specialist for targeted correction.
   A possible recommendation could result from the detecting categorical anomalies, such as imbalances
in the shoulders, hips, or head. If these posture misalignments are identified as long-term issues,
they may affect physical appearance. Preventative measures, like targeted physical exercises designed
to strengthen specific muscles, could help address these imbalances. Recommendations for such
exercises could be presented as a list of items within a recommender system, with detailed descriptions
generated by a large language model, making the exercises accessible to non-expert users and promoting
regular engagement. Additionally, if the user already performs some of these exercises during personal
training, the system could incorporate contextual information, such as daily workout routines, to ensure
personalized recommendations that avoid redundancy.


6. Conclusion
I started my Ph.D. in October 2023 at University of Bari Aldo Moro, under the supervision of Prof.
Pasquale Lops and Dr. Marco Polignano, belonging to the SWAP research group2 , as well as Dr. Giuseppe
Cavallo from Naps Lab S.r.l.s.3 . I expect to complete my Ph.D. at the end of 2026. After almost a year of
working on the topic, I am at the stage where I am beginning to approach operational techniques and
strategies for achieving successful HRS. This stage is the crucial one for the success of my journey, and
therefore, I believe that I can benefit greatly from discussion and valuable suggestions from experts
in Computer Vision, Recommender Systems, and eHealth and, in particular, from the vibrant AIxIA
community. I would, therefore, be delighted to attend the conference and the Doctoral Consortium.


Acknowledgments
This research is partially funded by PNRR - Mission 4 ("Education and research") – Component 2
("From research to business"), Investment 3.3 ("Introduction of innovative doctorates that respond to the
innovation needs of companies and promote the hiring of researchers by companies") D.M.n. 117/2023 -
H91I23000170007.
I extend my sincere gratitude to Naps Lab S.r.l.s.3 for their support and collaboration in the realisation
of this research.


References
    [1] L. Punnett, D. H. Wegman, Work-related musculoskeletal disorders: the epidemiologic evidence
        and the debate, Journal of electromyography and kinesiology 14 (2004) 13–23.

2
    https://swap.di.uniba.it/
3
    www.napslab.it
 [2] G. Dibenedetto, M. Polignano, P. Lops, G. Semeraro, Human pose estimation for explainable
     corrective feedbacks in office spaces, in: Adjunct Proceedings of the 32nd ACM Conference on
     User Modeling, Adaptation and Personalization, 2024, pp. 264–275.
 [3] S.-H. Han, H.-G. Kim, H.-J. Choi, Rehabilitation posture correction using deep neural network, in:
     2017 IEEE international conference on big data and smart computing (BigComp), IEEE, 2017, pp.
     400–402.
 [4] J. C. T. Mallare, D. F. G. Pineda, G. M. Trinidad, R. D. Serafica, J. B. K. Villanueva, A. R. D. Cruz,
     R. R. P. Vicerra, K. K. D. Serrano, E. A. Roxas, Sitting posture assessment using computer vision, in:
     2017IEEE 9th International Conference on Humanoid, Nanotechnology, Information Technology,
     Communication and Control, Environment and Management (HNICEM), IEEE, 2017, pp. 1–5.
 [5] Y. M. Kim, Y. Son, W. Kim, B. Jin, M. H. Yun, Classification of children’s sitting postures using
     machine learning algorithms, Applied Sciences 8 (2018) 1280.
 [6] S. P. Chatrati, G. Hossain, A. Goyal, A. Bhan, S. Bhattacharya, D. Gaurav, S. M. Tiwari, Smart
     home health monitoring system for predicting type 2 diabetes and hypertension, Journal of King
     Saud University-Computer and Information Sciences 34 (2022) 862–870.
 [7] L. R. Ferretto, E. A. Bellei, D. Biduski, L. C. P. Bin, M. M. Moro, C. R. Cervi, A. C. B. De Marchi,
     A physical activity recommender system for patients with arterial hypertension, IEEE Access 8
     (2020) 61656–61664.
 [8] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural
     networks, Advances in neural information processing systems 25 (2012).
 [9] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in:
     Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–
     3440.
[10] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region
     proposal networks, Advances in neural information processing systems 28 (2015).
[11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft
     coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference,
     Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 740–755.
[12] Y. Xu, J. Zhang, Q. Zhang, D. Tao, Vitpose: Simple vision transformer baselines for human pose
     estimation, Advances in Neural Information Processing Systems 35 (2022) 38571–38584.
[13] C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, N. Kehtarnavaz, M. Shah, Deep learning-based
     human pose estimation: A survey, ACM Computing Surveys 56 (2023) 1–37.
[14] C. Ionescu, D. Papava, V. Olaru, C. Sminchisescu, Human3. 6m: Large scale datasets and predictive
     methods for 3d human sensing in natural environments, IEEE transactions on pattern analysis
     and machine intelligence 36 (2013) 1325–1339.
[15] J. Jiang, W. Skalli, A. Siadat, L. Gajny, Effect of face blurring on human pose estimation: Ensuring
     subject privacy for medical and occupational health applications, Sensors 22 (2022) 9376.
[16] R. Hachiuma, F. Sato, T. Sekii, Unified keypoint-based action recognition framework via structured
     keypoint pooling, in: Proceedings of the IEEE/CVF conference on computer vision and pattern
     recognition, 2023, pp. 22962–22971.
[17] H. Mollaei, M. M. Sepehri, T. Khatibi, Patient’s actions recognition in hospital’s recovery depart-
     ment based on rgb-d dataset, Multimedia Tools and Applications 82 (2023) 24127–24154.
[18] G. Spillo, C. Musto, M. Polignano, P. Lops, M. de Gemmis, G. Semeraro, Combining graph neural
     networks and sentence encoders for knowledge-aware recommendations, in: Proceedings of the
     31st ACM UMAP Conference, 2023, pp. 1–12.
[19] H. Kempt, N. Freyer, S. K. Nagel, Justice and the normative standards of explainability in healthcare,
     Philosophy & Technology 35 (2022) 100.
[20] R. Khirodkar, T. Bagautdinov, J. Martinez, S. Zhaoen, A. James, P. Selednik, S. Anderson, S. Saito,
     Sapiens: Foundation for human vision models, 2024. URL: https://arxiv.org/abs/2408.12569.
     arXiv:2408.12569.
[21] W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, Y. Wang, Motionbert: A unified perspective on learning
     human motion representations, in: Proceedings of the IEEE/CVF, 2023, pp. 15085–15099.