<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Comparison of Frame-by-Frame and Aggregation Approaches for Gesture Classification Using Machine Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michał Wierzbicki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jakub Osuchowski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Opole University of Technology</institution>
          ,
          <addr-line>Prószkowska 76 St., Opole, 45-758</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In medicine, gesture and body pose analysis, especially in the context of telemedicine and rehabilitation, has gained importance after the COVID-19 pandemic. Gesture recognition and body pose estimation demand high computational power to process large and complex data. To address this problems Authors examined machine learning methods and aggregation techniques for sequential data. This paper compares two gesture analysis methods: frame-by-frame and gesture sequence analysis. The iMiGUE dataset, which contains skeleton data obtained using the OpenPose tool, is used. In this paper, the gesture classification results obtained using the RandomForestClassifier model with default and optimized parameters are evaluated in detail. Sequential gesture analysis methods outperformed the classical frame-by-frame analysis in terms of precision and computational efficiency.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;sequential analysis</kwd>
        <kwd>machine learning</kwd>
        <kwd>frame-by-frame analysis</kwd>
        <kwd>gesture recognition</kwd>
        <kwd>skeleton data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent years have yielded the most advanced solutions in the domain of artificial
intelligence (AI) to date, just to mention transformers architecture [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and plethora of large
language models (LLMs) based on that idea namely ChatGPT (GPT stands for Generative
Pretrained Transformer) or Gemini [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. While results obtained by those models are
marvelous, they have incurred significant costs[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], impossible to bear by many institutions.
Costs are mainly related to the number of parameters used in the training process - in some
cases they reach billions - as well as length of training itself [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. LLMs have found applications in
a variety of domains, for example in healthcare [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Yet solely analyzing language itself does not exhaust all possibilities of advanced AI
solutions for healthcare. One area that proved to be beneficiary of AI development is gesture and
body position recognition. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] provides distinctions of three groups of gestures that are of our
interests: head, hand and body.
      </p>
      <p>
        Gestures convey more information that can be inferred from speech alone [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Significant effort has been put towards development of more robust and precise techniques for
gesture recognition whether it is hand specifically [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or body and head [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
Those techniques proved to be crucial for rehabilitation purposes for post-stroke patients [16] or
people with cerebral palsy [17] allowing medical practitioners to remotely assess a patient's
condition and state. In a more general sense gesture and body recognition yield new
opportunities for rehabilitation processes [18],[19] providing more approachable ways of
monitoring progress and overall condition of the patient. As demonstrated by the authors in
[20], virtual reality (VR) offers a broad spectrum of applications in neurological rehabilitation.
This is largely attributed to its ability to easily replicate natural environments, design specific
movement patterns, and create engaging exercises in which patients can actively participate.
During these exercises, it is crucial for the patient to be monitored through gesture recognition,
allowing for effective tracking of their progress and identification of any obstacles.
      </p>
      <p>COVID-19 pandemic presented a plethora of challenges and obstacles for the healthcare
workforce [21]. Unexpected circumstances forced the healthcare sector to adjust to an
unfamiliar environment, rendering then-present methods of rehabilitation impossible to
execute in a new context. Post-COVID-19 rehabilitation has been deemed “an effective
therapeutic strategy to improve the functional capacity and quality of life of patients” [22]
yielding improvements in quality of both physical and psychological aspects of life [23].
Pandemic circumstances put emphasis on the development of the niche of telemedicine and
remote healthcare [24] with rehabilitation being one of the most crucial aspects. Remote
rehabilitation has been implemented during pandemic and to this day it is relied upon by the
medical practitioners as a mean that “was safe, feasible, and acceptable for those who accessed
it” [25],[26]. In the context of remote rehabilitation gesture recognition and body pose
estimation can be an important element of monitoring a patient's wellbeing and recovery [27],
[28]. At the same time, in the context of pandemic, the proposed methods may not be accessible
for some patients due to limitations of possessed hardware unable to perform required work for
proper assessment based on gestures of body pose. Therefore, methods to decrease
computational load are necessary to ensure availability of the service and its quality.</p>
      <p>In this paper we examine simple methods based on aggregation of the gestures, that
prove to be beneficial both in terms of required computational power and storage as well as a
performance. Given a dataset [29] consisting of skeleton data we aggregate data that describes
each gesture presented in a dataset using five methods: minimum, maximum, integral, average
and regression, and we compare obtained results with results obtained on original data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Significant effort has been put towards examining techniques of processing the data for
gesture recognition or body pose estimation. Authors in [30] state that the most focus in
research is put on “RGB data, depth data, or skeleton data”. In this paper we will solely focus on
skeleton data. Ionescu et al. [31] proposed a strategy for segmentation of image for body pose
estimation relying on regression to obtain joints coordinates. Wang et al. [32] distinct two
regression approaches in case of single person body pose estimation in 2D: “direct
regressionbased approach, which involves regressing key points directly from features” and heatmap
approach that infers joints positions from the heatmap. For the 3D case two mentioned
approaches found applications, as well as a third approach that combines 2D and 3D approaches
into one complex framework. This work focus on 3D skeleton data.</p>
      <p>While there is much effort put into discovery of the new techniques and methods of
processing the data, there is significantly less effort put towards examining ways of easing
computational load for recognizing gestures or body pose estimation. In the same vein, the
aspect of compressing, like in our case, skeleton data or methods of aggregating such data leave
still much to be desired for. While new techniques are providing impressive results (e.g.
variations on Spatio-Temporal Graph Convolutional Network [33]), they offer very little in
terms of improving the general methodology for the data processing process.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>In this study, the iMiGUE dataset [29] was utilized. This dataset was specifically created
to analyze micro-gestures in the context of emotional AI. It contains videos of press conferences
following tennis Grand Slam matches, where players respond to questions from journalists.
iMiGUE was designed to investigate hidden emotions by analyzing micro-gestures, which are
small, often unconscious movements that reflect internal emotional states. The videos were
collected from various open video platforms, such as YouTube, and included 359 videos,
including 258 winning and 101 losing matches, for a total of 2092 minutes of footage. All videos
have a resolution of 1280x720 pixels and were recorded at a rate of 25 frames per second. The
data is labeled at two levels: micro-gesture categories at the video clip level and emotion
categories at the entire video level. A total of 18,499 micro-gesture samples were labeled,
assigning them 32 different categories. It’s worth noting that iMiGUE is a dataset that protects
individuals’ privacy by removing biometric data such as face and voice. It contains data from 72
athletes from 28 countries, allowing for analysis of micro-gestures in the context of diverse
cultures and genders. Additionally, the dataset was notably imbalanced, which led to
significantly low performance in both detection and classification tasks. Nonetheless, the
application of class balancing techniques, as discussed in [34],[35], could potentially enhance
the performance in these areas.</p>
        <p>
          In the research, the RGB material contained in the iMiGUE dataset was not used.
Instead, the focus was solely on the skeleton data. The skeleton data in the iMiGUE dataset is
constructed to facilitate the recognition and understanding of micro-movements. This data is
obtained using the OpenPose tool [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which extracts pose data for each frame of a video
sequence. The pose data includes key points corresponding to different body parts, creating a
skeletal representation of a person’s posture and movements over time. The dataset uses a
sequence of key body points (or pose data) for each micro-movement instance, where each
frame in the sequence contains the coordinates of key joints. These key points capture the
spatial configuration of the body, allowing for the analysis of subtle movements that
characterize micro-movements. Skeletal data is advantageous because it is insusceptible to
dynamic background changes, making it more suitable for gesture recognition tasks in different
environments. By focusing on skeleton data, the iMiGUE dataset provides a detailed and private
way to analyze micro-movements, which are crucial for understanding hidden or suppressed
emotions.
        </p>
        <p>In this study, Authors used the training and validation data split proposed in the MiGA
(Micro-gesture Analysis for Hidden Emotion Understanding) challenge [36]. The data ware
restricted to a few selected micro-gestures that the Authors believed had sufficient support in
validation part of the dataset to perform correct inference. We selected the following
microgestures: ear touching (1720 samples, denoted as gesture 8), torso touching (3329 samples,
denoted as gesture 20), finger crossing (184 samples, denoted as gesture 24), lip pressing (2746
samples, denoted as gesture 29), shoulder shaking (4261 samples, denoted as gesture 31), and
unspecified gestures (9670 samples, denoted as gesture 99). These classes were selected because
they had enough samples, which is crucial for performing correct analysis and inference.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Sequence data processing</title>
        <p>The data in this dataset was originally divided into sequences of gestures (e.g., touching
ears sequence, touching torso sequence, crossing fingers sequence, etc.). In the study, the
Authors investigated whether frame-by-frame inference (denoted as Base) would yield worse
results than using simple methods that allow for the analysis of entire sequences. The simple
methods proposed by the Authors included calculating the mean value for the entire sequence
(denoted as Avg), determining the minimum (denoted as Min) and maximum (denoted as Max)
values for the entire sequence, and performing linear regression on each sequence (denoted as
Reg) and taking the slope value (a).</p>
        <sec id="sec-3-2-1">
          <title>The formula for linear regression in this context can be represented as:</title>
          <p>y =ax +b</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Where, a is the slope and b is the intercept.</title>
          <p>
            Additionally, Authors used the integral (trapezoidal rule) for calculating the sequence
value (denoted as Int). In this case the vector of sequence values y =¿] is uniformly distributed
over the interval [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ] and the integral value is calculated using the formula:
          </p>
          <p>1 1 n−1 yi+ yi+1
Integral ≈∫ f ( x ) dx ≈ ∑</p>
          <p>0 n 0 2
Where:


yi and yi+1are successive values in the vector y,
n is the number of intervals.</p>
          <p>When sequence-based inference was used, the support for the data changed. The new
support values were touching ears (34 samples), touching torso (55 samples), crossing fingers (10
samples), pressing lips (82 samples), shaking shoulders (193 samples), and undefined gestures
(258 samples). The application of these methods for aggregating gestures belonging to the same
sequence allowed for the analysis of entire gesture sequences, rather than analyzing each frame
individually.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. First Phase - Used model</title>
        <p>In the first part of the study, the modelRandomForestClassifier [37] with parameters
nestimators=100 and randomstate=42 was utilized. This model is an ensemble learning method
that constructs multiple decision trees during training and outputs the mode of the classes
(1)
(2)
(classification) or mean prediction (regression) of the individual trees. The parameternestimators
specify the number of trees in the forest, while randomstate ensures reproducibility of the
results. This approach was considered the initial step in the study, allowing for a preliminary
verification of the research hypothesis.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Second Phase - Grid Search</title>
        <p>
          In the second part of the study, a Grid Search method was applied to find the best
parameters. The parameter grid included the following:
 nestimators: number of trees in the forest, with values [100, 200, 300, 400, 500],








maxfeatures: number of features to consider when looking for the best split, with options
[None, 'sqrt', 'log2'],
maxdepth: maximum depth of the tree, with values [None, 10, 20, 30, 40, 50],
min¿: minimum number of samples required to split an internal node, with values [
          <xref ref-type="bibr" rid="ref10 ref2 ref5">2, 5, 10</xref>
          ],
min¿: minimum number of samples required to be at a leaf node, with values [
          <xref ref-type="bibr" rid="ref1 ref2 ref4">1, 2, 4</xref>
          ],
bootstrap: whether bootstrap samples are used when building trees, with options [True,
False],
criterion: function to measure the quality of a split, with options ['gini', 'entropy',
'log_loss'],
oobscore: whether to use out-of-bag samples to estimate the generalization accuracy, with
options [True, False],
classweight: weights associated with classes, with options [None, 'balanced',
'balanced_subsample'].
        </p>
        <p>The parameters were not searched in a brute-force manner (every combination of
parameters), but instead, to save time (as the calculation for Base model was time consuming), it
was decided to check each parameter sequentially. First, the best value for the first parameter
was selected, then using this value, the best value for the second parameter was determined, and
so on. This strategy, which can be used in model optimization [38], allowed each of the proposed
models to select parameters that fit best its structure.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Used metrics</title>
        <p>In the study, the following metrics were used to evaluate the algorithms:
Precision , Recall , F 1 Score∧ Accuracy. Precisionmeasures the exactness of the positive
predictions made by the model. It represents the proportion of correctly predicted positive
outcomes (true positives) to all outcomes that the model predicted as positive (true positives +
false positives) [39]. The formula for precision is as follows:</p>
        <p>Precision=</p>
        <p>True Positives
True Positives + False Positives
(3)
where:

</p>
        <p>True Positives are the number of correctly predicted positive cases,
False Positives are the number of incorrectly predicted positive cases.</p>
        <p>Recall measures the ability of a model to correctly identify all instances of an object in
a dataset. It represents the ratio of the number of correctly detected instances of an object (true
positives) to the total number of actual instances of the object in the dataset (true positives +
false negatives) [39]. The formula for recall is as follows:
(4)
(5)
Recall=</p>
        <p>True Positives</p>
        <p>True Positives+ False Negatives
where:
 True Positives are the number of correctly predicted positive cases,
 False Negatives are the number of actual positive cases that were incorrectly predicted as
negative.</p>
        <p>The F 1 Score is the harmonic average of Precision and Recall. It strikes a balance
between Precision and Recall, which is especially important when we want to account for
both false positives and false negatives in our model evaluation. The F 1 Scoreranges from 0 to
1, with higher values indicating better performance. An ideal F 1 Scoreof 1 means that the
model has achieved both perfect Precision and perfect Recall, suggesting that it is able to
correctly detect all instances of an object without generating false positives [39],[40]. The
formula for F 1 Score is as follows:</p>
        <p>2∗Precision∗Recall
F 1 Score=</p>
        <p>Precision+ Recall
where:
 Precision is the ratio of true positive predictions to the total number of positive predictions
made (both true and false positives).
 Recall is the ratio of true positive predictions to the total number of actual positive cases
(both true positives and false negatives).</p>
        <p>The Accuracy metric is one of the simplest and most commonly used metrics for
evaluating a classification model. It measures the percentage of correctly predicted results over
the total number of cases in the data set. Accuracy is particularly useful when the data is
balanced, i.e. when the number of examples of each class is similar [40]. The formula for
Accuracyis as follows:</p>
        <p>Accuracy =</p>
        <p>True Positives+True Negatives
True Positives+True Negatives + FalsePositives+ False Negatives
(6)
where:
 True Positives (TP) are the number of correctly predicted positive cases.
 True Negatives (TN) are the number of correctly predicted negative cases.
 False Positives (FP) are the number of incorrectly predicted positive cases.
</p>
        <p>False Negatives (FN) are the number of incorrectly predicted negative cases.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Authors have conducted examinations that aimed to test how the results obtained by
proposed aggregation compare to the Baseline results. By the Baseline results we understand
results obtained by performing fitting of the model on the Base dataset, whereas the aggregated
results refer to results obtained by fitting models on each of the aggregated datasets. For each of
the Phases and for each of the dataset one fitting was performed accordingly.</p>
      <p>In the tables below results of each Phase are presented by each chosen metric. For each
metric the best score was marked in red, while the best score was marked in green.</p>
      <sec id="sec-4-1">
        <title>4.1. First Phase</title>
        <p>In the first Phase every aggregated dataset as well as a Base dataset were fitted
on RandomForestClassifier. Table 1 presents obtained results:
 for F 1 Scorethe best result was obtained by regression method (0.55) while the
worst result was produced by Baseline model (0.43)
 for Precision score surprisingly three methods managed to obtain the same result
(0.57), namely Regression, Integral and Maximum methods, while the Minimum
method generated the worst result (0.47)
 for Recall score the best result was produced by Regression and Integral methods
(0.57) and the worst one was produced by Baseline method (0.44)
 for Accuracy metric the best score was obtained by Regression and Integral methods
(0.57) and the worst one was obtained by Baseline method (0.44)</p>
        <p>In the first Phase of the examination the most consistent and best-scoring
method turned out to be Regression, closely followed by Integral method, which scored
slightly worse in F 1 Score terms and kept equal on all other metrics. Maximum method
scored almost as well as two mentioned methods, placing third. The Minimum and
Average methods scored noticeably worse than former methods, yet better than the
Baseline results. Baseline results turned out to be the worst scores in three out of four
metrics, making it the worst performing approach.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Second Phase</title>
        <p>In the second Phase every aggregated dataset as well as a Base dataset were fitted
on RandomForestClassifier with Grid Search. Table 2 presents obtained results:
 for the F 1 Score the best result was obtained by Maximum method (0.56), second
best score was Regression (0.55), at the same time Baseline model was the worst
scoring one (0.45)
 for Precision two methods were able to produce the best results, namely Integral
and Maximum (0.60), Regression was able to produce second best result (0.58), the
lowest scoring was Baseline (0.47)
 for Recall metric again the best score was obtained by Maximum method (0.59) with
second best scoring method Integral (0.58), while the lowest score was obtained by
Baseline (0.46)
 in terms of Accuracy the best scoring method was Maximum (0.59), second best was</p>
        <p>Integral (0.58) and the lowest scoring was Baseline (0.46)</p>
        <p>In Phase II of the examination the best scoring and most consistent method across all
metrics was Maximum, which obtained the best results in all four metrics. It was followed by
Regression and Integral methods, which produced overall good results, slightly worse than the
former method. The overall worst performance in this Phase was performed by Baseline
method, which placed last in all considered metrics. Minimum and Average methods both
performed better than the Baseline model, but also noticeably worse than the first two
mentioned.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Comparison</title>
        <p>In the final part of the study, the F 1 Scores from the I and II Phases were compared
separately for each gesture, taking into account each presented method of sequential data
processing. Figure 1 illustrates the F 1 Scores for the ear-touching gesture.</p>
        <p>After applying Grid Search, better results were obtained for the methods: Base, Avg, Int,
and Min. Worse results were observed for the Max method, while the Reg method yielded the
same results. The results for gesture 20 (torso touching) are presented in Figure 2.</p>
        <p>For gesture 20 - torso touching, all results improved after applying Grid Search. The best
result for this gesture was obtained with the Max method, while the worst was with the Avg
method. Notably, before applying Grid Search, the best result was also with Max, and the worst
with Avg. Figure 3 presents the results for gesture 24 - finger crossing.</p>
        <p>A significant improvement after applying Grid Search can be observed for the Max
method, while a substantial decline is seen for the Min method. The best result was achieved
with the Max method, and the worst with the Reg method. Notably, before applying Grid
Search, the best result was with Min, and the worst with Int. Figure 4 presents the results for
gesture 29, which represents lip pressing.</p>
        <p>Also, a significant improvement after applying Grid Search can be observed for the Base,
Int, and Min methods, while a decline is seen for the Reg method, and comparable results for the
Avg and Max methods. The best result both before and after applying Grid Search was achieved
with the Reg method, while the worst result in both cases was with the Base method. Figure 5
shows the results for gesture 31 - shoulder shaking.</p>
        <p>All models with the applied sequential analysis method performed better than the Base
model for gesture 31, both before and after Grid Search. The best performance, both before and
after Grid Search, was achieved by the Reg method, while the worst was by the Base method.
After Grid Search, the models that showed improvement were Avg, Reg, Max, and Min. Int
model remained the same, while Base performed worse. Figure 6 presents the results for gesture
99 (unspecified gestures).</p>
        <p>For gesture 99 - unspecified gestures, Grid Search led to overall improvements. Most
models showed slight score increases, with the model using Maximum value performing best.
The Base model also improved significantly. The Regression model experienced a slight decline,
and the model with Integration remained stable. In this case the worst results were achieved by
the Min model.</p>
        <p>Overall, the application of Grid Search generally improved model performance across
various gestures. Most models showed enhancements in their scores, with the models using
Maximum value and Integration methods consistently performing well. The Averaging model
also demonstrated notable improvement. However, the Regression model occasionally
experienced slight declines, and while the Minimum value model showed some improvement, it
often remained the lowest performer.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Conducted examination proved that using simple aggregating methods for sequential
data can be beneficial. The best scoring methods were consistently better than Baseline results,
while using a more robust model with Grid Search improved them further. Furthermore, using
said methods results in more benefits. One of them is the space it takes to store the data. The
original dataset (restricted to just 6 gestures) takes around 400 MB of memory. The aggregated
datasets take from around 6 MB up to 12 MB. In the worst case scenario it proves over 33 times
reduction in size. Another aspect is the time it takes to calculate a model. For the Baseline
method with Grid Search it took over 71 hours to compute. At the same time, for aggregation
methods it took on average 1 hour and 28 minutes to compute a model, which resulted in over 48
times improvement. Computing all 5 models took over 9 times less time than the one Baseline
model. </p>
      <p>In both Phases aggregation methods Regression, Integral and Maximum proved to be
worth considerations for further research. All three of them were outperforming the remaining
two aggregations methods - Minimum and Average. Nonetheless, all aggregation methods were
performing better than the Baseline results.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future works</title>
      <p>This paper serves as an introduction for further analysis that examines more complex
approaches to aggregating data. While only simple methods are presented, this publication
serves as a basis for future works in this direction. An example of such a complex method is the
TCIP method, which is described in more detail in [41].  We aim to examine some methods that
would allow us to significantly reduce the size of the dataset, while at the same time
maintaining, or even in some cases improving, the level of obtained results. </p>
      <p>Although obtained results are promising further work is necessary to establish scale of
application of those solutions. Presented methods can be further tested on the Baseline dataset
to establish comparison solely on the full data instead of aggregated. Authors would also like to
acknowledge that the examination was performed on one dataset, further examination on other
datasets may be beneficial for estimating usefulness of aggregation methods. Moreover,
conducted examination focus on readily available data, we do not process data on the fly. While
this might be an interesting approach, it is outside of the scope of this paper.
[16] O. N. Zestas, D. N. Soumis, K. D. Kyriakou, K. Seklou, and N. D. Tselikas, “A
computervision based hand rehabilitation assessment suite,” AEU - International Journal of
Electronics and Communications, vol. 169, p. 154762, Sep. 2023, doi:
https://doi.org/10.1016/j.aeue.2023.154762.
[17] Y.-J. Chang, W.-Y. Han, and Y.-C. Tsai, “A Kinect-based upper limb rehabilitation system to
assist people with cerebral palsy,” Research in Developmental Disabilities, vol. 34, no. 11,
pp. 3654–3659, Nov. 2013, doi: https://doi.org/10.1016/j.ridd.2013.08.021.
[18] V. Tsakanikas et al., “Automated Assessment of Balance Rehabilitation Exercises With a
Data-Driven Scoring Model: Algorithm Development and Validation Study,” JMIR
Rehabilitation and Assistive Technologies, vol. 9, no. 3, p. e37229, Aug. 2022, doi:
https://doi.org/10.2196/37229.
[19] Y. Peng, “Smart Home based on Kinect Gesture Recognition Technology,” International
Journal of Performability Engineering, 2019, doi:
https://doi.org/10.23940/ijpe.19.01.p26.261269.
[20] D. Mikolajewski et al., “The Most Current Solutions using Virtual-Reality-Based Methods
in Cardiac Surgery -- A Survey,” Computer Science, vol. 25, no. 1, Mar. 2024, doi:
https://doi.org/10.7494/csci.2024.25.1.5633.
[21] R. Filip, R. G. Puscaselu, L. Anchidin-Norocel, M. Dimian, and W. K. Savage, “Global
Challenges to Public Health Care Systems during the COVID-19 Pandemic: a Review of
Pandemic Measures and Problems,” Journal of Personalized Medicine, vol. 12, no. 8, p. 1295,
Aug. 2022, doi: https://doi.org/10.3390/jpm12081295.
[22] T. Sakai, C. Hoshino, M. Hirao, M. Nakano, Y Takashina, and A. Okawa, “Rehabilitation of
Patients with Post-COVID-19 Syndrome: A Narrative Review,” Progress in rehabilitation
medicine, vol. 8, no. 0, p. n/a-n/a, Jan. 2023, doi: https://doi.org/10.2490/prm.20230017.
[23] B. Kesikburun et al., “The effect of comprehensive rehabilitation on post-COVID-19
syndrome,” Egyptian Rheumatology and Rehabilitation, vol. 50, no. 1, Dec. 2023, doi:
https://doi.org/10.1186/s43166-023-00227-4.
[24] D. Joyce, Aoife De Brún, Sophie Mulcahy Symmons, R. Fox, and É. McAuliffe, “Remote
patient monitoring for COVID-19 patients: comparisons and framework for reporting,”
BMC Health Services Research, vol. 23, no. 1, Aug. 2023, doi:
https://doi.org/10.1186/s12913-023-09526-0.
[25] H. Hawley-Hague et al., “Exploring the delivery of remote physiotherapy during the
COVID-19 pandemic: UK wide service evaluation,” Physiotherapy Theory and Practice, pp.
1–15, Aug. 2023, doi: https://doi.org/10.1080/09593985.2023.2247069.
[26] T. Sakai, C. Hoshino, R. Yamaguchi, M. Hirao, R. Nakahara, and A. Okawa, “Remote
rehabilitation for patients with COVID-19,” Journal of Rehabilitation Medicine, p. 0, 2020,
doi: https://doi.org/10.2340/16501977-2731.
[27] K. Guo, M. Orban, J. Lu, M. S. Al-Quraishi, H. Yang, and M. Elsamanty, “Empowering Hand
Rehabilitation with AI-Powered Gesture Recognition: A Study of an sEMG-Based System,”
Bioengineering, vol. 10, no. 5, p. 557, May 2023, doi:
https://doi.org/10.3390/bioengineering10050557.
[28] J. Xu, L. Leng, and B.-G. Kim, “Gesture Recognition and Hand Tracking for
AntiCounterfeit Palmvein Recognition,” Applied Sciences, vol. 13, no. 21, p. 11795, Jan. 2023,
doi: https://doi.org/10.3390/app132111795.
[29] X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, and G. Zhaoz, “iMiGUE: An Identity-free Video Dataset
for Micro-Gesture Understanding and Emotion Analysis,” arXiv.org, Jul. 01, 2021.
https://arxiv.org/abs/2107.00285 (accessed Jun. 25, 2024).
[30] H.-B. Zhang et al., “A Comprehensive Survey of Vision-Based Human Action Recognition</p>
      <p>Methods,” Sensors, vol. 19, no. 5, p. 1005, Feb. 2019, doi: https://doi.org/10.3390/s19051005.
[31] C. Ionescu, F. Li, and C. Sminchisescu, “Latent Structured Models for Human Pose
Estimation,” 2011. Accessed: Jun. 26, 2024. [Online]. Available:
https://vision.imar.ro/human3.6m/ils_iccv11.pdf
[32] C. Wang and J. Yan, “A comprehensive survey of RGB-Based and skeleton-based human
action recognition,” IEEE Access, vol. 11, pp. 53880–53898, 2023, doi:
https://doi.org/10.1109/ACCESS.2023.3282311.
[33] W. Zhong, W. Xiong, Y. Zhang, M. Zhang, and P. Fu, “A Spatio-Temporal Graph
Convolutional Network for Gesture Recognition from High-Density Electromyography.”
Accessed: Jul. 01, 2024. [Online]. Available: https://arxiv.org/pdf/2312.00553
[34] C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep
Learning,” Journal of Big Data, vol. 6, no. 1, Jul. 2019, doi:
https://doi.org/10.1186/s40537019-0197-0.
[35] M. Tomaszewski and J. Osuchowski, “Effectiveness of Data Resampling in Mitigating Class
Imbalance for Object Detection.” Accessed: Jul. 02, 2024. [Online]. Available:
https://ceurws.org/Vol-3628/paper14.pdf
[36] A. Mostafa, A. Shah, H. Chen, and Marko Savic, “The 2nd MiGA-IJCAI Challenge Track 1.</p>
      <p>Kaggle.,” MiGA-IJCAI Challenge, Apr. 27, 2024.
https://kaggle.com/competitions/2ndmiga-ijcai-challenge-track1 (accessed Jun. 29, 2024).
[37] S. Dimitriadis, D. Liparas, and ADNI, “How random is the random forest? Random forest
algorithm on the service of structural imaging biomarkers for Alzheimer’s disease: from
Alzheimer’s disease neuroimaging initiative (ADNI) database,” Neural Regeneration
Research, vol. 13, no. 6, p. 962, 2018, doi: https://doi.org/10.4103/1673-5374.233433.
[38] F. Hutter, H. Hoos, and K. Leyton-Brown, “Sequential Model-Based Optimization for
General Algorithm Configuration.” Available:
https://ml.informatik.uni-freiburg.de/wpcontent/uploads/papers/11-LION5-SMAC.pdf
[39] R. Yacouby and D. Axman, “Probabilistic Extension of Precision, Recall, and F1 Score for
More Thorough Evaluation of Classification Models,” ACLWeb, Nov. 01, 2020.
https://aclanthology.org/2020.eval4nlp-1.9/ (accessed May 13, 2022).
[40] Ž. Ð. Vujovic, “Classification Model Evaluation Metrics,” International Journal of Advanced
Computer Science and Applications, vol. 12, no. 6, 2021, doi:
https://doi.org/10.14569/ijacsa.2021.0120670.
[41] M. Tomaszewski, R. Gasz, S. S. Kasana, J. Osuchowski, S. Singh, and S. Zator, “TCIP:
Transformed Colour Intensity Profiles analysis for fault detection in power line insulators,”
Multimedia tools and applications, Mar. 2024, doi:
https://doi.org/10.1007/s11042-02418901-w.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          et al., “Attention Is All You Need,” arXiv.org, Jun.
          <volume>12</volume>
          ,
          <year>2017</year>
          . https://arxiv.org/abs/1706.03762
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , “BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          ,” arXiv.org, Oct.
          <volume>11</volume>
          ,
          <year>2018</year>
          . https://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>T. B. Brown</surname>
          </string-name>
          et al.,
          <string-name>
            <surname>“Language Models Are Few-Shot</surname>
            <given-names>Learners</given-names>
          </string-name>
          ,” arxiv.org, vol.
          <volume>4</volume>
          ,
          <string-name>
            <surname>May</surname>
            <given-names>2020</given-names>
          </string-name>
          , Available: https://arxiv.org/abs/
          <year>2005</year>
          .14165
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gill</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaur</surname>
          </string-name>
          , “
          <article-title>ChatGPT: Vision</article-title>
          and Challenges,”
          <source>Internet of Things and CyberPhysical Systems</source>
          , vol.
          <volume>3</volume>
          , pp.
          <fpage>262</fpage>
          -
          <lpage>271</lpage>
          ,
          <year>2023</year>
          , doi: https://doi.org/10.1016/j.iotcps.
          <year>2023</year>
          .
          <volume>05</volume>
          .004.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Sharir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Peleg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shoham</surname>
          </string-name>
          , “
          <article-title>The Cost of Training NLP Models: A Concise Overview</article-title>
          ,” arXiv:
          <year>2004</year>
          .08900 [cs],
          <source>Apr</source>
          .
          <year>2020</year>
          , Available: https://arxiv.org/abs/
          <year>2004</year>
          .08900
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Samsi</surname>
          </string-name>
          et al.,
          <article-title>“From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference</article-title>
          .” Available: https://arxiv.org/pdf/2310.03003
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sallam</surname>
          </string-name>
          , “ChatGPT Utility in Healthcare Education, Research, and
          <article-title>Practice: Systematic Review on the Promising Perspectives</article-title>
          and Valid Concerns,” Healthcare, vol.
          <volume>11</volume>
          , no.
          <issue>6</issue>
          , p.
          <fpage>887</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2023</year>
          , doi: https://doi.org/10.3390/healthcare11060887.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Ray</surname>
          </string-name>
          , “
          <article-title>Timely need for navigating the potential and downsides of LLMs in healthcare and biomedicine,” Briefings in bioinformatics</article-title>
          , vol.
          <volume>25</volume>
          , no.
          <issue>3</issue>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2024</year>
          , doi: https://doi.org/10.1093/bib/bbae214.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Acharya</surname>
          </string-name>
          , “
          <article-title>Gesture Recognition: A Survey,”</article-title>
          <source>IEEE Transactions on Systems, Man and Cybernetics</source>
          , Part C (
          <article-title>Applications</article-title>
          and Reviews),
          <source>Jan</source>
          .
          <year>2000</year>
          , Accessed: Apr.
          <volume>26</volume>
          ,
          <year>2024</year>
          . [Online]. Available: https://www.academia.edu/32594485/Gesture_Recognition_
          <string-name>
            <surname>A</surname>
          </string-name>
          _Survey
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Goldin-Meadow</surname>
          </string-name>
          , “
          <article-title>The role of gesture in communication and thinking,” Trends in Cognitive Sciences</article-title>
          , vol.
          <volume>3</volume>
          , no.
          <issue>11</issue>
          , pp.
          <fpage>419</fpage>
          -
          <lpage>429</lpage>
          , Nov.
          <year>1999</year>
          , doi: https://doi.org/10.1016/s1364-
          <volume>6613</volume>
          (
          <issue>99</issue>
          )
          <fpage>01397</fpage>
          -
          <lpage>2</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>H.-J. Kim</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>and J.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Park</surname>
          </string-name>
          , “
          <article-title>Dynamic hand gesture recognition using a CNN model with 3D receptive fields</article-title>
          ,
          <source>” Workshop on Neural Networks for Signal Processing</source>
          ,
          <year>2008</year>
          , Accessed: Jun.
          <volume>24</volume>
          ,
          <year>2024</year>
          . [Online]. Available: https://www.semanticscholar.org/paper/Dynamic-hand
          <article-title>-gesture-recognition-using-aCNN-model-Kim-Lee/1ae971c25646d003e15dd9eb706650e58c21d900</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Devineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Moutarde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J</given-names>
            .
            <surname>Yang</surname>
          </string-name>
          , “
          <article-title>Deep Learning for Hand Gesture Recognition on Skeletal Data,”</article-title>
          <source>IEEE Xplore, May</source>
          <volume>01</volume>
          ,
          <year>2018</year>
          . https://ieeexplore.ieee.org/document/8373818 (accessed Nov.
          <volume>26</volume>
          ,
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Samkari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alghamdi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Al</surname>
          </string-name>
          <string-name>
            <surname>Ghamdi</surname>
          </string-name>
          ,
          <article-title>“Human Pose Estimation Using Deep Learning: A Systematic Literature Review,”</article-title>
          <source>Machine Learning and Knowledge Extraction</source>
          , vol.
          <volume>5</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>1612</fpage>
          -
          <lpage>1659</lpage>
          , Dec.
          <year>2023</year>
          , doi: https://doi.org/10.3390/make5040081.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cao</surname>
          </string-name>
          , G. Hidalgo,
          <string-name>
            <given-names>T.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-E.</given-names>
            <surname>Wei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheikh</surname>
          </string-name>
          , “
          <article-title>OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields</article-title>
          ,” arXiv:
          <year>1812</year>
          .08008 [cs],
          <source>May</source>
          <year>2019</year>
          , Available: https://arxiv.org/abs/
          <year>1812</year>
          .08008
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pishchulin</surname>
          </string-name>
          et al.,
          <article-title>“DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation</article-title>
          ,” arXiv:
          <fpage>1511</fpage>
          .06645 [cs],
          <source>Apr</source>
          .
          <year>2016</year>
          , Available: https://arxiv.org/abs/1511.06645
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>