<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hand-Guided Object Tracking Using Hand-Object Consistency⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiwon Yang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taewook Ha</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Woontack Woo</string-name>
          <email>wwoo@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KAIST KI-ITC Augmented Reality Research Center</institution>
          ,
          <addr-line>291 Daehak-ro, Yuseong-gu, 34141, Daejeon</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>KAIST UVR Lab</institution>
          ,
          <addr-line>291 Daehak-ro, Yuseong-gu, 34141, Daejeon</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a robust method for estimating the pose of occluded objects by hand during user interaction in a Head-Mounted Display (HMD) environment. Existing approaches to the occlusion problem often predict the hand and object jointly to improve efficiency, but their applicability in HMD environments is limited by high computational cost and poor generalization to occluded objects. Our approach applies hand pose changes to object pose changes based on the confidence levels of both the hand and the object. Evaluation conducted on 20 distinct grasping pose types demonstrated a lower Mean Per-Vertex Position Error (MPVPE) compared to conventional interpolation methods. Consequently, the proposed method enables effective estimation of occluded objects using fewer computational resources.</p>
      </abstract>
      <kwd-group>
        <kwd>Object poses estimation</kwd>
        <kwd>Hand poses estimation</kwd>
        <kwd>AR/VR</kwd>
        <kwd>HMD</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Object recognition plays a critical role in scenarios where users interact with objects via
HeadMounted Display (HMD) devices. When real-world objects held by users are not accurately tracked
and such information fails to be transmitted to the device, natural interaction becomes impaired [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Although existing object detection models perform well under conditions where objects are clearly
visible, recognition accuracy tends to degrade significantly when occluded by the user’s hand. Hence,
addressing occlusion problems in computer vision is essential for providing realistic immersion in
HMD environments [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Current approaches to occlusion mitigation predominantly employ deep learning and generative
models to simultaneously predict hand and object states, often improving temporal efficiency
compared to sequential prediction methods [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, these methods exhibit three main limita
tions. First, as demonstrated in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], while robust occlusion-resistant prediction is feasible, the
computational load is high, making these methods resource-inefficient for HMD devices where limited
processing power is available, often consuming excessive computation for accurate hand-object pose
estimation. Second, as shown in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], limitations arise with previously unseen objects, leading to poor
generalization and inability to cover diverse hand-object interaction patterns typical in real-world
HMD usage. Third, these methods typically do not consider fail-safe strategies for pose estimation or
object tracking failures, which are crucial for practical deployment in HMD scenarios.
      </p>
      <p>We proposes a method that estimates the current pose and position of objects occluded beyond a
certain threshold by the user’s hand by leveraging hand motion information. Our approach enables
robust inference of occluded object movement with low computational overhead by applying hand
pose changes to object pose estimation. The method integrates object recognition results from both
current and previous frames, assessing the sufficiency of available information. When data confidence
is adequate, the method heavily relies on current frame information for pose estimation; otherwise, it
references prior frames’ data to compensate for missing or unreliable inputs. This temporal data
utilization ensures applicability to time-series data and supports real-time object pose estimation on
resource-constrained HMD platforms.</p>
      <p>
        Quantitative evaluation was performed using multiple hand-object interaction scenarios [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
representing various object grasping and rotation patterns. Frames were segmented based on object rotation
direction, and the Mean Per-Vertex Position Error (MPVPE) between predicted and ground truth poses
was computed. Experimental results show that applying hand pose variations to object pose estimation
significantly outperforms conventional interpolation techniques in tracking occluded objects during
hand-object interactions in HMD settings.
      </p>
      <p>The contributions of this study are threefold. First, it presents an efficient pose estimation method
tailored for real-time HMD interaction environments under limited computational resources, specifi
cally addressing occlusion caused by the hand. Second, it enhances robustness by leveraging both
current and past frame recognition data, enabling compensatory estimation when immediate informa
tion is insufficient or unreliable. Third, it validates improved tracking performance and generalization
through comprehensive quantitative experiments involving realistic hand-object grasping and pose
scenarios. Consequently, this work demonstrates the feasibility of reliable, real-time occluded object
pose estimation for collaborative and interactive applications utilizing HMD devices. </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>2.1.</p>
      <sec id="sec-2-1">
        <title>3D Hand-Object Poses Estimation</title>
        <p>
          Research on pose estimation in hand-object interaction scenarios from images or videos continues
to advance. The H+O framework [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] proposed a method that simultaneously performs 3D hand-object
pose estimation, object recognition, and action classification using a single RGB image, rather than
separately estimating 3D poses of the person or objects. However, since it relies solely on a single
RGB input, the lack of depth-related information poses inherent limitations when the hand and object
occlude each other, adversely affecting prediction accuracy. More recently, HOISDF [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] employed
Global Signed Distance Fields (SDF) to jointly estimate 3D hand-object poses even under occlusion.
While this approach benefits from modeling contact and proximity between the hand and object
simultaneously, it suffers from the high computational and memory demands of SDF processing.
Additionally, severe occlusions necessitate further refinement to accurately capture fine details at
contact regions. Similarly, Lin et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] proposed a method that selectively shares or separates features
at the backbone network level to improve simultaneous pose estimation from a single RGB image
under occlusion conditions. Despite its effectiveness, this approach lacks generalization to unseen
objects and does not sufficiently address the challenges posed by invisible hand-object contact areas.
        </p>
        <p>
          Other research efforts have sought to apply alternative AI models to hand-object pose estimation.
Semi-supervised frameworks have been proposed to enhance pose estimation performance under
occlusion and limited 3D labeled data from single images [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. However, the absence of 3D object
models resulted in no pseudo-labeling for objects, and the quality of such labels critically influences
performance, limiting comprehensive resolution in complex multi-object and hand interaction
scenarios. Additionally, some studies employ deep learning-based feedback loop frameworks to
simultaneously estimate 3D hand and object poses purely via deep neural networks [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Nonetheless,
these deep learning-based object detection approaches generally entail substantial computational
overhead, rendering them inefficient for deployment on resource-constrained HMD devices [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
Therefore, our approach aims to provide a less computationally demanding alternative compared to
existing methods that require high computational costs.
2.2.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Human-Object Interaction Detection (HOI detection)</title>
        <p>
          Research on detecting how human-object interactions (HOI) occur continues to advance. The
UnionDet framework [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] proposed a single-stage prediction approach that directly infers the
interaction regions of human-object pairs, aiming to overcome the speed limitations of conventional
multi-stage HOI detection methods. However, it exhibited limitations in handling overlapping
instances and multiple simultaneous interactions. Another approach utilized a transformer-based model
to predict sets of humans, objects, and interactions without requiring explicit human-object matching
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. This method benefits from eliminating computationally expensive post-processing, resulting in
significantly faster inference speeds. Nevertheless, it faces increased computational costs when dealing
with complex images containing numerous human object interaction instances. More recently, attempts
have been made to combine Convolutional Neural Networks (CNNs) with multi-resolution wavelet
analysis to address the trade-off between computational speed and detection accuracy [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. However,
this approach infers interactions solely from 2D images without incorporating full 3D information,
limiting its applicability in scenarios where depth information is essential. Therefore, instead of jointly
predicting both the hand and the object within the interaction space, our approach applies the motion
of the hand to the target object, thereby reducing computational overhead and improving inference
speed.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>3.1.</p>
      <sec id="sec-3-1">
        <title>System Structure</title>
        <p>In real-world interactions, the hand and object typically move independently until contact occurs.
Therefore, temporal information is utilized to set the initial pose of the stationary object, and
subsequent hand pose variations are applied to update the object pose when it becomes occluded.
Because the occluded object’s pose is predicted using the remaining recognition results, this approach
operates efficiently without requiring additional computational resources.</p>
        <p>The system is broadly divided into two stages. The first stage recognizes the hand and object
separately based on 2D images from the camera viewpoint. We assume that 3D information of the
target object is provided beforehand, and an RGB-D image captured either immediately at contact
onset or while in contact is supplied as input. This enables pose estimation of the object in its initial
static state, as well as the detection of the hand pose at the instant of contact. From the moment
handobject interaction begins, an object detection model is employed to evaluate the recognition confidence
of both hand and object. To improve reliability, cropped regions based on the estimated hand and object
locations are fed into the object detection model. These detection results subsequently inform the
application of hand pose changes to update the object pose.</p>
        <p>The second stage estimates the pose based on the detection confidence scores from the first stage.
When the object detection confidence is sufficiently high, indicating accurate recognition, the object
pose is updated using a dedicated pose estimation model. Conversely, if the object confidence is low,
the most recently recognized object pose is updated by applying hand pose variations. This update
process also considers the confidence of the hand pose estimation as well as the temporal interval since
the last object pose update.
the previous time step V O t -1, scaled by a weighting coefficient α t . Formally:</p>
        <p>V O t =α t ∙ M H t ∙V O t -1 .</p>
        <p>Here, α t quantifies the degree of trust in the previous object pose estimate when computing the
current pose update.</p>
        <p>α t = {No update ,
1 ,
(λ Δt ∙S hand ),
t
if S t obj ≥ τ obj
if S t obj &lt;τ obj and S t hand &lt;τ hand
otherwise
(1)
(2)</p>
        <p>Equation (2) defines the calculation of the weighting coefficient α t . Let S denote the recognition
confidence score for the object or the hand at time t , and let τ be a predefined confidence threshold. If
the recognition confidence for either the object or the hand falls below τ, it is considered that the
respective entity is insufficiently visible, and the system either fully references or disregards the previ
ous frame’s information accordingly. In other cases, the hand confidence is used with a decay factor
proportional to the number of frames elapsed since the last reliable object pose update to calculate α t .
When both the object confidence and the hand confidence drop below respective minimum thresholds,
the current frame’s pose estimation becomes unreliable. Therefore, the system preserves the pose from
the most recent reliable frame to minimize estimation error.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>4.1.</p>
      <sec id="sec-4-1">
        <title>Dataset</title>
        <p>
          To consider scenarios in which objects are partially occluded by the hand, the SHOWME dataset
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] was utilized. This dataset defines 20 grasp types derived from the comprehensive grasp taxonomy
of 33 types presented in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], and it comprises a total of 96 scenarios based on variations in object
categories and hand movements. In this study, experiments were conducted using a subset of 20
scenarios, each corresponding to one of the 20 selected grasp types. The selection criteria focused on
scenarios where the modeling information of rendered results aligned well when projected onto the
RGB images. To account for fast-moving objects, not all data recorded at 30 frames per second (fps)
was used; instead, one frame was sampled every 10 frames, effectively yielding a 3 fps frame rate for
the experiments. Camera parameters, including the distance between the camera and the object, were
directly utilized as provided in the dataset.
4.2.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Performance Metric</title>
        <p>The evaluation metric employed in this study is the Mean Per-Vertex Position Error (MPVPE).
MPVPE quantifies the average positional discrepancy between the vertices of the ground truth (GT)
mesh and those of the estimated mesh. For both the proposed method and the interpolation baseline,
predicted object meshes are separately saved as obj files to compute this metric. Specifically, the
MPVPE is calculated by comparing the obj files of the predicted object mesh against the corresponding
ground truth object mesh provided in the SHOWME dataset. Lower MPVPE values indicate smaller
deviations between the predicted and actual vertex positions, thus representing higher estimation
accuracy. The results are analyzed by plotting graphs for each grasp type and rotation direction to
provide detailed performance insights.
4.3.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Experimental Method</title>
        <p>Prior to deployment on HMD devices, the original dataset values are treated as ground truth and
used to evaluate the prediction accuracy. The method follows the previously described system
workflow. The model is applied to cropped images focusing exclusively on the hand and object regions
within each frame of the dataset. This cropping aims to isolate the hand-object interaction, preventing
interference from other objects in the scene and ensuring that confidence scores reflect only the
scenario-specific hand and object. The crops were generated using the rendered results provided by the
SHOWME dataset.</p>
        <p>As a baseline, an interpolation method was considered. When the confidence scores for the hand and
object in the current frame fall below the threshold used in the proposed method, the object pose is
estimated as the midpoint between the previous and subsequent frames. This interpolation approach is
analogous to the proposed idea of predicting object pose based on the hand pose change between
consecutive frames. The interpolation is applied specifically at the point where the weighting
coefficient is calculated during object pose estimation at time t in the system. Otherwise, all other
processing steps remain identical. This setup enables direct comparison of evaluation metrics between
the proposed method and the interpolation baseline.</p>
        <p>Significant variability exists in object detection rates across scenarios, heavily influenced by the
training quality of the detection model. Experiments were conducted not only using the raw detection
results but also by artificially adjusting detection rates per scenario. This allowed assessment of which
method performs better relative to the detection model’s effectiveness. When using unmodified
detection results, object pose prediction is triggered only if the object fails to be detected by the
detection model, which in this study is Mediapipe. When detection rates are artificially manipulated,
undetected frames are randomly prioritized for pose prediction using the respective methods. Hand
pose confidence for prediction always relies on Mediapipe model outputs.</p>
        <p>When utilizing the dataset, the grasping method, object type, and rotation sequence/direction are not
consistent. Therefore, we additionally grouped the data into rotation units of 10 frames, which served
as the minimum rotation segment for classification. In other words, one unit corresponds to a
100frame video (10 data points). For all 20 scenarios, these units were grouped, and the corresponding
rotation vectors were classified into rotation types using the k-means clustering algorithm. Based on
these rotation types, we examined which rotation directions each scenario is more specialized in,
thereby enabling more accurate estimation. The number of clusters was experimentally adjusted by
varying k until clusters with identical directions and motion tendencies no longer appeared.
Consequently, the six clusters obtained represent distinct directions or tendencies (i.e., consistency of
rotation).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>5.1.</p>
      <sec id="sec-5-1">
        <title>Comparison of MPVPE with Interpolation Methods</title>
        <p>First, when specifying different detection rate ratios, we compared the Mean Per-Vertex Position
Error (MPVPE) results for each scenario and detection rate using both the proposed method and the
interpolation baseline. The results demonstrated that the proposed method consistently achieved lower
MPVPE values across all scenarios. Scenario 14 (Medium Wrap) exhibited a substantial performance
gap favoring the proposed method regardless of the object detection rate. In contrast, Scenarios
4(Inferior Pincer), 8(Tripod), and 13(Quadpod) showed relatively minor differences between methods,
irrespective of detection rates. These scenarios involve grasps on smaller objects, which may contribute
to smaller absolute errors in both methods. Additionally, although these scenarios feature longer
durations with diverse rotations, the smaller radius of object rotation results in relatively low errors
even when using conventional interpolation.</p>
        <p>Across all detection rate variations, the proposed method consistently outperformed the
interpolation approach. Furthermore, as detection rates decreased, the performance gap widened,
indicating that the proposed method is particularly effective when recognition confidence is low.
Conversely, performance stabilized when detection rates surpassed a certain threshold. This plateau is
likely due to the decay factor in the weighting coefficient α t , which diminishes proportionally with
the number of frames elapsed since the last reliable object pose update. Higher detection rates reduce
the number of frames over which decay applies, leading to more stable pose estimations.</p>
        <p>To further analyze results by rotation type, scenarios were grouped into six rotation clusters.
Clusters 0 and 2 have opposite rotation directions, so they revealed significant differences in MPVPE
despite the symmetrical rotation axes. For example, Scenarios 9 (Parallel Extension), 10 (Power
Sphere), and 11 (Precision Sphere) frequently exceeded an MPVPE of 0.002 in Cluster 0, whereas in
Cluster 2, most scenarios remained below this threshold. This discrepancy is hypothesized to result
from longer occlusion durations caused by the rotation direction. Longer occlusions increase the
number of frames over which decay in pose confidence is applied, thereby reducing prediction
reliability and increasing error. Thus, the proposed method demonstrates better performance when
occlusion occurs in shorter, repeated intervals rather than in prolonged continuous segments.</p>
        <p>Next, the comparison between Clusters 4 and 5 focused on whether rotation direction remained
consistent or changed midway. Cluster 4 generally exhibited higher MPVPE values, with Scenario 13
(Quadpod) showing a twofold increase compared to Cluster 5. This result suggests that predicting
object pose from hand pose changes is more straightforward when rotation direction remains constant.
When rotation direction changes, the hand’s rotational velocity typically decreases, resulting in longer
occlusion intervals and greater difficulty in accurate prediction.</p>
        <sec id="sec-5-1-1">
          <title>Average time per frame [ms] ↓</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Object detection</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>Pose estimation</title>
          <p>11.88</p>
          <p>
            According to a recent study analyzing the impact of frame rate on user experience in virtual reality
environments [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], most users perceive a sense of real-time interaction at frame rates above 30 FPS,
while a frame rate of 60 FPS or higher is recommended to ensure full immersion and to mitigate
simulator sickness. Since the FPS value achieved by the system proposed in this study exceeds 60 FPS,
it can be considered sufficient for users to perceive real-time responsiveness. Furthermore, because the
proposed method requires only a minimal computational time, it is expected that parallel utilization of
multiple models would not introduce significant performance issues. Given the computational speed
of the example object detection model employed, it can be inferred that the overall system’s FPS is
ultimately determined by the specific object detection and hand pose estimation models utilized.
Therefore, if real-time capable object detection and hand pose estimation models are employed, the
system architecture demonstrated in this study can be effectively applied to HMD devices in real-time
scenarios.
          </p>
          <p>
            In addition, the FLOPs value of the proposed method is relatively small when compared to the
computational capabilities of current HMD devices and smartphones, thereby confirming its feasibility
for deployment on such platforms. Finally, the maximum memory consumption of approximately 2
GB further indicates that the system is well within the RAM capacity of modern HMD devices,
ensuring its practical applicability [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        We proposed a method to estimate the pose of objects occluded by the hand through the utilization
of hand pose changes. To evaluate the performance advantage of our method compared to the
conventional interpolation approach, tests were conducted on 20 grasp types from the SHOWME
dataset[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Our method consistently outperformed the baseline regardless of the performance of the
object detection model. Furthermore, even when experiments were stratified by object rotation
directions, the proposed method demonstrated superior performance with a substantial margin.
      </p>
      <p>In addition, we calculated the step-by-step processing time and computational resource usage to
examine whether the proposed method could be utilized on real-time HMD devices. Since our method
requires very little time per frame, we demonstrated that it can be applied in terms of processing time,
provided that the object detection and hand pose estimation models to be used together are
appropriately selected for real-time operation. Furthermore, in terms of FLOPs, we confirmed that the
method is applicable when considering the performance levels of HMD devices and general
smartphones.</p>
      <p>Although the SHOWME dataset used in this study contains certain instances of directional changes,
frames exhibit predominantly linear tendencies. Hence, it remains necessary to evaluate whether the
proposed approach can be generalized effectively to datasets characterized by more complex motion
patterns. Furthermore, since the SHOWME dataset is limited to single-hand manipulation of an object,
additional validation is required in scenarios that align more closely with the research objective—
namely, multi-user interaction with objects in immersive HMD environments, where multiple users
may manipulate a single object simultaneously. In addition, to comprehensively evaluate different
grasping methods, we utilized a dataset that can be classified into 20 grasp types and performed
experiments under the assumption that the hand and object in the images are observed from the user’s
perspective. To further validate the user’s direct manipulation of objects, we plan to conduct additional
user studies.</p>
      <p>In this regard, future work may consider incorporating a weighting term that accounts for complex
movements. For example, there is determining which hand’s confidence level should be applied when
estimating object pose changes in multi-hand scenarios. Such an extension would enhance the
robustness of the proposed estimation method for interaction with virtual avatars, which constitutes the
final objective of this research. Additionally, as this study has prioritized performance validation of the
proposed method, relatively less attention has been devoted to the selection of the hand pose estimation
model. Further investigations into model selection could thus provide a more comprehensive guideline
for the effective application of the proposed framework.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgement</title>
      <p>This paper was supported by Korea Institute for Advancement of Technology(KIAT) grant funded
by the Korea Government(MOTIE) (RS-2025-02304167, HRD Program for Industrial Innovation)</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used GPT-5 in order to: Grammar and spelling
check. After using these tools, the author reviewed and edited the content as needed and takes full
responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>M. M. Shah</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Arshad</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Sulaiman</surname>
          </string-name>
          , “Occlusion in augmented reality,”
          <source>in Proc. 2012 8th Int. Conf. Information Science and Digital Content Technology (ICIDT)</source>
          , Jeju, Korea,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. P. H.</given-names>
            <surname>Shum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Morishima</surname>
          </string-name>
          , “
          <article-title>Resolving hand-object occlusion for mixed reality with joint deep learning and model optimization</article-title>
          ,
          <source>” Comput. Animat. Virtual Worlds</source>
          , vol.
          <volume>31</volume>
          , no.
          <issue>4-5</issue>
          ,
          <issue>e1956</issue>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1002/cav.
          <year>1956</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salzmann</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathis</surname>
          </string-name>
          , “HOISDF:
          <article-title>Constraining 3D hand-object pose estimation with global signed distance fields,”</article-title>
          <source>in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2402.17062.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kuang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          , “
          <article-title>Harmonious feature learning for interactive hand-object pose estimation,”</article-title>
          <source>in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>12989</fpage>
          -
          <lpage>12998</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mounika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Udayaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ch. V.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. V.</given-names>
            <surname>Narayana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jyothi</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ch</surname>
          </string-name>
          . Devi, “
          <article-title>Exploring spiking neural networks and deep learning techniques for occlusion detection in AR and VR images,”</article-title>
          <source>in Proc. 2024 Int. Conf. Advances in Computing, Communication and Applied Informatics (ACCAI)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCAI61061.
          <year>2024</year>
          .
          <volume>10601809</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Swamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Leroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Weinzaepfel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Baradel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Galaaoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bregier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Armando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-S.</given-names>
            <surname>Franco</surname>
          </string-name>
          , and G. Rogez, “
          <article-title>SHOWMe: Benchmarking object-agnostic hand-object 3D reconstruction</article-title>
          ,” in ACVR Workshop at ICCV,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2309.10748.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Tekin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bogo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Pollefeys</surname>
          </string-name>
          , “H+O:
          <article-title>Unified egocentric recognition of 3D hand-object poses and interactions</article-title>
          ,”
          <source>in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4511</fpage>
          -
          <lpage>4520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , “
          <article-title>Semi-supervised 3D hand-object poses estimation with interactions in time,”</article-title>
          <source>in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>14687</fpage>
          -
          <lpage>14697</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oberweger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wohlhart</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Lepetit</surname>
          </string-name>
          , “
          <article-title>Generalized feedback loop for joint hand-object pose estimation,”</article-title>
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell., vol.
          <volume>42</volume>
          , no.
          <issue>8</issue>
          , pp.
          <fpage>1898</fpage>
          -
          <lpage>1912</lpage>
          , Aug.
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2019</year>
          .
          <volume>2907951</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Kim</surname>
          </string-name>
          , “
          <article-title>UnionDet: Union-level detector towards real-time human-object interaction detection,”</article-title>
          <source>in Proc. Eur. Conf. Computer Vision</source>
          (ECCV),
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2312.
          <fpage>12664</fpage>
          ..
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-S.</given-names>
            <surname>Kim</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Kim</surname>
          </string-name>
          , “HOTR:
          <article-title>End-to-end human-object interaction detection with transformers,”</article-title>
          <source>in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Q. B.</given-names>
            <surname>Pay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Baskaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Loo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>See</surname>
          </string-name>
          , “
          <article-title>Conceptualizing multi-scale wavelet attention and ray-based encoding for human-object interaction detection,”</article-title>
          <source>in Proc. Int. Joint Conf. Neural Networks (IJCNN)</source>
          ,
          <year>2025</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2507.10977.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Feix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pawlik</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-B. Schmiedmayer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Romero</surname>
            , and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Kragić</surname>
          </string-name>
          , “
          <article-title>A comprehensive grasp taxonomy,”</article-title>
          <source>in Proc. IEEE-RAS Int. Conf. Humanoid Robots (Humanoids)</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>327</fpage>
          -
          <lpage>333</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.-N.</given-names>
            <surname>Liang</surname>
          </string-name>
          , “
          <article-title>Effect of frame rate on user experience, performance, and simulator sickness in virtual reality</article-title>
          ,
          <source>” IEEE Trans. Vis. Comput. Graphics</source>
          , vol.
          <volume>29</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>2478</fpage>
          -
          <lpage>2488</lpage>
          , May
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1109/TVCG.
          <year>2023</year>
          .
          <volume>3247057</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Heaney</surname>
          </string-name>
          , “
          <article-title>Quest 3 full specs - Compared with Quest 2</article-title>
          ,
          <string-name>
            <surname>Quest</surname>
            <given-names>Pro</given-names>
          </string-name>
          ,
          <source>Pico</source>
          <volume>4</volume>
          &amp;
          <string-name>
            <given-names>Apple</given-names>
            <surname>Vision Pro</surname>
          </string-name>
          ,” UploadVR. Available: https://www.uploadvr.com/quest-3-specs/.
          <source>[Accessed: Aug. 18</source>
          ,
          <year>2025</year>
          ].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>