1. Introduction

X (H. Lee);

3D Trajectory Reconstruction of Dynamic Objects in Digital Twins from Monocular Video

Bogwan Kim

Haeseong Lee

Myungho Lee

0 0 Pusan National University , 2, Busandaehak-ro 63beon-gil, Geumjeong-gu, Busan , South Korea

2025

000 0 0002

The growing demand for remote monitoring through digital twins highlights the importance of integrating both structural accuracy and dynamic awareness of physical spaces. While 3D reconstruction technologies enable highly precise digital twin environments, they typically remain static, failing to reflect real-time changes. Conversely, CCTV systems provide live monitoring but only as separate 2D video streams, requiring users to mentally map them to the reconstructed 3D environment. To address this gap, we propose a 2D-3D projectionbased pipeline that incorporates dynamic object trajectories from monocular video into a 3D reconstructed digital twin. Our method leverages widely available indoor CCTV feeds, combining them with reconstructed static scenes and camera pose information to back-project object masks and recover placement and orientation. A stabilization filter further ensures robustness against noise and mask deformation. This approach ofers a practical foundation for integrating dynamic objects into digital twins, facilitating more consistent spatial perception and real-time monitoring of remote environments.

eol>Digital Twin Dynamic Object Trajectory Reconstruction Pose Estimation Video Surveillance

1. Introduction

Digital Twin (DT) technology is gaining significant attention as an innovative paradigm that connects the physical and digital worlds, enabling the continuous reflection of a real environment’s state, behavior, and changes over time in a virtual environment [ 1, 2 ]. Unlike traditional modeling approaches, which were often limited to static representations or simplified simulations, DTs integrate heterogeneous data sources—such as sensor data, image data, and simulation results—to provide a continuously updated virtual environment [ 3 ]. This characteristic is particularly crucial in various application domains such as smart manufacturing, healthcare, urban infrastructure management, and autonomous driving, where the demand for real-time monitoring, predictive analytics, and decision support is rapidly increasing [ 4, 5 ]. In this context, the usability and reliability of a DT are directly determined by the level of fidelity with which the virtual model reflects the structural, spatial, and temporal characteristics of the physical environment [ 1, 6 ]. Therefore, fidelity has become a core concept in DT research, extending beyond mere geometric representation or physical model accuracy to a comprehensive discussion that includes the realism of dynamic interactions and behavioral patterns [ 2, 7 ].

While fidelity can be defined in various ways [ 4 ], it essentially refers to how accurately a DT captures not only the static properties of a real environment but also its dynamic states and transitions over time. For example, if a DT of a manufacturing site only reproduces the geometric shape of machinery and fails to reflect dynamic elements such as trajectories, its utility for predictive maintenance is limited [ 8 ]. Similarly, if a smart city’s DT includes only static infrastructure like buildings and roads but fails to track the movement of mobile objects such as vehicles and pedestrians, it cannot suficiently contribute to trafic flow analysis or safety decision support [ 9 ]. These examples illustrate that the value of a DT lies not merely in creating a visually precise digital replica, but in ensuring a functionally equivalent level to reality by securing spatiotemporal consistency between the physical and virtual environments [ 1 ]. However, achieving such high fidelity entails several challenges. While low-fidelity models can reduce computational resource consumption, discrepancies with reality may lead to degraded prediction performance or erroneous judgments. Conversely, high-fidelity DTs require precise 3D reconstruction, robust pose estimation, and reliable dynamic object tracking, thus demanding massive computational loads and significant algorithmic complexity [ 10, 11 ]. Therefore, determining how to define and balance the level of fidelity has emerged as a key challenge in DT research, especially in application contexts that simultaneously require dynamic object recognition, real-time localization, and temporal consistency [12].

In this context, research aimed at virtually reproducing real-world scenes has continued steadily. 3D reconstruction is a prime example. Traditional pipelines for reconstructing 3D scenes from multi-view cameras have widely used Structure-from-Motion(SfM) [13] to estimate camera poses and sparse points, followed by Multi-View Stereo (MVS) [14] to produce dense depth and meshes. More recently, rapid advances in neural reconstruction methods—most notably Neural Radiance Fields (NeRF)—have made it possible to create and update high-precision 3D models of large-scale scenes [15, 16]. In particular, as the accuracy of visual localization and the conversion pipelines between mesh-based and pointcloudbased representations have matured [17, 18], it has become feasible to stably perform reconstruction and maintenance of a scene’s geometry and material properties at industrial scale. These technical foundations provide the continuously high-quality updatable spatial models required by DTs.

At the same time, progress in computer vision—object segmentation [19, 20], Multi-Object Tracking(MOT) [21], and Human Activity Recognition (HAR) [22]—has made it possible to quantitatively characterize and measure object states and scene events from images and video streams. In addition, the advent of vision-language models(VLMs) supports query-centric recognition and relational/descriptive reasoning even for object and behavior categories that are not predefined, enabling robust integration of domain-specific knowledge into vision pipelines tailored to each use case [ 23]. These models move beyond mere visualization of static scenes in Digital Twins to enable tracking, explanation, and prediction of dynamic states.

Together, advances in 3D reconstruction and object understanding now make it feasible to operate CCTV-equipped indoor environments (e.g., manufacturing facilities) as digital twins synchronized with their physical counterparts. By leveraging a predefined 3D scene model and visual localization, segmented objects from video streams can be registered into the 3D scene, ensuring spatiotemporal consistency. In this paper, we focus on synchronizing dynamic objects and propose a method to reconstruct their motion frame by frame within a virtual environment. We assume that a reconstructed static scene, 3D mesh models of dynamic objects, and accurate camera poses are available—assumptions that align with the current state of 3D reconstruction, modeling, and localization technologies. From the input image sequences, we extract object masks and incorporate predefined object and spatial information to reinforce consistency between physical and virtual spaces, enabling high-fidelity representations of dynamic objects in DT environments. This approach provides foundational techniques for implementing dynamic digital twins in domains with frequent motion, such as manufacturing facilities and urban settings.

2. Methodology

This section introduces a pipeline for high-fidelity DT representation of dynamic objects in indoor scenes recorded by a static camera (e.g., CCTV). To this end, we assume the following are given: (i) a 3D reconstructed mesh of the static scene, (ii) a 3D mesh of the dynamic objects, and (iii) intrinsic and extrinsic parameters of the camera. In particular, the camera pose estimated within the DT is assumed to be aligned—via visual localization—with the coordinate frame of the physical camera used for capture.

Since the dynamic objects are predefined, we prompt SAM2 once at initialization to obtain per-frame masks. From the pixel distribution in each mask, we compute a principal ray, which is then projected into the world coordinate system using the camera parameters. The intersection of this ray with the object’s mid-height plane yields the per-frame position, while the displacement between successive positions determines the rotation, primarily yaw. To reduce inter-frame rotational instability caused by mask deformation and noise, we apply a stabilization filter. Finally, if the position and rotation values were not calculated due to complete mask loss, we interpolate them to maintain consistency. An overview of the pipeline is shown in Figure 1. 2.1. Problem Statement We define the proposed algorithm as F, as shown in Eq. 1. Here, denotes the 3D scene mesh and represents the target object model. The camera is defined as = (, ), where = [, ] is the 3D pose of —with representing translation and representing rotation—and denotes the intrinsic parameters. ℐ denotes the monocular RGB image sequence (i.e., video) captured by , and refers to the image at frame .

F(, , , ℐ) = = {}=1

The 3D pose of at time is represented as {, , , }, where ∈ (3). The pose of on the ground plane of the scene model at time , computed by F and denoted as , is defined in Eq. 2, where denotes the yaw angle.

= {, 0, , }

2.1.1. 3D Mask Projection

To project from a 2D image onto the 3D scene, we first generate the target object mask on the image using SAM2. is represented as an array of 2D pixels = (, ). For each pixel in the mask, we define a ray () using , as shown in Eq. 3, where denotes the depth of ray. Finally, we compute the ray set r = {1 , 2 , ...} for 3D projection. (1) (2) ⎡⎤

1 () = − ⊤ + ⊤− 1 ⎢⎣ ⎥⎦ ,

However, these rays may be afected by mask noise or camera pose errors. To ensure robustness, we compute the unit vector dˆ of ray, as defined in Eq.4.

d = ⊤− 1 ⎢⎥ , ⎣ ⎦ ⎡⎤ 1 dˆ =

d ||d|| ¯ () = ∑︀r dˆ + , |r |

The principal ray ¯ () is then defined as the mean of the unit vectors in r, as given in Eq. 5, where |r| denotes the size of array.

2.1.2. Pose Calculation

As previously mentioned, the 3D mesh model of the object is predefined. Consequently, we can obtain the bounding box of the dynamic object and determine its maximum height . We then compute the 3D coordinates of the intersection point between the object’s principal ray ¯ and the horizontal plane at = /2. The corresponding ray parameter is determined by solving the ray–plane intersection: The full 3D position is then calculated as Eq 7: * = (/2) − ,

, = ¯ (* ) (4) (5) (6) (7) (8) (9)

Finally, by taking only the x and z values from this point , we project it onto the ground plane ( = 0) to place the object.

The position calculation allows us to determine the object’s placement for each frame. The object’s direction of rotation is determined from the displacement vector , calculated as the diference between the current frame’s position, , and the previous frame’s position, − 1.

= − − 1 = ⟨ − − 1, 0, − − 1⟩

Although the object’s position and rotation can be computed, significant inconsistencies may arise between consecutive frames if the masks are deformed or noisy. Such abrupt variations reduce fidelity, as the rotation calculation directly reflects them. To address this, we apply a stabilization filter composed of three components: • Motion gating / deadband: Suppresses micro-jitters by treating negligible rotational changes as zero when motion is minimal. • Rate limiting: Constrains the maximum rotation angle per frame, ensuring smooth and consistent turns. • Exponential moving average (EMA) smoothing: Reduces noise by blending the newly computed orientation with the previously filtered orientation using spherical interpolation.

By applying this process to each frame, we obtain the object’s position (, ) and yaw rotation for the sequence. However, when the mask is completely missing, position and rotation cannot be computed for those frames. To maintain temporal consistency during such dropouts, we linearly interpolate both position and rotation across short gaps of up to consecutive frames. Let 0 < 1 be the valid keyframes that bracket a gap of length = 1 − 0 − 1 ≤ . For any missing frame ∈ (0, 1), set = 1− − 00 and compute using Eqs. 9 and 10.

= (1 − )0 + 1 = 0 + ( 1 − 0 ) In Eq 10, (· ) maps angles to (− , ] to ensure shortest-arc interpolation. Interpolation of sections whose length exceeds may cause problems such as objects penetrating the scene, so they are not interpolated and are left as post-processing targets.

The full procedure is summarized in Algorithm 1.

Algorithm 1 Algorithm F for dynamic object pose estimation.

Require: image sequence ℐ, 3D scene mesh , object mesh , Camera = (, ) Ensure: object pose sequence 1: for ← 1 to do 2: ← 2() //Mask Image From Segment Anything Model 2 3: ¯ () ← () 4: ← ℎ() 5: * ← (/2)− ← ¯ (* ) if > 1 then − − 1 (, − 1) //Filtering with EMA, rate limit, deadband, motion gate

3. Evaluation

To evaluate the proposed methodology, we use a synthetic scene created in Unity. The object’s position and rotation information is logged for each frame, and these data serve as the GT. The methodology is then assessed by comparing and analyzing two sets of data against the GT: the data obtained with the stabilization filter applied and the data obtained without it.

In Figure 2, during frames 0–20, the object moves short distances and performs specific actions while largely stationary. From frames 21–35, it moves backward. Subsequently, the object moves straight, turns to the right, and then to the left, before ending the sequence. The same positions are obtained with both the filtered and unfiltered methods; however, the unfiltered method exhibits highly sporadic rotational directions, whereas the filtered method maintains consistency. More detailed results are provided in Figure 3. As illustrated in Figure 4, a masking error occurs between frames 30 and 40, leading to a substantial position error in this interval. In addition, after frame 90, an occlusion is observed, resulting in tracking failure and a further increase in position error.

As shown in Figure 3, between frames 0 and 40—where the inter-frame trajectory distance is short and both in-place rotations and masking errors occur—the unfiltered method exhibits large rotational lfuctuations, whereas the filtered method maintains narrower fluctuations, demonstrating robustness to noise. However, compared to the unfiltered method, the filtered method cannot immediately capture rapid directional changes due to the maximum rotation speed limit observed during the right/left turning section (frames 70–90).

Table 1 shows that applying the filter significantly reduces errors compared to the unfiltered method. In particular, for the maximum angular error (MaxAE), the unfiltered method produced a large error of approximately 179°, whereas the filtered method reduced this error to about 63 °.

4. Conclusion

This study proposes a lightweight pipeline that, after extracting masks using Segment Anything Model 2 (SAM2), performs mask projection, computes positions via the intersection between a principal ray and a plane, and approximates rotation (yaw) using frame-to-frame motion vectors. In addition, to suppress noises in the estimated rotation and ensure continuity along the time axis, we introduce a stabilization scheme that combines Gating, Deadband, Rate Limiting, and an exponential moving average (EMA). By incorporating this stabilization module, the system is designed to maintain spatiotemporal consistency even in the presence of noise and occasional errors. This design is practically meaningful in that it achieves computational eficiency suitable for real-time processing without complex optimization or large-scale learning.

Nevertheless, the proposed approach is structurally dependent on segmentation quality. Because position and rotations are determined from masks produced by SAM2, a basic level of error is inherent, and large errors may occur when occlusions are present or when SAM2 fails due to its performance limits. Moreover, since rotation is determined by the motion vectors, it is dificult to correctly reflect orientation in scenarios dominated by lateral or backward motion, in-place rotation or in-place actions. Our method also assumes that objects remain in contact with the ground and therefore estimates only 3DoF (planar position and yaw); accordingly, it is not applicable to aerial objects (e.g., drones) or to objects exhibiting substantial pitch/roll variations. To address these structural issues, future work should introduce more robust methods for position and rotation estimation and extent the framework to full 6DoF pose estimation.

Furthermore, it operates under the assumption that the camera extrinsics in the digital twin coordinate system are estimated with very high accuracy through visual localization. However, even a small pose error can bias the principal ray-plane intersection, inducing position and rotation drift. To mitigate this, pose-stabilization strategies—such as drift compensation using semantic landmarks and sensor fusion with additional modalities (e.g., IMU)—should be considered. For dynamic object models with large intra-class shape variation, the fixed height assumption may not be valid if the shape dispersion is large, a fixed-height assumption may be invalid, potentially distorting position and orientation estimates. Future work should estimate object height online from frame-by-frame observations to preserve robustness when object models are inaccurate.

The proposed method was evaluated only in a synthetic virtual scene using quantitative metrics. For future work, in-the-wild validation is needed by applying the method to real video within a digital twin constructed from a 3D reconstruction of the physical environment. It is desirable to conduct multi-site, multi-scenario experiments spanning diverse indoor locations, camera setups, and object categories, and to complement them with user studies that qualitatively assess the temporal consistency of dynamic-object trajectories. The qualitative evaluation can use panel-based Likert-scale ratings or pairwise comparisons. Raters inspect side-by-side overlays on the source video and top-down trajectory visualizations, and statistical significance is assessed using appropriate tests. Such a combined quantitative–qualitative evaluation in real settings would allow a more rigorous demonstration of the generalizability and robustness of the proposed method.

In summary, the proposed method presents a concise and portable foundation that goes beyond the visualization of static structures in Digital Twins and aims for high-fidelity dynamic reproduction approaching functional equivalence for dynamic objects in scenes. Its significance lies in providing a balanced trade-of among lightweight implementation, real-time performance, and consistency in application domains dominated by dynamic factors-such as manufacturing, logistics, and smart cities. By pursuing the aforementioned extensions, we expect to progressively resolve challenges such as occlusion and in-place motion, thereby further improving the reliability and applicability of dynamic Digital Twin implementations.

Acknowledgments

This work was supported in part by the Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (RS-2024-00344883)

Declaration on Generative AI

During the preparation of this work, the author(s) used GPT-5 in order to: Grammar and spelling check. The author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. [11] Y. Dai, Z. Hu, S. Zhang, L. Liu, A survey of detection-based video multi-object tracking, Displays 75 (2022) 102317. doi:10.1016/j.displa.2022.102317. [12] C. Kober, M. Fette, J. P. Wulfsberg, A method for calculating optimum digital twin fidelity, Procedia CIRP 120 (2023) 1155–1160. URL: https://www.sciencedirect.com/science/article/pii/ S2212827123008739. doi:https://doi.org/10.1016/j.procir.2023.09.141, 56th CIRP International Conference on Manufacturing Systems 2023. [13] J. L. Schönberger, J.-M. Frahm, Structure-from-motion revisited, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4104–4113. doi:10.1109/CVPR.2016. 445. [14] Y. Furukawa, J. Ponce, Accurate, dense, and robust multiview stereopsis, IEEE Transactions on

Pattern Analysis and Machine Intelligence 32 (2010) 1362–1376. doi:10.1109/TPAMI.2009.161. [15] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, R. Ng, NeRF: Representing scenes as neural radiance fields for view synthesis, Communications of the ACM 65 (2022) 99–106. doi:10.1145/3503250. [16] B. Kerbl, G. Kopanas, T. Leimkühler, G. Drettakis, 3d gaussian splatting for real-time radiance field rendering, ACM Transactions on Graphics 42 (2023) 1–14. doi:10.1145/3592433. [17] C. Chen, B. Wang, C. X. Lu, N. Trigoni, A. Markham, Deep learning for visual localization and mapping: A survey, IEEE Transactions on Neural Networks and Learning Systems 35 (2024) 17000–17020. doi:10.1109/TNNLS.2023.3309809. [18] W. Xiao, R. Chierchia, R. S. Cruz, X. Li, D. Ahmedt-Aristizabal, O. Salvado, C. Fookes, L. Lebrat, Neural radiance fields for the real world: A survey, 2025. URL: https://arxiv.org/abs/2501.13104. arXiv:2501.13104. [19] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., Segment anything, in: Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026. [20] N. Ravi, et al., Sam 2: Segment anything in images and videos, arXiv preprint arXiv:2408.00714 (2024). arXiv:2408.00714. [21] W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, T.-K. Kim, Multiple object tracking: A literature review, Artificial Intelligence 293 (2021) 103448. URL: https://www.sciencedirect.com/science/ article/pii/S0004370220301958. doi:https://doi.org/10.1016/j.artint.2020.103448. [22] J. Shin, N. Hassan, A. S. M. Miah1, S. Nishimura, A comprehensive methodological survey of human activity recognition across divers data modalities, 2024. URL: https://arxiv.org/abs/2409.09678. arXiv:2409.09678. [23] Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, G. Shi, A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges, 2025. URL: https://arxiv.org/abs/2501. 02189. arXiv:2501.02189.

[1] ISO/IEC, Digital twin - concepts and terminology , International Standard ISO/IEC 30173 : 2023 , 2023 . URL: https://www.iso.org/standard/81442.html, standard.

[2] National Academies of Sciences, Engineering, and Medicine, The digital twin landscape , in: Foundational Research Gaps and Future Directions for Digital Twins, National Academies Press (US) , 2024 . URL: https://www.ncbi.nlm.nih.gov/books/NBK605499/.

[3]

Fuller , et al., Digital twin: Enabling technologies, challenges and open research, IEEE Access 8 ( 2020 ) 108952 - 108971 . doi: 10 .1109/ACCESS. 2020 . 2998358 .

[4]

Jones ,

Snider ,

Nassehi ,

Yon ,

Hicks , Characterising the digital twin: A systematic literature review , CIRP Journal of Manufacturing Science and Technology 29 ( 2020 ) 36 - 52 . URL: https://www.sciencedirect.com/science/article/pii/S1755581720300110. doi:https://doi.org/ 10.1016/j.cirpj. 2020 . 02 .002.

[5]

D. M.

Botín-Sanabria ,

A.-S.

Mihaita ,

R. E.

Peimbert-García ,

M. A.

Ramírez-Moreno ,

R. A.

RamírezMendoza , J. d. J. Lozoya-Santos , Digital twin technology challenges and applications: A comprehensive review , Remote Sensing 14 ( 2022 ). URL: https://www.mdpi.com/2072-4292/14/6/1335. doi: 10 .3390/rs14061335.

[6]

Muñoz , Measuring the fidelity of digital twin systems , in: Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings (MODELS '22) , Association for Computing Machinery, New York, NY, USA, 2022 , pp. 182 - 188 . doi: 10 .1145/3550356.3558516.

[7]

Digital

Twin Consortium , Digital twin consortium defines digital twin , 2020 . URL: https://www. digitaltwinconsortium.org/ 2020 /12/digital-twin -consortium-defines-digital-twin/ , accessed: 2025 - 08-21.

[8]

Tao ,

Zhang , Y. Liu,

A. Y. C.

Nee , Digital twin driven prognostics and health management for complex equipment , CIRP Annals 67 ( 2018 ) 169 - 172 . doi: 10 .1016/j.cirp. 2018 . 04 .055.

[9]

M. S.

Irfan ,

Dasgupta ,

Rahman , Toward transportation digital twin systems for trafic safety and mobility: A review , IEEE Internet of Things Journal 11 ( 2024 ) 24581 - 24603 . doi: 10 .1109/ JIOT. 2024 . 3395186 .

[10]

Picard ,

Chevobbe ,

Darouich ,

J.-Y.

Didier , A survey on real-time 3d scene reconstruction with slam methods in embedded systems , arXiv preprint arXiv:2309.05349 ( 2023 ). arXiv: 2309 . 05349 .