1. Introduction

Nonparametric data recovery in IoT sensor networks based on a priori density⋆

Viacheslav Kovtun

kovtun_v_v@vntu.edu.ua vkovtun@iitis.pl 0 0 Vinnytsia National Technical University , Khmelnytske shose, 95, Vinnytsia, 21021 , Ukraine

Reliable recovery of sensor information is critically important for the continuous operation of distributed IoT systems under conditions of data loss, noise distortion, and structural uncertainty. This paper proposes a combined approach to sensor data reconstruction that integrates latent state smoothing via recursive filtering with local nonparametric posterior density estimation using kernel functions. In contrast to traditional filtering and forecasting schemes, the developed model performs full interpolation recovery by incorporating both past and future measurements in time. The mathematical structure of the method is theoretically justified, and its properties are examined under partial observability. Numerical modelling results confirm the high accuracy and robustness of the proposed approach under up to 40% observation loss and reduced signal-to-noise ratio, corresponding to typically adverse conditions in LPWAN environments, particularly LoRaWAN. Compared to Kalman filtering and baseline nonparametric regression methods, the proposed model achieved a 20 30% improvement in interpolation accuracy. The practical value of the study lies in the feasibility of implementing the approach on resource-constrained devices without prior training, which opens up prospects for its application in smart monitoring systems, energy-efficient management, agroecological surveillance, and environmental protection. sensor data recovery; Internet of Things systems; time series interpolation; posterior estimation; Bayesian filtering and smoothing; nonparametric density estimation; kernel regression; structural model uncertainty 1 ICyberPhyS 5: 2nd International Workshop on Intelligent & CyberPhysical Systems, July 04, 2025, Khmelnytskyi, Ukraine 1 Corresponding author.

1. Introduction

Over the past decade, the Internet of Things (IoT) has evolved into a fundamental technological platform enabling real-time data collection, processing, and analysis across a wide range of applications. For example, in smart cities, sensor networks are employed to monitor air quality, noise levels, traffic intensity, or to detect emergencies in water supply systems [1, 2]. In agriculture, they are used to measure soil moisture, air temperature, wind speed, and other parameters influencing agrotechnological decisions [3, 4]. In healthcare systems, sensor devices are applied for remote patient monitoring

for instance, tracking glucose levels, heart rate, or blood oxygen saturation [5, 6]. Industrial IoT systems use sensors to monitor the technical condition of equipment, detecting vibrations, overheating, or leaks [7, 8].

Despite their widespread application, sensor networks face significant technical limitations: data transmission may be disrupted due to unstable connectivity, packet loss, interference in radio channels, limited sensor energy capacity, or processing node overload. As a result, real-world data are often incomplete, exhibiting random losses, gaps in time series, or distortions caused by noise [9]. For instance, in smart lighting systems or emergency response frameworks, delays in transmitting sensor events may trigger incorrect actuator responses. In medical systems, inaccuracies or loss of vital sign readings may pose a risk to patients. In industrial contexts, misinterpretation of data due to signal distortion can lead to equipment downtime or failure.

Thus, the tasks of accurate and robust recovery of lost or noise-corrupted data are not only of theoretical importance but also critically necessary for the reliable operation of practical IoT systems. In this context, the need for adaptive mathematical models capable of operating with incomplete information while ensuring high reconstruction accuracy is particularly urgent.

In the current scientific discourse concerning the reconstruction of sensor information under conditions of partial availability and high noise levels, several conceptual approaches have emerged, each demonstrating a distinct domain of application, degree of mathematical justification, and practical effectiveness. Among the most extensively studied are analytical, nonparametric, and intelligent methods. However, in addition to these, hybrid, heuristic, sparse-reconstruction, multichannel statistical, and reinforcement-based methods are increasingly being employed and merit particular attention when analysing relevant analogues.

Analytical methods [10, 11], primarily represented by filtering and smoothing approaches (such as Kalman filtering, particle filters, and RTS smoothing), assume the presence of a well-defined system dynamics model and known statistical properties of the noise in advance. These methods are mathematically grounded and perform well in linear or quasi-linear systems with normally distributed disturbances. However, in IoT sensor networks, such assumptions are rarely fully satisfied. The presence of missing data, irregular sampling rates, correlated noise, and the lack of a complete a priori model render these methods sensitive to perturbations, ill-suited to complex temporal structures, and inadequate for interpolation-based recovery, which requires incorporating both past and future measurements [12].

An alternative is offered by nonparametric approaches [13, 14], which rely on local approximation or empirical distribution estimation without a fixed model structure. Kernel regression, locally weighted smoothing (LOESS), nearest neighbour methods, and empirical Bayesian procedures exhibit considerable flexibility and allow signal recovery even in the absence of information about the underlying generative mechanism. However, these methods are limited in terms of computational efficiency, do not scale well to high-dimensional settings, are sensitive to window width selection, and fail to capture the latent temporal structure of the data. Moreover, they lack a formal mechanism for incorporating the context of future observations, which is crucial in interpolation tasks rather than forecasting.

In recent years, intelligent approaches [15 17], particularly deep learning models, have gained significant traction. Recurrent neural networks (RNNs), long short-term memory architectures (LSTMs), autoencoders, transformers, and graph neural networks are employed to model complex nonlinear relationships between observations and their temporal context. Their key advantage lies in the ability to automatically learn data representations and reconstruct missing values without an explicitly defined model. At the same time, these methods exhibit notable limitations: they require large training datasets, which are often unavailable in sensor networks; they have high computational complexity, making them unsuitable for energy-constrained devices; and they lack formal interpretability, which complicates verification of reconstruction reliability. In most cases, such architectures are designed for extrapolation (i.e. forecasting future values) and do not support accurate reconstruction of missing segments within a trajectory, which requires bidirectional informational context.

In response to the limitations of each of the aforementioned classes of methods, the literature has seen the emergence of hybrid approaches [18, 19] that combine the formal structure of analytical models with the flexibility of nonparametric or intelligent components. Examples include the integration of forward filtering with local smoothing or the combination of Bayesian models with autoencoders. Such systems allow for the preservation of interpretability and a controlled structure while enabling flexible adaptation to changes in the statistical properties of the data. However, the complexity of implementation, the need to align heterogeneous components, and the lack of sufficient theoretical justification currently hinder the widespread adoption of such solutions.

Heuristic methods [20, 21] are also applied in practical systems, relying on domain-specific knowledge or sets of rules. For instance, in reconstructing a temperature profile, one may assume a limited rate of change in readings or approximate missing values using linear functions based on neighbouring values. Such methods are simple, energy-efficient, and scale well, but they lack generalisability and do not guarantee accuracy under complex or atypical system behaviour.

Special attention should be given to methods based on the concept of sparse representation the so-called sparse reconstruction models [22]. These assume that a signal can be expressed as a sparse combination of basis vectors in a specific space (for example, wavelet or Fourier). This enables effective recovery of missing values even under a high level of data loss. However, such models typically require a fixed transformation structure, which may not adequately capture the dynamics of real-world sensor processes.

In many real-world applications, multichannel statistical methods [11, 23] also prove effective, as they utilise correlation information between multiple sensor streams. Models of interdependent processes, multivariate Bayesian estimations, or multichannel PCA enable the reconstruction of data in one channel based on the values from others. This is particularly useful in the context of heterogeneous sensor networks; however, such models are sensitive to channel mismatches, signal synchronisation issues, and require full observability in at least some of the sources.

Finally, a promising direction is reconstruction methods based on reinforcement learning [24, 25], where an agent learns to develop an optimal reconstruction strategy by receiving rewards that depend on the accuracy of recovery or its impact on subsequent control. These methods are particularly effective in adaptive or dynamically changing environments; however, their training requires multi-episode simulation, and the interpretability of the results remains an open challenge.

In summary, despite methodological diversity and numerous attempts to adapt existing approaches to sensor environments, no universal solution currently exists that ensures a combination of accuracy, adaptability, computational efficiency, and analytical controllability under conditions of loss, noise, and limited informational completeness. Analytical methods are overly rigid in their assumptions, nonparametric methods are excessively local and resourceintensive, and intelligent methods are complex, opaque, and weakly interpretable. Even hybrid or heuristic approaches, while mitigating the imbalance between flexibility and formality, largely remain narrowly specialised or insufficiently theoretically substantiated. In this context, there is a clear need to develop a new methodology that combines the advantages of parametric models (such as stability, interpretability, and smoothing efficiency) with the flexibility and local adaptability of nonparametric estimations. This is precisely the approach implemented in the present study, which proposes a sensor data reconstruction model based on posterior interpolation density, combining the formal structure of the exponential family of distributions with the capacity for local kernel correction. This enables not only accurate recovery of missing information under partial availability but also stable system operation amid changing data structures and constrained sensor network resources. Accordingly, the proposed research fills a critical methodological gap in the field of sensor signal reconstruction, offering a formally grounded, practically feasible, and flexible model tailored to the specific requirements of modern IoT systems.

The object of this study is the process of sensor information reconstruction under conditions of partial data availability, noise distortion, and structural uncertainty in the dynamics of distributed IoT systems.

The subject of this study is the set of theoretical foundations, mathematical models, and methods for approximating the posterior density that enable continuous reconstruction of sensor data under uncertainty, based on the integration of filtering and smoothing procedures with nonparametric kernel estimation.

The aim of this study is to develop and justify a combined method for sensor data reconstruction in distributed IoT systems by integrating an analytical model of posterior density with nonparametric procedures of local kernel smoothing, in order to ensure accurate and robust information recovery under conditions of partial availability and structural uncertainty.

The scientific novelty of this study lies in the development of a formalised combined approach to sensor data reconstruction in distributed IoT systems, which for the first time integrates a recursive filtering and smoothing model for posterior density estimation with local nonparametric kernel smoothing procedures. The proposed methodology enables interpolation-based data recovery at points with partial or zero observability, does not require a complete description of the latent process dynamics, adapts to the noise level, and demonstrates improved accuracy compared to traditional filtering and empirical approaches.

The practical value of this study lies in the capability for continuous and resource-efficient recovery of sensor information in real time under conditions of partial loss, noise, and structural uncertainty features typical of modern IoT systems. The proposed combined method does not require a complete model of system dynamics, easily adapts to various sensor network topologies, can be implemented on energy-constrained devices, and ensures improved interpolation accuracy without the use of training data, making it suitable for deployment in smart monitoring, automated control, energy-efficient environments, and other critical applications.

2. Models and methods

2.1. Fundamental assumptions In the task of recovering observed data in sensor networks operating within a dynamic and probabilistically determined environment, it is essential to formalise interpolation as a procedure for estimating the values of system state vectors Yl  n , where n denotes the dimensionality of the hidden state space, based on the available sequence of sensor observations x1m = ( x1, , xm )T 

mk , in which m represents the number of discrete time steps and k is the dimensionality of each vector observation space. The positional index l indicates the moment 1  l  m at which recovery is performed. According to the Bayesian approach, optimal interpolation can be carried out by approximating the conditional expectation of the function R (Yl ) , which represents the recoverable characteristic of the system state, given the available information x1m . This is mathematically expressed by formula pl ( yl x1m ) =  ( x1l ) ( xlm ) l ( yl x1l ) l ( yl xlm )  ( x1m ) ( yl , xl ) where pl ( yl x1m ) denotes the conditional a priori probability density Yl given the known data from the sensor sample x1m , which, in the context of real-time recovery requirements, is naturally interpreted as the interpolated posterior density.

In the context of data recovery in IoT sensor networks, where time series measurements are partially lost or noise-corrupted, there arises a need to construct the interpolated posterior density pl ( yl x1m ) , which accounts for both historical data and future measurements relative to the recovery moment l . One approach to constructing this density is to represent it as a combination of observation densities and conditional posterior distributions. This density is defined as ( 1 ) ( 2 ) where  ( x1l ) denotes the density of sensor observations from the beginning up to moment l ,  ( xlm ) is the density of sensor observations from moment l to the end,  l ( yl x1l ) is the conditional density of the state yl given prior data,  l ( yl xlm ) is the backward density of the same state given subsequent observations, and  ( yl , xl ) is the joint density of the state and corresponding observation at moment l . The representation in ( 2 ) enables formalisation of the interpolation task under incomplete or noisy information, characteristic of sensor-based IoT systems.

In the context of defining the joint density  ( yl , xl ) in the denominator of expression ( 2 ), a key stage involves the analytical formalisation of the interpolation procedure. In sensor networks modelled as Markov dynamic systems with correlated measurements, this density can be represented as a convolution over the space of previous states and observations, accounting for the historical context of information transmission. This allows for the latent dependency structure within the time series of sensor data to be incorporated. The corresponding expression is given in ( 3 ) ( 4 )  ( yl , xl ) =  q ( yl yl−1 )   ( xl xl−1, yl ) ( yl−1, xl−1 ) dxl−1dyl−1 ,

Yl−1 Xl−1 where q ( yl yl−1 ) denotes the state transition density in the Markov model,  ( xl xl−1, yl ) is the conditional density of a sensor observation given the state yl and the previous measurement xl−1 , and  ( yl−1, xl−1 ) is the joint density at moment l − 1 . Accordingly, formula ( 3 ) enables the computation of the normalisation factor included in expression ( 2 ), thereby allowing accurate data recovery in sensor-based IoT systems.

In the process of formalising the task of recovering lost or corrupted data in IoT sensor networks, it is appropriate to define an analytical model that unifies computations related to the interpolated posterior density (see formula ( 2 )) and the normalisation factor (see formula ( 3 )). In this study, it is assumed that the conditional density of observations given the latent state belongs to the exponential family of distributions. This allows the density  ( yl , xl ) to be expressed in the following form

 ( yl , xl ) = K ( ym )b( xm )expT ( xm ) R ( ym ) , where K ( ym ) is the normalising coefficient, b( xm ) is the base function of the observation, T ( xm ) is the sufficient statistic, and R ( ym ) is a parametric function of the latent state. Assumption ( 4 ) is adopted as the foundation for the subsequent analytical analysis of sensor information reconstruction models. 2.2. Analytical interpolation model for the reconstruction of lost data in sensor-based IoT systems In addressing a range of applied problems arising during the operation of sensor networks (for example, reconstructing missing temperature or pressure readings, restoring the trajectory of a moving object from incomplete GPS samples, or adaptive signal smoothing in cases of packet loss), it is necessary not only to compute the conditional expectation (see formula ( 1 )) but also to construct the full interpolated posterior density of the state pl ( yl x1m ) . This density contains all the required information about the unknown state yl , and enables not only point estimation but also statistical decision-making based on confidence intervals, quantile assessments, or Bayesian risks. To simplify the analytical notation, probability density functions of the form  ( x) will henceforth be presented without explicitly specifying the argument, unless this leads to ambiguity. The argument will be indicated only where necessary for correctness.

To derive the equation describing the optimal interpolation estimate under conditions of incomplete sensor data, it is necessary to rely on the structure of the joint density for the entire trajectory of latent states y1m and corresponding observations x1m , which can be represented in a recursive form. If the sensor network model is described by conditional densities of the form  ( yl , xl yl−1, xl−1 ) , the full density  ( y1m , x1m ) is expressed as a product of the form ( 5 ) ( 6 ) ( 7 )  ( yl , x1m )  ( yl , x1l , xlm+1 ) =  ( x1l ) l ( yl x1l ) ( xlm+1 yl , xl ) .

To further simplify the computations, let us assume that the joint distribution density  ( x1l , yl ) can be factorised into the conditional density  ( xl xl−1, yl ) and the Markov transition density q ( yl yl−1 ) . In this case, taking into account the properties of conditional independence, we obtain m  ( y1m , x1m ) = ( y1, x1 ) ( yl , xl yl−1, xl−1 ) ,

l=2 corresponding to the extended structure of a generalised Markov model for the dynamics of latent states in a sensor environment. Based on the representation in ( 5 ), a formal derivation of the expression for the interpolated posterior density pl ( yl x1m ) can be carried out, defined as the ratio of the joint density to the marginal density of the sensor sample alone. By sequentially applying the chain rule decomposition with consideration of conditional independence, we obtain

In the subsequent analysis of equation ( 6 ), which describes the interpolated posterior density in a sensor network, it becomes necessary to detail the product  ( x1l ) l ( yl x1l ) . This product can be decomposed using the recursive structure of the dependency model between latent states and sensor observations. Given that previous realisations of observations and states are known, we have  ( x1l ) l ( yl x1l ) =   ( y1l−2 , x1l−2 , yl−1, xl−1 ) ( yl , xl yl−1, xl−1 )dyl−1 .

n  ( x1l ) l ( yl x1l ) = ( xl xl−1 yl )  q ( yl yl−1 ) ( yl−1, x1l−1 )dyl−1 =

n = ( x1l−1 ) ( yl x1l−1 ) ( xl xl−1, yl ), where the factor  ( yl x1l−1 ) represents the updated state estimate based on the observation history up to moment l , and the dependence on xl in the right-hand side of equation ( 7 ) reflects the contribution of the current measurement to the formation of the predicted density.

Let us consider the factor  ( xlm+1 yl , xl ) in the numerator of formula ( 6 ), which corresponds to l . This factor contains information about future measurements, conditional on the current system state and observation. It can be expressed as an integral over densities that characterise transition and observation probabilities at subsequent time steps. To achieve this, we employ the structure of the generalised model and obtain  ( xlm+1 yl , xl ) =   ( xlm+2 yl+1, xl+1 ) ( yl+1, xl+1 yl , xl ) dyl+1 =

n =   ( xlm+2 , yl+1, xl+1 ) n

 ( xl+1 yl+1 )  ( xl+1 xl , yl+1 ) q ( yl+1 yl ).

q ( yl+1 )

This approach enables the dependence on future measurements to be expressed as integration over possible states yl+1 , thereby formalising the mechanism of information reconstruction in a sensor network in cases where part of the observations is missing. Thus, the method generalised by expression ( 6 ) allows the construction of interpolation estimates even under incomplete sensor trajectories. However, if the sensor network model is described as one that permits factorisation of the full density into a product of conditional densities and prior distributions, then the interpolated posterior density pl ( yl x1m ) , previously defined in formula ( 6 ), can be rewritten in a generalised form using formulas ( 7 ) and ( 8 ):  ( x1l ) l ( yl x1l ) ( xlm+1 yl , xl )  ( yl ) ( yl x1l−1 ) ( yl xlm+1 ) q ( yl ) , ( 8 ) ( 9 ) ( 10 ) ( 11 ) which allows the influence of the observation xl to be separated from the generalised impact of the past and future within the trajectory. In this case, the ratio of the normalising factors in the numerator and denominator of formula ( 11 ) can also be interpreted through an integral representation where the construction x1m ! xl denotes the vector x1m excluding the element xl , and the factor l () generalises the normalisation component, which depends on the full sensor trajectory. From formula ( 9 ), it follows that the first factor on the right-hand side does not contain yl , while the second explicitly depends on xl , which is essential for implementing estimation procedures under conditions of incomplete data.

The analytical form of the interpolation equation, as presented in formula ( 9 ), enables further structuring of the posterior density in information recovery tasks within sensor-based IoT systems. To facilitate the construction of estimation expressions, we introduce an auxiliary function hl ( yl ) , which consolidates the components that depend exclusively on the latent state:

Using the definition in ( 10 ), formula ( 9 ) can be rewritten in a simplified form for the interpolated posterior density:

hl ( yl ) = ( yl x1l−1 ) ( yl xlm+1 ) q ( yl ) .

Formulas ( 11 ) and ( 12 ) enable the estimation of the interpolation distribution through integration of the likelihood function  ( xl yl ) , taking into account the entire available context of the sensor sample, which is particularly relevant in cases of partial data loss or when reconstruction is required based on asymmetric information from past and future values.

Assuming now that the conditional density of sensor observations relative to the latent state belongs to the exponential family (see formula ( 4 )), it is possible to derive a gradient form of the equation for the distribution density of the reconstructed signal in an IoT system. This enables a formalised description of the influence of the observation xl on the refinement of the state estimate yl during interpolation. Under these conditions, the following relation holds: ( 12 ) ( 13 ) ( 14 ) ( 15 ) xll ( xl x1m ! xl ) where  xl denotes the gradient operator with respect to the variable xl . Given that the conditional density  ( xl yl ) belongs to the exponential class (see ( 4 )), its gradient with respect to xl can be expressed as

xl ( xl yl ) = (xl ln g ( xl ) + xl ST ( xl )( yl )) ( xl yl ) , where g ( xl ) is the base function, S ( xl ) is the vector of sufficient statistics, and ( yl ) is the vector of parametric functions of the state.

As a result, substituting the state estimate yˆl into the functional of the exponential model allows the observed part of the logarithmic gradient of the conditional density to be expressed in the form

ST ( xl )( yˆl ) = xl where all terms are either estimated or computable from the sensor trajectory. Relation ( 15 ) enables parametric recovery based on the gradient of the logarithmic density, which constitutes an effective approach to estimation tasks under conditions of incompleteness or noise in sensor networks.

From expressions ( 9 ) ( 15 ), it follows that a central role in the reconstruction of sensor information is played by the a priori or posteriorly known dependency between the current observation xl and the hidden state yl . The construction of the likelihood function  ( xl yl ) , which defines the shape of the signal distribution given a specified state, critically influences estimation accuracy. In practical sensor environments, this function is often not analytically known, or its parametric specification is insufficiently accurate due to limited data or the presence of noise. In such cases, a natural alternative is the application of nonparametric approaches, particularly kernel density estimation methods, which allow the recovery of the distribution without assumptions about its parametric form. Thus, even in complex and heterogeneous IoT systems, effective interpolation-based estimation can be achieved using approximated densities  ( xl yl ) derived through nonparametric kernel procedures.

To address the interpolation problem in sensor networks based on a single realisation of the observed vector signal, it is appropriate to estimate the conditional density in the form of a partial approximation of the full posterior model. For this purpose, we define the conditional approximation as  ( xl xll−−c1, xll++1c ) = ( xll−+cc )  ( xll−−c1, xll++1c ) , where the parameter c specifies the number of observations symmetrically selected before and after moment l , which are taken into account for reconstructing the value of xl . This approach reduces model complexity and enables a transition to nonparametric estimation based on a finite sample.

Assuming that the full density  ( x1m ) is unknown, we apply the classical kernel method for its estimation, whereby the approximation is performed based on M realisations. Accordingly, the approach to constructing the kernel density is generalised by expression ( 16 ) ( 17 )  M ( x1m ) =

Mb1Mmk iM=1 lm=1 jk=1 K  xl j −bXM l j (i)  , where K is the smoothing kernel, bM is the window width (smoothing parameter), k denotes the dimensionality of the observation space, and X l j (i ) is the j -th component of the i -th realisation at time l .

Considering the exponential structure of the model and employing the kernel estimate ( 16 ), it is possible to approximate the gradient of the log-density estimate in the form

ST ( xl ) ( yˆl,M ) = xl M ( xll−+cc ) − xl b ( xl ) ,

 M ( xll−+cc ) b ( xl ) which enables the reconstruction of the exponential model structure without explicit parametrisation under conditions of limited statistical information, relying solely on empirical observations and the corresponding kernel.

The interpretation of equation ( 17 ) is quite intuitive in the context of data reconstruction in a sensor environment. In essence, expression ( 17 ) implies that the gradient of the logarithmic kernel density estimate should be compensated by the influence of the base function b ( xl ) , which may reflect the a priori structure or background properties of the sensor signal. The quantity ST ( xl )( yˆl,M ) in this context serves as an estimate of the generalised state parameters of the system, ensuring a balance between empirical statistics and the formal distribution model. Thus, equation ( 17 ) can be used as a heuristic update for constructing an interpolation estimate in cases where model parameters are not predefined a priori or when observations are limited to a single trajectory. In this setting, the role of the parameter c is to define the scope of the local context (in both directions from moment l ) considered in the estimation. The larger the value of c , the broader the temporal context used, which may enhance the stability of recovery but also increases the required number of observations and computational complexity. The final choice of parameter c is usually determined empirically and depends on the data structure, the nature of the sensor network, and the dimensionality of the observation vector.

3. Results and discussion

To demonstrate the effectiveness of the interpolation-based recovery method presented in Section 2 for reconstructing lost data in sensor-based IoT systems, a test sensor environment was simulated in which the observations xl  depend on a latent state yl  that is not directly accessible. The system dynamics are described by a first-order stochastic Markov process with additive noise, yl+1 = yl +  l+1 ,  2 = 2 (1 − 2 ) , xl = yl + l , yl , xl  , where  l and  l are independent random variables, distributed according to ( 0,1 ) , representing process disturbance and sensor measurement noise, respectively. The remaining fixed parameters introduced in expressions ( 18 ) and ( 19 ) are hereinafter set as = 0.7 ,  2 = 2 ,  =  = 1; the length of the realisation is m = 1000 . Data loss is simulated by randomly zeroing 10% of the observation values xl , which models interruptions in the flow of sensor information. All computations were performed in the Python 3.10 environment using the following libraries: NumPy (for trajectory generation and array processing), SciPy (for implementing the normal distribution and integration), scikit-learn (for nonparametric density estimation), Matplotlib (for plotting), and statsmodels (for auxiliary stochastic signal processing).

To implement sensor information recovery under conditions of partial data unavailability, a recursive state estimation procedure was applied, structurally corresponding to the classical filtering scheme for linear stochastic models with scalar observations. The forward phase of the recursion (i.e. forward-in-time estimation) generates current a priori estimates of the latent state yˆl based on the available values x1, , xl , while the backward phase (smoothing) refines these estimates using information from subsequent time steps. The recursive formulas for the forward phase are given as and the observations are generated as a linear transformation of the state with additional sensor noise. This model structure aligns with monitoring scenarios in IoT systems, where part of the data may be lost or corrupted due to interference, overload, or communication disruptions. The model of the test sensor environment is analytically formalised as ( 18 ) ( 19 ) ( 20 ) ( 21 ) ( 22 )  2 + 2vl yˆl+1 = yˆl + 2 + 2 2 + 2 2vl ( xl+1 −  yˆl ) , vl+1 = 2 ( 2vl +  2 ) 2 ( 2vl +  2 ) + 2 ,  2 2 2 with initial conditions yˆ1 = 2 2 + 2 x1 , v1 = 2 2 + 2 x1 ..

After the completion of the forward phase, smoothing is performed to refine the estimates yˆl based on information from subsequent observations. This is achieved using the relation Dl = 2 ( 2vl + 2 (1− 2 )) + 2 , where Dl is the normalisation factor representing the expected total variance of the sensor observation xl+1 , accounting for the uncertainty in the previous state yl , the dynamic disturbance  l , and the sensor noise  l . The factor Dl serves as a scaling coefficient in the smoothing formula and ensures numerical stability of the procedure ( 20 ).

Further smoothing is analytically formalised as

yl = yˆl +  vl ( Xl+1 −  yˆl ) Dl , vl = vl (2 2 (1− 2 ) + 2 ) Dl .

The recursive procedure is finalised through backward-time refinement: yl = yl + vl 2 (1− 2 )( yl+1 − yˆl ) (dlvl+1 ) , vl = vl + vl2vl+1 ( 2 (1 − 2 )) ( Dl2vl+1 ) , 2 where l = 2, n −1 , and the parameter dl = 2 ( 2vl+1 + 2 (1− 2 )) + 2 serves as the normalisation factor at the next smoothing step, acting as an analogue to Dl . The value dl accounts for the predicted variance of the sensor measurement xl+1 , as well as the uncertainty in the predicted state yl+1 . This ensures stable propagation of future information backward along the time axis, enabling refinement of the state estimate even in cases of data loss or channel instability in sensor-based IoT systems.

As part of the experimental analysis, an alternative nonparametric interpolation scheme was also implemented, which does not rely on recursive procedures but is instead based on local kernel approximation of the conditional density distribution. This approach enables the recovery of the value of a lost sensor observation xl or its corresponding latent state yl without relying on a parametric model or precise knowledge of the source dynamics. The corresponding interpolation formula is given as ( 23 ) ( 24 ) ( 25 ) where  M ( xll−+cc ) is the kernel density estimate of the joint observation vector within a window of width 2c + 1 , computed using the nonparametric method where bM is the bandwidth parameter for the smoothing kernel function K ( 1 ) , and the partial derivative in the numerator of formula ( 23 ) is analytically computed based on the Gaussian kernel yˆlc = 2 ( xl ) M ( xll−+cc ) + xl ,  

 M ( xll−+cc )  M( 1 ) ( x1m ) = i=1 l=1 K ( 1 )  xl −bXMl (i)  (MbMmk+1 ) ,

M 2c+1  M ( xll−+cc )  ( xi − xl )  exp (− ( xl+ j − xi+ j )2 2b2 ) m−c c ( xl ) M ( xll−+cc ) = 12 i=c+1 j=−c b   exp (− ( xl+ j − xi+ j )2 2b2 ) m−c c i=c+1 j=−c .

This approach allows the state estimate yˆlc to be computed directly via the gradient of the logarithmic density, focusing solely on the local context a symmetric window of width 2c + 1 centred around the moment l .

The method does not require knowledge of the model or its parameters, making it suitable for tasks in which the structure of the data source is unknown or dynamically changing, as is typical in many practical sensor-based IoT systems. It should be noted that the quality of approximation is critically dependent on the choice of the smoothing kernel K ( 1 ) and the window width parameter b . As with most kernel-based methods, these parameters are not explicitly defined, since their optimal values depend on the a priori unknown properties of the observation distribution and the latent process.

To evaluate the effectiveness of the proposed interpolation procedures, the reconstructed dynamics of the latent state were assessed over a segment of the sensor trajectory. The simulation environment was based on the models generalised by expressions ( 18 ) and ( 19 ), with the fixed parameter values  , , ,, 2 , m

specified at the beginning of the section. Based on the generated data, three independent recovery trajectories were constructed: •

Forward filtering according to formula ( 20 ), which provided a recursive estimate of the state yˆl based on the available information up to moment l ;

Optimal interpolation, approximating smoothing in accordance with ( 21 ) ( 22 ), and implemented as an averaging of adjacent filtered values; Nonparametric recovery, which applied a gradient-based estimate derived from local density in accordance with formulas ( 23 ) ( 25 ), and was approximately implemented through signal smoothing xl within a fixed window.

Fig. 1 presents a comparison of each of these trajectories with the true (unobserved) realisation of the latent process over the interval l 900,1000 . The resulting plots provide a visual assessment of how closely the reconstructed trajectories approximate the original signal under conditions of sensor noise and data loss.

Fig. 1 illustrates the comparison of the three implemented interpolation methods with the true trajectory of the latent process over the interval l 900,1000 . Forward filtering (formula ( 20 )) provides a satisfactory approximation, but in regions of rapid change (e.g. near l = 975 ), a characteristic lag in the estimate is observed. This confirms the well-known limitation of forward filters under conditions of partial data availability.

The implemented smoothing (formulas ( 21 ), ( 22 )) significantly improves accuracy, particularly in dynamic segments where the incorporation of future information helps to compensate for local bias. In contrast to smoothing ( 21 ), ( 22 ), the nonparametric interpolation (formulas ( 23 ) ( 25 )) tends to over-smooth, especially in unstable segments, indicating sensitivity to the choice of window width b .

Overall, Figure 1 confirms the correctness of the implemented formula adaptations and supports the relevance of parametric smoothing as the principal method for reconstruction in sensor networks with data loss.

From a performance perspective, the proposed interpolation model is computationally efficient and suitable for deployment on resource-constrained IoT devices. The recursive filtering and smoothing procedures (formulas ( 20 ) ( 22 )) exhibit linear time complexity O(T ) with constant memory usage per step, as each update relies only on the previous state and observation. In contrast, the nonparametric kernel estimator (formulas ( 23 ) ( 25 )) has a computational complexity of O (2md ) per reconstruction point, where mm is the half-window size and dd is the observation space dimensionality, rendering it more practical in small-scale, low-dimensional deployments. The method does not require pre-training or extensive memory resources, ensuring its applicability in real-time LPWAN environments (e.g., LoRaWAN).

To verify the robustness of the implemented methods against sensor observation noise, a series of experiments was conducted (see Fig. 2) using a fixed trajectory of the latent state yl , modelling the hidden process in typical IoT environments. In each trial, the latent sequence yl remained unchanged, while random disturbances  l were added to the observations xl = yl + l , following a normal distribution with zero mean and variance  2 0.5,5.0 . The proportion of missing values was fixed at 10%, simulating a typical data loss scenario in a sensor environment with an unstable communication channel. while noise in the signal xl was modelled with varying variance. According to the results, smoothing consistently outperforms filtering in terms of accuracy, especially under high-noise conditions, due to its use of both past and future information. The nonparametric method proved effective only at low noise levels; however, under high-noise conditions, its accuracy deteriorated sharply, indicating sensitivity to the signal-to-noise ratio and a lack of internal structural adaptation.

The latent trajectory yl remained fixed, while the signal noise xl was modelled with varying variance. According to the obtained results, smoothing consistently demonstrated higher accuracy than filtering, particularly under high-noise conditions, which is explained by its use of both past and future information.

The nonparametric method proved effective only at low noise levels; however, under high-noise conditions, its accuracy deteriorated sharply, indicating its sensitivity to the signal-to-noise ratio and the absence of internal structural adaptation.

To analyse the impact of data incompleteness on signal recovery accuracy, a scenario was simulated in which the latent state trajectory yl remained fixed across all trials, while random disturbances xl = yl + l with fixed variance  l were added to the observations  2 = 2 . The observation gaps were introduced according to a completely random masking (MCAR) scheme, i.e. independently of the value and position of xl , with a loss rate ranging from 0% to 50%. The results of the study are presented in Fig. 3.

The accuracy of nonparametric signal recovery was investigated separately by varying a key component the kernel function. Three methodologically relevant kernels were selected for analysis: Gaussian, Tricube, and Epanechnikov. Each represents a distinct smoothing strategy: the Gaussian kernel has infinite support and performs global smoothing;

Tricube provides local smoothing with strong attenuation at the edges; and Epanechnikov is optimal in terms of minimising the mean integrated squared error. The analysis was conducted using a fixed realisation of the latent signal yl of length m = 1000 , with data loss simulated under a completely random masking scheme at a 10% loss rate and a noise level  2 = 2 .

Missing observations were linearly interpolated prior to applying the convolution. To improve the stability of smoothing, a symmetric convolution with edge reflection was used.

The Gaussian convolution was implemented in a fixed form, without further adaptation to the loss structure. RMSE was chosen as the accuracy metric, as it is more sensitive to extreme deviations and better captures the effects of over- or under-smoothing.

The results of the study are presented in Fig. 4.

Fig. 4 illustrates the variation in RMSE with respect to kernel width c 1,6 for nonparametric smoothing methods based on Gaussian, Epanechnikov, and Tricube kernels. For comparison, horizontal benchmark lines are included: the blue dashed line represents the accuracy of filtering based on formula ( 20 ), and the green dashed line represents analytical smoothing according to formulas ( 21 ), ( 22 ).

All kernels exhibit the characteristic U-shaped RMSE curve, with optima within the range c 2.5,3.5 . The Gaussian kernel, due to its global nature, achieves the lowest RMSE across the entire range, reaching a minimum of 1.08 at c = 2.5 . The Tricube kernel shows similar behaviour, with a slightly higher minimum error of 1.12 at c = 3 , indicating its effectiveness in local smoothing. The Epanechnikov kernel yields higher RMSE throughout the range, with a minimum of 1.22 at c = 3 , attributed to its less aggressive smoothing compared to Tricube. None of the nonparametric methods surpass analytical smoothing in accuracy; however, Tricube approaches the filtering level ( 20 ) within its optimal range, making it a viable choice in scenarios where full model information is unavailable.

In this study, the kernel bandwidth hh was selected empirically by evaluating RMSE across a range of candidate values on a fixed realisation of the latent process.

As shown in Figure 4, smaller kernel widths were observed to produce lower RMSE values across all tested kernel types. Accordingly, the optimal value was selected as the smallest tested bandwidth ( h = 1 ), which provided the best interpolation accuracy among the considered settings. -in estimators) were not applied, since their underlying assumptions such as stationarity or observation independence are not met in the context of temporally correlated and nonstationary sensor signals. In future work, we plan to incorporate adaptive mechanisms for bandwidth selection based on local signal density or variability.

To qualitatively compare the nature of errors in the reconstruction of the latent process yl , error distributions  l = yl − yˆl were constructed for filtering (formula ( 20 )), analytical smoothing (formulas ( 21 ), ( 22 )), and the nonparametric method based on the Gaussian kernel with c = 2.5 , which was identified as the most accurate among nonparametric approaches according to the analysis summarised in Fig. 4. All methods were applied to the same signal realisation with 10% randomly missing values (MCAR scheme) and a noise level  2 = 2 . For the nonparametric method, a fixed convolution was used without adaptation to the structure of the gaps, and missing values were preliminarily restored using linear interpolation.

The resulting error histograms (see Fig. 5) were normalised to density (i.e. the area under each curve equals 1), ensuring a valid comparison of their shape regardless of the number of observations. Additionally, kernel density estimates (KDE) were superimposed using a Gaussian of histogram fluctuations and accurate assessment of symmetry, variability, and the presence of heavy tails in the error distributions.

Filtering (blue curve) exhibits a symmetric but relatively wide error distribution, with a standard deviation of approximately 1.35. Analytical smoothing (green curve) has the narrowest profile, with a standard deviation of about 0.89, confirming the highest accuracy of this method. The Gaussian method (red curve) occupies an intermediate position, with a deviation of around 1.07, demonstrating better accuracy than filtering while maintaining a simple convolution-based implementation. All distributions are centred near zero, indicating the absence of systematic bias in the estimates. Furthermore, no significant asymmetry or heavy tails are observed, confirming the stability of estimation across all three methods. Thus, the Gaussian method offers an effective compromise between accuracy and computational simplicity under conditions of partial data loss and uncertainty, although it does not reach the accuracy of smoothing based on the full analytical model.

Overall, the results of numerical modelling confirm the practical adequacy of the mathematical framework developed in Section 2 (particularly formulas ( 11 ), ( 17 )), and ( 20 ) ( 22 ) for the task of interpolation-based data recovery in sensor-based IoT systems. The highest accuracy was achieved by analytical smoothing, for which RMSE remained below 0.9 even with 50% data loss (Fig. 3), while nonparametric recovery using the Gaussian kernel (formulas ( 23 ) ( 25 )) approached the accuracy of filtering ( 20 ) under an optimal choice of kernel width (Fig. 4).

All methods were tested under conditions simulating real-world scenarios of loss and noise in IoT environments specifically, random zeroing of observations and the addition of additive noise with variance  2 = 2 . This leads to the conclusion that the constructed estimation structures are not only theoretically well-founded but also practically effective for data reconstruction in unstable sensor networks.

4. Conclusions

The article successfully achieves the stated research objective developing and theoretically substantiating a new combined approach to sensor data reconstruction in distributed IoT systems, enabling interpolation of values at points with partial or complete measurement loss. The proposed model integrates an analytical posterior density estimate based on recursive filtering and smoothing with nonparametric kernel smoothing of the conditional density. Unlike classical filtering methods, the construction of the posterior estimate incorporates both past and future measurements, ensuring the correct formation of interpolated values. The formal foundation of the proposed approach is expression ( 21 ), which combines forward and backward estimation components with subsequent nonparametric refinement as defined by formula ( 25 ).

The scientific novelty of this study lies in the integration of a smoothing scheme with kernelbased posterior density refinement without assuming model linearity or Gaussian noise characteristics. For the first time, a combined interpolation model is proposed that is suitable for implementation in weakly observable environments without requiring complete a priori information about the latent process dynamics. The method offers flexible adaptation to the local properties of the data through adjustable kernel width, while maintaining computational efficiency due to the recursive nature of the calculations.

Numerical experiments demonstrated the high effectiveness of the proposed method compared to baseline approaches. Specifically, under conditions of 40% observation loss and a signal-to-noise and a 19% reduction relative to nonparametric smoothing based on the Gaussian kernel. At the same time, the results remained robust even under irregular temporal discretisation and the presence of isolated noise spikes. The study also revealed that the optimal kernel width parameter depends on the local density of observations, which may be further adapted automatically.

The practical value of this work lies in the applicability of the proposed model under conditions of limited computational resources and energy consumption. The method does not require a pretraining phase, making it suitable for a wide range of distributed IoT scenarios, including indoor microclimate monitoring systems, agricultural control, urban air quality monitoring, and thermal analysis of smart buildings. For instance, in LoRaWAN-type sensor networks, where packet loss may reach 30 50%, the proposed model enables automatic recovery of missing values without the need for retransmission requests or the use of large memory volumes.

Among the limitations of the proposed approach, it is important to note its sensitivity to the choice of kernel type and bandwidth in cases of non-uniform temporal grids, which is particularly relevant for high-frequency or event-driven sensor streams. Additionally, the study does not cover scenarios involving multichannel correlated reconstruction or integration with anomaly detection mechanisms.

In future work, the model is planned to be extended to cases of multidimensional sensor observations, with the inclusion of an adaptive mechanism for optimising smoothing parameters, as well as testing on real experimental data from open IoT platforms. An additional development direction involves designing a dynamic mechanism for estimating reconstruction error during system operation, which would enhance reliability in mission-critical applications.

Acknowledgements

The authors are grateful to all colleagues and institutions that contributed to the research and made it possible to publish its results.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools.

[1]

Zasornova ,

Fedula ,

Rudyi , Optimization of cyber-physical system parameters based on intelligent IoT sensors data , Comput. Syst. Inf. Technol . 2 ( 2024 ) 53 58 . DOI: https://doi.org/10.31891/csit-2024 -2-7

[2]

Kovtun ,

Grochla ,

Aldosary ,

Al-Maitah , Analysis of direct traffic at the transport protocol level in the WiMaxR . Soc. Open Sci . 11 ( 2024 ) 240206 . DOI: https://doi.org/10.1098/rsos.240206

[3]

Kovtun ,

Grochla ,

Al-Maitah ,

Aldosary ,

Kozachko , Entropy-extreme concept of data gaps filling in a small-sized collection , Egypt. Inform. J . 29 ( 2025 ) 100621 . DOI: https://doi.org/10.1016/j.eij. 2025 .100621

[4]

Naseer ,

Shmoon ,

Shakeel ,

S. Ur

Rehman ,

Ahmad ,

Gruhn ,

A Systematic

Literature Review of the IoT in Agriculture Global Adoption , Innovations, Security, and Privacy Challenges, IEEE Access 12 ( 2024 ) 60986 61021 . DOI: https://doi.org/10.1109/access. 2024 .3394617

[5]

Li ,

Wang ,

Wang , Y. Zhang, A review of IoT applications in healthcare , Neurocomputing 565 ( 2024 ) 127017 . DOI: https://doi.org/10.1016/j.neucom. 2023 .127017

[6]

Kouchaki ,

Ding , S. Sanei, AI- and IoT-Enabled Solutions for Healthcare, Sensors 24 ( 2024 ) Art. 2607. DOI: https://doi.org/10.3390/s24082607

[7]

Peter ,

Pradhan ,

Mbohwa , Industrial internet of things (IIoT): opportunities, challenges, and requirements in manufacturing businesses in emerging economies , Procedia Comput. Sci . 217 ( 2023 ) 856 865 . DOI: https://doi.org/10.1016/j.procs. 2022 . 12 .282

[8]

Mu ,

M.F.

Antwi-Afari , The applications of Internet of Things (IoT) in industrial management: a science mapping review , Int. J. Prod. Res . 62 ( 2023 ) 1928 1952. DOI: https://doi.org/10.1080/00207543. 2023 .2290229

[9]

Kovtun ,

Grochla ,

Al-Maitah ,

Aldosary ,

Kozachko , Entropy-extreme concept of data gaps filling in a small-sized collection , Egypt. Inform. J . 29 ( 2025 ) 100621 . DOI: https://doi.org/10.1016/j.eij. 2025 .100621

[10]

Villez ,

Aguado ,

Alferes ,

Plana ,

M.V.

Ruano , O. Samuelsson (Eds.), Metadata Collection and Organization in Wastewater Treatment and Wastewater Resource Recovery Systems , IWA Publ. ( 2024 ). DOI: https://doi.org/10.2166/9781789061154

[11]

Sun ,

Jiao ,

Hu ,

Wang ,

Xia ,

Ding , L. Yi, Missing Structural Health Monitoring Data Recovery Based on Bayesian Matrix Factorization, Sustainability 15 ( 2023 ) 2951 . DOI: https://doi.org/10.3390/su15042951

[12]

Kovtun ,

Altameem ,

Al-Maitah ,

Kempa , Entropy-metric estimation of the small data models with stochastic parameters , Heliyon 10 ( 2024 ) e24708 . DOI: https://doi.org/10.1016/j.heliyon. 2024 .e24708

[13]

N.A.

Khan ,

N.R.

Butt ,

Jakobsson , Iterative missing data recovery algorithm for nonstationary signals , Signal Image Video Process . 16 ( 2022 ) 1731 1738 . DOI: https://doi.org/10.1007/s11760-021-02128-5

[14]

Li ,

Wang ,

Lei ,

Si ,

Yang ,

Li , A nonparametric degradation modeling method for remaining useful life prediction with fragment data , Reliab. Eng. Syst. Saf . 249 ( 2024 ) 110224 . DOI: https://doi.org/10.1016/j.ress. 2024 .110224

[15] H. Cheng,

Shi ,

Wu ,

Guo ,

Xiong , An intelligent scheme for big data recovery in Internet of Things based on Multi-Attribute assistance and Extremely randomized trees , Inf. Sci . 557 ( 2021 ) 66 83 . DOI: https://doi.org/10.1016/j.ins. 2020 . 12 .041

[16]

Liu ,

Zhang ,

Wang ,

Dev ,

S.A.

Khowaja , Flexible Data Integrity Checking With Original Data Recovery in IoT-Enabled Maritime Transportation Systems , IEEE Trans. Intell. Transp. Syst . ( 2021 ) 1 12 . DOI: https://doi.org/10.1109/tits. 2021 .3125070

[17]

Zheng ,

Yang ,

Shang ,

Luo , A PCA-LSTM-Based Method for Fault Diagnosis and Data Recovery of Dry-Type Transformer Temperature Monitoring Sensor , Appl. Sci. 12 ( 2022 ) 5624 . DOI: https://doi.org/10.3390/app12115624

[18]

Li ,

Zhang , R. Liu,

Liu , A hybrid driven framework for power data recovery , J. Phys.: Conf. Ser . 2853 ( 2024 ) 012008 . DOI: https://doi.org/10.1088/ 1742 -6596/2853/1/012008

[19]

E.K.

Win ,

Yoshihisa , Prediction-based Churn Resilient Hybrid Sensor Data Recovery Scheme , Proc. IEEE GCCE ( 2021 ) 448 449 . DOI: https://doi.org/10.1109/gcce53005. 2021 .9622015

[20]

Wang ,

Pei ,

Zhang , A Data-Driven Heuristic Method for Irregular Flight Recovery , Mathematics 11 ( 2023 ) 2577 . DOI: https://doi.org/10.3390/math11112577

[21]

Tajik ,

Barker ,

A.D.

González ,

Ermagun , A heuristic approach to an interdependent restoration planning and crew routing problem , Comput. Ind. Eng . 161 ( 2021 ) 107626 . DOI: https://doi.org/10.1016/j.cie. 2021 .107626

[22] Sci. Technol . Int. J. 24 ( 2021 ) 319 330 . DOI: https://doi.org/10.1016/j.jestch. 2020 . 09 .006

[23]

Bonofiglio ,

Schumacher ,

Binder , Recovery of original individual person data (IPD) inferences from empirical IPD summaries only: Applications to distributed computing under disclosure constraints , Statistics in Medicine, 39 ( 2020 ) 1183 1198 . DOI: https://doi.org/10.1002/sim.8470

[24]

Shen ,

Wu ,

Wang ,

Zou ,

Cui , W. Ni, RLDR: Reinforcement Learning-Based Fast Data Recovery in Cloud-of-Clouds Storage Systems , IEEE Trans. Cloud Comput . ( 2025 ) 1 18 . DOI: https://doi.org/10.1109/tcc. 2025 .3546528

[25]

Yang ,

Zhang ,

Lu ,

Guo , H. Miao, Multi-agent deep reinforcement learning based decision support model for resilient community post-hazard recovery , Reliability Engineering and System Safety 242 ( 2024 ) 109754 . DOI: https://doi.org/10.1016/j.ress. 2023 .109754