<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>refinement⋆</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”</institution>
          ,
          <addr-line>37, Prospect Beresteiskyi, Kyiv, 03056</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>State University of Information and Communication Technologies</institution>
          ,
          <addr-line>7 Solom'yanska str., Kyiv, 03110</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Viktoriia Zhebka</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Volodymyr Vynnychenko Central Ukrainian State University</institution>
          ,
          <addr-line>Shevchenko Street, 1, Kropyvnytskyi, 25000</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Real-time Internet of Things sensor data is frequently corrupted by mixed noise (outliers, drift, constant values), degrading data quality. Conventional cleaning methods often lack adaptivity for such heterogeneous noise. This paper introduces URTCA, a novel real-time cleaning architecture for univariate time series. URTCA employs a machine learning classifier (Random Forest) on sliding window features to identify distinct noise types. This classification, refined by a rule-based check for low-variance states, drives an adaptive strategy module that selects appropriate cleaning operators (e.g., imputation, smoothing, passthrough). Experiments using simulated noise on real temperature data demonstrated URTCA's superior cleaning accuracy, achieving ~14% and ~34% lower Root Mean Squared Error (RMSE) compared to Rolling Median and Kalman Filter baselines, respectively. URTCA offers a robust, classification-driven adaptive solution for enhancing the reliability of real-time IoT data streams.</p>
      </abstract>
      <kwd-group>
        <kwd>data cleaning</kwd>
        <kwd>real-time</kwd>
        <kwd>internet of things</kwd>
        <kwd>noise classification</kwd>
        <kwd>machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The proliferation of the Internet of Things (IoT) has led to an exponential increase in connected
devices, especially within smart home and industrial contexts, generating vast streams of sensor data
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This data, typically univariate time series like temperature or energy usage, holds immense
potential but is often plagued by quality issues [18]. Raw sensor data streams are frequently
compromised by errors stemming from sensor malfunctions, environmental factors, or transmission
glitches [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Common data quality problems include missing values, isolated point outliers,
continuous errors like sensor drift or constant segments, and general background noise [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These
issues can significantly degrade the performance of downstream applications, making data quality
management crucial, particularly in domains like industrial IoT and smart grids.
      </p>
      <p>
        Many IoT applications require real-time processing, rendering traditional batch cleaning methods
insufficient [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The high velocity and continuous nature of IoT data demand online approaches
where cleaning occurs as data arrives. Furthermore, merely detecting errors or discarding data is
often inadequate; applications like forecasting or control systems typically require a complete and
corrected time series [
        <xref ref-type="bibr" rid="ref8">8, 9</xref>
        ]. Therefore, real-time data correction, replacing erroneous values with
      </p>
      <p>
        0000-0003-4051-1190 (V. Zhebka); 0000-0001-9893-5709 (S. Shlianchak);
0000-0002-0531-9809 (S. Popereshnyak); 0009-0006-1781-4109 (D. Nishchemenko)
plausible estimates, is a critical necessity [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Failure to adequately clean and correct data can
propagate errors, leading to flawed analyses and unreliable system behavior.
      </p>
      <p>
        This paper focuses on addressing these challenges through a specific methodology:
classificationdriven adaptive correction for real-time cleaning of univariate IoT time series data. This approach
involves classifying detected noise or anomalies into distinct types (e.g., outlier, drift, constant value)
using machine learning techniques [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and then dynamically selecting or parameterizing a
correction strategy best suited to the identified noise type [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This contrasts with simpler methods
like generic smoothing filters (e.g., EWMA), which can distort valid data patterns, or non-adaptive
methods that apply a uniform repair irrespective of the error's nature [9]. The underlying principle
is that different error types necessitate different, tailored correction techniques for optimal results
[
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. The central challenge is thus to accurately diagnose the specific error type in real-time and
apply an appropriate correction that removes the artifact while preserving genuine data patterns,
navigating the trade-off between minimal data modification and necessary correction [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        To address this challenge, we propose URTCA, a novel unified real-time cleaning architecture.
Our method utilizes a machine learning classifier [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to identify multiple noise types within sliding
windows of sensor data and employs an adaptive strategy module, refined by rule-based logic, to
apply targeted cleaning actions. This paper details the architecture, implementation, and evaluation
of the proposed method, demonstrating its effectiveness in handling mixed noise types compared to
existing approaches.
      </p>
      <p>The remainder of this paper is structured as follows: Section 2 reviews the relevant literature on
real-time IoT data cleaning, focusing on noise classification and adaptive correction. Section 3
presents the architecture and components of the proposed URTCA system. Section 4 describes the
experimental setup, including data generation, comparison methods, and evaluation metrics. Section
5 presents and discusses the experimental results, including cleaning performance and classification
accuracy. Finally, Section 6 concludes the paper and outlines directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Research addressing data quality issues in real-time IoT streams has yielded various techniques
for anomaly detection, data imputation, and stream processing [9, 18]. This literature review
concentrates on the specific method of classification-driven adaptive cleaning applied to univariate
IoT time series.</p>
      <p>
        The core idea behind classifying noise before correction is that different error types (e.g., spikes,
drift, constant segments) have distinct characteristics and causes, warranting tailored repair
strategies [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Applying a universal correction method risks either ineffective repair or distortion
of valid data [9]. Therefore, identifying the specific noise type allows for a more targeted and
potentially more accurate correction that better preserves signal integrity [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Machine learning models like Random Forests (RF), Decision Trees (DT), and Support Vector
Machines (SVM) are potential candidates for automating this classification task, given their ability
to learn complex patterns from features [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, a review of the literature suggests that the
explicit use of such standard classifiers for multi-class noise type identification (distinguishing
outlier vs. drift vs. constant segment etc.) as a preliminary step to adaptive correction is not widely
documented [9]. Much existing work focuses on binary anomaly detection (normal vs. anomalous)
using techniques like LSTM, Autoencoders, statistical tests on model residuals, or distance-based
methods [
        <xref ref-type="bibr" rid="ref3">3, 16</xref>
        ], rather than categorizing the anomaly's nature. Systems combining ML with data
cleaning exist [10, 11], but specific details on multi-type noise classification are often limited.
      </p>
      <p>
        A key challenge for supervised noise classification is effective feature engineering [20]. Potential
features derived from sliding windows include statistical moments, trend indicators, constancy
measures, frequency domain information, or model-based residuals [15]. While some methods
implicitly use such features (e.g., IQR statistics, model residuals, decomposition components),
designing a feature set to robustly distinguish diverse noise types in real-time remains difficult.
Furthermore, the requirement for accurately labeled training data, where each point or segment is
tagged with a specific noise type, presents a major bottleneck [19], explaining the prevalence of
unsupervised or semi-supervised approaches in anomaly detection research [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Adaptive correction aims to dynamically select or parameterize the cleaning action based on the
diagnosed noise type, moving beyond one-size-fits-all approaches [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Hypothetically, point outliers
could be handled by local imputation, drift might require detrending, constant segments might need
model-based interpolation, and high background noise could be addressed by adaptive smoothing
filters [12].
      </p>
      <p>
        While the surveyed literature shows examples of adaptivity, it often differs from the target
paradigm. Methods like Iterative Minimum Repairing (IMR) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and ARX/ANN cleaning algorithms
[15] adapt by iteratively refining a single underlying model based on previous repairs. Other
approaches adapt based on statistical triggers like IQR or skewness thresholds, or contextual
information from neighboring sensors [18]. Tools like cleanTS [10] automate sequential steps but
rely on user selection rather than dynamic classification-based adaptation. Systems like Cleanits [11]
suggest integrated strategies but lack detail on classification-driven adaptation. Adaptive smoothing
methods like ASPA [12] adjust parameters based on signal characteristics but typically target
background noise without prior multi-class classification. Thus, the explicit linkage where a
classifier's output label (e.g., "Drift") directly selects a specific correction algorithm (e.g.,
"Detrending") appears uncommon [9]. Rule-based logic and thresholds, sometimes learned
dynamically (e.g., using ELM for speed constraints) [17], offer a complementary way to incorporate
adaptivity and domain knowledge.
      </p>
      <p>
        Implementing sophisticated cleaning in real-time IoT environments faces constraints like
sequential data arrival, low latency demands, limited computational resources (especially on edge
devices), and high data volume [
        <xref ref-type="bibr" rid="ref4 ref7">4, 7</xref>
        ]. Distributed architectures, potentially using edge computing
for initial processing and dynamic task scheduling, are proposed to manage these constraints [
        <xref ref-type="bibr" rid="ref4">4, 18</xref>
        ].
Several online algorithms exist, including streaming outlier detectors (TsOutlier) [16], adaptive
smoothers (ASPA) [12], iterative methods with efficient updates (IMR) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], stream constraint
checkers (SCREEN) [13], rolling window statistics, and dynamic constraint predictors [17]. However,
efficient integration of the entire pipeline (feature extraction, ML classification, adaptive correction
selection, execution) in real-time remains a major challenge [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], requiring careful design and
potential algorithm simplification or hardware acceleration.
      </p>
      <p>
        Practical deployment often requires integrated systems combining detection, correction, and
other functionalities [20]. Existing frameworks like Cleanits [11], IMR [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], ARX/ANN algorithms
[15], and cleanTS [10] each have different focuses (industrial integration, iterative repair,
automation). While valuable, these systems generally do not appear to fully implement the specific
paradigm of explicit multi-class noise classification directly driving the selection among diverse,
tailored correction algorithms in real-time [9].
      </p>
      <p>
        Evaluating real-time cleaning involves assessing correction accuracy (RMSE, MAE against ground
truth) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], impact on downstream tasks (e.g., forecasting accuracy) [15], and computational efficiency
(latency, throughput). Obtaining ground truth often requires using real-world data with synthetically
injected, labeled errors [20]. While studies report performance improvements using various
advanced methods [
        <xref ref-type="bibr" rid="ref8">8, 11, 15</xref>
        ], a major challenge is the lack of standardized benchmarks specifically
designed to evaluate classification-driven adaptive cleaning, including metrics for the classification
step itself and the benefit of adaptation [9].
      </p>
      <p>
        In summary, while the field addresses real-time IoT data cleaning with various ML and statistical
techniques [18], a specific gap exists concerning methods that explicitly classify multiple distinct
noise types using ML [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and use this classification to dynamically select tailored correction
strategies in real-time [9]. Key research gaps include the development and validation of such
multiclass noise classifiers for time series, designing architectures that effectively link classification to
adaptive correction under real-time constraints, creating appropriate benchmarks, and advancing
real-time feature engineering. Addressing these gaps, potentially through hybrid approaches
combining ML and rules and leveraging edge computing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], is crucial for advancing the
state-ofthe-art in robust and reliable IoT data cleaning. This research aims to contribute to filling these gaps
by proposing and evaluating URTCA, a unified real-time cleaning architecture.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Architecture and components</title>
      <p>To address the challenges of cleaning mixed noise types in real-time univariate IoT sensor data
streams, we propose the URTCA. The core principle is classification-driven adaptive cleaning,
refined by rule-based logic to handle low-noise states effectively. The method processes data
sequentially, making decisions based on information available within a sliding window of recent data
points.</p>
      <p>The architecture consists of the following key components, operating iteratively as new data
points arrive (see Figure 1):</p>
      <p>Sliding Window Buffer: A fixed-size buffer (implemented using a deque for efficiency)
maintains the most recent W data points (raw or imputed). In our experiments, a window
size W=10 was used, corresponding to 10 minutes of data. This buffer provides the necessary
historical context for feature extraction and cleaning actions.</p>
      <p>Feature Extraction Module: For each new data point that fills the buffer, a set of statistical
features is calculated from the current window's data. These features aim to capture
characteristics indicative of different noise types. The features calculated include:
a. Basic moments: Mean, Variance, Standard Deviation.
b. Order statistics: Minimum, Maximum, Range.
c. Robust statistics: Interquartile Range (IQR), Median Absolute Deviation (MAD).
d. Relative variation: Coefficient of Variation (CV = Std Dev / Mean).
e. Change indicators: Mean Absolute Difference between consecutive points.
f. Missing data indicators: Count and Fraction of NaNs within the window. NaN values
within the window are handled during feature calculation to ensure robust feature
generation.</p>
      <p>Noise Classifier: A machine learning classifier, specifically a Random Forest model in our
implementation, is employed to predict the dominant noise type present within the current
window based on the extracted features. The classifier is trained offline using labeled data
from a separate training set containing synthetically generated noise examples. In our final
configuration, the classifier was trained to distinguish between four primary noise categories
relevant to the cleaning task: Outlier, ConstantValue, Drift, and GeneralNoise. It does not
explicitly predict a 'Clean' state.</p>
      <p>Rule-Based Refinement Module: Recognizing that distinguishing true low-noise states from
low-amplitude 'GeneralNoise' can be challenging for the classifier, a rule-based check is
applied after the classification step. If the classifier predicts GeneralNoise, but the calculated
variance of the current window falls below a predefined threshold, the prediction is
overridden.</p>
      <p>Adaptive Strategy Selection &amp; Cleaning Operators: Based on the effective label (either the
classifier's prediction or the 'Clean' label assigned by the rule-based refinement), a specific
cleaning strategy is selected and applied to the current data point:
a. Clean (Effective Label): The original data point is passed through without
modification. This occurs if the rule-based refinement identifies a low-variance state.
b. Outlier: A robust Z-score (calculated using window median and MAD) is checked for
the current point. If it exceeds a threshold (e.g., 3.5), the point is considered an outlier
and replaced by the window's median value; otherwise, it is passed through.
c. ConstantValue: The point is replaced by the median of the last few points (e.g., 3) in
the buffer, assuming recent valid points are more reliable than the potentially
anomalous constant value.Drift: The point is replaced by the mean of a slightly larger
number of recent points (e.g., 5) in the buffer, providing smoothing to counteract the
drift trend.
d. GeneralNoise: (Applied only if variance &gt; threshold). The point is replaced by the
mean of the last few points (e.g., 3) in the buffer, providing gentle smoothing
e. Missing Value (NaN Handling): If the incoming point itself is NaN, it is imputed using
the median of the recently cleaned values stored by the cleaner, ensuring that the
imputation relies on the best available estimates of the recent true signal level.</p>
      <p>This sequence of buffering, feature extraction, classification, rule-based refinement, and adaptive
strategy application allows the proposed method to dynamically adjust its cleaning approach in
realtime based on the diagnosed characteristics of the incoming data stream.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental setup</title>
      <p>To evaluate the performance of the proposed URTCA architecture, we conducted experiments using
a real-world dataset with synthetically injected noise, comparing our method against relevant
baselines under simulated real-time conditions.</p>
      <sec id="sec-4-1">
        <title>4.1. Data and preprocessing</title>
        <p>The base dataset originates from sensors deployed in a single home located in northeastern Mexico,
collected over 14 months (November 5, 2022, to January 5, 2024) at one-minute intervals. The full
dataset comprises 605,260 samples across 19 variables related to energy consumption and weather
conditions. For this study, we focused on a single representative univariate time series: indoor
temperature (temp). A continuous one-week segment (10,080 data points) from January 1, 2023,
00:00:00 to January 7, 2023, 23:59:00 was extracted from the original dataset to serve as the clean
"ground truth" data for our experiments. This segment exhibited typical diurnal temperature
variations.</p>
        <p>To create a realistic test set with known error types and ground truth, we injected a controlled
mixture of synthetic noise onto the clean ground truth data using the following procedure (with a
fixed random seed for reproducibility):
1. Point Outliers: Approximately 3% of randomly selected points were designated as outliers.</p>
        <p>Noise drawn from a normal distribution scaled by 5 times the ground truth standard
deviation was added to these points.
2. Constant Value Segments: 25 segments with random durations between 20 and 50 minutes
were selected, and the values within these segments were replaced by the constant value
observed at the start of each segment.
3. Drift Segments: 20 segments with random durations between 90 and 200 minutes were
selected. A linear drift (with a maximum slope factor relative to the overall standard
deviation) was added to the original values within these segments.
4. General Noise: Gaussian noise with a standard deviation equal to 5% of the ground truth
standard deviation was added to a fraction (60%) of the remaining clean points. Points not
affected by specific noise types or this step remained labeled 'Clean'.
5. Missing Values: Approximately 5% of the data points across the entire series were randomly
selected and replaced with NaN values.</p>
        <p>The approximate distribution of underlying noise types (before NaN overlay) in the final noisy
dataset was: GeneralNoise (~36%), Drift (~29%), Clean (~24%), ConstantValue (~8%), Outlier (~2%).</p>
        <p>The ground truth and corresponding noisy datasets were split chronologically into training (first
70%, 7056 points) and testing (last 30%, 3024 points) sets. The Random Forest classifier for the
proposed method was trained only on features extracted from the training set. All performance
evaluations were conducted only on the unseen testing set.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Comparison methods</title>
        <p>The performance of the proposed method URTCA was compared against the following methods, all
operating under the same real-time simulation constraints.</p>
        <p>The raw noisy test dataset serves as the baseline to quantify the overall noise level and the
minimum expected error.</p>
        <p>Rolling Median is a common non-adaptive filtering technique. It replaces each point with the
median value calculated over a sliding window of the preceding W noisy data points. The window
size W was set to 10, matching the window size used by the proposed method.</p>
        <p>Kalman Filter (Local Level) is a standard model-based filtering technique suitable for real-time
processing. We implemented a simple local level model assuming the true temperature follows a
random walk with Gaussian process noise (variance Q) and is observed with Gaussian measurement
noise (variance R). The observation noise variance R was estimated from the standard deviation of
the injected Gaussian noise (R ≈ 0.025). The process noise variance Q was set heuristically to a small
value (Q = 1e-4) assuming relatively smooth temperature changes minute-to-minute. The filter
recursively predicts the state and updates it based on the current observation.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation metrics</title>
        <p>The performance of all cleaning methods was evaluated using the following metrics on the test set.</p>
        <p>Root Mean Squared Error (RMSE) measures the square root of the average squared difference
between the cleaned/estimated values and the ground truth values. This metric is ensitive to large
errors.</p>
        <p>Mean Absolute Error (MAE): Measures the average absolute difference between the
cleaned/estimated values and the ground truth values. Unlike RMSE, it is less sensitive to outliers.</p>
        <p>The percentage reduction in RMSE and MAE achieved by the proposed method compared to the
baseline methods was calculated by taking the difference between the metric value of the baseline
method and the metric value of the proposed URTCA method, dividing that difference by the metric
value of the baseline method, and then multiplying the result by 100.</p>
        <p>To understand the behavior of the internal classification mechanism, we compared the effective
label assigned by the URTCA method at each step (e.g., 'Outlier', 'Drift', 'Clean_Rule' mapped to
'Clean', etc.) against the true underlying noise type injected into the test data. Metrics included:
Overall Accuracy, Precision, Recall, F1-score (per class, weighted, and macro averages), Confusion
Matrix.</p>
        <p>The average processing time per data point (in milliseconds) was measured for each method
during the simulation run to compare their computational efficiency.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental results</title>
      <p>This section presents the results of the real-time cleaning simulation conducted on the synthetic test
dataset, comparing the proposed URTCA against the baseline methods: Rolling Median, Kalman
Filter, and the original noisy data (Noisy).</p>
      <sec id="sec-5-1">
        <title>5.1. Cleaning Performance Comparison</title>
        <p>The primary evaluation focused on the accuracy of data cleaning, measured by Root Mean Squared
Error (RMSE) and Mean Absolute Error (MAE) against the ground truth test data. Table 1 summarizes
the performance metrics for all evaluated methods.</p>
        <p>As shown in Table 1 and visualized in Figure 3, the proposed URTCA method achieved the lowest
RMSE (0.862) and MAE (0.280), indicating the highest overall cleaning accuracy among the tested
methods.</p>
        <p>It significantly outperformed the simple Rolling Median filter (RMSE 1.003, MAE 0.365) and the
model-based Kalman Filter (RMSE 1.302, MAE 0.842). Both URTCA and Rolling Median provided
substantial improvements over the Noisy baseline (RMSE 3.209, MAE 0.575), while the implemented
Kalman Filter, although better than Noisy, struggled more with the mixed noise types present in the
data.</p>
        <p>To quantify the advantage of the proposed method, the percentage improvement in RMSE and
MAE was calculated relative to the baseline methods. Table 2 presents these improvements.</p>
        <sec id="sec-5-1-1">
          <title>URTCA over Method RMSE MAE</title>
          <p>Rolling Median 14.04% 23.44%</p>
          <p>Kalman Filter 33.74% 66.79%</p>
          <p>URTCA demonstrated a notable ~14% reduction in RMSE and ~23% reduction in MAE compared
to the Rolling Median. The improvement over the Kalman Filter was even more pronounced, with
reductions of ~34% in RMSE and ~67% in MAE. These results highlight the benefit of the adaptive,
classification-driven approach over both simple filtering and a standard model-based filter when
dealing with heterogeneous noise types.</p>
          <p>Figure 4 provides a qualitative comparison, illustrating how the different methods handled a
segment of the noisy test data compared to the ground truth. Visual inspection suggests that URTCA
effectively smooths general noise, corrects outliers, and adapts to shifts or constant segments more
accurately than the Rolling Median or Kalman Filter, which tend to either oversmooth or fail to
adequately correct certain anomaly types.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Internal Classification Performance</title>
        <p>While the primary goal is cleaning accuracy, understanding the performance of the internal noise
classification mechanism within URTCA provides valuable insights. We evaluated the accuracy of
the effective label assigned by the method (after the rule-based refinement) compared to the true
underlying noise type injected into the test data.</p>
        <p>The overall accuracy of assigning the correct effective label was approximately 59.7%. Figure 4
visualizes the confusion matrix.</p>
        <p>The classification results indicate effective identification of Drift (F1=0.880) and ConstantValue
(F1=0.780) segments. The method also demonstrates high recall (0.838) for Outlier detection, meaning
it successfully flags most true outliers. However, the precision for Outlier (0.100) and Clean (0.204)
is low, indicating that many segments identified as these types were actually instances of other noise,
primarily GeneralNoise. GeneralNoise itself had relatively low recall (0.532), often being misclassified
as Outlier or Clean (via the rule-based override).</p>
        <p>Despite these imperfections in the underlying classification, particularly the difficulty in
distinguishing low-level GeneralNoise from truly Clean segments or certain Outliers, the overall
URTCA architecture achieved superior cleaning performance (Table 1, Figure 3). This suggests that
the combination of the adaptive strategy selection (applying tailored cleaning for correctly identified
Drift, ConstantValue, and high-confidence Outliers) and the rule-based handling of low-variance
states effectively compensates for the classification ambiguities, leading to robust and accurate data
correction.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Computational Cost</title>
        <p>The average processing time per data point was measured during the simulation. URTCA required
approximately 32.6 ms per point, primarily due to feature extraction and classifier inference. This is
significantly higher than the Rolling Median (~0.06 ms) and the Kalman Filter implementation (~1.3
ms in our test, though this can vary), but remains feasible for many real-time smart home
applications involving data sampled at intervals of seconds or minutes, such as the 1-minute data
used here.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>This paper addressed the critical challenge of cleaning univariate time series data from IoT sensors
in real-time, particularly when faced with a mixture of different noise and anomaly types. Traditional
methods often struggle in such scenarios, either by applying overly simplistic filtering or relying on
model assumptions violated by diverse error patterns. We proposed URTCA, a novel architecture
based on classification-driven adaptive cleaning, refined by rule-based logic.</p>
      <p>Our experimental results, based on simulations using real-world temperature data corrupted with
synthetic mixed noise, demonstrate the effectiveness of the proposed approach. URTCA significantly
outperformed both a standard filtering technique (Rolling Median) and a common model-based
approach (Kalman Filter) in terms of cleaning accuracy, achieving approximately 14% and 34% lower
RMSE, respectively. This superior performance highlights the primary contribution of this work:
demonstrating that an adaptive strategy, guided by machine learning-based classification of noise
types and augmented with rules for handling low-noise states, can lead to more accurate and robust
real-time data cleaning compared to non-adaptive or simpler model-based methods when dealing
with complex, heterogeneous noise profiles typical of IoT sensor data. The analysis of the internal
classification mechanism revealed that while distinguishing certain noise types remains challenging,
the overall adaptive architecture effectively compensates, yielding high-quality cleaned data suitable
for reliable downstream applications.</p>
      <p>The significance of this work lies in providing a practical and demonstrably effective real-time
cleaning methodology tailored to the complexities of IoT sensor data. By moving beyond generic
filtering or detection, URTCA enables more nuanced data correction, which is crucial for improving
the reliability of analytics, control systems, and decision-making processes built upon sensor streams.
This contributes to unlocking the full potential of IoT data in various domains.</p>
      <p>However, this study has limitations. The evaluation relied on synthetically generated noise;
further validation on diverse real-world datasets with ground truth or well-characterized errors is
necessary. The internal noise classifier, while effective within the overall system, showed limitations
in discriminating certain noise types, suggesting room for improvement through advanced feature
engineering or alternative classification models. The current implementation is univariate, and
extending the architecture to handle multivariate time series, considering inter-channel correlations,
remains an important future step. Furthermore, the method's parameters (window size, thresholds)
were set heuristically and may require tuning for different datasets or noise characteristics. Lastly,
while feasible for minute-level data, the computational cost needs consideration for higher frequency
streams or severely resource-constrained edge devices.</p>
      <p>Future work should focus on enhancing the noise classification module, potentially exploring
semi-supervised learning or more sophisticated feature extraction techniques. Extending the
framework to multivariate data cleaning is a key direction. Investigating methods for automatic
parameter optimization and exploring model compression or hardware acceleration techniques to
reduce computational overhead for edge deployment would also be valuable. Finally, applying and
evaluating the method in specific real-world IoT application domains (e.g., energy forecasting,
industrial predictive maintenance) will further validate its practical utility.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>[9] X. Wang, C. Wang, Time Series Data Cleaning: A Survey, IEEE Access 8 (2020) 1866–1881.</p>
        <p>doi:10.1109/access.2019.2962152.
[10] M. K. Shende, A. E. Feijóo-Lorenzo, N. D. Bokde, cleanTS: Automated (AutoML) Tool to Clean
Univariate Time Series at Microscales, Neurocomputing (2022).
doi:10.1016/j.neucom.2022.05.057.
[11] X. Ding, H. Wang, J. Su, Z. Li, J. Li, H. Gao, Cleanits: a data cleaning system for industrial time
series, Proc. VLDB Endow. 12.12 (2019) 1786–1789. doi:10.14778/3352063.3352066.
[12] Rong K., Bailis P. ASPA: Adaptive Smoothing for Streaming Time Series, Proc. VLDB Endow.</p>
        <p>10.11 (2017) 1358–1369. doi:10.14778/3137628.3137645.
[13] Song, S., Zhang, A., Wang, J., &amp; Yu, P.S., SCREEN: Stream Data Cleaning under Speed
Constraints, Proceedings of the 2015 ACM SIGMOD International Conference on Management
of Data.
[14] R. Ahmad, E. H. Alkhammash, Online Adaptive Kalman Filtering for Real-Time Anomaly</p>
        <p>Detection in Wireless Sensor Networks, Sensors 24.15 (2024) 5046. doi:10.3390/s24155046.
[15] H. N. Akouemo, R. J. Povinelli, Data Improving in Time Series Using ARX and ANN Models,</p>
        <p>IEEE Trans. Power Syst. 32.5 (2017) 3352–3359. doi:10.1109/tpwrs.2017.2656939.
[16] Yogita, D. Toshniwal, A Framework for Outlier Detection in Evolving Data Streams by
Weighting Attributes in Clustering, Procedia Technol. 6 (2012) 214–222.
doi:10.1016/j.protcy.2012.10.026.
[17] Yin, M., and Yue, K. Time Series Data Cleaning Method Based on Optimized ELM Prediction.</p>
        <p>JIPS, vol. 13, no. 2, 2017, pp. 432-445. doi:10.3745/JIPS.04.0268.
[18] A. Karkouch, H. Mousannif, H. Al Moatassime, T. Noel, Data quality in internet of things: A
state-of-the-art survey, J. Netw. Comput. Appl. 73 (2016) 57–81. doi:10.1016/j.jnca.2016.08.002.
[19] B. Zhu, C. He, P. Liatsis, A robust missing value imputation method for noisy data, Appl. Intell.</p>
        <p>36.1 (2010) 61–74. doi:10.1007/s10489-010-0244-1.
[20] An, N., Ding, Y., and Zhao, H. Statistical feature analysis and preprocessing assisted artificial
neural network for cleaning multi-type concurrent anomalies in time series data. Proc. ISCAS,
2024. doi: 10.1109/ISAS61044.2024.10552599</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L. Random</given-names>
          </string-name>
          <string-name>
            <surname>Forests</surname>
          </string-name>
          .
          <source>Machine Learning</source>
          <volume>45</volume>
          ,
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          (
          <year>2001</year>
          ), doi:10.1023/A:
          <fpage>1010933404324</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Hodge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Austin</surname>
          </string-name>
          , A Survey of Outlier Detection Methodologies,
          <source>Artif. Intell. Rev. 22.2</source>
          (
          <year>2004</year>
          )
          <fpage>85</fpage>
          -
          <lpage>126</lpage>
          . doi:
          <volume>10</volume>
          .1023/b:aire.
          <volume>0000045502</volume>
          .10941.
          <year>a9</year>
          ..
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Anomaly detection</article-title>
          ,
          <source>ACM Comput. Surv. 41.3</source>
          (
          <issue>2009</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>58</lpage>
          . doi:
          <volume>10</volume>
          .1145/1541880.1541882.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gubbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Buyya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marusic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palaniswami</surname>
          </string-name>
          ,
          <article-title>Internet of Things (IoT): A vision, architectural elements, and future directions</article-title>
          ,
          <source>Future Gener. Comput. Syst. 29.7</source>
          (
          <year>2013</year>
          )
          <fpage>1645</fpage>
          -
          <lpage>1660</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.future.
          <year>2013</year>
          .
          <volume>01</volume>
          .010.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gama</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Žliobaitė</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pechenizkiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bouchachia</surname>
          </string-name>
          ,
          <article-title>A survey on concept drift adaptation</article-title>
          ,
          <source>ACM Comput. Surv. 46.4</source>
          (
          <issue>2014</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          . doi:
          <volume>10</volume>
          .1145/2523813.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Kalman</surname>
          </string-name>
          , A New Approach to Linear Filtering and
          <string-name>
            <given-names>Prediction</given-names>
            <surname>Problems</surname>
          </string-name>
          , J.
          <source>Basic Eng. 82.1</source>
          (
          <year>1960</year>
          )
          <fpage>35</fpage>
          -
          <lpage>45</lpage>
          . doi:
          <volume>10</volume>
          .1115/1.3662552.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yasumoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yamaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shigeno</surname>
          </string-name>
          ,
          <article-title>Survey of Real-time Processing Technologies of IoT Data Streams</article-title>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Inf</surname>
          </string-name>
          .
          <source>Process. 24.2</source>
          (
          <year>2016</year>
          )
          <fpage>195</fpage>
          -
          <lpage>202</lpage>
          . doi:
          <volume>10</volume>
          .2197/ipsjjip.24.195.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Song,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          , Time Series Data Cleaning:
          <article-title>From Anomaly Detection to Anomaly Repairing</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>10</volume>
          .10 (
          <year>2017</year>
          )
          <fpage>1046</fpage>
          -
          <lpage>1057</lpage>
          . doi:
          <volume>10</volume>
          .14778/3115404.3115410.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>