<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Z. Lahovnik);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Assurance of Predictive Models in Intelligent Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Grega Vrbančič</string-name>
          <email>grega.vrbancic@um.si</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zala Lahovnik</string-name>
          <email>zala.lahovnik1@um.si</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vili Podgorelec</string-name>
          <email>vili.podgorelec@um.si</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Maribor, Slovenia</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Electrical Engineering and Computer Science, University of Maribor</institution>
          ,
          <addr-line>Koroška cesta 46, 2000 Maribor</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Predictive models are becoming an integral part of modern information systems, transforming them into intelligent systems, featuring automated, data-driven decision-making across various domains. However, the reliance of intelligent systems on the evolving data streams and advanced machine learning algorithms introduces unique quality assurance (QA) challenges. This paper provides insight into systematic approaches for ensuring the reliability and robustness of predictive models, emphasizing the role of Machine Learning Operations practices. It addresses critical QA aspects such as data validation, drift detection, and model testing methodologies. A detailed case study of the Airly platform, a solution for air quality prediction, demonstrates practical implementations of these principles, including automated data pipelines, rigorous statistical validation, data version control, and continuous model monitoring. The study demonstrates the necessity of structured, automated, and integrated QA pipelines to maintain model performance throughout the intelligent system life cycle.</p>
      </abstract>
      <kwd-group>
        <kwd>Quality assurance</kwd>
        <kwd>Predictive models</kwd>
        <kwd>Intelligent systems</kwd>
        <kwd>Machine Learning Operations (MLOps)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Predictive models are becoming fundamental components of intelligent systems across diverse domains,
including healthcare, finance, environmental monitoring, and industrial automation. These models
facilitate automated, data-driven decision-making, thereby significantly enhancing the capabilities
and eficiencies of modern systems. Nevertheless, integrating predictive models within intelligent
systems brings forth distinct quality assurance challenges due to their inherent reliance on data-driven
processes rather than traditional software engineering practices. Specifically, any modification within
these systems, whether to the data, pre-processing methods, or model parameters, has the potential to
afect overall predictive performance and system stability, captured succinctly by the phrase ”changing
anything changes everything” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Traditional software quality assurance methods, centered predominantly around static code testing
and deterministic outcomes, fall short in addressing the nuanced demands posed by predictive models.
Unlike conventional software components, predictive models continuously evolve as they consume new
and potentially shifting data distributions. Consequently, model performance can deteriorate silently
due to unobserved data drift, concept drift, or changes in underlying relationships between inputs
and outputs. Such subtle yet critical changes can lead to erroneous predictions without rigorous and
systematic quality assurance practices, undermining user trust and compromising system integrity [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Addressing these challenges necessitates a systematic approach to managing the life cycle of predictive
models. Recent advancements in Machine Learning Operations (MLOps) propose comprehensive
strategies to ensure the quality, reproducibility, and robustness [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Central to MLOps is the use of
structured and automated pipelines that facilitate continuous data validation, systematic testing, and
rigorous performance monitoring. These pipelines are essential to detect and prevent model degradation
and manage changes in data, model architectures, and environmental configurations methodically.
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>Within this context, data validation and testing play crucial roles. Efective validation pipelines
help ensure the integrity, consistency, and relevance of training and inference data, reducing risks
associated with data drift and anomalies. Testing strategies, including schema validation, statistical
distribution checks, and performance evaluations, provide early warnings of potential issues, thus
enabling proactive maintenance of predictive models.</p>
      <p>To illustrate these principles and methodologies in practice, this paper presents a detailed case study
based on the Airly platform, an MLOps-based solution developed for air quality prediction in Slovenia.
The Airly platform demonstrates a comprehensive application of systematic quality assurance practices,
including automated data pipelines, rigorous validation and statistical testing of datasets, structured
version control using DVC (Data Version Control), and continuous monitoring through model registries
and dashboards integrated via MLflow. This case study underscores the critical importance and practical
benefits of systematic, pipeline-oriented quality assurance approaches for predictive models embedded
within intelligent systems.</p>
      <p>The remainder of the paper is structured as follows. Section 2 presents the most common existing
methodologies and frameworks for quality assurance of predictive models, together with data validation
and data testing. In section 3, the Airly platform is presented, outlining its features and incorporated
quality assurance practices. Finally, the conclusion synthesizes key findings, discusses implications for
broader practice.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Quality assurance of predictive models</title>
      <p>Quality assurance for machine learning predictive models spans the entire life cycle of intelligent
system development, from acquiring the data to model deployment. Ensuring high-quality data and
rigorously testing models are critical, as poor data or unexpected model behavior can nullify the benefits
of machine learning algorithms [4]. This section reviews four key components of QA in ML systems:
data validation, data testing (drift detection), testing of predictive models, and their integration into
MLOps pipelines.</p>
      <sec id="sec-2-1">
        <title>2.1. Data validation</title>
        <p>Reliable predictive modeling requires robust data validation processes. As emphasized by Polyzotis
et al. [4], data validation is crucial to detect schema mismatches, outliers, and anomalies before they
afect model training or predictions. The errors or inconsistencies in input data can compromise the
predictive performance of the model [5, 4]. Additionally, feedback loops in intelligent systems, where
model predictions are fed back as training data, can amplify even small data errors, causing gradual
model performance degradation over time. Therefore, catching data issues early, before they propagate
through training or inference, is crucial for reliable intelligent systems.</p>
        <p>In general, there are two approaches to data validation. The first relies on manually defined validation
rules or constraints – essentially “unit tests” for data. Researchers have proposed declarative
frameworks that allow engineers to write assertions about data (e.g., value ranges, missing value thresholds,
uniqueness), which are then automatically checked on datasets updated with new data. The second
approach is schema-driven validation. Here, a data schema defines the expected structure, types, and
sometimes statistical properties of the data. The schema acts as a contract for what “correct” data
should look like in terms of features, types, allowed domains, ranges, etc. Frameworks like TensorFlow
Data Validation (TFDV) and Great Expectations ofer automated validation of data against inferred
or user-defined schemes [ 6]. These tools validate feature distributions, missing value rates, and type
consistency, and flag issues through validation reports. Such validations are typically performed before
training and prior to model inference. With the utilization of such tools, organizations can continuously
enforce data quality constraints, preventing bad data from ever reaching the model training phase. This
not only catches catastrophic errors (such as scrambled feature values or corrupted files) but also subtle
issues that could bias or degrade the model if left unchecked. In summary, rigorous data validation –
whether via manually crafted rules or schema-driven automation – is the first line of defense in ML
quality assurance, ensuring that models train on clean, trustworthy data.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data testing</title>
        <p>Regardless of data validation, the data can and most probably will change over time, leading to the efect
of data drift or concept drift. Data drift can occur in diferent ways. The most common are covariate drift,
prior probability drift, and conceptual drift. Data drift refers to shifts in the input feature distributions
over time, without changes in the underlying relationship between inputs and outputs [7]. Typical
examples include shifts in demographic data, sensor calibration changes, or varying data collection
methods. Data drift can lead models trained on historical data to perform poorly due to the mismatch
between training and production environments [8]. Statistical tests like the Kolmogorov–Smirnov (K–S)
test [9] or Population Stability Index (PSI) [10] are commonly employed to detect such shifts, comparing
the empirical distributions of new and historical data [11]. Additionally, machine-learning-based
detectors, such as classifier-based two-sample tests and autoencoders, have demonstrated efectiveness
in identifying complex, high-dimensional distributional changes [12].</p>
        <p>In contrast to the covariate drift, the concept drift occurs when there are changes in the conditional
distribution, reflecting alterations in the underlying relationships or the decision boundary that the
predictive model attempts to learn [13]. Real-world examples include evolving customer preferences,
changes in fraud patterns, or medical diagnosis criteria evolving over time. Concept drift basically
implies that the previously learned model concept becomes obsolete, thus requiring model adaptation
or retraining. Common detection techniques include monitoring prediction error rates with methods
such as ADaptive WINdowing (ADWIN) [14], and Early Drift Detection Method (EDDM) [ 15].</p>
        <p>Prior probability drift refers to shifts in the distribution of the output variable, without afecting the
relationship between features and targets directly [7]. For example, in medical diagnosis tasks, the
prevalence of disease may shift, afecting the baseline probabilities. Prior probability drift primarily
impacts the calibration and decision thresholds of the predictive models [16]. Detection of prior
probability drift typically involves tracking the class distribution changes over time through straightforward
statistical measures such as chi-squared tests or Fisher’s exact tests. Recalibration techniques like Platt
scaling, isotonic regression, or threshold adjustment are also commonly used to mitigate the efect of
prior probability drift [ 17].</p>
        <p>Identifying drift is critical, but it is equally important to establish strategies to mitigate its efects.
For covariate and concept drift, retraining or incremental learning algorithms are essential [ 8]. Prior
probability drift typically requires recalibration of model probabilities rather than retraining the entire
model from scratch [17, 16]. All these drift types should be integrated into automated pipelines within
MLOps frameworks, which continuously check for distributional changes, performance degradation,
and calibration drift.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Testing the predictive models</title>
        <p>Aside from data validation and testing, it is also important to focus on the testing of the predictive
models. Machine learning models pose specific challenges to traditional software testing practices, as
they are derived from data rather than being programmed in the classical sense. Therefore, it is dificult
to construct traditional test oracles [18]. ML systems are also prone to silent failures, influenced not just
by the model but by dependencies across the data pipeline, pre-processing logic, and external libraries.</p>
        <p>Despite these challenges, practitioners have adapted software testing best practices to ML systems.
At a high level, the most common are unit testing, integration testing, and system testing. Unit testing
involves testing the smallest components of the ML pipeline in isolation. For an intelligent system, that
would be testing the data pre-processing functions, feature extraction logic, or model interface code.
Polyzotis et al. [4] also advocated for unit tests on the model’s learned parameters or predictions for
synthetic inputs to verify the training code isn’t flawed.</p>
        <p>Since the intelligent systems involve multiple pipelines, each with numerous components, the
integration tests ensure that these components work together as intended. A common practice is
to run a small end-to-end pipeline on a smaller sub-sample of a dataset and verify that the pipeline
produces outputs of the correct format and within the expected ranges. From the academic standpoint,
the researchers have proposed frameworks for metamorphic integration testing, where the entire ML
pipeline is exercised under transformations of input data to see if the end-to-end output changes in
expected ways.</p>
        <p>Testing the models’ prediction logic is probably the most challenging task. Addressing this challenge,
the metamorphic testing is a technique where one leverages known relationships (metamorphic relations)
between diferent inputs and outputs to test the program without a priori oracles. In metamorphic
testing, we generate new test cases from existing ones by applying transformations, like adding noise,
scaling features, shufling input order, and checking for consistency or predefined changes in the model
output [19].</p>
        <p>Another important aspect of model testing is robustness testing, which overlaps with metamorphic
and adversarial testing. Techniques like adversarial example generation slightly perturb inputs to
see if the model output changes dramatically, and can be seen as testing the model’s stability [20].
Additionally, mutation testing has also been adapted, introducing small modifications (mutations) to a
trained model (or to its inputs) to see if the model’s outputs change, which can help identify brittle
logic[21].</p>
        <p>Testing ML systems is not only technically challenging but also not yet fully standardized in industry.
Empirical studies highlight that software teams often struggle to apply traditional testing to ML projects
and have had to develop new practices. For example, a recent industry survey by Li et al. [22] found
that testing activities for ML systems focus on three major areas: test data collection, test execution,
and monitoring model performance. In particular, they identified entanglement among components as
a key problem in test execution. Another challenge is that after conducting tests, the interpretation of
the results can be non-trivial since metric thresholds (accuracy, F1 score, etc.) are commonly used as
proxies for correctness, and when these metrics drop, it’s akin to a test failure. However, understanding
why the metric dropped requires human insight and debugging, often beyond the scope of automated
testing.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Integration into MLOps workflows</title>
        <p>Modern practices in machine learning operations emphasize that the QA techniques should be automated
and integrated as part of the ML pipeline. In a robust MLOps pipeline, data validation, model testing, and
monitoring are incorporated into the continuous integration and deployment process for ML models.
This ensures that quality checks are consistently applied every time data or code changes, much like
continuous testing in DevOps.</p>
        <p>In production workflows, data validation is executed prior to model retraining, ensuring that only
validated data propagates through the pipeline. Therefore, the potential problems can be caught early,
the pipeline can halt or alert when problems arise, similar to failing a unit test in continuous integration.
In the same manner, schema validation can also be set up to detect distribution shifts relative to training
data, feeding into data drift detection mechanisms [ 23]. Furthermore, after the data validation and data
testing, is it common to automate training or retraining of the model and automatically evaluate it on
the test set. If the evaluated model meets defined performance criteria, the pipeline proceeds to push
the model to the production stage.</p>
        <p>Another aspect of integration is also continuous monitoring in production. After the deployment of
the predictive model, an MLOps pipeline does not end. It enters the monitoring phase, where data and
predictions are continually logged and analyzed. Here, the earlier-mentioned drift detection methods
run on production data streams. When significant issues like data drift are detected, it is common to
trigger automated workflows, raising an alert, rolling back to a previous model, or scheduling a new
training job with fresh data. When such MLOps loop is closed, ideally, the pipeline itself becomes
self-correcting, which means no new model goes out, and no new data influences a model without
passing quality assurance checks.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Case Study: Quality Assurance in the Airly application</title>
      <p>Airly platform, presented on Figure 1, is an MLOps solution developed for monitoring and predicting
air pollution levels (specifically PM10 and PM2.5 particulate matter) across 21 air measurement stations
in Slovenia. The platform implements comprehensive quality assurance mechanisms throughout
the intelligent system life cycle, from data acquisition, validation, and testing to model training and
deployment.</p>
      <sec id="sec-3-1">
        <title>3.1. Technology stack</title>
        <p>The project utilizes a technology stack designed for scalability, maintainability, and eficient development.
The user interface is developed using React.js, a declarative JavaScript library for building interactive web
applications. The backend is implemented in Python, using the Flask web framework to provide RESTful
APIs for data access and prediction services (together with ONNX runtime library). Data acquisition is
implemented using BeautifulSoup and requests libraries for scraping and parsing the pollution data from
external web sources, while Pandas, NumPy libraries are utilized for data manipulation and numerical
operations. MongoDB is used for persistent storage of predictions accessed via the pymongo library.
For the machine learning tasks, scikit-learn and TensorFlow libraries are used, while for the experiment
tracking and model management, the MLFlow and ONNX libraries are utilized.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Pipeline</title>
        <p>The data pipeline, presented in Figure 2, was developed with a strong emphasis on data quality assurance,
reproducibility, and robust automated operation.</p>
        <p>Raw air-quality sensor, together with weather data, is obtained and transformed using established data
processing approaches and techniques. After the initial pre-processing phase, the pipeline merges the
data and enforces automated validation checks at multiple stages to ensure the integrity and consistency
of the input data. Data quality is of great importance in such pipelines, as inconsistencies or anomalies
in the input can significantly degrade model performance [ 4]. Therefore, we adopt a schema-driven
validation approach using the Great Expectations library, which can be observed in Figure 3. The
data must satisfy a suite of predefined quality criteria (schema compliance, valid ranges, distributional
properties) before it is passed to the next pipeline steps. Such an approach enables the automatic
detection of aberrant or out-of-bound data (for example, sensor malfunctions or extreme outliers) early
in the process, preventing corrupted data from propagating downstream and thereby protecting the
process of training the model.</p>
        <p>To support experimental reproducibility and traceability, the pipeline incorporates strict data
versioning. Every dataset snapshot is tracked under data version control alongside the code, producing a
detailed record of how data and code evolve over time and which versions are used for each training
cycle. This practice follows recommended MLOps principles to apply version control to ML assets
(datasets, features, trained models) in addition to source code [24]. In our implementation, we leverage
a Git-integrated data versioning tool (Data Version Control, DVC) to achieve this linkage. Such an
approach has become a popular standard in MLOps practice, as DVC and similar tools allow the seamless
integration of data versioning into ML workflows [ 24]. By using such a versioned data repository, any
result in the Airly case study can be reproduced exactly by fetching the corresponding dataset version
and model state, which strengthens the scientific rigor and auditability of the pipeline.</p>
        <p>In addition to ensuring data integrity and reproducibility, the pipeline includes mechanisms for
continuous monitoring of data. Over time, the statistical properties of environmental sensor data may
shift due to seasonality or changes in sensor behavior, and the model’s real-world performance can
be afected by these shifts. Such data drift or concept drift is a well-known challenge in the machine
learning life cycle, as it can quietly erode a model’s accuracy and reliability if left unchecked [ 25].
To address this, our pipeline automatically tracks the distribution of incoming data and the model’s
predictions, and it triggers alerts or mitigation steps when significant deviations from the training
baseline are detected. We implement this stability testing via an open-source monitoring suite (e.g.,
Evidently), which can be observed in Figure 4 that continuously evaluates whether the live data falls
within expected bounds. By integrating such automated drift detection into the pipeline, any emerging
anomaly or degradation in model behavior can be identified promptly. This enables a proactive response
(such as triggering data quality investigations or model retraining) before the issue materially impacts
decision-making, thereby maintaining the overall robustness of the Airly platform.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model Training Pipeline</title>
        <p>The model training and model registration/versioning pipeline in the Airly platform is designed to
ensure that predictive models are developed, evaluated, and deployed in a manner consistent with the
principles of MLOps: reproducibility, automation, and traceability. This pipeline is triggered upon
changes in code or data. The use of automated orchestration tools for machine learning life cycle
tasks has been widely advocated in recent MLOps literature to minimize manual errors and enforce
reproducible workflows [ 24, 23].</p>
        <p>The pipeline begins by setting up a consistent Python environment using Poetry, ensuring
reproducibility in dependency resolution and package management. The most recent validated datasets are
retrieved using DVC, linking model training directly to the corresponding versioned data. This practice
enables rigorous traceability and reproducibility, aligning with best practices in version-controlled ML
development.</p>
        <p>Model training is conducted using the TensorFlow and scikit-learn frameworks, both of which support
modular development and integration with pipeline components. Scikit-learn pipelines are utilized
to manage feature scaling, encoding, and imputation, while the training routine, which exploits the
knowledge distillation techniques, is implemented using TensorFlow. To evaluate model performance,
the pipeline calculates standard regression metrics such as mean absolute error (MAE), mean squared
error (MSE), and explained variance score (EVS). These metrics are logged into MLFlow, an open-source
platform for managing the end-to-end machine learning life cycle. MLFlow’s experiment tracking
and model registry capabilities provide centralized visibility into model performance and facilitate
comparison across diferent runs and versions. The example of executed Airly experiments, logged to
MLFlow, can be observed in Figure 6.</p>
        <p>Following evaluation, trained models are converted to the ONNX format, which provides a
platformagnostic representation that facilitates eficient model inference and integration into diverse runtime
environments. Both the pre-processing pipeline and the trained models are registered/versioned in
MLFlow, which allows for storage of metadata, tagging of model versions, and management of aliases
(e.g., “production”, “champion”) to support controlled deployment.</p>
        <p>The pipeline also includes rigorous testing infrastructure. Unit tests validate individual components,
such as data transformation routines and model structure, while integration tests verify the full execution
path from data loading to model registration. By incorporating automated environment setup, data
versioning, robust model training, reproducibility-focused evaluation, and model life cycle management,
the Airly training pipeline adheres to modern MLOps practices and provides a scalable framework for
continuous model improvement.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Production Monitoring Pipeline</title>
        <p>To maintain the reliability of deployed predictive models over time, the Airly platform integrates a
dedicated model monitoring pipeline presented in Figure 7. This component executes on a scheduled
basis (daily) to assess both prediction performance and data quality. Continuous monitoring of
deployed models is a critical aspect of operational machine learning systems, allowing early detection of
performance degradation or data drift.</p>
        <p>The monitoring process begins by establishing a consistent Python environment via Poetry, followed
by automated retrieval of recent predictions and ground truth observations from MongoDB and external
APIs. The monitoring script pre-processes and aligns the data to enable model evaluation. Performance
metrics, including MAE, MSE, and EVS, are computed using scikit-learn and are evaluated separately for
each deployed model version. The results of these evaluations are stored in MongoDB and optionally
logged into MLflow. This integration provides long-term observability and supports centralized analysis
of model behavior over time. Such historical tracking, presented in Figure 8, is essential for detecting
trends, diagnosing issues, and managing model life cycle decisions in an evidence-based manner.
Additionally, predictive model aliases and active model versions can be managed directly through
the platform interface. If newer models demonstrate better performance, they can be promoted to
production by reassigning the alias to the corresponding model version.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Automation and Infrastructure Support</title>
        <p>The quality assurance infrastructure is supported by automated workflows, executed via GitHub
Workflows. Each pipeline stage is implemented as a separate job within the CI/CD framework, as can
be observed in Figure 9, ensuring that any modification to data, code, or configuration is rigorously
validated before integration. The pipeline is defined as a sequence of interdependent stages managed
by an orchestration framework, ensuring that each stage executes in the correct order under consistent
conditions. Automating the end-to-end workflow in this manner aligns with best-practice methodologies
in MLOps, which call for the use of automated pipelines to reliably manage ML life cycle activities
from data processing to model deployment [23]. In our case study, this means that new data can be
ingested and processed on a defined schedule (or continuously), validated against quality expectations,
and fed through to model retraining or updates with minimal human intervention. The combination of
pipeline automation with rigorous data validation, version control, and drift monitoring yields a robust
data infrastructure. This design not only ensures that the Airly models are trained and updated on
high-quality, reproducible data but also that the system can automatically detect and adapt to changes
in data or performance over time. Such a formalized pipeline approach is well-suited for any intelligent
system, as it provides the reliability, transparency, and scientific rigor required for deployment in a
real-world scenario.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>The integration of predictive models into intelligent systems necessitates comprehensive quality
assurance practices to manage the inherent complexities and dynamics associated with such systems. This
paper emphasizes that efective QA must include rigorous and automated data validation procedures,
robust drift detection mechanisms for identification of data and concept shifts, and systematic testing
methodologies specifically tailored for predictive models. The case study of the Airly platform highlights
the practical benefits and efectiveness of embedding these practices into automated MLOps pipelines,
demonstrating improved reproducibility, reliability, and maintainability. Key findings support a life
cycle-centric view of QA, where continuous monitoring and automated interventions, such as retraining
or recalibration triggered by drift detection, are crucial. By systematically integrating validation, testing,
and monitoring practices, organizations can ensure sustained predictive accuracy, mitigate the impact
of data and concept drift, and maintain high-quality intelligent systems over extended operational
periods.</p>
      <p>In future work, we aim to expand our investigation by systematically evaluating the efectiveness of
the proposed quality assurance approach for predictive models. This will involve simulating a range
of data-related scenarios—such as shifts in data distribution, introduction of noisy inputs, or missing
values—to assess the system’s resilience. We also plan to perform a comparative analysis of model
performance under QA-enabled versus non-QA (baseline) conditions. Such an experimental setup
will help quantify the concrete benefits of integrated QA mechanisms in terms of predictive accuracy,
robustness, and reliability over time.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The authors acknowledge the financial support from the Slovenian Research and Innovation Agency
(Research Core Funding No. P2-0057).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Grammarly in order to: Grammar and spelling
check. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s)
full responsibility for the publication’s content.
[4] N. Polyzotis, M. Zinkevich, S. Roy, E. Breck, S. Whang, Data validation for machine learning,</p>
      <p>Proceedings of machine learning and systems 1 (2019) 334–347.
[5] S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann, A. Grafberger, Automating large-scale
data quality verification, Proceedings of the VLDB Endowment 11 (2018) 1781–1794.
[6] E. Caveness, P. S. GC, Z. Peng, N. Polyzotis, S. Roy, M. Zinkevich, Tensorflow data validation: Data
analysis and validation in continuous ml pipelines, in: Proceedings of the 2020 ACM SIGMOD
International Conference on Management of Data, 2020, pp. 2793–2796.
[7] J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodríguez, N. V. Chawla, F. Herrera, A unifying view on
dataset shift in classification, Pattern recognition 45 (2012) 521–530.
[8] J. Gama, P. Medas, G. Castillo, P. Rodrigues, Learning with drift detection, in: Advances in
Artificial Intelligence–SBIA 2004: 17th Brazilian Symposium on Artificial Intelligence, Sao Luis,
Maranhao, Brazil, September 29-Ocotber 1, 2004. Proceedings 17, Springer, 2004, pp. 286–295.
[9] F. J. Massey Jr, The kolmogorov-smirnov test for goodness of fit, Journal of the American statistical</p>
      <p>Association 46 (1951) 68–78.
[10] B. Yurdakul, Statistical properties of population stability index, Western Michigan University, 2018.
[11] J. Quiñonero-Candela, Dataset shift in machine learning, Mit Press, 2009.
[12] V. Yarabolu, G. Waghmare, S. Gupta, S. Asthana, A scalable approach to covariate and concept
drift management via adaptive data segmentation, arXiv preprint arXiv:2411.15616 (2024).
[13] F. Bayram, B. S. Ahmed, A. Kassler, From concept drift to model degradation: An overview on
performance-aware drift detectors, Knowledge-Based Systems 245 (2022) 108632.
[14] A. Bifet, R. Gavalda, Learning from time-changing data with adaptive windowing, in: Proceedings
of the 2007 SIAM international conference on data mining, SIAM, 2007, pp. 443–448.
[15] M. Baena-Garcıa, J. del Campo-Ávila, R. Fidalgo, A. Bifet, R. Gavalda, R. Morales-Bueno, Early
drift detection method, in: Fourth international workshop on knowledge discovery from data
streams, volume 6, Citeseer, 2006, pp. 77–86.
[16] T. Silva Filho, H. Song, M. Perello-Nieto, R. Santos-Rodriguez, M. Kull, P. Flach, Classifier calibration:
a survey on how to assess and improve predicted class probabilities, Machine Learning 112 (2023)
3211–3260.
[17] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On calibration of modern neural networks, in:</p>
      <p>International conference on machine learning, PMLR, 2017, pp. 1321–1330.
[18] H. B. Braiek, F. Khomh, On testing machine learning programs, Journal of Systems and Software
164 (2020) 110542.
[19] C. Murphy, G. E. Kaiser, L. Hu, L. Wu, Properties of machine learning applications for use in
metamorphic testing., in: SEKE, volume 8, 2008, pp. 867–872.
[20] X. Yuan, P. He, Q. Zhu, X. Li, Adversarial examples: Attacks and defenses for deep learning, IEEE
transactions on neural networks and learning systems 30 (2019) 2805–2824.
[21] J. Wang, G. Dong, J. Sun, X. Wang, P. Zhang, Adversarial sample detection for deep neural network
through model mutation testing, in: 2019 IEEE/ACM 41st International Conference on Software
Engineering (ICSE), IEEE, 2019, pp. 1245–1256.
[22] S. Li, J. Guo, J.-G. Lou, M. Fan, T. Liu, D. Zhang, Testing machine learning systems in industry: an
empirical study, in: Proceedings of the 44th International Conference on Software Engineering:
Software Engineering in Practice, 2022, pp. 263–272.
[23] M. M. John, H. H. Olsson, J. Bosch, An empirical guide to mlops adoption: Framework, maturity
model and taxonomy, Information and Software Technology 183 (2025) 107725.
[24] L. Berberi, V. Kozlov, G. Nguyen, J. Sáinz-Pardo Díaz, A. Calatrava, G. Moltó, V. Tran, Á. López
García, Machine learning operations landscape: platforms and tools, Artificial Intelligence Review 58
(2025) 167.
[25] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept drift: A review, IEEE
transactions on knowledge and data engineering 31 (2018) 2346–2363.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>I. Bojinov</surname>
          </string-name>
          , Keep your ai projects on track,
          <year>2023</year>
          . URL: https://hbr.org/
          <year>2023</year>
          /11/ keep-your
          <article-title>-ai-projects-on-track.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Turhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Beyond accuracy: An empirical study on unit testing in open-source deep learning projects</article-title>
          ,
          <source>ACM Trans. Softw. Eng. Methodol</source>
          .
          <volume>33</volume>
          (
          <year>2024</year>
          ). URL: https://doi.org/10.1145/3638245. doi:
          <volume>10</volume>
          .1145/3638245.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Vrbančič</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Podgorelec</surname>
          </string-name>
          ,
          <string-name>
            <surname>OTS</surname>
          </string-name>
          <year>2019</year>
           
          <article-title>Sodobne informacijske tehnologije in storitve: Zbornik štiriindvajsete konference</article-title>
          ,
          <source>Maribor</source>
          ,
          <volume>18</volume>
          . in 19. junij 2019, Maribor: Univerzitetna založba Univerze,
          <year>2019</year>
          . URL: http://press.um.si/index.php/ump/catalog/book/420. doi:
          <volume>10</volume>
          .18690/
          <fpage>978</fpage>
          - 961- 286- 282- 4.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>