1. Introduction

Towards machine learning-aware data validation

Sebastian Strasser

0 0 University of Regensburg , Bajuwarenstraße 4, 93053 Regensburg

An important task when operating machine learning applications is model monitoring. Teams operating machine learning pipelines monitor the model performance based on common machine learning metrics like accuracy. However, in many real-world applications, monitoring model performance is dificult as ground truth is required to evaluate the performance. One possible way to draw conclusions about the current state of the pipeline is to observe drifting data, i.e., serving data deviating from the training data. However, approaches which give alerts on changing data are often too sensitive, leading to many false alarms. We propose an approach which provides more actionable data validation for machine learning monitoring. It is based on building so-called data assertions from initial training data. These assertions are then used as constraints to detect unexpected changes and data errors.

eol>data monitoring machine learning data validation data assertions

1. Introduction

address in our approach.

In the last decades, the field of machine learning has gone

ttihornouinghnutrme mereonudsoaupspplircoagtiroensss winhaicchadleemditao aanwdiidnedaudstorpy-. Metrics Model Significant efort is spent on developing and optimizing Training PreprDoacteassing Training algorithms. Another important factor in machine learn- data ing applications is the utilized data. The research field of data-centric AI – which can be seen as complementmaraychtionmeloedaerln-cinengtraipcpAliIca–titohnuss[a1d]d.rOensseeismdpaotartaasnptefcatcstoinr Sedravtiang PreprDoacteassing Inference Predictions is data maintenance for machine learning pipelines, including both training and serving data. Figure 1 shows a Figure 1: Schematic illustration of a machine learning high-level representation of a machine learning pipeline: pipeline (straight arrows indicate data flow, dashed arrows Training data is preprocessed and used for training. The indicate a deployment of artifacts) output of this process is a model. In the serving environment, predictions are made for unseen data after a data A comprehensive data monitoring is needed to detect preprocessing. When operating such machine learning aforementioned deviations in serving data. This process pipelines, a major challenge is training-serving-skew [ 2 ], can be seen as a special type of data validation which is i.e., a deviation of the training to the serving environ- not only an important topic in machine learning, but in ment. In particular, serving data which difers signifi- all applications where data is processed. Data science cantly from training data can cause numerous problems. teams can choose from a vast amount of diferent tools When considering learning on tabular data, structural for this task, e.g., TFX Data Validation [ 3 ], Deequ [ 4 ], or changes can cause errors in the serving pipeline, hence great expectations1. These tools can be efectively used heavily influencing the model outcome or even breaking to validate incoming data according to user-defined conpipelines. But also diferences in data characteristics can straints or a baseline dataset. However, as models and have significant impact on the downstream model. One processed data in production machine learning applicafrequently researched challenge in this context is concept tions are often updated continuously [ 5 ], a comparison of drift. This term refers to a change of relationship between serving data with a static baseline is not suficient. There input and output data over time. Concept drift can have is a need for data validation tools for machine learning massive impact on model performance. Detecting such applications where this dynamic nature is considered. changes is therefore an important task which we plan to Another limitation of many data validation tools used for machine learning is that their output does not lead to a quick and easy diagnosis of the underlying problem.

Thus, Polyzotis and Zaharia [ 5 ] identifies outputs to be 35th GI-Workshop on Foundations of Databases (Grundlagen von Datenbanken), May 22-24, 2024, Herdecke, Germany. $ sebastian.strasser@ur.de (S. Strasser) 0009-0001-8848-1368 (S. Strasser) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1https://greatexpectations.io/

Attribution 4.0 International (CC BY 4.0). actionable as one core requirement of monitoring tools MXNet computation graphs. This allows a fine-grained for data-centric AI. False alarms can be a problem, too. tracking and thus good replicability. When a systems outputs too many alerts, users tend to MLFlow [ 7 ] is a widely used tool which provides a ignore or silence them entirely. One way to implement tracking API to log metrics and artifacts of machine learna more helpful diagnosis and to prevent alert fatigue is ing experiments. API calls are inserted into training code to alert only on changes which are probably going to by the users which initiate the logging of metadata to have an impact on the downstream model. Here, the a file or database. MLFlow then provides an API and UI challenge lies in finding types of deviations which are to display metadata collected during an experiment, enlikely to impact the model outcome. abling comparison between diferent experiment runs.

Thus, we propose an approach for a system which There are integrations for a lot of machine learning framemonitors data throughout the machine learning lifecycle. works, thus supporting metadata tracking for numerous It aims at validating new incoming data which is used use cases. as serving and/or training data. We focus especially on TFX ML Metadata2 is another library which retrieves actionable outputs. This requires a deep analysis of the metadata from data science workflows. It also allows user data and the impact of specific data characteristics on the to track metadata about artifacts and outputs which are pipeline. As successful machine learning deployments produced during an experiment. This metadata is then are operated for long time periods and are updated con- stored into a metadata store which data scientists can use tinuously, information used for data validation is also for debugging models in production. For instance, they updated continuously. Metadata necessary for building can trace the dataset a model is trained with or compare useful constraints and actionable alerts is collected dur- results of two experiments. ing the training phase. As metadata collection from machine learning scripts

This paper first reviews literature and tools which are is an essential step in our approach, we plan on using aimed at monitoring and validating data for machine concepts from these tools. Similar as it is proprosed by learning pipelines. Subsequently, we present our ap- TFX ML Metadata, we also use the retrieved metadata for proach in more detail. We also exemplify identified re- debugging in production. However, our focus is to collect search challenges and ideas on how to solve these chal- and store metadata about the processed data rather than lenges. about the model.

2. Related work 2.2. Data validation The main goal of our approach is to ensure that incom

Multiple fields of research are relevant for our envisioned ing serving data does not break the prediction pipeline machine learning monitoring system focusing on data or induces other problems like concept drift. Thus, we aspects. Firstly, we take a look at tools for metadata efectively validate serving data, i.e., check if it fits premanagement in machine learning experiments. Secondly, defined criteria. Obviously, data validation is required approaches for validating data in general are discussed. in countless applications over multiple fields. Therefore, Afterwards, existing ML monitoring tools are presented. lots of development and research went into designing systems which validate data. 2.1. Metadata management Deequ [ 4 ] is an example for such a system. It enables the validation of large-scale data in respect to the data In our approach, mining metadata from the training pro- quality. Users can set constraints or choose from suggescess is an important first step. Numerous tools enabling tions generated by the system. The main focus of this metadata tracking from training pipelines exist. The system is to ensure a performant processing even for goal of such tools is to guarantee the reproducibility of very large datasets. Redyuk et al. [ 8 ] suggest a system machine learning experiments. They also support com- where data quality is monitored by computing descriptive parison between diferent machine learning pipelines in statistics and detecting deviations with novelty detection regards to model performance or other metrics. methods. This is a contrast to other approaches where

Schelter et al. [ 6 ] suggest a declarative approach to constraints are set manually or semi-automatically by metadata tracking for machine learning pipelines. This the user. refers to a decoupling of the actual artifacts produced in There are also data validation systems designed specifthe machine learning process (i.e., code, models, datasets, ically for machine learning applications. TFX Data Valetc.) with metadata describing these artifacts. Metadata idation [ 3 ] is a part of the platform TFX implemented is extracted from internal data structures of machine by Google. It enables the validation of both training and learning frameworks, e.g., from Spark DataFrames or serving data. The authors make a diferentation between ware assertions to machine learning models. They allow single-batch and inter-batch validation. Single-batch vali- domain experts to specify constraints over model input dation is supposed to detect anomalies in a single batch and output. This enables the detection of wrong model of data while inter-batch validation is targeted at finding outputs in cases where confidence is high. Our approach significant changes between training and serving data also includes assertions, but focuses on the validation of or batches of training data. mlinspect [ 9 ] is a tool which data, not model optimization. enables users to inspect training pipelines. It mainly fo- Shankar and Parameswaran [ 12 ] present a vision of cuses on debugging data distribution changes induced by a so-called observability system for machine learning pipeline steps. This can be used to detect technical bias, pipelines. One primary goal is to detect and diagnose i.e., bias which is introduced by data preprocessing or bugs in machine learning pipelines. The authors also other automated tasks. emphasize the importance of measuring model perfor

Static data validation, as provided by most presented mance with incomplete information, i.e., when no ground applications, is not suficient for the problem of contin- truth is available to evaluate model outputs. A prototype uously monitoring input data for machine learning ap- for such an observability system was also presented7. plications. Also, to the best of our knowledge, there are Our proposed system also tackles challenges that apno systems which validate data based on the impact it is pear in machine learning applications where no ground expected to have on a downstream model which is one truth is available. However, while the authors of [ 12 ] of the core ideas of our approach. presented concepts of using partial or delayed labels for performance measurement, we keep the labels out of the 2.3. Monitoring machine learning equation and focus on detecting data with unexpected characteristics. Thus, our approach can be seen as comapplications plementary to the envisioned system presented in [ 12 ]. Kreuzberger et al. [ 10 ] identified continuous monitoring as one of the main principles of MLOps which is a paradigm describing the efective development and operation of production machine learning applications. The monitoring component is needed to detect errors or changes influencing model quality. Various artifacts like data, model outputs, serving infrastructure, etc. are observed.

Machine learning teams use various tools for this task. A popular choice are general-purpose monitoring systems like Prometheus3 or the ELK stack4.

There are also tools specifically designed for the monitoring of machine learning applications. EvidentlyAI 5 provides multiple modules to monitor data quality, data drift, and model performance. Test suites perform data and model quality checks based on conditions that are either manually set or generated from a reference dataset.

Reports provide general metrics and interactive visualizations for analysis and debugging purposes. A continuous monitoring can be achieved by storing snapshots of test suites and reports and displaying it in a dashboard. Similar tools exist for various cloud services which provide tooling for machine learning, e.g., Amazon SageMaker Model Monitor6. In constrast to our approach, these tools incorporate ground truth into the model performance evaluation. We plan to evaluate models under the assumption that ground truth is not available.

A concept which can used for monitoring models is the abstraction of model assertions [ 11 ] which adapt soft

3https://www.prometheus.io/

4https://www.elastic.co/elastic-stack 5https://www.evidentlyai.com/ 6https://docs.aws.amazon.com/sagemaker/latest/dg/modelmonitor.html

3. Concept An architecture of our proposed system is depicted in

Figure 2. It consists of two main components: (i) metadata collection and (ii) data assertion generation. The first component collects metadata from the initial training pipeline. This includes machine learning metrics and the training data. We use this metadata as a baseline. The next step is to infer expected data characteristics from the metadata. We refer to these expectations as data assertions. By collecting machine learning metrics additionally to the data, it is possible to measure how specific data characteristics afect the outcome of the model. By incorporating the efect on model outcome, more actionable alerts are possible. Teams operating machine learning pipelines are interested in data errors which are likely to impact model results. In the following, we present a general outline for those two components. We also exemplify research challenges and ideas on how to solve these.

3.1. Collection and storage of metadata in machine learning pipelines

Firstly, we identify a set of metadata to collect from the machine learning process. We diferentiate between two types of metadata: (i) experimentation metadata and (ii) metadata about the training data. To track metadata about the experimentation, we intend to use a tracking tool like mlflow . Here, we are mostly interested in how

7https://github.com/loglabs/mltrace

Training pipeline Training data

ML metrics Metadata Collection

Data profiles

ML metrics Data profiles

Data Assertion Generation

Data assertions

Metadata Storage Monitoring system

Data assertions Serving data well a model performed on which datasets. Thus, we An idea here is to save the data profiles in data structures track ML metrics in combination with pointers to the that allow incremental updates. A similar approach is training datasets. implemented in Deequ [ 4 ] where data quality metrics are

More importantly, we collect comprehensive metadata incrementally updated for large-scale datasets. about the actual data a model is trained with. For this, we create a data profile for each training dataset. For now, 3.2. Data assertions we focus on profiling tabular data. When considering tabular data, one could utilize comprehensive work on In model serving, we diferentiate between two cases: the profiling of relational data [ 13 ]. Various tools were online and batch inference. Online inference means that developed specifically to profile data used in machine the input is immediately passed to the model which then learning applications [ 14, 15 ]. However, an extensive outputs a prediction. Thus, the serving data can be seen profiling is not always feasible for the application of as unbounded. In batch inference, on the other hand, creating data profiles for machine learning datasets, as it input data is accumulated into batches. Those batches would be too expensive for many use cases. Thus, a trade- are then passed to a model and inference is performed of between getting insightful information about the data on all the collected data. This diferentiation is important and performance has to be made. A first design of a data as these types can require highly diferent processing profile which considers these requirements is shown in techniques.

Figure 3. The data profile contains general information We tackle challenges regarding the training-servingand descriptive statistics describing the dataset. We track skew which is an issue in production ML. Thus, detecting statistics about numeric and categorical features, as well as relationships between columns. The design of this data profile is not meant to be final, but rather serves as a Data Profile suggestion which can be adjusted based on the use case.

There are also diferent possibilities to collect data G- e#nReorawls Column Correlations from training scripts and other machine learning code. - # Columns One possibility is to provide an API which enables the user to log data from variables or files, similar to imple- Numeric Categorical mentations in mlflow [ 7 ] or other experiment tracking - Min - # Categories tools. Other approaches where data is captured directly -- MMeaxdian -- FMroedqeuency Distribution from Python scripts [ 9, 16 ] could also be used. - Mean

An additional research challenge we identified is to fig- - Standard Deviation ure out an eficient storage method for the metadata. This -- SKkuertwosniesss is especially important for machine learning applications which are operated over a long period of time, as the Figure 3: Components of a data profile training data is updated every time models are retrained. such diferences between training and serving data is crucial to avoid performance decreases. In batch settings, one way to address this problem could be to compare data from the training phase with the serving dataset.

Assuming data profiles as described in Section 3.1 are available, this comparison can be done by finding diferences between the baseline data profile (i.e., data profile of the training data) and the serving data profile.

For unbounded data, the approach of building data profiles for input data is not feasible. As data profiles are meant to be summaries of datasets, they do not serve as a good means to describe unbounded data. Therefore, for this case, our approach to detect changes is to derive so-called data assertions from the data profiles that were collected in the training process. These data assertions are thought to ensure that incoming serving data does not lead to errors in the inference pipeline. Thus, they are meant to identify data which is significantly diferent from training data. If an assertion is violated, the monitoring system gives a warning.

Consider an application where the goal is to predict whether there will be a storm on the next day based on weather data from the current day. One column in this example would be the temperature. A data assertion for this column could look like depicted in Listing 1. In this example, the temperature is guaranteed to be a float and to have a value between -5.2 and 36.7.

{ } "column_assertion": { "name": "temperature", "dtype": float, "lower_bound": -5.2, "upper_bound": 36.7 }

Listing 1: Example for a data assertion (planned)

In the context of data assertions, we identified two research challenges. The first one is the design of data assertions. Listing 1 only shows a rough idea for the design of a specific example. Data assertions can be thought of as a kind of constraint. Our goal is to define these constraints in a way that data violating them is likely to cause problems in the pipeline, ranging from pipeline breaking errors to deviations in data distribution which often lead to worse model outputs. For this, we first have to define which errors can occur in the pipeline. Then we can classify which errors are afecting the pipeline and which are not.

We start with “simple” assertions, i.e., data assertions based on simple data statistics like the min and the max. An example for such a data assertion was introduced Metadata collection

Data assertion generation

Refinement

Deployment Feedback prior in Listing 1. These data assertions are similiar to constraints like implemented in Deequ [ 4 ] or other data validation tools. We then create a taxonomy of those data assertions, i.e., find suitable assertion types. For instance, on a high level, assertion types could be divided into structural and semantical. Also, assertions can be on feature- or table-level. Table-level assertions include constraints which model relationships between features. In the next step, we plan to cluster those constraints according to what error types are produced when data violates them. Here, we diferentiate between changes breaking the preprocessing pipeline and changes influencing model outcome (also over time). Therefore, we also incorporate the model into the evaluation. This way we can also measure if specific data assertion violations tend to have similar impact on the downstream model. A research question in this context is to find approaches on how to evaluate efects of assertion violations on the model. An idea is to use metrics which measure the feature importance [ 17 ].

The outcome of this process are rules which model the relationship of data assertion violation to impact on the pipeline. These can be used to build a classification model. The independent variables are attributes describing the data assertion and the dependent variable is the impact on the pipeline. This model in turn can be used to improve the assertions by prioritizing assertions with a higher predicted impact and neglecting constraints with low estimated impact.

An important part of the concept of data assertions is that users do not have to define them all by hand. Rather, we plan to use a semi-automatic approach which is illustrated in Figure 4. The monitoring system suggests assertions based on the data profiles of the training process to the data scientist. Then the data scientist can accept or reject the assertion. They can also edit the assertion. For the example in which a constraint is generated for the column temperature, they could change the upper bound to 40 – if they know such temperatures are realistic in the observed area. This user feedback can also be incorporated into the data assertion generation process, enabling better constraints which produce less false alarms.

In a last step, the data assertions created before have to be evaluated. Several datasets used for benchmarking in machine learning research can be used. However, we will also allow a parametrization of various attributes of the data, e.g. schema or column contents. Variance in data can therefore be controlled and data with diferent properties can be tested. In the evaluation, the input data is firstly split into “training” data with which the data assertion generation is executed and “test data” with which the accuracy of these data assertions is then evaluated. The evaluation is separated into two steps: (i) verification of the semantical correctness, i.e., does the generated data assertion hold for the test data, and (ii) comparison of the predicted impact with the actual impact on the model.

4. Conclusion

Validating data for machine learning applications is a task with many challenges yet to solve. We focus on the validation of data which is updated continuously and the minimization of false positives. Therefore, this paper proposes a data monitoring system which incorporates a comprehensive collection of metadata in machine learning pipelines and a creation of data constraints we call data assertions. The assertion building process is aware of the downstream model and measures the influence of data variance on model outcome. In our next steps, we implement the metadata collection component to collect comprehensive metadata not only on data, but also on the model. This metadata serves as input for the data assertion generation process. We presented research challenges we identified both in metadata collection and data assertion generation. We also made considerations on how to evaluate the generated data assertions.

[1]

Zha , et al., Data-centric artificial intelligence: A survey , 2023 . arXiv: 2303 . 10158 .

[2]

Polyzotis , et al., Data management challenges in production machine learning , in: Proceedings of the 2017 ACM International Conference on Management of Data, Association for Computing Machinery , 2017 .

[3]

Polyzotis , et al., Data validation for machine learning , in: Proceedings of Machine Learning and Systems , volume 1 , 2019 .

[4]

Schelter , et al., Automating large-scale data quality verification , in: Proc. VLDB Endow ., volume 11 ,

VLDB

Endowment , 2018 .

[5]

Polyzotis ,

Zaharia , What can data-centric ai learn from data and ml engineering? , 2021 . arXiv: 2112 . 06439 .

[6]

Schelter , et al., Declarative metadata management: A missing piece in end-to-end machine learning , in: Proceedings of SysML , 2018 .

[7]

M. A.

Zaharia , et al., Accelerating the machine learning lifecycle with mlflow , IEEE Data Eng. Bull . 41 ( 2018 ).

[8]

Redyuk , et al., Automating data quality validation for dynamic data ingestion , in: International Conference on Extending Database Technology , 2021 .

[9]

Grafberger , et al., Data distribution debugging in machine learning pipelines , The VLDB Journal 31 ( 2022 ).

[10]

Kreuzberger ,

Kühl ,

Hirschl , Machine learning operations (mlops): Overview, definition, and architecture , IEEE Access 11 ( 2022 ).

[11]

Kang , et al., Model assertions for monitoring and improving ml models , in: Proceedings of Machine Learning and Systems , volume 2 , 2020 .

[12]

Shankar ,

A. G.

Parameswaran , Towards observability for production machine learning pipelines , Proc. VLDB Endow . 15 ( 2022 ).

[13]

Abedjan ,

Golab ,

Naumann , Profiling relational data: a survey , The VLDB Journal 24 ( 2015 ).

[14]

Epperson , et al., Dead or alive: Continuous data profiling for interactive data science , IEEE Transactions on Visualization and Computer Graphics 30 ( 2024 ).

[15]

Clemente , et al., ydata-profiling: Accelerating data-centric ai with high-quality data , Neurocomputing 554 ( 2023 ).

[16]

Murta , et al., noworkflow: Capturing and analyzing provenance of scripts , in: Provenance and Annotation of Data and Processes , Springer International Publishing, Cham, 2015 .

[17]

Saarela ,

Jauhiainen , Comparison of feature importance measures as explanations for classification models , SN Applied Sciences 3 ( 2021 ).