<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards machine learning-aware data validation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sebastian Strasser</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Regensburg</institution>
          ,
          <addr-line>Bajuwarenstraße 4, 93053 Regensburg</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>An important task when operating machine learning applications is model monitoring. Teams operating machine learning pipelines monitor the model performance based on common machine learning metrics like accuracy. However, in many real-world applications, monitoring model performance is dificult as ground truth is required to evaluate the performance. One possible way to draw conclusions about the current state of the pipeline is to observe drifting data, i.e., serving data deviating from the training data. However, approaches which give alerts on changing data are often too sensitive, leading to many false alarms. We propose an approach which provides more actionable data validation for machine learning monitoring. It is based on building so-called data assertions from initial training data. These assertions are then used as constraints to detect unexpected changes and data errors.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;data monitoring</kwd>
        <kwd>machine learning</kwd>
        <kwd>data validation</kwd>
        <kwd>data assertions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>address in our approach.</p>
      <sec id="sec-1-1">
        <title>In the last decades, the field of machine learning has gone</title>
        <p>
          ttihornouinghnutrme mereonudsoaupspplircoagtiroensss winhaicchadleemditao aanwdiidnedaudstorpy-. Metrics Model
Significant efort is spent on developing and optimizing Training PreprDoacteassing Training
algorithms. Another important factor in machine learn- data
ing applications is the utilized data. The research field
of data-centric AI – which can be seen as
complementmaraychtionmeloedaerln-cinengtraipcpAliIca–titohnuss[a1d]d.rOensseeismdpaotartaasnptefcatcstoinr Sedravtiang PreprDoacteassing Inference Predictions
is data maintenance for machine learning pipelines,
including both training and serving data. Figure 1 shows a Figure 1: Schematic illustration of a machine learning
high-level representation of a machine learning pipeline: pipeline (straight arrows indicate data flow, dashed arrows
Training data is preprocessed and used for training. The indicate a deployment of artifacts)
output of this process is a model. In the serving
environment, predictions are made for unseen data after a data A comprehensive data monitoring is needed to detect
preprocessing. When operating such machine learning aforementioned deviations in serving data. This process
pipelines, a major challenge is training-serving-skew [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], can be seen as a special type of data validation which is
i.e., a deviation of the training to the serving environ- not only an important topic in machine learning, but in
ment. In particular, serving data which difers signifi- all applications where data is processed. Data science
cantly from training data can cause numerous problems. teams can choose from a vast amount of diferent tools
When considering learning on tabular data, structural for this task, e.g., TFX Data Validation [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], Deequ [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], or
changes can cause errors in the serving pipeline, hence great expectations1. These tools can be efectively used
heavily influencing the model outcome or even breaking to validate incoming data according to user-defined
conpipelines. But also diferences in data characteristics can straints or a baseline dataset. However, as models and
have significant impact on the downstream model. One processed data in production machine learning
applicafrequently researched challenge in this context is concept tions are often updated continuously [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], a comparison of
drift. This term refers to a change of relationship between serving data with a static baseline is not suficient. There
input and output data over time. Concept drift can have is a need for data validation tools for machine learning
massive impact on model performance. Detecting such applications where this dynamic nature is considered.
changes is therefore an important task which we plan to Another limitation of many data validation tools used
for machine learning is that their output does not lead
to a quick and easy diagnosis of the underlying problem.
        </p>
        <p>
          Thus, Polyzotis and Zaharia [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] identifies outputs to be
35th GI-Workshop on Foundations of Databases (Grundlagen von
Datenbanken), May 22-24, 2024, Herdecke, Germany.
$ sebastian.strasser@ur.de (S. Strasser)
0009-0001-8848-1368 (S. Strasser)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1https://greatexpectations.io/
        </p>
        <p>
          Attribution 4.0 International (CC BY 4.0).
actionable as one core requirement of monitoring tools MXNet computation graphs. This allows a fine-grained
for data-centric AI. False alarms can be a problem, too. tracking and thus good replicability.
When a systems outputs too many alerts, users tend to MLFlow [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is a widely used tool which provides a
ignore or silence them entirely. One way to implement tracking API to log metrics and artifacts of machine
learna more helpful diagnosis and to prevent alert fatigue is ing experiments. API calls are inserted into training code
to alert only on changes which are probably going to by the users which initiate the logging of metadata to
have an impact on the downstream model. Here, the a file or database. MLFlow then provides an API and UI
challenge lies in finding types of deviations which are to display metadata collected during an experiment,
enlikely to impact the model outcome. abling comparison between diferent experiment runs.
        </p>
        <p>Thus, we propose an approach for a system which There are integrations for a lot of machine learning
framemonitors data throughout the machine learning lifecycle. works, thus supporting metadata tracking for numerous
It aims at validating new incoming data which is used use cases.
as serving and/or training data. We focus especially on TFX ML Metadata2 is another library which retrieves
actionable outputs. This requires a deep analysis of the metadata from data science workflows. It also allows user
data and the impact of specific data characteristics on the to track metadata about artifacts and outputs which are
pipeline. As successful machine learning deployments produced during an experiment. This metadata is then
are operated for long time periods and are updated con- stored into a metadata store which data scientists can use
tinuously, information used for data validation is also for debugging models in production. For instance, they
updated continuously. Metadata necessary for building can trace the dataset a model is trained with or compare
useful constraints and actionable alerts is collected dur- results of two experiments.
ing the training phase. As metadata collection from machine learning scripts</p>
        <p>This paper first reviews literature and tools which are is an essential step in our approach, we plan on using
aimed at monitoring and validating data for machine concepts from these tools. Similar as it is proprosed by
learning pipelines. Subsequently, we present our ap- TFX ML Metadata, we also use the retrieved metadata for
proach in more detail. We also exemplify identified re- debugging in production. However, our focus is to collect
search challenges and ideas on how to solve these chal- and store metadata about the processed data rather than
lenges. about the model.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.2. Data validation</title>
        <sec id="sec-2-1-1">
          <title>The main goal of our approach is to ensure that incom</title>
          <p>
            Multiple fields of research are relevant for our envisioned ing serving data does not break the prediction pipeline
machine learning monitoring system focusing on data or induces other problems like concept drift. Thus, we
aspects. Firstly, we take a look at tools for metadata efectively validate serving data, i.e., check if it fits
premanagement in machine learning experiments. Secondly, defined criteria. Obviously, data validation is required
approaches for validating data in general are discussed. in countless applications over multiple fields. Therefore,
Afterwards, existing ML monitoring tools are presented. lots of development and research went into designing
systems which validate data.
2.1. Metadata management Deequ [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] is an example for such a system. It enables
the validation of large-scale data in respect to the data
In our approach, mining metadata from the training pro- quality. Users can set constraints or choose from
suggescess is an important first step. Numerous tools enabling tions generated by the system. The main focus of this
metadata tracking from training pipelines exist. The system is to ensure a performant processing even for
goal of such tools is to guarantee the reproducibility of very large datasets. Redyuk et al. [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] suggest a system
machine learning experiments. They also support com- where data quality is monitored by computing descriptive
parison between diferent machine learning pipelines in statistics and detecting deviations with novelty detection
regards to model performance or other metrics. methods. This is a contrast to other approaches where
          </p>
          <p>
            Schelter et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] suggest a declarative approach to constraints are set manually or semi-automatically by
metadata tracking for machine learning pipelines. This the user.
refers to a decoupling of the actual artifacts produced in There are also data validation systems designed
specifthe machine learning process (i.e., code, models, datasets, ically for machine learning applications. TFX Data
Valetc.) with metadata describing these artifacts. Metadata idation [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] is a part of the platform TFX implemented
is extracted from internal data structures of machine by Google. It enables the validation of both training and
learning frameworks, e.g., from Spark DataFrames or
serving data. The authors make a diferentation between ware assertions to machine learning models. They allow
single-batch and inter-batch validation. Single-batch vali- domain experts to specify constraints over model input
dation is supposed to detect anomalies in a single batch and output. This enables the detection of wrong model
of data while inter-batch validation is targeted at finding outputs in cases where confidence is high. Our approach
significant changes between training and serving data also includes assertions, but focuses on the validation of
or batches of training data. mlinspect [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] is a tool which data, not model optimization.
enables users to inspect training pipelines. It mainly fo- Shankar and Parameswaran [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] present a vision of
cuses on debugging data distribution changes induced by a so-called observability system for machine learning
pipeline steps. This can be used to detect technical bias, pipelines. One primary goal is to detect and diagnose
i.e., bias which is introduced by data preprocessing or bugs in machine learning pipelines. The authors also
other automated tasks. emphasize the importance of measuring model
perfor
          </p>
          <p>
            Static data validation, as provided by most presented mance with incomplete information, i.e., when no ground
applications, is not suficient for the problem of contin- truth is available to evaluate model outputs. A prototype
uously monitoring input data for machine learning ap- for such an observability system was also presented7.
plications. Also, to the best of our knowledge, there are Our proposed system also tackles challenges that
apno systems which validate data based on the impact it is pear in machine learning applications where no ground
expected to have on a downstream model which is one truth is available. However, while the authors of [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]
of the core ideas of our approach. presented concepts of using partial or delayed labels for
performance measurement, we keep the labels out of the
2.3. Monitoring machine learning equation and focus on detecting data with unexpected
characteristics. Thus, our approach can be seen as
comapplications plementary to the envisioned system presented in [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
Kreuzberger et al. [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] identified continuous monitoring as
one of the main principles of MLOps which is a paradigm
describing the efective development and operation of
production machine learning applications. The
monitoring component is needed to detect errors or changes
influencing model quality. Various artifacts like data,
model outputs, serving infrastructure, etc. are observed.
          </p>
          <p>Machine learning teams use various tools for this task. A
popular choice are general-purpose monitoring systems
like Prometheus3 or the ELK stack4.</p>
          <p>There are also tools specifically designed for the
monitoring of machine learning applications. EvidentlyAI 5
provides multiple modules to monitor data quality, data
drift, and model performance. Test suites perform data
and model quality checks based on conditions that are
either manually set or generated from a reference dataset.</p>
          <p>Reports provide general metrics and interactive
visualizations for analysis and debugging purposes. A continuous
monitoring can be achieved by storing snapshots of test
suites and reports and displaying it in a dashboard.
Similar tools exist for various cloud services which provide
tooling for machine learning, e.g., Amazon SageMaker
Model Monitor6. In constrast to our approach, these tools
incorporate ground truth into the model performance
evaluation. We plan to evaluate models under the
assumption that ground truth is not available.</p>
          <p>
            A concept which can used for monitoring models is
the abstraction of model assertions [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] which adapt
soft
          </p>
        </sec>
        <sec id="sec-2-1-2">
          <title>3https://www.prometheus.io/</title>
          <p>4https://www.elastic.co/elastic-stack
5https://www.evidentlyai.com/
6https://docs.aws.amazon.com/sagemaker/latest/dg/modelmonitor.html</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Concept</title>
      <sec id="sec-3-1">
        <title>An architecture of our proposed system is depicted in</title>
        <p>Figure 2. It consists of two main components: (i) metadata
collection and (ii) data assertion generation. The first
component collects metadata from the initial training
pipeline. This includes machine learning metrics and
the training data. We use this metadata as a baseline.
The next step is to infer expected data characteristics
from the metadata. We refer to these expectations as
data assertions. By collecting machine learning metrics
additionally to the data, it is possible to measure how
specific data characteristics afect the outcome of the
model. By incorporating the efect on model outcome,
more actionable alerts are possible. Teams operating
machine learning pipelines are interested in data errors
which are likely to impact model results. In the following,
we present a general outline for those two components.
We also exemplify research challenges and ideas on how
to solve these.</p>
        <sec id="sec-3-1-1">
          <title>3.1. Collection and storage of metadata in machine learning pipelines</title>
          <p>Firstly, we identify a set of metadata to collect from
the machine learning process. We diferentiate between
two types of metadata: (i) experimentation metadata and
(ii) metadata about the training data. To track metadata
about the experimentation, we intend to use a tracking
tool like mlflow . Here, we are mostly interested in how</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>7https://github.com/loglabs/mltrace</title>
        <p>Training pipeline
Training
data</p>
        <p>ML
metrics
Metadata Collection</p>
        <p>Data
profiles</p>
        <p>ML
metrics
Data profiles</p>
        <p>Data Assertion
Generation</p>
        <p>Data
assertions</p>
        <p>Metadata Storage
Monitoring system</p>
        <p>
          Data
assertions
Serving data
well a model performed on which datasets. Thus, we An idea here is to save the data profiles in data structures
track ML metrics in combination with pointers to the that allow incremental updates. A similar approach is
training datasets. implemented in Deequ [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] where data quality metrics are
        </p>
        <p>
          More importantly, we collect comprehensive metadata incrementally updated for large-scale datasets.
about the actual data a model is trained with. For this, we
create a data profile for each training dataset. For now, 3.2. Data assertions
we focus on profiling tabular data. When considering
tabular data, one could utilize comprehensive work on In model serving, we diferentiate between two cases:
the profiling of relational data [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Various tools were online and batch inference. Online inference means that
developed specifically to profile data used in machine the input is immediately passed to the model which then
learning applications [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ]. However, an extensive outputs a prediction. Thus, the serving data can be seen
profiling is not always feasible for the application of as unbounded. In batch inference, on the other hand,
creating data profiles for machine learning datasets, as it input data is accumulated into batches. Those batches
would be too expensive for many use cases. Thus, a trade- are then passed to a model and inference is performed
of between getting insightful information about the data on all the collected data. This diferentiation is important
and performance has to be made. A first design of a data as these types can require highly diferent processing
profile which considers these requirements is shown in techniques.
        </p>
        <p>Figure 3. The data profile contains general information We tackle challenges regarding the
training-servingand descriptive statistics describing the dataset. We track skew which is an issue in production ML. Thus, detecting
statistics about numeric and categorical features, as well
as relationships between columns. The design of this
data profile is not meant to be final, but rather serves as a Data Profile
suggestion which can be adjusted based on the use case.</p>
        <p>
          There are also diferent possibilities to collect data G- e#nReorawls Column Correlations
from training scripts and other machine learning code. - # Columns
One possibility is to provide an API which enables the
user to log data from variables or files, similar to imple- Numeric Categorical
mentations in mlflow [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] or other experiment tracking - Min - # Categories
tools. Other approaches where data is captured directly -- MMeaxdian -- FMroedqeuency Distribution
from Python scripts [
          <xref ref-type="bibr" rid="ref16 ref9">9, 16</xref>
          ] could also be used. - Mean
        </p>
        <p>An additional research challenge we identified is to fig- - Standard Deviation
ure out an eficient storage method for the metadata. This -- SKkuertwosniesss
is especially important for machine learning applications
which are operated over a long period of time, as the Figure 3: Components of a data profile
training data is updated every time models are retrained.
such diferences between training and serving data is
crucial to avoid performance decreases. In batch settings,
one way to address this problem could be to compare
data from the training phase with the serving dataset.</p>
        <p>Assuming data profiles as described in Section 3.1 are
available, this comparison can be done by finding
diferences between the baseline data profile (i.e., data profile
of the training data) and the serving data profile.</p>
        <p>For unbounded data, the approach of building data
profiles for input data is not feasible. As data profiles are
meant to be summaries of datasets, they do not serve as
a good means to describe unbounded data. Therefore,
for this case, our approach to detect changes is to derive
so-called data assertions from the data profiles that were
collected in the training process. These data assertions
are thought to ensure that incoming serving data does
not lead to errors in the inference pipeline. Thus, they
are meant to identify data which is significantly
diferent from training data. If an assertion is violated, the
monitoring system gives a warning.</p>
        <p>Consider an application where the goal is to predict
whether there will be a storm on the next day based on
weather data from the current day. One column in this
example would be the temperature. A data assertion for
this column could look like depicted in Listing 1. In this
example, the temperature is guaranteed to be a float and
to have a value between -5.2 and 36.7.</p>
        <p>{
}
"column_assertion": {
"name": "temperature",
"dtype": float,
"lower_bound": -5.2,
"upper_bound": 36.7
}</p>
      </sec>
      <sec id="sec-3-3">
        <title>Listing 1: Example for a data assertion (planned)</title>
        <p>In the context of data assertions, we identified two
research challenges. The first one is the design of data
assertions. Listing 1 only shows a rough idea for the design
of a specific example. Data assertions can be thought
of as a kind of constraint. Our goal is to define these
constraints in a way that data violating them is likely
to cause problems in the pipeline, ranging from pipeline
breaking errors to deviations in data distribution which
often lead to worse model outputs. For this, we first have
to define which errors can occur in the pipeline. Then
we can classify which errors are afecting the pipeline
and which are not.</p>
        <p>We start with “simple” assertions, i.e., data assertions
based on simple data statistics like the min and the max.
An example for such a data assertion was introduced
Metadata
collection</p>
        <p>Data assertion
generation</p>
        <p>Refinement</p>
        <p>
          Deployment
Feedback
prior in Listing 1. These data assertions are similiar to
constraints like implemented in Deequ [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] or other data
validation tools. We then create a taxonomy of those
data assertions, i.e., find suitable assertion types. For
instance, on a high level, assertion types could be divided
into structural and semantical. Also, assertions can be
on feature- or table-level. Table-level assertions include
constraints which model relationships between features.
In the next step, we plan to cluster those constraints
according to what error types are produced when data
violates them. Here, we diferentiate between changes
breaking the preprocessing pipeline and changes
influencing model outcome (also over time). Therefore, we
also incorporate the model into the evaluation. This way
we can also measure if specific data assertion violations
tend to have similar impact on the downstream model.
A research question in this context is to find approaches
on how to evaluate efects of assertion violations on the
model. An idea is to use metrics which measure the
feature importance [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>The outcome of this process are rules which model the
relationship of data assertion violation to impact on the
pipeline. These can be used to build a classification model.
The independent variables are attributes describing the
data assertion and the dependent variable is the impact on
the pipeline. This model in turn can be used to improve
the assertions by prioritizing assertions with a higher
predicted impact and neglecting constraints with low
estimated impact.</p>
        <p>An important part of the concept of data assertions is
that users do not have to define them all by hand. Rather,
we plan to use a semi-automatic approach which is
illustrated in Figure 4. The monitoring system suggests
assertions based on the data profiles of the training
process to the data scientist. Then the data scientist can
accept or reject the assertion. They can also edit the
assertion. For the example in which a constraint is
generated for the column temperature, they could change
the upper bound to 40 – if they know such temperatures
are realistic in the observed area. This user feedback can
also be incorporated into the data assertion generation
process, enabling better constraints which produce less
false alarms.</p>
        <p>In a last step, the data assertions created before have
to be evaluated. Several datasets used for benchmarking
in machine learning research can be used. However, we
will also allow a parametrization of various attributes of
the data, e.g. schema or column contents. Variance in
data can therefore be controlled and data with diferent
properties can be tested. In the evaluation, the input data
is firstly split into “training” data with which the data
assertion generation is executed and “test data” with which
the accuracy of these data assertions is then evaluated.
The evaluation is separated into two steps: (i)
verification of the semantical correctness, i.e., does the generated
data assertion hold for the test data, and (ii) comparison
of the predicted impact with the actual impact on the
model.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Validating data for machine learning applications is a
task with many challenges yet to solve. We focus on the
validation of data which is updated continuously and
the minimization of false positives. Therefore, this paper
proposes a data monitoring system which incorporates a
comprehensive collection of metadata in machine
learning pipelines and a creation of data constraints we call
data assertions. The assertion building process is aware
of the downstream model and measures the influence of
data variance on model outcome. In our next steps, we
implement the metadata collection component to collect
comprehensive metadata not only on data, but also on
the model. This metadata serves as input for the data
assertion generation process. We presented research
challenges we identified both in metadata collection and data
assertion generation. We also made considerations on
how to evaluate the generated data assertions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zha</surname>
          </string-name>
          , et al.,
          <source>Data-centric artificial intelligence: A survey</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>10158</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          , et al.,
          <article-title>Data management challenges in production machine learning</article-title>
          ,
          <source>in: Proceedings of the 2017 ACM International Conference on Management of Data, Association for Computing Machinery</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          , et al.,
          <article-title>Data validation for machine learning</article-title>
          ,
          <source>in: Proceedings of Machine Learning and Systems</source>
          , volume
          <volume>1</volume>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schelter</surname>
          </string-name>
          , et al.,
          <article-title>Automating large-scale data quality verification</article-title>
          ,
          <source>in: Proc. VLDB Endow</source>
          ., volume
          <volume>11</volume>
          ,
          <string-name>
            <given-names>VLDB</given-names>
            <surname>Endowment</surname>
          </string-name>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <article-title>What can data-centric ai learn from data</article-title>
          and
          <source>ml engineering?</source>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2112</volume>
          .
          <fpage>06439</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schelter</surname>
          </string-name>
          , et al.,
          <article-title>Declarative metadata management: A missing piece in end-to-end machine learning</article-title>
          ,
          <source>in: Proceedings of SysML</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          , et al.,
          <article-title>Accelerating the machine learning lifecycle with mlflow</article-title>
          ,
          <source>IEEE Data Eng. Bull</source>
          .
          <volume>41</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Redyuk</surname>
          </string-name>
          , et al.,
          <article-title>Automating data quality validation for dynamic data ingestion</article-title>
          ,
          <source>in: International Conference on Extending Database Technology</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Grafberger</surname>
          </string-name>
          , et al.,
          <article-title>Data distribution debugging in machine learning pipelines</article-title>
          ,
          <source>The VLDB Journal</source>
          <volume>31</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kreuzberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kühl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hirschl</surname>
          </string-name>
          ,
          <article-title>Machine learning operations (mlops): Overview, definition, and architecture</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kang</surname>
          </string-name>
          , et al.,
          <article-title>Model assertions for monitoring and improving ml models</article-title>
          ,
          <source>in: Proceedings of Machine Learning and Systems</source>
          , volume
          <volume>2</volume>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Parameswaran</surname>
          </string-name>
          ,
          <article-title>Towards observability for production machine learning pipelines</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>15</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Golab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <article-title>Profiling relational data: a survey</article-title>
          ,
          <source>The VLDB Journal</source>
          <volume>24</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Epperson</surname>
          </string-name>
          , et al.,
          <article-title>Dead or alive: Continuous data profiling for interactive data science</article-title>
          ,
          <source>IEEE Transactions on Visualization and Computer Graphics</source>
          <volume>30</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Clemente</surname>
          </string-name>
          , et al.,
          <article-title>ydata-profiling: Accelerating data-centric ai with high-quality data</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>554</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Murta</surname>
          </string-name>
          , et al.,
          <article-title>noworkflow: Capturing and analyzing provenance of scripts</article-title>
          ,
          <source>in: Provenance and Annotation of Data and Processes</source>
          , Springer International Publishing, Cham,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Saarela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jauhiainen</surname>
          </string-name>
          ,
          <article-title>Comparison of feature importance measures as explanations for classification models</article-title>
          ,
          <source>SN Applied Sciences</source>
          <volume>3</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>