<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Concept Drift Detection in Machine Learning Systems by Exploiting Relaxed Functional Dependencies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Loredana Caruccio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Cirillo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Polese</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Stanzione</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Salerno</institution>
          ,
          <addr-line>via Giovanni Paolo II, 132, Fisciano (SA), 84084</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Although to train predictive models Machine Learning approaches completely rely on data, the latter can dynamically evolve over time. This could make predictive models outdated due to the presence of possible data shifts, with a consequent decrease in prediction accuracy. Concept drift detection techniques aim to detect such shifts in order to adopt countermeasures and maintain predictive performance over time. To this end, drift detection methods aim to monitor data distribution shifts, trying to identify changes without evaluating model predictions. In this discussion paper, we present a profiling metadata-driven approach for quantifying concept drift. Specifically, we focus on Relaxed Functional Dependencies ( rfds) and formalize the relationship between changes in metadata and performance trends of the predictive models over time. Moreover, we define a suite of rfd-based metrics measuring the distance between two sets of data. To evaluate the proposed approach, we compared it with other distribution-based metrics on datasets with both known and unknown drift. Results proved that the proposed metrics are strongly correlated with the model's performance according to their trends. Moreover, the defined suite of metrics is also able to capture concept drift more efectively than traditional distribution-based approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Profiling</kwd>
        <kwd>Relaxed Functional Dependencies</kwd>
        <kwd>Concept Drift</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Machine Learning (ML) models are increasingly relied upon for a multitude of tasks, including
critical ones such as anomaly detection [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ], where inaccurate predictions can lead to
potentially severe consequences. After deployment, ML models may initially exhibit robust
performance, but, as time progresses, the underlying assumptions may no longer hold, leading to
wrong predictions. The main reason for model degradation is concept drift, a phenomenon that
refers to changes in the underlying function that generates data. Aiming at detecting such shifts,
several methods monitor the model’s prediction performance, while others analyze how data
distribution changes. Some of them rely on qualitative descriptors like “abrupt” and “gradual”,
which have been shown to have limitations due to their dependence on arbitrary boundaries [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
leading to the necessity of estimating the drift magnitude by means of quantitative measures.
However, while data distribution-based approaches have the advantage of not requiring an
analysis of model predictions, they are more prone to false positives [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Moreover, existing
approaches can only capture changes in the individual attribute distributions. Thus, new
strategies leveraging new types of properties in the data should be investigated. To this end,
valuable properties could be extracted through Data Profiling techniques, which enable the
discovery of a wide variety of metadata [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], including Relaxed Functional Dependencies (rfds).
This discussion paper presents the concept drift detection approach proposed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which
analyzes the change of rfds to quantify data shifts in supervised ML settings. Specifically, we
defined a suite of rfd-based metrics to quantify the divergence between the training data and
a set of new samples that the model has to process. Moreover, we provided other rfd-based
metrics inspired by ML measures, with the aim of capturing the performance trend of the
monitored model. We evaluated the proposed metrics on datasets with Known and Unknown
drift, studying how their trend is correlated to the performance of the model over time. A strong
correlation would prove that analyzing rfd evolution can provide meaningful insights about
concept drift without evaluating the model predictions. We also compared the proposed metrics
with existing distribution-based measures.
      </p>
      <sec id="sec-1-1">
        <title>2. RFDs and Concept Drift</title>
        <p>Profiling Metadata. Functional Dependencies (fds) describe relationships among two sets
of attributes X and Y. Formally, an fd X → Y (X implies Y ) is satisfied if and only if for
every pair of tuples (1, 2), whenever (1[] = 2[]), then (1[ ] = 2[ ]). The attribute
set  = 1, 2, . . . , ℎ represents the Left Hand Side (LHS) of the fd, whereas the set
 = 1, 2, . . . ,  is the Right Hand Side (RHS).</p>
        <p>The definition of fd has been recently extended to address challenges associated with
inaccurate real-world data, leading to Relaxed Functional Dependencies (rfds). The latter admit a
limited number of violations (rfds relaxing on the extent, namely rfds) and/or the usage of
similarity/distance functions as matching operators (rfds relaxing on the attribute comparison,
namely rfds). In this paper, we leverage rfds only. Formally, given an instance  of a relation
schema , a constraint , over an attribute A ∈ attr(R), is a predicate  ([],  []) , where
 is a similarity (or distance) function,   a comparison operator, and  a threshold. A specific
similarity/distance function is applied according to the nature of the attributes.
Definition 1. (rfd). Given a relation schema , an rfd  is denoted as Φ1 → Φ2 ,
where:  = 1, 2, ..., ℎ and  = 1, 2, ..., , with ,  ⊆ () and  ∩  = ∅;
and Φ 1 = ⋀︀∈ [](Φ 2 = ⋀︀∈  [ ], .), with ( , .) a conjunction of
similarity/distance constraints on ( , .) and  = 1, . . . , ℎ ( = 1, . . . , , .). Thus,
given an instance r of R, we can state that r satisfies the rfd  (i.e., r ⊨  ) if and only if for
every pair of tuples (1, 2) ∈ , if Φ 1 is true then also Φ 2 returns true.</p>
        <p>For the sake of simplicity and without loss of generality, in what follows, we consider rfds
with a single attribute on the RHS, i.e., Φ1 → 2 . Moreover, we will refer to constraints
defined through distance functions, with a comparison operator ( ≤ ) and the associated threshold.</p>
        <p>For example, consider the tuples ranging from 0 to 6 of the dataset snippet shown in
Figure 1a, then an example of holding rfd is:  :Model≤ 4, Year≤ 1 → Price≤ 300, denoting that
whenever two tuples have similar values on Model and Year, then they have a similar Price.
+</p>
        <p>Model
Kia Picanto
Renault Clio
Ford Fiesta tdci</p>
        <p>Fiat Panda
Hyundai i10 1.0</p>
        <p>Renault Clio
Renault Clio dCi 2019
Year #Owners Price
6.000</p>
        <p>Model , #Owners</p>
        <p>Model , Year
Year</p>
        <p>, Price
Model , Price
Model , #Owners</p>
        <p>Price
Price
Model
Year</p>
        <p>Year
Price
Year
#Owners
#Owners
b)</p>
        <p>Model , #Owners , Year</p>
        <p>Price
Year
, Price</p>
        <p>Model
Model , Price
, #Owners</p>
        <p>Year
Price</p>
        <p>#Owners
Year , #Owners
#Model</p>
        <p>The minimality property ensures that an rfd no longer holds after either () increasing one
or more thresholds on the LHS, () reducing the LHS, or () decreasing the RHS threshold.</p>
        <p>
          To leverage rfds in real-world scenarios, it is necessary to exploit discovery algorithms to
automatically infer the set of minimal rfds holding on a given dataset [
          <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
          ].
        </p>
        <p>Updating rfds over time. Real-world data evolves over time following inserts, deletions,
and updates of data, and rfds evolve accordingly. Specifically, after a tuple deletion, a given
rfd</p>
        <p>may be generalized by an rfd  ′ that either has (i) larger thresholds on the LHS and/or
a smaller threshold on the RHS, (ii) fewer attributes on the LHS, or (iii) both. Instead, after a
tuple insertion, a given rfd  can be specialized by one or more rfds  ′′ that either have (i)
smaller thresholds on the LHS and/or higher thresholds on the RHS, (ii) additional attributes on
the LHS, or (iii) both. Notice that updates can be seen as a deletion followed by an insertion.
Consider the tuples ranging from 0 to 6 shown in Figure 1a, and suppose that 1 gets
deleted. The rfd  :Model≤ 4, Year≤ 1 → Price≤ 300 still holds, but no longer minimal, since
 ′:Year≤ 1 → Price≤ 300 is also valid. On the other hand, suppose that tuple 7 is inserted,
the rfd  :Model≤ 4, Year≤ 1 → Price≤ 300 is no longer valid, since (2,7) and (4,7) violate  .
However,  ′′ holds on the updated dataset:  ′′:Model≤ 4, Year≤ 1, #Owners≤ 1 → Price≤ 300. In
this study, we formalize how to exploit rfd evolution to quantify concept drift.</p>
        <p>
          Relating rfds to Concept Drift. Consider a supervised ML setting, in which each instance
is composed of a feature vector  and a target . Concept drift is a change in the joint distribution
 (, ) between two time instants  and  + , i.e.,  (, ) ̸=  +(, ) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Drifts can be
categorized according to the type of shift, e.g., Gradual drift represents a progressive evolution
from one concept to another, while Abrupt drift occurs when the transition is immediate.
Considering changes in data distribution is one of the most popular concept drift detection
technique, but drift can also afect underlying relationships. To this end, we investigate whether
it is possible to quantify the concept drift in terms of the divergence between two sets of rfds.
Let us consider the sets Σ and Σ ′ of rfds holding on a relation instance  of  in two given
time instants  and  + 1, respectively. The analysis of how rfds change can be achieved in
two diferent perspectives: evaluating how much Σ is changed w.r.t. Σ ′, and vice versa. In what
follows, we characterize all possible rfd evolutions according to these two scenarios.
Definition 2 (Shift from Σ to Σ ′). To quantify the divergence between Σ and Σ ′, it is necessary
to evaluate each rfd
        </p>
        <p>∈ Σ to verify if  is somehow related to any rfd  ′ ∈ Σ ′. Specifically,

∀</p>
        <p>∈ Σ : (i)  can also belong to Σ ′; (ii)  can be specialized by at least one  ′ ∈ Σ ′; or (iii)  can
neither belong to Σ ′ nor be specialized by any  ′ ∈ Σ ′, meaning that  has been invalidated.</p>
        <p>Training Set</p>
        <p>...</p>
        <p>...
...</p>
        <p>Model training
g
isnign
ss
es
ce
oc
ro
pr
ep
re
Pr
P
......
......
......</p>
        <p>...</p>
        <p>ML Model</p>
        <p>RRFFDDDDisicsocoveveryry</p>
        <p>RRFFDDFFiltielterr
RRFFDDDDisicsocoveveryry</p>
        <p>RRFFDDFFiltielterr
RRFFDDDDisicsocoveveryry</p>
        <p>RRFFDDFFiltielterr
n
on
irsairsoisonsnoonsonnnon
pparairairsraisoirsiso</p>
        <p>i
ommppaprar
CDoCoCmompoCmompaompCmpao</p>
        <p>CCom
FRFRDFDFDRDFDFCRDFC</p>
        <p>RRFRDFD</p>
        <p>R R
g
n
i
r
u
s
a
e
M
s
ess
rree
saeuseuarsau
e
iMft
tM
ftrDiDM
rif
Dr
As an example, consider the sets of rfds Σ and Σ ′ provided in Figure 1b. Considering Σ , we
can quantify its shift as follows: (i)  2 does not change, since it also belongs to Σ ′; (ii) 5 rfds
are specialized in Σ ′ ( 0 and  1 are specialized by  ′0,  3 and  4 by  ′2, and  5 by  ′3); and (iii)
 6 is invalidated, since it does not belong to Σ ′ and there is no rfd in Σ ′ specializing it.
Definition 3 (Shift from Σ ′ to Σ ). To quantify the divergence between Σ ′ and Σ , it is necessary
to evaluate each rfd  ′ ∈ Σ ′ to verify if  ′ is somehow related to any rfd  ∈ Σ . Specifically,
∀ ′ ∈ Σ ′: (i)  ′ can also belong to Σ ; (ii)  ′ can be generalized by at least one  ∈ Σ ; or (iii)  ′
can neither belong to Σ nor be generalized by any  ∈ Σ , meaning that  ′ is a new rfd.
As an example, consider the sets of rfds Σ and Σ ′ in Figure 1b. Considering Σ ′, we can quantify
its shift as follows: (i)  ′1 does not change, since it also belongs to Σ ; (ii) 3 rfds are generalized
in Σ ( ′0 is generalized by  0 and  1;  ′2 by  3 and  4; and  ′3 by  5); and (iii)  ′4 is a new
rfd, since it does not belong to Σ and there is no rfd in Σ generalizing it.</p>
      </sec>
      <sec id="sec-1-2">
        <title>3. Exploiting rfds for concept drift detection</title>
        <p>We introduce an approach to quantify concept drift in supervised ML settings. As shown in
Figure 2, it involves two main steps, denoted with black and red arrows. In the Initial step, the
ML model that has to be monitored is trained on the available data, while an rfd discovery
process extracts the set of holding rfds for each target label. In the Updating step, the deployed
ML model makes predictions on incoming data. Our approach entails periodical checks for
drifts. This involves a new rfd discovery on an updated dataset combining training data with
the predicted instances. For each class, the original rfd set is compared with the updated one
using a suite of rfd-based metrics to detect significant shifts that may require model retraining.
3.1. Collecting Meaningful rfds
Our approach underlies three main phases for collecting meaningful rfds: i) Preprocessing, ii)
rfd Discovery, and iii) rfd Filtering; as highlighted by the yellow box in Figure 2.</p>
        <p>
          The preprocessing prepares the dataset for rfd discovery through three main operations.
First, we leverage mutual information-based feature selection [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] to retain the most relevant
attributes. Then, we arrange attributes with high variability into equivalence classes to group
similar values [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Finally, the dataset is partitioned into k subsets based on target labels.
        </p>
        <p>
          After preprocessing, each of the  subsets is given as input to an rfd discovery algorithm.
We leverage domino [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], which also infers the distance constraints. The output of this phase
consists of  sets of rfds, i.e., Σ  ,  = 1, 2, . . . , , where  is the number of target labels.
        </p>
        <p>The third phase filters discovered rfds, aiming at retaining, for each target label, only its
most representative rfds, i.e., those that distinguish it most from the others. Specifically, we
remove from each set Σ  all rfds that also belong to other sets Σ  . Then, we further filter
rfds by retaining, for each label, only rfds that are minimal w.r.t. all the rfds discovered on
the other labels. This ensures that the rfds of each class are unique and not related to those
discovered for the other classes. The output of this phase consists of the updated  rfd sets.</p>
        <p>For example, consider a scenario with two target labels  and  , and suppose that after the
discovery process, the following resulting rfds are provided:
• Σ  =Σ ∪ { 7} where Σ is shown in Figure 1b and  7:Year≤ 2, #Owners≤ 1 → Model≤ 3;
• Σ  ={ 7,  8,  9}where  8:Model≤ 2,Price≤ 300→#Owners≤ 0 and  9:Year≤ 3→ Model≤ 2.
Notice that the rfd  7 is shared between the two sets Σ  and Σ  and that  8 ∈ Σ  is not
minimal with respect to  5 ∈ Σ  . Consequently, the sets Σ  and Σ  will become Σ  =
{ 0,  1,  2,  3,  4,  5,  6} and Σ  = { 9} after the application of the filtering strategy.
3.2. Evaluating Drift through</p>
        <p>rfds
Starting from the the original rfd sets Σ  and the updated ones Σ ′ (with  = 1, 2, . . . , ), the
drift evaluation can be performed. The latter consists of two main phases: i) rfd Comparison
and ii) Measuring; as highlighted by the green box in Figure 2.</p>
        <p>rfd Comparison. This phase compares the original rfd set Σ  and the updated one
Σ ′ . This comparison provides diferent interpretations, i.e., from Σ  and Σ ′ and vice versa,
yielding diferent types of changes. Independently from the direction, there can be a certain
number of rfds that appear in both sets, namely Imm. Instead, if the comparison is from Σ 
to Σ ′ , then it is possible to quantify the number of rfds in Σ  that are: (i) specialized in Σ ′ ,
namely Spec; (ii) specialized in Σ ′ by adding LHS attributes; (iii) specialized in Σ ′ by varying
thresholds only; or (iv) invalidated, i.e., neither present nor specialized in Σ ′ , namely Inv.
As an example, consider the two sets Σ and Σ ′ shown in Figure 1b, which can be denoted as Σ 
and Σ ′ since they are associated to a single label. Thus, it is possible to say that there is just
one Imm rfd, i.e.,  2 and one Inv rfd, i.e.,  6. Among the five Spec rfds:  0,  1,  3  4, and
 5; only the latter, compared with  ′3, is only driven by a simple variation of the RHS threshold.
If the comparison is from Σ ′ to Σ  , it is possible to quantify the rfds in Σ ′ that are:
(i) generalized in Σ  , namely Gen; (ii) generalized in Σ  by removing LHS attributes; (iii)
generalized in Σ  by varying thresholds only; or (iv) New.</p>
        <p>As an example, consider the sets of rfds in Figure 1b. Thus, it is possible to say that there is
just one Imm rfd, i.e.,  ′1 and one New rfd, i.e.,  ′4. Moreover, among the three Gen rfds:
 ′0,  ′2,  ′3; only the latter, compared with  5, is driven by a variation of the RHS threshold.
Overall, the quantitative information provided by the several comparison criteria can be used
to define diferent metrics to measure a possible drift into data.</p>
        <p>Measuring. After the comparison phase, the proposed approach can measure the shift in
terms of rfds according to a suite of rfd-based metrics. Among them, some evaluate the
magnitude of the change of Σ  with respect to Σ ′ , while others consider the opposite direction.
Specifically, we defined two categories of metrics: 12 metrics to quantify the divergence between
two sets of rfds, and 7 metrics inspired by ML, following a confusion matrix-based evaluation1.
To accurately estimating the severity of changes, the former metrics use coeficients to weight
diferent types of rfd evolution, assigning greater importance to more substantial changes (e.g.,
invalidations and new rfds) w.r.t. moderate ones (e.g., specializations and generalizations).
As an example, consider the sets of rfds Σ and Σ ′ in Figure 1b, which can also be denoted
as Σ  and Σ ′ since they are associated to a single label. By considering the definition of a
divergence metric, namely 5, we can quantify drift from Σ  to Σ ′ as follows:
5 =
 + (( − ) × 0.5) + ( × 0.05)
|Σ  |</p>
        <p>1 + ((5 − 1) × 0.5) + (1 × 0.05)
= = 0.44
7
where  denotes the number of rfds specialized by others with the same LHS.
The second category of metrics is inspired by the confusion matrix commonly employed for ML
evaluation. In particular, we adapted the concepts of True∖False Positives and True∖False
Negatives to investigate whether this (re-)interpretation aligns with the model’s actual performance
trend. For instance, consider one of these metrics (i.e., 1), through which we identify True
Positives as the number of rfds that were in Σ  and are still in Σ ′ , and False Negatives as the
rfds that were in Σ  but not in Σ ′ . Instead, False Positives represent rfds that were not in
Σ  but in Σ ′ (i.e., new rfds). From this interpretation, the F1-Measure can be computed. In
general, lower values indicate a larger change between the two sets of rfds.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experimental Evaluation</title>
      <p>This section evaluates whether the proposed metrics exhibit trends strongly correlated with the
model’s performance w.r.t. baseline methods. A higher correlation would imply that rfd-based
metrics more accurately capture drifts, providing reliable insights into the model’s behavior.</p>
      <p>
        Baseline approaches. As compared techniques, we considered the Hellinger distance [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
recommended by [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ], and HiNormalizedComplement [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Since these measures quantify
drift for only a single attribute, we employed two aggregation strategies to obtain a single
distance: (i) the average of all distances [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] (namely  and ) and (ii) the maximum
between all of them [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] (namely  and ).
      </p>
      <p>
        Experimental settings. The evaluation has been performed in two phases: in the first one,
we consider datasets with Known Drift, while the second one considers datasets with Unknown
Drift to simulate real-world scenarios (see Figure 3a). The datasets with Known Drift can be
divided into two groups: IDs 1-9, namely Statistical Drift, which contains 9 configurations
with drift afecting the data distribution; IDs 10-18, namely Attribute-relationship Drift, which
considers 3 datasets and their synthetically generated drifted versions obtained by independently
shufling the values of the number of columns reported in Figure 3a. For the scenario with
1The full set of comparison criteria, metric definitions, and experimental results has been provided in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
a)
      </p>
      <p>Unknown Drift (IDs 19-24), we considered 5 classification datasets. As shown in Figure 3a, for
all datasets configurations, we randomly sampled a certain number of rows to vary the data
within each configuration. Instead, we used all samples for smaller datasets (IDs 21-24).
The experimental sessions required to split datasets into four batches, corresponding to the 25%,
45%, 70%, and 100% of their size, respectively. The first one is used for training a Random Forest
model, which is deployed for making predictions over the other ones. To evaluate the proposed
metrics and the baselines, we compared, for each class, the correlation of their trend with the
actual F1-Measure trend of the model. For rfd-based divergences and the baselines, we expect a
negative correlation, while, for the confusion matrix-based metrics, we expect a positive one.</p>
      <p>Experimental Results Figure 3b shows the correlations obtained by the baseline approaches
and the top-4 rfd-based metrics. The latter include two divergence metrics, i.e., 5 and 7,
and two confusion matrix-based metrics, i.e., 2 and 3. For the latter, we consider the
inverted correlations to include all metrics in a single plot, as they showed positive values as
expected. In general, we can observe that rfd-based metrics achieved stronger correlations
than baseline approaches. More specifically, 5 and 7 recorded an average correlation of
− 0.94 and − 0.93, respectively, while 3 and 2 have an average correlation of 0.87 and
0.86, respectively. Concerning the baseline approaches, the best metric was , with an
average correlation of − 0.75, while  achieved a slightly lower correlation (i.e., − 0.73).
 and  performed significantly worse, with average correlations of − 0.62 and
− 0.59, respectively. By considering the configurations from ID 1 to 9, we expected the baseline
approaches to perform well, as the dataset contains changes of the statistical properties. In fact,
the Hellinger distance performed reasonably well:  achieved an average correlation of
− 0.86 and  a correlation of − 0.85. Despite these good results, the rfd-based metrics
outperformed them. In fact, 5 was the best metric, with an average correlation of − 0.927,
followed by 7 (i.e., − 0.926), 3 (i.e., 0.91), and 2 (i.e., 0.90). Instead,  and ,
recorded the worst results, with correlations of − 0.71 and − 0.64, respectively. For experiments
with IDs 10-18, we artificially introduced drift by altering multi-column relationships. This type
of drift significantly afected the performance of the baselines:  achieved an average
correlation of − 0.65, while  and  recorded a correlation of − 0.61. Finally, 
obtained the worst result, with an average correlation of − 0.48. Thus, their trend was not
aligned to the F1-Measure of the model. Instead, the rfd-based metrics showed the best results:
5 and 7 achieved an average correlation of − 0.95 and − 0.94, respectively; whereas 3 and
2 performed slightly worse (i.e., 0.82 and 0.81, resp.). Thus, we can conclude that although
confusion matrix-based metrics are able to obtain significant results, they are less reliable
than rfd-based divergences. Among the latter, 5 and 7 proved to be the most efective,
consistently maintaining strong correlations in all experiments. For experiments with Unknown
Drift (IDs 19-24), the rfd-based divergences 5 and 7 achieved the strongest correlations
in almost all experiments, confirming themselves as the best metrics we proposed, achieving
an average correlation of − 0.946 and − 0.948, respectively. As for confusion matrix-based
metrics, we observed a similar behavior w.r.t. previous scenarios: although often achieving
good correlations, they were less consistent, with a lower average outcome (i.e., 0.85 for 3
and 0.70 for 2). As for the baseline approaches,  and  achieved an average
correlation of − 0.79 and − 0.74, respectively, while  and  recorded a correlation
of − 0.70 and − 0.65, respectively. However, also in this case, 5, 7, and 3 were more
accurate in quantifying drifts with respect to all baseline approaches.</p>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion and Future Works</title>
      <p>We investigated the potential of profiling metadata to quantify drifts within data evolving over
time. We introduced two categories of rfd-based metrics to measure the shift within data,
proposing rfd-based divergences and rfd confusion matrix-based metrics. We evaluated our
approach on datasets with both known and unknown drift, by also comparing it with other
distribution-based measures. Results shown that the trend of rfd-based metrics is strongly
correlated with the F1-Measure of the model, and that they provide more reliable insights than
the compared baseline, especially when drift afects attribute relationships. In fact, one of the
strengths of this method lies in helping drift profile and understand which data relationships are
changing. Moreover, the proposed approach does not require ground-truth labels for incoming
data during the monitoring process. In the future, we want to investigate other types of profiling
metadata to define a complete drift framework. A limitation of this work is the fact that it
leverage static discovery algorithms, which also deal with a problem that is exponential in the
number of attributes. Future works could employ incremental discovery strategies [20, 21, 22]
to update rfds over time without reconsidering already processed data. This will require new
incremental discovery algorithms capable of inferring similarity/distance thresholds.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the PNRR MUR project PE0000013-FAIR (Future Artificial
Intelligence Research).</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
and Data Mining, ACM, Sydney, Australia, 2015, pp. 935–944.
[20] L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, et al., Incremental discovery of functional
dependencies with a bit-vector algorithm, in: Proceedings of the 27th Italian Symposium
on Advanced Database Systems, 2019.
[21] L. Caruccio, S. Cirillo, Incremental discovery of imprecise functional dependencies, Journal
of Data and Information Quality (JDIQ) 12 (2020) 1–25.
[22] B. Breve, L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, Indibits: Incremental discovery of
relaxed functional dependencies using bitwise similarity, in: Proceedings of the 2023 IEEE
39th International Conference on Data Engineering (ICDE), IEEE, 2023, pp. 1393–1405.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Cai</surname>
          </string-name>
          , Codetect:
          <article-title>Financial fraud detection with anomaly feature detection</article-title>
          ,
          <source>IEEE Access 6</source>
          (
          <year>2018</year>
          )
          <fpage>19161</fpage>
          -
          <lpage>19174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sarno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sinaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Sungkono</surname>
          </string-name>
          ,
          <article-title>Anomaly detection in business processes using process mining and fuzzy association rule learning</article-title>
          ,
          <source>Journal of Big Data</source>
          <volume>7</volume>
          (
          <year>2020</year>
          )
          <article-title>5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Hooshmand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hosahalli</surname>
          </string-name>
          ,
          <article-title>Network anomaly detection using deep learning techniques</article-title>
          ,
          <source>CAAI Transactions on Intelligence Technology</source>
          <volume>7</volume>
          (
          <year>2022</year>
          )
          <fpage>228</fpage>
          -
          <lpage>243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Šabić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Keeley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nannemann</surname>
          </string-name>
          ,
          <article-title>Healthcare and anomaly detection: using machine learning to predict anomalies in heart rate data</article-title>
          ,
          <source>AI &amp; SOCIETY</source>
          <volume>36</volume>
          (
          <year>2021</year>
          )
          <fpage>149</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>G. I. Webb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hyde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petitjean</surname>
          </string-name>
          , Characterizing concept drift,
          <source>Data Mining and Knowledge Discovery</source>
          <volume>30</volume>
          (
          <year>2016</year>
          )
          <fpage>964</fpage>
          -
          <lpage>994</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bayram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kassler</surname>
          </string-name>
          ,
          <article-title>From concept drift to model degradation: An overview on performance-aware drift detectors</article-title>
          ,
          <source>Knowledge-Based Systems</source>
          <volume>245</volume>
          (
          <year>2022</year>
          )
          <fpage>108632</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <article-title>Data profiling revisited</article-title>
          ,
          <source>ACM SIGMOD Record</source>
          <volume>42</volume>
          (
          <year>2013</year>
          )
          <fpage>40</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Caruccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cirillo</surname>
          </string-name>
          , G. Polese,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stanzione</surname>
          </string-name>
          ,
          <article-title>An RFD-based approach for concept drift detection in machine learning systems</article-title>
          , in: To appear
          <source>in Proceedings of the 25th International Conference on Extending Database Technology</source>
          ,
          <source>(EDBT)</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Golab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Karlof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Korn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>On generating near-optimal tableaux for conditional functional dependencies, Proceeding of the VLDB Endow</article-title>
          .
          <volume>1</volume>
          (
          <year>2008</year>
          )
          <fpage>376</fpage>
          -
          <lpage>390</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          , L. Chen,
          <article-title>Eficient discovery of similarity constraints for matching dependencies</article-title>
          ,
          <source>Data &amp; Knowledge Engineering</source>
          <volume>87</volume>
          (
          <year>2013</year>
          )
          <fpage>146</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Caruccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Deufemia</surname>
          </string-name>
          , G. Polese,
          <article-title>Mining relaxed functional dependencies from data</article-title>
          ,
          <source>Data Mining and Knowledge Discovery</source>
          <volume>34</volume>
          (
          <year>2020</year>
          )
          <fpage>443</fpage>
          -
          <lpage>477</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L. F.</given-names>
            <surname>Kozachenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Leonenko</surname>
          </string-name>
          ,
          <article-title>Sample estimate of the entropy of a random vector</article-title>
          ,
          <source>Problemy Peredachi Informatsii</source>
          <volume>23</volume>
          (
          <year>1987</year>
          )
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Caruccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Deufemia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          , G. Polese,
          <article-title>Discovering relaxed functional dependencies based on multi-attribute dominance</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>3212</fpage>
          -
          <lpage>3228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hellinger</surname>
          </string-name>
          ,
          <article-title>Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen</article-title>
          .,
          <source>Journal für die reine und angewandte Mathematik</source>
          <year>1909</year>
          (
          <year>1909</year>
          )
          <fpage>210</fpage>
          -
          <lpage>271</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goldenberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. I. Webb</surname>
          </string-name>
          ,
          <article-title>Survey of distance measures for quantifying concept drift and shift in numeric data</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>60</volume>
          (
          <year>2019</year>
          )
          <fpage>591</fpage>
          -
          <lpage>615</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ditzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Polikar</surname>
          </string-name>
          ,
          <article-title>Hellinger distance based drift detection for nonstationary environments, in: 2011 IEEE symposium on computational intelligence in dynamic and uncertain environments (CIDUE)</article-title>
          , IEEE,
          <year>2011</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goldenberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. I. Webb</surname>
          </string-name>
          ,
          <article-title>Pca-based drift and shift quantification framework for multidimensional data</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>62</volume>
          (
          <year>2020</year>
          )
          <fpage>2835</fpage>
          -
          <lpage>2854</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>M. J. Swain</surname>
            ,
            <given-names>D. H.</given-names>
          </string-name>
          <string-name>
            <surname>Ballard</surname>
          </string-name>
          , Color indexing,
          <source>International journal of computer vision 7</source>
          (
          <year>1991</year>
          )
          <fpage>11</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Qahtan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Alharbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Zhang,</surname>
          </string-name>
          <article-title>A pca-based change detection framework for multidimensional data streams: Change detection in multidimensional data streams</article-title>
          ,
          <source>in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>