Towards New Data Quality Rules for Modeling Data Change
                         Nishttha Sharma
                         Supervised by: Dr. Fei Chiang
                         McMaster University, Hamilton ON, Canada


                                         Abstract
                                         Data is not static, and attribute value changes often trigger changes in another set of attributes. Traditional methods for analyzing
                                         data changes often treat these changes in isolation, failing to consider the broader context in which they occur. This lack of contextual
                                         awareness limits the ability to capture relationships between attributes or interpret their significance, especially when distinguishing
                                         between normal variations and potential anomalies. In this paper, we discuss the importance of context-awareness and the need to
                                         identify normal change behaviour. To achieve this, we introduce a new data quality rule, called change rule, capable of capturing
                                         changes in both antecedent and consequent attributes within ordered tuples of a relational instance.

                                         Keywords
                                         Data Dependencies, Dynamic Data Dependencies, Change Exploration, Change Dependency


                         1. Introduction                                                                                                 Table 1
                                                                                                                                         Example employee changes in position, salary.
                         In real-world datasets, values rarely remain static as data
                         continuously changes over time. These changes often carry                                                        𝑡𝐼𝐷     Year    Emp         Position           Salary    EmpMng
                         critical information, revealing patterns, trends, and triggers                                                     𝑡1    2012     E1       Software Dev         65,000      0
                         that are essential for understanding environmental condi-                                                          𝑡2    2013     E1       Software Dev         68,400      0
                         tions, system, user behaviour and trends. Existing database                                                        𝑡3    2014     E1     Sr. Software Dev       82,000      4
                         systems have limited functionality to manage changes, and                                                          𝑡4    2015     E1     Sr. Software Dev       84,100      5
                         to identify abnormal changes, often relying on triggers to                                                         𝑡5    2016     E1     Sr. Software Dev       86,700      5
                         recognize out-of-bound changes. In this work, we consider                                                          𝑡6    2017     E1         Lead Dev           96,700     25
                                                                                                                                            𝑡7    2018     E1         Lead Dev           105,000    25
                         changes to relational attributes for an entity. To simplify
                                                                                                                                            𝑡8    2019     E1         Manager            130,000    140
                         our setting, attribute changes are modeled as a sequence                                                           𝑡9    2015     E2       Software Dev         64,500      0
                         of ordered tuples, implicitly with respect to time. Hence, a                                                      𝑡10    2016     E2       Software Dev         67,000      0
                         tuple represents the value each attribute holds for an entity                                                     𝑡11    2017     E2       Software Dev         69,200      0
                         at a specific point in time.                                                                                      𝑡12    2018     E2       Software Dev         71,500      0
                            Data changes occur in numeric and non-numeric at-                                                              𝑡13    2019     E2     Sr. Software Dev       80,000      2
                         tributes. Changes to numeric attributes are often measured                                                        𝑡14    2020     E2     Sr. Software Dev       82,100      3
                         using absolute difference, percentage change, rate of change,                                                     𝑡15    2021     E2     Sr. Software Dev       84,000      3
                         rolling average [1]. While these metrics are easy to compute,                                                     𝑡16    2022     E2     Sr. Software Dev       88,400      4
                         they fail to capture the broader context of the change, such                                                      𝑡17    2023     E2         Lead Dev           96,100     28
                         as the influence of related attributes or the significance of
                         the change.
                            For non-numeric attributes, changes are often measured                                                       moted to Senior Software Developer with a similar salary
                         using edit distances (Levenshtein, Jaro-Winkler, Hamming)                                                       increase as E1’s (from $71,500 to $80,000). Changes in salary
                         [2] or set-based coefficients (Overlap, Jaccard, Dice) [3].                                                     are typically quantified using percentage change (+19.9%
                         However, these metrics are insufficient because they ignore                                                     for E1 and +11.9% for E2). While this provides a numerical
                         the semantic meaning of the changes and the context in                                                          summary of the change, it fails to account for the broader
                         which they occur. Context is critical because it provides                                                       context. For instance, E1 received a larger raise after a
                         the necessary information to interpret the significance of a                                                    shorter tenure and took on the responsibility of managing
                         change. Without context, changes are reduced to isolated                                                        four employees, whereas E2 had to wait twice as long for a
                         events, which can lead to misleading interpretations of the                                                     similar promotion and gained the responsibility of manag-
                         data change.                                                                                                    ing two fewer employees compared to E1.
                         Example 1. Table 1 shows two employees (Emp) E1 and                                                             Non-numeric changes within and between classes: Tradi-
                         E2 and their Position, Salary and number of employees                                                           tional edit distance metrics such as Levenshtein distance
                         managed (EmpMng) as of a specific Year. Consider the                                                            (LD) quantify changes based on character modifications.
                         following changes and the need for greater context:                                                             The transition from Software Developer to Senior Software
                                                                                                                                         Developer has an LD = 7, whereas for Senior Software Devel-
                         Numeric attribute value changes: As observed in tuples                                                          oper to Lead Developer, LD = 13. These values suggest that
                         𝑡1 − 𝑡3 of Table 1, after only two years as a Software Devel-                                                   the latter change is almost twice as significant as the former
                         oper, E1 was promoted to the position of Senior Software                                                        despite both changes being promotions to the next posi-
                         Developer accompanied by a significant increase in salary                                                       tion within the same class (development roles), as shown in
                         ($68,400 to $82,000). Whereas tuples 𝑡9 − 𝑡13 show that E2                                                      Figure 1.
                         spent four years as a Software Developer before being pro-                                                         The implications of a change can be much greater be-
                                                                                                                                         tween different classes. For instance, the LD between Lead
                          Published in the Proceedings of the Workshops of the EDBT/ICDT 2025                                            Developer and Manager is 11 which suggests that this tran-
                          Joint Conference (March 25-28, 2025), Barcelona, Spain
                          $ sharmn99@mcmaster.ca (N. Sharma)
                                                                                                                                         sition is smaller than the transition from Senior Software
                                     © 2025 Copyright 2025 for this paper by its authors. Use permitted under Creative Commons License   Developer to Lead Developer (LD = 13). However, this inter-
                                     Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                    Problem 2: Differentiating normal vs. abnormal
                                                                 data change. Existing dependencies do not capture the
                                                                 dependence between changes from antecedent attributes
                                                                 to consequent attributes on ordered tuples. A declarative
                                                                 specification is needed that models the expected range of
                                                                 value change between attribute sets. We propose change
                                                                 rules to address this problem.

                                                                 1.1. Challenges
                                                                 • Context representation: Context helps to interpret the
                                                                   significance of data changes. Which attributes, and which
                                                                   subset of values are used to provide this context? Is this
                                                                   context time-dependent? How are existing distance mea-
                                                                   sures augmented to consider this context?
Figure 1: Hierarchy of the company in Table 1
                                                                 • Efficient rule mining: Manual specification of change
                                                                   rules is not practically feasible, and automated solutions
pretation is misleading. The change from Lead Developer            are needed. Determining dependent sets of attributes is
to Manager represents a more significant career shift as           important towards identifying meaningful data change.
compare to the change from Senior Software Developer               Exhaustive enumeration of all attribute sets and their
to Lead Developer (where both positions are in the same            values is not feasible, and efficient methods to evaluate
class), as it involves moving from a development role to           the large space of attribute sets are needed.
a managerial position (2 levels up as per Figure 1) which        • Filtering spurious changes. Rule mining is known to
is accompanied by a significant increase in the number of          produce spurious rules. Determining which changes are
people managed. Existing distance measures fail to capture         most relevant and defining (support) measures that filter
semantic interpretations of the data.                              less meaningful changes is necessary.
   Problem 1: The need for context. The example high-
lights that not all changes are equally significant. Context     1.2. Contributions
is often needed to interpret data change, and there is a need
to augment existing distance metrics with context.               We expect to make the following contributions.
   While identifying (significant) changes is important, it is
equally critical to differentiate between normal changes vs      • Context-aware change metric: A metric for quantify-
abnormal changes. Traditional methods have used declara-           ing changes in both numeric and non-numeric attributes,
tive methods such as data dependencies of the form 𝑋 → 𝑌 ,         augmenting them with contextual information from re-
where 𝑋, 𝑌 are attribute sets, representing antecedent and         lated attributes.
consequent attributes. Order Dependencies (OD) [4], Se-          • Change rules: A new rule that captures the relationship
quential Dependencies (SD) [5], and Differential Depen-            of changes from one attribute set 𝑋 to another attribute
dencies (DD) [6] specify expected relationships between            set 𝑌 across ordered tuples.
attribute sets. ODs introduce ordering relationships but
do not explicitly quantify changes in attribute values. SDs      • Change rule discovery algorithm: An efficient discov-
model consequent attribute changes but do not account for          ery algorithm for change rules over ordered datasets. The
variations in the antecedent attributes. DDs, while address-       algorithm adapts the FastDD method to handle ordered
ing changes in both antecedent and consequent attributes,          data [7], and identifies changes in sequential attribute
apply to unordered data.                                           values using context-aware metrics.
Example 2. Consider a sequential dependency (SD) stat-
ing that when ordered by Position, the change in Salary          2. Related Work
between consecutive tuples should be between 5% and 20%.
For E1, the SD is violated between 𝑡3 and 𝑡4 with a salary       We discuss the relationship of our work to existing met-
change of 2.6% falling below the range. It is also violated      rics, data dependencies, association rules and statistical/ML
between 𝑡7 and 𝑡8 , where the salary change (23.8%) exceeds      approaches.
the upper bound. These violations help in identifying abnor-
mal changes. However, we also want to identify patterns          2.1. Similarity, Distance Metrics
where different changes in the antecedent attributes, such
as changes within Position will elicit different changes in      Traditional numeric metrics analyze individual attributes in
the consequent (Salary). For instance, with no change in         isolation, missing contextual relationships between changes
position, salary still changes annually by 2% to 10%. When-      in different attributes. Measures of central tendency (mean,
ever there is a promotion (change in position > 0), the salary   median, mode) summarize values but can be skewed by out-
always changes by 10% to 25%. The existing dependencies          liers. Dispersion metrics (variance, standard deviation, IQR)
do not capture relationships of this form.                       capture data spread, but also prone to outlier sensitivity.
   To address this, we define a data quality rule called         Shape distribution measures (skewness, kurtosis, CV) de-
change rule. The change rule captures relationships be-          scribe asymmetry and variability but can be biased when
tween changes in attribute values of an ordered relational       data is highly skewed or sparse [1].
instance.
   For non-numeric (categorical, text) data, cosine similarity    guaranteeing consistency. Dependencies ensure structural
is commonly used. Cosine similarity measures the cosine           integrity, while association rules uncover patterns that may
of the angle between vectors [8] and is often used with           not hold universally.
embeddings to capture semantic similarity. Overlap, Jaccard,
and Dice Coefficients [3] are used to quantify the similarity     2.4. Statistical and Machine Learning
and diversity of sets. Edit distance such as Levenshtein,
Jaro-Winkler, Hamming quantifies the number of operations
                                                                       Approaches
needed to transform one string into another [2].                  Statistical and machine learning approaches leverage pat-
   While these metrics are widely used, they do not capture       terns in historical data to identify deviations that fall out-
semantic distances. For example, "Software Developer" and         side expected behavior. Statistical methods rely on prede-
"Senior Software Developer" have a high edit distance de-         fined thresholds and assumptions about data distribution,
spite being closely related in meaning. Embedding-based           while machine learning approaches adapt to complex, high-
approaches (e.g., BERT) address this by capturing contextual      dimensional datasets.
meaning but require pre-trained models and domain-specific           Statistical and machine learning approaches offer com-
tuning. An effective approach for measuring semantic sim-         plementary techniques for identifying and differentiating
ilarity between non-numeric values is to compute cosine           normal and abnormal changes in data. Statistical techniques
similarity on BERT embeddings, which allows for a context-        include rule-based thresholds and hypothesis testing. For
aware representation of the data.                                 example, Z-scores and modified Z-scores are commonly
                                                                  used to detect anomalies by measuring how far a data point
2.2. Data Dependencies                                            deviates from the mean, relative to the standard deviation
                                                                  [1]. For example, if a data point’s z-score exceeds a certain
Order Dependencies (ODs) extend functional dependencies           threshold (e.g., 3), it may be flagged as abnormal. Similarly,
by enforcing ordering relationships [4]. They ensure that a       control charts and statistical process control (SPC) methods
positive change in the antecedent corresponds to a positive       monitor data streams over time, flagging points that fall
change in the consequent. However, the semantics of ODs           outside control limits as potential anomalies [11].
do not declaratively capture the change in any attribute             Machine learning provides various techniques for distin-
values.                                                           guishing normal from abnormal changes in data, particu-
   Sequential Dependencies (SDs) declaratively specify the        larly through anomaly detection algorithms. Isolation For-
change in consequent attributes [5]. They enforce con-            est [12] isolates anomalies by partitioning the dataset into
straints on how the consequent changes in response to an          smaller subsets. Points that require fewer partitions to be
instance ordered on the antecedent, i.e. when the instance        isolated are identified as anomalies. This method works well
is ordered on 𝑋, the changes in the consecutive 𝑌 -values         in high-dimensional data but may struggle with datasets
will be within a range 𝑔. However, they fail to capture the       containing overlapping clusters or anomalies that are close
change in the antecedent. Conditional SDs (CSDs) focus on         to the decision boundary.
identifying intervals within ordered data that satisfy a given       While numerous anomaly detection methods exist, our
SD. They prefer larger, contiguous intervals that capture         approach specifically targets anomalies in the change of at-
a substantial portion of the data satisfying the embedded         tribute values. We achieve this by defining a change rule that
SD. However, the continuity of these intervals requires a         not only identifies abnormal behavior but also captures the
trade-off with the specificity of the bound 𝑔, which is not       relationships between changes across multiple attributes.
addressed in the paper.
   Differential Dependencies (DDs) model differences be-
tween any two tuples in a relation independent of the tu-         3. Preliminaries
ple ordering, i.e., if the antecedent attribute differences lie
within a range 𝑔𝑥 , then the consequence attribute value          Let 𝑅 be a relational schema on attributes 𝐴1 , 𝐴2 , ..., 𝐴𝑁 ,
differences must lie within a range 𝑔𝑦 [6]. By not capturing      and 𝑋 and 𝑌 be sets of attributes such that 𝑋 ⊆ 𝑅
order, DDs miss critical contextual information like trends       and 𝑌 ⊆ 𝑅. Let 𝐼 = {𝑡1 , 𝑡2 , ..., 𝑡𝑁 } be a relational
or patterns across consecutive tuples.                            instance of 𝑅 with 𝑁 tuples, ordered on X (implicitly
   TSDDs [9] designed for time-series data capture temporal       ordered on time). The distance between consecutive tu-
relationships by treating data within a given time window         ples in 𝐼 for an attribute 𝐴 is given via a context-aware
as an ordered set and supporting real-valued function op-         distance measure: 𝑑𝑖𝑠𝑡(𝑡𝑖 [𝐴], 𝑡𝑖+1 [𝐴]). We define a per-
erations. However, similar to SDs, they do not account for        missible range for 𝑑𝑖𝑠𝑡 as 𝑔𝐴 = (𝑝, 𝑞), where 𝑝, 𝑞 are
changes in the antecedent attributes over time. Additionally,     real values, i.e., if 𝑑𝑖𝑠𝑡(𝑡𝑖 [𝐴], 𝑡𝑖+1 [𝐴]) ∈ 𝑔𝑎 , then 𝑝 ≤
selecting an optimal time window remains a challenge, as          𝑑𝑖𝑠𝑡(𝑡𝑖 [𝐴], 𝑡𝑖+1 [𝐴]) ≤ 𝑞.
an overly narrow window may overlook significant trends,             We define a support function 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝜎, 𝐼)) that mea-
while a broader one risks diluting the relevance of depen-        sures the relative strength of a change rule 𝜎 in 𝐼. Naturally,
dencies.                                                          we seek high-support rules to ensure that they have suffi-
                                                                  cient evidence in the instance. We introduce change rules
                                                                  in the next section, and focus on their discovery (as part of
2.3. Association Rules                                            Problem 2).
Association rules identify co-occurrences of items within         Problem Definition: Given a minimum support threshold
a dataset, typically expressed in the form of {𝐴, 𝐵} → 𝐶,         𝜃, find all change rules Σ such that 𝐼 satisfies Σ (𝐼 |= Σ),
stating that if items 𝐴 and 𝐵 appear together, then 𝐶 is          such that for all 𝜎 ∈ Σ, 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝜎, 𝐼) ≥ 𝜃.
likely to appear as well [10]. Unlike data dependencies,
which enforce constraints that all instances must satisfy,
association rules identify probabilistic relationships without
4. Current Work: Change Rules                                      behaviour. We introduced a new data rule, called change
                                                                   rule, that captures the relationship between the changes in
A change rule is a novel data quality rule which describes         antecedent and the changes in the consequent.
a relationship between the changes in attributes within 𝐼.           As next steps, we plan to address the aforementioned
It states that when the change in the antecedent is within         problems and challenges:
some range 𝑔𝑥 = (𝑔𝑥𝑙 , 𝑔𝑥𝑢 ), then the corresponding change
in the consequent will also be within a defined range 𝑔𝑦 =              • Exploring transformer-based embeddings (e.g.,
(𝑔𝑦𝑙 , 𝑔𝑦𝑢 ).                                                             BERT), to quantify and accurately capture context-
                                                                          aware changes in numeric and non-numeric data
DEFINITION 1. Let 𝜋 be the permutation of tuples of                       without compromising semantic information.
𝐼 increasing on 𝑋 (that is, 𝑡𝜋(1) [𝑋] < 𝑡𝜋(2) [𝑋] <
. . . < 𝑡𝜋(𝑁 ) [𝑋]).          Change rule 𝜎 : 𝑋𝑔𝑥 →                     • Optimize Set Cover Enumeration by developing an
𝑌𝑔𝑦 holds over 𝐼 if for all 𝑖 such that 1 ≤ 𝑖 ≤                           efficient method to minimize the search space when
𝑁 − 1, when 𝑑𝑖𝑠𝑡(𝑡𝜋(𝑖) [𝑋], 𝑡𝜋(𝑖+1) [𝑋]) ∈ 𝑔𝑥 then                        identifying minimal subsets of antecedent changes
𝑑𝑖𝑠𝑡(𝑡𝜋(𝑖) [𝑌 ], 𝑡𝜋(𝑖+1) [𝑌 ]) ∈ 𝑔𝑦 .                                     that explain consequent violations.

   When ordered on X, if the 𝑑𝑖𝑠𝑡 between any two consecu-              • Consider the lagged effects of earlier changes on
tive 𝑋-values is within the range 𝑔𝑥 then the 𝑑𝑖𝑠𝑡 between                subsequent changes, i.e., the change in an attribute
the corresponding 𝑌 -values must be within 𝑔𝑦 . A change                  at one time step influences changes at a later time
rule with a minimum support threshold 𝜃 holds when at                     step.
least 𝜃% pairs of consecutive tuples in the instance satisfy
the conditions of the change rule.
                                                                   References
Example 3. Consider the change rule over Table 1:
                                                                    [1] N. A. Heckert, J. J. Filliben, C. M. Croarkin, B. Hem-
          𝜎 : 𝑃 𝑜𝑠𝑖𝑡𝑖𝑜𝑛(5,15) → 𝑆𝑎𝑙𝑎𝑟𝑦(0.1,0.25)                        bree, W. F. Guthrie, P. Tobias, J. Prinz, Handbook
                                                                        151: Nist/sematech e-handbook of statistical methods
   This rule states that if the change in Position is between           (2002).
5 and 15, then the change in Salary will be between 10%             [2] G. Navarro, A guided tour to approximate string
to 25%. This holds true for most of the table except when               matching, ACM computing surveys (CSUR) 33 (2001)
E2 is promoted from Senior Software Developer to Lead                   31–88.
Developer. In this case, the salary increase is only 8.7%,          [3] J. S. Cardinal, Similarity measures and graph adja-
which is below the expected 10% to 25% increase. This devi-             cency with sets (2022). URL: towardsdatascience.com/
ation from the rule highlights that the employee received a             similarity-measures-and-graph-adjacency-with-sets,
smaller-than-normal raise with their promotion.                         [Online; posted 28-Oct-2022].
                                                                    [4] J. Szlichta, P. Godfrey, L. Golab, M. Kargar, D. Srivas-
                                                                        tava, Effective and complete discovery of order depen-
4.1. Discovery of Change Rules                                          dencies via set-based axiomatization, arXiv preprint
We build upon the Differential Dependency discovery algo-               arXiv:1608.06169 (2016).
rithm, FastDD [7] over unordered data.                              [5] L. Golab, H. Karloff, F. Korn, A. Saha, D. Srivastava,
                                                                        Sequential dependencies, Proceedings of the VLDB
• Diff-Set Construction: Encodes pairwise differences                   Endowment 2 (2009) 574–585.
  between all tuples into a diff-set, where each element            [6] S. Song, L. Chen, Differential dependencies: Reason-
  represents a differential constraint violation (e.g., 𝑡𝑖 [𝐴] −        ing and discovery, ACM Transactions on Database
  𝑡𝑗 [𝐴] > 𝜑), where 𝑡𝑖 and 𝑡𝑗 are any two tuples in a re-              Systems (TODS) 36 (2011) 1–41.
  lational instance and 𝜑 is a numerical value. For change          [7] S. Kuang, H. Yang, Z. Tan, S. Ma, Efficient differen-
  rules, we modify this step by using a sorted instance 𝐼 on            tial dependency discovery, Proceedings of the VLDB
  the antecedent attributes 𝑋 to compute the 𝑑𝑖𝑠𝑡 between               Endowment 17 (2024) 1552–1564.
  consecutive pairs of tuples. This eliminates the redun-           [8] W. H. Gomaa, A. A. Fahmy, A survey of text simi-
  dant comparisons by restricting diff-set construction to              larity approaches, international journal of Computer
  adjacent tuple pairs in the sorted instance 𝐼.                        Applications 68 (2013).
                                                                    [9] X. Ding, Y. Li, H. Wang, C. Wang, Y. Liu, J. Wang, Tsd-
• Set Cover Enumeration: Finds minimal subsets of dif-                  discover: Discovering data dependency for time series
  ferential functions (antecedent) that cover all violations            data, in: 2024 IEEE 40th International Conference on
  of the consequent. For change rules, instead of fixed                 Data Engineering (ICDE), IEEE, 2024, pp. 3668–3681.
  thresholds, use intervals 𝑔𝑥 and 𝑔𝑦 for antecedent and           [10] R. Agrawal, R. Srikant, et al., Fast algorithms for min-
  consequent gaps. That is, we find the minimal subsets of              ing association rules, in: Proc. 20th int. conf. very
  𝑑𝑖𝑠𝑡(𝑡𝜋(𝑖) [𝑋], 𝑡𝜋(𝑖+1) [𝑋]) ∈ 𝑔𝑥 that cover all violations           large data bases, VLDB, volume 1215, Santiago, 1994,
  of 𝑑𝑖𝑠𝑡(𝑡𝜋(𝑖) [𝑌 ], 𝑡𝜋(𝑖+1) [𝑌 ]) ∈ 𝑔𝑦 .                              pp. 487–499.
                                                                   [11] P. Qiu, Statistical process control charts as a tool for
                                                                        analyzing big data, Big and Complex Data Analysis:
5. Conclusion and Next Steps                                            Methodologies and Applications (2017) 123–138.
Data changes over time, however, we want to capture rela-          [12] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation forest,
tionships between these changes. In this paper, we discussed            in: 2008 eighth ieee international conference on data
the importance of context-awareness when capturing these                mining, IEEE, 2008, pp. 413–422.
changes and the relevance of identifying normal change