Towards New Data Quality Rules for Modeling Data Change Nishttha Sharma Supervised by: Dr. Fei Chiang McMaster University, Hamilton ON, Canada Abstract Data is not static, and attribute value changes often trigger changes in another set of attributes. Traditional methods for analyzing data changes often treat these changes in isolation, failing to consider the broader context in which they occur. This lack of contextual awareness limits the ability to capture relationships between attributes or interpret their significance, especially when distinguishing between normal variations and potential anomalies. In this paper, we discuss the importance of context-awareness and the need to identify normal change behaviour. To achieve this, we introduce a new data quality rule, called change rule, capable of capturing changes in both antecedent and consequent attributes within ordered tuples of a relational instance. Keywords Data Dependencies, Dynamic Data Dependencies, Change Exploration, Change Dependency 1. Introduction Table 1 Example employee changes in position, salary. In real-world datasets, values rarely remain static as data continuously changes over time. These changes often carry 𝑑𝐼𝐷 Year Emp Position Salary EmpMng critical information, revealing patterns, trends, and triggers 𝑑1 2012 E1 Software Dev 65,000 0 that are essential for understanding environmental condi- 𝑑2 2013 E1 Software Dev 68,400 0 tions, system, user behaviour and trends. Existing database 𝑑3 2014 E1 Sr. Software Dev 82,000 4 systems have limited functionality to manage changes, and 𝑑4 2015 E1 Sr. Software Dev 84,100 5 to identify abnormal changes, often relying on triggers to 𝑑5 2016 E1 Sr. Software Dev 86,700 5 recognize out-of-bound changes. In this work, we consider 𝑑6 2017 E1 Lead Dev 96,700 25 𝑑7 2018 E1 Lead Dev 105,000 25 changes to relational attributes for an entity. To simplify 𝑑8 2019 E1 Manager 130,000 140 our setting, attribute changes are modeled as a sequence 𝑑9 2015 E2 Software Dev 64,500 0 of ordered tuples, implicitly with respect to time. Hence, a 𝑑10 2016 E2 Software Dev 67,000 0 tuple represents the value each attribute holds for an entity 𝑑11 2017 E2 Software Dev 69,200 0 at a specific point in time. 𝑑12 2018 E2 Software Dev 71,500 0 Data changes occur in numeric and non-numeric at- 𝑑13 2019 E2 Sr. Software Dev 80,000 2 tributes. Changes to numeric attributes are often measured 𝑑14 2020 E2 Sr. Software Dev 82,100 3 using absolute difference, percentage change, rate of change, 𝑑15 2021 E2 Sr. Software Dev 84,000 3 rolling average [1]. While these metrics are easy to compute, 𝑑16 2022 E2 Sr. Software Dev 88,400 4 they fail to capture the broader context of the change, such 𝑑17 2023 E2 Lead Dev 96,100 28 as the influence of related attributes or the significance of the change. For non-numeric attributes, changes are often measured moted to Senior Software Developer with a similar salary using edit distances (Levenshtein, Jaro-Winkler, Hamming) increase as E1’s (from $71,500 to $80,000). Changes in salary [2] or set-based coefficients (Overlap, Jaccard, Dice) [3]. are typically quantified using percentage change (+19.9% However, these metrics are insufficient because they ignore for E1 and +11.9% for E2). While this provides a numerical the semantic meaning of the changes and the context in summary of the change, it fails to account for the broader which they occur. Context is critical because it provides context. For instance, E1 received a larger raise after a the necessary information to interpret the significance of a shorter tenure and took on the responsibility of managing change. Without context, changes are reduced to isolated four employees, whereas E2 had to wait twice as long for a events, which can lead to misleading interpretations of the similar promotion and gained the responsibility of manag- data change. ing two fewer employees compared to E1. Example 1. Table 1 shows two employees (Emp) E1 and Non-numeric changes within and between classes: Tradi- E2 and their Position, Salary and number of employees tional edit distance metrics such as Levenshtein distance managed (EmpMng) as of a specific Year. Consider the (LD) quantify changes based on character modifications. following changes and the need for greater context: The transition from Software Developer to Senior Software Developer has an LD = 7, whereas for Senior Software Devel- Numeric attribute value changes: As observed in tuples oper to Lead Developer, LD = 13. These values suggest that 𝑑1 βˆ’ 𝑑3 of Table 1, after only two years as a Software Devel- the latter change is almost twice as significant as the former oper, E1 was promoted to the position of Senior Software despite both changes being promotions to the next posi- Developer accompanied by a significant increase in salary tion within the same class (development roles), as shown in ($68,400 to $82,000). Whereas tuples 𝑑9 βˆ’ 𝑑13 show that E2 Figure 1. spent four years as a Software Developer before being pro- The implications of a change can be much greater be- tween different classes. For instance, the LD between Lead Published in the Proceedings of the Workshops of the EDBT/ICDT 2025 Developer and Manager is 11 which suggests that this tran- Joint Conference (March 25-28, 2025), Barcelona, Spain $ sharmn99@mcmaster.ca (N. Sharma) sition is smaller than the transition from Senior Software Β© 2025 Copyright 2025 for this paper by its authors. Use permitted under Creative Commons License Developer to Lead Developer (LD = 13). However, this inter- Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Problem 2: Differentiating normal vs. abnormal data change. Existing dependencies do not capture the dependence between changes from antecedent attributes to consequent attributes on ordered tuples. A declarative specification is needed that models the expected range of value change between attribute sets. We propose change rules to address this problem. 1.1. Challenges β€’ Context representation: Context helps to interpret the significance of data changes. Which attributes, and which subset of values are used to provide this context? Is this context time-dependent? How are existing distance mea- sures augmented to consider this context? Figure 1: Hierarchy of the company in Table 1 β€’ Efficient rule mining: Manual specification of change rules is not practically feasible, and automated solutions pretation is misleading. The change from Lead Developer are needed. Determining dependent sets of attributes is to Manager represents a more significant career shift as important towards identifying meaningful data change. compare to the change from Senior Software Developer Exhaustive enumeration of all attribute sets and their to Lead Developer (where both positions are in the same values is not feasible, and efficient methods to evaluate class), as it involves moving from a development role to the large space of attribute sets are needed. a managerial position (2 levels up as per Figure 1) which β€’ Filtering spurious changes. Rule mining is known to is accompanied by a significant increase in the number of produce spurious rules. Determining which changes are people managed. Existing distance measures fail to capture most relevant and defining (support) measures that filter semantic interpretations of the data. less meaningful changes is necessary. Problem 1: The need for context. The example high- lights that not all changes are equally significant. Context 1.2. Contributions is often needed to interpret data change, and there is a need to augment existing distance metrics with context. We expect to make the following contributions. While identifying (significant) changes is important, it is equally critical to differentiate between normal changes vs β€’ Context-aware change metric: A metric for quantify- abnormal changes. Traditional methods have used declara- ing changes in both numeric and non-numeric attributes, tive methods such as data dependencies of the form 𝑋 β†’ π‘Œ , augmenting them with contextual information from re- where 𝑋, π‘Œ are attribute sets, representing antecedent and lated attributes. consequent attributes. Order Dependencies (OD) [4], Se- β€’ Change rules: A new rule that captures the relationship quential Dependencies (SD) [5], and Differential Depen- of changes from one attribute set 𝑋 to another attribute dencies (DD) [6] specify expected relationships between set π‘Œ across ordered tuples. attribute sets. ODs introduce ordering relationships but do not explicitly quantify changes in attribute values. SDs β€’ Change rule discovery algorithm: An efficient discov- model consequent attribute changes but do not account for ery algorithm for change rules over ordered datasets. The variations in the antecedent attributes. DDs, while address- algorithm adapts the FastDD method to handle ordered ing changes in both antecedent and consequent attributes, data [7], and identifies changes in sequential attribute apply to unordered data. values using context-aware metrics. Example 2. Consider a sequential dependency (SD) stat- ing that when ordered by Position, the change in Salary 2. Related Work between consecutive tuples should be between 5% and 20%. For E1, the SD is violated between 𝑑3 and 𝑑4 with a salary We discuss the relationship of our work to existing met- change of 2.6% falling below the range. It is also violated rics, data dependencies, association rules and statistical/ML between 𝑑7 and 𝑑8 , where the salary change (23.8%) exceeds approaches. the upper bound. These violations help in identifying abnor- mal changes. However, we also want to identify patterns 2.1. Similarity, Distance Metrics where different changes in the antecedent attributes, such as changes within Position will elicit different changes in Traditional numeric metrics analyze individual attributes in the consequent (Salary). For instance, with no change in isolation, missing contextual relationships between changes position, salary still changes annually by 2% to 10%. When- in different attributes. Measures of central tendency (mean, ever there is a promotion (change in position > 0), the salary median, mode) summarize values but can be skewed by out- always changes by 10% to 25%. The existing dependencies liers. Dispersion metrics (variance, standard deviation, IQR) do not capture relationships of this form. capture data spread, but also prone to outlier sensitivity. To address this, we define a data quality rule called Shape distribution measures (skewness, kurtosis, CV) de- change rule. The change rule captures relationships be- scribe asymmetry and variability but can be biased when tween changes in attribute values of an ordered relational data is highly skewed or sparse [1]. instance. For non-numeric (categorical, text) data, cosine similarity guaranteeing consistency. Dependencies ensure structural is commonly used. Cosine similarity measures the cosine integrity, while association rules uncover patterns that may of the angle between vectors [8] and is often used with not hold universally. embeddings to capture semantic similarity. Overlap, Jaccard, and Dice Coefficients [3] are used to quantify the similarity 2.4. Statistical and Machine Learning and diversity of sets. Edit distance such as Levenshtein, Jaro-Winkler, Hamming quantifies the number of operations Approaches needed to transform one string into another [2]. Statistical and machine learning approaches leverage pat- While these metrics are widely used, they do not capture terns in historical data to identify deviations that fall out- semantic distances. For example, "Software Developer" and side expected behavior. Statistical methods rely on prede- "Senior Software Developer" have a high edit distance de- fined thresholds and assumptions about data distribution, spite being closely related in meaning. Embedding-based while machine learning approaches adapt to complex, high- approaches (e.g., BERT) address this by capturing contextual dimensional datasets. meaning but require pre-trained models and domain-specific Statistical and machine learning approaches offer com- tuning. An effective approach for measuring semantic sim- plementary techniques for identifying and differentiating ilarity between non-numeric values is to compute cosine normal and abnormal changes in data. Statistical techniques similarity on BERT embeddings, which allows for a context- include rule-based thresholds and hypothesis testing. For aware representation of the data. example, Z-scores and modified Z-scores are commonly used to detect anomalies by measuring how far a data point 2.2. Data Dependencies deviates from the mean, relative to the standard deviation [1]. For example, if a data point’s z-score exceeds a certain Order Dependencies (ODs) extend functional dependencies threshold (e.g., 3), it may be flagged as abnormal. Similarly, by enforcing ordering relationships [4]. They ensure that a control charts and statistical process control (SPC) methods positive change in the antecedent corresponds to a positive monitor data streams over time, flagging points that fall change in the consequent. However, the semantics of ODs outside control limits as potential anomalies [11]. do not declaratively capture the change in any attribute Machine learning provides various techniques for distin- values. guishing normal from abnormal changes in data, particu- Sequential Dependencies (SDs) declaratively specify the larly through anomaly detection algorithms. Isolation For- change in consequent attributes [5]. They enforce con- est [12] isolates anomalies by partitioning the dataset into straints on how the consequent changes in response to an smaller subsets. Points that require fewer partitions to be instance ordered on the antecedent, i.e. when the instance isolated are identified as anomalies. This method works well is ordered on 𝑋, the changes in the consecutive π‘Œ -values in high-dimensional data but may struggle with datasets will be within a range 𝑔. However, they fail to capture the containing overlapping clusters or anomalies that are close change in the antecedent. Conditional SDs (CSDs) focus on to the decision boundary. identifying intervals within ordered data that satisfy a given While numerous anomaly detection methods exist, our SD. They prefer larger, contiguous intervals that capture approach specifically targets anomalies in the change of at- a substantial portion of the data satisfying the embedded tribute values. We achieve this by defining a change rule that SD. However, the continuity of these intervals requires a not only identifies abnormal behavior but also captures the trade-off with the specificity of the bound 𝑔, which is not relationships between changes across multiple attributes. addressed in the paper. Differential Dependencies (DDs) model differences be- tween any two tuples in a relation independent of the tu- 3. Preliminaries ple ordering, i.e., if the antecedent attribute differences lie within a range 𝑔π‘₯ , then the consequence attribute value Let 𝑅 be a relational schema on attributes 𝐴1 , 𝐴2 , ..., 𝐴𝑁 , differences must lie within a range 𝑔𝑦 [6]. By not capturing and 𝑋 and π‘Œ be sets of attributes such that 𝑋 βŠ† 𝑅 order, DDs miss critical contextual information like trends and π‘Œ βŠ† 𝑅. Let 𝐼 = {𝑑1 , 𝑑2 , ..., 𝑑𝑁 } be a relational or patterns across consecutive tuples. instance of 𝑅 with 𝑁 tuples, ordered on X (implicitly TSDDs [9] designed for time-series data capture temporal ordered on time). The distance between consecutive tu- relationships by treating data within a given time window ples in 𝐼 for an attribute 𝐴 is given via a context-aware as an ordered set and supporting real-valued function op- distance measure: 𝑑𝑖𝑠𝑑(𝑑𝑖 [𝐴], 𝑑𝑖+1 [𝐴]). We define a per- erations. However, similar to SDs, they do not account for missible range for 𝑑𝑖𝑠𝑑 as 𝑔𝐴 = (𝑝, π‘ž), where 𝑝, π‘ž are changes in the antecedent attributes over time. Additionally, real values, i.e., if 𝑑𝑖𝑠𝑑(𝑑𝑖 [𝐴], 𝑑𝑖+1 [𝐴]) ∈ π‘”π‘Ž , then 𝑝 ≀ selecting an optimal time window remains a challenge, as 𝑑𝑖𝑠𝑑(𝑑𝑖 [𝐴], 𝑑𝑖+1 [𝐴]) ≀ π‘ž. an overly narrow window may overlook significant trends, We define a support function π‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘(𝜎, 𝐼)) that mea- while a broader one risks diluting the relevance of depen- sures the relative strength of a change rule 𝜎 in 𝐼. Naturally, dencies. we seek high-support rules to ensure that they have suffi- cient evidence in the instance. We introduce change rules in the next section, and focus on their discovery (as part of 2.3. Association Rules Problem 2). Association rules identify co-occurrences of items within Problem Definition: Given a minimum support threshold a dataset, typically expressed in the form of {𝐴, 𝐡} β†’ 𝐢, πœƒ, find all change rules Ξ£ such that 𝐼 satisfies Ξ£ (𝐼 |= Ξ£), stating that if items 𝐴 and 𝐡 appear together, then 𝐢 is such that for all 𝜎 ∈ Ξ£, π‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘(𝜎, 𝐼) β‰₯ πœƒ. likely to appear as well [10]. Unlike data dependencies, which enforce constraints that all instances must satisfy, association rules identify probabilistic relationships without 4. Current Work: Change Rules behaviour. We introduced a new data rule, called change rule, that captures the relationship between the changes in A change rule is a novel data quality rule which describes antecedent and the changes in the consequent. a relationship between the changes in attributes within 𝐼. As next steps, we plan to address the aforementioned It states that when the change in the antecedent is within problems and challenges: some range 𝑔π‘₯ = (𝑔π‘₯𝑙 , 𝑔π‘₯𝑒 ), then the corresponding change in the consequent will also be within a defined range 𝑔𝑦 = β€’ Exploring transformer-based embeddings (e.g., (𝑔𝑦𝑙 , 𝑔𝑦𝑒 ). BERT), to quantify and accurately capture context- aware changes in numeric and non-numeric data DEFINITION 1. Let πœ‹ be the permutation of tuples of without compromising semantic information. 𝐼 increasing on 𝑋 (that is, π‘‘πœ‹(1) [𝑋] < π‘‘πœ‹(2) [𝑋] < . . . < π‘‘πœ‹(𝑁 ) [𝑋]). Change rule 𝜎 : 𝑋𝑔π‘₯ β†’ β€’ Optimize Set Cover Enumeration by developing an π‘Œπ‘”π‘¦ holds over 𝐼 if for all 𝑖 such that 1 ≀ 𝑖 ≀ efficient method to minimize the search space when 𝑁 βˆ’ 1, when 𝑑𝑖𝑠𝑑(π‘‘πœ‹(𝑖) [𝑋], π‘‘πœ‹(𝑖+1) [𝑋]) ∈ 𝑔π‘₯ then identifying minimal subsets of antecedent changes 𝑑𝑖𝑠𝑑(π‘‘πœ‹(𝑖) [π‘Œ ], π‘‘πœ‹(𝑖+1) [π‘Œ ]) ∈ 𝑔𝑦 . that explain consequent violations. When ordered on X, if the 𝑑𝑖𝑠𝑑 between any two consecu- β€’ Consider the lagged effects of earlier changes on tive 𝑋-values is within the range 𝑔π‘₯ then the 𝑑𝑖𝑠𝑑 between subsequent changes, i.e., the change in an attribute the corresponding π‘Œ -values must be within 𝑔𝑦 . A change at one time step influences changes at a later time rule with a minimum support threshold πœƒ holds when at step. least πœƒ% pairs of consecutive tuples in the instance satisfy the conditions of the change rule. References Example 3. Consider the change rule over Table 1: [1] N. A. Heckert, J. J. Filliben, C. M. Croarkin, B. Hem- 𝜎 : 𝑃 π‘œπ‘ π‘–π‘‘π‘–π‘œπ‘›(5,15) β†’ π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦(0.1,0.25) bree, W. F. Guthrie, P. Tobias, J. Prinz, Handbook 151: Nist/sematech e-handbook of statistical methods This rule states that if the change in Position is between (2002). 5 and 15, then the change in Salary will be between 10% [2] G. Navarro, A guided tour to approximate string to 25%. This holds true for most of the table except when matching, ACM computing surveys (CSUR) 33 (2001) E2 is promoted from Senior Software Developer to Lead 31–88. Developer. In this case, the salary increase is only 8.7%, [3] J. S. Cardinal, Similarity measures and graph adja- which is below the expected 10% to 25% increase. This devi- cency with sets (2022). URL: towardsdatascience.com/ ation from the rule highlights that the employee received a similarity-measures-and-graph-adjacency-with-sets, smaller-than-normal raise with their promotion. [Online; posted 28-Oct-2022]. [4] J. Szlichta, P. Godfrey, L. Golab, M. Kargar, D. Srivas- tava, Effective and complete discovery of order depen- 4.1. Discovery of Change Rules dencies via set-based axiomatization, arXiv preprint We build upon the Differential Dependency discovery algo- arXiv:1608.06169 (2016). rithm, FastDD [7] over unordered data. [5] L. Golab, H. Karloff, F. Korn, A. Saha, D. Srivastava, Sequential dependencies, Proceedings of the VLDB β€’ Diff-Set Construction: Encodes pairwise differences Endowment 2 (2009) 574–585. between all tuples into a diff-set, where each element [6] S. Song, L. Chen, Differential dependencies: Reason- represents a differential constraint violation (e.g., 𝑑𝑖 [𝐴] βˆ’ ing and discovery, ACM Transactions on Database 𝑑𝑗 [𝐴] > πœ‘), where 𝑑𝑖 and 𝑑𝑗 are any two tuples in a re- Systems (TODS) 36 (2011) 1–41. lational instance and πœ‘ is a numerical value. For change [7] S. Kuang, H. Yang, Z. Tan, S. Ma, Efficient differen- rules, we modify this step by using a sorted instance 𝐼 on tial dependency discovery, Proceedings of the VLDB the antecedent attributes 𝑋 to compute the 𝑑𝑖𝑠𝑑 between Endowment 17 (2024) 1552–1564. consecutive pairs of tuples. This eliminates the redun- [8] W. H. Gomaa, A. A. Fahmy, A survey of text simi- dant comparisons by restricting diff-set construction to larity approaches, international journal of Computer adjacent tuple pairs in the sorted instance 𝐼. Applications 68 (2013). [9] X. Ding, Y. Li, H. Wang, C. Wang, Y. Liu, J. Wang, Tsd- β€’ Set Cover Enumeration: Finds minimal subsets of dif- discover: Discovering data dependency for time series ferential functions (antecedent) that cover all violations data, in: 2024 IEEE 40th International Conference on of the consequent. For change rules, instead of fixed Data Engineering (ICDE), IEEE, 2024, pp. 3668–3681. thresholds, use intervals 𝑔π‘₯ and 𝑔𝑦 for antecedent and [10] R. Agrawal, R. Srikant, et al., Fast algorithms for min- consequent gaps. That is, we find the minimal subsets of ing association rules, in: Proc. 20th int. conf. very 𝑑𝑖𝑠𝑑(π‘‘πœ‹(𝑖) [𝑋], π‘‘πœ‹(𝑖+1) [𝑋]) ∈ 𝑔π‘₯ that cover all violations large data bases, VLDB, volume 1215, Santiago, 1994, of 𝑑𝑖𝑠𝑑(π‘‘πœ‹(𝑖) [π‘Œ ], π‘‘πœ‹(𝑖+1) [π‘Œ ]) ∈ 𝑔𝑦 . pp. 487–499. [11] P. Qiu, Statistical process control charts as a tool for analyzing big data, Big and Complex Data Analysis: 5. Conclusion and Next Steps Methodologies and Applications (2017) 123–138. Data changes over time, however, we want to capture rela- [12] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation forest, tionships between these changes. In this paper, we discussed in: 2008 eighth ieee international conference on data the importance of context-awareness when capturing these mining, IEEE, 2008, pp. 413–422. changes and the relevance of identifying normal change