<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Discovering Provenance of Data Warehouse Objects Created by Select-Project Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Witold Andrzejewski</string-name>
          <email>witold.andrzejewski@cs.put.poznan.pl</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pawel Boinski</string-name>
          <email>pawel.boinski@cs.put.poznan.pl</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Wrembel</string-name>
          <email>robert.wrembel@cs.put.poznan.pl</email>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Data lineage is the set of techniques that allow tracking data transformations along a data integration (DI) workflow. Despite substantial research that has already been done on this topic, discovering missing or broken links between database objects in large data repositories still remains challenging. In a data warehouse system, objects like materialized views are created from other objects by means of SQL select-project-join queries. As a result, dependencies between objects are complex, thus dificult to discover and maintain. Moreover, DI workflows often create auxiliary temporary objects (e.g., temporary tables, global variables in a database programming language) and they invoke codes (e.g., table functions, procedures with cursors) that produce volatile data. These volatile data and data structures are used to build permanent data sets and structures, but they cease to exist after terminating a given DI task. As a consequence, relationships between such objects cannot be automatically recorded. This paper provides algorithms that discover relationships between data warehouse objects. We address objects that were created by means of SQL projection-selection queries, where projection of at least one attribute is performed via a linear function. In particular we contribute: (1) an algorithm that discovers the slope and intercept of a linear function that was used in a project query and (2) an algorithm that recovers the selection condition used in a select query.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;data lineage</kwd>
        <kwd>data provenance</kwd>
        <kwd>broken lineage</kwd>
        <kwd>data integration workflow</kwd>
        <kwd>data warehouse</kwd>
        <kwd>select-project query</kwd>
        <kwd>linear regression</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Data lineage (also referred to as data provenance [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]) encompasses the systematic documentation
of the lifecycle of data. This documentation spans the data lifecycle from its origin in a data source,
through various transformations applied within data integration (DI) workflows, to its final destination,
which is typically a data warehouse (DW), data lake, or data lakehouse [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ]. Data lineage
techniques are fundamental to data management, governance, and compliance, as they provide insights
into data origins and processing logic. Consequently, data lineage is an indispensable instrument for
validating data quality, diagnosing processing anomalies, and providing regulatory bodies (especially in
the financial sector) with auditable evidence of data correctness and integrity.
      </p>
      <p>While existing data lineage research primarily focuses on identifying and managing relationships
between individual data records, these relationships are fundamentally determined by the underlying
dependencies between database objects, such as tables, views, and materialized views. To the best
of our knowledge, the systematic discovery of these structural, object-level relationships remains an
underexplored area in the literature.</p>
      <p>
        DI workflows often create auxiliary temporary objects (e.g., temporary tables in a staging area) and
they call predefined functions or user-defined functions (UDFs) that produce volatile tabular data (e.g.,
by table functions). Often, these volatile data and data structures are used to build permanent data sets
and structures. Typically, auxiliary temporary objects (temporary tables, table functions, procedures
with cursors and global variables) cease to exist after terminating a given DI task. As a consequence,
data lineage gets broken. To the best of our knowledge, the state-of-the-art research solutions for data
lineage have not addressed the problem of a broken lineage [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. What is more, none of the commercial
or open-source DI engines allows for the discovery of the broken lineage. The broken lineage prevents
users from checking data quality and from managing dependencies between database objects, which in
turn increases system development costs.
      </p>
      <p>This research work was motivated by a real problem of the broken lineage, faced by a company
in the financial sector. The company uses a DW that integrates data from multiple external systems
(mainly relational databases). New DW objects are created to meet the demand for new analytical
reports. These new objects are based on existing DW and source database objects. For instance, a DW
in the company IT architecture encompasses over 5,000 views and materialized views as well as over
11,000 stored procedures and functions, totaling more than 2 million lines of code. Remarkably, a DW
of this size is still classified as a small DW.</p>
      <p>The problem addressed in this paper is aggravated by the fact that institutions typically maintain
multiple subject-oriented DWs. The vast number of DW objects and source objects used leads to
an overwhelming complexity of dependencies between them. This creates a significant challenge in
managing source-to-target relationships, as a single source object often feeds multiple targets across
multiple warehouses. Furthermore, objects in one DW may be derived from multiple source objects
and/or objects from other DWs.</p>
      <p>In this paper, we introduce a method for discovering missing and broken lineage links between
database objects (Section 3). We focus on a frequent real-world scenario: a new relation is created to
store a specific subset (a selection) of data from a source table, but the values are transformed rather than
just copied. These transformations typically involve linear unit changes, such as converting currencies,
temperature scales, or time zones. For example, a database might have a raw_trades table recording
all stock exchange activity. To help analysts, a developer might create a large_trades_PLN table
that only includes transactions over 10,000 shares and converts prices from USD to PLN. Often, the
metadata or "context" columns are dropped during this process, making it dificult to prove where the
data actually came from. This creates the broken lineage problem.</p>
      <p>The method presented in this paper is capable of solving the problem by discovering the slope and
intercept of the linear function used in the projection as well as recovering the selection criteria used.
In the stock market example, our approach would successfully identify the USD-to-PLN conversion
rate and the 10,000-share filter used to create the new table.</p>
      <p>The experimental evaluation of the method is reported in Section 4. The results show the algorithms
performance w.r.t. to several distinct parameters and demonstrate that the selection-projection query is
reconstructed correctly. We conclude the paper with a short summary in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Three fundamental solutions for managing data provenance in databases have been proposed, namely:
annotation-based, inversion-based, and lineage graphs. The annotation-based solutions annotate input
data with information about their origin. Then, these annotations are propagated by a database engine
through the entire data processing pipeline. Historically, first annotation-based solutions include
Polygen [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and DBNotes [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. More recent works using the same concept include [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ]. A
common approach to annotation-based data provenance representation is the mathematical model
known as semiring [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. The inversion-based solutions invert queries trying to retrospectively infer
the original data that produced a given result, e.g., [
        <xref ref-type="bibr" rid="ref15 ref6">15, 6</xref>
        ]. Finally, in the lineage graph approach,
provenance is registered during query processing and is saved in the form of a graph for further analysis
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] systematizes data lineage characteristics, such as (1) lineage type, (2) lineage granularity, (3)
types of tracked data transformations, and (4) the point in time where the lineage information is stored.
Paper [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] summarizes several most relevant works on data provenance.
      </p>
      <p>
        Data science pipelines typically process data using scripting languages (e.g., Python, R) where data
provenance also needs to be maintained, e.g., [
        <xref ref-type="bibr" rid="ref19 ref20 ref21">19, 20, 21</xref>
        ]. Paper [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] proposes that a user expresses
provenance information need, independent of the instrumentation and tracing of scripts.
      </p>
      <p>
        The standardization of provenance information was initiated in the Open Provenance Model (OPM)
was created [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. OPM uses a directed acyclic graph to represent provenance information. The OPM
was replaced by the W3C standard PROV [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ], to facilitate provenance data exchange over the Web.
      </p>
      <p>
        Query Reverse Engineering (QRE) is a related research area. State-of-the-art QRE techniques (e.g.,
[
        <xref ref-type="bibr" rid="ref26 ref27 ref28">26, 27, 28</xref>
        ]) primarily focus on synthesizing queries involving complex structural operations like joins
and aggregations. However, they lack the capability to discover algebraic data transformations that we
address in this paper.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Contribution</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we presented an approach to detect whether there is a possibility that a column from one table
was derived from another table column via selection and linear transformation. Moreover, we provided
a method for estimating slope and intercept for the linear transformation under the assumption that
the slope is positive.
      </p>
      <p>In this paper, we build upon the previous solution to provide the following improvements:
• significantly more accurate detection of slope and intercept w.r.t. the previous approach;
• the detection of negative slopes (the previous method allowed only for positive slopes);
• the reconstruction of selection conditions.</p>
      <p>In Section 3.1, we establish the notation used in the subsequent discussion and formally define the
problem addressed in this paper. In Section 3.2, we provide an algorithm for determining the slope and
intercept. In Section 3.3, we provide the important component of this algorithm (a verification step). In
Section 3.4, we present an algorithm for reconstructing the selection conditions. Finally, in Section 3.5
we discuss the limitations of the current solution and provide topics for future work.</p>
      <p>
        To the best of our knowledge, the only paper directly related to this problem is [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>3.1. Notations and problem definition</title>
        <p>Assume two relations (1, 2, . . . , , ) and  (1, 2, . . . , , ). For conciseness, we denote the set
of all  attributes as  and the set of all  attributes as  . We say that there is a relationship between
 and  if || ≥ | | and  = Π1() as 1,...,() as ,+ as ( ()), where the functions   are
arbitrary and unknown, and the coeficient  is non-zero. In other words, attributes  may, or may not,
be computed from  in an arbitrary manner. We assume that the selection condition  is represented in
disjunctive normal form, where each clause is a conjunction of literals, and each literal is an inequality
of the form {≤, &gt;}, where  belongs to the domain of  .</p>
        <p>In this paper, we present algorithms to solve the following problem: given relations  and  as
defined above, find the slope , the intercept  and the selection condition  if a relationship exists, or
return an indication that  was not derived from  otherwise.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Finding slope and intercept</title>
        <p>Basic algebra yields, that given any two pairs of  and  values, e.g., (1, 1) and (2, 2), one can
easily derive the slope  and intercept  as:
 = 1 −  2</p>
        <p>1 −  2
 = 1 −  1 = 2 −  2
(1)
(2)
Under the assumption that the relationship exists, every row in  corresponds to some row in . Thus,
in order to find the slope and intercept, it is suficient to choose any two rows in  , and treat attribute
the  attribute values from these rows as 1 and 2. Finding 1 and 2 from the relation  is more
challenging, since we do not know which rows were chosen by the selection condition . Thus, we
need to map every 2-permutation (1, 2) of values in the  attribute to the 1 and 2 values obtained
previously, compute  and  and verify whether, for each row in  , some  value in  allows us to
compute . If the verification is successful, the linear function parameters are found.</p>
        <p>Unfortunately, the proposed solution, while accurate, is highly complex. There are (||2) pairs of
1 and 2. Moreover, for each such pair, a verification must be performed, which requires checking if
all  values in  are properly explained. Thus, the verification requires at least | | steps. In fact, the
actual complexity of this verification step is even higher, as we show in Section 3.3. In order to mitigate
this problem, we propose the following solutions.</p>
        <p>
          First, in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] we have proposed to use two-sample Kolmogorov-Smirnov test to detect whether there
is a possibility that a relationship exists between the two tables. The complex approach should be used
only if such a possibility exists.
        </p>
        <p>Second, the number of checked 2-permutations can be limited using the following approach. Choose
the extreme  values as 1 and 2, i.e.:
1
2
=
=
()( )
()( )
(3)
(4)</p>
        <p>Since a linear function is monotonic, the corresponding 1 and 2 should be the closest to the extreme
values in  attribute as well (though not necessarily equal to them). Assuming the slope  is positive, the
correct 1 corresponding to 1 will be the first row in this order that adheres to the selection condition
. Similarly, the correct 2 corresponding to 2 will be the last row in this order that adheres to the
selection condition . Thus, searching for 1 and 2 should begin either at the start, or the end of the
sorted  attribute. In case the slope  is negative, this is reversed. The correct 1 corresponding to 1
will be found in the last row selected via the condition  and the correct 2 corresponding to 2 will
be in the first row selected via the condition . Since we do not know the actual sign of the slope, we
propose to choose consecutive 1 values in the outside-in order.</p>
        <p>The above discussion is summarized in Algorithm 1. In line 1, the attributes  and  are extracted and
sorted into arrays  and  . We also assume that only unique values are retained. In line 2, the extreme
values of the  attribute are retrieved as 1 (the smallest) and 2 (the biggest). In line 3, two pointers 
and  are initialized. The pointer  points to the start of the  array, while the pointer  points to the
end. The pointers are used to implement the outside-in order. Each iteration of the while loop (lines
4-19), corresponds to two 1 values stored in X at pointers  and . At the end of each iteration, the
pointer  is incremented while the pointer  is decremented (line 18). The while loop ends when  &gt; .
A for loop, defined in lines 5-17, is responsible for iterating over these two pointers. In line 6, the value
1 is retrieved. Now, it must be paired with every other value in the  array. This is performed by the
for loop in lines 7-16. To increase the chance of finding the correct 2 value as soon as possible, we start
from the end of the  array if the  pointer is used, or from the start of the array  if the  pointer
is used. The value 2 is retrieved in line 8. If 1 ̸= 2, the slope  and the intercept  are computed
(lines 10 and 11). Next, we verify if every value in  can be computed based on some value in .</p>
        <p>The verification algorithm is described in Section 3.3. If the verification is positive, then the computed
 and  values are returned as a result (line 13). If no result was found and the while loop ends, then
there is no relationship between the tables  and  via attributes  and . The slope  and intercept 
are returned as null values (line 20).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Verification</title>
        <p>The verification step determines whether every value in attribute  can be derived from values in
attribute  using the current slope  and intercept . In general, this is a problem of verifying whether
one set is a subset of the other, complicated by floating point rounding errors.</p>
        <p>Given that the arrays  and  are sorted, three algorithmic approaches are feasible: (1) a sorted
join-like approach ((|| + | |)), (2) iterative binary search ((| | log ||)), and (3) hashing ((||)
Algorithm 1 Full search for slope  and intercept .</p>
        <p>1:  ←  (Π()),  ←  (Π( ))
2: 1 ←  [0],  2 ←  [| | − 1]
3:  ← 0,  ← || − 1
4: while  ≤  do
5: for 1 ∈ {, } do</p>
        <p>for 1 ∈ {} do
6: 1 ← [ 1]
7: for 2 = || − 1, || − 2, . . . , 0 do</p>
        <p>for 2 = 0, 1, . . . , || − 1 do
8: 2 ← [ 2]
9: if 1 ̸= 2 then
10:  ← ( 1 −  2)/(1 −  2)
11:  ←  1 −  1
12: if  (, , , , ) then
13: return a,b
14: end if
15: end if
16: end for
17: end for
18:  ←  + 1,  ←  − 1
19: end while
20: return , 
◁ if  ̸= 
◁ if  = 
◁ if  1 = 
◁ if  1 = 
◁ Alg. 2
initialization plus (| |) verification). Regarding the complexities, it is also worth noting that they
represent a situation in which all values in the  attribute find a match in the  attribute. Since a single
 value without a match is suficient to abort the verification, in most cases the cost of this step will
be much lower. We have experimentally evaluated the first two methods. Preliminary experiments
indicated that the sorted-join like approach was significantly slower than the binary search alternative.
Consequently, the experimental section reports results only for the binary search variant.</p>
        <p>For clarity and reproducibility, we detail the binary search variant in Algorithm 2. The algorithm
requires the sorted arrays  and  (from Algorithm 1), the linear parameters  and , and a tolerance
threshold  . The tolerance parameter is crucial for handling floating-point inaccuracies. Two values are
considered equal if their absolute diference is less than or equal to  . The algorithm starts with the
initialization of the _ variable, which represents the lower bound of the search range within
array . The main loop (lines 3-20) iterates through  . In each iteration, we calculate the theoretical
value  (line 4) and determine if a corresponding value exists in . Line 5, performs a binary search
within the range [_, || − 1] to find the first insertion point for  that maintains the sorted
order of . Lines 7-12 and 13-18 check if the values in  at indices  or  − 1 fall within  ±  . If
neither satisfies the condition, the parameters  and  fail to explain the dataset, and the function returns
  (line 19). Conversely, upon a successful match, _ is updated to the current position to
narrow the search space for subsequent iterations (lines 9 and 15). If all corresponding  values are
found, the function returns  (line 21).</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Reconstructing selection conditions</title>
        <p>For the purpose of the discussion, let us briefly assume that the  attribute in relation  stores unique
values. Let us also assume that the algorithm described in Section 3.2 has found the slope  and the
intercept .</p>
        <p>In order to determine the selection condition , we must first mark the selected rows. We need
to compute an additional attribute in  such, that for each row it stores either 1 or 0, depending on
whether the row was selected via the condition  or not. Let us denote this attribute as . Since we
assume that the  attribute stores unique values, the rows can be selected via a simple join with the
condition  +  = . Thus, the relation with the attribute  can be computed as follows:
Π1,...,,1 as ( +=( ×  ))
∪
Π 1,...,,0 as (
∖
Π1,...,,( +=( ×  ))) (5)</p>
        <sec id="sec-3-4-1">
          <title>Note that the attribute  is replaced by the new attribute .</title>
          <p>The subsequent step involves identifying the selection condition . By framing this as a classification
problem, where 1, . . . ,  are the features and  is the target label, we can employ a decision rule model.
Because this model’s representation closely resembles relational selection conditions, the disjunction of
the conditions of the rules yielding a positive classification (label 1) defines the selection condition .</p>
          <p>There are, however, several key diferences that may cause standard classification algorithms to fail
to retrieve the correct selection condition. In standard methodology, a dataset is divided into at least
training and testing datasets. In our case, a testing dataset is unnecessary, as we only want to determine
how rows were selected when relation  was created. We are not creating a model for future prediction.
Moreover, traditional classification models avoid overfitting by creating simple rules that are allowed
to cover a small number of examples contradicting the rule’s label. While we also seek simple and
non-redundant rules, they must exactly describe the training dataset. The rules must be pure.</p>
          <p>
            In our initial experiments, the best results were achieved using a sequential covering algorithm
(e.g., [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ]) featuring two additional rule-set post-processing phases: rule merging and redundant rule
subsumption. We do not claim novelty for these specific components, as the methods employed are
well-established in the literature: (1) rule aggregation is similar to the RISE algorithm [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ], while (2)
rule subsumption is very similar to the last step of LEM2 procedure [31]. Nevertheless, we provide a
brief description of the rule induction algorithm here to ensure clarity and enable the reproducibility of
our results.
          </p>
          <p>
            The main loop of rule induction is presented in Algorithm 3. It starts with the initialization of two sets:
, which is the set of generated rules, and , which represents the set of examples not yet
covered by generated rules (lines 1,2). The algorithm generates all rules for a single label, and only after
all examples with that label are covered, it continues on to the next label. This order is represented by
the for loop (lines 3-9) iterating over labels and the while loop (lines 4-8) iterating until every example
with the current label is covered. In the inner loop, the rules for the current label are generated via the
 function (line 5, see Algorithm 4) and appended to the  set (line 6). Finally, the
 set is updated by removing all the examples covered by the newly created rule (line 7). This
initial step generates a set of rules that needs to be optimized. This optimization is performed via the
 (line 10, see Algorithm 5) and   (line 11, see Algorithm 6) procedures.
Algorithm 3 Rule induction
1:  ← ∅
2:  ← all examples
3: for  ∈ [
            <xref ref-type="bibr" rid="ref1">1, 0</xref>
            ] do
4: while there is example with  in  do
5:  ← (, )
6:  ←  ∪ 
7: Remove examples covered by  from 
8: end while
9: end for
10: ()
11:  ()
◁ Alg. 4
◁ Alg. 5
◁ Alg. 6
          </p>
          <p>Let us now describe the  function presented in Algorithm 4. The function starts with
initializing the new rule with an always-true condition (line 2). This rule’s condition is next updated
with additional inequalities (conjunctions) until it covers only examples with the current  (while
loop in lines 3-6). The new inequalities are created in line 4. For simplicity, we do not provide the
pseudocode, as it would take a lot of space, however, we describe this process in the following. At this
step, multiple cuts (inequalities) are generated and the one with the best score is chosen. The score is
computed as purity (fraction of covered examples with the correct ) multiplied by a large weight
(109 in our implementation) plus the number of the covered examples. This means that in case two
of the cuts have the same purity, we prefer the cut covering more examples. The possible cuts are
generated for every feature, for every inequality operand (either ≤ or &gt;) and for a subset of distinct
values of the current feature (see [32]).</p>
          <p>The best cut algorithm returns the feature, operand, and threshold combination with the highest
score (which are appended to the rule in line 5) or indicates that no such result can be generated. In the
latter case, the rule creation is finalized and the resulting rule is returned (line 7).</p>
          <p>Algorithm 4 Rule creation
1: function (, )
2:  ← "if true then "
3: while  is not pure in  do
4: , , ℎ ← best cut; break if not found
5: add condition (, , ℎ) to 
6: end while
7: return 
8: end function</p>
          <p>After rules covering every example are generated, they must be optimized. The first optimization step
is rule aggregation. We greedily aggregate rules while they still remain pure. This optimization step is
presented in Algorithm 5. The algorithm starts with the initialization of the boolean  variable
which indicates whether the set of rules changed in the last iteration of the while loop (lines 3-15). This
loop iterates until no change is introduced to the rule set. While in this loop, for each combination of
two rules 1 and 2 with the same label (lines 5-14), we compute their union (line 7) and verify whether
the union remains pure (line 8). If it does, than the rules 1 and 2 are replaced by their union in the
rule set (line 9). The rule union is performed by taking the minimum of the two lower bounds and the
maximum of the two upper bounds for every feature. In case some of the bounds are not defined, they
are treated as −∞ or ∞.</p>
          <p>The final rule optimization step is the removal of redundant rules, presented in Algorithm 6. The
algorithm starts with the initialization of the new rule set (line 2). Next, each rule, generated so far, is
checked for redundancy (for loop in lines 3-10). This is done by finding the set of examples covered
by the rule and the set of examples covered by all other rules with the same label (lines 4-6). If the set
of examples covered by the rule is not a subset of the examples covered by other rules (line 7) then it
is not redundant and is stored in the new, final, rule set (line 8). The set with non-redundant rules is
returned in line 11.</p>
          <p>Algorithm 6 Rule pruning
1: procedure  ()
2:  ← ∅
3: for  ∈  do
4:  ← rules ̸=  with  of 
5:  ← examples covered by 
6:  ← examples covered by 
7: if  ̸⊆   then
8:  ←  ∪ {}
9: end if
10: end for
11:  ← 
12: end procedure</p>
          <p>We now return to the assumption regarding the uniqueness of values in attribute  of relation .
In general, this assumption does not hold. Consequently, multiple rows in  may contain values in
 that correspond to a single value in attribute . If only a subset of these rows was selected when
the relation  was created, it may result in the inclusion of incorrect rows in the dataset used for rule
generation. In order to avoid this problem, examples corresponding to non-unique  values should be
pruned beforehand.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Limitations and future work</title>
        <sec id="sec-3-5-1">
          <title>In this section, we discuss the limitations of the proposed solution.</title>
          <p>The first limitation is related to the possible ambiguity when finding the linear parameters  and
. Consider a situation where relation  has 5 rows with  values in the set {−4, −1, 0, 1, 4} , while
relation  has 3 rows with  values in the set {1, 2, 3}. In such a case, the following functions would
pass verification:
•  =  + 2
•  = − + 2
•  = 0.25 + 2
•  = −0.25 + 2
The problem is that there are several mappings of  values to  values such that the relative distances
between consecutive  and consecutive  values are proportional. Algorithm 1 aborts the search
after the first ,  pair passes verification successfully. However, it can be easily modified to continue
computations until all possibilities are exhausted. In order to determine which function is the original
one (in case many are found), expert analysis of the solution, as well as the reconstructed selection
conditions, might be needed. In the future, we plan on devising automatic methods that solve this
problem.</p>
          <p>The second limitation stems from the problem already mentioned in Section 3.4. If values in attribute
 are not unique, then there might be a problem with choosing exemplary rows for rule induction. We
suggested to remove all rows corresponding to duplicate  values from the training dataset. However,
if too many examples are removed, the reconstruction of selection conditions might become impossible.
In the future, we plan to investigate methods for preserving some of the examples that would otherwise
be removed.</p>
          <p>The third limitation stems from the complexity of Algorithm 1, which requires (||2| | log(||) in
the worst case, i.e., in case where most verifications fail after checking a large portion of the  array
and there is no solution, so all possibilities need to be exhausted. However, in the experiments we
noticed that verifications abort quickly, so the complexity is actually lower. Moreover, as mentioned
in Section 3.2, it is suggested to first perform the Kolmogorov-Smirnov test to determine whether
the solution can possibly exist before actually executing this algorithm. Finally, since many of the
computations are independent, it is possible to parallelize Algorithm 1. This will be the subject of our
work in the near future.</p>
          <p>
            Unfortunately, even if a solution exists, numerous iterations of the outer loop in Algorithm 1 might be
needed before it is found. To solve this problem, in our future work we plan on adapting the approach
we presented in [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] to roughly estimate  and . Based on these estimations, it is possible to pinpoint the
corresponding indices 1 and 2 used in Algorithm 1, and start the search in their neighborhood. This
will allow us to find a solution faster if it exists, with dismissible degradation of performance otherwise.
          </p>
          <p>The final limitation of our approach stems from the assumed selection condition . As mentioned in
Section 3.2, condition  is represented in disjunctive normal form, where each clause is a conjunction
of literals, and each literal is an inequality of the form {≤, &gt;} , where  belongs to the domain of
. However, it is easy to imagine conditions that involve multiple attributes in a comparison, e.g.,
1 + 2 &lt; 3. Unfortunately, the adopted rule induction algorithm cannot discover such conditions.
The final topic of our future work is therefore designing a new algorithm for selection condition
reconstruction, capable of finding conditions that compare linear combinations of attribute values with
an arbitrary threshold.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>To validate the proposed algorithms for slope and intercept estimation (Algorithm 1) and the
reconstruction of selection conditions (Algorithm 3), we performed a series of experiments using synthetic
data to simulate various transformation scenarios and logical conditions.</p>
      <p>We begin with the slope and intercept estimation. As Algorithm 1 requires only attributes  and
, only these were generated. The data generation process begins by initializing a source set  with
the given number of values drawn from a uniform distribution, which are then sorted in ascending
order. A derived subset,  , of cardinality  is then constructed through a sampling process that strictly
enforces the inclusion of elements at both the lower bound index 1 and the upper bound index 2. The
remaining  − 2 values are drawn randomly without replacement from the index interval (1, 2).
Following the selection, a linear transformation defined by the slope and intercept is applied to the
elements of  . Subsequently, both datasets are independently shufled to eliminate the ordinal structure
introduced during the initial phase.</p>
      <p>As mentioned in Section 3.5, the complexity of Algorithm 1 with verification using Algorithm 2 is
(||2| | log(||)). The ||2 term stems from the nested loops in Algorithm 1, while the | | log(||)
term represents the complexity of verification. Regarding the nested loops in Algorithm 1, the inner for
loop in almost every iteration of the outer while loop is executed completely. Thus, its processing time
depends primarily on || and the verification cost. Total processing time depends on the number of
iterations of the outer loop. In the worst case, this number is equal to ||, which yields the
aforementioned worst-case theoretical complexity. However, if a result is found earlier, the loop terminates. In
fact, the number of iterations of the outer loop depends on the position in the sorted  attribute (array
) of the match for the first value in  attribute (1 variable in Algorithm 1). This position corresponds
to the 1 parameter mentioned in the previous paragraph.</p>
      <p>In order to verify the above analysis, we performed two experiments. In the first experiment we
tested performance of the inner loop. We varied || in range 10000, 15000, . . . , 50000 and | | was set
to 25%, 50%, and 75% of ||. The measured average processing times of the inner loop are shown in
Fig. 1. As can be observed, the processing time of the inner loop is linear w.r.t. the size of , but the
size of  does not influence it significantly. This is mainly due to the early exit mechanism during
verification. In our experiments, the verification for incorrect  and  failed almost immediately (after
ifrst or second  value to verify). This is confirmed by the results in Fig. 2, where we varied | | for
several distinct values of || and 1, observing that the total processing time (not just the average inner
loop time) remained constant.</p>
      <p>The second experiment tested how the position 1 influences the total processing time for constant ||.
We performed the experiment for three distinct sizes of  equal to 10000, 30000, and 50000 respectively.
Regardless of the size of , | | was kept constant and equal to 1000. For each series, we varied 1 in the
range [0; || − | |] with a step size of 2500 (adjusting the last step to account for | |). The results are
presented in Fig. 3. Since 1 essentially determines the number of iterations of the outer loop, and the
inner loop processing time is constant for constant ||, the dependency is essentially linear. However,
one can easily see the pyramid-like shape of the curves in the figure. This shape can be easily explained
with the outside-in traversal order of the outer loop. Given this order, extreme values of 1 are processed
ifrst (the least number of iterations, the shortest processing times), while values near the middle are
processed last (the highest number of iterations, the longest processing times).
1  &lt; 25 ∨  &gt; 58  ≤ 25.04 ∨  &gt; 57.95
2  &gt; 16 ∧  &gt; 70  &gt; 16 ∧  &gt; 70.01
3 ( &lt; 26 ∧  &gt; 80) ∨  &gt; 50 ( ≤ 26.03 ∧  &gt; 80) ∨  &gt; 50.01
4 ( &lt; 26 ∧  &gt; 77) ∨ ( &gt; 50 ∧  &lt; 15) ( ≤ 25.95 ∧  &gt; 77.02) ∨ ( &gt; 49.96 ∧  ≤ 15)
5  &lt; 26 ∨  &gt; 77 ∨  &gt; 50 ∨  &lt; 15  ≤ 26 ∨  &gt; 77.38 ∨  &gt; 49.57 ∨  ≤ 15
6 ( &lt; 26 ∨  &gt; 77) ∧ ( &gt; 50 ∨  &lt; 15)
7  +  &gt; 70
8  *  &lt; 1000
( ≤ 25.92 ∧  &gt; 49.99) ∨ ( &gt; 77.03 ∧  ≤ 15.04)
∨ ( ≤ 26 ∧  ≤ 15.04) ∨ ( &gt; 77.03 ∧  &gt; 51.27)
A set of 38 disjuncts with a total of 77 inequalities
A set of 306 disjuncts, two inequalities each
Decision
✓
✓
✓
✓
✓
✓
✗
✗</p>
      <p>We now discuss the evaluation of the selection condition reconstruction algorithm (Section 3.4). The
data generation procedure was designed to simulate a real-world scenario where a target relation  is
derived from a source relation  via the selection of tuples satisfying a specific condition, followed by
the projection of a chosen attribute using a linear function. For each test, a source relation  consisting
of  = 10000 tuples was generated. The number of attributes  in relation  varied from 3 to 5,
depending on the complexity of the tested condition. The structure of relation  was defined as a set of
attributes {1, 2, . . . , −1 , }, where  act as conditional attributes, and  is the attribute subject to
linear transformation. Attribute values were generated randomly according to a uniform distribution.
For conditional attributes , the value range was set to [−100, 100) , while for the attribute , the range
was [−1, 1) . The target relation  was created in two steps: (1) selection of tuples from relation 
satisfying a defined condition, and (2) projection and transformation - for each selected tuple, the value
of attribute  was transformed into value  using a linear function.</p>
      <p>To validate the efectiveness of the discussed method, in Table 1 we provide example outcomes by
contrasting the actual logical conditions with those recovered by the proposed approach. For clarity,
attributes in the table are denoted by letters , , , . . . rather than the  notation.</p>
      <p>For selection conditions that adhere to the disjunctive normal form, the algorithm demonstrates very
high efectiveness. The algorithm correctly identifies the logical structure of the query, including nested
conjunctions (∧) and disjunctions (∨). This is evident in row 6, where the complex logical structure
was preserved in the discovered condition. Slight deviations in numerical values (e.g., 25 in the original
vs. 25.04 in the discovered condition in row 1) result from the nature of the algorithm, which searches
for "best cuts" between existing values in the dataset, as well as from floating-point precision issues.
These diferences are negligible and do not alter the semantics of the selection. Finally, the correctness
of the recovered logical conditions serves as a validation for the first phase of our method, proving that
the slope and intercept were successfully identified.</p>
      <p>The last two cases (rows 7 and 8) illustrate the algorithm’s behavior when encountering conditions that
involve multiple attributes simultaneously. In these scenarios, the algorithm attempts to approximate the
complex decision boundaries using a large number of disjuncts (and inequalities). While these selection
conditions may be technically correct, they are rather useless for human interpretation or meaningful
lineage tracking. However, it is crucial to emphasize, that these scenarios fall outside the scope of
the problem definition established in Section 3.1. The initial assumptions explicitly constrained the
selection conditions to disjunctive normal form. Therefore, rows 7 and 8 serve primarily to demonstrate
the necessity of these assumptions rather than a failure of the algorithm within its intended domain.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Summary</title>
      <p>In this paper, we present an algorithm capable of discovering relations that were created via selection
and projection from source relations. The algorithm determines the linear function used to transform
an attribute as well as reconstructs the selection condition used.</p>
      <p>1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000
|T|
5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
i1 - a position of the first match from T in sorted S</p>
      <p>Experimental results confirm the high accuracy of the reconstructed selection conditions (see Table 1),
and consequently the precise recovery of the linear function’s slope and intercept.</p>
      <p>In future work, we plan to advance our research in two directions: (1) improving the performance
of the linear function discovery algorithm, and (2) expanding the complexity of recoverable selection
conditions. For details on future work, please refer to Section 3.5.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors used Gemini 3 solely for grammar and spelling checks; promising content returned by the
LLM was reviewed and rewritten as needed.
[31] J. W. Grzymala-Busse, LERS-A System for Learning from Examples Based on Rough Sets, Springer,
1992.
[32] U. M. Fayyad, K. B. Irani, Multi-interval discretization of continuous-valued attributes for
classification learning, in: Int. Joint Conf. on Artificial Intelligence, 1993.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mucci</surname>
          </string-name>
          ,
          <article-title>What is data provenance?</article-title>
          , https://www.ibm.com/topics/data-lineage,
          <source>n.d. IBM Think documentation; accessed Jan</source>
          ,
          <year>2026</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Lanter</surname>
          </string-name>
          ,
          <article-title>Design of a lineage-based meta-data base for gis</article-title>
          ,
          <source>Cartography and Geographic Information Systems</source>
          <volume>18</volume>
          (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] What is data lineage?</article-title>
          , https://www.ibm.com/topics/data-lineage,
          <source>n.d. IBM documentation; accessed Jan</source>
          ,
          <year>2026</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          ,
          <article-title>Lineage tracing for general data warehouse transformations</article-title>
          ,
          <source>VLDB Journal 12</source>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y. R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Madnick</surname>
          </string-name>
          ,
          <article-title>A polygen model for heterogeneous database systems: The source tagging perspective</article-title>
          ,
          <source>in: Int. Conf. on Very Large Data Bases (VLDB)</source>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yamada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kitagawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Amagasa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Matono</surname>
          </string-name>
          ,
          <article-title>Augmented lineage: traceability of data analysis including complex UDF processing</article-title>
          ,
          <source>The VLDB Journal</source>
          <volume>32</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Andrzejewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Boiński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wrembel</surname>
          </string-name>
          ,
          <article-title>On fixing broken lineage</article-title>
          , in: ProvenanceWeek @SIGMOD,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Boiński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Andrzejewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grocholewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gruszczyński</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Wrembel,</surname>
          </string-name>
          <article-title>Leveraging machine learning techniques for discovering broken lineage links between database objects</article-title>
          ,
          <source>in: Information Systems Development (ISD)</source>
          ,
          <year>2025</year>
          . URL: https://doi.org/10.62036/ISD.
          <year>2025</year>
          .
          <volume>109</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiticariu</surname>
          </string-name>
          , W.-C. Tan,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Vijayvargiya, DBNotes: a post-it system for relational databases based on provenance</article-title>
          ,
          <source>in: Int. Conf. on Management of Data (SIGMOD)</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Amsterdamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deutch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          ,
          <article-title>Provenance for aggregate queries</article-title>
          ,
          <source>in: ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems (PODS)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Silvello, Data provenance for attributes: attribute lineage</article-title>
          ,
          <source>in: USENIX Conf. on Theory and Practice of Provenance</source>
          , TAPP, USENIX Association,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Annotated</surname>
            <given-names>XML</given-names>
          </string-name>
          :
          <article-title>queries and provenance</article-title>
          ,
          <source>in: ACM SIGMODSIGACT-SIGART symposium on Principles of Database Systems (PODS)</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Karvounarakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          ,
          <article-title>Provenance semirings</article-title>
          ,
          <source>in: ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems (PODS)</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tannen</surname>
          </string-name>
          ,
          <article-title>The semiring framework for database provenance</article-title>
          ,
          <source>in: ACM SIGMODSIGACT-SIGAI Symposium on Principles of Database Systems (PODS)</source>
          , ACM,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Wiener</surname>
          </string-name>
          ,
          <article-title>Tracing the lineage of view data in a warehousing environment</article-title>
          ,
          <source>ACM Transactions on Database Systems</source>
          <volume>25</volume>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kashliev</surname>
          </string-name>
          ,
          <article-title>Storage and querying of large provenance graphs using NoSQL DSE</article-title>
          ,
          <source>in: Int. Conf. on Big Data Security on Cloud, IEEE Int. Conf. on High Performance and Smart Computing, and IEEE Int. Conf. on Intelligent Data and Security</source>
          (BigDataSecurity/HPSC/IDS), IEEE,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ikeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          ,
          <article-title>Data lineage: A survey</article-title>
          , Stanford University Publications (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Senellart</surname>
          </string-name>
          , Provenance and Probabilities in Relational Databases,
          <source>SIGMOD Record 46</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stamatogiannakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bos</surname>
          </string-name>
          ,
          <article-title>Looking inside the black-box: Capturing data provenance using dynamic instrumentation</article-title>
          ,
          <source>in: Int. Provenance and Annotation Workshop on Provenance and Annotation of Data and Processes</source>
          , Springer-Verlag,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J. a. F.</given-names>
            <surname>Pimentel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Murta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Braganholo</surname>
          </string-name>
          ,
          <article-title>A survey on collecting, managing, and analyzing provenance from scripts</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>52</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Missier</surname>
          </string-name>
          , G. Simonelli,
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          ,
          <article-title>Capturing and querying fine-grained provenance of preprocessing pipelines in data science</article-title>
          ,
          <source>VLDB Endowment 14</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cafarella</surname>
          </string-name>
          ,
          <article-title>Enabling useful provenance in scripting languages with a human-in-the-loop</article-title>
          , in: Workshop on
          <article-title>Human-In-the-Loop Data Analytics (HILDA) @</article-title>
          SIGMOD,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Futrelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>McGrath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Myers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Paulson</surname>
          </string-name>
          ,
          <article-title>The Open Provenance Model: An Overview, in: Provenance and Annotation of Data and Processes (IPAW)</article-title>
          , volume
          <volume>5272</volume>
          <source>of LNISA</source>
          , Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>P.</given-names>
            <surname>Missier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Belhajjame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cheney</surname>
          </string-name>
          ,
          <article-title>The W3C PROV family of specifications for modelling provenance metadata</article-title>
          ,
          <source>in: Int. Conf. on Extending Database Technology (EDBT)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>PROV-Overview</surname>
          </string-name>
          , https://www.w3.org/TR/prov-overview/,
          <year>2013</year>
          . W3C Working Group documentation;
          <source>accessed Jan</source>
          ,
          <year>2026</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Q. T.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Parthasarathy</surname>
          </string-name>
          , Query reverse engineering,
          <source>The VLDB Journal</source>
          <volume>23</volume>
          (
          <year>2014</year>
          )
          <fpage>721</fpage>
          -
          <lpage>746</lpage>
          . doi:
          <volume>10</volume>
          .1007/s00778-013-0349-3.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cheung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bodik</surname>
          </string-name>
          ,
          <article-title>Interactive query synthesis from input-output examples</article-title>
          ,
          <source>in: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>1631</fpage>
          -
          <lpage>1634</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 3035918.3058738.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>P.</given-names>
            <surname>Orvalho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Terra-Neves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ventura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Manquinho</surname>
          </string-name>
          ,
          <article-title>SQUARES: a SQL synthesizer using query reverse engineering</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>2853</fpage>
          -
          <lpage>2856</lpage>
          . doi:
          <volume>10</volume>
          .14778/ 3415478.3415492.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Niblett</surname>
          </string-name>
          ,
          <article-title>The cn2 induction algorithm</article-title>
          ,
          <source>Machine Learning</source>
          <volume>3</volume>
          (
          <year>1989</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>P.</given-names>
            <surname>Domingos</surname>
          </string-name>
          ,
          <article-title>Rule induction and instance-based learning a unified approach</article-title>
          ,
          <source>in: Int. Joint Conf. on Artificial Intelligence -</source>
          Volume
          <volume>2</volume>
          , Morgan Kaufmann,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>