<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SEBD</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>BUNNI: Discover Repair Actions for Data Cleaning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giansalvatore Mecca</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Papotti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Donatello Santoro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enzo Veltri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>EURECOM</institution>
          ,
          <addr-line>Biot</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi della Basilicata (UNIBAS)</institution>
          ,
          <addr-line>Potenza</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>33</volume>
      <fpage>16</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This work tackles the challenging and open problem of involving non-expert users as first-class participants in the data-repairing process. While numerous approaches have been proposed to clean data from the perspective of expert users, such as IT professionals and data scientists, there is a significant lack of studies focused on non-expert users. Given a set of predefined data quality rules, we employ machine learning techniques to guide users in identifying dirty values for each violation and repairing them. Our approach minimizes user eofrt while efectively producing trustworthy data repairs. Through experimental evaluation, we demonstrate that this machine-learning-driven method consistently produces a unique clean solution with high quality in scenarios where other traditional approaches fail.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Cleaning</kwd>
        <kwd>Repair Discovery</kwd>
        <kwd>Human In The Loop</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Data cleaning, or data repairing, is the process of detecting and removing errors from data. It is a
crucial preprocessing step in many ML pipelines on diferent tasks, like medical predictions [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ],
Text-To-SQL [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or TNLI [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] applications. Over the past two decades, significant research has
been focused on the data cleaning problem using rule-based approaches [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref5 ref6 ref7 ref8 ref9">5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15</xref>
        ] and data imputation [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. In a rule-based approach, users define upfront data-quality
rules, typically expressed as integrity constraints, and the system enforces these rules over the
data.
      </p>
      <p>
        Rule-based approaches usually rely on machine learning [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] or on heuristics to generate
solutions [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref19">10, 11, 12, 13, 19, 14</xref>
        ]. These may include quite strong assumptions – for example,
always repair the input instance according to the right-hand-side of rules – which may lead to
unsatisfactory results, especially since it is unfeasible for the user to manually check all the
solutions and fix them manually.
      </p>
      <p>
        Bunni1 [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] aims at solving the problem of semi-automatic repair discovery by involving the
user in the generation of a unique solution for a given set of rules. It uses a machine-learning
      </p>
      <p>Title Author Journal Vol Year
1 Possible . . . SQL H. Kohler VLDB J. 25 2016
2 Possible . . . SQL U. Leck VLDB J. 25 2015
3 Possible . . . SQL S. Link VLDB J. 25 2015
4 Possible . . . SQL X. Zhou VLDB J. 25 Aug.
5 RDF . . . survey Z. Kaoudi VLDB J. 25 2015
6 RDF . . . survey I. Manolescu VLDB J. 24 2015
7 A . . . Fragmentation V. Braganholo VLDB J. 25 2014
8 A . . . Fragmentation M. Mattoso SIGMOD Rec. 43 2014</p>
      <p>
        Title Author Journal Vol Year
1 Possible . . . SQL H. Kohler VLDB J. 25 2016
2 Possible . . . SQL U. Leck VLDB J. 25 2016
3 Possible . . . SQL S. Link VLDB J. 25 2016
4 Possible . . . SQL X. Zhou VLDB J. 25 2016
5 RDF . . . survey Z. Kaoudi VLDB J. 24 2015
6 RDF . . . survey I. Manolescu VLDB J. 24 2015
7 A . . . Fragmentation V. Braganholo SIGMOD Rec. 43 2014
8 A . . . Fragmentation M. Mattoso SIGMOD Rec. 43 2014
interactive framework for repair discovery. Starting from a Constant Conditional Functional
Dependency (CFD) [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], the system first collects some training data by asking the user to solve
a few violations. After a training step, the system automatically infers when a rule should be
enforced over a tuple in violation using a repair over the right-hand side (RHS), or conclusion,
of the rule, or a repair over the left-hand side (LHS), or premise, for the same rule.
Example 1: Consider the instance in Table 1 with publications scraped from the web. Consider
CFD 1, which states whenever the Journal is VLDB J. and the Volume is 25, the year must be
2016. Errors w.r.t. 1 are highlighted in red.
      </p>
      <p>1 : t[Journal]=VLDB J., t[Vol]=25 → t[Year]=2016
Assume a user needs to manually enforce 1 over the data. She inspects a few of the tuples in
violation – 2, 3, 4, 5 and 7 in our example – and updates 2[Year] to 2016 first because in her
knowledge (or after some manual search) the paper was published in the VLDB J. on volume 25
in the year 2016.</p>
      <p>Then, she turns to tuple 3. The user has already the knowledge on how to repair the violations
by updating 3[Year] to 2016. The reasons are related to the data that she is seeing: Title, Journal
and Volume have the same values of the previous solved error, and moreover the Year has the
same (wrong) value. In her mind, it is more likely that in tuple 3 Year is erroneous instead of
Journal or Volume. Similarly for tuple 4.</p>
      <p>Now the user inspects 5, in this case, she is uncertain about how to solve it, because, even if
Journal and Volume are the same as the previous repair, the Title is diferent and for deciding
how to repair she uses again her knowledge. In this case, she decides to update the Volume
to 24 instead of changing Year. Finally, the user inspects tuple 7, in this case again she is
uncertain about how to repair the data, and for removing the violations she needs to use again
her knowledge. This time she decides to update the Journal and Volume because the paper was
published in the SIGMOD Rec with Volume 43 in the Year 2014.</p>
      <p>This leads to the cleaned instances in Table 2, where changes by the user are highlighted in
green.</p>
      <p>From Example 1, one may observe that there might exist diferent reasons for removing the
errors even for the same rule. In some cases, the values within the tuple give us a clear idea of
how to repair the violation – as in 4 above. This means that the user “trusts” some of the values,
and “distrusts” others; the data-quality rule, along with the trusted values provides evidence on
how to correct the error. In other cases, it is less clear to identify which values to trust within
3a Certain
3b Uncertain
1 cfd $
the tuple, and we need to rely on the user to investigate further the error and ultimately solve
it, as in tuples 2 and 7.</p>
      <p>Bunni mimics this form of “human thinking”, trying to learn which values of the data can be
trusted and which not, and involving humans only in the “hard decisions”. Bunni minimizes
the number of user interactions, asking only critical questions. The workflow of the system is
depicted in Figure 1.</p>
      <p>The user submits a CFD  over instance ℐ (1). Bunni finds violations w.r.t. , selects one of
them, i.e., a tuple , and tries to solve the violation (2). If Bunni finds out that cells of  matching
the LHS of the rule can be trusted with a confidence above a fixed threshold, it applies the rule
to repair the tuple with respect to the RHS (3). Otherwise, if Bunni is uncertain, i.e. if the rule
cannot be trusted, it asks the user to manually solve the violation (3). If required by Bunni,
the user manually updates the tuple, and Bunni uses this new evidence to refine its learning
model (3) . The loop in points 2 and 3 terminates when no more tuples are in violation. Then
the user goes back to step 1 if she has other rules to submit.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Constraints and Repairs</title>
      <p>
        We use the semantic of Conditional Functional Dependencies (CFDs) [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Among these, we
concentrate on constant CFDs [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] in normal form, i.e., such that the conclusion contains only
one atom, as follows:
      </p>
      <p>cfd: A1=c1, A2=c2, . . . , A=c → A=c</p>
      <p>
        Where A is an attribute from a given relation of a schema  and c represents a constant
value for A. CFDs with multiple atoms in the conclusion can be rewritten as a set of CFDs in
normal form, one for each atom in the conclusion [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>Title Author Journal Vol Year
2 Possible . . . SQL U. Leck VLDB J. 25 2016
7 A . . . Fragmentation V. Braganholo VLDB J. 25 2016</p>
      <p>Example 2: Consider again Table 1 and the CFD 1: t[Journal]=VLDB J., t[Vol]=25 →
t[Year]=2016. The cells for Journal, Volume and Year in tuple t1 are not in violation with
1, while the cells for Journal, Volume and Year in tuple t7 are in violation. We can say that t1
satisfies  1, but t7 does not, so the instance ℐ in Table 1 does not satisfy 1.</p>
      <p>Given an instance ℐ and a set of CFDs Σ, we say that ℐ is clean if satisfies all Σ, otherwise we
say that is dirty. It is easy to see that ℐ is dirty whenever there exist some violations, i.e., tuple
 (or cell values in ) such that  matches with the premise of the CFD but does not match with
the conclusion. Of course, a tuple  may be in violation with one or more CFDs.</p>
      <p>Intuitively we can use one or more CFDs to make sure that our data is clean or to find
violations.</p>
      <p>To clean a dirty instance ℐ we need to define a repair strategy. Given a set of rules Σ defined
over the instance ℐ, a repair strategy is the process of removing violation w.r.t. Σ in ℐ such that
at the end of the process ℐ satisfies Σ.</p>
      <p>
        We repair cells with cell updates, i.e., we change the values of the cells in violation to clean
them [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Despite this intuitive approach, there are multiple ways to remove a violation for a
CFD cfd: 1) RHS Repair, we may change the values of cells corresponding to attributes on the
RHS of the cfd, i.e., attributes appearing in the conclusion; 2) LHS Repair we may change the
values of cells corresponding to attributes in the LHS, i.e., in the premise.
      </p>
      <p>In the RHS Repair, we trust the values in the premise of the cfd and use the constant in the
conclusion to change the corresponding attribute and remove the violation. In the LHS Repair,
we trust the value in the conclusion of the cfd, and therefore one or more of the values appearing
within the premise needs to be changed.</p>
      <p>Example 3: Consider the instance in Table 1 and the CFD c1 from Example 2. The cells
t2[Journal], t2[Vol], t2[Year], t7[Journal], t7[Vol] and t7[Year], are in violation w.r.t. c1. Table 3 is
an example of an RHS repair. We opted for an RHS repair by changing the values related to
Year since we trusted the values in the premise of the c1. Table 4 combines RHS and LHS repair
strategies. For tuple t2 we opted for an RHS repair strategy; on the contrary for tuple t7 we
selected to use an LHS strategy by changing t7[Journal] to SIGMOD Rec and t7[Volume] to 43.
Tables 5 and 6 are examples of LHS repairs only. In the former, we used values from the active
domain of the Journal and Volume attributes, while in the latter we used values outside of the
active domain.</p>
      <p>
        It is clear that choosing the repair strategy ℛ and the clean values  to repair violations
in a complex data-repairing task is a crucial problem because multiple repaired instances can
be generated, as shown by the previous example. Making decisions about the repair process is
usually dificult, both for humans and for machines. However, we could alleviate this problem.
User Involvement: from Naïve to Interactive. Several studies have been devoted to consider
the presence of humans in the repairing process [
        <xref ref-type="bibr" rid="ref23 ref24 ref25 ref26 ref27 ref28">23, 24, 25, 26, 27, 28</xref>
        ]. Early proposals have
mainly relied on what we might call a “naïve” way to involve users in the rule-based repairing
process, as follows.
      </p>
      <p>
        The user first uses her domain knowledge to write the constraints. Then defines a strategy
for repair discovery ℛ – i.e., an algorithm to decide which cells are dirty for each tuple in
violation with Σ –, and a strategy for value discovery  – i.e., an algorithm to replace cells that
are deemed dirty with clean values, in some cases, multiple choices are available. At the end of
the cleaning process one or multiple solutions depending on ℛ and  are generated and the
user will manually inspect all the generated solutions. If  uses variables [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], i.e., placeholders,
to remove violations, then the user needs to change each placeholder into the correct constant
value.
      </p>
      <p>In the end, the user selects the best solution among those generated. By examining the
solutions, she might find out that some of the errors in the data were not detected or repaired In
this case, she needs to write additional rules that capture these errors and execute the process
again.</p>
      <p>In this “naïve” form, the user is involved exclusively at the beginning of the process, to
identify the constraints, and at the end, to inspect the generated solutions.</p>
      <p>Instead, Bunni uses a diferent protocol to involve the user in the repair process. It requires
the user to ) inspect cells wrong w.r.t. a constraint and ) resolve the violations by submitting
a value update.</p>
      <p>These value updates are then used by our learning algorithms to infer the intended repair
strategy, and the strategy to select clean values. In fact, by providing an update, the user 1)
indicates which cells need to be repaired, and indirectly indicates which cells should be trusted,
and 2) indicates the values to use to fix the errors. As a result, the system and the user generate
a single trusted repair, in which all choices taken during the repair process are validated directly
or indirectly by user actions.</p>
      <p>Framework for Repair Discovery. Given a constant CFD cfd and a violated tuple  we say
that the set of cells in  corresponding to the attribute in the premise of the cfd can be trusted if
the probability of each cell is above a threshold  . More specifically, we fix two user-defined
thresholds,   and  . Then, for each cell  involved in a violation, we estimate the
probability  that is clean. We rule that  can be trusted if  &gt;  ; we rule that the  is
dirty if  &lt;  . We ask the user otherwise.</p>
      <p>In light of this, solving the Repair Discovery problem amounts to calculating the probability
that a cell is clean, and more specifically it consists of solving a classification problem where the
classifier outputs both the prediction of whether a cell is clean or dirty and also the probability
about the prediction that is then compared against   and  .</p>
      <p>To handle the case in which more than one cell of a tuple  may be dirty, we extend the binary
classification model to a multilabel classification problem. Practically, a multilabel classifier is
based on a set of n binary classifiers, one for each outcome, where n represents the number of
possible values for the outcome. Then, given a threshold  and a tuple  to classify, the response
consists of all the predicted outcomes of each binary classifier such that the probability that
the hypothesis is true is above the threshold  . In our case, we generate a classifier for each
attribute in the premise of the CFD, and we use   as the threshold.</p>
      <p>Figure 2 illustrates the detailed workflow of Bunni. The process begins with a CFD , the
system identifies all violations  w.r.t. . If no violations are found, it goes to the next CFD.
Otherwise, it selects  tuples to be manually labeled by the user. These manually annotated
tuples are then used to train a classifier, which helps repair other tuples that violate .</p>
      <p>Once the classifier is initialized, if there are still tuples in , Bunni selects a tuple  using
the pickOne function, which we will discuss next. If the classifier determines that the cells in
 related to the premise of  can be trusted, i.e., their probability exceeds  , it applies the
RHS update using the value in ’s conclusion ( ). If the classifier detects that some
cells in  related to the premise of  cannot be trusted, i.e., their probabilities fall below  ,
Bunni applies an LHS update (). If neither an RHS nor LHS repair is
feasible, meaning the classifier is in an indecision area, Bunni asks the user to manually clean
the afected cells. These manual updates are then used to retrain the classifier. This process
repeats until  is empty, then the system moves on to the next CFD.</p>
      <p>
        The pickOne function uses Active Learning [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] to minimize user interactions. For each tuple
in violation, we calculate a score that measures the uncertainty of the classifier. Intuitively,
submitting to the user the tuple with the highest score, i.e., with the highest uncertainty,
will increase the quality of the model in a faster way. The score for each cell is  =
2 ·   ·  . The score for a tuple is the sum of each score for each cell in the premise of the
CFD. This strategy is also used to select the tuples that should be manually labeled by the user.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>
        We use three real-world datasets and a synthetic dataset. BUS: a UK government public dataset
with 15 attributes and 250K tuples. DBLP: is based on the collection of authors, publications
and venues with 15 attributes and 1M tuples. FLIGHT [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] is a real-world dataset related to
lfight departures and arrivals. It contains real errors and is provided with both clean and dirty
versions. Synth is a dataset we designed starting from the original Soccer dataset with 10
attributes and 100k tuples.
      </p>
      <p>
        Errors and Metrics. To compute the quality of a data repair algorithm we need two versions
of the same dataset: a dirty version with the errors and its clean version to compare with the
repaired generated solution. We use the BART error generation tool [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], which allows us to
introduce detectable errors w.r.t. a set of input CFDs by changing the cell values. BART allows
configuring how errors are injected, i.e., how many errors are introduced in the LHS/RHS of
the CFDs and thus we also control how many errors are introduced for each tuple. We manually
define six normalized constant CFDs for each dataset and we generate for each dataset two
diferent versions of dirty instances. We generate an easy (hard) scenario, where we inject
approximately 75% (50%) of the errors on the RHS of every CFD using random values (typo,
random value, value from active domain) and the remaining on the LHS using values from active
domain; As metrics we use: 1) the Quality of the repaired instances by using the F-Measure.
We do not measure the similarity [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] of the repaired instance w.r.t. the clean version since we
do not introduce Nulls or LLUNS [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. 2) The Benefit w.r.t. to manually solve all the errors, i.e.
the percentage of saved user interaction. 3) The Efectiveness, i.e., the F-Measure between the
Quality and the Benefit. Using the efectiveness we can compare diferent strategies considering
both Quality and Benefit. And 4) the Time spent to clean all the instance excluding the think
time of the user.
      </p>
      <p>
        Interactive System Evaluation. We evaluate Bunni against HoloClean (HC) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], Raha and
Baran, and LLunatic [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Raha [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] is an automatic error detection system that uses user
updates to find cells with errors, while Baran [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] is an automatic data repair system that uses
user updates to infer cell repairs. We combine both systems (Raha + Baran) to infer errors in
the cells and repair such cells using user updates. LLunatic is a rule-based data-cleaning system
that can be configured with diferent strategies. To reduce the number of user updates, we
configured LLunatic with a similarity-based strategy to repair the data. It applies an RHS repair
if the value in the conclusion is similar to the one constrained by the CFD (using the Levenshtein
distance), otherwise, it applies a LHS repair. In case of multiple possible values in the LHS, it
makes use of placeholders (LLUNS [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) that are interactively updated by the user.
For HoloClean, LLunatic and Bunni, we use the same CFDs to detect errors. We mine CFDs for
Flight using Metanome [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] with the CFDFinder discovering algorithm [
        <xref ref-type="bibr" rid="ref22 ref7">22, 7</xref>
        ]. We measure the
quality (Q), benefit (B), efectiveness (E), and total execution time (T) in seconds. For Raha and
Baran we use diferent budgets, i.e., number of user updates, and we report the best results in
terms of Efectiveness. We report the results in Table 7. For Raha and Baran, we had memory
errors for BUS and DBLP and we do not report such results. We also report the average of
all the metrics over all datasets. The best results for each metric and the lowest total time are
shown in bold.
      </p>
      <p>For Raha and Baran, datasets with a large number of columns had out-of-memory errors
on our machine. This is due to the number of clustering algorithms trained for each column.
Another issue is the initialization of the strategies for error detection, which takes considerable
time for an interactive system; however, such initialization is executed once and is skipped in
the other executions. HoloClean does not require any user update, so the benefit is always 1.00,
but the quality for DBLP and Synth is low, and thus the efectiveness is also low. Moreover,
like Raha and Baran, also HoloClean takes considerable time to repair the dataset. LLunatic can
reach a good quality without requiring too much user efort, and it also requires a reasonable
time for an interactive system (ignoring Flights dataset). Finally, Bunni represents the good
trade-of when we require high-quality results (on avg. we have 0.99), while it also reduces the
user efort (benefit on avg. 0.63). The efectiveness of Bunni is the highest in this experiment,
while the avg. time is the lowest, proving that is suitable for an interactive system.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>
        We presented Bunni, an interactive system for repair discovery. We demonstrated how the
system is able to interact with users in real-time to generate a solution with a user-defined
quality. We used a well-known machine learning technique, namely Naïve Bayes, to solve the
repair-discovery problem while minimizing the number of user interactions and computing time.
We showed how the Naïve Bayes method outperforms other approaches in terms of quality,
benefit, efectiveness, and execution times. A promising research direction in this respect
consists of using other bayesian approaches with graphical models, like Bayesian Networks.
This would allow us to consider other factors like causality in the attributes, or semantic
correlation. Indeed, our Naïve Bayes classifier is a special case of a Bayesian Network. Another
aspect to investigate is the integration of Bunni with Language Models (LMs) such as BERT [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ]
and T5 [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ], or even tabular LMs [37], to automatically suggest the values for the LHS repair.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach.</p>
      <p>Learn. Res. 21 (2020) 140:1–140:67.
[37] G. Badaro, M. Saeed, P. Papotti, Transformers for Tabular Data Representation: A Survey of
Models and Applications, Transactions of the Association for Computational Linguistics 11
(2023) 227–249. URL: https://doi.org/10.1162/tacl_a_00544. doi:10.1162/tacl_a_00544.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lapadula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mecca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Solimando</surname>
          </string-name>
          , E. Veltri,
          <article-title>Humanity is overrated. or not. automatic diagnostic suggestions by greg</article-title>
          ,
          <source>ml (extended abstract)</source>
          ,
          <source>Communications in Computer and Information Science</source>
          <volume>909</volume>
          (
          <year>2018</year>
          )
          <fpage>305</fpage>
          -
          <lpage>313</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -00063-9_
          <fpage>29</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lapadula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mecca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Solimando</surname>
          </string-name>
          , E. Veltri, Greg, ml
          <article-title>- machine learning for healthcare at a scale</article-title>
          ,
          <source>Health and Technology</source>
          <volume>10</volume>
          (
          <year>2020</year>
          )
          <fpage>1485</fpage>
          -
          <lpage>1495</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s12553-020-00468-9.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Veltri</surname>
          </string-name>
          , G. Badaro,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saeed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <article-title>Data ambiguity profiling for the generation of training examples</article-title>
          ,
          <source>in: 39th IEEE International Conference on Data Engineering, ICDE</source>
          <year>2023</year>
          , Anaheim, CA, USA, April 3-
          <issue>7</issue>
          ,
          <year>2023</year>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>450</fpage>
          -
          <lpage>463</lpage>
          . URL: https://doi.org/10. 1109/ICDE55515.
          <year>2023</year>
          .
          <volume>00041</volume>
          . doi:
          <volume>10</volume>
          .1109/ICDE55515.
          <year>2023</year>
          .
          <volume>00041</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Bussotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Veltri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <article-title>Generation of training examples for tabular natural language inference</article-title>
          ,
          <source>Proc. ACM Manag. Data</source>
          <volume>1</volume>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.1145/ 3626730. doi:
          <volume>10</volume>
          .1145/3626730.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <article-title>Discovering denial constraints</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>6</volume>
          (
          <year>2013</year>
          )
          <fpage>1498</fpage>
          -
          <lpage>1509</lpage>
          . URL: http://www.vldb.org/pvldb/vol6/p1498-papotti.pdf.
          <source>doi:10. 14778/2536258</source>
          .2536262.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          , L. Chen,
          <article-title>Eficient discovery of similarity constraints for matching dependencies</article-title>
          ,
          <source>Data Knowl. Eng</source>
          .
          <volume>87</volume>
          (
          <year>2013</year>
          )
          <fpage>146</fpage>
          -
          <lpage>166</lpage>
          . URL: https://doi.org/10.1016/j.datak.
          <year>2013</year>
          .
          <volume>06</volume>
          .003. doi:
          <volume>10</volume>
          .1016/j.datak.
          <year>2013</year>
          .
          <volume>06</volume>
          .003.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Geerts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <article-title>Discovering conditional functional dependencies</article-title>
          ,
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>23</volume>
          (
          <year>2011</year>
          )
          <fpage>683</fpage>
          -
          <lpage>698</lpage>
          . URL: https://doi.org/10.1109/TKDE.
          <year>2010</year>
          .
          <volume>154</volume>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2010</year>
          .
          <volume>154</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Discovering data quality rules</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>1</volume>
          (
          <year>2008</year>
          )
          <fpage>1166</fpage>
          -
          <lpage>1177</lpage>
          . URL: http://www.vldb.org/pvldb/vol1/1453980.pdf.
          <source>doi:10.14778/1453856</source>
          . 1453980.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Golab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Karlof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Korn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Discovering conservation rules</article-title>
          ,
          <source>in: IEEE 28th International Conference on Data Engineering (ICDE</source>
          <year>2012</year>
          ), Washington, DC, USA (Arlington, Virginia),
          <fpage>1</fpage>
          -5 April,
          <year>2012</year>
          , IEEE Computer Society,
          <year>2012</year>
          , pp.
          <fpage>738</fpage>
          -
          <lpage>749</lpage>
          . URL: https://doi.org/10.1109/ICDE.
          <year>2012</year>
          .
          <volume>105</volume>
          . doi:
          <volume>10</volume>
          .1109/ICDE.
          <year>2012</year>
          .
          <volume>105</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Geerts</surname>
          </string-name>
          , G. Mecca,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <article-title>Cleaning data with llunatic</article-title>
          ,
          <source>VLDB Journal 29</source>
          (
          <year>2020</year>
          )
          <fpage>867</fpage>
          -
          <lpage>892</lpage>
          . doi:
          <volume>10</volume>
          .1007/s00778-019-00586-5.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Khayyat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jindal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quiané-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Bigdansing: A system for big data cleansing</article-title>
          ,
          <source>in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data</source>
          , Melbourne, Victoria, Australia, May 31 - June 4,
          <year>2015</year>
          , ACM,
          <year>2015</year>
          , pp.
          <fpage>1215</fpage>
          -
          <lpage>1230</lpage>
          . URL: https://doi.org/10.1145/2723372. 2747646. doi:
          <volume>10</volume>
          .1145/2723372.2747646.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Towards dependable data repairing with fixing rules</article-title>
          ,
          <source>in: International Conference on Management of Data, SIGMOD</source>
          <year>2014</year>
          ,
          <article-title>Snowbird</article-title>
          ,
          <string-name>
            <surname>UT</surname>
          </string-name>
          , USA, June 22-27,
          <year>2014</year>
          , ACM,
          <year>2014</year>
          , pp.
          <fpage>457</fpage>
          -
          <lpage>468</lpage>
          . URL: https://doi.org/10.1145/2588555.2610494. doi:
          <volume>10</volume>
          .1145/ 2588555.2610494.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bohannon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Flaster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rastogi</surname>
          </string-name>
          ,
          <article-title>A cost-based model and efective heuristic for repairing constraints by value modification</article-title>
          ,
          <source>in: Proceedings of the ACM SIGMOD International Conference on Management of Data</source>
          , Baltimore, Maryland, USA, June 14- 16,
          <year>2005</year>
          , ACM,
          <year>2005</year>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>154</lpage>
          . URL: https://doi.org/10.1145/1066157.1066175. doi:
          <volume>10</volume>
          . 1145/1066157.1066175.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Towards certain fixes with editing rules and master data</article-title>
          ,
          <source>VLDB J</source>
          .
          <volume>21</volume>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Buoncristiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mecca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          , E. Veltri,
          <article-title>Detective gadget: Generic iterative entity resolution over dirty data</article-title>
          ,
          <source>Data</source>
          <volume>9</volume>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .3390/data9120139.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Breve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Caruccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Deufemia</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Polese, RENUVER: A missing value imputation algorithm based on relaxed functional dependencies</article-title>
          ,
          <source>in: Proceedings of the 25th International Conference on Extending Database Technology, EDBT</source>
          <year>2022</year>
          ,
          <article-title>Edinburgh</article-title>
          , UK, March 29 - April 1,
          <year>2022</year>
          , OpenProceedings.org,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .5441/002/EDBT.
          <year>2022</year>
          .
          <volume>05</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Chen,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Enriching data imputation under similarity rule constraints</article-title>
          ,
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>32</volume>
          (
          <year>2020</year>
          )
          <fpage>275</fpage>
          -
          <lpage>287</lpage>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2018</year>
          .
          <volume>2883103</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Rekatsinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ré</surname>
          </string-name>
          , Holoclean:
          <article-title>Holistic data repairs with probabilistic inference</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>10</volume>
          (
          <year>2017</year>
          )
          <fpage>1190</fpage>
          -
          <lpage>1201</lpage>
          . URL: http://www.vldb.org/pvldb/ vol10/p1190-rekatsinas.pdf.
          <source>doi:10.14778/3137628</source>
          .3137631.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Veltri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mecca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Interactive and deterministic data cleaning</article-title>
          ,
          <source>in: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2016</year>
          , San Francisco, CA, USA, June 26 - July 01,
          <year>2016</year>
          , ACM,
          <year>2016</year>
          , pp.
          <fpage>893</fpage>
          -
          <lpage>907</lpage>
          . URL: https://doi.org/10.1145/2882903.2915242. doi:
          <volume>10</volume>
          .1145/2882903. 2915242.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mecca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          , E. Veltri,
          <article-title>BUNNI: learning repair actions in rule-driven data cleaning</article-title>
          ,
          <source>ACM J. Data Inf. Qual</source>
          .
          <volume>16</volume>
          (
          <year>2024</year>
          )
          <volume>12</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          :
          <fpage>31</fpage>
          . URL: https://doi.org/10.1145/3665930. doi:
          <volume>10</volume>
          .1145/3665930.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Geerts</surname>
          </string-name>
          ,
          <article-title>Foundations of Data Quality Management</article-title>
          , Synthesis Lectures on Data Management Morgan &amp; Claypool Publishers,
          <year>2012</year>
          . URL: http://dx.doi.org/10.2200/ S00439ED1V01Y201207DTM030. doi:
          <volume>10</volume>
          .2200/S00439ED1V01Y201207DTM030.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Geerts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kementsietsidis</surname>
          </string-name>
          ,
          <article-title>Conditional functional dependencies for capturing data inconsistencies</article-title>
          ,
          <source>ACM Trans. Database Syst</source>
          .
          <volume>33</volume>
          (
          <year>2008</year>
          ) 6:
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          :
          <fpage>48</fpage>
          . URL: https://doi.org/10.1145/1366102.1366103. doi:
          <volume>10</volume>
          .1145/1366102.1366103.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>V.</given-names>
            <surname>Raman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          ,
          <article-title>Potter's wheel: An interactive data cleaning system</article-title>
          ,
          <source>in: VLDB</source>
          <year>2001</year>
          ,
          <source>Proceedings of 27th International Conference on Very Large Data Bases, September 11-14</source>
          ,
          <year>2001</year>
          , Roma, Italy, Morgan Kaufmann,
          <year>2001</year>
          , pp.
          <fpage>381</fpage>
          -
          <lpage>390</lpage>
          . URL: http: //www.vldb.org/conf/2001/P381.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kandel</surname>
          </string-name>
          ,
          <article-title>Predictive interaction for data transformation</article-title>
          ,
          <source>in: Seventh Biennial Conference on Innovative Data Systems Research, CIDR</source>
          <year>2015</year>
          , Asilomar, CA, USA, January 4-
          <issue>7</issue>
          ,
          <year>2015</year>
          ,
          <string-name>
            <given-names>Online</given-names>
            <surname>Proceedings</surname>
          </string-name>
          , www.cidrdb.org,
          <year>2015</year>
          . URL: http://cidrdb. org/cidr2015/Papers/CIDR15_Paper27.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          ,
          <article-title>An iterative approach to synthesize data transformation programs</article-title>
          ,
          <source>in: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI</source>
          <year>2015</year>
          ,
          <string-name>
            <given-names>Buenos</given-names>
            <surname>Aires</surname>
          </string-name>
          , Argentina,
          <source>July 25-31</source>
          ,
          <year>2015</year>
          , AAAI Press,
          <year>2015</year>
          , pp.
          <fpage>1726</fpage>
          -
          <lpage>1732</lpage>
          . URL: http://ijcai.org/Abstract/15/246.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kandel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paepcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          ,
          <article-title>Wrangler: interactive visual speciifcation of data transformation scripts</article-title>
          ,
          <source>in: Proceedings of the International Conference on Human Factors in Computing Systems, CHI</source>
          <year>2011</year>
          , Vancouver, BC, Canada, May 7-
          <issue>12</issue>
          ,
          <year>2011</year>
          , ACM,
          <year>2011</year>
          , pp.
          <fpage>3363</fpage>
          -
          <lpage>3372</lpage>
          . URL: https://doi.org/10.1145/1978942.1979444. doi:
          <volume>10</volume>
          .1145/1978942.1979444.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhamidipaty</surname>
          </string-name>
          ,
          <article-title>Interactive deduplication using active learning</article-title>
          ,
          <source>in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26</source>
          ,
          <year>2002</year>
          , Edmonton, Alberta, Canada,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2002</year>
          , pp.
          <fpage>269</fpage>
          -
          <lpage>278</lpage>
          . URL: https://doi.org/10.1145/775047.775087. doi:
          <volume>10</volume>
          .1145/775047.775087.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yakout</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Elmagarmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Neville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <article-title>Guided data repair</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>4</volume>
          (
          <year>2011</year>
          )
          <fpage>279</fpage>
          -
          <lpage>289</lpage>
          . URL: https://doi.org/10.14778/1952376.1952378. doi:
          <volume>10</volume>
          .14778/1952376.1952378.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>B.</given-names>
            <surname>Glavic</surname>
          </string-name>
          , G. Mecca,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          , E. Veltri,
          <article-title>Similarity measures for incomplete database instances</article-title>
          ,
          <source>in: Proceedings 27th International Conference on Extending Database Technology, EDBT</source>
          <year>2024</year>
          , Paestum, Italy, March 25 - March 28, OpenProceedings.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>B.</given-names>
            <surname>Settles</surname>
          </string-name>
          , Active Learning,
          <source>Synthesis Lectures on Artificial Intelligence and Machine Learning</source>
          , Morgan &amp; Claypool Publishers,
          <year>2012</year>
          . URL: https://doi.org/10.2200/ S00429ED1V01Y201207AIM018. doi:
          <volume>10</volume>
          .2200/S00429ED1V01Y201207AIM018.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          ,
          <article-title>Baran: Efective error correction via a unified context representation and transfer learning</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>1948</fpage>
          -
          <lpage>1961</lpage>
          . URL: http://www.vldb.org/pvldb/vol13/p1948-mahdavi.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Arocena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Glavic</surname>
          </string-name>
          , G. Mecca,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <article-title>Messing up with BART: error generation for evaluating data-cleaning algorithms</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>9</volume>
          (
          <year>2015</year>
          )
          <fpage>36</fpage>
          -
          <lpage>47</lpage>
          . URL: http://www.vldb.org/pvldb/vol9/p36-arocena.pdf.
          <source>doi:10.14778/ 2850578</source>
          .2850579.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Raha: A configuration-free error detection system</article-title>
          ,
          <source>in: Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2019</year>
          , Amsterdam, The Netherlands, June 30 - July 5,
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>865</fpage>
          -
          <lpage>882</lpage>
          . URL: https://doi.org/10.1145/3299869.3324956. doi:
          <volume>10</volume>
          .1145/3299869.3324956.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>T.</given-names>
            <surname>Papenbrock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bergmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Finke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zwiener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <article-title>Data profiling with metanome</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>8</volume>
          (
          <year>2015</year>
          )
          <fpage>1860</fpage>
          -
          <lpage>1863</lpage>
          . URL: http://dx.doi.org/10.14778/ 2824032.2824086. doi:
          <volume>10</volume>
          .14778/2824032.2824086.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          .,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/n19-
          <fpage>1423</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>