<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Automated Data Cleaning Work ows</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohammad Mahdavi</string-name>
          <email>mahdavilahijani@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felix Neutatz</string-name>
          <email>felix.neutatz@dfki.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Larysa Visengeriyeva</string-name>
          <email>larysa.visengeriyeva@campus.tu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ziawasch Abedjan</string-name>
          <email>abedjan@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Data Pro-</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DFKI GmbH</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TU Berlin</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The success of AI-based technologies depends crucially on trustful and clean data. Research in data cleaning has provided a variety of approaches to address di erent data quality problems. Most of them require some prior knowledge about the dataset in order to select and con gure the approach correctly. We argue that for unknown datasets, it is unrealistic to know the data quality problems upfront and to formulate all necessary quality constraints in one shot. Pragmatically, the user solves data quality problems by implementing an iterative cleaning process. This incremental approach poses the challenge of identifying the right sequence of cleaning routines and their con gurations. In this paper, we highlight our work in progress towards building a cleaning work ow orchestrator that learns from cleaning tasks in the past and proposes promising cleaning work ows for a new dataset. To this end, we highlight new approaches for selecting the most promising error detection routines, aggregating their outputs, and explaining the nal results.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Cleaning Work ows ling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Deriving value from AI- and machine learning-based technologies crucially
depends on the quality of the underlying data [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Research in data cleaning
has provided a variety of tools and approaches to address di erent data
quality problems [
        <xref ref-type="bibr" rid="ref11 ref18 ref22 ref32 ref6">6, 22, 11, 32, 18</xref>
        ]. Nevertheless, in real-world applications, human
agents utilize handcrafted scripts to curate their datasets [
        <xref ref-type="bibr" rid="ref11 ref28">28, 11</xref>
        ]. Underlying
problems that impede the application of thoroughly researched cleaning
algorithms are as follows:
      </p>
      <p>Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>No one-size- ts-all solution. Research on data cleaning solves well-de ned
data quality problems that often do not generalize to all problems of a
realworld dataset. In particular, data quality problems are exposed with regard
to a speci c context, such as rules, dictionaries, patterns, and distributions.</p>
      <p>
        Current solutions only focus on only one of the contexts above [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Iterative data cleaning. Oftentimes, one has to perform multiple rounds of
cleaning and wrangling until the data reaches a satisfactory state [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
Moreover, some data quality problems are hidden in a way that they can only
be exposed after some iterations of certain cleaning or transformation
procedures. For example, missing value imputation facilitates the discovery of
outliers in a dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Trial-and-error parametrization. Current techniques require user-de ned
algorithm parameters, such as rules or thresholds, which are not
straightforward to select by a data practitioner [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Often, the user has to gure these
parameters out during a trial-and-error process that adds more cycles to the
iterative process of data cleaning.
      </p>
      <p>In this paper, we report on the conducted research and the ongoing work
towards a framework that leverages machine learning and data pro ling
techniques to build a cleaning work ow orchestrator for a dataset. In particular, we
are working towards a solution that
{ uses similarities of current cleaning tasks with previous cleaning tasks to
assess the possible gain of a certain tool on a new dataset (Section 2.1).
{ enables users to aggregate the results of stand-alone cleaning strategies in a
holistic manner (Section 2.2).
{ featurizes data values to better explain the context of a data error and
enable an active learning approach to sample more promising data values for
labeling (Section 2.3).</p>
      <p>
        Next, we will describe the overall architecture of our vision and shed light on
some of our intermediate results, some of which have been already published [
        <xref ref-type="bibr" rid="ref15 ref16 ref29">29,
15, 16</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Machine Learning-Driven Cleaning Pipelines</title>
      <p>
        We consider a data science use case where data analytics and data preparation
are carried out on a frequent basis, accumulating a history of data cleaning tasks
from the past that can be logged for later analysis [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Furthermore, we assume
that the data scientists are in possession of multiple cleaning algorithms or
routines. While in our experiments, we are considering o -the-shelf data cleaning
prototypes from research, any sort of custom cleaning script can be considered
as an individual cleaning solution or algorithm.
      </p>
      <p>Figure 1 illustrates the overall architecture of the proposed system. The rst
task is to identify metadata that describes the quality problems of a dataset.
Thus, given a new dataset, the Dataset Pro ler component creates a pro le
Current Status Future Work</p>
      <p>Past Workflows
Annotations
3</p>
      <p>Orchestrator
Error Detection
Output</p>
      <p>Feature Generator</p>
      <p>Estimator</p>
      <p>Workflow Repository</p>
      <p>Data Cleaning Workflow</p>
      <p>Cleaned</p>
      <p>Dataset
Cleaning Routine N</p>
      <p>…</p>
      <p>Cleaning Routine 2
Metadata</p>
      <p>Recommender</p>
      <p>Cleaning Routine 1
Dirty
Dataset</p>
      <p>Error Detection</p>
      <p>Toolbox</p>
      <p>Tools
2 Error Detection</p>
      <p>Engine
Tool Selector
Tool Aggregator
1</p>
      <p>Dataset Profiler
by extracting relevant metadata (Step 1). This pro le summarizes the content,
structure, and the dirtiness of the dataset into statistics and distributions. The
Error Detection Engine leverages the metadata to compare the similarity of the
new dataset to the previously cleaned datasets in the Work ow Repository (Step
2). The Tool Selector uses this metadata to identify the most promising error
detection strategies, whose estimated performance is high enough. We will detail
this step in Section 2.1. The Error Detection Engine then runs the promising
error detection strategies on the new dataset to identify potential data quality
problems. The pro le of the new dataset is then enriched by adding information
related to the strategies' output, such as the output size. Based on the enriched
pro le of the data, the set of potential cleaning algorithms can be re ned.
Furthermore, the Tool Aggregator uses the enriched dataset pro le to aggregate the
output of the promising error detection strategies into one nal output. We will
detail this step in Section 2.2. The User is involved in the process once the initial
pro ling and detection phase is over. The rst task of the user is to annotate a
sample of the detected errors. Leveraging a feature representation that describes
each data cell, our machine learning approach propagates the user labels to other
similar data values with a similar set of feature values. The generated metadata,
the error detection results, and the annotations will be used by the Orchestrator
to generate a dataset-speci c cleaning work ow (Step 3). Currently, we focus on
work ows as sequences of cleaning routines. More complex control ow elements,
such as branches, are future work. Finally, the executed cleaning work ow can
be stored in the work ow repository. In the following, we discuss insights that
we have gained so far working on this project.</p>
      <p>
        Con guration-Free Tool Selection
Existing data cleaning solutions are usually tailored towards one speci c type
of data errors, such as outliers, syntactic pattern violations, or missing values.
However, cleaning the dataset might require a combination of such solutions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Although the number of available data cleaning routines is limited, there is a vast
space of possible con gurations for each algorithm. To address this challenge, we
propose an automated approach for con guring the error detection algorithms
and estimating their F1 score on a new dataset [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>To select the proper set of cleaning routines, we use the similarity between
the current task and previous data cleaning tasks. For a dataset at hand, we
need to select cleaning routines that have successfully cleaned similar datasets
in the past. The key challenge here is to de ne a similarity metric that encodes
the data quality of datasets.</p>
      <p>
        We have created a dirtiness pro le based on data pro ling features [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
These features cover content-describing metadata, such as value distribution,
and structure-describing metadata, such as character distribution [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. We
compare the similarity of datasets through these metadata to lter out irrelevant
error detection algorithms and con gurations that had poor accuracy on the
previous similar datasets [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Next, we run the selected error detection routines
on the new dataset to compute the second group of metadata that are based on
the output of the error detection routines. The raw output of a tool on a dataset
harbors relevant information, such as the output size and its overlap with the
output of other tools. The dirtiness pro le of the dataset will be enriched with
these metadata as well. Finally, the regression models estimate the F1 score of
the selected error detection routines based on the similarity of the nal dirtiness
pro le of the dataset to the previous datasets.
      </p>
      <p>
        To anecdotally show that the approach is promising, we evaluate our
performance estimator on 11 diverse datasets: Hospital [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], Flights [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], Rayyan [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ],
IT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Beers [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] are our real-world datasets that have been cleaned
manually, and Salaries [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], Address [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Movies [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Restaurants [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Soccer [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and
Tax [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are our synthetic datasets. We also have 15 error detection strategies
generated by con guring 7 entirely di erent error detection tools: NADEEF [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
OpenRe ne [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], and KATARA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] are our rule, pattern, and knowledge base
violation detection systems, respectively, and Histogram, Gaussian, Gaussian
Mixture, and Partitioned Histogram Modeling are our outlier detection
strategies [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. We apply the leave-one-out methodology to evaluate our approach.
Each time, we consider one of the datasets as the new arriving dataset and the
rest of datasets as the historical training datasets. So, our performance estimator
trains regression models to learn the relationships between pro le components
and F1 scores of all 15 error detection strategies and estimates the corresponding
F1 score for the new dataset. Our prototype is available online3.
      </p>
      <p>Figure 2 shows the results of our experiments. We use mean squared error
(MSE) to evaluate the quality of the estimated performance of these strategies.</p>
      <sec id="sec-2-1">
        <title>3 https://github.com/bigdama/reds</title>
        <p>
          Precision-Based Ordering [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
Our Unsupervised Estimator
7 10 2
r 6
rroE 5
red 4
a
quS 3
ean 2
M 1
        </p>
        <p>
          The rst experiment (a) shows how the number of existing training datasets
within the repository in uences the estimation accuracy of our proposed
solution. Each point in the graph reports the average mean and standard deviation of
5 independent runs on estimating the performance of each of the 15 tools. As
depicted, the MSE signi cantly decreases with the size of the work ow repository.
The second experiment (b) shows that our unsupervised performance estimator
approach provides more accurate estimations than the precision-based ordering
approach [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] that requires additional user labels.
        </p>
        <p>
          The described approach required manual con guration of each tool per
dataset. In fact, it is possible to relieve the user also from the con guration
task using our dirtiness pro le-based approach. Our novel system Raha [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] rst
generates a range of possible con gurations for each tool independent of the
dataset. Based on the similarity of the new dataset to historical datasets, Raha
lters out irrelevant error detection strategies for each column of the new dataset
at hand.
Although the prevalence of speci c error types suggests a ranking of error
detection algorithms for a dataset at hand, we cannot limit the error detection e ort
to only running one single approach [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. Having access to various data
cleaning solutions implicates an e ective aggregation of the error detection results,
which we consider as a classi cation problem [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. To holistically combine
error detection methods, we use state-of-the-art ensemble learning algorithms [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]
while leveraging the dataset's metadata. We train an error detection classi er
by creating a feature representation based on the error detection results from
the di erent data cleaning systems and additional metadata [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ].
        </p>
        <p>
          We use the ensemble learning method stacking [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] to learn the error classi
cation model. Stacking is an approach for training a meta-learner by combining
        </p>
        <p>
          Algorithms
Wrangler [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
Nadeef (Dedup) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
P
        </p>
        <p>R</p>
        <p>F1
multiple rst-level models for error detection. The main idea is to train the
rstlevel learners on the same training dataset, and then generate a new training
dataset for the meta-learner. The outputs of the rst-level learners constitute
the input for training the meta-learning model. In our initial prototype, we train
three di erent rst-level classi ers on the same feature vector: a neural network
with one hidden layer, a decision tree, and a naive Bayes classi er. Each of these
models classi es dataset cells as erroneous or correct. The meta-classi er, logistic
regression, is trained then on the produced output of the rst-level classi ers.</p>
        <p>
          Table 1 shows the performance scores of our ensemble learning aggregation
based on Stacking in comparison to individual error detection strategies from
the literature as well as standard aggregators, such as Majority Wins, Union
All, and Precision Based Ordering [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The experiments were performed on the
above-mentioned Address dataset, which contains misspelled or missing values,
and eld separation violations, resulting in 36% of errors in the whole dataset.
Our prototype is available online4.
        </p>
        <p>
          As metadata provides signals on individual values and inter-column
relationships, they support the four major types of error detection strategies: pattern
violation detection, rule violation detection, outlier detection, and duplicate-based
error detection [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. To improve the performance of error classi cation, we
incorporate metadata, such as data type a liation, attribute domain, frequency, null
values, and multi-column dependencies, into our learning algorithms. Initially,
the classi ers are trained on the feature vectors that comprise the output of
individual error detection algorithms and the metadata information. Operating
on the augmented feature vector explains why the stacking learning approach
results in a higher recall, i.e., 0:91%, compared to the sum of the recalls of the
individual error detection methods. Using the metadata, the classi er leverages
more information to classify the dataset cells. This experiment demonstrates that
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>4 http://bit.ly/systems-aggregation</title>
        <p>Source Flight Departure Arrival
A LH-100 10:00 a.m. 2:00 p.m. Dec 02
B LH-100 10:00 a.m. 2:00 p.m.</p>
        <p>C LH-100 10:00 a.m. 3:00 p.m.</p>
        <p>
          Arrival
metadata-driven holistic aggregation of error detection results captures more
errors in the dataset than o -the-shelf error detection system and aggregation
methods from the literature [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          Our stacking method requires labeled data up to 1% of the dataset size.
Currently, we are working towards a new active learning strategy to further
reduce the labeling e ort [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
State-of-the-art machine learning-based error detection methods, as described in
previous sections, achieve very high detection accuracy. However, the user might
not only be interested in a high accuracy but she might also be interested in the
underlying cause of the corresponding errors and their context. For instance, in
case of a eld separation issue, outlier detection methods, syntax checkers, and
functional dependency violation detection methods might detect that there is
indeed an error. Nonetheless, none of these methods tells the user that the error
is related to a eld separation issue that can be resolved by a speci c strategy.
As a rst step to address this problem, we propose to leverage the extensive
work on feature engineering for error detection where features cover information
on the attribute, tuple, and dataset level for each data cell, as discussed in
the aforementioned sections. This way, we train a classi cation model, such as
a decision tree, to t the error detection result that the user is interested in
exploring. This classi cation model provides the user with those features that
correlate with the corresponding error and therefore gives the user an idea of the
context that this error occurs in. Figure 3 shows an example of this approach on
Flights data. The trained model has learned that the underlying syntax pattern
for Arrival requires less than 11 characters. This insight hints that Arrival has
a formatting issue. Furthermore, it found that the source C is unreliable with
respect to the Arrival entries and in fact, the source C is the actual error cause.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        There has already been a large body of work on data cleaning [
        <xref ref-type="bibr" rid="ref23 ref4 ref5 ref6">4, 6, 23, 5</xref>
        ].
Individual subsystems can be plugged as potential subsystems into our estimation
and aggregation framework.
      </p>
      <p>
        Rule-based data cleaning. There is already extensive research conducted on
rule-based data cleaning [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. The Holistic data cleaning method is developed
based on denial constraints [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This approach considers the generalization of
functional dependencies by translating them into denial constraints. A
generalization of functional dependencies was also proposed by the Llunatic system [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
The Nadeef system [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] treats data quality rules holistically by providing an
interface for implementing denial constraints and other user-de ned functions.
Another line of research employs metric functional dependencies and proposes
a strategy to choose high-quality minimal repairs [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. The Katara system [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
is a knowledge base and crowd-driven data cleaning system. It aligns a dataset
with available knowledge bases to identify correct and incorrect data values
and to suggest top-k possible repairs for incorrect data. Our approach
considers variations of the aforementioned traditional cleaning strategies as potential
subsystems that can be aggregated inside a generated work ow.
Machine learning-based data cleaning. There is an increasing trend of
utilizing machine learning approaches for data integration and curation tasks.
The HoloClean system combines qualitative and quantitative repair signals
in a statistical model that allows it to repair erroneous data values [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
However, HoloClean requires manual hyperparameter optimization and follows a
one-shot aggregation strategy. GDR uses active learning to choose the correct
update suggested by user-de ned functional dependencies [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. Continuous data
cleaning leverages classi cation to trade o repairing constraints against
repairing the data [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. Ideally, it is desirable to adopt these approaches for a more
general cleaning pipeline beyond rule-based data cleaning. Other systems, such
as SCARE [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] and ERACER [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] leverage probabilistic models to repair data
and assume that all errors can be corrected without human involvement. Finally,
ActiveClean cleans the training data for a machine learning application and
requires the user to specify how to clean and how to featurize the dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
In our work, we apply machine learning techniques to build a cleaning work ow
orchestrator that learns from cleaning tasks in the past and proposes e ective
cleaning work ows for a new dataset.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Directions</title>
      <p>
        We presented our vision and initial steps for supporting the user in building
complex pipelines of automated data cleaning tools. Using various machine learning
techniques, we aim at leveraging knowledge about cleaning tasks from the past
and data pro ling to propose cleaning work ow for a new dataset. So far, we
are able to estimate the e ectiveness of error detection work ows on a dataset
and to aggregate error detection results e ectively. Also, we have developed a
feature representation that enables e ective active learning for error detection.
Yet, there are some challenging research directions ahead of us:
Understanding metadata. Our experiments show the bene ts of
incorporating metadata for various tasks. A principled connection between instances of
both concepts, metadata and data quality, is yet to be established. For
example, the pro ling result about null values is an indicator for the completeness
of a dataset. However, to detect disguised missing values [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], we would need
di erent metadata [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is thus essential to establish relationships between the
metadata and data quality problems, and use them for data cleaning routines.
Learning to transform data values. We plan to extend our active
learningbased example-driven approach from error detection to correction. For instance,
we can treat error correction as a translation task that translates erroneous
cells to correct cells. Following this idea, we can leverage current advances in
statistical machine translation [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work has been funded by the following three grants: the German Research
Foundation (DFG) under grant agreement 387872445, the German Federal
Ministry of Education and Research as BBDC II (01IS18025A), and the German
Federal Ministry of Transport and Digital Infrastructure as Daystream (19F2013).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abedjan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>R.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ouzzani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papotti</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stonebraker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
          </string-name>
          , N.:
          <article-title>Detecting data errors: Where are we and what needs to be done?</article-title>
          <source>PVLDB</source>
          <volume>9</volume>
          (
          <issue>12</issue>
          ),
          <volume>993</volume>
          {1004 (Aug
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Abedjan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Pro ling relational data: a survey</article-title>
          .
          <source>The VLDB Journal</source>
          <volume>24</volume>
          (
          <issue>4</issue>
          ),
          <volume>557</volume>
          {
          <fpage>581</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Arocena</surname>
            ,
            <given-names>P.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glavic</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mecca</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papotti</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santoro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Messing up with bart: error generation for evaluating data-cleaning algorithms</article-title>
          .
          <source>PVLDB</source>
          <volume>9</volume>
          (
          <issue>2</issue>
          ),
          <volume>36</volume>
          {
          <fpage>47</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papotti</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Holistic data cleaning: Putting violations into context</article-title>
          .
          <source>In: ICDE</source>
          . pp.
          <volume>458</volume>
          {
          <issue>469</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morcos</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ouzzani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papotti</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Katara: A data cleaning system powered by knowledge bases and crowdsourcing</article-title>
          .
          <source>In: SIGMOD</source>
          . pp.
          <volume>1247</volume>
          {
          <issue>1261</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dallachiesa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ebaid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eldawy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ouzzani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
          </string-name>
          , N.:
          <article-title>Nadeef: A commodity data cleaning system</article-title>
          .
          <source>In: SIGMOD</source>
          . pp.
          <volume>541</volume>
          {
          <issue>552</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>G. C.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.S.</given-names>
            ,
            <surname>Gokhale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Konda</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.:</surname>
          </string-name>
          <article-title>The magellan data repository</article-title>
          . https://sites.google.com/site/anhaidgroup/useful-stu /data
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geerts</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Foundations of data quality management</article-title>
          , vol.
          <volume>4</volume>
          . Morgan &amp; Claypool Publishers (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Geerts</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mecca</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papotti</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santoro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The llunatic data-cleaning framework</article-title>
          .
          <source>PVLDB</source>
          <volume>6</volume>
          (
          <issue>9</issue>
          ),
          <volume>625</volume>
          {
          <fpage>636</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hould</surname>
            ,
            <given-names>J.N.</given-names>
          </string-name>
          :
          <article-title>Craft beers dataset</article-title>
          . https://www.kaggle.com/nickhould/craft-cans (
          <year>2017</year>
          ), version
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kandel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paepcke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellerstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heer</surname>
          </string-name>
          , J.: Wrangler:
          <article-title>Interactive visual speci cation of data transformation scripts</article-title>
          .
          <source>In: SIGCHI</source>
          . pp.
          <volume>3363</volume>
          {
          <issue>3372</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>Statistical Machine Translation</article-title>
          . Cambridge University Press, New York, NY, USA, 1st edn. (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Krishnan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , E.:
          <article-title>Towards reliable interactive data cleaning: a user survey and recommendations</article-title>
          .
          <source>In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics</source>
          . p.
          <fpage>9</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Krishnan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>ActiveClean: interactive data cleaning for statistical modeling</article-title>
          .
          <source>PVLDB</source>
          <volume>9</volume>
          (
          <issue>12</issue>
          ),
          <volume>948</volume>
          {
          <fpage>959</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Mahdavi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abedjan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Reds: Estimating the performance of error detection strategies based on dirtiness pro les</article-title>
          .
          <source>In: SSDBM</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Mahdavi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abedjan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castro</surname>
            <given-names>Fernandez</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Tang</surname>
          </string-name>
          , N.:
          <article-title>Raha: A con guration-free error detection system</article-title>
          .
          <source>In: SIGMOD</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Management</surname>
          </string-name>
          , M.S.:
          <article-title>\data, analytics, and ai: How trust delivers value"</article-title>
          . MIT Sloan Management Review (
          <year>2019</year>
          ), http://bit.ly/mit-data-quality,
          <source>Accessed: 20.03</source>
          .2019
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. May eld, C.,
          <string-name>
            <surname>Neville</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prabhakar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Eracer: a database approach for statistical inference and data cleaning</article-title>
          .
          <source>In: SIGMOD</source>
          . pp.
          <volume>75</volume>
          {
          <issue>86</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Neutatz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mahdavi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abedjan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>ED2: A Case for Active Learning in Error Detection</article-title>
          . In: CIKM (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Ouzzani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hammady</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fedorowicz</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Rayyan -a web and mobile app for systematic reviews</article-title>
          . vol.
          <volume>5</volume>
          , p.
          <volume>210</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Pearson</surname>
            ,
            <given-names>R.K.</given-names>
          </string-name>
          :
          <article-title>The problem of disguised missing data</article-title>
          .
          <source>Acm Sigkdd Explorations Newsletter</source>
          <volume>8</volume>
          (
          <issue>1</issue>
          ),
          <volume>83</volume>
          {
          <fpage>92</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>Pit</given-names>
            <surname>Claudel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Mariet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Harding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Outlier detection in heterogeneous datasets using automatic tuple expansion</article-title>
          .
          <source>Technical Report</source>
          , MIT (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Prokoshyna</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szlichta</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Combining quantitative and logical data cleaning</article-title>
          .
          <source>PVLDB</source>
          <volume>9</volume>
          (
          <issue>4</issue>
          ),
          <volume>300</volume>
          {
          <fpage>311</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Rahm</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Do</surname>
            ,
            <given-names>H.H.</given-names>
          </string-name>
          :
          <article-title>Data cleaning: Problems and current approaches</article-title>
          .
          <source>IEEE Data Eng. Bull</source>
          .
          <volume>23</volume>
          (
          <issue>4</issue>
          ),
          <volume>3</volume>
          {
          <fpage>13</fpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Rekatsinas</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Re</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>HoloClean: Holistic data repairs with probabilistic inference</article-title>
          .
          <source>PVLDB</source>
          <volume>10</volume>
          (
          <issue>11</issue>
          ),
          <volume>1190</volume>
          {
          <fpage>1201</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Schelter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lange</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celikel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biessmann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grafberger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automating large-scale data quality veri cation</article-title>
          .
          <source>PVLDB</source>
          <volume>11</volume>
          (
          <issue>12</issue>
          ) (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Stonebraker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          :
          <article-title>Data integration: The current status and the way forward</article-title>
          .
          <source>IEEE Data Eng. Bull</source>
          .
          <volume>41</volume>
          (
          <issue>2</issue>
          ) (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , De Wilde,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Using OpenRe ne</article-title>
          .
          <source>Packt Publishing Ltd</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Visengeriyeva</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abedjan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Metadata-driven error detection</article-title>
          .
          <source>In: SSDBM</source>
          . pp.
          <volume>1</volume>
          {
          <issue>12</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Volkovs</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szlichta</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          :
          <article-title>Continuous data cleaning</article-title>
          .
          <source>In: ICDE</source>
          . pp.
          <volume>244</volume>
          {
          <issue>255</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Wolpert</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          :
          <article-title>Stacked generalization</article-title>
          .
          <source>Neural Networks</source>
          <volume>5</volume>
          (
          <issue>2</issue>
          ),
          <volume>241</volume>
          {
          <fpage>259</fpage>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Yakout</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berti-Equille</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          :
          <article-title>Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes</article-title>
          .
          <source>In: SIGMOD</source>
          . pp.
          <volume>553</volume>
          {
          <issue>564</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Yakout</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neville</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ouzzani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          :
          <article-title>Guided data repair</article-title>
          .
          <source>PVLDB</source>
          <volume>4</volume>
          (
          <issue>5</issue>
          ),
          <volume>279</volume>
          {
          <fpage>289</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Z.H.</given-names>
          </string-name>
          :
          <article-title>Ensemble methods: foundations and algorithms</article-title>
          . Chapman &amp; Hall/CRC, 1st edn. (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>