<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DIANA: a Knowledge-driven Framework for Data-centric AI</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Camilla Sancricca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Supervised by: Cinzia Cappiello Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB)</institution>
          ,
          <addr-line>Politecnico di Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data analysis plays a key role in companies that adopt machine learning models to support their decision-making processes. Among the phases of a machine learning pipeline, data preparation is essential to obtain high-quality data. Data-centric AI shifted the focus of such processes on the quality of data rather than on the machine learning model performance. Users from diferent application fields face data preparation, and they frequently encounter dificulties in designing efective data preparation pipelines when dealing with a multitude of data quality errors and data quality improvement techniques; this highlights the necessity for approaches to simplify the process of defining an efective data preparation pipeline. The main goal of my Ph.D. project is to design a framework to support users in selecting the data preparation tasks to perform in a machine learning pipeline. Using a knowledge-driven approach, we aim to guide (more and less experienced) users through an interactive process in which recommendations, explanations, and diferent levels of autonomy can simplify the design of an efective data preparation pipeline.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Quality</kwd>
        <kwd>Data Preparation</kwd>
        <kwd>Knowledge-driven Approach</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        1. Introduction
ering (i) the benefits of the interaction with users and
that (ii) the optimal pipeline may change according to
With the widespread difusion of a data-driven culture, diferent analysis goals, types of data, and user needs.
data analysis is becoming crucial for organizations to An emerging concept in this domain is Data-centric AI
gain competitive advantages. The volume and variety of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which is based on the idea that data and their quality
the available data have enabled enterprises to perform are the most important aspects to consider in defining
data analysis pipelines, employing their results to support new AI systems. Data-centric AI shifted the primary
their decision-making processes. focus of these systems from the goodness of the model
      </p>
      <p>Data analysis pipelines include multiple stages: data to the quality of the data.
acquisition, preparation, modeling and analysis, and eval- Moreover, DQ is not the only facet to be considered
uation. Among them, the most challenging phase is data when decision-making processes are employed in
conpreparation, which is essential to obtain good pipeline texts that use sensitive data. The outputs of data analysis,
outputs. Indeed, the goal of data preparation is to ensure in that case, will support decisions that could impact
peothat the data have a good level of quality, guaranteeing ple’s lives. The ethical aspect also comes into play: we
the dependability of the analysis results. must ensure that data do not contain biases and that the</p>
      <p>
        Designing an efective data preparation pipeline has models we are using are fair.
become extremely dificult for users due to various errors Another important aspect arising from the adoption
and the plethora of available data preparation techniques. of data analysis in many domains is that even less
exFor a data scientist, it has been demonstrated that data perienced users have started to face problems similar to
preparation could take up to 80% of the total data analysis the ones mentioned before. We also conducted a study
time [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Moreover, the majority of the data preparation interviewing users from diferent backgrounds and found
actions are based on approximate methods; if not per- that the less experienced ones have no idea of the type
formed well, data preparation can introduce a piece of of analysis to perform once they have data. The need
uncertainty in the data. In turn, this uncertainty can also has emerged to enable even non-expert end-users to
perpropagate in the final results of the analysis. form efective data analysis processes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The use of
      </p>
      <p>Currently, some approaches exist to assist users in de- explainability techniques is currently demonstrating its
signing these pipelines. However, they aim to completely efectiveness in helping less experienced users to have a
automate the data preparation process without consid- better understanding of these systems. However, more
experienced users probably need to use them in a way
Published in the Proceedings of the Workshops of the EDBT/ICDT 2024 that is more self-service than non-experts. For this
reaJoint Conference (March 25-28, 2024), Paestum, Italy son, the concept of sliding or adjustable autonomy is
$ camilla.sancricca@polimi.it (C. Sancricca) emerging in many research areas, i.e., the ability of a
0000-0©002022-43C8op2y0ri-gh7t8©72002(4Cfo.r tShaisnpacprericbcyaits) authors. Use permitted under system to involve humans when needed or to proceed
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCreEatUiveRCoWmmoornksLsichenospeAPttrribouctioene4d.0iInntgersna(tiConEal U(CCRB-YW4.0S)..org) autonomously. To achieve these goals, new systems need
Data
Sources</p>
      <p>Data
Exploration</p>
      <p>User
Preferences
Knowledge Base</p>
      <p>Data
Profiling</p>
      <p>DQ
Assessment</p>
      <p>Bias
Awareness</p>
      <p>Pipeline
Suggestions</p>
      <p>Explanations</p>
      <p>Data
Preparation</p>
      <p>&amp; Bias Mitigation
User</p>
      <p>Human-in-the-loop
Model
Selection
Training
Analysis</p>
      <p>Results
Evaluation</p>
      <p>DQ Dimension Data Preparation Activity
Completeness Remove Rows/Columns with Missing Values</p>
      <p>Data Imputation (Standard &amp; ML-based)
Accuracy Outlier Detection (Standard &amp; ML-based)</p>
      <p>Remove Rows/Columns with Outliers</p>
      <p>Outlier Correction using Data Imputation
Consistency Remove Inconsistencies</p>
      <p>Correction with Functional Dependencies</p>
      <p>ML Task
Classification
Clustering
Regression
to be designed with human-centered approaches. My users’ actions to make recommendations. In my project,
research is aimed to cover these open issues. we aim to use a knowledge-driven approach, but the</p>
      <p>The challenges of my Ph.D. research focus on develop- recommendations will be based on empirical evidence.
ing a self-service environment to help, support, and guide Moreover, almost all the above-mentioned methods, like
users in designing data preparation pipelines. Within this most of the AutoML approaches, aim to automate data
environment, our vision is to provide users with: preparation keeping the user unaware of how the data
(a) recommendations on the best sequence of data prepa- was prepared. However, it has been shown the
integraration actions that best fit their analysis purposes, sim- tion of human factors in the data science process achieves
plifying the design of a data preparation pipeline more efective and trustworthy systems [ 11]. For these
(b) the detection/mitigation of potential biases reasons, there is a need for more human-centered and
(c) explainability, with a human-centered design transparent approaches.
(d) more or less support/autonomy w.r.t. their needs.</p>
      <p>
        This paper presents the work done in my first Ph.D.
year. It is structured as follows. Section 2 describes re- 3. Contributions
lated work and challenges in data preparation; Section 3
describes the main contributions and presents the high- This section presents my Ph.D. contribution. After
analevel architecture of a framework to address the described lyzing the research gaps related to data preparation and
open challenges; Finally, Section 4 depicts conclusions the existing tools in this domain, my research mainly
foand future work. cused on designing a system to address all the open
challenges mentioned before. This section aims to present the
approach and, for each component, highlight preliminary
2. Related Work results obtained and ongoing works.
DIANA is a self-service environment for ADaptIve
Several literature contributions focused on data prepa- DAta-ceNtric AI. The main objective of DIANA is to
ration and related challenges in the last few years. As facilitate users in designing a data preparation pipeline by
a starting point of my work, I analyzed diferent frame- recommending the list of the most appropriate activities
works for supporting the data preparation process. Meth- to obtain reliable analysis results (a). The suggestions are
ods that focus on supporting users in designing efective based on the user’s analysis context and are extracted by
data analysis pipelines are proposed in [
        <xref ref-type="bibr" rid="ref4">4, 5</xref>
        ]. However, a Knowledge Base (KB). The analysis context is defined as
the type of Machine Learning (ML) algorithm to be run the combination of (i) the data source profile, i.e., all the
on the data is rarely considered in the recent literature pieces of information that we can extract through data
[6]. Instead, Section 3 will show that the type of analysis profiling operations, and (ii) the type of analysis, i.e., the
has a crucial role in the selection of the most suitable ML algorithm that the user intends to run on the data.
pipeline. Some recent tools [6, 7, 8] are based on optimiza- The system warns users of potential biases, suggesting
tion algorithms that build from scratch the best pipeline, how to mitigate them (b). Moreover, it can adapt to
extrying several combinations of data preparation tasks; pert and less-experienced users, providing more or less
however, extracting the recommendations in this manner support according to their preferences (c).
Human-incan often be very time-consuming. Other contributions the-loop techniques engage users through the process,
[9, 10] use knowledge-based approaches exploiting past and explanations are provided to support those in need
(d). Figure 1 depicts the high-level architecture of the
proposed approach. The figure is divided into the two
main phases of a data analysis pipeline: Data
Preprocessing and Data Analysis. My Ph.D. research is focused on
the Data Preprocessing phase.
      </p>
      <p>Data Collection and Exploration The targeted user
selects and loads into the system the Data Sources. First,
the system provides a Data Exploration engine with
interactive visualizations and allows users to enter some User
Preferences, e.g., the subset of the most relevant features,
the type of analysis to be performed on the data (if the
user already knows it), and the needed level of support in
the next phases. Depending on the users’ level of
expertise, they can specify if they want support in the choice
of (1) possible analysis to perform or (2) the data
preparation tasks to apply. An expert user could also prefer to
perform all the steps autonomously. In the former case,
a list of possible analyses will be generated.</p>
    </sec>
    <sec id="sec-2">
      <title>Ongoing We are currently working on a methodology</title>
      <p>that, given a dataset, provides suggestions on suitable
analyses by exploiting Large Language Models (LLM).</p>
    </sec>
    <sec id="sec-3">
      <title>In the latter case, a suggested pipeline of data prepa</title>
      <p>ration actions that best fits the selected analysis will be
displayed during the Data Preparation phase.</p>
      <p>Advanced Data Profiling Once all the data have been
collected, they are inspected via the Data Profiling and
the DQ Assessment. The former extracts metadata and
visualizations, while the latter assesses the level of DQ.
Finally, the Bias Awareness phase provides insights into
possible biases that could afect the data. The results of
these phases should help users understand the datasets’
content and their initial suitability for the task at hand.</p>
      <p>The Knowledge Base The recommendations about
the optimal data preparation pipeline in our framework
will be extracted by a KB. We start from the intuition
that diferent error types could impact the final results
diferently depending on the selected analysis context.
We envisioned that the best sequence of data preparation
actions to apply should depend on such an impact.
Results To verify this approach, we investigate the
impact of data errors (related to diferent DQ dimensions)
on the results quality of diferent ML models. We found
that issues related to DQ dimensions can have a diferent
impact on the outcome performance of a ML analysis
depending on both the ML algorithm used and the
characteristics of the data, which defines our analysis context.
Thus, we perform experiments creating rankings of DQ
dimensions for diferent combinations of datasets and
ML models, showing that improving the DQ dimensions
in order of importance for that specific analysis context
gives better final results [ 12]. Once the DQ dimensions
that need to be prioritized have been identified, we
focused on extracting, for each combination of dataset
profile, ML method, and DQ dimension to improve,
the corresponding top-k data preparation actions. We
demonstrate that, again, the goodness of the preparation
techniques depends on the specific context.</p>
    </sec>
    <sec id="sec-4">
      <title>Given the evidence that we can extract suggestions</title>
      <p>based on such an impact, we defined the final structure
of the KB, which contains a static and a dynamic module.
The static KB mainly contains descriptive and
experimental data. Descriptive data concern: (i) the DQ dimensions,
associated with (ii) the data preparation methods which
improve them, (iii) the list of considered ML models, and
(iv) profiles of the data sources analyzed in previous
experiments. Experimental data are the results of such
experiments fundamental to support the generation of
suggestions; they contain: (i) the sequence of the most
impacting DQ dimension and (ii) the most suitable data
preparation tasks to apply in a multitude of diferent
analysis context. The above-mentioned results have been
added to the experimental data and will have a
fundamental role during the extraction of the suggestions.</p>
    </sec>
    <sec id="sec-5">
      <title>Ongoing We are developing the KB conceptual model as</title>
      <p>a graph database. In addition, we are currently feeding
the KB with the empirical knowledge acquired from
the experiments, considering a heterogeneous set of
datasets taken from several open repositories. Figure 2
shows the initial setup we planned to enrich the KB.</p>
      <p>Generating Suggestions The dynamic KB takes an
unexplored analysis context as input and considers all the
previously analyzed contexts to extract the suggestions.
Our final goal is to identify the previously analyzed
context in the KB closer to the one chosen by the user and,
accordingly, to extract the ranking of the most impacting
DQ dimensions in such a context. This ranking and the
data provided by the Data Profiling and the DQ
Assessment engines are the input for identifying the suggested
ranking of DQ dimensions. For each DQ dimension, we
make use of a classifier that takes as input the results
stored in the KB and extracts the best data preparation
actions, building the suggested pipeline.</p>
    </sec>
    <sec id="sec-6">
      <title>Ongoing We trained a classifier to extract the best data</title>
      <p>imputation method for improving the Completeness
dimension, and we are currently validating it.</p>
      <p>Data Preparation The results of the last phase are
sent to the Data Preparation engine, which has two main
goals: showing the most appropriate task to perform and
executing the DQ improvement methods. During this
phase, depending on what the user has specified at the
beginning, a suggested pipeline of preparation actions
extracted by the KB could be shown or not to the user. In
the envisioned approach, we assume that the users are
free to follow the suggestions or not, letting them change
the order of the suggested actions, or to substitute them
by selecting from all the available ones. Kalagnanam, DQA: scalable, automated and
interactive data quality advisor, in: IEEE International
Results We conducted experiments to understand the Conference on Big Data, Los Angeles, USA, 2019.
efect of data preparation techniques on the uncertainty [5] L. A. Melgar, D. Dao, S. Gan, N. M. Gürel, N.
Hollenof ML models. We found that the amount of uncertainty stein, J. Jiang, B. Karlas, T. Lemmin, T. Li, Y. Li, S. X.
introduced is again context-dependent [13]. Rao, J. Rausch, C. Renggli, L. Rimanic, M. Weber,</p>
      <p>Bias Mitigation In addition to traditional DQ dimen- S. Zhang, Z. Zhao, K. Schawinski, W. Wu, C. Zhang,
sions and data preparation techniques, we aim to ofer Ease.ml: A lifecycle management system for
maa set of bias mitigation techniques to be applied within chine learning, in: 11th Conference on Innovative
the Data Preparation engine. Data Systems Research, CIDR, 2021.
[6] L. Berti-Équille, Learn2clean: Optimizing the
seResults We had the intuition that a trade-of could quence of tasks for web data preparation, in: The
exist between the concept of DQ and data ethics. We World Wide Web Conference, WWW 2019, San
demonstrate the existence of such a trade-of [ 14], and Francisco, CA, USA, 2019.
we defined a preliminary set of guidelines to balance it [7] F. Neutatz, B. Chen, Y. Alkhatib, J. Ye, Z.
Abed[15]. jan, Data cleaning and automl: Would an optimizer</p>
      <p>Explainability and Human-in-the-loop Our frame- choose to clean?, Datenbank-Spektrum (2022).
work is enriched with explanations tailored to the users’ [8] Q. Cui, W. Zheng, W. Hou, M. Sheng, P. Ren,
expertise that help them understand the data, the reason W. Chang, X. Li, Holocleanx: A multi-source
heterobehind the suggestions, and the results of the data prepa- geneous data cleaning solution based on lakehouse,
ration actions. Moreover, human-in-the-loop techniques in: Health Information Science - 11th International
involve users in various steps of the process. Conference, HIS, 2022.
[9] M. Mahdavi, Z. Abedjan, Semi-supervised data
Ongoing We are working on enriching the environment cleaning with raha and baran, in: 11th Conference
to guarantee explainability by extracting and formulat- on Innovative Data Systems Research, CIDR, 2021.
ing explanations through the support of LLM tools. [10] C. Yan, Y. He, Auto-suggest:
Learning-torecommend data preparation steps using data
science notebooks, in: Proceedings of the 2020
In4. Conclusion and Future ternational Conference on Management of Data,
Developments SIGMOD, Portland, OR, USA, 2020.
[11] Ö. Ö. Garibay, B. Winslow, S. Andolina, als., Six
This paper aims to present the main objectives of my human-centered artificial intelligence grand
chalPh.D. research project. I described the work related to my lenges, Int. J. Hum. Comput. Interact. 39 (2023)
project, its main challenges, and the high-level architec- 391–437.
ture of the designed approach. As soon as a preliminary [12] C. Sancricca, C. Cappiello, Supporting the design
version of the system is finalized, we plan to evaluate it of data preparation pipelines, in: Proceedings of
in comparison with similar tools such as [6] and with real the 30th Italian Symposium on Advanced Database
users. Future work will focus on exploiting past users’ Systems, SEBD 2022, Tirrenia (PI), Italy, 2022.
experiences and feedback to improve the recommenda- [13] C. Cappiello, F. Cerutti, C. Sancricca, R. Zanelli,
tions, letting the system evolve and learn. We aim to About the efects of data imputation techniques on
extend the KB model to include users’ profiles, goals, and ML uncertainty, in: Joint Proceedings of Workshops
past actions (i.e., provenance). at the 49th International Conference on Very Large
Data Bases (VLDB 2023), Vancouver, Canada, 2023.</p>
      <p>References [14] F. Azzalini, C. Cappiello, C. Criscuolo, S. Cuzzucoli,
A. Dangelo, C. Sancricca, L. Tanca, Data quality and
fairness: Rivals or friends?, in: Proceedings of the
31st Symposium of Advanced Database Systems,</p>
      <p>Galzingano Terme, Italy, 2023.
[15] F. Azzalini, C. Cappiello, C. Criscuolo, C. Sancricca,</p>
      <p>L. Tanca, Data quality and data ethics: Towards
a trade-of evaluation, in: Joint Proceedings of
Workshops at the 49th International Conference
on Very Large Data Bases (VLDB 2023), Vancouver,
Canada, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hameed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <article-title>Data preparation: A survey of commercial tools</article-title>
          ,
          <source>SIGMOD Rec</source>
          .
          <article-title>(</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Jarrahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Memariani</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Guha,</surname>
          </string-name>
          <article-title>The principles of data-centric AI, Commun</article-title>
          . ACM (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kandel</surname>
          </string-name>
          ,
          <article-title>Self-service data preparation: Research to practice, IEEE Data Eng</article-title>
          . Bull. (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhamidipaty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. M.</given-names>
            <surname>Giford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Siegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Ganapavarapu</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. R.</surname>
          </string-name>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>