=Paper= {{Paper |id=Vol-3741/paper27 |storemode=property |title=The Future of Sustainable Data Preparation |pdfUrl=https://ceur-ws.org/Vol-3741/paper27.pdf |volume=Vol-3741 |authors=Barbara Pernici,Cinzia Cappiello,Edoardo Ramalli,Matteo Palmonari,Federico Belotti,Flavio De Paoli,Angelo Mozzillo,Luca Zecchini,Giovanni Simonini,Sonia Bergamaschi,Tiziana Catarci,Matteo Filosa,Marco Angelini,Dario Benvenuti |dblpUrl=https://dblp.org/rec/conf/sebd/PerniciCRPBPMZS24 }} ==The Future of Sustainable Data Preparation== https://ceur-ws.org/Vol-3741/paper27.pdf
                                The Future of Sustainable Data Preparation
                                Barbara Pernici1,* , Cinzia Cappiello1 , Edoardo Ramalli1 , Matteo Palmonari2 ,
                                Federico Belotti2 , Flavio De Paoli2 , Angelo Mozzillo3 , Luca Zecchini3 ,
                                Giovanni Simonini3 , Sonia Bergamaschi3 , Tiziana Catarci4 , Matteo Filosa4 ,
                                Marco Angelini4 and Dario Benvenuti4
                                1
                                  Politecnico di Milano - DEIB, Milano, Italy
                                2
                                  Università di Milano-Bicocca, Milano, Italy
                                3
                                  Università degli Studi di Modena e Reggio Emilia, Modena, Italy
                                4
                                  Sapienza Università di Roma, Roma, Italy


                                            Abstract
                                            Data preparation has an important role in data analysis, and it is time and resource-consuming, both in
                                            terms of human and computational resources. The "Discount quality for responsible data science" project
                                            aims to focus on data-quality-based data preparation, analyzing the main characteristics of related tasks,
                                            and proposing methods for improving the sustainability of the data preparation tasks, considering also
                                            new emerging techniques based on generative AI. The paper discusses the main challenges that emerged
                                            in the initial research work in the project, as well as possible strategies for developing more sustainable
                                            data preparation frameworks.




                                1. Introduction
                                The technological boost in the capability of analyzing data and reusing it is enormous. The
                                attempt to build data spaces, or data ecosystems, that support the publication and reuse of data
                                for feeding data science pipelines has inspired several initiatives worldwide and in Europe in
                                several application domains. Data scientists specify and then execute pipelines to transform,
                                enrich, and analyze data, passing through exploratory analyses and refinement cycles to control
                                the quality of data and improve the final model. Completely automated pipelines, e.g., AutoML,
                                have shown significant weaknesses in data science life cycles and are often not appreciated by
                                data scientists, because of the difficulty of controlling the results in terms of quality, uncertainty,
                                and explainability. On the other hand, assessing the quality of data and results can be very
                                expensive in terms of computational and human costs. Recently emerging technologies, such
                                as Large Language Models (LLM) are starting to show promising directions to support data
                                analysis and manipulation operations, often triggering a demand based on a "wow effect";
                                however, applications of LLMs for processing data at moderate or large scales are associated
                                with costs that make their fitness for use still unclear.
                                In this scenario, the PRIN 2022 project "Discount quality for responsible data science: Human-
                                in-the-Loop (HITL) for quality data" focuses on making the whole process sustainable, both

                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
                                *
                                 Corresponding author.
                                $ barbara.pernici@polimi.it (B. Pernici)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
computationally and in terms of human effort, in all the different phases of data analysis, from
data preparation to data analysis and model building, and data exploitation.
   Inspired by [1], the project challenges and approaches focus on sustainability aspects both
concerning human effort in HILT approaches and on computational aspects when considering
task automation, in the direction of effectively using limited resources in the process. Taking
inspiration from the successful proposals for a “discount” usability evaluation proposed for
usability assessment1 , we propose a “discount” quality evaluation and data preparation approach,
based on methods and theories to reduce the annotations and assessment space and to control
and decrease both the human computing effort and the use of computational resources. Two
main goals will be pursued towards sustainability: i) reducing the computational effort needed
to analyze tabular data and knowledge graphs; ii) introducing HITL in a sustainable way, to
make human contributions effective, keeping them limited in time and size.
   The paper is structured as follows. In Section 2, we discuss the state of the art. Data
preparation pipelines are discussed in Section 3, illustrating the main challenges to make them
sustainable, while in Section 4 we examine the principal sustainability strategies proposed in
the project.

2. Related work
In [2], a systematic approach to developing data science projects is advocated. Several proposals
for scientific data ecosystems are emerging, including the European Open Science Cloud EOSC2 ,
and the scientific debate focuses on the need to include humans in the loop in scientific data
analysis while balancing the effort needed to achieve good-quality results.
   The project research aims at improving the state of the art in different directions, as follows.
   Data ecosystems and data spaces are widely used infrastructures enabling different stake-
holders to interact and resolve interoperability issues among shared data [3]. The design of
these data collaboratives has posed many sustainability challenges investigated first at the
business and organizational level [4], and then translated into technological practices [5]. In this
context, prior work has discussed the role of knowledge-driven approaches and related research
challenges [6], metadata representation for data science pipelines [7], and requirements to make
data FAIR (Findable, Accessible, Interoperable, Reusable) [8]. However, as of today, limited
support is provided to help user develop data preparation and analysis pipelines to be integrated
into data collaboratives.
   As an effective data preparation is dependent on the users’ goals, some interesting, although
task-specific, proposals were recently developed to support HITL for data preparation pipelines
and data-centric AI3 , and addressing data quality has become a prerequisite for data analysis,
machine learning and crowdsourcing techniques. As discussed in [9, 10] data quality in Big
Data presents additional issues for assessing the quality and privacy aspects of data, considering
multiple data sources. Ontology-based data management (OBDM) paradigms and automatic
annotation evaluation of uncertainty (e.g., [11]) are still open problems, in particular for un-

1
  https://www.nngroup.com/articles/discount-usability-20-years/
2
  https://ec.europa.eu/info/research-and-innovation/strategy/strategy-2020-2024/our-digital-future/open-science_
  en
3
  https://datacentricai.org/
derstanding unstructured data ([12, 13], and generating latent representation and comparing
values in different datasets [14]. In [15], the issue of deciding the type and needed amount of
cleaning has been raised focusing textual documents for information retrieval, issue that can
be generalized to different types of textual data, including tabular data. In [16], the authors
proposed a hybrid human-machine data integration framework for the entity-matching problem.
JedAI [17] provides semi-automated pipelines involving data integration and data cleaning,
where each component obtains feedback to refine the results of automated analyses. Yet, JedAI
focuses on the narrow problem of deduplicating records in databases. BrewER [18, 19] has been
proposed to clean only the portion of data useful to satisfy a user’s need expressed through a SQL
query. Data Civilizer [20] provides an end-to-end big data management system to support data
discovery and preparation considering the user’s end goal, providing primitives for performing
data debugging and workflow visualization. In addition, the need to reduce the amount of
computational resources is being emphasized (e.g., Green AI [21]). The emergence of the term
“crowd science” [22] shows the need to study all aspects related to human intervention, including
the management of scarce resources and user motivation in repetitive tasks such as labeling
and manual data quality evaluation. In such activities, a way to estimate and assess human
efforts is needed as a basis for reducing them in the development of datasets or during the
analysis, still retaining meaningful results. In [23], the challenges of exploratory data analysis
and data quality for AI steps are discussed, and a framework for selecting rows and columns,
and identifying overlaps is proposed.
   Information Visualization and Visual Analytics [24] support the real-time data analysis
process, enabling a user to explore the data, parametrize models, investigate results, and hypoth-
esize conclusions [25]. Concerning data preparation, few works coped with it through visual
means and human intervention, focusing specifically on data quality aspects. DataPilot [26]
is a recent contribution that focuses on visually supporting data preparation activities. The
analysis of data quality and performance models has been covered by several works [27, 28]
with proposals to allow human intervention through steering by Liu et al. [29]. To make this
effort effective, fluency in data analysis and data quality visual exploration by a human user
is a must [30]. For this reason, several works have explored how to keep human interaction
fluid using visualizations [31], while some others focused on analyzing user traces to inform
the system behavior on user intent, using different techniques to model it [32, 33]. The area of
data preparation was less subjected to these studies, presenting a gap to fill in the literature.


3. Data preparation pipelines
3.1. General concepts
In the project, we focus on data preparation pipelines, defined as sets of tasks that are applied
to a dataset or a data stream to explore the data’s potential or improve its quality in the data
preparation phase. As the main objective of the project is to improve the sustainability of the
data preparation process, in this section, we focus first on the main relevant aspects we plan to
consider in the project; then we examine some relevant challenges we are planning to address
to provide support to data scientists developing pipelines in a sustainable way.
   In data preparation pipelines, starting from the classical approach of achieving a data quality
that is “fit for use”, we need to address two main important aspects: the goal of the data
preparation and the tasks that can be performed. The preparation can be performed on several
types of data sources: in the following, we consider data originating from one or more data
sources and structured as textual content either as tabular data or as a knowledge graph.
   Concerning the goal, several targets can be considered: i) preparation of a dataset for further
reuse, where the goal is to improve the quality of the dataset in general, considering usual data
quality dimensions (e.g., as described in [34]); ii) preparation of datasets for data analysis, with
either an exploratory data analysis or a well-defined analysis goal; iii) preparation of datasets
for machine learning, for training, validation, fine-tuning, and testing phases, to improve the
quality of the learned model and/or its results. We advocate goal-oriented quality improvement
to make data preparation activities more sustainable, i.e., consider the final goal when tailoring
data preparation activities.
   While pre-processing input data to improve data quality several data preparation tasks
are considered in this research. We distinguish among: i) data profiling tasks, to analyze the
characteristics of the data; ii) data transformation tasks (including data cleaning, normalization
and standardization, merging, splitting, dropping data, data imputation); iii) data matching
tasks at the instance and schema level (including deduplication, entity matching and/or linking,
annotations of columns and column pairs), and data augmentation tasks to extend data with
data from third-party sources.
   A pipeline can be interpreted as a sequence or a more complex workflow of operations to be
executed on the data. While in many application scenarios pipelines must be executed by data
engineers on large data sets, this large-scale execution is just a final step of a more complex
task, which, inspired by requirements collected for data enrichment pipelines, we conceptualize
as composed of three phases (see Figure 1).
   In the first exploration phase, the goal is to 1) understand the fitness for usage in downstream
tasks and 2) identify operations that can improve their quality based on the intended usage.
Typical actors involved in this phase are data scientists, or other professional figures with
domain expertise. The second phase consists of the design of the engineered pipeline, typically
performed by data engineers, while the third phase consists of the actual execution, which also
involves monitoring by operators.
   Concerning the tasks in the exploratory phase, examples of questions to be answered are:
what are the characteristics of the data? How can the data be transformed? Is it possible to
enrich the data via integration with other data (matching and augmentation)? In this phase,
users typically consider data samples and need some direct feedback on the results of the
operations they explore to understand their effect, making the usage of proper interfaces very
valuable. The output of this phase is the definition of a preparation pipeline and the specification
of key elements, such as configurations for specific algorithms used within it. Suppose the
pipeline needs to be applied to a large amount of data and/or replicated recurrently. In that case,
it must usually be engineered to be efficient and, therefore, compliant with big data processing
platforms (e.g., using distributed computation), which is the objective of the design phase. While
the output of the design phase is the specification of the engineered pipeline, the output of the
execution phase is the enriched data.
Figure 1: Exploration and data preparation pipelines


3.2. Challenges
Data exploration and the design and execution of data preparation pipelines present several
challenges related to their sustainability. Understanding data preparation tasks under the lenses
of their sustainability is particularly interesting today, considering the role of ML in downstream
analytical modeling and the impact of LLMs (and similar models) on task automation. In the
project, we have identified the following challenges for driving further research questions:

Preparation gain: estimating the likelihood to improve the quality of the goal with data
improvement actions with a given approach.
   For instance, systematically applying data cleaning on a dataset is likely to improve the
quality of a machine learning model; however, the improvement ratio is difficult to assess and
may depend on the selected features or parameters.
A clear understanding of the impact of a data preparation action (possibly on a selected portion
of the data) can improve the sustainability of the result, both on the computational side and on
the side of human involvement in the process, as some tasks may require human intervention.
This understanding also presents other problems to be addressed, such as clearly defining goals
and assessing the context.

Sustainable LLM: Leveraging on LLM in data preparation, combining the power of the latest
generation models, e.g., LLMs, with efficiency, scalability, and environmental awareness.
LLMs or similar models targeting structured and semi-structured data are showing promising
performance on several data preparation tasks (e.g., [35, 36, 37, 38]). Careful prompt engineering
[39], larger context sizes [40] and orchestration strategies seem to deliver interesting capabili-
ties related to tasks that can be mapped to language generation (e.g., code generation for data
transformation, query generation and classification for data augmentation) or even decision/-
classification (e.g., deduplication and disambiguation), yet they are extremely expensive and
hard to scale. It is still unclear if these recently proposed solutions are still advantageous for, or
even compatible with, large-scale processing when we consider speed (execution times), costs (in-
frastructure), and environmental sustainability (carbon emissions) at training and inference time.
In general, enhancing the quality of the data consumed by LLMs improves the performances
of models fine-tuned with those LLMs [41]. With structured and semi-structured data coming
from large corpora employed to train the LLMs, it is unpractical to clean/prepare the entire data,
thus the efforts could focus on the portion/tasks that yield the highest benefit for the LLMs. A
first challenge is better characterizing the trade-offs between exploiting the power of LLMs’
implicit knowledge and preserving efficiency, scalability, and environmental sustainability. A
second challenge is finding sweet spots that make LLM usage valuable considering benefit-costs
trade-offs, e.g., application to specific data samples.

User understanding: providing the users with the capability to understand, control, and improve
the outcomes of algorithmic decisions in a human-in-the-loop fashion, even when using the latest
generation models.
For example, in entity linking, users aware of the uncertainty associated with links selected
by the algorithms can get insights into the quality of the results and revise these results
faster [42]. However, solutions to learn from users’ actions are still under-explored in several
data preparation tasks. Latest models, e.g., those based on LLMs proposed for matching-related
tasks [35, 36, 37, 38], do not natively support the interpretation of their decisions in terms of
confidence and are very difficult to adapt with a limited number of user feedback. Adequate
management of the human-in-the-loop approach represents a challenge, efficiently involving the
human user without generating cognitive overload due to too much data to analyze, too broad
and unfocused areas of intervention, or not well-supported decisions to make (human-driven
versus human-as-reviewer). A challenge also arises in identifying the correct degree of control
to provide to the human user, efficiently exploiting human and machine different capabilities.

User Experience Interaction-Driven Optimization: while users interact with systems
involving big data, such as big data visualization systems, their interactions can be recorded and
stored in the form of interaction logs; such logs can then be used to capture characteristics of the
user intent and to optimize the visualization systems used during the data preparation pipeline.
Frequently, such logs work like black boxes due to their low-level nature (i.e., the log is explorable,
but it contains just low-level atomic interactions like a mouse click or mouse move; it is not
straightforward with state-of-the-art techniques to relate it to high-level user actions with
reasonable accuracy) and does not provide information on the decision process regarding the
user’s interactions to the data preparation expert. When analyzing them, the data preparation
expert may be overwhelmed and misled, since the info grasped from the logs is too low-level
and does not give any insight into the user’s intentions. By providing techniques for the
extraction of the user’s intent, it will be possible to know in advance in which portion of the
interaction space and in which phase of the data preparation pipeline it is appropriate to apply
optimizations. Finally, such logs can be exploited to understand at which layer (e.g., data,
rendering or interaction) the visualization systems used during the data preparation pipeline
fail in maintaining response times low enough to keep the user experience optimal [43].


4. Sustainable strategies
Sustainability can be achieved by reducing time and resources. This can be achieved by reducing
the computational complexity of the task execution (e.g., reduction of the volume of the input
dataset) or avoiding redundant actions (e.g., reuse of components/information). In the project,
we are examining several possible strategies, as follows.

Sustainable Data Preparation Components. Sustainable data preparation ensures that
data-driven processes are efficient, effective, and environmentally friendly. Implementing
such a strategy involves processing data efficiently from the data collection to their analysis.
This implies starting from minimizing the acquired unnecessary data to properly selecting the
data preparation components. Improving data quality can already be considered a sustainable
action since poor data quality can lead to inefficiencies and resource waste. However, the data
preparation components have different characteristics and, therefore, different impacts on the
efficiency/effectiveness of the process. The volume and variety of components to consider are
high. In [44], it is possible to find a classification of the tasks included in the data preparation
pipeline: data discovery, data validation, data structuring, data enrichment, data filtering, and
data cleaning. Each category contains a plethora of different functionalities and techniques,
and their selection is not easy. The components can differ from different perspectives: scope,
execution time, complexity, energy consumption, autonomy level, and effectiveness. The idea is
to consider such properties to find the adequate combinations of components able to guarantee
the right balance between sustainability and quality of the results.

Pipeline configurations for sustainable data quality. The design of data preparation
pipelines is challenging: the data analyst must choose the appropriate operations accounting for
several factors. Trial-and-error approaches only sometimes lead to the most effective solution.
Instead, a systematic and automatic strategy supported by provenance information can optimize
this procedure and lead faster to the desired solution while constantly having feedback from the
user [45]. However, while the methodology to build a cost-effective data preparation pipeline is
clear, a sustainable method to reuse these pipelines is still missing. This research work aims to
propose a strategy that defines pipeline embeddings based on the components’ characteristics
that add context-aware capabilities. Such an approach can be used for reusing data profiling
activities, which is fundamental to be applied for LLM and knowledge graphs.

Data Preparation On-Demand. since the paradigm for data integration is increasingly
moving from ETL to ELT, novel on-demand solutions are required to efficiently perform data
preparation and integration on large amounts of raw data (usually stored in data lakes), cleaning
only the portion of data relevant to the downstream task at hand. We aim therefore to work in
this direction to provide practitioners with novel tools, as previously done with BrewER [18, 19],
which runs SQL SP queries directly on dirty data through entity resolution on-demand, and
Sloth [46], designed to detect duplicate and possibly inconsistent versions of the same table on
the Web or in data lakes.

User Experience Sustainable Analysis. Logs collected during the usage of big data
visualization systems can be exploited by leveraging generative AI-based techniques on
them to extract the user intent. By relating low-level traces to high-level visualization tasks
taxonomies [47] it will be possible to capture characteristics of the users’ intent during the
data preparation pipeline, optimizing the steps requiring the human intervention due to
improved and more efficient interaction. In this way, such a pipeline could be refined and
optimized, considering the users’ intent, allowing the creation of a broader design space for
optimization strategies in the other layers (due to the more semantic nature of the user intent
with respect to low-level traces). By leveraging these data, it is possible to fine-tune LLMs
to support optimizations of the data preparation pipelines the user can select during her
work. This strategy, which is linked to the User Understanding challenge, asks for providing
explainability to each user’s choice, which will be investigated [48] to enable the inspection
and understanding of the process for their derivation and their expected costs and outcomes.
Finally, by optimizing the user experience in the visualization systems used during the data
preparation pipeline, we can cascade into more favorable outcomes for each of its phases.
State-of-the-art approaches [49, 50] tend to mitigate the factor which negatively impacts the
most user experience in such systems - high response time [30] - by looking only at the database
level. We can exploit log analysis to pinpoint which layer of the visualization system (e.g., data,
rendering, interaction) is causing the failure, to highlight which portion of the interaction
space tends to lead the system into troubles, and to suggest appropriate optimization techniques.

  To conclude, in Table 1, we summarize the main directions that are being explored in the
project, individually or in combination, proposing solutions for the named challenges to achieve
the different strategies.
                                                                                                                 User Experience
                                                                    Preparation   Sustainable       User
                                                                                                                Interaction-Driven
                                                                       Gain         LLMs        Understanding
                                                                                                                   Optimization
                 Sustainable Data Preparation Components                X             X
             Pipeline configurations for sustainable data quality       X             X
                        Data Preparation On-Demand                      X             X                                 X
                    User Experience Sustainable Analysis                X                            X                  X


Table 1
Challenges defined for each of the proposed strategies




5. Concluding remarks
Data preparation pipelines make an important contribution both to the required quality of data
in different contexts and to the reduction of the amount of resources needed for their use. In
the project, we are studying the main challenges to be addressed to achieve the proposed set of
strategies to increase the sustainability of data preparation tasks. In particular, we are exploring
several research directions, namely: exploiting the reuse of sustainable pipelines; concentrating
on LLMs, both as a case study of resource-hunger application and as a tool for data preparation;
increasing the user-in-the-loop role by leveraging on usable visual information exploration
approaches.


Acknowledgments
This work has been supported by the PRIN 2022 Project “Discount quality for responsible data
science: Human-in-the-Loop for quality data” and by the PNRR-PE-AI “FAIR” project funded by
the NextGenerationEU program.


References
 [1] J. Nielsen, Applying discount usability engineering, IEEE Software 12 (1995) 98–100.
 [2] V. Stodden, The data science life cycle: a disciplined approach to advancing data science
     as a science, Communications of the ACM 63 (2020) 58–66.
 [3] M. I. S. Oliveira, B. F. Lóscio, What is a data ecosystem?, in: Proceedings of the 19th
     Annual International Conference on Digital Government Research: Governance in the
     Data Age, 2018, pp. 1–9.
 [4] E. Ruijer, Designing and implementing data collaboratives: A governance perspective,
     Government Information Quarterly 38 (2021) 101612.
 [5] E. Ramalli, B. Pernici, Sustainability and governance of data ecosystems, in: 2023 IEEE
     International Conference on Web Services (ICWS), IEEE, 2023, pp. 740–745.
 [6] S. Geisler, M. Vidal, C. Cappiello, B. F. Lóscio, A. Gal, M. Jarke, M. Lenzerini, P. Missier,
     B. Otto, E. Paja, B. Pernici, J. Rehof, Knowledge-driven data ecosystems toward data
     transparency, ACM Journal of Data and Information Quality 14 (2022) 3:1–3:12.
 [7] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. M. Wallach, H. D. III, K. Crawford,
     Datasheets for datasets, Commun. ACM 64 (2021) 86–92. URL: https://doi.org/10.1145/
     3458723. doi:10.1145/3458723.
 [8] A. Jacobsen, R. de Miranda Azevedo, N. S. Juty, D. Batista, S. J. Coles, R. Cornet, M. Courtot,
     M. Crosas, M. Dumontier, C. T. A. Evelo, C. A. Goble, G. Guizzardi, K. K. Hansen, A. Hasnain,
     K. M. Hettne, J. Heringa, R. W. W. Hooft, M. Imming, K. G. Jeffery, R. Kaliyaperumal, M. G.
     Kersloot, C. R. Kirkpatrick, T. Kuhn, I. Labastida, B. Magagna, P. McQuilton, N. Meyers,
     A. Montesanti, M. van Reisen, P. Rocca-Serra, R. Pergl, S. Sansone, L. O. B. da Silva Santos,
     J. Schneider, G. O. Strawn, M. Thompson, A. Waagmeester, T. Weigel, M. D. Wilkinson, E. L.
     Willighagen, P. Wittenburg, M. Roos, B. Mons, E. Schultes, FAIR principles: Interpretations
     and implementation considerations, Data Intell. 2 (2020) 10–29. URL: https://doi.org/10.
     1162/dint_r_00024. doi:10.1162/DINT\_R\_00024.
 [9] T. Catarci, M. Scannapieco, M. Console, C. Demetrescu, My (fair) big data, in: J. Nie,
     Z. Obradovic, T. Suzumura, R. Ghosh, R. Nambiar, C. Wang, H. Zang, R. Baeza-Yates, X. Hu,
     J. Kepner, A. Cuzzocrea, J. Tang, M. Toyoda (Eds.), 2017 IEEE International Conference on
     Big Data (IEEE BigData 2017), Boston, MA, USA, December 11-14, 2017, IEEE Computer
     Society, 2017, pp. 2974–2979. URL: https://doi.org/10.1109/BigData.2017.8258267. doi:10.
     1109/BIGDATA.2017.8258267.
[10] D. Ardagna, C. Cappiello, W. Samá, M. Vitali, Context-aware data quality assessment for
     big data, Future Generation Computer Systems 89 (2018) 548–562. URL: https://doi.org/10.
     1016/j.future.2018.07.014. doi:10.1016/J.FUTURE.2018.07.014.
[11] G. Scalia, C. A. Grambow, B. Pernici, Y. Li, W. H. G. Jr., Evaluating scalable uncertainty
     estimation methods for deep learning-based molecular property prediction, Journal of
     Chemical Information and Modeling 60 (2020) 2697–2717. URL: https://doi.org/10.1021/acs.
     jcim.9b00975. doi:10.1021/ACS.JCIM.9B00975.
[12] D. Ritze, O. Lehmberg, C. Bizer, Matching html tables to dbpedia, in: Proceedings of the
     5th international conference on web intelligence, mining and semantics, 2015, pp. 1–6.
[13] M. Cremaschi, F. De Paoli, A. Rula, B. Spahiu, A fully automated approach to a complete
     semantic table interpretation, Future Generation Computer Systems 112 (2020) 478–500.
[14] V. Cutrona, M. Ciavotta, F. De Paoli, M. Palmonari, et al., ASIA: A tool for assisted semantic
     interpretation and annotation of tabular data, in: CEUR WORKSHOP PROCEEDINGS,
     volume 2456, CEUR-WS, 2019, pp. 209–212.
[15] D. Roy, M. Mitra, D. Ganguly, To clean or not to clean: Document preprocessing and
     reproducibility, Journal of Data and Information Quality (JDIQ) 10 (2018) 1–25.
[16] G. Li, Human-in-the-loop data integration, Proceedings of the VLDB Endowment 10 (2017)
     2006–2017.
[17] G. Papadakis, G. Mandilaras, L. Gagliardelli, G. Simonini, E. Thanos, G. Giannakopoulos,
     S. Bergamaschi, T. Palpanas, M. Koubarakis, Three-dimensional entity resolution with
     jedai, Information Systems 93 (2020) 101565.
[18] G. Simonini, L. Zecchini, S. Bergamaschi, F. Naumann, Entity Resolution On-Demand,
     Proceedings of the VLDB Endowment (PVLDB) 15 (2022) 1506–1518. doi:10.14778/
     3523210.3523226.
[19] L. Zecchini, G. Simonini, S. Bergamaschi, F. Naumann, BrewER: Entity Resolution On-
     Demand, Proceedings of the VLDB Endowment (PVLDB) 16 (2023) 4026–4029. doi:10.
     14778/3611540.3611612.
[20] E. K. Rezig, L. Cao, M. Stonebraker, G. Simonini, W. Tao, S. Madden, M. Ouzzani, N. Tang,
     A. K. Elmagarmid, Data civilizer 2.0: A holistic framework for data preparation and
     analytics, Proc. VLDB Endow. 12 (2019) 1954–1957. URL: http://www.vldb.org/pvldb/vol12/
     p1954-rezig.pdf. doi:10.14778/3352063.3352108.
[21] R. Schwartz, J. Dodge, N. A. Smith, O. Etzioni, Green AI, Communications of the ACM 63
     (2020) 54–63.
[22] D. Ustalov, F. Casati, A. Drutsa, D. Baidakova (Eds.), Proceedings of the Crowd Science
     Workshop: Remoteness, Fairness, and Mechanisms as Challenges of Data Supply by
     Humans for Automation co-located with 34th Conference on Neural Information Pro-
     cessing Systems, CSW@NeurIPS 2020, Vancouver, BC, Canada / Online Event, Decem-
     ber 11, 2020, volume 2736 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL:
     https://ceur-ws.org/Vol-2736.
[23] H. Patel, S. Guttula, N. Gupta, S. Hans, R. S. Mittal, L. N, A data centric ai framework for
     automating exploratory data analysis and data quality tasks, ACM Journal of Data and
     Information Quality (2023).
[24] D. A. Keim, F. Mansmann, J. Schneidewind, J. Thomas, H. Ziegler, Visual analytics: Scope
     and challenges, Springer, 2008.
[25] L. Battle, P. Eichmann, M. Angelini, T. Catarci, G. Santucci, Y. Zheng, C. Binnig, J.-D.
     Fekete, D. Moritz, Database benchmarking for supporting real-time interactive querying
     of large data, in: Proceedings of the 2020 ACM SIGMOD International Conference on
     Management of Data, 2020, pp. 1571–1587.
[26] A. Narechania, F. Du, A. R. Sinha, R. Rossi, J. Hoffswell, S. Guo, E. Koh, S. B. Navathe,
     A. Endert, Datapilot: Utilizing quality and usage information for subset selection during
     visual data preparation, in: Proceedings of the 2023 CHI Conference on Human Factors in
     Computing Systems, CHI ’23, Association for Computing Machinery, New York, NY, USA,
     2023. URL: https://doi.org/10.1145/3544548.3581509. doi:10.1145/3544548.3581509.
[27] M. Angelini, C. Daraio, M. Lenzerini, F. Leotta, G. Santucci, Performance model’s develop-
     ment: A novel approach encompassing ontology-based data access and visual analytics,
     Scientometrics 125 (2020) 865–892.
[28] T. Gschwandtner, W. Aigner, S. Miksch, J. Gärtner, S. Kriglstein, M. Pohl, N. Suchy, Time-
     cleanser: a visual analytics approach for data cleansing of time-oriented data, in: Proceed-
     ings of the 14th International Conference on Knowledge Technologies and Data-Driven
     Business, i-KNOW ’14, Association for Computing Machinery, New York, NY, USA, 2014.
     URL: https://doi.org/10.1145/2637748.2638423. doi:10.1145/2637748.2638423.
[29] S. Liu, G. Andrienko, Y. Wu, N. Cao, L. Jiang, C. Shi, Y.-S. Wang, S. Hong, Steering
     data quality with visual analytics: The complexity challenge, Visual Informatics 2 (2018)
     191–197.
[30] Z. Liu, J. Heer, The effects of interactive latency on exploratory visual analysis, IEEE
     Transactions on Visualization and Computer Graphics 20 (2014) 2122–2131.
[31] A. Ulmer, M. Angelini, J.-D. Fekete, J. Kohlhammer, T. May, A survey on progressive
     visualization, IEEE Transactions on Visualization and Computer Graphics (2023) 1–18.
     doi:10.1109/TVCG.2023.3346641.
[32] J. S. Yi, Y. a. Kang, J. Stasko, J. Jacko, Toward a deeper understanding of the role of
     interaction in information visualization, IEEE Transactions on Visualization and Computer
     Graphics 13 (2007) 1224–1231. doi:10.1109/TVCG.2007.70515.
[33] D. Benvenuti, M. Filosa, T. Catarci, M. Angelini, Modeling and assessing user interaction
     in big data visualization systems, in: J. Abdelnour Nocera, M. Kristín Lárusdóttir, H. Petrie,
     A. Piccinno, M. Winckler (Eds.), Human-Computer Interaction – INTERACT 2023, Springer
     Nature Switzerland, Cham, 2023, pp. 86–109.
[34] C. Batini, M. Scannapieco, Data and Information Quality - Dimensions, Principles and
     Techniques, Data-Centric Systems and Applications, Springer, 2016. URL: https://doi.org/
     10.1007/978-3-319-24106-7. doi:10.1007/978-3-319-24106-7.
[35] X. Deng, H. Sun, A. Lees, Y. Wu, C. Yu, TURL: Table understanding through representation
     learning, ACM SIGMOD Record 51 (2022) 33–40.
[36] J. Tu, J. Fan, N. Tang, P. Wang, G. Li, X. Du, X. Jia, S. Gao, Unicorn: A unified multi-tasking
     model for supporting matching tasks in data integration, Proceedings of the ACM on
     Management of Data 1 (2023) 1–26.
[37] T. Zhang, X. Yue, Y. Li, H. Sun, Tablellama: Towards open large generalist models for
     tables, arXiv preprint arXiv:2311.09206 (2023).
[38] M. Trabelsi, Z. Chen, S. Zhang, B. D. Davison, J. Heflin, StruBERT: Structure-aware bert
     for table search and matching, in: Proceedings of the ACM Web Conference, WWW ’22,
     2022.
[39] M. Mosbach, T. Pimentel, S. Ravfogel, D. Klakow, Y. Elazar, Few-shot fine-tuning
     vs. in-context learning: A fair comparison and evaluation, in: A. Rogers, J. Boyd-
     Graber, N. Okazaki (Eds.), Findings of the Association for Computational Linguis-
     tics: ACL 2023, Association for Computational Linguistics, Toronto, Canada, 2023, pp.
     12284–12314. URL: https://aclanthology.org/2023.findings-acl.779. doi:10.18653/v1/
     2023.findings-acl.779.
[40] Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, J. Jia, LongloRA: Efficient fine-tuning of
     long-context large language models, in: The Twelfth International Conference on Learning
     Representations, 2024. URL: https://openreview.net/forum?id=6PmJoRfdaK.
[41] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi,
     P. Kauffmann, G. de Rosa, O. Saarikivi, et al., Textbooks are all you need, arXiv preprint
     arXiv:2306.11644 (2023).
[42] R. Avogadro, M. Ciavotta, F. De Paoli, M. Palmonari, D. Roman, Estimating link confidence
     for human-in-the-loop table annotation, in: 2023 IEEE/WIC International Conference on
     Web Intelligence and Intelligent Agent Technology (WI-IAT), 2023, pp. 142–149. doi:10.
     1109/WI-IAT59888.2023.00025.
[43] Z. Liu, J. Heer, The effects of interactive latency on Exploratory Visual Analysis, IEEE
     Transactions on Visualization and Computer Graphics 20 (2014) 2122–2131. doi:10.1109/
     TVCG.2014.2346452.
[44] M. Hameed, F. Naumann, Data preparation: A survey of commercial tools, ACM SIGMOD
     Record 49 (2020) 18–29.
[45] C. A. Bono, C. Cappiello, B. Pernici, E. Ramalli, M. Vitali, Pipeline design for data prepara-
     tion for social media analysis, ACM Journal of Data and Information Quality 15 (2023)
     1–25.
[46] L. Zecchini, T. Bleifuß, G. Simonini, S. Bergamaschi, F. Naumann, Determining the Largest
     Overlap between Tables, Proceedings of the ACM on Management of Data (PACMMOD) 2
     (2024) 48:1–48:26. doi:10.1145/3639303.
[47] M. Brehmer, T. Munzner, A multi-level typology of abstract visualization tasks, IEEE
     Transactions on Visualization and Computer Graphics 19 (2013) 2376–2385.
[48] B. La Rosa, G. Blasilli, R. Bourqui, D. Auber, G. Santucci, R. Capobianco, E. Bertini, R. Giot,
     M. Angelini, State of the art of visual analytics for explainable deep learning, in: Computer
     Graphics Forum, volume 42, Wiley Online Library, 2023, pp. 319–355.
[49] M. Livny, R. Ramakrishnan, K. Beyer, G. Chen, D. Donjerkovic, S. Lawande, J. Myllymaki,
     K. Wenger, Devise: Integrated querying and visual exploration of large datasets, SIGMOD
     ’97, Association for Computing Machinery, New York, NY, USA, 1997, p. 301–312.
[50] T. Zhang, R. Ramakrishnan, M. Livny, Birch: An efficient data clustering method for very
     large databases, SIGMOD ’96, Association for Computing Machinery, New York, NY, USA,
     1996, p. 103–114.