=Paper= {{Paper |id=Vol-3135/dataplat_short2 |storemode=property |title=Towards Human-centric AutoML via Logic and Argumentation |pdfUrl=https://ceur-ws.org/Vol-3135/dataplat_short2.pdf |volume=Vol-3135 |authors=Joseph Giovanelli,Giuseppe Pisano |dblpUrl=https://dblp.org/rec/conf/edbt/GiovanelliP22 }} ==Towards Human-centric AutoML via Logic and Argumentation== https://ceur-ws.org/Vol-3135/dataplat_short2.pdf
Towards Human-centric AutoML via Logic and
Argumentation
Joseph Giovanelli1 , Giuseppe Pisano1
1
    ALMA MATER STUDIORUM — Università di Bologna


                                             Abstract
                                             In the last decade, we have witnessed an exponential growth in both the complexity and the number of Machine Learning
                                             (ML) techniques. As a consequence, leveraging such methods to solve real-case problems has become difficult for a Data
                                             Scientist (DS). Automated Machine Learning (AutoML) tools were devised to alleviate that task, but easily became as complex
                                             as the ML techniques themselves. The DS has started to rely on this kind of tools without understanding their functioning,
                                             thus loosing the control over the process.
                                                  In this vision paper, we propose HAMLET (Human-centric AutoMl via Logic and Argumentation), a framework that
                                             would help the DS to redeem her centrality. HAMLET is inspired to the well-known standard process model CRISP-DM.
                                             Iteration after iteration, the knowledge is augmented by acquiring more constraints about the problem until a suitable solution
                                             is found. HAMLET leverages Logic and Argumentation to merge both constraints and solutions in an uniformed human- and
                                             machine-readable medium. Not only it allows an easy exploration of the new knowledge at each iteration, but it also enforces
                                             a continuous revision via the AutoML tool and the confrontation between the DS and Domain Experts.

                                             Keywords
                                             AutoML, Logic, Argumentation, CRISP-DM, Data Scientist



1. Introduction                                                                                                       important to the DS to leverage the knowledge about the
                                                                                                                      problem, considering all the ML constraints. Otherwise,
In relation to data platforms, it is well-known that Ma-                                                              it might lead the AutoML tool to retrieve invalid solu-
chine Learning (ML) plays a key role in the process of                                                                tions (i.e., the result of those cannot be deemed correct).
data analysis. As a matter of fact, it has been pervasively                                                           Besides, AutoML tools became that complex to make
employed to cope with each and every type of real-case                                                                it difficult for the DS to understand their functioning,
problems [1, 2, 3, 4]. The Data Scientist (DS) (i.e., a spe-                                                          hence losing the control over the process. Researchers
cialist of data analysis) starts by collecting raw data in an                                                         are aware of these problems [6]. There are some works
arbitrary format. Then she typically leverages a process                                                              that have prescribed to use a human-centric framework
model that will help her to translate the knowledge about                                                             for AutoML [7, 8, 9], yet suggesting only design require-
the problem into ML constraints, and deploy the solution.                                                             ments. Alternatively, the authors in [10] have proposed
CRISP-DM [5] is the most acknowledged standard pro-                                                                   a tool that visualises the best and the worst solutions
cess model and we will take it as a reference in the whole                                                            retrieved by an AutoML tool.
paper. A solution consists of a ML pipeline: a series of                                                                 We claim that the need of a human-centric framework
Data Pre-processing transformations and a ML algorithm.                                                               for AutoML is real, and it is crucial for the DS to augment
The DS can instantiate both with a large set of techniques,                                                           her knowledge via the retrieved solutions. At this pur-
which have their own tunable hyper-parameters. These                                                                  pose we propose HAMLET (Human-centric AutoMl via
choices highly affect the performance of a solution.                                                                  Logic and Argumentation), which leverages Logic and
   Automated Machine Learning (AutoML) tools have                                                                     Argumentation to:
been devised with the aim of assisting the DS during the
ML pipeline instantiation. They leverage state-of-the-art                                                                  • structure the ML constraints and the AutoML so-
optimisation approaches to smartly explore huge search                                                                       lutions in a Logical Knowledge Base (LogicalKB);
spaces of solutions. AutoML has been demonstrated to                                                                       • parse the structured LogicalKB into a human- and
provide accurate performance, even in a limited time bud-                                                                    machine-readable medium called Problem Graph;
get. During the setting up of the search space, it is highly                                                               • leverage the Problem Graph to set up an AutoML
                                                                                                                             search space;
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint                                                          • leverage the Problem Graph to allow both the DS
Conference (March 29-April 1, 2022), Edinburgh, UK
                                                                                                                             and an AutoML tool to revise the current knowl-
$ j.giovanelli@unibo.it (J. Giovanelli); g.pisano@unibo.it
(G. Pisano)                                                                                                                  edge.
 0000-0002-0990-3893 (J. Giovanelli); 0000-0003-0230-8212
(G. Pisano)                                                                                                           Figure 1 illustrates how CRISP-DM, AutoML, and HAM-
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      LET interact with each other. We remark that our frame-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        work allows the DS to never loose the control over the
process, and hence her centrality. Besides, HAMLET          the Bayesian Optimisation (i.e., to boost the convergence
allows to visualise the knowledge in an human- and          process) by suggesting promising configurations (i.e., that
machine-readable format. As advocated in [11], the DS       worked well in previous similar real-case problems) [17].
requires to understand the AutoML process in order to       Ensembling (i.e., construction of a high-performing so-
trust the proposed solutions.                               lution combining several low-performing solutions; e.g.,
   The remain of the paper is structured as follows. Sec-   bagging, boosting, stacking) have been leveraged to en-
tion 2 and Section 3 introduce the main notions of respec-  able AutoML tools to retrieve a solution that combines the
tively AutoML and Argumentation. Section 4 illustrates      best performing configurations, instead of retrieving just
our framework. Finally, Section 5 draws the conclusions     the best performing one [16]. Moreover, multi-fidelity
and potential leveraging.                                   methods (i.e., the use of several partial estimations to
                                                            boost the time-consuming evaluation process) have been
                                                            exploited to let AutoML tools explore as many configura-
2. AutoML                                                   tions as possible.
                                                               All in all, the improvements made over the last years
AutoML tools have been conceived with the aim of light-
                                                            have yielded to be so substantial that AutoML is nowa-
ening the DS in the overwhelming practise of finding the
                                                            days able to handle the entire ML pipeline instantiation.
suitable solution for the case at hand. We recall that in
                                                            Yet, the stacking of complex mechanisms on top of each
the context of data platforms, a solution is a ML pipeline,
                                                            other unavoidably led to a less understanding of the pro-
defined as a series of Data Pre-processing transforma-
                                                            cess by the DS. We believe that the DS has the duty to
tions followed by a ML algorithm. In its early days, only
                                                            revise and supervise the suggested solutions. Unfortu-
the instantiation of the latter – the ML algorithm – was
                                                            nately, state-of-art AutoML tools overlook her role, and
addressed. Auto-Weka [12] formalised the problem as
                                                            do not let that possible.
Combined Algorithm Selection and Hyper-parameter Op-
timisation (CASH). In a nutshell, in order to find the
most performing configuration, various ML algorithms 3. Logic & Argumentation
– and related hyper-parameters – have to be tested over
a dataset. Such a problem was successfully coped by Logic can be defined as the abstract study of statements,
leveraging Bayesian Optimisation (BO) [13], a sequen- sentences and deductive arguments [18]. From its birth,
tial design strategy for global optimisation. The process it has been developed and improved widely and now
involves several iterations, through which different con- includes a variety of formalisms and technologies. Be-
figurations are explored. As the iterations advance, an tween all, Argumentation has proved itself an important
increasingly accurate model is built on top of the previ- tool for handling conflicting information (e.g., opinions,
ous explored configurations, with the aim of suggesting empirical data). This has led to a great number of re-
the most promising ones. The configurations keep being searches trying to establish a computational model of
explored, and updating the model, until a budget in terms logical arguments.
of either iterations or time is reached.                       In Abstract Argumentation [19], a scenario can be rep-
   Recently, AutoML is no longer limited to optimise resented by a directed graph. Each node represents an
just the ML algorithm phase, but it includes Data Pre- argument, and each edge denotes an attack by one argu-
processing as well. Indeed, with the aid of a series of ment to another. Each argument is regarded as atomic.
transformations, it is possible to achieve better perfor- There is no internal structure to an argument. Also, there
mance, unattainable with the most performing ML algo- is no specification of what is an argument or an attack.
rithm configuration [14]. In [15], the author formalised A graph can then be analysed to determine which argu-
the problem as Data Pipeline Selection and Optimisation ments are acceptable according to some general criteria
(DPSO). Each of the transformations can be instantiated (i.e., semantics) [20].
with different techniques, which – analogously to the ML       A way to link Abstract Argumentation and logical for-
algorithms – have their own hyper-parameters. Auto- malisms has been advanced in the field of Structured
sklearn [16] includes Data Pre-processing already in its Argumentation [21], where we assume a formal logi-
first versions. Yet, they fix the arrangement of the trans- cal language for representing knowledge (i.e., a Logical
formations a priori, without considering that the most Knowledge Base), and specifying how arguments and
performing arrangement changes according to the case conflicts (i.e., attacks) can be derived from that knowl-
and data at hand. Considering several arrangements edge. In the structured approach, premises and claims
translates into larger search spaces, not easy to explore. of the argument are made explicit, and the relationship
   In order to cope with ever larger research spaces, vari- between them is formally defined through rules inter-
ous expedients have been employed. Meta-learning (i.e., nal to the formalism. We can build the notion of attack
learning on top of learning) has been used to warm-start as a binary relation over structured arguments that de-
Figure 1: Integration of the HAMLET framework with the CRISP-DM standard process model and AutoML.



notes when one argument is in conflict with another            main and Data Understanding might be repeated many
(e.g., contradictory claims or premises). One of the main      times, until the DS is satisfied by the acquired knowl-
frameworks for Structured Argumentation is ASPIC+ [22].        edge. Once she feels confident, she begins to investi-
In this formalism arguments are built with two kinds of        gate different solutions throughout the next stages: Data
inference rules: strict rules, whose premises guarantee        Pre-processing, Modelling, and Evaluation. Data Pre-
their conclusion, and defeasible rules, whose premises         processing and Modelling are conducted to effectively
only create a presumption in favour of their conclusion.       build the solution, while Evaluation offers a way to mea-
Then conflicts between arguments can arise from both           sure the performance of it. Finally, the process concludes
inconsistencies in the Logical Knowledge Base and the          with the Deployment stage (i.e., the actual implementa-
defeasibility of the reasoning steps in an argument (i.e.,     tion of the solution).
a defeasible rule used in reaching a certain conclusion           We recall that building a solution consists of instan-
from a set of premises can also be attacked).                  tiating a ML pipeline: a series of transformations –
   In our view, once defined the right logical language        defined in the Data Pre-processing stage – and a ML
for encoding the DS and AutoML knowledge, a Struc-             algorithm—defined in the Modelling stage. Seeking the
tured Argumentation model (e.g., an ASPIC+ instance            most correct and performing solution, the DS should con-
[23]) would provide us with the formal machinery to            sider the already known constraints – domain- and data-
build an Argumentation framework upon the data, while          related – and some new she discovers in the Data Pre-
Abstract Argumentation would dispense the evaluation           processing and Modelling, respectively: transformation-
tools.                                                         and algorithm-related constraints (i.e., due to the intrinsic
                                                               semantic of transformations and algorithms at hand).
                                                                  Throughout the different stages, the DS acquires
4. Towards a human-centric                                     knowledge from different points of view (i.e., domain-,
     approach                                                  data-, transformation-, and algorithm-related). Besides,
                                                               as illustrated in Figure 1, CRISP-DM might be iterated
Addressing ML problems encompasses the DS seeking for many times. The several iterations of the process aim
a solution, considering all the constraints of the case. She at augmenting such a knowledge about the problem. Fi-
usually leverages a process model as CRISP-DM. The DS nally, the process is ruled by interactions between the
starts by collecting raw data in an arbitrary format. Then, DS and Domain Experts, discussing and arguing on both
in the first stage, Domain Understanding is conducted. constraints and solutions.
The DS works in a close cooperation with Domain Ex-
perts, and enlists domain-related constraints (i.e., intrinsic 4.1. AutoML and CRISP-DM
of the problem). Follows Data Understanding, devoted
to data analysis, and with the aim of extracting data- As described in Section 2, AutoML helps in finding a
related constraints (i.e., defined by the data format). Do- suitable ML pipeline instantiation (i.e., automatisation of
Data Pre-processing, Modelling, and Evaluation stages).        the outcome of the AutoML tool in a uniform format.
However, such an automatisation unavoidably leads to a         As a result, it would be possible to use the DS knowl-
less overall understanding (i.e., the knowledge about the      edge as an input for the optimisation process—search
problem cannot be properly augmented throughout the            space definition. Then, this initial knowledge can be
process).                                                      augmented with the possible solutions provided by an
   The definition of the search space has a huge impact        AutoML tool. These possible solutions can be exploited
on the correctness and performance of the solutions. The       to derive new constraints (i.e., the awareness about the
DS collects constraints to guarantee the correctness of        problem increases). We see the augmented knowledge
the solution, anticipating the effect of each of them, and     as an awareness determined by an increased expertise
finally defining the search space.                             on the correct constraints. The finding of such correct
                                                               constraints leads to the finding of the correct solution—if
EXAMPLE 1. Let us consider two transformations,
                                                               exists. In other words, at each CRISP-DM iteration, the
namely Discretisation (𝒟) and Normalisation (𝒩 ),
                                                               knowledge is encoded into the AutoML tool, which pro-
and a ML algorithm as Decision Tree (𝒟𝒯 ). Based on
                                                               vides a feedback (i.e., augmented knowledge) in the same
the implementation, a possible algorithm-related con-
                                                               format.
straint may be “require 𝒟 when applying 𝒟𝒯 ”. Ac-
                                                                   Logic could be the key element in defining a common
cordingly, we consider a transformation-related con-
                                                               structure (i.e., a uniformed human- and machine-readable
straint “no 𝒩 in pipelines with 𝒟”. This leads to
                                                               medium) on which the knowledge of both the DS and
discard ML pipelines that contain 𝒟, 𝒩 , and 𝒟𝒯 :
                                                               the AutoML tool can be combined fruitfully. In a way,
          · · · → 𝒩 → · · · → 𝒟 → · · · → 𝒟𝒯                   our approach follows the steps of the well known logical
                                                               based expert systems, of which it is possible to find a
          · · · → 𝒟 → · · · → 𝒩 → · · · → 𝒟𝒯
                                                               great number of successful examples [26]. In literature, it
                                                               is also possible to find two well-known issues [27]: lack
   In real-case problems, consider all the possible effects    of scalability and difficulties in the definition of a sound
is overwhelming, and inconsistencies might occur. The          knowledge base that encodes all the required pieces of
problem exacerbates when it comes to cross-cutting is-         information. Yet, we believe they do not affect our model.
sues, such as those related to ethical and legal fields. For   As to the former, the amount of the acquired knowledge
instance, topics like racism and gender equality have to       (i.e. the problem constraints) through CRISP-DM itera-
be treated separately, otherwise they could lead to so-        tions is not enough to label such a problem as a big data
cial repercussions. As it is well-know, the authors of the     problem, and hence scalability should not be an issue. As
boston-house dataset [24] engineered a feature assum-          to the latter, we believe that the analysis process would
ing that racial self-segregation had a positive impact on      only benefit from the clearness given by such a structured
house prices. A way of addressing such an issue is to en-      investigation.
code some kind of ethical constraint (e.g., dropping that          Logic would also provide the tools to cope with one of
particular feature from the data). Furthermore, the ML         the distinctive features of the knowledge we want to deal
result is expected to be compliant to the laws of the in-      with: the possible inconsistency. Indeed, the ML process
volved countries. To the best of our knowledge there is no     is the product of possible attempts, validated or refuted by
attempt to properly treat such ML constraints, and hence       a consequent evaluation. Hence, the mechanism used to
ease the search space definition. Most of the tools are        encode the knowledge is required to manage this constant
not customisable (i.e., weak-constrained search spaces,        revision process. This is the role of Argumentation—one
e.g., Auto-Weka, [12] Auto-Sklearn [16]), and others are       of the main approaches for dealing with inconsistent
far too permissive (i.e., no assistance at all; e.g., Hyper-   knowledge and defeasible reasoning.
Opt [25]). AutoML is not clear enough to provide the DS
with a feedback that would help to augment her knowl-          4.3. HAMLET
edge about the problem. We claim that a human-centric
framework should provide the mechanisms to: i) help      In the last paragraphs we identified two main require-
the DS to structure her knowledge about the problem    ments for a human-centric framework (i.e., structure the
in an effective search space; ii) augment the knowledgeDS knowledge in a well-defined AutoML search space,
initially possessed by the DS with the one produced by and provide the solutions in accordance with the input
the AutoML optimisation process.                       knowledge). We also introduced Computational Logic
                                                       – Argumentation in particular – as the main tool in our
                                                       investigation. Let us now delve into details of how these
4.2. The role of logic                                 pieces converge in our framework.
The two identified requisites share a common need: en-   Figure 1 illustrates a scheme of HAMLET. The DS con-
coding both the DS knowledge about the problem and ducts the stages from Domain & Data Understanding to
Listing 1: Example of a LogicalKB using a logical formalism.
t1 := > trans forma tion ( d i s c r e t i s a t i o n ) .
t2 := > trans forma tion ( n o r m a l i s a t i o n ) .
a1 : = > a l g o r i t h m ( d e c i s i o n _ t r e e ) .

c1 := > m a n d a t o r y _ t r a n s f o r m a t i o n _ f o r _ a l g o r i t h m ( [ d i s c r e t i s a t i o n ] , d e c i s i o n _ t r e e ) .
c2 := > i n v a l i d _ t r a n s f o r m a t i o n _ s e t ( [ n o r m a l i s a t i o n , d i s c r e t i s a t i o n ] ) .



                                                              and Domain Experts to correct, revise, and supervise the
                                                              process. Accordingly, possible inconsistencies – due to
                                                              diverging constraints – can be verified by the DS using
                                                              her knowledge.
                                                                 Once the knowledge has been accurately revised, an
                                                              AutoML tool is leveraged to automatise the ML pipeline
                                                              instantiation. Throughout the exploration, different solu-
                                                              tions are tested, which contribute to augment the global
                                                              knowledge about the problem. Accordingly, some of the
                                                              originally encoded knowledge by the DS and Domain
                                                              Experts might be refuted or found inconsistent. HAM-
                                                              LET is designed to enable a transparent augmentation
                                                              of the knowledge in the Problem Graph according to the
                                                              newfound solutions. The updating procedure is the same
                                                              as the one employed by the DS during the constraint
                                                              encoding phase. Specifically, the AutoML solutions are
Figure 2: Example of a Problem Graph. Green nodes are valid automatically transposed to our logical language in the
arguments, red ones are refuted.                              form of new constraints, and then added to the Logi-
                                                              calKB. Of course, a change in the LogicalKB translates
                                                              in a change in the Problem Graph, allowing the DS and
Data Pre-processing & Modelling, and thus gathers all Domain Experts to visualise and argue about it. The re-
the constraints that represent the knowledge discovered vision of the Graph is the key element in the process of
so far. The Logical Knowledge Base (LogicalKB) provides augmenting the knowledge: the DS and Domain Experts
a vehicle to encode such constraints. In particular, the can consult each other and discuss how the new insights
DS leverages an intuitive logical language, and enlists relate with their initial knowledge. Indeed, thanks to the
the constraints one-by-one. In Section 3 we introduced nature of the Problem Graph, it would be extremely easy
the notion of Structured Argumentation as a formal tool to identify new possible conflicts and supporting argu-
to convert elements from a logical language into an Ar- ments. Consequently, new constraints can be derived.
gumentation graph. Implementing and exploiting such EXAMPLE 2. In Example 1 we introduce two possi-
a Structured Argumentation tool, HAMLET proceeds to ble ML constraints. We now provide their encoding in
resolve conflicts in the LogicalKB: the logical-encoded the LogicalKB, and the resulting Problem Graph. For
knowledge is transformed in a Problem Graph.                  the sake of clarity, we focus only on Discretisation (𝒟)
    The benefit of the Problem Graph is two-fold. First and Normalisation (𝒩 ) as transformations, and Deci-
of all, it can be leveraged by both the DS and Domain sion Tree (𝒟𝒯 ) as the ML algorithm. Listing 1 con-
Experts to understand and summarise the current knowl- tains the LogicalKB expressed in a logic language: t1
edge. Second of all, thanks to its nature, it is straightfor- and t2 represent 𝒟 and 𝒩 respectively, a1 represents
ward to convert such a graph of constraints into a space 𝒟𝒯 . We consider the algorithms-related constraint
of possible solutions (i.e., exploiting Argumentation se- c1, namely “require 𝒟 when applying 𝒟𝒯 ”, and the
mantics, it is easy to obtain all the sets of arguments – trnasformation-related constraint c2, that is “no 𝒩 in
constraints – which hold together). As a matter of fact, pipelines with 𝒟”. This LogicalKB is used to gener-
this feature would relieve the DS of the burden of manu- ate the Problem Graph shown in Figure 2, nodes rep-
ally considering all the effects of the possible constraints. resent arguments and edges represent attacks among
It is important to notice that, although the increased de- them. There are five possible ML pipelines: 𝒟𝒯 (p1),
gree of automatisation, the Problem Graph allows the DS 𝒟 → 𝒟𝒯 (p2), 𝒩 → 𝒟𝒯 (p3), 𝒟 → 𝒩 → 𝒟𝒯
(p4), 𝒩 → 𝒟 → 𝒟𝒯 (p5). With no constraints,                    available literature and similar real-case problems.
we cannot discard any ML pipeline (i.e., there are no
incompatibilities between the arguments). By intro-
ducing c1, attacks against p1 and p3 are generated             5. Conclusions and potential
(both pipelines contain 𝒟𝒯 but not 𝒟). By introduc-               leveraging
ing c2, attacks against p4 and p5 are generated (both
pipelines contain 𝒟 and 𝒩 ). We can leverage a stan-           The increasing complexity in the state-of-the-art AutoML
dard argumentation semantics (e.g., Dung’s grounded            tools has led the DS to lose the control over the resolution
semantics [19]) to evaluate the graph. In our case, all        process. We believe that human awareness about all the
the arguments with no attacks are admissible. Among            constraints and possible solutions of a ML problem is a
them, we retrieve the ones representing pipelines. p2          fundamental aspect to consider, and consequently should
is the only valid pipeline, and it will be used to gener-      play a key role in the design of next-generation data
ate the AutoML search space.                                   platforms. Accordingly, in this vision paper we present
   Example 2 illustrates how HAMLET leverages Logic            HAMLET, a human-centric AutoML framework based on
and Argumentation to handle the DS knowledge. The              Logic and Structured Argumentation. Logic is exploited
proposed logic formalism allows to easily encode the dif-      to give a structure to the knowledge that the DS has to
ferent ML constraints into a LogicalKB. We highlight that      consider while deploying a solution. The advantage of
the Problem Graph generation is handled by an argumen-         such a choice is twofold. First of all, the logical encoding
tation engine, which is available in the Supplementary         of the knowledge allows an easy exploration and verifi-
Material 1 . The use of the Problem Graph allows to prune      cation of all the constraints that may apply to the case at
the considered ML pipeline for the AutoML search space.        hand—it is overwhelming for the DS to correctly handle
AutoML could update the Problem Graph by extracting            the vast amount of them. Second of all, it provides a
constraints from the performed exploration, and trans-         medium that is both human- and machine- readable. The
posing them into the LogicalKB. For instance, the DS may       DS and Domain experts can revise the knowledge, as well
not have considered that data at hand contain missing          as the AutoML tool, thus creating a constant feedback cy-
values. AutoML could help in identifying transformation-       cle. We further remark that our framework could be able
related constraints such as: “require Imputation (ℐ) in        to address a wide range of AutoML-related challenges.
all the pipelines”. The resulting constraints might be in      We already highlighted a few of them: the embodiment
conflict with the previous knowledge. In our vision, the       of both ethical and legal constraints, and the construction
DS is able to visualise such inconsistencies through the       of a shared knowledge among the DS community.
Problem Graph, and resolve them.                                  The road for future expansions is straightforward: we
   We remark how our framework is compliant with the           plan to extend this work providing a sound formalisation
iterative nature of the CRISP-DM standard process model.       of HAMLET, and then a working implementation. It will
This aspect is crucial when trying to solve real-case prob-    be then possible to effectively quantify the benefits of our
lems through the use of modern data platforms. Indeed,         framework and test its efficacy on real-case problems.
not only the different CRISP-DM stages can be executed
several times, but the whole process can be iterated, bring-
ing new information about the problem. We claim that           References
our framework support and ease the adoption of the de-
                                                                [1] L. Zhou, S. Pan, J. Wang, A. V. Vasilakos, Ma-
scribed resolution process model, by providing a tool that
                                                                    chine learning on big data: Opportunities and
is both human- and machine-readable. The knowledge
                                                                    challenges, Neurocomputing 237 (2017) 350–361.
can be automatically handled throughout iterations, sup-
                                                                    doi:10.1016/j.neucom.2017.01.026.
porting the DS in the whole analysis, in a continuous
                                                                [2] P. Agrawal, R. Arya, A. Bindal, S. Bhatia, A. Gagneja,
revision of the problem constraints. At each iteration, a
                                                                    J. Godlewski, Y. Low, T. Muss, M. M. Paliwal, S. Ra-
portion of the knowledge is known and other is discov-
                                                                    man, V. Shah, B. Shen, L. Sugden, K. Zhao, M.-C.
ered. Its integration into a unified augmented knowledge
                                                                    Wu, Data platform for machine learning, in: Pro-
graph allows to: i) derive new constraints from the dis-
                                                                    ceedings of the 2019 International Conference on
covered knowledge, ii) jgcseamlessly visualise possible
                                                                    Management of Data, SIGMOD ’19, Association for
inconsistencies and conflicts. This naturally leads to a
                                                                    Computing Machinery, New York, NY, USA, 2019,
new iteration based on the new augmented knowledge.
                                                                    p. 1803–1816. URL: https://doi.org/10.1145/3299869.
Besides, the entire process might be boosted with the aid
                                                                    3314050. doi:10.1145/3299869.3314050.
of an external knowledge. In our vision, the DS commu-
                                                                [3] M. Francia, E. Gallinucci, M. Golfarelli, A. G. Leoni,
nity could create a shared LogicalKB derived from the
                                                                    S. Rizzi, N. Santolini, Making data platforms
                                                                    smarter with MOSES, Future Gener. Comput.
    1
        https://queueinc.github.io/HAMLET-DATAPLAT2022/
     Syst. 125 (2021) 299–313. doi:10.1016/j.future.                Workshop on Design, Optimization, Languages and
     2021.06.031.                                                   Analytical Processing of Big Data (DOLAP), volume
 [4] C. Forresi, M. Francia, E. Gallinucci, M. Golfarelli,          2840 of CEUR Workshop Proceedings, CEUR-WS.org,
     Optimizing execution plans in a multistore, in:                2021, pp. 1–10.
     L. Bellatreche, M. Dumas, P. Karras, R. Matule-           [15] A. Quemy, Data pipeline selection and optimiza-
     vičius (Eds.), Advances in Databases and Informa-              tion., in: DOLAP, 2019.
     tion Systems, Springer International Publishing,          [16] M. Feurer, A. Klein, K. Eggensperger, J. T. Springen-
     Cham, 2021, pp. 136–151.                                       berg, M. Blum, F. Hutter, Auto-sklearn: efficient
 [5] R. Wirth, J. Hipp, Crisp-dm: Towards a standard                and robust automated machine learning, in: Auto-
     process model for data mining, in: Proceedings of              mated Machine Learning, Springer, Cham, 2019, pp.
     the 4th international conference on the practical ap-          113–134.
     plications of knowledge discovery and data mining,        [17] J. Giovanelli, B. Bilalli, A. Abelló, Data pre-
     volume 1, Springer-Verlag London, UK, 2000.                    processing pipeline generation for autoetl, Infor-
 [6] D. Xin, E. Y. Wu, D. J. L. Lee, N. Salehi, A. G.               mation Systems (2021) 101957.
     Parameswaran, Whither automl? understanding               [18] L. C. Paulson, Computational logic: its origins and
     the role of automation in machine learning work-               applications, Proceedings of the Royal Society A:
     flows, in: CHI ’21: CHI Conference on Human                    Mathematical, Physical and Engineering Sciences
     Factors in Computing Systems, ACM, 2021, pp. 83:1–             474 (2018). doi:10.1098/rspa.2017.0872.
     83:16. doi:10.1145/3411764.3445306.                       [19] P. M. Dung, On the acceptability of arguments and
 [7] Y. Gil, J. Honaker, S. Gupta, Y. Ma, V. D’Orazio,              its fundamental role in nonmonotonic reasoning,
     D. Garijo, S. Gadewar, Q. Yang, N. Jahanshad, To-              logic programming and n-person games, Artifi-
     wards human-guided machine learning, in: Pro-                  cial Intelligence 77 (1995) 321–358. doi:10.1016/
     ceedings of the 24th International Conference on               0004-3702(94)00041-X.
     Intelligent User Interfaces, 2019, pp. 614–624.           [20] P. Baroni, M. Caminada, M. Giacomin, An introduc-
 [8] D. J.-L. Lee, S. Macke, A human-in-the-loop per-               tion to argumentation semantics, Knowledge Engi-
     spective on automl: Milestones and the road ahead,             neering Review 26 (2011) 365–410. doi:10.1017/
     IEEE Data Engineering Bulletin (2020).                         S0269888911000166.
 [9] D. Wang, J. D. Weisz, M. Muller, P. Ram, W. Geyer,        [21] P. Besnard, A. J. García, A. Hunter, S. Modgil,
     C. Dugan, Y. Tausczik, H. Samulowitz, A. Gray,                 H. Prakken, G. R. Simari, F. Toni, Introduction to
     Human-ai collaboration in data science: Exploring              structured argumentation, Argument & Computa-
     data scientists’ perceptions of automated ai, Pro-             tion 5 (2014) 1–4. doi:10.1080/19462166.2013.
     ceedings of the ACM on Human-Computer Interac-                 869764.
     tion 3 (2019) 1–24.                                       [22] S. Modgil, H. Prakken, The ASPIC + framework
[10] J. P. Ono, S. Castelo, R. Lopez, E. Bertini, J. Freire,        for structured argumentation: a tutorial, Argu-
     C. T. Silva, Pipelineprofiler: A visual analytics tool         ment & Computation 5 (2014) 31–62. doi:10.1080/
     for the exploration of automl pipelines, IEEE Trans-           19462166.2013.869766.
     actions on Visualization and Computer Graphics            [23] R. Calegari, G. Pisano, A. Omicini, G. Sartor, Arg2P:
     27 (2021) 390–400.                                             An argumentation framework for explainable intel-
[11] J. Drozdal, J. Weisz, D. Wang, G. Dass, B. Yao,                ligent systems, Journal of Logic and Computation
     C. Zhao, M. Muller, L. Ju, H. Su, Trust in automl:             (2021). doi:10.1093/logcom/exab089.
     exploring information needs for establishing trust        [24] D. Harrison, D. Rubinfeld, Hedonic housing prices
     in automated machine learning systems, in: Pro-                and the demand for clean air, Journal of Environ-
     ceedings of the 25th International Conference on               mental Economics and Management 5 (1978) 81–
     Intelligent User Interfaces, 2020, pp. 297–307.                102. doi:10.1016/0095-0696(78)90006-2.
[12] L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter,          [25] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D. D.
     K. Leyton-Brown, Auto-weka: Automatic model                    Cox, Hyperopt: a python library for model selection
     selection and hyperparameter optimization in weka,             and hyperparameter optimization, Computational
     in: Automated Machine Learning, Springer, Cham,                Science & Discovery 8 (2015) 014008.
     2019, pp. 81–95.                                          [26] H. Tan, A brief history and technical review of
[13] P. I. Frazier, A tutorial on bayesian optimization,            the expert system research, IOP Conference Se-
     CoRR abs/1807.02811 (2018). URL: http://arxiv.org/             ries: Materials Science and Engineering 242 (2017).
     abs/1807.02811. arXiv:1807.02811.                              doi:10.1088/1757-899X/242/1/012111.
[14] J. Giovanelli, B. Bilalli, A. Abelló, Effective data      [27] P. K. Coats, Why expert systems fail, Financial
     pre-processing for automl, in: K. Stefanidis, P. Mar-          Management 17 (1988) 77–86. URL: http://www.
     cel (Eds.), Proceedings of the 23rd International              jstor.org/stable/3666074.