1. Introduction

Syst.

10.1016/j.future

Towards Human-centric AutoML via Logic and Argumentation

Joseph Giovanelli

Giuseppe Pisano

ALMA MATER STUDIORUM - Università di Bologna

2022

125 2021 0000 0002

In the last decade, we have witnessed an exponential growth in both the complexity and the number of Machine Learning (ML) techniques. As a consequence, leveraging such methods to solve real-case problems has become dificult for a Data Scientist (DS). Automated Machine Learning (AutoML) tools were devised to alleviate that task, but easily became as complex as the ML techniques themselves. The DS has started to rely on this kind of tools without understanding their functioning, thus loosing the control over the process. In this vision paper, we propose HAMLET (Human-centric AutoMl via Logic and Argumentation), a framework that would help the DS to redeem her centrality. HAMLET is inspired to the well-known standard process model CRISP-DM. Iteration after iteration, the knowledge is augmented by acquiring more constraints about the problem until a suitable solution is found. HAMLET leverages Logic and Argumentation to merge both constraints and solutions in an uniformed human- and machine-readable medium. Not only it allows an easy exploration of the new knowledge at each iteration, but it also enforces a continuous revision via the AutoML tool and the confrontation between the DS and Domain Experts.

eol>AutoML Logic Argumentation CRISP-DM Data Scientist

1. Introduction

important to the DS to leverage the knowledge about the problem, considering all the ML constraints. Otherwise, In relation to data platforms, it is well-known that Ma- it might lead the AutoML tool to retrieve invalid soluchine Learning (ML) plays a key role in the process of tions (i.e., the result of those cannot be deemed correct). data analysis. As a matter of fact, it has been pervasively Besides, AutoML tools became that complex to make employed to cope with each and every type of real-case it dificult for the DS to understand their functioning, problems [ 1, 2, 3, 4 ]. The Data Scientist (DS) (i.e., a spe- hence losing the control over the process. Researchers cialist of data analysis) starts by collecting raw data in an are aware of these problems [6]. There are some works arbitrary format. Then she typically leverages a process that have prescribed to use a human-centric framework model that will help her to translate the knowledge about for AutoML [7, 8, 9], yet suggesting only design requirethe problem into ML constraints, and deploy the solution. ments. Alternatively, the authors in [10] have proposed CRISP-DM [5] is the most acknowledged standard pro- a tool that visualises the best and the worst solutions cess model and we will take it as a reference in the whole retrieved by an AutoML tool. paper. A solution consists of a ML pipeline: a series of We claim that the need of a human-centric framework Data Pre-processing transformations and a ML algorithm. for AutoML is real, and it is crucial for the DS to augment The DS can instantiate both with a large set of techniques, her knowledge via the retrieved solutions. At this purwhich have their own tunable hyper-parameters. These pose we propose HAMLET (Human-centric AutoMl via choices highly afect the performance of a solution. Logic and Argumentation), which leverages Logic and

Automated Machine Learning (AutoML) tools have Argumentation to: been devised with the aim of assisting the DS during the ML pipeline instantiation. They leverage state-of-the-art optimisation approaches to smartly explore huge search spaces of solutions. AutoML has been demonstrated to provide accurate performance, even in a limited time budget. During the setting up of the search space, it is highly • structure the ML constraints and the AutoML so

lutions in a Logical Knowledge Base (LogicalKB); • parse the structured LogicalKB into a human- and

machine-readable medium called Problem Graph; • leverage the Problem Graph to set up an AutoML

search space; • leverage the Problem Graph to allow both the DS and an AutoML tool to revise the current knowledge. Figure 1 illustrates how CRISP-DM, AutoML, and HAMLET interact with each other. We remark that our framework allows the DS to never loose the control over the process, and hence her centrality. Besides, HAMLET allows to visualise the knowledge in an human- and machine-readable format. As advocated in [11], the DS requires to understand the AutoML process in order to trust the proposed solutions.

The remain of the paper is structured as follows. Section 2 and Section 3 introduce the main notions of respectively AutoML and Argumentation. Section 4 illustrates our framework. Finally, Section 5 draws the conclusions and potential leveraging.

2. AutoML

the Bayesian Optimisation (i.e., to boost the convergence process) by suggesting promising configurations (i.e., that worked well in previous similar real-case problems) [17].

Ensembling (i.e., construction of a high-performing solution combining several low-performing solutions; e.g., bagging, boosting, stacking) have been leveraged to enable AutoML tools to retrieve a solution that combines the best performing configurations, instead of retrieving just the best performing one [16]. Moreover, multi-fidelity methods (i.e., the use of several partial estimations to boost the time-consuming evaluation process) have been exploited to let AutoML tools explore as many configurations as possible.

All in all, the improvements made over the last years have yielded to be so substantial that AutoML is nowadays able to handle the entire ML pipeline instantiation.

Yet, the stacking of complex mechanisms on top of each other unavoidably led to a less understanding of the process by the DS. We believe that the DS has the duty to revise and supervise the suggested solutions. Unfortunately, state-of-art AutoML tools overlook her role, and do not let that possible.

AutoML tools have been conceived with the aim of lightening the DS in the overwhelming practise of finding the suitable solution for the case at hand. We recall that in the context of data platforms, a solution is a ML pipeline, defined as a series of Data Pre-processing transformations followed by a ML algorithm. In its early days, only the instantiation of the latter – the ML algorithm – was addressed. Auto-Weka [12] formalised the problem as Combined Algorithm Selection and Hyper-parameter Optimisation (CASH). In a nutshell, in order to find the most performing configuration, various ML algorithms 3. Logic & Argumentation – and related hyper-parameters – have to be tested over a dataset. Such a problem was successfully coped by Logic can be defined as the abstract study of statements, leveraging Bayesian Optimisation (BO) [13], a sequen- sentences and deductive arguments [18]. From its birth, tial design strategy for global optimisation. The process it has been developed and improved widely and now involves several iterations, through which diferent con- includes a variety of formalisms and technologies. Beifgurations are explored. As the iterations advance, an tween all, Argumentation has proved itself an important increasingly accurate model is built on top of the previ- tool for handling conflicting information (e.g., opinions, ous explored configurations, with the aim of suggesting empirical data). This has led to a great number of rethe most promising ones. The configurations keep being searches trying to establish a computational model of explored, and updating the model, until a budget in terms logical arguments. of either iterations or time is reached. In Abstract Argumentation [19], a scenario can be rep

Recently, AutoML is no longer limited to optimise resented by a directed graph. Each node represents an just the ML algorithm phase, but it includes Data Pre- argument, and each edge denotes an attack by one arguprocessing as well. Indeed, with the aid of a series of ment to another. Each argument is regarded as atomic. transformations, it is possible to achieve better perfor- There is no internal structure to an argument. Also, there mance, unattainable with the most performing ML algo- is no specification of what is an argument or an attack. rithm configuration [ 14]. In [15], the author formalised A graph can then be analysed to determine which arguthe problem as Data Pipeline Selection and Optimisation ments are acceptable according to some general criteria (DPSO). Each of the transformations can be instantiated (i.e., semantics) [20]. with diferent techniques, which – analogously to the ML A way to link Abstract Argumentation and logical foralgorithms – have their own hyper-parameters. Auto- malisms has been advanced in the field of Structured sklearn [16] includes Data Pre-processing already in its Argumentation [21], where we assume a formal logiifrst versions. Yet, they fix the arrangement of the trans- cal language for representing knowledge (i.e., a Logical formations a priori, without considering that the most Knowledge Base), and specifying how arguments and performing arrangement changes according to the case conflicts (i.e., attacks) can be derived from that knowland data at hand. Considering several arrangements edge. In the structured approach, premises and claims translates into larger search spaces, not easy to explore. of the argument are made explicit, and the relationship

In order to cope with ever larger research spaces, vari- between them is formally defined through rules interous expedients have been employed. Meta-learning (i.e., nal to the formalism. We can build the notion of attack learning on top of learning) has been used to warm-start as a binary relation over structured arguments that denotes when one argument is in conflict with another main and Data Understanding might be repeated many (e.g., contradictory claims or premises). One of the main times, until the DS is satisfied by the acquired knowlframeworks for Structured Argumentation is ASPIC+[22]. edge. Once she feels confident, she begins to investiIn this formalism arguments are built with two kinds of gate diferent solutions throughout the next stages: Data inference rules: strict rules, whose premises guarantee Pre-processing, Modelling, and Evaluation. Data Pretheir conclusion, and defeasible rules, whose premises processing and Modelling are conducted to efectively only create a presumption in favour of their conclusion. build the solution, while Evaluation ofers a way to meaThen conflicts between arguments can arise from both sure the performance of it. Finally, the process concludes inconsistencies in the Logical Knowledge Base and the with the Deployment stage (i.e., the actual implementadefeasibility of the reasoning steps in an argument (i.e., tion of the solution). a defeasible rule used in reaching a certain conclusion We recall that building a solution consists of instanfrom a set of premises can also be attacked). tiating a ML pipeline: a series of transformations –

In our view, once defined the right logical language defined in the Data Pre-processing stage – and a ML for encoding the DS and AutoML knowledge, a Struc- algorithm—defined in the Modelling stage. Seeking the tured Argumentation model (e.g., an ASPIC+ instance most correct and performing solution, the DS should con[23]) would provide us with the formal machinery to sider the already known constraints – domain- and databuild an Argumentation framework upon the data, while related – and some new she discovers in the Data PreAbstract Argumentation would dispense the evaluation processing and Modelling, respectively: transformationtools. and algorithm-related constraints (i.e., due to the intrinsic semantic of transformations and algorithms at hand).

Throughout the diferent stages, the DS acquires 4. Towards a human-centric knowledge from diferent points of view (i.e., domain-, approach data-, transformation-, and algorithm-related). Besides, as illustrated in Figure 1, CRISP-DM might be iterated many times. The several iterations of the process aim at augmenting such a knowledge about the problem. Finally, the process is ruled by interactions between the DS and Domain Experts, discussing and arguing on both constraints and solutions.

Addressing ML problems encompasses the DS seeking for

a solution, considering all the constraints of the case. She usually leverages a process model as CRISP-DM. The DS starts by collecting raw data in an arbitrary format. Then, in the first stage, Domain Understanding is conducted. The DS works in a close cooperation with Domain Experts, and enlists domain-related constraints (i.e., intrinsic of the problem). Follows Data Understanding, devoted to data analysis, and with the aim of extracting datarelated constraints (i.e., defined by the data format). Do

4.1. AutoML and CRISP-DM As described in Section 2, AutoML helps in finding a

suitable ML pipeline instantiation (i.e., automatisation of Data Pre-processing, Modelling, and Evaluation stages). the outcome of the AutoML tool in a uniform format. However, such an automatisation unavoidably leads to a As a result, it would be possible to use the DS knowlless overall understanding (i.e., the knowledge about the edge as an input for the optimisation process—search problem cannot be properly augmented throughout the space definition. Then, this initial knowledge can be process). augmented with the possible solutions provided by an

The definition of the search space has a huge impact AutoML tool. These possible solutions can be exploited on the correctness and performance of the solutions. The to derive new constraints (i.e., the awareness about the DS collects constraints to guarantee the correctness of problem increases). We see the augmented knowledge the solution, anticipating the efect of each of them, and as an awareness determined by an increased expertise ifnally defining the search space. on the correct constraints. The finding of such correct EXAMPLE 1. Let us consider two transformations, ceoxnissttsr.aIinntostlheeardswtoortdhse, afintdeinagchofCtRhIeScPo-rDreMctisteorluattiioonn—,tihf e namely Discretisation () and Normalisation ( ), knowledge is encoded into the AutoML tool, which protahned iamMplLemalegnotraitthiomn,aas Dpoescsiisbi olen aTlgreoeri(thm-)r.elaBtaesdedcoonn- vfoidrmesaat.feedback (i.e., augmented knowledge) in the same sctorradiinntglmy,aywebeco“nresiqdueirreatrawnhsefonr mapaptiloyinn-grelated”. cAonc-- strLuoctguicrec(oi.uel.d,abuentihfoerkmeeydehleummeannt-iannddemfiniancghianec-ormeamdaobnle straint “no in pipelines with ”. This leads to medium) on which the knowledge of both the DS and discard ML pipelines that contain , , and : the AutoML tool can be combined fruitfully. In a way, our approach follows the steps of the well known logical · · · → → · · · → → · · · → based expert systems, of which it is possible to find a · · · → → · · · → → · · · → great number of successful examples [26]. In literature, it is also possible to find two well-known issues [ 27]: lack

In real-case problems, consider all the possible efects of scalability and dificulties in the definition of a sound is overwhelming, and inconsistencies might occur. The knowledge base that encodes all the required pieces of problem exacerbates when it comes to cross-cutting is- information. Yet, we believe they do not afect our model. sues, such as those related to ethical and legal fields. For As to the former, the amount of the acquired knowledge instance, topics like racism and gender equality have to (i.e. the problem constraints) through CRISP-DM iterabe treated separately, otherwise they could lead to so- tions is not enough to label such a problem as a big data cial repercussions. As it is well-know, the authors of the problem, and hence scalability should not be an issue. As boston-house dataset [24] engineered a feature assum- to the latter, we believe that the analysis process would ing that racial self-segregation had a positive impact on only benefit from the clearness given by such a structured house prices. A way of addressing such an issue is to en- investigation. code some kind of ethical constraint (e.g., dropping that Logic would also provide the tools to cope with one of particular feature from the data). Furthermore, the ML the distinctive features of the knowledge we want to deal result is expected to be compliant to the laws of the in- with: the possible inconsistency. Indeed, the ML process volved countries. To the best of our knowledge there is no is the product of possible attempts, validated or refuted by attempt to properly treat such ML constraints, and hence a consequent evaluation. Hence, the mechanism used to ease the search space definition. Most of the tools are encode the knowledge is required to manage this constant not customisable (i.e., weak-constrained search spaces, revision process. This is the role of Argumentation—one e.g., Auto-Weka, [12] Auto-Sklearn [16]), and others are of the main approaches for dealing with inconsistent far too permissive (i.e., no assistance at all; e.g., Hyper- knowledge and defeasible reasoning. Opt [25]). AutoML is not clear enough to provide the DS with a feedback that would help to augment her knowl- 4.3. HAMLET edge about the problem. We claim that a human-centric framework should provide the mechanisms to: i) help the DS to structure her knowledge about the problem in an efective search space; ii) augment the knowledge initially possessed by the DS with the one produced by the AutoML optimisation process.

In the last paragraphs we identified two main require

ments for a human-centric framework (i.e., structure the DS knowledge in a well-defined AutoML search space, and provide the solutions in accordance with the input knowledge). We also introduced Computational Logic – Argumentation in particular – as the main tool in our investigation. Let us now delve into details of how these pieces converge in our framework.

Figure 1 illustrates a scheme of HAMLET. The DS conducts the stages from Domain & Data Understanding to

4.2. The role of logic The two identified requisites share a common need: encoding both the DS knowledge about the problem and

Listing 1: Example of a LogicalKB using a logical formalism. t 1 : = > t r a n s f o r m a t i o n ( d i s c r e t i s a t i o n ) . t 2 : = > t r a n s f o r m a t i o n ( n o r m a l i s a t i o n ) . a1 : = > a l g o r i t h m ( d e c i s i o n _ t r e e ) . c1 : = > m a n d a t o r y _ t r a n s f o r m a t i o n _ f o r _ a l g o r i t h m ( [ d i s c r e t i s a t i o n ] , d e c i s i o n _ t r e e ) . c2 : = > i n v a l i d _ t r a n s f o r m a t i o n _ s e t ( [ n o r m a l i s a t i o n , d i s c r e t i s a t i o n ] ) . and Domain Experts to correct, revise, and supervise the process. Accordingly, possible inconsistencies – due to diverging constraints – can be verified by the DS using her knowledge.

Once the knowledge has been accurately revised, an AutoML tool is leveraged to automatise the ML pipeline instantiation. Throughout the exploration, diferent solutions are tested, which contribute to augment the global knowledge about the problem. Accordingly, some of the originally encoded knowledge by the DS and Domain Experts might be refuted or found inconsistent. HAMLET is designed to enable a transparent augmentation of the knowledge in the Problem Graph according to the newfound solutions. The updating procedure is the same as the one employed by the DS during the constraint encoding phase. Specifically, the AutoML solutions are Figure 2: Example of a Problem Graph. Green nodes are valid automatically transposed to our logical language in the arguments, red ones are refuted. form of new constraints, and then added to the LogicalKB. Of course, a change in the LogicalKB translates in a change in the Problem Graph, allowing the DS and Data Pre-processing & Modelling, and thus gathers all Domain Experts to visualise and argue about it. The rethe constraints that represent the knowledge discovered vision of the Graph is the key element in the process of so far. The Logical Knowledge Base (LogicalKB) provides augmenting the knowledge: the DS and Domain Experts a vehicle to encode such constraints. In particular, the can consult each other and discuss how the new insights DS leverages an intuitive logical language, and enlists relate with their initial knowledge. Indeed, thanks to the the constraints one-by-one. In Section 3 we introduced nature of the Problem Graph, it would be extremely easy the notion of Structured Argumentation as a formal tool to identify new possible conflicts and supporting arguto convert elements from a logical language into an Ar- ments. Consequently, new constraints can be derived. gumentation graph. Implementing and exploiting such EXAMPLE 2. In Example 1 we introduce two possia Structured Argumentation tool, HAMLET proceeds to ble ML constraints. We now provide their encoding in resolve conflicts in the LogicalKB: the logical-encoded the LogicalKB, and the resulting Problem Graph. For knowledge is transformed in a Problem Graph. the sake of clarity, we focus only on Discretisation ()

The benefit of the Problem Graph is two-fold. First and Normalisation ( ) as transformations, and Deciof all, it can be leveraged by both the DS and Domain sion Tree ( ) as the ML algorithm. Listing 1 conExperts to understand and summarise the current knowl- tains the LogicalKB expressed in a logic language: t1 edge. Second of all, thanks to its nature, it is straightfor- and t2 represent and respectively, a1 represents wofaprdostsoibcloensvoelruttsiounchs (ai.eg.r,aepxhploofitcionngstArarginutmsiennttoataiospnascee- c1, .naWmeelyco“nresqidueirrethe walhgeonritahpmplsy-irneglated ”c,onanstdratihnet mantics, it is easy to obtain all the sets of arguments – trnasformation-related constraint c2, that is “no in constraints – which hold together). As a matter of fact, pipelines with ”. This LogicalKB is used to generthis feature would relieve the DS of the burden of manu- ate the Problem Graph shown in Figure 2, nodes repally considering all the efects of the possible constraints. resent arguments and edges represent attacks among It is important to notice that, although the increased de- them. There are five possible ML pipelines: (p1), gree of automatisation, the Problem Graph allows the DS → (p2), → (p3), → → (p4), → → (p5). With no constraints, available literature and similar real-case problems. we cannot discard any ML pipeline (i.e., there are no incompatibilities between the arguments). By introducing c1, attacks against p1 and p3 are generated 5. Conclusions and potential (both pipelines contain but not ). By introduc- leveraging ing c2, attacks against p4 and p5 are generated (both pipelines contain and ). We can leverage a stan- The increasing complexity in the state-of-the-art AutoML dard argumentation semantics (e.g., Dung’s grounded tools has led the DS to lose the control over the resolution semantics [19]) to evaluate the graph. In our case, all process. We believe that human awareness about all the the arguments with no attacks are admissible. Among constraints and possible solutions of a ML problem is a them, we retrieve the ones representing pipelines. p2 fundamental aspect to consider, and consequently should is the only valid pipeline, and it will be used to gener- play a key role in the design of next-generation data ate the AutoML search space. platforms. Accordingly, in this vision paper we present

Example 2 illustrates how HAMLET leverages Logic HAMLET, a human-centric AutoML framework based on and Argumentation to handle the DS knowledge. The Logic and Structured Argumentation. Logic is exploited proposed logic formalism allows to easily encode the dif- to give a structure to the knowledge that the DS has to ferent ML constraints into a LogicalKB. We highlight that consider while deploying a solution. The advantage of the Problem Graph generation is handled by an argumen- such a choice is twofold. First of all, the logical encoding tation engine, which is available in the Supplementary of the knowledge allows an easy exploration and verifiMaterial 1. The use of the Problem Graph allows to prune cation of all the constraints that may apply to the case at the considered ML pipeline for the AutoML search space. hand—it is overwhelming for the DS to correctly handle AutoML could update the Problem Graph by extracting the vast amount of them. Second of all, it provides a constraints from the performed exploration, and trans- medium that is both human- and machine- readable. The posing them into the LogicalKB. For instance, the DS may DS and Domain experts can revise the knowledge, as well not have considered that data at hand contain missing as the AutoML tool, thus creating a constant feedback cyvalues. AutoML could help in identifying transformation- cle. We further remark that our framework could be able related constraints such as: “require Imputation (ℐ) in to address a wide range of AutoML-related challenges. all the pipelines”. The resulting constraints might be in We already highlighted a few of them: the embodiment conflict with the previous knowledge. In our vision, the of both ethical and legal constraints, and the construction DS is able to visualise such inconsistencies through the of a shared knowledge among the DS community. Problem Graph, and resolve them. The road for future expansions is straightforward: we

We remark how our framework is compliant with the plan to extend this work providing a sound formalisation iterative nature of the CRISP-DM standard process model. of HAMLET, and then a working implementation. It will This aspect is crucial when trying to solve real-case prob- be then possible to efectively quantify the benefits of our lems through the use of modern data platforms. Indeed, framework and test its eficacy on real-case problems. not only the diferent CRISP-DM stages can be executed several times, but the whole process can be iterated, bringing new information about the problem. We claim that References our framework support and ease the adoption of the described resolution process model, by providing a tool that is both human- and machine-readable. The knowledge can be automatically handled throughout iterations, supporting the DS in the whole analysis, in a continuous revision of the problem constraints. At each iteration, a portion of the knowledge is known and other is discovered. Its integration into a unified augmented knowledge graph allows to: i) derive new constraints from the discovered knowledge, ii) jgcseamlessly visualise possible inconsistencies and conflicts. This naturally leads to a new iteration based on the new augmented knowledge.

Besides, the entire process might be boosted with the aid of an external knowledge. In our vision, the DS community could create a shared LogicalKB derived from the 1https://queueinc.github.io/HAMLET-DATAPLAT2022/

[1]

Zhou ,

Pan ,

Wang ,

A. V.

Vasilakos , Machine learning on big data: Opportunities and challenges , Neurocomputing 237 ( 2017 ) 350 - 361 . doi: 10 .1016/j.neucom. 2017 . 01 .026.

[2]

Agrawal ,

Arya ,

Bindal ,

Bhatia ,

Gagneja ,

Godlewski ,

Low ,

Muss , M. M. Paliwal , S.

Raman , V.

Shah , B.

Shen , L.

Sugden , K.

Zhao , M.-C. Wu , Data platform for machine learning , in: Proceedings of the 2019 International Conference on Management of Data, SIGMOD '19 , Association for Computing Machinery, New York, NY, USA, 2019 , p. 1803 - 1816 . URL: https://doi.org/10.1145/3299869. 3314050. doi: 10 .1145/3299869.3314050.

[3]

Francia , E. Gallinucci,

Golfarelli ,

A. G.

Leoni ,

Rizzi ,

Santolini , Making data platforms smarter with MOSES, Future Gener . Comput.