=Paper= {{Paper |id=Vol-2578/ETMLP5 |storemode=property |title=Putting the Human Back in the AutoML Loop |pdfUrl=https://ceur-ws.org/Vol-2578/ETMLP5.pdf |volume=Vol-2578 |authors=Iordanis Xanthopoulos,Ioannis Tsamardinos,Vassilis Christophides,Eric Simon,Alejandro Salinger |dblpUrl=https://dblp.org/rec/conf/edbt/XanthopoulosTCS20 }} ==Putting the Human Back in the AutoML Loop== https://ceur-ws.org/Vol-2578/ETMLP5.pdf
                        Putting the Human Back in the AutoML Loop

          Iordanis Xanthopoulos∗                                   Ioannis Tsamardinos†‡                         Vassilis Christophides§
             University of Crete                                         University of Crete                          University of Crete
                  Greece                                                     Gnosis DA                                      Greece
      jordan.xanthopoulos@gmail.com                                           Greece                                 christop@csd.uoc.gr
                                                                       tsamard.it@gmail.com

                                                 Eric Simon                                  Alejandro Salinger
                                                 SAP France                                          SAP SE
                                                    France                                         Germany
                                            eric.simon@sap.com                            alejandro.salinger@sap.com

ABSTRACT                                                                               Finally, AutoML could improve replicability of analyses, sharing
Automated Machine Learning (AutoML) is a rapidly rising sub-                           of results, and facilitate collaborative analyses.
field of Machine Learning. AutoML aims to fully automate the                              To clarify the term AutoML, we consider the minimal require-
machine learning process end-to-end, democratizing Machine                             ments to be the ability to return (a) a predictive model that can
Learning to non-experts and drastically increasing the produc-                         be applied to new data, and (b) an estimate of predictive perfor-
tivity of expert analysts. So far, most comparisons of AutoML                          mance of that model, given a data source, e.g., a 2-dimensional
systems focus on quantitative criteria such as predictive perfor-                      matrix (tabular data). Thus, do-it-yourself tools that allow you
mance and execution time. In this paper, we examine AutoML                             to graphically construct the analysis pipeline (e.g. Microsoft’s
services for predictive modeling tasks from a user’s perspective,                      Azure ML [31]) are excluded. In addition, we distinguish between
going beyond predictive performance. We present a wide palette                         libraries and services. The former require coding and typically
of criteria and dimensions on which to evaluate and compare                            offer just the minimal requirements, namely return a model and
these services as a user. This qualitative comparative method-                         a performance estimation. AutoML services, on the other hand,
ology is applied on seven AutoML systems, namely Auger.AI,                             include a user interface and strive to democratize ML not only to
BigML, H2O’s Driverless AI, Darwin, Just Add Data Bio, Rapid-                          coders, but to anybody with a computer; they typically offer a
Miner, and Watson. The comparison indicates the strengths and                          much wider range of functionalities.
weaknesses of each service, the needs that it covers, the segment                         Algorithmically, AutoML encompasses techniques regarding
of users that is most appropriate for, and the possibilities for                       hyper-parameter optimization (HPO, [3, 48]), algorithm selection
improvements.                                                                          (CASH, [22]), automatic synthesis of analysis pipelines [36], per-
                                                                                       formance estimation [53], and meta-level learning [54], to name
KEYWORDS                                                                               a few. In addition, an AutoML system could not only automate
                                                                                       the modeling process, but also the steps that come before and
AutoML, machine learning services, qualitative evaluation
                                                                                       after. Pre-analysis steps include data integration, data preprocess-
                                                                                       ing, data cleaning, and data engineering (feature construction).
1    INTRODUCTION                                                                      Post-analysis steps include interpretation, explanation, and vi-
Automated Machine Learning (AutoML) is becoming a separate,                            sualization of the analysis process and the output model, model
independent sub-field of Machine Learning, that is rapidly rising                      production, model monitoring, and model updating. The ideal
in attention, importance, and number of applications [23, 35]. Au-                     AutoML system should only require the human to specify the
toML goals are to completely automate the application of machine                       data source(s), their semantics, and the goal of the analysis to
learning, statistical modeling, data mining, pattern recognition,                      create and maintain a model into production indefinitely.
and all advanced data analytics techniques. As an end result, Au-                         Given the importance and potential of AutoML, several aca-
toML could potentially democratize ML to non-experts (Citizen                          demic and commercial libraries, as well as services have appeared.
Data Scientists), boost the productivity of experts, shield against                    The first AutoML system was the academic Gene Expression
statistical methodological errors, and even surpass manual expert                      Model Selector (GEMS) [46]. Recent works formulate the AutoML
analysis performance (e.g., by using meta-level learning [11]).                        problem [56, 57], introduce techniques and frameworks for cre-
                                                                                       ating new AutoML tools [6, 45], survey the existing ones [43, 57]
∗ This work was done while the author was working at SAP SE
                                                                                       and comparatively evaluate them [42, 51, 55]. This is a a techni-
† Ioannis Tsamardinos is CEO of Gnosis Data Analysis, which created JAD Bio
                                                                                       cally challenging task requiring the availability of a plethora of
(JAD)
‡ The research leading to these results has received funding from the European         datasets with different characteristics [14], extensive computa-
Research Council under the European Union’s Seventh Framework Programme                tional time, ability to set time-limits to all software and many
(FP/2007-2013) / ERC Grant Agreement n. 617393                                         others (see [18] for a discussion on the set up and results of the
§ Work of the author was supported by the Institute of Advanced Studies of the
University Cergy-Pontoise under the Paris Seine Initiative for Excellence ("In-        AutoML Challenge Series).
vestissements d’Avenir" ANR-16-IDEX-0008).                                                AutoML strives to take the human expert out of the ML loop;
                                                                                       but, unfortunately, it seems the majority of AutoML surveys and
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-   evaluations also take the human user out of the loop, focusing
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-              solely on predictive performance and ignoring the user experi-
tribution 4.0 International (CC BY 4.0)                                                ence for the most part. Still, some exceptions can be found. In
[26], an interactive environment is proposed, emphasizing on          Data Robot3 , we were not able to obtain the free trial licence
user-centric aspects of AutoML. Moreover, in [49] a brief quali-      advertised on their website.
tative evaluation on AutoML services and libraries is presented,
mainly regarding their ML capabilities.                               3     QUALITATIVE CRITERIA
   The contribution of this paper is to provide a user-centric        To qualitatively evaluate the seven AutoML services, we present
framework for comparing AutoML services. We define a set of           32 user-centric qualitative criteria spanning across six different
qualitative criteria, spanning across six categories (Estimates,      categories. The criteria are partitioned in the following categories.
Scope, Productivity, Interpretability, Customizability, and Con-      The Estimates category is concerned with metrics and estimates’
nectivity) that highlight user-experience beyond predictive per-      properties about the predictive power of the final model. The
formance when selecting or evaluating AutoML services. Us-            Scope criteria describe the applicability scope of a service mainly
ing this framework we evaluated seven such services, namely           in terms of data types and ML predictive tasks. The Productivity
Auger.AI [2], BigML [4], H2O’s Driverless AI [19], Darwin [7],        category is concerned with the ease of use, while Interpretability is
Just Add Data Bio [1], RapidMiner [39], and Watson [24]. The          concerned with the ability to interpret the results of the analysis.
comparison is meant to indicate the strengths, weaknesses, scope,     The last two categories are Customizability of the analysis and
and usability of the services, indicating the needs it covers, the    Connectivity of the service. The criteria are graded on a 4-level
tasks it is most appropriate for, and the opportunities for im-       scale. F(ail) (✗), C for fulfilling the basic requirements of the
provement. To the best of our knowledge no other survey or            criterion, B for providing additional functionalities and A for
benchmarking paper proposed the aforementioned qualitative            achieving a level that should satisfy most users in our opinion.
criteria and methodology for evaluating AutoML services and
libraries.                                                            3.1     Estimates
                                                                      Criteria for Estimates (Table 1), concern the wealth and depth of
2    AUTOML SERVICES CONSIDERED                                       estimated quantities regarding the predictive model. ROC curves
In the present evaluation study we consider seven current Au-         are a useful visualization for interpreting the performance of a
toML service platforms that offer a free trial version, so we could   classification model and are widely used by the ML community.
base it on first-hand experience. All of these services, specialize   We grade with B the services that output ROC curves (Auger.AI,
on tabular data, helping us apply the qualitative criteria on all     BigML and RM) and with A the ones which also output perfor-
of them. It was conducted from 01/12/2019 until 07/12/2019            mance metrics for different points on the curve (DAI, JAD and
and we used the live versions of the services at the time. In         Watson). In addition to the out-of-sample estimate of predictive
alphabetical order, the services are:                                 performance, a service should be able to report the uncertainty
                                                                      of this estimation (criterion STD/CI calculation in Table 1 stand-
     • Auger.AI[2]: A new service, going live in 2019, Auger.AI
                                                                      ing for standard deviation and confidence interval respectively).
       boasts to have high accuracy and a well-implemented API
                                                                      With B, we grade the services that only calculate the STD (BigML,
       to help users run experiments with ease.
                                                                      DAI and RM) and with A the ones calculating the whole prob-
     • BigML [4]: One of the oldest ML services, BigML sup-
                                                                      ability distribution of performance and its confidence intervals,
       ports AutoML tasks and offers extended support, a custom
                                                                      a richer piece of information (JAD). Regarding Label Predictions
       programming language and a cloud infrastructure for the
                                                                      on new data, the services that support either individual samples
       user.
                                                                      predictions or batch predictions are graded with B (Darwin), and
     • Darwin [7]: SparkCognition’s new AutoML service, pro-
                                                                      the ones supporting both with A (the rest of the services). For
       viding the users with convenient tools to speed-up their
                                                                      binary classification tasks, the services able to generate Label
       ML tasks.
                                                                      probability estimations get an A (all services except Auger.AI).
     • Driverless AI (DAI) [19]: One of the most well-known
                                                                      Overall, JAD has a full score on all the criteria, followed by DAI
       AutoML services, DAI supports various ML tasks and also
                                                                      and RM.
       has advanced interpretability mechanisms.
     • Just Add Data Bio (JAD) [1]: JAD was launched in No-
                                                                      3.2     Scope
       vember 2019 focusing on the analysis of molecular biolog-
       ical data (small-sample, high-dimensional) with emphasis       Scope criteria (Table 1) cover the range of input data that can
       on feature selection.                                          be analyzed. When it comes to Outcome types, services able to
     • RapidMiner Studio (RM) [39]: The oldest AutoML ser-            handle binary (classification), multi-class (classification), contin-
       vice used in our evaluation, RM provides multiple tools        uous (regression) and censored time-to-event outcomes (survival
       to its users and supports user-created components. We          analysis) score A (JAD), while the ones not handling survival
       are looking into the standard version, not including the       analyses score B (the rest of the services). Regarding Predictor
       available user-created add-ons.                                types, the services which support all the standard tabular data and
     • IBM’s Watson (Watson) [24]: Watson contains multiple           also text or time-series data are graded with A (all services except
       components, but here we focus on the AutoAI experiment         for JAD), while the ones only supporting the former with B (JAD).
       toolkit1 , being closer to what we define as AutoML service    The term Clustered data (not to be confused with clustering of
       for tabular data.                                              data) in statistics refers to samples that are naturally grouped
                                                                      in clusters (or groups) of samples that may be correlated given
Due to registration fees, we were not able to include in our bench-   the predictors. Examples include matched case-control data in
mark recent services such as Google AutoML Tables2 . Regarding        medicine and repeated measurements taken on the same subject
                                                                      or client. With A, we grade the services able to handle clustered
1 https://www.ibm.com/cloud/watson-studio/autoai
                                                                      data (DAI and JAD). It is important to mention the absence of
2 https://cloud.google.com/automl-tables/                             3 https://www.datarobot.com/
                                                     Table 1: Estimates and Scope criteria.


                                       Criteria                Auger.AI   BigML   DAI    Darwin     JAD    RM     Watson
                                       ROC curves                 B        B       A        ✗        A      B        A
                   Estimates
                                   STD/CI calculation             ✗        B       B        ✗        A      B        ✗
                                     Label predictions            A        A       A        B        A      A        A
                               Label probability estimations      ✗        A       A        A        A      A        A
                                     Outcome types                B        B       B        B        A      B        B
                                     Predictor types              A        A       A        A        B      A        A
                   Scope




                                 Clustered data handling          ✗        ✗       A        ✗        A      ✗        ✗
                                 Missing values handling          A        A       A        A        A      A        A


clustered data and repeated measurements handling from most of             interpreting how the final model functions (Final model interpre-
the services. Essentially, most services assume independently and          tation). A particular means to understanding of results is through
identically distributed (i.i.d.) data reducing their scope. Finally,       Feature selection, which deserves its own criterion, along with the
we grade a service’s ability to handle missing data with A (all            available mechanisms for the Final feature set interpretation. (d)
services). In this category, DAI and JAD lead with the highest             Understanding and validating the process that took place during
score.                                                                     the analysis (Analysis exploration). Regarding Data visualizations
                                                                           prior to the analysis, a service which only provides histograms,
3.3    Productivity                                                        scores C (JAD). If it also implements correlation plots and data
The Productivity criteria (Table 2) concern the ease of use and            heatmaps, its score is B (BigML). The services with more options
boost of user productivity. We start off with Data manipulation            get A (DAI, RM and Watson). During the analysis (Progress report),
functionalities available to prepare and manipulate the input              if a service only reports the completion percentage, it gets the
data before analysis. Grade B goes to the services providing the           grade C (Darwin). When it shows additionally a performance
user with custom data partitioning and preprocessing recom-                estimation of the best model and keeps track of the analysis pro-
mendations (DAI and Darwin) and grade A to the services that               cedure, its grade is B (BigML and JAD). The highest grade (A)
additionally provide data merging, filtering and sub-sampling              goes to the services that also show variable importance rankings,
(BigML, JAD, RM, Watson). About Pipeline automation, the ser-              generated models ranking and hardware usage (Auger.AI, DAI,
vices where the best model is automatically selected according to          RM and Watson).
pre-specified user preferences (e.g., maximize AUC) score A (DAI,              Once the analysis is complete, the AutoML service should be
Darwin, JAD and Watson). The services producing a ranking of               able to explain how the final model works. This adds transparency
all tried models instead and require the user to select the one that       to the model and pinpoints possible flaws or bias in its decision
satisfies their criteria the best score B (Auger.AI, BigML and RM).        making, making it more trustworthy. The interpretability of the
On one hand, ranking all the models arguably provides richer               results is a subdomain of ML with increasing popularity and
information to the user, on the other, it does reduce automation           every year multiple new mechanisms are introduced [9, 33]. We
and could confuse the non-expert. So, our grading in this crite-           have selected a set of such mechanisms and grade the AutoML
rion is admittedly subjective. We next grade the ability to Early          services based on how many of them they have implemented.
stop or pause an analysis. The services able to do both score A            The mechanisms are: a) the confusion matrix, which is created
(RM) and in case they have implemented either one but not the              based on the predictions made during the training phase, to
other, they score B (the rest of the services). When it comes to           help the user understand what type of errors are produced by
Collaboration features, we grade a service with A if it has imple-         the final model; b) report of the performance of the final model
mented mechanisms to create custom organizations and teams                 using multiple performance metrics; c) residuals visualization,
to allow sharing of resources, such as data and analyses (all ser-         i.e. the difference between observed and predicted values of the
vices except DAI and Darwin). Lastly, about Documentation and              data; d) PCA procedure [44] to highlight strong patterns of the
support, the services providing e-mail support score C (JAD). If           data and visualize them on a 2-D space; e) visualization of the
they also deliver extensive documentation to the user, they score          final model, when this is possible; f) techniques to explain the
B (Auger.AI and Darwin) and when they additionally have direct             predictions in case of a complex final model (e.g. LIME-SUP [21],
technical support and user forums, their score is A (BigML, DAI,           K-LIME, a variant of LIME [40], decision tree surrogate models
RM and Watson). In general, Productivity is a category empha-              [8], etc.). When the service has implemented at least 2 of the above
sized by all services, making it relatively straightforward to any         mechanisms, its corresponding grade is C (Darwin and Watson),
user to complete an ML analysis.                                           while for a service with more than 2 available mechanisms, its
                                                                           grade is B (Auger.AI, BigML, RM). The grade A is reserved for the
3.4    Interpretability                                                    services with more than 4 of the aforementioned mechanisms
Interpretability criteria (Table 2) is arguably on the most impor-         implemented (DAI and JAD).
tant categories for selecting an AutoML service[32]. The criteria              Feature selection is often the primary goal of an analysis. It
concern (a) Exploring and visualizing the data (Data visualiza-            leads to simpler models that require fewer measurements to pro-
tion) before conducting the analysis. (b) Monitoring the execution         vide a prediction, which may be important in several applications.
of the analysis progress (Progress report). (c) Understanding and          Most importantly however, feature selection is used as a tool for
                                                                           knowledge discovery [28] to gain intuition and insight into the
                                      Table 2: Productivity and Interpretability criteria. ✜: only for certain models


                                              Criteria                  Auger.AI   BigML    DAI    Darwin     JAD    RM     Watson
                                          Data manipulation                ✗        A        B        B        A      A        A
                  Productivity

                                         Pipeline automation               B        B        A        A        A      B        A
                                         Early stop or pause               B        ✗        B        B        B      A        B
                                        Collaboration features             A        A        ✗        ✗        A      A        A
                                      Documentation and support            B        A        A        B        C      A        A
                                            Data visualization             ✗        B        A        ✗        C      A        A
                  Interpretability




                                             Progress report              A         B        A        C        B      A        A
                                       Final model interpretation          B        B        A        C        A      B        C
                                            Feature selection              ✗        C        C        ✗        A      B        ✗
                                     Final feature set interpretation      C        B        A        C        A      B        C
                                          Analysis exploration            A✜        B        B        ✗        B      A        A


problem (hence, its inclusion in the interpretability category).                    case of multiple feature selection. A service that has implemented
A pharmacologist is not only interested in predicting cancer                        at least 1 of these mechanisms, is graded with C (Auger.AI, Dar-
metastasis but also in the molecules involved in the prediction to                  win and Watson). If more than 2 mechanisms are available, the
identify drug targets; a business person is interested in the quan-                 service’s grade is B (BigML, RM) and the grade A is reserved for
tities that affect customer attrition to devise new promotions and                  the services with 4 or more mechanisms (DAI, JAD).
advertisements. Such reasoning is theoretically supported by the                        Expert analysts would often like to verify the correctness and
fact that feature selection has been connected to the causal mech-                  completeness of the analysis that took place. It is not only the
anisms that generate the data [50]. It is defined as the problem of                 results (model) that should not be treated as a black-box, but
identifying a minimal-size feature subset that jointly (multivari-                  also how these results were obtained. A service which displays
ately) leads to an optimal prediction model (see [17] for a formal                  an Analysis exploration graph, to help the users understand the
definition). Thus, feature selection removes not only irrelevant,                   methods used in each step scores A (Auger.AI, RM and Watson).
but also redundant features. In some data distributions, there may                  If the service displays all pipelines that were tried, in the form
be multiple solutions to the feature selection. For example, due to                 of list instead of as a graph, its score is B (BigML, DAI and JAD).
low sample size the truly best feature subset may be statistically                  When it comes to analysis interpretation, DAI and JAD seem to
indistinguishable from slightly sub-optimal feature subsets. Or,                    be the best choice, providing the user with advanced mechanisms
it could be the case there is informational redundancy that leads                   for understanding the final results. Some services, do not provide
to feature subsets that are equally predictive. While all solutions                 any information about which analysis pipelines they tried; the
are equivalent in terms of predictive performance, returning all                    analysis process is essentially a black box to the user. We note
solutions is important when feature selection is used as a tool for                 that in our opinion, there is room for improvement regarding
knowledge discovery.                                                                interpretability for most of the services.
    The services which offer single feature selection functionality,
score C (BigML and DAI). BigML treats feature selection as a                        3.5    Customizability
preprocessing step, before the modeling process and the estima-                     The Customizability category (Table 3) grades the ability of the
tion of performance protocol. This approach is methodologically                     services to customize analysis according to user choices and
wrong and leads to overestimating performance (see [20], page                       preferences. About Time budget, we grade with B the services
245). There are different notions of multiple feature selection.                    giving the ability to impose a non-strict time limit on an analysis
When a service returns several feature subsets as options, but                      (Auger.AI, BigML and JAD) and with A the ones which allow
does not provide any theoretical guarantees of statistical equiv-                   setting a strict time limit (DAI and Darwin). Our take on this
alence, its grade is B (RM). On the other hand, when a service                      subject is that every service should give the ability to pose a
returns several feature subsets that lead to models with statisti-                  strict time budget, as an analysis can be part of a bigger project,
cally indistinguishable performance from the optimal, its grade                     running under specific time restrictions. Moving to the hardware
is A (JAD). Feature selection by itself is not enough. The services                 Resources budget, if a service allows the user to select a preset
should also provide users with mechanisms for interpreting and                      hardware configuration, it scores B (Watson) and if it allows
understanding how each feature in the final set affects and con-                    setting up the exact hardware specifications, A (DAI and JAD).
tributes to the decision making of the final model. We base our                     Next, we consider the Customization of analysis components, i.e.
grading on a set of Final feature set interpretation mechanisms                     the ability to choose the methods and algorithms to try, along
and how many of them each AutoML service has implemented.                           with their hyperparameters, in each step of the ML pipeline. If
The mechanisms are: a) random forest feature importance rank-                       the user is able to fully customize the included components, the
ing of the participating features [5]; b) LOCO feature importance                   service gets A (Auger.AI, BigML, DAI and RM). If the service
[27]; c) partial dependence plots (PDPs) [12]; d) SHAP plots [29];                  provides the user with a set of limited settings, it gets B (Darwin,
e) ICE plots [15]; f) a report of the standardized individual and                   JAD and Watson).
cumulative importance of the participating features; g) the actual                     A service that allows the user to Enforce final model inter-
standardized coefficient for each feature, in the case of a linear                  pretability, is graded with B (JAD) and if it provides additional
final model; h) information about the resulted feature sets, in the                 interpretability settings, with A (DAI). Another customization
                                 Table 3: Customizability and Connectivity criteria. ✧: for RM server, not RM studio


                                           Criteria                  Auger.AI    BigML     DAI    Darwin     JAD     RM    Watson
                                            Time budget                 B           B       A        A         B      ✗        ✗
               Customizability


                                          Resources budget              ✗           ✗       A        ✗         A      ✗        B
                                 Analysis components customization      A           A       A        B         B      A        B
                                   Enforce Model Interpretability       ✗           ✗       A        ✗         B      ✗        ✗
                                      Feature selection options         ✗           A       A        ✗         A      B        ✗
                                    Visualizations customization        ✗           A       B        ✗         ✗      A        A
                                        Service deployment              ✗           A       A        ✗         ✗     A✧        ✗
                                   3rd party storage connection         A           A       A        ✗         ✗      A        A
               Connectivity




                                            API access                  A           A       A        A         A      A        A
                                       Downloadable results             A           A       A        ✗         B      A        B
                                 Analysis components contribution       B           A       A        ✗         ✗      A        B
                                        Model deployment                A           A       A        A         ✗      A        A
                                    Visualizations exportability        ✗           B       B        ✗         B      A        A


criterion is about the available Feature selection options. If the              graded with B (Auger.AI and Watson). If the service has moreover
AutoML service allows the user to select the exact number of                    implemented a complete system for user-defined components, by
selected features, it is graded with A (BigML, DAI and JAD) and                 creating their own marketplace or extensions library, its grade
if it allows the user to set certain parameters, such as the effort             is A (BigML, DAI and RM). Creating the best final model does
put in feature selection, with B (RM). Finally, we also consider                not always suffice, as the user will probably want to deploy it in
the Visualizations customization options. When a service gives                  an external service and use it for new data predictions. Most of
the user the ability to set user-specific thresholds on certain vi-             the participating services, have added various model deployment
sualizations, its grade is B (DAI). If the user can fully customize             options (grade A) (all except JAD). The currently implemented
the resulted visualizations (e.g. changing the axes, titles, legend,            ideas are to use data transfer libraries, e.g. cURL (Auger.AI, Wat-
colors), the service’s grade is A (BigML, RM and Watson). In                    son), create actionable models (BigML, Darwin, RM) or scoring
general, when it comes to customizability, DAI has a clear edge                 pipelines (DAI). All of the above provide the same functional-
over the competition, giving the users options to fine-tune and                 ity; predicting labels on new unseen data. Finally, when writing
setup an analysis according to their needs. We distinguish two                  reports or papers with the results, the visualizations need to be
different schools of thought on this category. On one hand, ser-                exported. The services which provide less than 3 export options
vices such as DAI, let the user fully customize the algorithms and              score B (BigML, DAI and JAD) and those with more, score A (RM
hyperparameter values to search during an analysis. On the other                and Watson). Taking a look at the participating services, most
hand, services like JAD provide the user with a few preference                  of them cover the majority of the proposed criteria. The export
choices that do not require expert knowledge of ML. The first                   formats available for data visualizations are static in all tools, an
approach empowers an expert analyst but it may be intimidating                  area that could greatly be improved. Additionally, we find the
to the non-expert user. There is a fine line between providing                  lack of connections to public repositories, such as OpenML [52]
enough choices to an expert to fully customize an analysis and                  important, as they can be useful to a user who is interested in
achieve better results and providing too many choices that make                 conducting ML analyses for academic reasons.
the process complex and easy to break. For this reason, we would
recommend to equip AutoML services with some kind of warning                    4   LIMITATIONS AND DISCUSSION
system that can actually detect when the selected setup might                   Admittedly, the current study has several limitations. We take
create problems and notify the user accordingly.                                the opportunity to discuss some in depth, pointing to important
                                                                                open issues and future work. First of all, we were not able to
3.6    Connectivity                                                             evaluate every known AutoML service.
The Connectivity criteria (Table 3) grade the options offered to                Estimates: While all services provide estimated quantities from
connect a service with external tools and resources. First, re-                 the data, the major question remains: are the estimates re-
garding the Service’s deployment at an external infrastructure,                 turned correct and reliable? Statistical estimations are par-
the services supporting it score A (BigML, DAI and RM). The                     ticularly challenging with low samples; even more so with high
ones able to Connect to 3rd party storage providers also get an                 dimensional data. Is performance overestimated, standard de-
A (all except from Darwin and JAD). Furthermore, all services                   viations underestimated, probabilities of individual predictions
have implemented their own API (grade A). We also look into the                 uncalibrated, feature importance’s accurate, or multiple feature
Downloadable results options. In the case where only part of the                subsets returned not statistically equivalent? Which AutoML ser-
results are downloadable, the services are graded with B (JAD                   vices return reliable results one can trust, and which ones are ac-
and Watson) while the ones allowing the user to download all the                tually misleading the user and potentially harmful? In case of
results and also generate a summary report, with (A) (all services              medical applications, overestimating performance or confidence
except JAD and Watson). A user might be interested in Adding                    in a prediction (uncalibrated predicted probabilities) is dangerous
custom components to the AutoML service. If it is allowed to the                and could impact human health, while in business applications
user to add components through a service’s API, the service is                  it may have significant monetary costs. Such questions require
significant experimentation with all services to answer. Experi-        of system performance that hide important shortcomings. Un-
mentation should be performed on datasets with a wide range of          derstanding details about failures is important for finding ways
characteristics, e.g., sample size, number of features, percentage      for improvement, communicating the reliability of systems in
of missing values, mixture of types of predictors (continuous,          different settings and for specifying appropriate human oversight
discrete, ordinal, zero-inflated, etc.), outcomes, etc. to provide      and engagement [34].
a full quantitative picture of the pros and cons of each service           Finally, we would like to mention that each category could be
and its correctness properties. Unfortunately, most quantitative        expanded with many more criteria. Only the criteria that were
evaluations are currently performed on datasets with a limited          addressed by at least one of the services were included. Function-
range of such characteristics or are restricted by time limitations.    alities that were not addressed by any of the services examined
Scope: In this paper, we are only concerned with predictive             are missing. One example is the ability to handle continuous
modeling (supervised learning) tasks and not other ML categories.       signals and streaming data [38].
Each different task would require a separate set of criteria that
applies to it. We do note, however, that BigML, DAI, RM, and Watson     5    CONCLUSION
also support clustering, anomaly detection, and some NLP tasks          AutoML has made tremendous progress since its first embodi-
which are useful to numerous users. A major limitation of our           ment in the GEMS system. Several AutoML services are already
scope grading is that it misses important criteria concerning the       available, routinely analyzing business and scientific data for
maximum volume of data a service can handle in reasonable               thousands of users. They do increase productivity and allow non-
time or memory resources, both in terms of number of features,          experts to perform sophisticated ML analyses. Our prediction
samples, or their combination (total volume). Unfortunately, we         is that within a few years, most of data analysis will involve
are not able to test the limits of each service as we are confined to   the use of an AutoML service or library; scripting as a means
analyses that run on the free trial versions. However, regarding        to manual ML analysis will gradually become obsolete or pass
the scalability with respect to feature size, we note that almost       to the next level, where it is customizing and invoking AutoML
all services have difficulty scaling to thousands of features. JAD      functionalities.
on the other hand, was created to scale up to the feature size of          The proposed criteria intend to turn the spotlight back onto
typical multi-omics datasets that can reach up to hundreds of           the human user. Users do not only consider learning performance
thousands of features.                                                  when choosing a service. They also consider a plethora of other
Productivity/Interpretability: Although, we presented a first           criteria such as the ones presented. One of the most important
qualitative assessment, a true measure of productivity increase         ones is interpretability of results. Users are rarely satisfied with
requires an extensive user study with representative datasets           just a predictive model; they also seek to understand the pat-
spanning a wide-range of characteristics (in terms of the number        terns in their data. Thus, results should not be a black-box, but
of features and samples). In such a user-study, one should mea-         explained, visualized, and interpreted. Users need to examine
sure how much productivity has improved over manual scripting,          the analysis process and ensure its correctness or optimality:
eventually by trading off learning performance, and how much            AutoML should automate, not obfuscate. The analysis process
insight has been gained by the interpretation tools offered by          should be transparent, verifiable, and customizable by the user.
each service. To assess how an AutoML tool performs against             Some of the AutoML services examined, clearly abide to these
human experts Kaggle4 and other ML competitions could be ex-            principles but some fail in this set of criteria. Arguably, it is per-
ploited. As data and tasks are specific for a competition problem,      haps interpretation of results and ease-of-use that will determine
solutions by human experts usually take the top positions as they       the success of an AutoML service, and not necessarily predictive
apply domain-specific knowledge and sometimes create custom             performance.
methods and mechanisms to help them win these competitions.                Current AutoML systems mostly focus on tabular, iid-sampled
Still, AutoML tools that have been tested on such tasks, achieve        data. Obviously however, most of the world’s data is not in this
comparable performance. AutoML tools are becoming more and              format or sampled as iid. Ultimately, AutoML competes with the
more sophisticated, by automating an increasing number of tasks         human expert not only in learning performance but in scope
in ML pipelines (e.g., feature engineering), while supporting meta-     and the range of problems it can handle. There are ongoing ef-
level learning techniques. This can lead to minimizing the gap          forts to develop AutoML solutions for regression or anomaly
between human experts and AutoML in competitive environ-                detection tasks in time-series, time-course data, and streaming
ments [45] and aid in producing high quality ML models for both         data (e.g., Microsoft Azure [31], Yahoo EGADS [25], Facebook
commercial and academic purposes.                                       Prophet [47]), or to generate features from relational tables or
    There are several other criteria categories that are missing        CSV/JSON files [16]. Future AutoML systems should also auto-
from the present methodology, due to space limitations. These           mate more data preparation tasks including data cleaning (e.g.
include model monitoring and maintenance that regards function-         error correction and deduplication) [41] and support ML tasks
alities to maintain a model into production [30], such as monitor       such as reinforcement, transfer and federated learning, or causal
the health of the production model, raise alarms when there is a        modeling [37] to name a few. Still, interpreting the results of
drift in the data distribution, automatically re-train and update       the analysis in each category is quite challenging and probably
the model, and others. As ML systems move from computer-                requires a different, specialized set of methods. There is a long
science laboratories into the open world, their accountability [13]     road ahead, where ML is entering a new generation of systems
and auditing [10] becomes a high priority problem. In this re-          and algorithms, but an exciting road indeed.
spect, we need a deep understanding of the ML system behavior
and its failures. Current evaluation methods such as single-score       REFERENCES
error metrics and confusion matrices provide aggregate views             [1] Gnosis Data Analysis. 2019. Just Add Data Bio. https://www.jadbio.com/.
                                                                         [2] Auger.AI. 2019. Auger.AI. https://auger.ai/.
4 https://kaggle.com
 [3] James S. Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algo-      [34] Besmira Nushi, Ece Kamar, and Eric Horvitz. 2018. Towards Accountable
     rithms for Hyper-Parameter Optimization. In Advances in Neural Information              AI: Hybrid Human-Machine Analyses for Characterizing System Failure. In
     Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira,        Proceedings of the Sixth AAAI Conference on Human Computation and Crowd-
     and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2546–2554. http://papers.         sourcing, HCOMP 2018, Zürich, Switzerland, July 5-8, 2018. 126–135.
     nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf                 [35] Meghana Padmanabhan, Pengyu Yuan, Govind Chada, and Hien Van Nguyen.
 [4] BigML. 2012. BigML. https://bigml.com/.                                                 2019. Physician-Friendly Machine Learning: A Case Study with Cardiovascular
 [5] Leo Breiman. 2017. Classification and regression trees. Routledge.                      Disease Risk Prediction. In Journal of clinical medicine.
 [6] Yi-Wei Chen, Qingquan Song, and Xia Hu. 2019. Techniques for Automated             [36] Magnus Palmblad, Anna-Lena Lamprecht, Jon Ison, and Veit Schwämmle. 2018.
     Machine Learning. CoRR abs/1907.08908 (2019). arXiv:1907.08908 http:                    Automated workflow composition in mass spectrometry-based proteomics.
     //arxiv.org/abs/1907.08908                                                              Bioinformatics 35, 4 (2018), 656–664.
 [7] Spark Cognition. 2019. Darwin. https://www.sparkcognition.com/product/             [37] J. Peters, D. Janzing, and B. Schölkopf. 2017. Elements of Causal Inference:
     darwin/.                                                                                Foundations and Learning Algorithms. MIT Press, Cambridge, MA, USA.
 [8] Mark W. Craven and Jude W. Shavlik. 1995. Extracting Tree-Structured Repre-        [38] Fábio Pinto, Marco O. P. Sampaio, and Pedro Bizarro. 2019. Automatic Model
     sentations of Trained Networks. In Proceedings of the 8th International Confer-         Monitoring for Data Streams. CoRR abs/1908.04240 (2019). http://arxiv.org/
     ence on Neural Information Processing Systems (NIPS’95). MIT Press, Cambridge,          abs/1908.04240
     MA, USA, 24–30.                                                                    [39] RapidMiner. 2006. RapidMiner. https://rapidminer.com/.
 [9] Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for interpretable            [40] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should
     machine learning. Commun. ACM 63, 1 (2019), 68–77.                                      I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings
[10] Amitai Etzioni and Oren Etzioni. 2016. Designing AI Systems That Obey Our               of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery
     Laws and Values. Commun. ACM 59, 9 (Aug. 2016), 29–31.                                  and Data Mining (KDD ’16). ACM, New York, NY, USA, 1135–1144. https:
[11] Matthias Feurer and Frank Hutter. 2018. Towards Further Automation in                   //doi.org/10.1145/2939672.2939778
     AutoML. In ICML 2018 AutoML Workshop.                                              [41] Vraj Shah and Arun Kumar. 2019. The ML Data Prep Zoo: Towards Semi-
[12] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting             Automatic Data Preparation for ML. In Proceedings of the 3rd International
     machine. Annals of statistics (2001), 1189–1232.                                        Workshop on Data Management for End-to-End Machine Learning (DEEM’19).
[13] Krishna Gade, Sahin Cem Geyik, Krishnaram Kenthapadi, Varun Mithal, and                 Association for Computing Machinery, New York, NY, USA, Article Article
     Ankur Taly. 2019. Explainable AI in Industry. In Proceedings of the 25th                11, 4 pages. https://doi.org/10.1145/3329486.3329499
     ACM SIGKDD International Conference on Knowledge Discovery and Data                [42] Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann,
     Mining (KDD ’19). Association for Computing Machinery, New York, NY, USA,               Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska.
     3203–3204.                                                                              2019. Democratizing Data Science Through Interactive Curation of ML
[14] Pieter Gijsbers, Erin LeDell, Janek Thomas, Sébastien Poirier, Bernd Bischl,            Pipelines. In Proceedings of the 2019 International Conference on Manage-
     and Joaquin Vanschoren. 2019. An Open Source AutoML Benchmark. CoRR                     ment of Data (SIGMOD ’19). ACM, New York, NY, USA, 1171–1188. https:
     abs/1907.00909 (2019). arXiv:1907.00909 http://arxiv.org/abs/1907.00909                 //doi.org/10.1145/3299869.3319863
[15] Alex Goldstein, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015. Peeking       [43] Radwa El Shawi, Mohamed Maher, and Sherif Sakr. 2019. Automated Machine
     inside the black box: Visualizing statistical learning with plots of individual         Learning: State-of-The-Art and Open Challenges. CoRR abs/1906.02287 (2019).
     conditional expectation. Journal of Computational and Graphical Statistics 24,          arXiv:1906.02287 http://arxiv.org/abs/1906.02287
     1 (2015), 44–65.                                                                   [44] Jonathon Shlens. 2014. A tutorial on principal component analysis. arXiv
[16] Google. 2019. AutoML Tables. https://cloud.google.com/automl-tables/.                   preprint arXiv:1404.1100 (2014).
[17] Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and          [45] Micah J. Smith, Carles Sala, James Max Kanter, and Kalyan Veeramachaneni.
     feature selection. Journal of machine learning research 3, Mar (2003), 1157–            2019. The Machine Learning Bazaar: Harnessing the ML Ecosystem for Ef-
     1182.                                                                                   fective System Development. CoRR abs/1905.08942 (2019). arXiv:1905.08942
[18] Isabelle Guyon, Lisheng Sun-Hosoya, Marc Boullé, Hugo Jair Escalante, Sergio            http://arxiv.org/abs/1905.08942
     Escalera, Zhengying Liu, Damir Jajetic, Bisakha Ray, Mehreen Saeed, Michèle        [46] Alexander Statnikov, Ioannis Tsamardinos, Yerbolat Dosbayev, and Con-
     Sebag, Alexander Statnikov, Wei-Wei Tu, and Evelyne Viegas. 2019. Analysis              stantin F Aliferis. 2005. GEMS: a system for automated cancer diagnosis
     of the AutoML Challenge Series 2015–2018. Springer International Publishing,            and biomarker discovery from microarray gene expression data. International
     Cham, 177–219. https://doi.org/10.1007/978-3-030-05318-5_10                             journal of medical informatics 74, 7-8 (2005), 491–503.
[19] H2O. 2017. Driverless AI. https://www.h2o.ai/products/h2o-driverless-ai/.          [47] Sean Taylor and Benjamin Letham. 2017. Forecasting at Scale. The American
[20] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The elements of            Statistician 72 (09 2017). https://doi.org/10.1080/00031305.2017.1380080
     statistical learning: data mining, inference, and prediction. Springer Science &   [48] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown.
     Business Media.                                                                         2012. Auto-WEKA: Automated selection and hyper-parameter optimization
[21] Linwei Hu, Jie Chen, Vijayan Nair, and Agus Sudjianto. 2018. Locally Inter-             of classification algorithms. CoRR, abs/1208.3719 (2012).
     pretable Models and Effects based on Supervised Partitioning (LIME-SUP). (06       [49] Anh Truong, Austin Walters, Jeremy Goodsitt, Keegan Hines, Bayan Bruss,
     2018).                                                                                  and Reza Farivar. 2019. Towards Automated Machine Learning: Evaluation
[22] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2011. Sequential                  and Comparison of AutoML Approaches and Tools. (2019). http://arxiv.org/
     Model-Based Optimization for General Algorithm Configuration. In Proceed-               abs/1908.05557 cite arxiv:1908.05557.
     ings of the conference on Learning and Intelligent OptimizatioN (LION 5). 507–     [50] Ioannis Tsamardinos and Constantin F Aliferis. 2003. Towards principled
     523.                                                                                    feature selection: relevancy, filters and wrappers.. In AISTATS.
[23] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren (Eds.). 2018. Automated        [51] Lukas Tuggener, Mohammadreza Amirian, Katharina Rombach, Stefan Lör-
     Machine Learning: Methods, Systems, Challenges. Springer. In press, available           wald, Anastasia Varlet, Christian Westermann, and Thilo Stadelmann. 2019.
     at http://automl.org/book.                                                              Automated Machine Learning in Practice: State of the Art and Recent Results.
[24] IBM. 2015. IBM Watson Studio. https://www.ibm.com/watson.                               CoRR abs/1907.08392 (2019). arXiv:1907.08392 http://arxiv.org/abs/1907.08392
[25] Nikolay Laptev, Saeed Amizadeh, and Ian Flint. 2015. Generic and Scalable          [52] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013.
     Framework for Automated Time-Series Anomaly Detection. In Proceedings of                OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15,
     the 21th ACM SIGKDD International Conference on Knowledge Discovery and                 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198
     Data Mining (KDD ’15). Association for Computing Machinery, New York, NY,          [53] Sudhir Varma and Richard Simon. 2006. Bias in error estimation when using
     USA, 1939–1947. https://doi.org/10.1145/2783258.2788611                                 cross-validation for model selection. BMC bioinformatics 7, 1 (2006), 91.
[26] Doris Jung-Lin Lee, Stephen Macke, Doris Xin, Angela Lee, Silu Huang, and          [54] Ricardo Vilalta and Youssef Drissi. 2002. A Perspective View and Survey
     Aditya Parameswaran. 2019. A Human-in-the-loop Perspective on AutoML:                   of Meta-Learning. Artificial Intelligence Review 18, 2 (01 Jun 2002), 77–95.
     Milestones and the Road Ahead. Data Engineering (2019), 58.                             https://doi.org/10.1023/A:1019956318069
[27] Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasser-     [55] Ziqiao Weng. 2019. From Conventional Machine Learning to AutoML. Journal
     man. 2018. Distribution-free predictive inference for regression. J. Amer.              of Physics: Conference Series 1207 (apr 2019), 012015. https://doi.org/10.1088/
     Statist. Assoc. 113, 523 (2018), 1094–1111.                                             1742-6596/1207/1/012015
[28] Huan Liu and Hiroshi Motoda. 1998. Feature Selection for Knowledge Discovery       [56] Quanming Yao, Mengshuo Wang, Hugo Jair Escalante, Isabelle Guyon, Yi-Qi
     and Data Mining. Kluwer Academic Publishers, Norwell, MA, USA.                          Hu, Yu-Feng Li, Wei-Wei Tu, Qiang Yang, and Yang Yu. 2018. Taking Human
[29] Scott Lundberg, Gabriel Erion, and Su-In Lee. 2018. Consistent Individualized           out of Learning Applications: A Survey on Automated Machine Learning.
     Feature Attribution for Tree Ensembles. (02 2018).                                      CoRR abs/1810.13306 (2018). arXiv:1810.13306 http://arxiv.org/abs/1810.13306
[30] Jorge G Madrid, Hugo Jair Escalante, Eduardo F Morales, Wei-Wei Tu, Yang           [57] Marc-André Zöller and Marco F. Huber. 2019. Survey on Automated Machine
     Yu, Lisheng Sun-Hosoya, Isabelle Guyon, and Michèle Sebag. 2018. Towards                Learning. (2019). arXiv:1904.12054 http://arxiv.org/abs/1904.12054
     AutoML in the presence of Drift: first results. In Workshop AutoML 2018
     @ ICML/IJCAI-ECAI. Pavel Brazdil, Christophe Giraud-Carrier, and Isabelle
     Guyon, Stockholm, Sweden. https://hal.inria.fr/hal-01966962
[31] Microsoft. 2015. Azure Machine Learning Studio. https://studio.azureml.net/.
[32] Christoph Molnar. 2019. Interpretable Machine Learning. https://christophm.
     github.io/interpretable-ml-book/.
[33] Christoph Molnar. 2019. Interpretable machine learning. Lulu. com.