                        Putting the Human Back in the AutoML Loop

          Iordanis Xanthopoulos∗                                   Ioannis Tsamardinos†‡                         Vassilis Christophides§
             University of Crete                                         University of Crete                          University of Crete
                  Greece                                                     Gnosis DA                                      Greece
      jordan.xanthopoulos@gmail.com                                           Greece                                 christop@csd.uoc.gr

                                                 Eric Simon                                  Alejandro Salinger
                                                 SAP France                                          SAP SE
                                                    France                                         Germany
                                            eric.simon@sap.com                            alejandro.salinger@sap.com

ABSTRACT                                                                               Finally, AutoML could improve replicability of analyses, sharing
Automated Machine Learning (AutoML) is a rapidly rising sub-                           of results, and facilitate collaborative analyses.
field of Machine Learning. AutoML aims to fully automate the                              To clarify the term AutoML, we consider the minimal require-
machine learning process end-to-end, democratizing Machine                             ments to be the ability to return (a) a predictive model that can
Learning to non-experts and drastically increasing the produc-                         be applied to new data, and (b) an estimate of predictive perfor-
tivity of expert analysts. So far, most comparisons of AutoML                          mance of that model, given a data source, e.g., a 2-dimensional
systems focus on quantitative criteria such as predictive perfor-                      matrix (tabular data). Thus, do-it-yourself tools that allow you
mance and execution time. In this paper, we examine AutoML                             to graphically construct the analysis pipeline (e.g. Microsoft’s
services for predictive modeling tasks from a user’s perspective,                      Azure ML [31]) are excluded. In addition, we distinguish between
going beyond predictive performance. We present a wide palette                         libraries and services. The former require coding and typically
of criteria and dimensions on which to evaluate and compare                            offer just the minimal requirements, namely return a model and
these services as a user. This qualitative comparative method-                         a performance estimation. AutoML services, on the other hand,
ology is applied on seven AutoML systems, namely Auger.AI,                             include a user interface and strive to democratize ML not only to
BigML, H2O’s Driverless AI, Darwin, Just Add Data Bio, Rapid-                          coders, but to anybody with a computer; they typically offer a
Miner, and Watson. The comparison indicates the strengths and                          much wider range of functionalities.
weaknesses of each service, the needs that it covers, the segment                         Algorithmically, AutoML encompasses techniques regarding
of users that is most appropriate for, and the possibilities for                       hyper-parameter optimization (HPO, [3, 48]), algorithm selection
improvements.                                                                          (CASH, [22]), automatic synthesis of analysis pipelines [36], per-
                                                                                       formance estimation [53], and meta-level learning [54], to name
KEYWORDS                                                                               a few. In addition, an AutoML system could not only automate
                                                                                       the modeling process, but also the steps that come before and
AutoML, machine learning services, qualitative evaluation
                                                                                       after. Pre-analysis steps include data integration, data preprocess-
                                                                                       ing, data cleaning, and data engineering (feature construction).
1    INTRODUCTION                                                                      Post-analysis steps include interpretation, explanation, and vi-
Automated Machine Learning (AutoML) is becoming a separate,                            sualization of the analysis process and the output model, model
independent sub-field of Machine Learning, that is rapidly rising                      production, model monitoring, and model updating. The ideal
in attention, importance, and number of applications [23, 35]. Au-                     AutoML system should only require the human to specify the
toML goals are to completely automate the application of machine                       data source(s), their semantics, and the goal of the analysis to
learning, statistical modeling, data mining, pattern recognition,                      create and maintain a model into production indefinitely.
and all advanced data analytics techniques. As an end result, Au-                         Given the importance and potential of AutoML, several aca-
toML could potentially democratize ML to non-experts (Citizen                          demic and commercial libraries, as well as services have appeared.
Data Scientists), boost the productivity of experts, shield against                    The first AutoML system was the academic Gene Expression
statistical methodological errors, and even surpass manual expert                      Model Selector (GEMS) [46]. Recent works formulate the AutoML
analysis performance (e.g., by using meta-level learning [11]).                        problem [56, 57], introduce techniques and frameworks for cre-
                                                                                       ating new AutoML tools [6, 45], survey the existing ones [43, 57]
[26], an interactive environment is proposed, emphasizing on          Data Robot3 , we were not able to obtain the free trial licence
user-centric aspects of AutoML. Moreover, in [49] a brief quali-      advertised on their website.
tative evaluation on AutoML services and libraries is presented,
mainly regarding their ML capabilities.                               3     QUALITATIVE CRITERIA
   The contribution of this paper is to provide a user-centric        To qualitatively evaluate the seven AutoML services, we present
framework for comparing AutoML services. We define a set of           32 user-centric qualitative criteria spanning across six different
qualitative criteria, spanning across six categories (Estimates,      categories. The criteria are partitioned in the following categories.
Scope, Productivity, Interpretability, Customizability, and Con-      The Estimates category is concerned with metrics and estimates’
nectivity) that highlight user-experience beyond predictive per-      properties about the predictive power of the final model. The
formance when selecting or evaluating AutoML services. Us-            Scope criteria describe the applicability scope of a service mainly
ing this framework we evaluated seven such services, namely           in terms of data types and ML predictive tasks. The Productivity
Auger.AI [2], BigML [4], H2O’s Driverless AI [19], Darwin [7],        category is concerned with the ease of use, while Interpretability is
Just Add Data Bio [1], RapidMiner [39], and Watson [24]. The          concerned with the ability to interpret the results of the analysis.
comparison is meant to indicate the strengths, weaknesses, scope,     The last two categories are Customizability of the analysis and
and usability of the services, indicating the needs it covers, the    Connectivity of the service. The criteria are graded on a 4-level
tasks it is most appropriate for, and the opportunities for im-       scale. F(ail) (✗), C for fulfilling the basic requirements of the
provement. To the best of our knowledge no other survey or            criterion, B for providing additional functionalities and A for
benchmarking paper proposed the aforementioned qualitative            achieving a level that should satisfy most users in our opinion.
criteria and methodology for evaluating AutoML services and
libraries.                                                            3.1     Estimates
                                                                      Criteria for Estimates (Table 1), concern the wealth and depth of
2    AUTOML SERVICES CONSIDERED                                       estimated quantities regarding the predictive model. ROC curves
In the present evaluation study we consider seven current Au-         are a useful visualization for interpreting the performance of a
toML service platforms that offer a free trial version, so we could   classification model and are widely used by the ML community.
base it on first-hand experience. All of these services, specialize   We grade with B the services that output ROC curves (Auger.AI,
on tabular data, helping us apply the qualitative criteria on all     BigML and RM) and with A the ones which also output perfor-
of them. It was conducted from 01/12/2019 until 07/12/2019            mance metrics for different points on the curve (DAI, JAD and
and we used the live versions of the services at the time. In         Watson). In addition to the out-of-sample estimate of predictive
alphabetical order, the services are:                                 performance, a service should be able to report the uncertainty
                                                                      of this estimation (criterion STD/CI calculation in Table 1 stand-
     • Auger.AI[2]: A new service, going live in 2019, Auger.AI
                                                                      ing for standard deviation and confidence interval respectively).
       boasts to have high accuracy and a well-implemented API
                                                                      With B, we grade the services that only calculate the STD (BigML,
       to help users run experiments with ease.
                                                                      DAI and RM) and with A the ones calculating the whole prob-
     • BigML [4]: One of the oldest ML services, BigML sup-
                                                                      ability distribution of performance and its confidence intervals,
       ports AutoML tasks and offers extended support, a custom
                                                                      a richer piece of information (JAD). Regarding Label Predictions
       programming language and a cloud infrastructure for the
                                                                      on new data, the services that support either individual samples
                                                                      predictions or batch predictions are graded with B (Darwin), and
     • Darwin [7]: SparkCognition’s new AutoML service, pro-
                                                                      the ones supporting both with A (the rest of the services). For
       viding the users with convenient tools to speed-up their
                                                                      binary classification tasks, the services able to generate Label
       ML tasks.
                                                                      probability estimations get an A (all services except Auger.AI).
     • Driverless AI (DAI) [19]: One of the most well-known
                                                                      Overall, JAD has a full score on all the criteria, followed by DAI
       AutoML services, DAI supports various ML tasks and also
                                                                      and RM.
       has advanced interpretability mechanisms.
     • Just Add Data Bio (JAD) [1]: JAD was launched in No-
                                                                      3.2     Scope
       vember 2019 focusing on the analysis of molecular biolog-
       ical data (small-sample, high-dimensional) with emphasis       Scope criteria (Table 1) cover the range of input data that can
       on feature selection.                                          be analyzed. When it comes to Outcome types, services able to
     • RapidMiner Studio (RM) [39]: The oldest AutoML ser-            handle binary (classification), multi-class (classification), contin-
       vice used in our evaluation, RM provides multiple tools        uous (regression) and censored time-to-event outcomes (survival
       to its users and supports user-created components. We          analysis) score A (JAD), while the ones not handling survival
       are looking into the standard version, not including the       analyses score B (the rest of the services). Regarding Predictor
       available user-created add-ons.                                types, the services which support all the standard tabular data and
     • IBM’s Watson (Watson) [24]: Watson contains multiple           also text or time-series data are graded with A (all services except
       components, but here we focus on the AutoAI experiment         for JAD), while the ones only supporting the former with B (JAD).
       toolkit1 , being closer to what we define as AutoML service    The term Clustered data (not to be confused with clustering of
       for tabular data.                                              data) in statistics refers to samples that are naturally grouped
                                                                      in clusters (or groups) of samples that may be correlated given
Due to registration fees, we were not able to include in our bench-   the predictors. Examples include matched case-control data in
mark recent services such as Google AutoML Tables2 . Regarding        medicine and repeated measurements taken on the same subject
                                                                      or client. With A, we grade the services able to handle clustered
1 https://www.ibm.com/cloud/watson-studio/autoai
                                                                      data (DAI and JAD). It is important to mention the absence of
2 https://cloud.google.com/automl-tables/                             3 https://www.datarobot.com/
                                                     Table 1: Estimates and Scope criteria.

                                       Criteria                Auger.AI   BigML   DAI    Darwin     JAD    RM     Watson
                                       ROC curves                 B        B       A        ✗        A      B        A
                                   STD/CI calculation             ✗        B       B        ✗        A      B        ✗
                                     Label predictions            A        A       A        B        A      A        A
                               Label probability estimations      ✗        A       A        A        A      A        A
                                     Outcome types                B        B       B        B        A      B        B
                                     Predictor types              A        A       A        A        B      A        A

                                 Clustered data handling          ✗        ✗       A        ✗        A      ✗        ✗
                                 Missing values handling          A        A       A        A        A      A        A

clustered data and repeated measurements handling from most of             interpreting how the final model functions (Final model interpre-
the services. Essentially, most services assume independently and          tation). A particular means to understanding of results is through
identically distributed (i.i.d.) data reducing their scope. Finally,       Feature selection, which deserves its own criterion, along with the
we grade a service’s ability to handle missing data with A (all            available mechanisms for the Final feature set interpretation. (d)
services). In this category, DAI and JAD lead with the highest             Understanding and validating the process that took place during
score.                                                                     the analysis (Analysis exploration). Regarding Data visualizations
                                                                           prior to the analysis, a service which only provides histograms,
3.3    Productivity                                                        scores C (JAD). If it also implements correlation plots and data
The Productivity criteria (Table 2) concern the ease of use and            heatmaps, its score is B (BigML). The services with more options
boost of user productivity. We start off with Data manipulation            get A (DAI, RM and Watson). During the analysis (Progress report),
functionalities available to prepare and manipulate the input              if a service only reports the completion percentage, it gets the
data before analysis. Grade B goes to the services providing the           grade C (Darwin). When it shows additionally a performance
user with custom data partitioning and preprocessing recom-                estimation of the best model and keeps track of the analysis pro-
mendations (DAI and Darwin) and grade A to the services that               cedure, its grade is B (BigML and JAD). The highest grade (A)
additionally provide data merging, filtering and sub-sampling              goes to the services that also show variable importance rankings,
(BigML, JAD, RM, Watson). About Pipeline automation, the ser-              generated models ranking and hardware usage (Auger.AI, DAI,
vices where the best model is automatically selected according to          RM and Watson).
pre-specified user preferences (e.g., maximize AUC) score A (DAI,              Once the analysis is complete, the AutoML service should be
Darwin, JAD and Watson). The services producing a ranking of               able to explain how the final model works. This adds transparency
all tried models instead and require the user to select the one that       to the model and pinpoints possible flaws or bias in its decision
satisfies their criteria the best score B (Auger.AI, BigML and RM).        making, making it more trustworthy. The interpretability of the
On one hand, ranking all the models arguably provides richer               results is a subdomain of ML with increasing popularity and
information to the user, on the other, it does reduce automation           every year multiple new mechanisms are introduced [9, 33]. We
and could confuse the non-expert. So, our grading in this crite-           have selected a set of such mechanisms and grade the AutoML
rion is admittedly subjective. We next grade the ability to Early          services based on how many of them they have implemented.
stop or pause an analysis. The services able to do both score A            The mechanisms are: a) the confusion matrix, which is created
(RM) and in case they have implemented either one but not the              based on the predictions made during the training phase, to
other, they score B (the rest of the services). When it comes to           help the user understand what type of errors are produced by
Collaboration features, we grade a service with A if it has imple-         the final model; b) report of the performance of the final model
mented mechanisms to create custom organizations and teams                 using multiple performance metrics; c) residuals visualization,
to allow sharing of resources, such as data and analyses (all ser-         i.e. the difference between observed and predicted values of the
vices except DAI and Darwin). Lastly, about Documentation and              data; d) PCA procedure [44] to highlight strong patterns of the
support, the services providing e-mail support score C (JAD). If           data and visualize them on a 2-D space; e) visualization of the
they also deliver extensive documentation to the user, they score          final model, when this is possible; f) techniques to explain the
B (Auger.AI and Darwin) and when they additionally have direct             predictions in case of a complex final model (e.g. LIME-SUP [21],
technical support and user forums, their score is A (BigML, DAI,           K-LIME, a variant of LIME [40], decision tree surrogate models
RM and Watson). In general, Productivity is a category empha-              [8], etc.). When the service has implemented at least 2 of the above
sized by all services, making it relatively straightforward to any         mechanisms, its corresponding grade is C (Darwin and Watson),
user to complete an ML analysis.                                           while for a service with more than 2 available mechanisms, its
                                                                           grade is B (Auger.AI, BigML, RM). The grade A is reserved for the
3.4    Interpretability                                                    services with more than 4 of the aforementioned mechanisms
Interpretability criteria (Table 2) is arguably on the most impor-         implemented (DAI and JAD).
tant categories for selecting an AutoML service[32]. The criteria              Feature selection is often the primary goal of an analysis. It
concern (a) Exploring and visualizing the data (Data visualiza-            leads to simpler models that require fewer measurements to pro-
tion) before conducting the analysis. (b) Monitoring the execution         vide a prediction, which may be important in several applications.
of the analysis progress (Progress report). (c) Understanding and          Most importantly however, feature selection is used as a tool for
                                                                           knowledge discovery [28] to gain intuition and insight into the
                                      Table 2: Productivity and Interpretability criteria. ✜: only for certain models

                                              Criteria                  Auger.AI   BigML    DAI    Darwin     JAD    RM     Watson
                                          Data manipulation                ✗        A        B        B        A      A        A

                                         Pipeline automation               B        B        A        A        A      B        A
                                         Early stop or pause               B        ✗        B        B        B      A        B
                                        Collaboration features             A        A        ✗        ✗        A      A        A
                                      Documentation and support            B        A        A        B        C      A        A
                                            Data visualization             ✗        B        A        ✗        C      A        A

                                             Progress report              A         B        A        C        B      A        A
                                       Final model interpretation          B        B        A        C        A      B        C
                                            Feature selection              ✗        C        C        ✗        A      B        ✗
                                     Final feature set interpretation      C        B        A        C        A      B        C
                                          Analysis exploration            A✜        B        B        ✗        B      A        A

problem (hence, its inclusion in the interpretability category).                    case of multiple feature selection. A service that has implemented
A pharmacologist is not only interested in predicting cancer                        at least 1 of these mechanisms, is graded with C (Auger.AI, Dar-
metastasis but also in the molecules involved in the prediction to                  win and Watson). If more than 2 mechanisms are available, the
identify drug targets; a business person is interested in the quan-                 service’s grade is B (BigML, RM) and the grade A is reserved for
tities that affect customer attrition to devise new promotions and                  the services with 4 or more mechanisms (DAI, JAD).
advertisements. Such reasoning is theoretically supported by the                        Expert analysts would often like to verify the correctness and
fact that feature selection has been connected to the causal mech-                  completeness of the analysis that took place. It is not only the
anisms that generate the data [50]. It is defined as the problem of                 results (model) that should not be treated as a black-box, but
identifying a minimal-size feature subset that jointly (multivari-                  also how these results were obtained. A service which displays
ately) leads to an optimal prediction model (see [17] for a formal                  an Analysis exploration graph, to help the users understand the
definition). Thus, feature selection removes not only irrelevant,                   methods used in each step scores A (Auger.AI, RM and Watson).
but also redundant features. In some data distributions, there may                  If the service displays all pipelines that were tried, in the form
be multiple solutions to the feature selection. For example, due to                 of list instead of as a graph, its score is B (BigML, DAI and JAD).
low sample size the truly best feature subset may be statistically                  When it comes to analysis interpretation, DAI and JAD seem to
indistinguishable from slightly sub-optimal feature subsets. Or,                    be the best choice, providing the user with advanced mechanisms
it could be the case there is informational redundancy that leads                   for understanding the final results. Some services, do not provide
to feature subsets that are equally predictive. While all solutions                 any information about which analysis pipelines they tried; the
are equivalent in terms of predictive performance, returning all                    analysis process is essentially a black box to the user. We note
solutions is important when feature selection is used as a tool for                 that in our opinion, there is room for improvement regarding
knowledge discovery.                                                                interpretability for most of the services.
    The services which offer single feature selection functionality,
score C (BigML and DAI). BigML treats feature selection as a                        3.5    Customizability
preprocessing step, before the modeling process and the estima-                     The Customizability category (Table 3) grades the ability of the
tion of performance protocol. This approach is methodologically                     services to customize analysis according to user choices and
wrong and leads to overestimating performance (see [20], page                       preferences. About Time budget, we grade with B the services
245). There are different notions of multiple feature selection.                    giving the ability to impose a non-strict time limit on an analysis
When a service returns several feature subsets as options, but                      (Auger.AI, BigML and JAD) and with A the ones which allow
does not provide any theoretical guarantees of statistical equiv-                   setting a strict time limit (DAI and Darwin). Our take on this
alence, its grade is B (RM). On the other hand, when a service                      subject is that every service should give the ability to pose a
returns several feature subsets that lead to models with statisti-                  strict time budget, as an analysis can be part of a bigger project,
cally indistinguishable performance from the optimal, its grade                     running under specific time restrictions. Moving to the hardware
is A (JAD). Feature selection by itself is not enough. The services                 Resources budget, if a service allows the user to select a preset
should also provide users with mechanisms for interpreting and                      hardware configuration, it scores B (Watson) and if it allows
understanding how each feature in the final set affects and con-                    setting up the exact hardware specifications, A (DAI and JAD).
tributes to the decision making of the final model. We base our                     Next, we consider the Customization of analysis components, i.e.
grading on a set of Final feature set interpretation mechanisms                     the ability to choose the methods and algorithms to try, along
and how many of them each AutoML service has implemented.                           with their hyperparameters, in each step of the ML pipeline. If
The mechanisms are: a) random forest feature importance rank-                       the user is able to fully customize the included components, the
ing of the participating features [5]; b) LOCO feature importance                   service gets A (Auger.AI, BigML, DAI and RM). If the service
[27]; c) partial dependence plots (PDPs) [12]; d) SHAP plots [29];                  provides the user with a set of limited settings, it gets B (Darwin,
e) ICE plots [15]; f) a report of the standardized individual and                   JAD and Watson).
cumulative importance of the participating features; g) the actual                     A service that allows the user to Enforce final model inter-
standardized coefficient for each feature, in the case of a linear                  pretability, is graded with B (JAD) and if it provides additional
final model; h) information about the resulted feature sets, in the                 interpretability settings, with A (DAI). Another customization
                                 Table 3: Customizability and Connectivity criteria. ✧: for RM server, not RM studio

                                           Criteria                  Auger.AI    BigML     DAI    Darwin     JAD     RM    Watson
                                            Time budget                 B           B       A        A         B      ✗        ✗

                                          Resources budget              ✗           ✗       A        ✗         A      ✗        B
                                 Analysis components customization      A           A       A        B         B      A        B
                                   Enforce Model Interpretability       ✗           ✗       A        ✗         B      ✗        ✗
                                      Feature selection options         ✗           A       A        ✗         A      B        ✗
                                    Visualizations customization        ✗           A       B        ✗         ✗      A        A
                                        Service deployment              ✗           A       A        ✗         ✗     A✧        ✗
                                   3rd party storage connection         A           A       A        ✗         ✗      A        A

                                            API access                  A           A       A        A         A      A        A
                                       Downloadable results             A           A       A        ✗         B      A        B
                                 Analysis components contribution       B           A       A        ✗         ✗      A        B
                                        Model deployment                A           A       A        A         ✗      A        A
                                    Visualizations exportability        ✗           B       B        ✗         B      A        A

criterion is about the available Feature selection options. If the              graded with B (Auger.AI and Watson). If the service has moreover
AutoML service allows the user to select the exact number of                    implemented a complete system for user-defined components, by
selected features, it is graded with A (BigML, DAI and JAD) and                 creating their own marketplace or extensions library, its grade
if it allows the user to set certain parameters, such as the effort             is A (BigML, DAI and RM). Creating the best final model does
put in feature selection, with B (RM). Finally, we also consider                not always suffice, as the user will probably want to deploy it in
the Visualizations customization options. When a service gives                  an external service and use it for new data predictions. Most of
the user the ability to set user-specific thresholds on certain vi-             the participating services, have added various model deployment
sualizations, its grade is B (DAI). If the user can fully customize             options (grade A) (all except JAD). The currently implemented
the resulted visualizations (e.g. changing the axes, titles, legend,            ideas are to use data transfer libraries, e.g. cURL (Auger.AI, Wat-
colors), the service’s grade is A (BigML, RM and Watson). In                    son), create actionable models (BigML, Darwin, RM) or scoring
general, when it comes to customizability, DAI has a clear edge                 pipelines (DAI). All of the above provide the same functional-
over the competition, giving the users options to fine-tune and                 ity; predicting labels on new unseen data. Finally, when writing
setup an analysis according to their needs. We distinguish two                  reports or papers with the results, the visualizations need to be
different schools of thought on this category. On one hand, ser-                exported. The services which provide less than 3 export options
vices such as DAI, let the user fully customize the algorithms and              score B (BigML, DAI and JAD) and those with more, score A (RM
hyperparameter values to search during an analysis. On the other                and Watson). Taking a look at the participating services, most
hand, services like JAD provide the user with a few preference                  of them cover the majority of the proposed criteria. The export
choices that do not require expert knowledge of ML. The first                   formats available for data visualizations are static in all tools, an
approach empowers an expert analyst but it may be intimidating                  area that could greatly be improved. Additionally, we find the
to the non-expert user. There is a fine line between providing                  lack of connections to public repositories, such as OpenML [52]
enough choices to an expert to fully customize an analysis and                  important, as they can be useful to a user who is interested in
achieve better results and providing too many choices that make                 conducting ML analyses for academic reasons.
the process complex and easy to break. For this reason, we would
recommend to equip AutoML services with some kind of warning                    4   LIMITATIONS AND DISCUSSION
system that can actually detect when the selected setup might                   Admittedly, the current study has several limitations. We take
create problems and notify the user accordingly.                                the opportunity to discuss some in depth, pointing to important
                                                                                open issues and future work. First of all, we were not able to
3.6    Connectivity                                                             evaluate every known AutoML service.
The Connectivity criteria (Table 3) grade the options offered to                Estimates: While all services provide estimated quantities from
connect a service with external tools and resources. First, re-                 the data, the major question remains: are the estimates re-
garding the Service’s deployment at an external infrastructure,                 turned correct and reliable? Statistical estimations are par-
the services supporting it score A (BigML, DAI and RM). The                     ticularly challenging with low samples; even more so with high
ones able to Connect to 3rd party storage providers also get an                 dimensional data. Is performance overestimated, standard de-
A (all except from Darwin and JAD). Furthermore, all services                   viations underestimated, probabilities of individual predictions
have implemented their own API (grade A). We also look into the                 uncalibrated, feature importance’s accurate, or multiple feature
Downloadable results options. In the case where only part of the                subsets returned not statistically equivalent? Which AutoML ser-
results are downloadable, the services are graded with B (JAD                   vices return reliable results one can trust, and which ones are ac-
and Watson) while the ones allowing the user to download all the                tually misleading the user and potentially harmful? In case of
results and also generate a summary report, with (A) (all services              medical applications, overestimating performance or confidence
except JAD and Watson). A user might be interested in Adding                    in a prediction (uncalibrated predicted probabilities) is dangerous
custom components to the AutoML service. If it is allowed to the                and could impact human health, while in business applications
user to add components through a service’s API, the service is                  it may have significant monetary costs. Such questions require
significant experimentation with all services to answer. Experi-        of system performance that hide important shortcomings. Un-
mentation should be performed on datasets with a wide range of          derstanding details about failures is important for finding ways
characteristics, e.g., sample size, number of features, percentage      for improvement, communicating the reliability of systems in
of missing values, mixture of types of predictors (continuous,          different settings and for specifying appropriate human oversight
discrete, ordinal, zero-inflated, etc.), outcomes, etc. to provide      and engagement [34].
a full quantitative picture of the pros and cons of each service           Finally, we would like to mention that each category could be
and its correctness properties. Unfortunately, most quantitative        expanded with many more criteria. Only the criteria that were
evaluations are currently performed on datasets with a limited          addressed by at least one of the services were included. Function-
range of such characteristics or are restricted by time limitations.    alities that were not addressed by any of the services examined
Scope: In this paper, we are only concerned with predictive             are missing. One example is the ability to handle continuous
modeling (supervised learning) tasks and not other ML categories.       signals and streaming data [38].
Each different task would require a separate set of criteria that
applies to it. We do note, however, that BigML, DAI, RM, and Watson     5    CONCLUSION
also support clustering, anomaly detection, and some NLP tasks          AutoML has made tremendous progress since its first embodi-
which are useful to numerous users. A major limitation of our           ment in the GEMS system. Several AutoML services are already
scope grading is that it misses important criteria concerning the       available, routinely analyzing business and scientific data for
maximum volume of data a service can handle in reasonable               thousands of users. They do increase productivity and allow non-
time or memory resources, both in terms of number of features,          experts to perform sophisticated ML analyses. Our prediction
samples, or their combination (total volume). Unfortunately, we         is that within a few years, most of data analysis will involve
are not able to test the limits of each service as we are confined to   the use of an AutoML service or library; scripting as a means
analyses that run on the free trial versions. However, regarding        to manual ML analysis will gradually become obsolete or pass
the scalability with respect to feature size, we note that almost       to the next level, where it is customizing and invoking AutoML
all services have difficulty scaling to thousands of features. JAD      functionalities.
on the other hand, was created to scale up to the feature size of          The proposed criteria intend to turn the spotlight back onto
typical multi-omics datasets that can reach up to hundreds of           the human user. Users do not only consider learning performance
thousands of features.                                                  when choosing a service. They also consider a plethora of other
Productivity/Interpretability: Although, we presented a first           criteria such as the ones presented. One of the most important
qualitative assessment, a true measure of productivity increase         ones is interpretability of results. Users are rarely satisfied with
requires an extensive user study with representative datasets           just a predictive model; they also seek to understand the pat-
spanning a wide-range of characteristics (in terms of the number        terns in their data. Thus, results should not be a black-box, but
of features and samples). In such a user-study, one should mea-         explained, visualized, and interpreted. Users need to examine
sure how much productivity has improved over manual scripting,          the analysis process and ensure its correctness or optimality:
eventually by trading off learning performance, and how much            AutoML should automate, not obfuscate. The analysis process
insight has been gained by the interpretation tools offered by          should be transparent, verifiable, and customizable by the user.
each service. To assess how an AutoML tool performs against             Some of the AutoML services examined, clearly abide to these
human experts Kaggle4 and other ML competitions could be ex-            principles but some fail in this set of criteria. Arguably, it is per-
ploited. As data and tasks are specific for a competition problem,      haps interpretation of results and ease-of-use that will determine
solutions by human experts usually take the top positions as they       the success of an AutoML service, and not necessarily predictive
apply domain-specific knowledge and sometimes create custom             performance.
methods and mechanisms to help them win these competitions.                Current AutoML systems mostly focus on tabular, iid-sampled
Still, AutoML tools that have been tested on such tasks, achieve        data. Obviously however, most of the world’s data is not in this
comparable performance. AutoML tools are becoming more and              format or sampled as iid. Ultimately, AutoML competes with the
more sophisticated, by automating an increasing number of tasks         human expert not only in learning performance but in scope
in ML pipelines (e.g., feature engineering), while supporting meta-     and the range of problems it can handle. There are ongoing ef-
level learning techniques. This can lead to minimizing the gap          forts to develop AutoML solutions for regression or anomaly
between human experts and AutoML in competitive environ-                detection tasks in time-series, time-course data, and streaming
ments [45] and aid in producing high quality ML models for both         data (e.g., Microsoft Azure [31], Yahoo EGADS [25], Facebook
commercial and academic purposes.                                       Prophet [47]), or to generate features from relational tables or
    There are several other criteria categories that are missing        CSV/JSON files [16]. Future AutoML systems should also auto-
from the present methodology, due to space limitations. These           mate more data preparation tasks including data cleaning (e.g.
include model monitoring and maintenance that regards function-         error correction and deduplication) [41] and support ML tasks
alities to maintain a model into production [30], such as monitor       such as reinforcement, transfer and federated learning, or causal
the health of the production model, raise alarms when there is a        modeling [37] to name a few. Still, interpreting the results of
drift in the data distribution, automatically re-train and update       the analysis in each category is quite challenging and probably
the model, and others. As ML systems move from computer-                requires a different, specialized set of methods. There is a long
science laboratories into the open world, their accountability [13]     road ahead, where ML is entering a new generation of systems
and auditing [10] becomes a high priority problem. In this re-          and algorithms, but an exciting road indeed.
spect, we need a deep understanding of the ML system behavior
