=Paper= {{Paper |id=Vol-2841/DARLI-AP_15 |storemode=property |title=Enhancing the friendliness of data analytics tasks: an automated methodology |pdfUrl=https://ceur-ws.org/Vol-2841/DARLI-AP_15.pdf |volume=Vol-2841 |authors=Paolo Bethaz,Tania Cerquitelli |dblpUrl=https://dblp.org/rec/conf/edbt/BethazC21 }} ==Enhancing the friendliness of data analytics tasks: an automated methodology== https://ceur-ws.org/Vol-2841/DARLI-AP_15.pdf
             Enhancing the friendliness of data analytics tasks: an
                          automated methodology
                                 Paolo Bethaz                                                                Tania Cerquitelli
       Department of Control and Computer Engineering,                                  Department of Control and Computer Engineering,
              Politecnico di Torino, Turin, Italy                                              Politecnico di Torino, Turin, Italy
                    paolo.bethaz@polito.it                                                         tania.cerquitelli@polito.it




                                                    Figure 1: Automated Data Exploration Workflow
ABSTRACT                                                                               above approaches, we did not focus on the architecture, but on the
This paper presents ADESCA (Automated Data Exploration with                            methodological approach, like the authors in [17, 32]. Specifically,
Storytelling CApability), a new methodology to automatically ex-                       in [32] authors proposed Helix, a declarative machine learning
tract models and structures from the data, with the aim of democra-                    system that seeks to optimize end-to-end workflow execution
tizing data science. Particular importance was placed on the creation                  across iterations, while in [17] the READ engine has been dis-
of an innovative storyboard for automated data analytics, allowing                     cussed to reduce the time required for data exploration.
creation and visualization of data stories. In addition, ADESCA                        Regarding the concept of democratizing the data science, works
also offers a guided and more traditional step-by-step exploration,                    described in [9] and [2] are of considerable importance, as the
where the algorithm parameters are configured automatically and                        authors propose to assist “data enthusiastic” users [12] [21] in
the results are graphically shown through different plots in order to                  the exploration of transactional databases in an interactive way.
be as understandable and user-friendly as possible. Preliminary ex-                    However, these research activities focus on query-oriented tech-
perimental results are promising and demonstrated the effectiveness                    niques, while our purpose is to enhance the friendliness of the
and efficiency of the proposed automated methodology.                                  knowledge-extraction approach.
                                                                                       There are also a series of tools that try to make the data ex-
                                                                                       ploration a process automatic and more graphically intuitive.
1    INTRODUCTION                                                                      However, tools like DataWrapper1 , Datapine2 or ZenVisage [26]
In today’s world there are an infinite number of data that can                         only offer a traditional visualization service, where the user can
be analyzed and extracting useful insights from these data has                         see the distribution of each attribute, after choosing which type
become a fundamental operation in many sectors. However, the                           of chart to use, among those proposed.
exploration of these data requires the intervention of a data sci-                     Other tools, in addition to the graphic display of the attributes,
entist with great expertise in the field who often devotes a long                      also make it possible to perform some analysis on the data. This
time to find a good combination of data-driven algorithms able                         is for example the case of AnswerMiner3 (which allows to build
to discover interesting knowledge items from data. The purpose                         decision trees on data), Dive [13] (which allows for example to
of our research activity is democratizing the data science by de-                      perform the ANOVA test) and NCSS4 (which allows to perform
signing and developing a new data-analytics engine providing                           ANOVA test, multivariate analysis and a very simplified cluster
self-learning capabilities jointly with parameter autoselection to                     analysis).
off-load the data scientist from testing and tuning and to easily                      Taking a cue from the works listed so far, in this paper we propose
focus her/his attention only to the most interesting and valuable                      an exploratory data methodology, named ADESCA (Automated
insights hidden in her/his data.                                                       Data Exploration with Storytelling CApability), that is able to
Some research efforts have already been done in order to improve                       automatically extract models and structures from the data, with-
the data exploration process towards automated approaches. Au-                         out the intervention of a data scientist. In particular, we now
thors in [14] re-examined the architecture of database manage-                         propose a new web-based system that automatically combines
ment systems, like in DICE system, to combine sampling with                            state-of-the-art data exploration approaches into a single tool to
speculative execution in the same engine [15]. Different than the                      automatically address all the steps of the traditional knowledge
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-
                                                                                       1 https://github.com/datawrapper/datawrapper
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)
                                                                                       2 https://www.datapine.com/
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
                                                                                       3 https://www.answerminer.com/
International (CC BY 4.0)
                                                                                       4 https://www.ncss.com/
data discovery pipeline, adding some new innovative steps com-           But the most particularly innovative aspects that ADESCA offers
pared to the traditional approaches. Innovative steps include: (i)       include the Automated Data Exploration section, the self-tuning
the automated data transformation strategy, (ii) the self-tuning         strategy and the Data Storytelling section, described below.
strategies tailored to each data analytics algorithm, and (iii) the
data storytelling methodology. The results of each of these tech-        2.1    Automated Data Transformation
niques are shown to the user in a very intuitive way thanks to           Data transformation plays a key role in the KDD process applied
user-friendly graphs that effectively support the decision-making        to any data type. Furthermore, the more complex the dataset,
process.                                                                 the greater the benefit that can be obtained by applying a data
The automated data transformation section modifies the struc-            transformation technique. ADESCA integrates common transfor-
ture of the original dataset, reporting the data in a different way,     mation techniques such as standardization and One Hot Encoding
to try to extract information that are too hidden or absent in the       [23], but also more innovative transformations based on pivot
original structure. Then, the self-tuning strategies applied to su-      tables [16].
pervised and unsupervised techniques off-load the data scientist         As a first attempt at this last type of transformations, we focus
to manually select the best algorithm and its input-parameter set-       on the idea that reporting the same data in a different structure
ting. Finally, the data-driven storytelling is a particular section in   than the original one, new insights that were not very visible
the implemented tool in which the most important information             before, maybe can now become clearer. We preliminary applied
and results regarding all the analysis done on the data are shown        the proposed strategy only on those datasets in which a tempo-
in a completely automatic way, thanks to an effective display            ral information exists (e.g. date, month, year, ecc.). Further data
system based on multiple web pages, each containing the visual           transformation techniques will be integrated later in ADESCA to
results of different analyzes.                                           deal with complex and heterogeneous data. The proposed trans-
The paper is organized as follows. Section 2 details the proposed        formation strategy integrated in ADESCA consists in considering
automated methodology, introducing the various innovative tech-          only two of the original attributes, reporting the original data
niques integrated in the system. Section 3 presents the prelimi-         in the form of frequency values relative to the chosen attributes,
nary development details, while Section 4 discusses the prelimi-         as shown in the Figure 2, where the original clinical dataset on
nary experimental results obtained on two real datasets. Finally,        the left is transformed into a new dataset which reports for each
Section 5 discusses open issues and presents the future develop-         row the history of the clinical exams carried out by each patient.
ment of this work.                                                       The KDD process applied on the new dataset may lead to better
                                                                         insights from data. For example, interesting groups containing
2    METHODOLOGY                                                         patients with a similar examination history (with standard or
                                                                         more specific examinations) might be discovered through a un-
The ADESCA methodology has the main purpose of creating
                                                                         supervised learning methodology.
automated data-stories, capable of summarizing the most impor-
tant information of a dataset; but, at the same time, ADESCA
also allows a step-by-step exploration of the knowledge data
discovery (KDD) pipeline shown in Figure 1. In this last case,
the user explores the data, but s/he does not necessarily need to
know the algorithms to be applied and s/he does not bother to
choose the best value for the algorithm parameters.
The first step in the workflow shown in Figure 1 is a general
characterization of the data, where some statistics and graphs
concerning the various attributes of the dataset are computed
(e.g. correlation matrix, boxplot and Cumulative Distribution
Function). Then, to make the data ready for subsequent analy-
sis, ADESCA removes the outliers in the dataset under analysis,          Figure 2: Example of proposed transformation. The origi-
identifying them with the DBSCAN clustering [28] (if there is            nal dataset on the left is transformed according to his two
no class label in the original dataset) or thanks to the boxplot         attributes ’Id_Patient’ and ’Cod.Exam’
analysis (if a label is present). In this last case ADESCA looks
for the attributes that give a greater contribution than the others         Moreover, not the results of all the transformations are shown,
in choosing the final label. To find these attributes, an ANOVA          but ADESCA carries out analyzes and suggests only the trans-
univariate test [20] on each of them is performed, assigning this        formations for which the results of these analyzes are better,
attribute a p-value. Each attribute associated with a p-value less       providing us with a more interesting knowledge. The proposed
than 0.05 is considered significant, and for each of these the out-      analysis to identify the best transformations consists in: (i) con-
liers removal is done analyzing their boxplot representation [8]         sidering each pair of attributes, merging together into a new
and eliminating each point less than the minimum (first quar-            single column, (ii) calculating the Average Frequency of each of
tile minus 1.5*Interquartile range) or greater than the maximum          these new columns and (iii) choosing only the transformations in
(third quartile plus 1.5*Interquartile range). The outliers points       which this value is between 1 and a threshold ∆ found empirically
found are shown to the user in a very intuitive way, thanks to an        after trying to apply these transformations on different datasets.
interactive 3d scatter chart which represents all the points of the      Each transformation obtained is shown in a very user-friendly
analyzed dataset according to its 3 most significant dimensions          page that contains the scheme of the new transformed dataset,
(found by applying the PCA [31]), where all the points labeled           that the user can download entirely in csv format with a click.
as ’outliers’ are shown in red and the other points are shown in         Then, to provide a graphic idea of the transformation, two inter-
green.                                                                   active 3d scatter charts are shown, that represent the distribution
of the dataset before and after the transformation. The user can       that will be obtained in the final partition. To look for the best
click on these graphs, enlarging and rotating them in all angles,      solution of the algorithm, ADESCA calculated the k-means for
allowing an effective comparison between them.                         each value of k in the interval [1,φ] and for each of these solutions
                                                                       the SSE (Sum of Squared Error) is calculated. Plotting these results
2.2    Data modeling through self-tuning                               graphically in a decreasing order of SSE, ADESCA looked for
       unsupervised and supervised algorithms                          the elbow of the proposed curve, that it is the point where the
A key point in the whole KDD process is that the results of            increase in k will cause a very small decrease of the SSE, while
any analysis are computed and graphically shown automatically,         the decrease in k will sharply increase the SSE. To calculate the
without the user having to interact with the system by choosing        elbow point automatically [27], we have drawn a straight line
the most suitable algorithm and its parameter values. To guar-         between the first point and the last point of our curve and we
antee the automatic nature of the process, we have proposed            looked for the point of the curve with the greatest distance from
an ad-hoc methodology for the automatic configuration of the           the line. To provide the user with a graphical display of how
parameters of each algorithm used. These methodologies are             this parameter has been chosen, the elbow graph is plotted next
particularly relevant for the cluster analysis and for the super-      to the scatter chart, by effectively highlighting the elbow point
vised learning techniques, where algorithms require parameters         chosen as the value of k.
that greatly affect the quality of the results and so they must be     DBSCAN. This algorithm requires as parameters a radius Eps
chosen appropriately.                                                  and a number MinPoints. To automatically choose the MinPoints
                                                                       value to use, ADESCA calculates the k-dist graph for each value
    2.2.1 Self-Tuning Unsupervised Learning. As a first attempt,       of k in a range between 2 and φ 1 . Starting with k=2 and increasing
the unsupervised learning in ADESCA includes the cluster analy-        this value by 1 each time, we have seen the differences with the
sis performed through three different algorithms: K-means (which       (k+1)-dist graph using the MAPE percentage error [10]. If this
requires k as parameter), DBSCAN (which requires MinPoints and         percentage value is less than a pre-set threshold τ , we stop the
Eps like parameters) and Hierarchical clustering (which requires       algorithm and we choose the current k as MinPoints.
the number of clusters as parameter) [28] properly enriched with       Using the value of MinPoints (k) just found, to find Eps we now
a self-tuning strategy to automatically identify the best input        calculate the elbow point of the sorted k-dist graph, with the
parameter setting. The results of each clustering algorithm are        same technique used in the case of k-means. To make the choice
shown to the user through a clickable 3d scatter chart, in which       of these parameters more meaningful to the user, next to the 3d
each cluster is highlighted with a different color, as shown in Fig-   scatter chart, the sorted k dist graph is plotted, with the elbow
ure 3. The same colors are also used in a pie chart, which shows       point highlighted.
how many elements belong to each cluster and the respective            Hierarchical Clustering. This algorithm requires as a parame-
percentages. If the algorithm used implies the concept of cen-         ter the number of clusters to be obtained in the final result. To
troids, these are also reported as red points in the scatter chart.    find the most suitable value for this parameter, we used the Sil-
By clicking on one of these points, the radar chart of the centroid    houette index [25], calculating the value of Average Silhouette for
is shown, that jointly represents the value of each component of       each algorithm with a number of clusters between 2 and a pre-set
the centroid on different Cartesian axes having the same origin.       threshold φ 2 . The number of clusters chosen as a parameter will
Let’s now see in detail the proposed self-tuning strategies that       be the one associated with the higher Average Silhouette.
allow to automatically choose the input parameters of each algo-       The result of the Hierarchical clustering is then visualized graph-
rithm integrated in ADESCA.                                            ically in a very user-friendly dendrogram, that is a tree-like di-
                                                                       agram, which effectively shows how the various clusters have
                                                                       been grouped together.
                                                                          2.2.2 Self-Tuning Supervised Learning. The self-tuning ap-
                                                                       proach in ADESCA provides an automatic overview of the pre-
                                                                       dictive performance of a wide set of classification algorithms,
                                                                       whose parameters are automatically optimized by a grid-search,
                                                                       hence automatically identifying the best one and its performance
                                                                       gain with respect to the others. Prediction performance is evalu-
                                                                       ated by exploiting a 10-fold stratified cross-validation, and the
                                                                       machine learning algorithms used in this section are Support
                                                                       Vector Machine, Decision Tree, K-Nearest Neighbor, and Naive
                                                                       Bayes Classifier [1].

Figure 3: The algorithm used and the value of the parame-              2.3    Storytelling
ters chosen automatically by ADESCA are shown in the first             In the Storytelling section, the most salient information regard-
box. The user can then manually change the value of these              ing the uploaded dataset are automatically shown to the user
parameters. The second box contains the pie chart with the             through a storyboard (i.e. visual layout mechanism able to sum-
same colors as the scatter chart, while in the circle on the           marize specific events and data) that allows addressing visual
right there is the radar chart obtained by clicking on the             analytic challenges. We have designed a storyboard, in which all
centroid related to cluster 2                                          the salient information are shown to the user on different pages.
                                                                       However, different than the traditional concept of storyboard
K-means. The choice of the k parameter is fundamental in the           [30], that is always placed on a timeline, we removed the tempo-
k-means algorithm, because it determines the number of clusters        ral dependence between visualized pages. In fact, the different
     Figure 4: D1 Storytelling section. The names of attributes in screen 1 and screen 2.b are: "SepalLength, SepalWidth,
        PetalLength, PetalWidth". The names of the labels in screen 3 are: "Iris-setosa, Iris-versicolor, Iris-virginica"


pages that make up our section are independent of each other, as        the Silhouette score). This solution is shown as a 3d scatter chart,
each page contains results relating to a particular analysis, which     depicting each cluster with a different color. By clicking on a
is unrelated to the others.                                             point belonging to a particular cluster, a new page is opened con-
Specifically, the purpose of this data-driven storytelling is to tell   taining specific information characterizing that cluster content
to the curious-data user the story of the uploaded data automat-        (e.g., radar chart, boxplots, paths extracted from the decision tree
ically, so that the user can have an immediate overview of the          built on that solution).
insights hidden on the data, without having the technical and               Third View - Supervised Learning. This page is present only
methodological expertise on the underlying algorithms. Struc-           if the original data set contains a label. In this case, the results
tures and models hidden in the data are reported in a completely        of a series of classification algorithms are shown here. For each
automatically way and the user is not required to interact any-         integrated algorithm a confusion matrix is shown to compare
more with the application, s/he can focus her/his attention to          the predicted labels with the original ones, by highlighting the
the discovered insights. In practice, the information contained         algorithm with the highest performance.
here were always present also in previous sections; but while in            Fourth View - Data Transformation. This page shows the three
the other sections all the possible results were shown in cascade,      best transformations between those obtained in the appropriate
now ADESCA only shows the most interesting solutions, that              section, reporting for each of these the scheme of the new dataset
are those that mostly characterize the dataset. To provide this         and its distribution in a 3d scatter chart. To choose which are the
summary, the Storytelling section tries to offer a visual layout        best transformations among those obtained, we apply to each
strategy able to summarize the most important concepts. As a            of them the same clustering algorithm by automatically set the
first attempt, the storyboard includes four different views, each       input parameter. The transformation that yielded the highest
of which contains results of different analyzes. More details of        silhouette score is selected. This page is present only if there is a
each view in the storyboard are reported below.                         temporal attribute in the original dataset.
   First View - Data Characterization. ADESCA shows a brief
overview of the dataset, including general information (number
of rows, number of columns, list of attributes, ecc), a 3d scatter
chart of the distribution and a correlation matrix containing the       3   PRELIMINARY DEVELOPMENT
Pearson coeffcients between the various numeric fields. Each            A preliminary implementation of ADESCA has been developed
cell of the matrix is clickable and allows to open a pop up win-        in Python [24], including the scikitlearn library [22] (for the an-
dow containing the boxplots and the Cumulative Distribution             alytic tasks) and the pandas library [19] (for manipulating data
Functions concerning the attributes related to the clicked cell.        in a tabular or sequential format). To build the architecture of
   Second View - Unsupervised Learning. When the user exploits          the system we used the micro-framework Flask [11], while the
ADESCA step by step, the clustering section provides all the            graphic part of the web pages is managed thanks to CSS and
results obtained by the various clustering algorithms performed         Javascript [7], including the HighCharts library [18] for creating
with the parameters chosen automatically by the proposed self-          interactive graphs.
tuning strategy. And later the user can also see new results by         ADESCA has been experimentally evaluated by analyzing differ-
manually changing these parameters. But now, in the second              ent datasets, using τ =3, φ=10, φ 1 =50 and φ 2 =11 as default values
view of the data-driven storyboard, between all the solutions au-       in the clustering analysis, and ∆=2.5 as a default value of the
tomatically computed by ADESCA and those manually selected              parameter in the automated data transformation.
by the user (if any), only the best one is displayed (chosen using
As a first try, ADESCA was promoted at the ’Festival della Tec-                     Table 1: ADESCA Computational Time
nologia’ 5 , held at the Politecnico di Torino on November 8th
2019. During this event, about 200 people composed of university                                                  D1      D2
students and students of the last year of high school were able                         Data Characterization     1.58    2.01
to test our tool. What made this event significant was the fact                               Clustering          9.34   14.37
that most of the students who used ADESCA had very limited                                   Storytelling         2.09    9.56
skills in areas such as data exploration or data mining. How-                           Data Transformation         -     1.50
ever, they managed to understand the tool without problems,
also testing it in its various analyzes, and easily understanding
automated insights extracted from data. This experience showed
us the practicality and the user-friendliness of the work done.           by clicking on a point of it. Finally, screen 3 shows the results of
                                                                          the classification techniques and it is present because the original
                                                                          dataset contains a label.
                                                                          Regarding instead the storytelling section of D2, the results re-
                                                                          lated to data characterization and clustering analysis are in the
                                                                          same form as those reported for D1. But, since D2 does not have
                                                                          a label, the page containing the classification algorithms is no
                                                                          longer present. In its place there is a page indicating the three
                                                                          best possible transformations for D2 (because now a temporal
                                                                          attribute is present). Each of these transformations is shown to
                                                                          the user on a page with the same layout reported in the Figure 5.
                                                                             In particular, the first proposed transformation contains a
                                                                          record for each patient and represents the history of clinical ex-
                                                                          aminations performed by each patient. Then, applying unsuper-
                                                                          vised analysis on the dataset before and after the transformation,
Figure 5: Layout that illustrates how the page showing each               we can note that on the original dataset we are not able to find
transformation obtained is divided. In rectangle 2 the first              subgroups (because it is not relevant to group individual pre-
five lines of the new dataset are shown, while clicking in the            scriptions together), while in the second case we are grouping
red circle at the top the user can download the entire dataset            the patients and from the automated cluster analysis we can
in csv format. Finally, in rectangle 3 there is the comparison            clearly see two distinct groups containing patients with similar
between the 3d distribution of the dataset before and after               examination histories. From this result, we can deduce that the
the transformation                                                        transformation in this case was useful, because we are now able
                                                                          to capture a model from the data that was not previously visible.
4     PRELIMINARY EXPERIMENTAL RESULTS
The objective of the preliminary experiments is to demonstrate            5   OPEN ISSUES AND FUTURE DIRECTIONS
the capability of ADESCA in discovering useful knowledge items            A first attempt towards automated data exploration has been
in an automated fashion and presenting only the relevant out-             achieved through ADESCA. Preliminary experimental results
comes to the end user. To this aim, dataset size does impact neither      on small datasets demonstrate the friendliness of the knowledge
negative nor positive in the analysis. Furthermore, for reasons of        extraction approach towards the democratization of data sci-
space, we will not report all the analyzes performed by the tool,         ence. The storytelling capability, based on storyboard, allows
but we will report only the results of the storytelling section,          curious-data users easily analyzing their data through visual lay-
which is the most significant and innovative section. The first           out mechanism able to summarize specific events and data with-
dataset analyzed, D1, contains a label attribute, while the second        out technical and methodological expertise. Specifically, thanks
dataset, D2, contains a temporal attribute. D1 is called ’Iris’, a free   to ADESCA, all users, even the less experienced, can take advan-
dataset available at the UCI Machine Learning Repository [6]. It          tage of their data, by viewing only their most salient aspects in
contains 150 records, each of which represents a different flower.        the appropriate storyboard. Moreover, since the algorithms and
Each flower is characterized by a series of numerical attributes          the parameters are chosen automatically by ADESCA, the pro-
and it belongs to a ’Species’ label. Each species contains 50 flow-       posed methodology is suitable for all those non-technical users
ers. D2 is instead a dataset containing medical information. In           who can draw interesting insights from their data. Obviously,
particular, it has 500 lines and each line contains information           what has been done so far must be considered as a preliminary
about the exam made by a patient, on a given date (temporal               step. Data exploration is a very complex topic, and making it
attribute). Table 1 contains information about the execution time         automatically makes everything even difficult.
(in seconds) of each section of ADESCA, separately per dataset.           The idea is to extend ADESCA, covering topics that are currently
The computational cost of the storytelling section is obtained by         still open, like: (i) the management of heterogeneous data types
summing the time of each page that composes it.                           (e.g. combination of time series data with unstructured data, im-
   The results of the Storytelling section for D1 are shown in            ages [29] or audio signals), (ii) the capability to perform feature
Figure 4 in four screens. The first screen shows the results of           engineering to capture specific data properties, (iii) the integra-
the data characterization. Screen 2.a shows the best clustering           tion of more complex machine learning algorithms able to deal
solution to apply to the dataset, while screen 2.b is a window con-       with unstructured and high dimensional data [5], (iv) the ability
taining information related to a specific cluster and is obtained         to deal with huge datasets, (v) the capability to enrich the data
5 https://www.festivaltecnologia.it/                                      under analysis with additional and related open-data, (vi) the
capability of properly managing geo-referenced [4] and time-                                  54d79c2a-f155-40ec-93ec-ed05b58afa39/view?access_token=
series data [3], and (vii) the improvement of the proposed data                               6d8ec910cf2a1b3901c721fcb94638563cd646fe14400fecbb76cea6aaae2fb1
                                                                                         [28] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2013. Data mining
transformation strategies to obtain more generalizable outcomes.                              cluster analysis: basic concepts and algorithms. Introduction to data mining,
                                                                                              Pearson Education India (2013), 487–533.
                                                                                         [29] Bartolomeo Vacchetti, Tania Cerquitelli, and Riccardo Antonino. 2020. Cine-
REFERENCES                                                                                    matographic Shot Classification through Deep Learning. In 44th IEEE Annual
 [1] Charu C Aggarwal. 2014. Data classification: algorithms and applications. CRC            Computers, Software, and Applications Conference, COMPSAC 2020, Madrid,
     press.                                                                                   Spain, July 13-17, 2020. IEEE, 345–350.
 [2] Marcello Buoncristiano, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri,      [30] Rick Walker, Llyr Ap Cenydd, Serban Pop, Helen C Miles, Chris J Hughes,
     Donatello Santoro, and Letizia Tanca. 2015. Database Challenges for Ex-                  William J Teahan, and Jonathan C Roberts. 2015. Storyboarding for visual
     ploratory Computing. SIGMOD Rec, Association for Computing Machinery 44,                 analytics. Information Visualization 14, 1 (2015), 27–50.
     2 (Aug. 2015), 17–22. https://doi.org/10.1145/2814710.2814714                       [31] Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component
 [3] L. Cagliero, T. Cerquitelli, S. Chiusano, P. Garza, and X. Xiao. 2017. Predicting        analysis. Chemometrics and intelligent laboratory systems, Elsevier 2, 1-3
     critical conditions in bicycle sharing systems. Computing 99, 1 (2017), 39–57.           (1987), 37–52.
     https://doi.org/10.1007/s00607-016-0505-x cited By 14.                              [32] Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, and Aditya G.
 [4] Tania Cerquitelli, Evelina Di Corso, Stefano Proto, Paolo Bethaz, Daniele                Parameswaran. 2018. Helix: Accelerating Human-in-the-loop Machine Learn-
     Mazzarelli, Alfonso Capozzoli, Elena Baralis, Marco Mellia, Silvia Casagrande,           ing. CoRR abs/1808.01095 (2018). arXiv:1808.01095 http://arxiv.org/abs/1808.
     and Martina Tamburini. 2020. A Data-Driven Energy Platform: From Energy                  01095
     Performance Certificates to Human-Readable Knowledge through Dynamic
     High-Resolution Geospatial Maps. Electronics 9, 12 (2020). https://doi.org/10.
     3390/electronics9122132
 [5] Evelina Di Corso, Tania Cerquitelli, and Francesco Ventura. 2017. Self-tuning
     techniques for large scale cluster analysis on textual data collections. In Pro-
     ceedings of the Symposium on Applied Computing, SAC 2017, Marrakech, Mo-
     rocco, April 3-7, 2017, Ahmed Seffah, Birgit Penzenstadler, Carina Alves, and
     Xin Peng (Eds.). ACM, 771–776.
 [6] Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http:
     //archive.ics.uci.edu/ml
 [7] David Flanagan. 2006. JavaScript: the definitive guide. " O’Reilly Media, Inc.".
 [8] Michael Galarnyk. 2018. Understanding Boxplots. https://towardsdatascience.
     com/understanding-boxplots-5e2df7bcbd51
 [9] Antonio Giuzio, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri, Do-
     natello Santoro, and Letizia Tanca. 2019. INDIANA: An interactive system for
     assisting database exploration. Information Systems 83 (2019), 40–56.
[10] Paul Goodwin and Richard Lawton. 1999. On the asymmetry of the symmetric
     MAPE. International Journal of Forecasting 15, 4 (1999), 405–408. https:
     //EconPapers.repec.org/RePEc:eee:intfor:v:15:y:1999:i:4:p:405-408
[11] Miguel Grinberg. 2018. Flask web development: developing web applications
     with python. " O’Reilly Media, Inc.".
[12] Pat Hanrahan. 2012. Analytic database technologies for a new kind of user:
     the data enthusiast. In Proceedings of the 2012 ACM SIGMOD International
     Conference on Management of Data. ACM, 577–578.
[13] Kevin Hu, Diana Orghian, and César Hidalgo. 2018. Dive: A mixed-initiative
     system supporting integrated data exploration workflows. In Proceedings of
     the Workshop on Human-In-the-Loop Data Analytics. ACM, 5.
[14] Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. 2015. Overview
     of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD Inter-
     national Conference on Management of Data. ACM, 277–281.
[15] Prasanth Jayachandran, Karthik Tunga, Niranjan Kamat, and Arnab Nandi.
     2014. Combining User Interaction, Speculative Query Execution and Sampling
     in the DICE System. Proc. VLDB Endow. 7, 13 (Aug. 2014), 1697–1700. https:
     //doi.org/10.14778/2733004.2733064
[16] Bill Jelen and Michael Alexander. 2010. Pivot Table Data Crunching: Microsoft
     Excel 2010. Pearson Education.
[17] Udayan Khurana, Srinivasan Parthasarathy, and Deepak S. Turaga. 2014.
     READ: Rapid data Exploration, Analysis and Discovery. In EDBT. 612–615.
[18] Joe Kuan. 2015. Learning highcharts 4. Packt Publishing Ltd.
[19] Wes McKinney et al. 2011. pandas: a foundational Python library for data
     analysis and statistics. Python for High Performance and Scientific Computing,
     Seattle 14, 9 (2011), 1–9.
[20] Rupert G Miller Jr. 1997. Beyond ANOVA: basics of applied statistics. CRC
     press.
[21] Kristi Morton, Magdalena Balazinska, Dan Grossman, and Jock Mackinlay.
     2014. Support the data enthusiast: Challenges for next-generation data-
     analysis systems. Proceedings of the VLDB Endowment 7, 6 (2014), 453–456.
[22] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,
     Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
     Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python.
     Journal of machine learning research 12, Oct (2011), 2825–2830.
[23] Kedar Potdar, Taher Pardawala, and Chinmay Pai. 2017. A Comparative Study
     of Categorical Variable Encoding Techniques for Neural Network Classifiers.
     International Journal of Computer Applications 175 (10 2017), 7–9. https:
     //doi.org/10.5120/ijca2017915495
[24] Guido Rossum. 1995. Python Reference Manual. Technical Report. Amsterdam,
     The Netherlands, The Netherlands.
[25] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and
     validation of cluster analysis. Journal of computational and applied mathemat-
     ics 20 (1987), 53–65.
[26] Tarique Siddiqui, Albert Kim, John Lee, Karrie Karahalios, and Aditya
     Parameswaran. 2016. Effortless Data Exploration with Zenvisage: An Ex-
     pressive and Interactive Visual Analytics System. Proc. VLDB Endow. 10, 4
     (Nov. 2016), 457–468. https://doi.org/10.14778/3025111.3025126
[27] Manoj Singh. 2017.                  Finding the elbow or knee of a
     curve.              https://dataplatform.cloud.ibm.com/analytics/notebooks/