=Paper=
{{Paper
|id=Vol-2841/DARLI-AP_15
|storemode=property
|title=Enhancing the friendliness of data analytics tasks: an automated methodology
|pdfUrl=https://ceur-ws.org/Vol-2841/DARLI-AP_15.pdf
|volume=Vol-2841
|authors=Paolo Bethaz,Tania Cerquitelli
|dblpUrl=https://dblp.org/rec/conf/edbt/BethazC21
}}
==Enhancing the friendliness of data analytics tasks: an automated methodology==
Enhancing the friendliness of data analytics tasks: an automated methodology Paolo Bethaz Tania Cerquitelli Department of Control and Computer Engineering, Department of Control and Computer Engineering, Politecnico di Torino, Turin, Italy Politecnico di Torino, Turin, Italy paolo.bethaz@polito.it tania.cerquitelli@polito.it Figure 1: Automated Data Exploration Workflow ABSTRACT above approaches, we did not focus on the architecture, but on the This paper presents ADESCA (Automated Data Exploration with methodological approach, like the authors in [17, 32]. Specifically, Storytelling CApability), a new methodology to automatically ex- in [32] authors proposed Helix, a declarative machine learning tract models and structures from the data, with the aim of democra- system that seeks to optimize end-to-end workflow execution tizing data science. Particular importance was placed on the creation across iterations, while in [17] the READ engine has been dis- of an innovative storyboard for automated data analytics, allowing cussed to reduce the time required for data exploration. creation and visualization of data stories. In addition, ADESCA Regarding the concept of democratizing the data science, works also offers a guided and more traditional step-by-step exploration, described in [9] and [2] are of considerable importance, as the where the algorithm parameters are configured automatically and authors propose to assist “data enthusiastic” users [12] [21] in the results are graphically shown through different plots in order to the exploration of transactional databases in an interactive way. be as understandable and user-friendly as possible. Preliminary ex- However, these research activities focus on query-oriented tech- perimental results are promising and demonstrated the effectiveness niques, while our purpose is to enhance the friendliness of the and efficiency of the proposed automated methodology. knowledge-extraction approach. There are also a series of tools that try to make the data ex- ploration a process automatic and more graphically intuitive. 1 INTRODUCTION However, tools like DataWrapper1 , Datapine2 or ZenVisage [26] In today’s world there are an infinite number of data that can only offer a traditional visualization service, where the user can be analyzed and extracting useful insights from these data has see the distribution of each attribute, after choosing which type become a fundamental operation in many sectors. However, the of chart to use, among those proposed. exploration of these data requires the intervention of a data sci- Other tools, in addition to the graphic display of the attributes, entist with great expertise in the field who often devotes a long also make it possible to perform some analysis on the data. This time to find a good combination of data-driven algorithms able is for example the case of AnswerMiner3 (which allows to build to discover interesting knowledge items from data. The purpose decision trees on data), Dive [13] (which allows for example to of our research activity is democratizing the data science by de- perform the ANOVA test) and NCSS4 (which allows to perform signing and developing a new data-analytics engine providing ANOVA test, multivariate analysis and a very simplified cluster self-learning capabilities jointly with parameter autoselection to analysis). off-load the data scientist from testing and tuning and to easily Taking a cue from the works listed so far, in this paper we propose focus her/his attention only to the most interesting and valuable an exploratory data methodology, named ADESCA (Automated insights hidden in her/his data. Data Exploration with Storytelling CApability), that is able to Some research efforts have already been done in order to improve automatically extract models and structures from the data, with- the data exploration process towards automated approaches. Au- out the intervention of a data scientist. In particular, we now thors in [14] re-examined the architecture of database manage- propose a new web-based system that automatically combines ment systems, like in DICE system, to combine sampling with state-of-the-art data exploration approaches into a single tool to speculative execution in the same engine [15]. Different than the automatically address all the steps of the traditional knowledge © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- 1 https://github.com/datawrapper/datawrapper ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) 2 https://www.datapine.com/ on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 3 https://www.answerminer.com/ International (CC BY 4.0) 4 https://www.ncss.com/ data discovery pipeline, adding some new innovative steps com- But the most particularly innovative aspects that ADESCA offers pared to the traditional approaches. Innovative steps include: (i) include the Automated Data Exploration section, the self-tuning the automated data transformation strategy, (ii) the self-tuning strategy and the Data Storytelling section, described below. strategies tailored to each data analytics algorithm, and (iii) the data storytelling methodology. The results of each of these tech- 2.1 Automated Data Transformation niques are shown to the user in a very intuitive way thanks to Data transformation plays a key role in the KDD process applied user-friendly graphs that effectively support the decision-making to any data type. Furthermore, the more complex the dataset, process. the greater the benefit that can be obtained by applying a data The automated data transformation section modifies the struc- transformation technique. ADESCA integrates common transfor- ture of the original dataset, reporting the data in a different way, mation techniques such as standardization and One Hot Encoding to try to extract information that are too hidden or absent in the [23], but also more innovative transformations based on pivot original structure. Then, the self-tuning strategies applied to su- tables [16]. pervised and unsupervised techniques off-load the data scientist As a first attempt at this last type of transformations, we focus to manually select the best algorithm and its input-parameter set- on the idea that reporting the same data in a different structure ting. Finally, the data-driven storytelling is a particular section in than the original one, new insights that were not very visible the implemented tool in which the most important information before, maybe can now become clearer. We preliminary applied and results regarding all the analysis done on the data are shown the proposed strategy only on those datasets in which a tempo- in a completely automatic way, thanks to an effective display ral information exists (e.g. date, month, year, ecc.). Further data system based on multiple web pages, each containing the visual transformation techniques will be integrated later in ADESCA to results of different analyzes. deal with complex and heterogeneous data. The proposed trans- The paper is organized as follows. Section 2 details the proposed formation strategy integrated in ADESCA consists in considering automated methodology, introducing the various innovative tech- only two of the original attributes, reporting the original data niques integrated in the system. Section 3 presents the prelimi- in the form of frequency values relative to the chosen attributes, nary development details, while Section 4 discusses the prelimi- as shown in the Figure 2, where the original clinical dataset on nary experimental results obtained on two real datasets. Finally, the left is transformed into a new dataset which reports for each Section 5 discusses open issues and presents the future develop- row the history of the clinical exams carried out by each patient. ment of this work. The KDD process applied on the new dataset may lead to better insights from data. For example, interesting groups containing 2 METHODOLOGY patients with a similar examination history (with standard or more specific examinations) might be discovered through a un- The ADESCA methodology has the main purpose of creating supervised learning methodology. automated data-stories, capable of summarizing the most impor- tant information of a dataset; but, at the same time, ADESCA also allows a step-by-step exploration of the knowledge data discovery (KDD) pipeline shown in Figure 1. In this last case, the user explores the data, but s/he does not necessarily need to know the algorithms to be applied and s/he does not bother to choose the best value for the algorithm parameters. The first step in the workflow shown in Figure 1 is a general characterization of the data, where some statistics and graphs concerning the various attributes of the dataset are computed (e.g. correlation matrix, boxplot and Cumulative Distribution Function). Then, to make the data ready for subsequent analy- sis, ADESCA removes the outliers in the dataset under analysis, Figure 2: Example of proposed transformation. The origi- identifying them with the DBSCAN clustering [28] (if there is nal dataset on the left is transformed according to his two no class label in the original dataset) or thanks to the boxplot attributes ’Id_Patient’ and ’Cod.Exam’ analysis (if a label is present). In this last case ADESCA looks for the attributes that give a greater contribution than the others Moreover, not the results of all the transformations are shown, in choosing the final label. To find these attributes, an ANOVA but ADESCA carries out analyzes and suggests only the trans- univariate test [20] on each of them is performed, assigning this formations for which the results of these analyzes are better, attribute a p-value. Each attribute associated with a p-value less providing us with a more interesting knowledge. The proposed than 0.05 is considered significant, and for each of these the out- analysis to identify the best transformations consists in: (i) con- liers removal is done analyzing their boxplot representation [8] sidering each pair of attributes, merging together into a new and eliminating each point less than the minimum (first quar- single column, (ii) calculating the Average Frequency of each of tile minus 1.5*Interquartile range) or greater than the maximum these new columns and (iii) choosing only the transformations in (third quartile plus 1.5*Interquartile range). The outliers points which this value is between 1 and a threshold ∆ found empirically found are shown to the user in a very intuitive way, thanks to an after trying to apply these transformations on different datasets. interactive 3d scatter chart which represents all the points of the Each transformation obtained is shown in a very user-friendly analyzed dataset according to its 3 most significant dimensions page that contains the scheme of the new transformed dataset, (found by applying the PCA [31]), where all the points labeled that the user can download entirely in csv format with a click. as ’outliers’ are shown in red and the other points are shown in Then, to provide a graphic idea of the transformation, two inter- green. active 3d scatter charts are shown, that represent the distribution of the dataset before and after the transformation. The user can that will be obtained in the final partition. To look for the best click on these graphs, enlarging and rotating them in all angles, solution of the algorithm, ADESCA calculated the k-means for allowing an effective comparison between them. each value of k in the interval [1,φ] and for each of these solutions the SSE (Sum of Squared Error) is calculated. Plotting these results 2.2 Data modeling through self-tuning graphically in a decreasing order of SSE, ADESCA looked for unsupervised and supervised algorithms the elbow of the proposed curve, that it is the point where the A key point in the whole KDD process is that the results of increase in k will cause a very small decrease of the SSE, while any analysis are computed and graphically shown automatically, the decrease in k will sharply increase the SSE. To calculate the without the user having to interact with the system by choosing elbow point automatically [27], we have drawn a straight line the most suitable algorithm and its parameter values. To guar- between the first point and the last point of our curve and we antee the automatic nature of the process, we have proposed looked for the point of the curve with the greatest distance from an ad-hoc methodology for the automatic configuration of the the line. To provide the user with a graphical display of how parameters of each algorithm used. These methodologies are this parameter has been chosen, the elbow graph is plotted next particularly relevant for the cluster analysis and for the super- to the scatter chart, by effectively highlighting the elbow point vised learning techniques, where algorithms require parameters chosen as the value of k. that greatly affect the quality of the results and so they must be DBSCAN. This algorithm requires as parameters a radius Eps chosen appropriately. and a number MinPoints. To automatically choose the MinPoints value to use, ADESCA calculates the k-dist graph for each value 2.2.1 Self-Tuning Unsupervised Learning. As a first attempt, of k in a range between 2 and φ 1 . Starting with k=2 and increasing the unsupervised learning in ADESCA includes the cluster analy- this value by 1 each time, we have seen the differences with the sis performed through three different algorithms: K-means (which (k+1)-dist graph using the MAPE percentage error [10]. If this requires k as parameter), DBSCAN (which requires MinPoints and percentage value is less than a pre-set threshold τ , we stop the Eps like parameters) and Hierarchical clustering (which requires algorithm and we choose the current k as MinPoints. the number of clusters as parameter) [28] properly enriched with Using the value of MinPoints (k) just found, to find Eps we now a self-tuning strategy to automatically identify the best input calculate the elbow point of the sorted k-dist graph, with the parameter setting. The results of each clustering algorithm are same technique used in the case of k-means. To make the choice shown to the user through a clickable 3d scatter chart, in which of these parameters more meaningful to the user, next to the 3d each cluster is highlighted with a different color, as shown in Fig- scatter chart, the sorted k dist graph is plotted, with the elbow ure 3. The same colors are also used in a pie chart, which shows point highlighted. how many elements belong to each cluster and the respective Hierarchical Clustering. This algorithm requires as a parame- percentages. If the algorithm used implies the concept of cen- ter the number of clusters to be obtained in the final result. To troids, these are also reported as red points in the scatter chart. find the most suitable value for this parameter, we used the Sil- By clicking on one of these points, the radar chart of the centroid houette index [25], calculating the value of Average Silhouette for is shown, that jointly represents the value of each component of each algorithm with a number of clusters between 2 and a pre-set the centroid on different Cartesian axes having the same origin. threshold φ 2 . The number of clusters chosen as a parameter will Let’s now see in detail the proposed self-tuning strategies that be the one associated with the higher Average Silhouette. allow to automatically choose the input parameters of each algo- The result of the Hierarchical clustering is then visualized graph- rithm integrated in ADESCA. ically in a very user-friendly dendrogram, that is a tree-like di- agram, which effectively shows how the various clusters have been grouped together. 2.2.2 Self-Tuning Supervised Learning. The self-tuning ap- proach in ADESCA provides an automatic overview of the pre- dictive performance of a wide set of classification algorithms, whose parameters are automatically optimized by a grid-search, hence automatically identifying the best one and its performance gain with respect to the others. Prediction performance is evalu- ated by exploiting a 10-fold stratified cross-validation, and the machine learning algorithms used in this section are Support Vector Machine, Decision Tree, K-Nearest Neighbor, and Naive Bayes Classifier [1]. Figure 3: The algorithm used and the value of the parame- 2.3 Storytelling ters chosen automatically by ADESCA are shown in the first In the Storytelling section, the most salient information regard- box. The user can then manually change the value of these ing the uploaded dataset are automatically shown to the user parameters. The second box contains the pie chart with the through a storyboard (i.e. visual layout mechanism able to sum- same colors as the scatter chart, while in the circle on the marize specific events and data) that allows addressing visual right there is the radar chart obtained by clicking on the analytic challenges. We have designed a storyboard, in which all centroid related to cluster 2 the salient information are shown to the user on different pages. However, different than the traditional concept of storyboard K-means. The choice of the k parameter is fundamental in the [30], that is always placed on a timeline, we removed the tempo- k-means algorithm, because it determines the number of clusters ral dependence between visualized pages. In fact, the different Figure 4: D1 Storytelling section. The names of attributes in screen 1 and screen 2.b are: "SepalLength, SepalWidth, PetalLength, PetalWidth". The names of the labels in screen 3 are: "Iris-setosa, Iris-versicolor, Iris-virginica" pages that make up our section are independent of each other, as the Silhouette score). This solution is shown as a 3d scatter chart, each page contains results relating to a particular analysis, which depicting each cluster with a different color. By clicking on a is unrelated to the others. point belonging to a particular cluster, a new page is opened con- Specifically, the purpose of this data-driven storytelling is to tell taining specific information characterizing that cluster content to the curious-data user the story of the uploaded data automat- (e.g., radar chart, boxplots, paths extracted from the decision tree ically, so that the user can have an immediate overview of the built on that solution). insights hidden on the data, without having the technical and Third View - Supervised Learning. This page is present only methodological expertise on the underlying algorithms. Struc- if the original data set contains a label. In this case, the results tures and models hidden in the data are reported in a completely of a series of classification algorithms are shown here. For each automatically way and the user is not required to interact any- integrated algorithm a confusion matrix is shown to compare more with the application, s/he can focus her/his attention to the predicted labels with the original ones, by highlighting the the discovered insights. In practice, the information contained algorithm with the highest performance. here were always present also in previous sections; but while in Fourth View - Data Transformation. This page shows the three the other sections all the possible results were shown in cascade, best transformations between those obtained in the appropriate now ADESCA only shows the most interesting solutions, that section, reporting for each of these the scheme of the new dataset are those that mostly characterize the dataset. To provide this and its distribution in a 3d scatter chart. To choose which are the summary, the Storytelling section tries to offer a visual layout best transformations among those obtained, we apply to each strategy able to summarize the most important concepts. As a of them the same clustering algorithm by automatically set the first attempt, the storyboard includes four different views, each input parameter. The transformation that yielded the highest of which contains results of different analyzes. More details of silhouette score is selected. This page is present only if there is a each view in the storyboard are reported below. temporal attribute in the original dataset. First View - Data Characterization. ADESCA shows a brief overview of the dataset, including general information (number of rows, number of columns, list of attributes, ecc), a 3d scatter chart of the distribution and a correlation matrix containing the 3 PRELIMINARY DEVELOPMENT Pearson coeffcients between the various numeric fields. Each A preliminary implementation of ADESCA has been developed cell of the matrix is clickable and allows to open a pop up win- in Python [24], including the scikitlearn library [22] (for the an- dow containing the boxplots and the Cumulative Distribution alytic tasks) and the pandas library [19] (for manipulating data Functions concerning the attributes related to the clicked cell. in a tabular or sequential format). To build the architecture of Second View - Unsupervised Learning. When the user exploits the system we used the micro-framework Flask [11], while the ADESCA step by step, the clustering section provides all the graphic part of the web pages is managed thanks to CSS and results obtained by the various clustering algorithms performed Javascript [7], including the HighCharts library [18] for creating with the parameters chosen automatically by the proposed self- interactive graphs. tuning strategy. And later the user can also see new results by ADESCA has been experimentally evaluated by analyzing differ- manually changing these parameters. But now, in the second ent datasets, using τ =3, φ=10, φ 1 =50 and φ 2 =11 as default values view of the data-driven storyboard, between all the solutions au- in the clustering analysis, and ∆=2.5 as a default value of the tomatically computed by ADESCA and those manually selected parameter in the automated data transformation. by the user (if any), only the best one is displayed (chosen using As a first try, ADESCA was promoted at the ’Festival della Tec- Table 1: ADESCA Computational Time nologia’ 5 , held at the Politecnico di Torino on November 8th 2019. During this event, about 200 people composed of university D1 D2 students and students of the last year of high school were able Data Characterization 1.58 2.01 to test our tool. What made this event significant was the fact Clustering 9.34 14.37 that most of the students who used ADESCA had very limited Storytelling 2.09 9.56 skills in areas such as data exploration or data mining. How- Data Transformation - 1.50 ever, they managed to understand the tool without problems, also testing it in its various analyzes, and easily understanding automated insights extracted from data. This experience showed us the practicality and the user-friendliness of the work done. by clicking on a point of it. Finally, screen 3 shows the results of the classification techniques and it is present because the original dataset contains a label. Regarding instead the storytelling section of D2, the results re- lated to data characterization and clustering analysis are in the same form as those reported for D1. But, since D2 does not have a label, the page containing the classification algorithms is no longer present. In its place there is a page indicating the three best possible transformations for D2 (because now a temporal attribute is present). Each of these transformations is shown to the user on a page with the same layout reported in the Figure 5. In particular, the first proposed transformation contains a record for each patient and represents the history of clinical ex- aminations performed by each patient. Then, applying unsuper- vised analysis on the dataset before and after the transformation, Figure 5: Layout that illustrates how the page showing each we can note that on the original dataset we are not able to find transformation obtained is divided. In rectangle 2 the first subgroups (because it is not relevant to group individual pre- five lines of the new dataset are shown, while clicking in the scriptions together), while in the second case we are grouping red circle at the top the user can download the entire dataset the patients and from the automated cluster analysis we can in csv format. Finally, in rectangle 3 there is the comparison clearly see two distinct groups containing patients with similar between the 3d distribution of the dataset before and after examination histories. From this result, we can deduce that the the transformation transformation in this case was useful, because we are now able to capture a model from the data that was not previously visible. 4 PRELIMINARY EXPERIMENTAL RESULTS The objective of the preliminary experiments is to demonstrate 5 OPEN ISSUES AND FUTURE DIRECTIONS the capability of ADESCA in discovering useful knowledge items A first attempt towards automated data exploration has been in an automated fashion and presenting only the relevant out- achieved through ADESCA. Preliminary experimental results comes to the end user. To this aim, dataset size does impact neither on small datasets demonstrate the friendliness of the knowledge negative nor positive in the analysis. Furthermore, for reasons of extraction approach towards the democratization of data sci- space, we will not report all the analyzes performed by the tool, ence. The storytelling capability, based on storyboard, allows but we will report only the results of the storytelling section, curious-data users easily analyzing their data through visual lay- which is the most significant and innovative section. The first out mechanism able to summarize specific events and data with- dataset analyzed, D1, contains a label attribute, while the second out technical and methodological expertise. Specifically, thanks dataset, D2, contains a temporal attribute. D1 is called ’Iris’, a free to ADESCA, all users, even the less experienced, can take advan- dataset available at the UCI Machine Learning Repository [6]. It tage of their data, by viewing only their most salient aspects in contains 150 records, each of which represents a different flower. the appropriate storyboard. Moreover, since the algorithms and Each flower is characterized by a series of numerical attributes the parameters are chosen automatically by ADESCA, the pro- and it belongs to a ’Species’ label. Each species contains 50 flow- posed methodology is suitable for all those non-technical users ers. D2 is instead a dataset containing medical information. In who can draw interesting insights from their data. Obviously, particular, it has 500 lines and each line contains information what has been done so far must be considered as a preliminary about the exam made by a patient, on a given date (temporal step. Data exploration is a very complex topic, and making it attribute). Table 1 contains information about the execution time automatically makes everything even difficult. (in seconds) of each section of ADESCA, separately per dataset. The idea is to extend ADESCA, covering topics that are currently The computational cost of the storytelling section is obtained by still open, like: (i) the management of heterogeneous data types summing the time of each page that composes it. (e.g. combination of time series data with unstructured data, im- The results of the Storytelling section for D1 are shown in ages [29] or audio signals), (ii) the capability to perform feature Figure 4 in four screens. The first screen shows the results of engineering to capture specific data properties, (iii) the integra- the data characterization. Screen 2.a shows the best clustering tion of more complex machine learning algorithms able to deal solution to apply to the dataset, while screen 2.b is a window con- with unstructured and high dimensional data [5], (iv) the ability taining information related to a specific cluster and is obtained to deal with huge datasets, (v) the capability to enrich the data 5 https://www.festivaltecnologia.it/ under analysis with additional and related open-data, (vi) the capability of properly managing geo-referenced [4] and time- 54d79c2a-f155-40ec-93ec-ed05b58afa39/view?access_token= series data [3], and (vii) the improvement of the proposed data 6d8ec910cf2a1b3901c721fcb94638563cd646fe14400fecbb76cea6aaae2fb1 [28] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2013. Data mining transformation strategies to obtain more generalizable outcomes. cluster analysis: basic concepts and algorithms. Introduction to data mining, Pearson Education India (2013), 487–533. [29] Bartolomeo Vacchetti, Tania Cerquitelli, and Riccardo Antonino. 2020. Cine- REFERENCES matographic Shot Classification through Deep Learning. In 44th IEEE Annual [1] Charu C Aggarwal. 2014. Data classification: algorithms and applications. CRC Computers, Software, and Applications Conference, COMPSAC 2020, Madrid, press. Spain, July 13-17, 2020. IEEE, 345–350. [2] Marcello Buoncristiano, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri, [30] Rick Walker, Llyr Ap Cenydd, Serban Pop, Helen C Miles, Chris J Hughes, Donatello Santoro, and Letizia Tanca. 2015. Database Challenges for Ex- William J Teahan, and Jonathan C Roberts. 2015. Storyboarding for visual ploratory Computing. SIGMOD Rec, Association for Computing Machinery 44, analytics. Information Visualization 14, 1 (2015), 27–50. 2 (Aug. 2015), 17–22. https://doi.org/10.1145/2814710.2814714 [31] Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component [3] L. Cagliero, T. Cerquitelli, S. Chiusano, P. Garza, and X. Xiao. 2017. Predicting analysis. Chemometrics and intelligent laboratory systems, Elsevier 2, 1-3 critical conditions in bicycle sharing systems. Computing 99, 1 (2017), 39–57. (1987), 37–52. https://doi.org/10.1007/s00607-016-0505-x cited By 14. [32] Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, and Aditya G. [4] Tania Cerquitelli, Evelina Di Corso, Stefano Proto, Paolo Bethaz, Daniele Parameswaran. 2018. Helix: Accelerating Human-in-the-loop Machine Learn- Mazzarelli, Alfonso Capozzoli, Elena Baralis, Marco Mellia, Silvia Casagrande, ing. CoRR abs/1808.01095 (2018). arXiv:1808.01095 http://arxiv.org/abs/1808. and Martina Tamburini. 2020. A Data-Driven Energy Platform: From Energy 01095 Performance Certificates to Human-Readable Knowledge through Dynamic High-Resolution Geospatial Maps. Electronics 9, 12 (2020). https://doi.org/10. 3390/electronics9122132 [5] Evelina Di Corso, Tania Cerquitelli, and Francesco Ventura. 2017. Self-tuning techniques for large scale cluster analysis on textual data collections. In Pro- ceedings of the Symposium on Applied Computing, SAC 2017, Marrakech, Mo- rocco, April 3-7, 2017, Ahmed Seffah, Birgit Penzenstadler, Carina Alves, and Xin Peng (Eds.). ACM, 771–776. [6] Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http: //archive.ics.uci.edu/ml [7] David Flanagan. 2006. JavaScript: the definitive guide. " O’Reilly Media, Inc.". [8] Michael Galarnyk. 2018. Understanding Boxplots. https://towardsdatascience. com/understanding-boxplots-5e2df7bcbd51 [9] Antonio Giuzio, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri, Do- natello Santoro, and Letizia Tanca. 2019. INDIANA: An interactive system for assisting database exploration. Information Systems 83 (2019), 40–56. [10] Paul Goodwin and Richard Lawton. 1999. On the asymmetry of the symmetric MAPE. International Journal of Forecasting 15, 4 (1999), 405–408. https: //EconPapers.repec.org/RePEc:eee:intfor:v:15:y:1999:i:4:p:405-408 [11] Miguel Grinberg. 2018. Flask web development: developing web applications with python. " O’Reilly Media, Inc.". [12] Pat Hanrahan. 2012. Analytic database technologies for a new kind of user: the data enthusiast. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 577–578. [13] Kevin Hu, Diana Orghian, and César Hidalgo. 2018. Dive: A mixed-initiative system supporting integrated data exploration workflows. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. ACM, 5. [14] Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. 2015. Overview of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD Inter- national Conference on Management of Data. ACM, 277–281. [15] Prasanth Jayachandran, Karthik Tunga, Niranjan Kamat, and Arnab Nandi. 2014. Combining User Interaction, Speculative Query Execution and Sampling in the DICE System. Proc. VLDB Endow. 7, 13 (Aug. 2014), 1697–1700. https: //doi.org/10.14778/2733004.2733064 [16] Bill Jelen and Michael Alexander. 2010. Pivot Table Data Crunching: Microsoft Excel 2010. Pearson Education. [17] Udayan Khurana, Srinivasan Parthasarathy, and Deepak S. Turaga. 2014. READ: Rapid data Exploration, Analysis and Discovery. In EDBT. 612–615. [18] Joe Kuan. 2015. Learning highcharts 4. Packt Publishing Ltd. [19] Wes McKinney et al. 2011. pandas: a foundational Python library for data analysis and statistics. Python for High Performance and Scientific Computing, Seattle 14, 9 (2011), 1–9. [20] Rupert G Miller Jr. 1997. Beyond ANOVA: basics of applied statistics. CRC press. [21] Kristi Morton, Magdalena Balazinska, Dan Grossman, and Jock Mackinlay. 2014. Support the data enthusiast: Challenges for next-generation data- analysis systems. Proceedings of the VLDB Endowment 7, 6 (2014), 453–456. [22] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct (2011), 2825–2830. [23] Kedar Potdar, Taher Pardawala, and Chinmay Pai. 2017. A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers. International Journal of Computer Applications 175 (10 2017), 7–9. https: //doi.org/10.5120/ijca2017915495 [24] Guido Rossum. 1995. Python Reference Manual. Technical Report. Amsterdam, The Netherlands, The Netherlands. [25] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathemat- ics 20 (1987), 53–65. [26] Tarique Siddiqui, Albert Kim, John Lee, Karrie Karahalios, and Aditya Parameswaran. 2016. Effortless Data Exploration with Zenvisage: An Ex- pressive and Interactive Visual Analytics System. Proc. VLDB Endow. 10, 4 (Nov. 2016), 457–468. https://doi.org/10.14778/3025111.3025126 [27] Manoj Singh. 2017. Finding the elbow or knee of a curve. https://dataplatform.cloud.ibm.com/analytics/notebooks/