Model-Driven Technologies for Data Mining Democratisation Alfonso de la Vega and Pablo Sánchez Software Engineering and Real-Time, University of Cantabria, Santander (Spain) {alfonso.delavega, p.sanchez}@unican.es Abstract. Data mining techniques allow discovering insights previously hidden in data from a domain. However, these techniques demand very specialised skills. People often lack these skills, which hinders data mining democratisation. To alleviate this situation, we defined a model-driven framework and some domain-specific languages that contribute to the democratisation of data mining. Here we summarise these contributions. Keywords: Model-Driven Engineering · Domain-Specific Languages · Data Mining · Data Mining Democratisation 1 Introduction Currently, computer systems gather large amounts of data that, when properly analysed, can be of great help for different purposes [9]. For instance, data col- lected by Uber is being used by different city halls to improve public transport networks, whereas Netflix is using their data to determine its next productions. Nevertheless, data mining techniques, which can find valuable facts hidden in data, require very specialised skills. For instance, before grouping some data by their similarities, we must decide which one of the dozens of available clustering algorithms best fits with our needs. Then, some preprocessing is necessary to adapt the input data to the requirements of the selected algorithm, such as con- verting categorical values to a numerical representation; or normalising numbers into the range [0, 1]. People willing to analyse data often lack the technical skills to achieve these tasks, which hampers data mining democratisation. As a first step to address these issues, we analysed the state of the art of the data mining democratisation field by means of a systematic literature review [4]. In this review, more than 700 works were considered, including both research articles and industrial tools. Some conclusions of this review are: (1) generic solutions, which are completely domain-independent, might exhibit accuracy problems, since they do not take into account the particularities of each domain to configure their algorithms or to preprocess input data; and (2) the issue of facilitating the data selection and data formatting stages is scarcely addressed in the literature. Model-Driven Engineering (MDE) and Domain-Specific Languages (DSLs) have demonstrated to be effective methods to provide domain-adapted solutions Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 10 Alfonso de la Vega and Pablo Sánchez that are easy to use and feel familiar to experts in an application domain. There- fore, we explored whether these benefits can be applied to the data mining area. Our initial idea was to create a DSL with a high-level syntax that hid low-level details of the applied mining techniques, so that it could be used for people without expertise on these. This DSL was initially devised to work with data coming from any domain, but ignoring domain details quickly turned into an unfeasible option, as the first conclusion of our review states. Thus, we opted to develop FLANDM : a model-driven framework for the rapid generation of DSLs for data mining [7], where generated DSLs are adapted to the specificities of each concrete context. Additionally, this framework uses two DSLs, Lavoisier [6] and Pinset [8], to support its customisation. These DSLs address the second conclusion of our review by helping with the data transformation steps, i.e., making data conform to the requirements imposed by the applied data mining algorithms. Our approach has been validated by generating DSLs for several domains, with a special focus in the analysis of data extracted from e-learning platforms, web systems, and data from model-driven artefacts [8]. Moreover, we performed a set of empirical experiments to state whether the generated DSLs might be actually used by people without knowledge on data mining techniques. The rest of this paper is organised as follows: Section 2 introduces FLANDM, i.e., our framework for the generation of DSLs for data mining. Sections 3 and 4 describe Lavoisier and Pinset, which are our languages for the transformation of data into an analysis-ready format. Finally, Section 5 concludes this work. 2 FLANDM: A Model-Driven Framework for the Generation of DSLs for Data Mining Democratisation To address the first issue stated in the introduction, some authors created frame- works for the development of data mining applications, where an expert initially configures some elements of the framework so that the resulting application is adapted to a specific domain. In these cases, it is important to reduce the inter- vention of experts as much as possible, in order to decrease development cost. With this idea in mind, we created FLANDM (Framework to develop LAN- guages for Data Mining) [7]. FLANDM is an MDE-based framework that can be used to create DSLs for data mining democratisation. These DSLs hide technical details of the applied analysis techniques behind a high-level, query-based syn- tax, in order to be usable by people without expertise on data mining. Generated DSLs are adapted to the particularities of each domain, which makes them feel familiar to use, and contributes to improving the accuracy of the analyses. Figure 1 provides a general overview of how FLANDM works. As it happens in any data mining process, we start with a set of business questions to be answered. For instance, a software engineer might want to know why some classes of a software system are more likely to contain bugs than others. These questions are complemented with a characterisation of the analysis context by means of a domain model. The purpose of this domain model is Model-Driven Technologies for Data Mining Democratisation 11 input Business Questions output Kontroler KonkretnyKontroler1 KonkretnyKontroler2 +akcjaUzytkownika1() +akcjaUzytkownika1() +akcjaUzytkownika2() +akcjaUzytkownika2() +akcjaUzytkownika3() +akcjaUzytkownika3() Answers Kontroler -widok: Widok -model: Model Domain-Specific posiada +ustawModel(model: Model) posiada 1 +ustawWidok(widok: Widok) 1 t +akcjaUzytkownika1() +akcjaUzytkownika2() inpu +akcjaUzytkownika3() FLANDM Widok Model 1 Widok 1 -model: Model Model +ustawModel(model: Model) -obserwatorzy: Widok[] +wyświetl() posiada +dodajObserwatora(widok: Widok) Analysis Language +dodaj(widok: Widok) 1 1 +usuńObserwatora(widok: Widok) +usuń(widok: Widok) +powiadom() +pobierz(i: int): Widok 0..n 1 +operacjaModelu() 0..n 1 WidokLiść WidokKompozyt KonkretnyModel1 KonkretnyModel1 +wyświetl() +wyświetl() +operacjaModelu() +operacjaModelu() +dodaj(widok: Widok) +dodaj(widok: Widok) +usuń(widok: Widok) +usuń(widok: Widok) +pobierz(i: int): Widok +pobierz(i: int): Widok Data Domain Model Fig. 1. FLANDM’s language generation process (Icons: Flaticon, Noto Emoji font). twofold: (1) to indicate the terminology with which domain experts are familiar; and (2) to specify the available data for the analysis. These data might be present in a well-defined source, such as a relational database; or it might need to be extracted from several sources. For instance, continuing with our previous example, we can use as data some quality metrics computed for each class of the software system. In addition, these data could be complemented with information extracted from a bug tracking tool. The steps of extracting and integrating data from different sources are not currently addressed by FLANDM, and need to be performed manually. Listing 1.1. Query examples of an analysis language generated with FLANDM. 1 find_reasons for num_bugs > 10 of classes_bug_info; 2 3 find_reasons for num_bugs > 10 of classes_bug_info 4 with package not_equals "legacyAccountMng"; Using this information as input for FLANDM, we could generate a query- based language such as the one depicted in Listing 1.1. As it can be seen, the employed terms (num bugs, class, package) should be familiar to software engi- neers. The structure of these sentences would be similar for all domains. Each query is composed of a command, that specifies the kind of answer to be com- puted; a dataset, which determines the data to be used for that analysis; and, optionally, filters that might exclude some data from the analysis. In Listing 1.1, line 1 we try to find reasons that lead to a number of bugs higher than a specific threshold using a dataset called classes bug info. In lines 3-4 we perform the same query, but in this case we omit those classes from package “legacyAccountMng” from the analysis. These high-level sentences are translated, by means of model transformation and code generation techniques, into low-level code that configures and invokes specific data mining algorithms. This generated code is then executed to provide an answer to the specified query. Both the DSL generation infrastructure and the sentence transformation pro- cess have been designed so that they can be easily configured by data mining experts to fit with the particularities of each domain. For instance, an expert can change easily the underlying algorithm that is used to compute a specific command, or fine tune some of its parameters. 12 Alfonso de la Vega and Pablo Sánchez rs 3.1 rs 3.2 dataset releasesInfo { name mainClass Class [name] ms cbo ms dit ms cbo ms dit include rs by releaseId { include ms by metric.name Book 3 0 4 0 } Novel 1 1 1 1 } Fig. 2. Top: domain model for class-level metrics; bottom-left: target dataset; bottom- right: Lavoisier query that performs the extraction. To evaluate the benefits of FLANDM, we carried out two different actions. First, we compared the effort of developing DSLs for data mining from scratch and with the help of FLANDM, for four different domains. Results showed that our framework helps reduce around 50% of development efforts. Secondly, we checked whether the generated languages can be actually used by people with- out expertise in data mining by carrying out some empirical experiments. Uni- versity teachers from heterogeneous areas used an educational analysis language to study courses data from an e-learning platform. At the time of writing this paper, we are still processing the gathered data, but preliminary results indicate that teachers were able to correctly use this language after a minimum training. As commented, each executed query indicates a dataset as input data. In our framework, a dataset is a tabular representation of a data bundle selected from the domain model. The need of being tabular is a requirement imposed by most data mining algorithms. Our framework provides two languages, called Lavoisier and Pinset, which allow non-experts to create datasets from a domain model by themselves, i.e., without the assistance of data mining experts. These languages are briefly described in next sections. 3 Lavoisier: High-Level Data Selection and Processing Lavoisier [6] is a language for creating datasets from object-oriented domain models. Dataset creation, i.e., the process of transforming data into a two- dimensional format to serve as input of an analysis algorithm, is considered one of the key stages of any data mining process [5]. In our framework (Fig- ure 1), once a domain model has been created and populated with accurate and clean data, datasets can be produced by specifying, through a Lavoisier query, a subset of this domain model to be considered for a specific analysis. Then, this subset must be transformed into an analysis-ready dataset to be digested by data mining tools. The problem of data formatting is illustrated in Figure 2. The top of this figure shows a domain model about quality metrics of a software system. For each class contained in this system, several metrics per release are computed. Examples of these metrics could be CBO (Coupling Between Objects) or DIT Model-Driven Technologies for Data Mining Democratisation 13 (Depth of Inheritance Tree). These and other metrics have been previously used, for instance, to predict the defects that will be found in a software release [3]. A domain model represents information in a graph-like format, whereas most analyses require data to be transformed into a tabular format like the one de- picted in Figure 2 (bottom left). To perform this task, several data transforma- tion operations, such as joins or pivots, are typically used. Domain experts are key for the proper creation of datasets, since they might give some useful input to correctly guide an analysis. So, it would be desirable if these experts were able to define their own datasets. Nevertheless, domain experts often lack the technical skills to accomplish this task. Lavoisier tries to alleviate this shortcoming. This language provides a high- level syntax that we expect can be used by domain experts, since it tries to hide any technical details of the dataset creation process. Therefore, a domain expert might focus on data selection, rather than on which combination of low-level operations has to be used to obtain data in the required format. Figure 2 (bottom right) shows an example of dataset creation using Lavoisier. The dataset releasesInfo will be used to compare class metrics of different releases, so each row of this dataset would contain the information of a class. We indicate this in the query by selecting Class as the mainClass of the dataset. From each class object, we include its name. As information for the analysis, we include data from all the releases rs of each class. Each set of columns extracted from a release will be identified by its releaseId. Finally, for each release, we include all measurements, each one corresponding to a metric name (e.g. cbo or dit). Lavoisier automatically uses the value of a measurement to fill the corresponding columns. This specification, when executed, produces a dataset like the one shown in Figure 2 (bottom left). It should be noted that, in this case, the number of columns of the resulting dataset varies dynamically depending on the number of releases and gathered metrics. The execution of a dataset specification is carried out by Lavoisier transpar- ently, freeing domain experts of these low-level details. To perform this execution, Lavoisier employs a set of data transformation patterns [6] that we defined by adapting some typical procedures applied in object-relational data mappers and in data management tools. 4 Pinset: MDE that Helps Data Mining Help MDE Following a current trend [1,2], we tried to employ Lavoisier to enable the use of data mining techniques on data extracted from MDE artefacts. During this evaluation, we realised that Lavoisier’s high-level syntax might be inadequate for domain experts with programming skills, such as software engineers. We found that some fine-grain aspects of a dataset creation, like the computation of aggregate values, cannot be easily specified using Lavoisier constructs. Thus, we extended the initial objectives of Lavoisier to create a new DSL, called Pinset [8], which offers a lower-level syntax for performing some computations. 14 Alfonso de la Vega and Pablo Sánchez Listing 1.2. Dataset extraction with Pinset. 1 dataset classAggregates over c : Class { 2 properties [name as className] 3 column numDefectiveReleases : c.rs.select(r | r.ms.exists(m | 4 m.metric.name = "num_bugs" and m.value > 0)).size() 5 ... 6 } Listing 1.2 shows a dataset extraction over the domain model of Figure 2. In this dataset, the entities to be analysed are again Classes (line 1). Several metrics are computed for each class. For space reasons, only the numDefectiveReleases metric is shown (lines 3-4), which indicates the number of releases per class where at least one defect was detected. This metric is calculated by chaining different operators that are interpreted to generate the resulting value. 5 Conclusions and Future Work This paper has briefly described our MDE-based contributions in the field of data mining democratisation: the FLANDM framework [7], and the Lavoisier [6] and Pinset [8] languages. As future work, we plan to perform new empirical experiments of these contributions, also including new analysis domains. We also want to explore new research lines, such as how to define explainable (white- box) analysis processes for non-experts, or how to allow for a more fine-grained configuration of these processes with a controlled increase of the complexity. Acknowledgements. Funded by the University of Cantabria’s Doctorate Program, and by the Spanish Government under grant TIN2017-86520-C3-3-R. References 1. Babur, Ö., Cleophas, L., van den Brand, M.: Hierarchical clustering of metamodels for comparative analysis and visualization. In: Modelling Foundations and Applica- tions - 12th European Conference, ECMFA. pp. 3–18 (2016) 2. Basciani, F., Rocco, J.D., Ruscio, D.D., Iovino, L., Pierantonio, A.: Automated clustering of metamodel repositories. In: CAiSE. pp. 342–358 (2016) 3. D’Ambros, M., Lanza, M., Robbes, R.: An extensive comparison of bug prediction approaches. In: IEEE Int. Conf. Mining Software Repositories. pp. 31 – 41 (2010) 4. de la Vega, A., et al.: How Far are we from Data Mining Democratisation? A System- atic Review. arXiv e-prints 1903.08431 (2019), https://arxiv.org/abs/1903.08431 5. Munson, M.A.: A study on the importance of and time spent on different modeling steps. SIGKDD Explor. Newsl. 13(2), 65–71 (May 2012) 6. de la Vega, A., Garcı́a-Saiz, D., Zorrilla, M., Sánchez, P.: On the Automated Trans- formation of Domain Models into Tabular Datasets. ER FORUM 1979 (2017) 7. de la Vega, A., Garcı́a-Saiz, D., Zorrilla, M., Sánchez, P.: FLANDM: a development framework of domain-specific languages for data mining democratisation. Computer Languages, Systems and Structures 54, 316–336 (2018) 8. de la Vega, A., Sanchez, P., Kolovos, D.: Pinset: A DSL for Extracting Datasets from Models for Data Mining-Based Quality Analysis. Quality of Information and Communications Technology (QUATIC) pp. 83–91 (2018) 9. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. 4th edn. (2016)