Introduction

Model-Driven Technologies for Data Mining Democratisation

0 Software Engineering and Real-Time, University of Cantabria , Santander , Spain

9 14

Data mining techniques allow discovering insights previously hidden in data from a domain. However, these techniques demand very specialised skills. People often lack these skills, which hinders data mining democratisation. To alleviate this situation, we de ned a model-driven framework and some domain-speci c languages that contribute to the democratisation of data mining. Here we summarise these contributions.

Model-Driven Engineering Domain-Speci c Languages Data Mining Data Mining Democratisation

Introduction

Currently, computer systems gather large amounts of data that, when properly analysed, can be of great help for di erent purposes [ 9 ]. For instance, data collected by Uber is being used by di erent city halls to improve public transport networks, whereas Net ix is using their data to determine its next productions.

Nevertheless, data mining techniques, which can nd valuable facts hidden in data, require very specialised skills. For instance, before grouping some data by their similarities, we must decide which one of the dozens of available clustering algorithms best ts with our needs. Then, some preprocessing is necessary to adapt the input data to the requirements of the selected algorithm, such as converting categorical values to a numerical representation; or normalising numbers into the range [ 0, 1 ]. People willing to analyse data often lack the technical skills to achieve these tasks, which hampers data mining democratisation.

As a rst step to address these issues, we analysed the state of the art of the data mining democratisation eld by means of a systematic literature review [ 4 ]. In this review, more than 700 works were considered, including both research articles and industrial tools. Some conclusions of this review are: (1) generic solutions, which are completely domain-independent, might exhibit accuracy problems, since they do not take into account the particularities of each domain to con gure their algorithms or to preprocess input data; and (2) the issue of facilitating the data selection and data formatting stages is scarcely addressed in the literature.

Model-Driven Engineering (MDE) and Domain-Speci c Languages (DSLs) have demonstrated to be e ective methods to provide domain-adapted solutions Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). that are easy to use and feel familiar to experts in an application domain. Therefore, we explored whether these bene ts can be applied to the data mining area. Our initial idea was to create a DSL with a high-level syntax that hid low-level details of the applied mining techniques, so that it could be used for people without expertise on these. This DSL was initially devised to work with data coming from any domain, but ignoring domain details quickly turned into an unfeasible option, as the rst conclusion of our review states. Thus, we opted to develop FLANDM : a model-driven framework for the rapid generation of DSLs for data mining [ 7 ], where generated DSLs are adapted to the speci cities of each concrete context.

Additionally, this framework uses two DSLs, Lavoisier [ 6 ] and Pinset [ 8 ], to support its customisation. These DSLs address the second conclusion of our review by helping with the data transformation steps, i.e., making data conform to the requirements imposed by the applied data mining algorithms.

Our approach has been validated by generating DSLs for several domains, with a special focus in the analysis of data extracted from e-learning platforms, web systems, and data from model-driven artefacts [ 8 ]. Moreover, we performed a set of empirical experiments to state whether the generated DSLs might be actually used by people without knowledge on data mining techniques.

The rest of this paper is organised as follows: Section 2 introduces FLANDM, i.e., our framework for the generation of DSLs for data mining. Sections 3 and 4 describe Lavoisier and Pinset, which are our languages for the transformation of data into an analysis-ready format. Finally, Section 5 concludes this work. 2

FLANDM: A Model-Driven Framework for the Generation of DSLs for Data Mining Democratisation

To address the rst issue stated in the introduction, some authors created frameworks for the development of data mining applications, where an expert initially con gures some elements of the framework so that the resulting application is adapted to a speci c domain. In these cases, it is important to reduce the intervention of experts as much as possible, in order to decrease development cost.

With this idea in mind, we created FLANDM (Framework to develop LANguages for Data Mining) [ 7 ]. FLANDM is an MDE-based framework that can be used to create DSLs for data mining democratisation. These DSLs hide technical details of the applied analysis techniques behind a high-level, query-based syntax, in order to be usable by people without expertise on data mining. Generated DSLs are adapted to the particularities of each domain, which makes them feel familiar to use, and contributes to improving the accuracy of the analyses.

Figure 1 provides a general overview of how FLANDM works. As it happens in any data mining process, we start with a set of business questions to be answered. For instance, a software engineer might want to know why some classes of a software system are more likely to contain bugs than others.

These questions are complemented with a characterisation of the analysis context by means of a domain model. The purpose of this domain model is input Business Questions

output Data

FLANDM

Domain-Specific Analysis Language

Answers twofold: (1) to indicate the terminology with which domain experts are familiar; and (2) to specify the available data for the analysis. These data might be present in a well-de ned source, such as a relational database; or it might need to be extracted from several sources. For instance, continuing with our previous example, we can use as data some quality metrics computed for each class of the software system. In addition, these data could be complemented with information extracted from a bug tracking tool. The steps of extracting and integrating data from di erent sources are not currently addressed by FLANDM, and need to be performed manually.

Listing 1.1. Query examples of an analysis language generated with FLANDM. 1 find_reasons for num_bugs > 10 of classes_bug_info; 2 3 find_reasons for num_bugs > 10 of classes_bug_info 4 with package not_equals "legacyAccountMng";

Using this information as input for FLANDM, we could generate a querybased language such as the one depicted in Listing 1.1. As it can be seen, the employed terms (num bugs, class, package) should be familiar to software engineers. The structure of these sentences would be similar for all domains. Each query is composed of a command, that speci es the kind of answer to be computed; a dataset, which determines the data to be used for that analysis; and, optionally, lters that might exclude some data from the analysis. In Listing 1.1, line 1 we try to nd reasons that lead to a number of bugs higher than a speci c threshold using a dataset called classes bug info. In lines 3-4 we perform the same query, but in this case we omit those classes from package \legacyAccountMng" from the analysis.

These high-level sentences are translated, by means of model transformation and code generation techniques, into low-level code that con gures and invokes speci c data mining algorithms. This generated code is then executed to provide an answer to the speci ed query.

Both the DSL generation infrastructure and the sentence transformation process have been designed so that they can be easily con gured by data mining experts to t with the particularities of each domain. For instance, an expert can change easily the underlying algorithm that is used to compute a speci c command, or ne tune some of its parameters. name Book Novel

To evaluate the bene ts of FLANDM, we carried out two di erent actions. First, we compared the e ort of developing DSLs for data mining from scratch and with the help of FLANDM, for four di erent domains. Results showed that our framework helps reduce around 50% of development e orts. Secondly, we checked whether the generated languages can be actually used by people without expertise in data mining by carrying out some empirical experiments. University teachers from heterogeneous areas used an educational analysis language to study courses data from an e-learning platform. At the time of writing this paper, we are still processing the gathered data, but preliminary results indicate that teachers were able to correctly use this language after a minimum training.

As commented, each executed query indicates a dataset as input data. In our framework, a dataset is a tabular representation of a data bundle selected from the domain model. The need of being tabular is a requirement imposed by most data mining algorithms. Our framework provides two languages, called Lavoisier and Pinset, which allow non-experts to create datasets from a domain model by themselves, i.e., without the assistance of data mining experts. These languages are brie y described in next sections. 3

Lavoisier: High-Level Data Selection and Processing

Lavoisier [ 6 ] is a language for creating datasets from object-oriented domain models. Dataset creation, i.e., the process of transforming data into a twodimensional format to serve as input of an analysis algorithm, is considered one of the key stages of any data mining process [ 5 ]. In our framework (Figure 1), once a domain model has been created and populated with accurate and clean data, datasets can be produced by specifying, through a Lavoisier query, a subset of this domain model to be considered for a speci c analysis. Then, this subset must be transformed into an analysis-ready dataset to be digested by data mining tools.

The problem of data formatting is illustrated in Figure 2. The top of this gure shows a domain model about quality metrics of a software system. For each class contained in this system, several metrics per release are computed. Examples of these metrics could be CBO (Coupling Between Objects) or DIT (Depth of Inheritance Tree). These and other metrics have been previously used, for instance, to predict the defects that will be found in a software release [ 3 ].

A domain model represents information in a graph-like format, whereas most analyses require data to be transformed into a tabular format like the one depicted in Figure 2 (bottom left). To perform this task, several data transformation operations, such as joins or pivots, are typically used.

Domain experts are key for the proper creation of datasets, since they might give some useful input to correctly guide an analysis. So, it would be desirable if these experts were able to de ne their own datasets. Nevertheless, domain experts often lack the technical skills to accomplish this task.

Lavoisier tries to alleviate this shortcoming. This language provides a highlevel syntax that we expect can be used by domain experts, since it tries to hide any technical details of the dataset creation process. Therefore, a domain expert might focus on data selection, rather than on which combination of low-level operations has to be used to obtain data in the required format.

Figure 2 (bottom right) shows an example of dataset creation using Lavoisier. The dataset releasesInfo will be used to compare class metrics of di erent releases, so each row of this dataset would contain the information of a class. We indicate this in the query by selecting Class as the mainClass of the dataset. From each class object, we include its name. As information for the analysis, we include data from all the releases rs of each class. Each set of columns extracted from a release will be identi ed by its releaseId. Finally, for each release, we include all measurements, each one corresponding to a metric name (e.g. cbo or dit ). Lavoisier automatically uses the value of a measurement to ll the corresponding columns. This speci cation, when executed, produces a dataset like the one shown in Figure 2 (bottom left). It should be noted that, in this case, the number of columns of the resulting dataset varies dynamically depending on the number of releases and gathered metrics.

The execution of a dataset speci cation is carried out by Lavoisier transparently, freeing domain experts of these low-level details. To perform this execution, Lavoisier employs a set of data transformation patterns [ 6 ] that we de ned by adapting some typical procedures applied in object-relational data mappers and in data management tools. 4

Pinset: MDE that Helps Data Mining Help MDE

Following a current trend [ 1,2 ], we tried to employ Lavoisier to enable the use of data mining techniques on data extracted from MDE artefacts. During this evaluation, we realised that Lavoisier's high-level syntax might be inadequate for domain experts with programming skills, such as software engineers. We found that some ne-grain aspects of a dataset creation, like the computation of aggregate values, cannot be easily speci ed using Lavoisier constructs. Thus, we extended the initial objectives of Lavoisier to create a new DSL, called Pinset [ 8 ], which o ers a lower-level syntax for performing some computations.

Listing 1.2. Dataset extraction with Pinset. 1 dataset classAggregates over c : Class { 2 properties [name as className] 3 column numDefectiveReleases : c.rs.select(r | r.ms.exists(m | 4 m.metric.name = "num_bugs" and m.value > 0)).size() 5 ... 6 }

Listing 1.2 shows a dataset extraction over the domain model of Figure 2. In this dataset, the entities to be analysed are again Classes (line 1). Several metrics are computed for each class. For space reasons, only the numDefectiveReleases metric is shown (lines 3-4), which indicates the number of releases per class where at least one defect was detected. This metric is calculated by chaining di erent operators that are interpreted to generate the resulting value. 5

Conclusions and Future Work

This paper has brie y described our MDE-based contributions in the eld of data mining democratisation: the FLANDM framework [ 7 ], and the Lavoisier [ 6 ] and Pinset [ 8 ] languages. As future work, we plan to perform new empirical experiments of these contributions, also including new analysis domains. We also want to explore new research lines, such as how to de ne explainable (whitebox) analysis processes for non-experts, or how to allow for a more ne-grained con guration of these processes with a controlled increase of the complexity.

Acknowledgements. Funded by the University of Cantabria's Doctorate Program, and by the Spanish Government under grant TIN2017-86520-C3-3-R.

1. Babur , O. , Cleophas , L., van den Brand, M.: Hierarchical clustering of metamodels for comparative analysis and visualization . In: Modelling Foundations and Applications - 12th European Conference , ECMFA. pp. 3 { 18 ( 2016 )

2. Basciani , F. , Rocco , J.D. , Ruscio , D.D. , Iovino , L. , Pierantonio , A. : Automated clustering of metamodel repositories . In: CAiSE . pp. 342 { 358 ( 2016 )

'Ambros , M. , Lanza , M. , Robbes , R.: An extensive comparison of bug prediction approaches . In: IEEE Int. Conf. Mining Software Repositories . pp. 31 { 41 ( 2010 )

4. de la Vega , A. , et al.: How Far are we from Data Mining Democratisation? A Systematic Review . arXiv e-prints 1903 . 08431 ( 2019 ), https://arxiv.org/abs/ 1903 .08431

5. Munson , M.A. : A study on the importance of and time spent on di erent modeling steps . SIGKDD Explor. Newsl . 13 ( 2 ), 65 {71 (May 2012 )

6. de la Vega , A. , Garc a-Saiz, D. , Zorrilla , M. , Sanchez , P. : On the Automated Transformation of Domain Models into Tabular Datasets. ER FORUM 1979 ( 2017 )

7. de la Vega , A. , Garc a-Saiz, D. , Zorrilla , M. , Sanchez , P. : FLANDM: a development framework of domain-speci c languages for data mining democratisation . Computer Languages, Systems and Structures 54 , 316 { 336 ( 2018 )

8. de la Vega , A. , Sanchez , P. , Kolovos , D. : Pinset: A DSL for Extracting Datasets from Models for Data Mining-Based Quality Analysis . Quality of Information and Communications Technology (QUATIC) pp. 83 { 91 ( 2018 )

9. Witten , I.H. , Frank , E. , Hall , M.A. , Pal , C.J. : Data Mining: Practical Machine Learning Tools and Techniques. 4th edn . ( 2016 )