<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Model-Driven Technologies for Data Mining Democratisation</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Software Engineering and Real-Time, University of Cantabria</institution>
          ,
          <addr-line>Santander</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>9</fpage>
      <lpage>14</lpage>
      <abstract>
        <p>Data mining techniques allow discovering insights previously hidden in data from a domain. However, these techniques demand very specialised skills. People often lack these skills, which hinders data mining democratisation. To alleviate this situation, we de ned a model-driven framework and some domain-speci c languages that contribute to the democratisation of data mining. Here we summarise these contributions.</p>
      </abstract>
      <kwd-group>
        <kwd>Model-Driven Engineering</kwd>
        <kwd>Domain-Speci c Languages</kwd>
        <kwd>Data Mining</kwd>
        <kwd>Data Mining Democratisation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Currently, computer systems gather large amounts of data that, when properly
analysed, can be of great help for di erent purposes [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. For instance, data
collected by Uber is being used by di erent city halls to improve public transport
networks, whereas Net ix is using their data to determine its next productions.
      </p>
      <p>
        Nevertheless, data mining techniques, which can nd valuable facts hidden in
data, require very specialised skills. For instance, before grouping some data by
their similarities, we must decide which one of the dozens of available clustering
algorithms best ts with our needs. Then, some preprocessing is necessary to
adapt the input data to the requirements of the selected algorithm, such as
converting categorical values to a numerical representation; or normalising numbers
into the range [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]. People willing to analyse data often lack the technical skills
to achieve these tasks, which hampers data mining democratisation.
      </p>
      <p>
        As a rst step to address these issues, we analysed the state of the art of the
data mining democratisation eld by means of a systematic literature review [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
In this review, more than 700 works were considered, including both research
articles and industrial tools. Some conclusions of this review are: (1) generic
solutions, which are completely domain-independent, might exhibit accuracy
problems, since they do not take into account the particularities of each domain
to con gure their algorithms or to preprocess input data; and (2) the issue of
facilitating the data selection and data formatting stages is scarcely addressed
in the literature.
      </p>
      <p>
        Model-Driven Engineering (MDE) and Domain-Speci c Languages (DSLs)
have demonstrated to be e ective methods to provide domain-adapted solutions
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
that are easy to use and feel familiar to experts in an application domain.
Therefore, we explored whether these bene ts can be applied to the data mining area.
Our initial idea was to create a DSL with a high-level syntax that hid low-level
details of the applied mining techniques, so that it could be used for people
without expertise on these. This DSL was initially devised to work with data
coming from any domain, but ignoring domain details quickly turned into an
unfeasible option, as the rst conclusion of our review states. Thus, we opted to
develop FLANDM : a model-driven framework for the rapid generation of DSLs
for data mining [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], where generated DSLs are adapted to the speci cities of
each concrete context.
      </p>
      <p>
        Additionally, this framework uses two DSLs, Lavoisier [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Pinset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
to support its customisation. These DSLs address the second conclusion of our
review by helping with the data transformation steps, i.e., making data conform
to the requirements imposed by the applied data mining algorithms.
      </p>
      <p>
        Our approach has been validated by generating DSLs for several domains,
with a special focus in the analysis of data extracted from e-learning platforms,
web systems, and data from model-driven artefacts [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Moreover, we performed
a set of empirical experiments to state whether the generated DSLs might be
actually used by people without knowledge on data mining techniques.
      </p>
      <p>The rest of this paper is organised as follows: Section 2 introduces FLANDM,
i.e., our framework for the generation of DSLs for data mining. Sections 3 and 4
describe Lavoisier and Pinset, which are our languages for the transformation
of data into an analysis-ready format. Finally, Section 5 concludes this work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>FLANDM: A Model-Driven Framework for the</title>
    </sec>
    <sec id="sec-3">
      <title>Generation of DSLs for Data Mining Democratisation</title>
      <p>To address the rst issue stated in the introduction, some authors created
frameworks for the development of data mining applications, where an expert initially
con gures some elements of the framework so that the resulting application is
adapted to a speci c domain. In these cases, it is important to reduce the
intervention of experts as much as possible, in order to decrease development cost.</p>
      <p>
        With this idea in mind, we created FLANDM (Framework to develop
LANguages for Data Mining) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. FLANDM is an MDE-based framework that can be
used to create DSLs for data mining democratisation. These DSLs hide technical
details of the applied analysis techniques behind a high-level, query-based
syntax, in order to be usable by people without expertise on data mining. Generated
DSLs are adapted to the particularities of each domain, which makes them feel
familiar to use, and contributes to improving the accuracy of the analyses.
      </p>
      <p>Figure 1 provides a general overview of how FLANDM works. As it happens
in any data mining process, we start with a set of business questions to be
answered. For instance, a software engineer might want to know why some classes
of a software system are more likely to contain bugs than others.</p>
      <p>These questions are complemented with a characterisation of the analysis
context by means of a domain model. The purpose of this domain model is
input
Business Questions</p>
      <p>output
Data</p>
      <p>FLANDM</p>
      <p>Domain-Specific
Analysis Language</p>
      <p>Answers
twofold: (1) to indicate the terminology with which domain experts are familiar;
and (2) to specify the available data for the analysis. These data might be
present in a well-de ned source, such as a relational database; or it might need
to be extracted from several sources. For instance, continuing with our previous
example, we can use as data some quality metrics computed for each class of the
software system. In addition, these data could be complemented with information
extracted from a bug tracking tool. The steps of extracting and integrating data
from di erent sources are not currently addressed by FLANDM, and need to be
performed manually.</p>
      <p>Listing 1.1. Query examples of an analysis language generated with FLANDM.
1 find_reasons for num_bugs &gt; 10 of classes_bug_info;
2
3 find_reasons for num_bugs &gt; 10 of classes_bug_info
4 with package not_equals "legacyAccountMng";</p>
      <p>Using this information as input for FLANDM, we could generate a
querybased language such as the one depicted in Listing 1.1. As it can be seen, the
employed terms (num bugs, class, package) should be familiar to software
engineers. The structure of these sentences would be similar for all domains. Each
query is composed of a command, that speci es the kind of answer to be
computed; a dataset, which determines the data to be used for that analysis; and,
optionally, lters that might exclude some data from the analysis. In Listing 1.1,
line 1 we try to nd reasons that lead to a number of bugs higher than a speci c
threshold using a dataset called classes bug info. In lines 3-4 we perform the same
query, but in this case we omit those classes from package \legacyAccountMng"
from the analysis.</p>
      <p>These high-level sentences are translated, by means of model transformation
and code generation techniques, into low-level code that con gures and invokes
speci c data mining algorithms. This generated code is then executed to provide
an answer to the speci ed query.</p>
      <p>Both the DSL generation infrastructure and the sentence transformation
process have been designed so that they can be easily con gured by data mining
experts to t with the particularities of each domain. For instance, an expert
can change easily the underlying algorithm that is used to compute a speci c
command, or ne tune some of its parameters.
name
Book
Novel</p>
      <p>To evaluate the bene ts of FLANDM, we carried out two di erent actions.
First, we compared the e ort of developing DSLs for data mining from scratch
and with the help of FLANDM, for four di erent domains. Results showed that
our framework helps reduce around 50% of development e orts. Secondly, we
checked whether the generated languages can be actually used by people
without expertise in data mining by carrying out some empirical experiments.
University teachers from heterogeneous areas used an educational analysis language
to study courses data from an e-learning platform. At the time of writing this
paper, we are still processing the gathered data, but preliminary results indicate
that teachers were able to correctly use this language after a minimum training.</p>
      <p>As commented, each executed query indicates a dataset as input data. In our
framework, a dataset is a tabular representation of a data bundle selected from
the domain model. The need of being tabular is a requirement imposed by most
data mining algorithms. Our framework provides two languages, called Lavoisier
and Pinset, which allow non-experts to create datasets from a domain model by
themselves, i.e., without the assistance of data mining experts. These languages
are brie y described in next sections.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Lavoisier: High-Level Data Selection and Processing</title>
      <p>
        Lavoisier [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a language for creating datasets from object-oriented domain
models. Dataset creation, i.e., the process of transforming data into a
twodimensional format to serve as input of an analysis algorithm, is considered
one of the key stages of any data mining process [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In our framework
(Figure 1), once a domain model has been created and populated with accurate and
clean data, datasets can be produced by specifying, through a Lavoisier query,
a subset of this domain model to be considered for a speci c analysis. Then,
this subset must be transformed into an analysis-ready dataset to be digested
by data mining tools.
      </p>
      <p>
        The problem of data formatting is illustrated in Figure 2. The top of this
gure shows a domain model about quality metrics of a software system. For
each class contained in this system, several metrics per release are computed.
Examples of these metrics could be CBO (Coupling Between Objects) or DIT
(Depth of Inheritance Tree). These and other metrics have been previously used,
for instance, to predict the defects that will be found in a software release [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>A domain model represents information in a graph-like format, whereas most
analyses require data to be transformed into a tabular format like the one
depicted in Figure 2 (bottom left). To perform this task, several data
transformation operations, such as joins or pivots, are typically used.</p>
      <p>Domain experts are key for the proper creation of datasets, since they might
give some useful input to correctly guide an analysis. So, it would be desirable
if these experts were able to de ne their own datasets. Nevertheless, domain
experts often lack the technical skills to accomplish this task.</p>
      <p>Lavoisier tries to alleviate this shortcoming. This language provides a
highlevel syntax that we expect can be used by domain experts, since it tries to hide
any technical details of the dataset creation process. Therefore, a domain expert
might focus on data selection, rather than on which combination of low-level
operations has to be used to obtain data in the required format.</p>
      <p>Figure 2 (bottom right) shows an example of dataset creation using Lavoisier.
The dataset releasesInfo will be used to compare class metrics of di erent
releases, so each row of this dataset would contain the information of a class. We
indicate this in the query by selecting Class as the mainClass of the dataset.
From each class object, we include its name. As information for the analysis,
we include data from all the releases rs of each class. Each set of columns
extracted from a release will be identi ed by its releaseId. Finally, for each
release, we include all measurements, each one corresponding to a metric name
(e.g. cbo or dit ). Lavoisier automatically uses the value of a measurement to
ll the corresponding columns. This speci cation, when executed, produces a
dataset like the one shown in Figure 2 (bottom left). It should be noted that,
in this case, the number of columns of the resulting dataset varies dynamically
depending on the number of releases and gathered metrics.</p>
      <p>
        The execution of a dataset speci cation is carried out by Lavoisier
transparently, freeing domain experts of these low-level details. To perform this execution,
Lavoisier employs a set of data transformation patterns [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that we de ned by
adapting some typical procedures applied in object-relational data mappers and
in data management tools.
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>Pinset: MDE that Helps Data Mining Help MDE</title>
      <p>
        Following a current trend [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ], we tried to employ Lavoisier to enable the use
of data mining techniques on data extracted from MDE artefacts. During this
evaluation, we realised that Lavoisier's high-level syntax might be inadequate
for domain experts with programming skills, such as software engineers. We
found that some ne-grain aspects of a dataset creation, like the computation of
aggregate values, cannot be easily speci ed using Lavoisier constructs. Thus, we
extended the initial objectives of Lavoisier to create a new DSL, called Pinset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
which o ers a lower-level syntax for performing some computations.
      </p>
      <p>Listing 1.2. Dataset extraction with Pinset.
1 dataset classAggregates over c : Class {
2 properties [name as className]
3 column numDefectiveReleases : c.rs.select(r | r.ms.exists(m |
4 m.metric.name = "num_bugs" and m.value &gt; 0)).size()
5 ...
6 }</p>
      <p>Listing 1.2 shows a dataset extraction over the domain model of Figure 2. In
this dataset, the entities to be analysed are again Classes (line 1). Several metrics
are computed for each class. For space reasons, only the numDefectiveReleases
metric is shown (lines 3-4), which indicates the number of releases per class
where at least one defect was detected. This metric is calculated by chaining
di erent operators that are interpreted to generate the resulting value.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>
        This paper has brie y described our MDE-based contributions in the eld of
data mining democratisation: the FLANDM framework [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and the Lavoisier [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
and Pinset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] languages. As future work, we plan to perform new empirical
experiments of these contributions, also including new analysis domains. We
also want to explore new research lines, such as how to de ne explainable
(whitebox) analysis processes for non-experts, or how to allow for a more ne-grained
con guration of these processes with a controlled increase of the complexity.
      </p>
      <p>Acknowledgements. Funded by the University of Cantabria's Doctorate
Program, and by the Spanish Government under grant TIN2017-86520-C3-3-R.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Babur</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cleophas</surname>
          </string-name>
          , L., van den Brand, M.:
          <article-title>Hierarchical clustering of metamodels for comparative analysis and visualization</article-title>
          .
          <source>In: Modelling Foundations and Applications - 12th European Conference</source>
          , ECMFA. pp.
          <volume>3</volume>
          {
          <issue>18</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Basciani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rocco</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruscio</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iovino</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pierantonio</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automated clustering of metamodel repositories</article-title>
          .
          <source>In: CAiSE</source>
          . pp.
          <volume>342</volume>
          {
          <issue>358</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>D</given-names>
            <surname>'Ambros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Lanza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Robbes</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.:</surname>
          </string-name>
          <article-title>An extensive comparison of bug prediction approaches</article-title>
          .
          <source>In: IEEE Int. Conf. Mining Software Repositories</source>
          . pp.
          <volume>31</volume>
          {
          <issue>41</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>de la Vega</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>How Far are we from Data Mining Democratisation? A Systematic Review</article-title>
          . arXiv e-prints
          <year>1903</year>
          .
          <volume>08431</volume>
          (
          <year>2019</year>
          ), https://arxiv.org/abs/
          <year>1903</year>
          .08431
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Munson</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>A study on the importance of and time spent on di erent modeling steps</article-title>
          .
          <source>SIGKDD Explor. Newsl</source>
          .
          <volume>13</volume>
          (
          <issue>2</issue>
          ),
          <volume>65</volume>
          {71 (May
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>de la Vega</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            a-Saiz,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zorrilla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <source>On the Automated Transformation of Domain Models into Tabular Datasets. ER FORUM</source>
          <year>1979</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>de la Vega</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            a-Saiz,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zorrilla</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>FLANDM: a development framework of domain-speci c languages for data mining democratisation</article-title>
          .
          <source>Computer Languages, Systems and Structures</source>
          <volume>54</volume>
          ,
          <issue>316</issue>
          {
          <fpage>336</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>de la Vega</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolovos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Pinset: A DSL for Extracting Datasets from Models for Data Mining-Based Quality Analysis</article-title>
          .
          <source>Quality of Information and Communications Technology (QUATIC)</source>
          pp.
          <volume>83</volume>
          {
          <issue>91</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pal</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          :
          <source>Data Mining: Practical Machine Learning Tools and Techniques. 4th edn</source>
          . (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>