=Paper= {{Paper |id=Vol-2022/paper44 |storemode=property |title= Data Mining Design and Systematic Modelling |pdfUrl=https://ceur-ws.org/Vol-2022/paper44.pdf |volume=Vol-2022 |authors=Yannic Kropp,Bernhard Thalheim |dblpUrl=https://dblp.org/rec/conf/rcdl/KroppT17a }} == Data Mining Design and Systematic Modelling == https://ceur-ws.org/Vol-2022/paper44.pdf
               Data Mining Design and Systematic Modelling

                    © Co Yannic Kropp                  © Bernhard Thalheim
  Christian Albrechts University Kiel, Department of Computer Science, D-24098 Kiel, Germany
    yk@is.informatik.uni-kiel.de                         thalheim@is.informatik.uni-kiel.de
           Abstract. Data mining is currently a well-established technique and supported by many algorithms. It
     is dependent on the data on hand, on properties of the algorithms, on the technology developed so far, and
     on the expectations and limits to be applied. It must be thus matured, predictable, optimisable, evolving,
     adaptable and well-founded similar to mathematics and SPICE/CMM-based software engineering. Data
     mining must therefore be systematic if the results have to be fit to its purpose. One basis of this systematic
     approach is model management and model reasoning. We claim that systematic data mining is nothing else
     than systematic modelling. The main notion is the notion of the model in a variety of forms, abstraction and
     associations among models.
           Keywords: data mining, modelling, models, framework, deep model, normal model, modelling
     matrix

                                                                   typically used for the solution of data mining and
 1 Introduction                                                    analysis tasks. It is neglected that an algorithm also has
                                                                   an application area, application restrictions, data
       Data mining and analysis is nowadays well-                  requirements, results at certain granularity and
 understood from the algorithms side. There are                    precision. These problems must be systematically
 thousands of algorithms that have been proposed. The              tackled if we want to rely on the results of mining and
 number of success stories is overwhelming and has                 analysis. Otherwise analysis may become misleading,
 caused the big data hype. At the same time, brute-force           biased, or not possible. Therefore, we explicitly treat
 application of algorithms is still the standard. Nowadays         properties of mining and analysis. A similar observation
 data analysis and data mining algorithms are still taken          can be made for data handling.
 for granted. They transform data sets and hypotheses                  Data mining is often considered to be a separate
 into conclusions. For instance, cluster algorithms check          sub-discipline of computer engineering and science.
 on given data sets and for a clustering requirements              The statistics basis of data mining is well accepted. We
 portfolio whether this portfolio can be supported and             typically start with a general (or better generic) model
 provide as a set of clusters in the positive case as an           and use for refinement or improvement of the model the
 output. The Hopkins index is one of the criteria that             data that are on hand and that seem to be appropriate.
 allow to judge whether clusters exist within a data set.          This technique is known in sciences under several
 A systematic approach to data mining has already been             names such as inverse modelling, generic modelling,
 proposed in [3, 17]. It is based on mathematics and               pattern-based reasoning, (inductive) learning, universal
 mathematical statistics and thus able to handle errors,           application, and systematic modelling.
 biases and configuration of data mining as well. Our                    Data mining is typically not only based on one
 experience in large data mining projects in archaeology,          model but rather on a model ensemble or model suite
 ecology, climate research, medical research etc. has              The association among models in a model suite is
 however shown that ad-hoc and brute-force mining is               explicitly specified. These associations provide an
 still the main approach. The results are taken for                explicit form via model suites. Reasoning techniques
 granted and believed despite the modelling,                       combine methods from logics (deductive, inductive,
 understanding, flow of work and data handling pitfalls.           abductive, counter-inductive, etc.), from artificial
 So, the results often become dubious.                             intelligence (hypothetic, qualitative, concept-based,
       Data are the main source for information in data            adductive, etc.), computational methods (algorithmics
 mining and analysis. Their quality properties have been           [6], topology, geometry, reduction, etc.), and cognition
 neglected for a long time. At the same time, modern               (problem representation and solving, causal reasoning,
 data management allows to handle these problems. In               etc.).
 [16] we compare the critical findings or pitfalls of [21]               These choices and handling approaches need a
 with resolution techniques that can be applied to                 systematic underpinning. Techniques from artificial
 overcome the crucial pitfalls of data mining in                   intelligence, statistics, and engineering are bundled
 environmental sciences reported there. The algorithms             within the CRISP framework (e.g. [3]). They can be
 themselves are another source of pitfalls that are                enhanced by techniques that have originally been
                                                                   developed for modelling, for design science, business
Proceedings of the XIX International Conference                    informatics, learning theory, action theory etc.
“Data Analytics and Management in Data Intensive                         We combine and generalize the CRISP, heuristics,
Domains” (DAMDID/RCDL’2017), Moscow, Russia,                       modelling theory, design science, business informatics,
October 10-13, 2017



                                                             273
statistics, and learning approaches in this paper. First,            adequateness, dependability, well-formedness, scenario,
we introduce our notion of the model. Next we show                   functions and purposes, backgrounds (grounding and
how data mining can be designed. We apply this                       basis), and outer directives (context and community of
investigation to systematic modelling and later to                   practice). It covers all known so far notions in
systematic data mining. It is our goal to develop a                  agriculture, archaeology, arts, biology, chemistry,
holistic and systematic framework for data mining and                computer      science,     economics,      electro-technics,
analysis. Many issues are left out of the scope of this              environmental sciences, farming, geosciences, historical
paper such as a literature review, a formal introduction             sciences, languages, mathematics, medicine, ocean
of the approach, and a detailed discussion of data                   sciences, pedagogical science, philosophy, physics,
mining application cases.                                            political sciences, sociology, and sports. The models
                                                                     used in these disciplines are instruments used in certain
2 Models and Modelling                                               scenarios.
                                                                           Sciences distinguish between general, particular
Models are principle instruments in mathematics, data                and specific things. Particular things are specific for
analysis, modern computer engineering (CE), teaching                 general things and general for specific things. The same
any kind of computer technology, and also modern                     abstraction may be used for modelling. We may start
computer science (CS). They are built, applied, revised              with a general model. So far, nobody knows how to
and manufactured in many CE&CS sub-disciplines in a                  define general models for most utilization scenarios.
large variety of application cases with different                    Models function as instruments or tools. Typically,
purposes and context for different communities of                    instruments come in a variety of forms and fulfill many
practice. It is now well understood that models are                  different functions. Instruments are partially
something different from theories. They are often                    independent or autonomous of the thing they operate
intuitive, visualizable, and ideally capture the essence of          on. Models are however special instruments. They are
an understanding within some community of practice                   used with a specific intention within a utilization
and some context. At the same time, they are limited in              scenario. The quality of a model becomes apparent in
scope, context and the applicability.                                the context of this scenario.
                                                                           It might thus be better to start with generic models.
2.1 The Notion of the Model
                                                                     A generic model [4, 26, 31, 32] is a model which
There is however a general notion of a model and of a                broadly satisfies the purpose and broadly functions in
conception of the model:                                             the given utilization scenario. It is later tailored to suit
A model is a well-formed, adequate, and dependable                   the particular purpose and function. It generally
instrument that represents origins [9, 29, 30].                      represents origins of interest, provides means to
      Its criteria of well-formedness, adequacy, and                 establish adequacy and dependability of the model, and
dependability must be commonly accepted by its                       establishes focus and scope of the model. Generic
community of practice within some context and                        models should satisfy at least five properties: (i) they
correspond to the functions that a model fulfills in                 must be accurate; (ii) the quality of generic models
utilization scenarios.                                               allows that they are used consciously; (iii) they should
A well-formed instrument is adequate for a collection                be descriptive, not evaluative; (iv) they should be
of origins if it is analogous to the origins to be                   flexible so that they can be modified from time to time;
represented according to some analogy criterion, it is               (v) they can be used as a first “best guess”.
more focused (e.g. simpler, truncated, more abstract or
reduced) than the origins being modelled, and it                     2.3 Model Suites
sufficiently satisfies its purpose.                                  Most disciplines integrate a variety of models or a
      Well-formedness enables an instrument to be                    society of models, e.g. [7, 14] Models used in CE&CS
justified by an empirical corroboration according to its             are mainly at the same level of abstraction. It is already
objectives, by rational coherence and conformity                     well-known for threescore years that they form a model
explicitly stated through conformity formulas or                     ensemble (e.g. [10, 23]) or horizontal model suite (e.g.
statements, by falsifiability or validation, and by                  [8, 27]). Developed models vary in their scopes,
stability and plasticity within a collection of origins.             aspects, and facets they represent and their abstraction.
      The instrument is sufficient by its quality                         A model suite consists of a set of models {M1,...,
characterization for internal quality, external quality and          Mn}, of an association or collaboration schema among
quality in use or through quality characteristics [28]               the models, of controllers that maintain consistency or
such      as    correctness,     generality,     usefulness,         coherence of the model suite, of application schemata
comprehensibility, parsimony, robustness, novelty etc.               for explicit maintenance and evolution of the model
Sufficiency is typically combined with some assurance                suite, and of tracers for the establishment of the
evaluation (tolerance, modality, confidence, and                     coherence.
restrictions).                                                            Multi-modelling [11, 19, 24] became a culture in
                                                                     CE&CS. Maintenance of coherence, co-evolution, and
2.2 Generic and Specific Models                                      consistency among models has become a bottleneck in
The general notion of a model covers all aspects of                  development. Moreover, different languages with




                                                               274
different capabilities have become an obstacle similar to          of the definitional frame within a model development
multi-language retrieval [20] and impedance                        process. They define also the capacity and potential of a
mismatches. Models are often loosely coupled. Their                model whenever it is utilized.
dependence and relationship is often not explicitly                     Deep models and the modelling matrix also define
expressed. This problem becomes more complex if                    some frame for adequacy and dependability. This frame
models are used for different purposes such as                     is enhanced for specific normal models. It is then used
construction of systems, verification, optimization,               for a statement in which cases a normal model
explanation, and documentation.                                    represents the origins under consideration.

2.4 Stepwise Refinement of Models                                  2.6 Deep Models and Matrices in Archaeology
Refinement of a model to a particular or special model             Let us consider an application case. The CRC 1266 1
provides mechanisms for model transformation along                       “Scales of Transformation – Human
the adequacy, the justification and the sufficiency of a                 Environmental Interaction in Prehistoric and
model. Refinement is based on specialization for better                  Archaic Societies”
suitability of a model, on removal of unessential                  investigates processes of transformation from 15,000
elements, on combination of models to provide a more               BCE to 1 BCE, including crisis and collapse, on
holistic view, on integration that is based on binding of          different scales and dimensions, and as involving
model components to other components and on                        different types of groups, societies, and social
enhancement that typically improves a model to become              formations. It is based on the matrix and a deep model
more adequate or dependable.                                       as sketched in Figure 1. This matrix determines which
     Control of correctness of refinement [33] for                 normal models can still be considered and which not.
information systems takes into account (A) a focus on              The initial model for any normal model accepts this
the refined structure and refined vocabulary, (B) a focus          matrix.
to information systems structures of interest, (C)
abstract information systems computation segments,
(D) a description of database segments of interest, and
(E) an equivalence relation among those data of interest.

2.5 Deep Models and the Modelling Matrix
Model development is typically based on an explicit
and rather quick description of the ‘surface’ or normal
model and on the mostly unconditional acceptance of a
deep model. The latter one directs the modelling process
and the surface or normal model. Modelling itself is
often understood as development and design of the
normal model. The deep model is taken for granted and
accepted for a number of normal models.
      The deep model can be understood as the common
basis for a number of models. It consists of the
grounding for modelling (paradigms, postulates,
restrictions, theories, culture, foundations, conventions,
authorities), the outer directives (context and
community of practice), and basis (assumptions, general
concept space, practices, language as carrier, thought
community and thought style, methodology, pattern,
routines, commonsense) of modelling. It uses a
collection of undisputable elements of the background
as grounding and additionally a disputable and
adjustable basis which is commonly accepted in the
given context by the community of practice. Education              Figure 1 Modeling in archaeology with a matrix
on modelling starts, for instance, directly with the deep
model. In this case, the deep model has to be accepted                  We base our consideration on the matrix and the
and is thus hidden and latent.                                     deep model on [19] and the discussions in the CRC.
      A (modelling) matrix is something within or from             Whether the deep model or the model matrix is
which something else originates, develops, or takes                appropriate has already been discussed. The final
from. The matrix is assumed to be correct for normal               version presented in this paper illustrates our
models. It consists of the deep model and the modelling            understanding.
scenarios. The modelling agenda is derived from the
modelling scenario and the utilization scenarios. The
modelling scenario and the deep model serve as a part              1
                                                                       https://www.sfb1266.uni-kiel.de/en




                                                             275
2.7 Stereotyping of a Data Mining Process                          description of six main parameters (e.g. for inductive
                                                                   learning [34]):
Typical modeling (and data mining) processes follow
                                                                   (a) The data analysis algorithm: Algorithm
some kind of ritual or typical guideline, i.e. they are
                                                                   development is the main activity in data mining
stereotyped. The stereotype of a modelling process is
                                                                   research. Each of these algorithms transfers data and
based on a general modelling situation. Most modelling
                                                                   some specific parameters of the algorithm to a result.
methodologies are bound to one stereotype and one
                                                                   (b) The concept space: the concept space defines the
kind of model within one model utilization scenario.
                                                                   concepts under consideration for analysis based on
Stereotypes are governing, conditioning, steering and
                                                                   certain language and common understanding.
guiding the model development. They determine the
                                                                   (c) The data space: The data space typically consists of
model kind, the background and way of modelling
                                                                   a multi-layered data set of different granularity. Data
activities. They persuade the activities of modelling.
                                                                   sets may be enhanced by metadata that characterize the
They provide a means for considering the economics of
                                                                   data sets and associate the data sets to other data sets.
modelling. Often, stereotypes use a definitional frame
                                                                   (d) The hypotheses space: An algorithm is supposed to
that primes and orients the processes and that considers
                                                                   map evidence on the concepts to be supported or
the community of practice or actors within the model
                                                                   rejected into hypotheses about it.
development and utilization processes, the deep model
                                                                   (e) The prior knowledge space: Specifying the
or the matrix with its specific language and model basis,
                                                                   hypothesis space already provides some prior
and the agenda for model development. It might be
                                                                   knowledge. In particular, the analysis task starts with
enhanced by initial models which are derived from
                                                                   the assumption that the target concept is representable
generic models in accordance to the matrix.
                                                                   in a certain way.
     The model utilization scenario determines the
                                                                   (f) The acceptability and success criteria: Criteria for
function that a model might have and therefore also the
                                                                   successful analysis allow to derive termination criteria
goals and purposes of a model.
                                                                   for the data analysis.
2.8 The Agenda                                                     Each instantiation and refinement of the six parameters
                                                                   leads to specific data mining tasks.
The agenda is something like a guideline for modeling              The result of data mining and data analysis is described
activities and for model associations within a model               within the knowledge space. The data mining and
suite. It improves the quality of model outcomes by                analysis task may thus be considered to be a
spending some effort to decide what and how much                   transformation of data sets, concept sets and hypothesis
reasoning to do as opposed to what activities to do. It            sets into chunks of knowledge through the application
balances resources between the data-level actions and              of algorithms.
the reasoning actions. E.g. [17] uses an agent approach                  Problem solving and modelling considers,
with preparation agents, exploration agents, descriptive           however, typically six aspects [16]:
agents, and predictive agents. The agenda for a model              (1) Application, problems, and users: The domain
suite uses thus decisions points that require agenda               consists of a model of the application, a specification of
control according to performance and resource                      problems under consideration, of tasks that are issued,
considerations.    This      understanding      supports           and of profiles of users.
introspective monitoring about performance for the data            (2) Context: The context of a problem is anything what
mining process, coordinated control of the entire mining           could support the problem solution, e.g. the sciences’
process, and coordinated refinement of the models.                 background, theories, knowledge, foundations, and
Such kind of control is already necessary due to the               concepts to be used for problem specification, problem
problem space, the limitations of resources, and the               background, and solutions.
amount of uncertainty in knowledge, concepts, data,                (3) Technology: Technology is the enabler and defines
and the environment.                                               the methodology. It provides [23] means for the flow of
                                                                   problem solving steps, the flow of activities, the
3 Data Mining Design                                               distribution, the collaboration, and the exchange.
                                                                   (4) Techniques and methods: Techniques and methods
3.1 Conceptualization of Data Mining and Analysis                  can be given as algorithms. Specific algorithms are data
The data mining and analysis task must be enhanced by              improvers and cleaners, data aggregators, data
an explicit treatment of the languages used for concepts           integrators,     controllers,     checkers,     acceptance
and hypotheses, and by an explicit description of                  determiners, and termination algorithms.
knowledge that can be used. The algorithmic solution of            (5) Data: Data have their own structuring, their quality
the task is based on knowledge on algorithms that are              and their life span. They are typically enhanced by
used and on data that are available and that are required          metadata. Data management is a central element of
for the application of the algorithms. Typically, analysis         most problem solving processes.
algorithms are iterative and can run forever. We are               (6) Solutions: The solutions to problem solving can be
interested only in convergent ones and thus need                   formally given, illustrated by visual means, and
termination criteria. Therefore, conceptualization of the          presented by models. Models are typically only normal
data mining and analysis task consists of a detailed               models. The deep model and the matrix is already
                                                                   provided by the context and accepted by the community




                                                             276
of practice in dependence of the needs of this
community for the given application scenario.                      4.1 Setting the Deep Model and the Matrix
Therefore, models may be the final result of a data
mining and analysis process beside other means.                    The problem to be tackled must be clearly stated in
                                                                   dependence on the utilization scenario, the tasks to be
     Comparing these six spaces with the six                       solved, the community of practice involved, and the
parameters we discover that only four spaces are                   given context. The result of this step is the deep model
considered so far in data mining. We miss the user and             and its matrix. The first one is based on the background,
application space as well as the representation space.             the specific context parameter such as infrastructure and
Figure 2 shows the difference.                                     environment, and candidates for deep models.




Figure 2 Parameters of Data Mining and the Problem
Solving Aspects

3.2 Meta-models of Data Mining
An abstraction layer approach separates the application
domain, the model domain and the data domain [17].
This separation is illustrated in Figure 3.                        Figure 4 The Phases in Data Mining Design (Non-
                                                                   iterative form)

                                                                       The data mining tasks can be now formulated based
                                                                   on the matrix and the deep model. We set up the
                                                                   context, the environment, the general goal of the
                                                                   problem and also criteria for adequateness and
Figure 3 The V meta-model of Data Mining Design                    dependability of the solution, e.g. invariance properties
                                                                   for problem description and for the task setting and its
The data mining design framework uses the inverse                  mathematical formulation and solution faithfulness
modeling approach. It starts with the consideration of             properties for later application of the solution in the
the application domain and develops models as                      given environment. What is exactly the problem, the
mediators between the data and the application domain              expected benefit? What should a solution look like?
worlds. In the sequel we are going to combine the three            What is known about the application?
approaches of this section. The meta-model corresponds                 Deep models already use a background consisting of
to other meta-models such as inductive modelling or                an undisputable grounding and a selectable basis. The
hypothetical reasoning (hypotheses development,                    explicit statement of the background provides an
experimenting and testing, analysis of results, interim            understanding     of    the    postulates,    paradigms,
conclusions, reappraisal against real world).                      assumptions, conceptions, practices, etc. Without the
                                                                   background, the results of the analysis cannot be
4 Data Mining: A Systematic Model-Based                            properly understood. Models have their profile, i.e.
Approach                                                           goals, purposes and functions. These must be explicitly
                                                                   given. The parameters of a generic model can be either
Our approach presented so far allows to revise and to              order or slave parameters [12], either primary or
reformulate the model-oriented data mining process on              secondary or tertiary (also called genotypes or
the basis of well-defined engineering [15, 25] or                  phenotypes or observables) [1, 5], and either ruling (or
alternatively on systematic mathematical problem                   order) or driven parameters [12]. Data mining can be
solving [22]. Figure 4 displays this revision. We realize          enhanced by knowledge management techniques.
that the first two phases are typically implicitly assumed             Additionally, the concept space into which the data
and not considered. We concentrate on the non-iterative            mining task is embedded must be specified. This
form. Iterative processes can be handled in a similar              concept space is enhanced during data analysis.
form.




                                                             277
4.2 Stereotyping the Process                                             The result of the entire data mining process heavily
                                                                    depends on the appropriateness of the data sets, their
The general flow of data mining activities is typically
                                                                    properties and quality, and more generally the data
implicitly assumed on the basis of stereotypes which
                                                                    schemata with essentially three components: application
form a set of tasks, e.g. tasks of prove in whatever
                                                                    data schema with detailed description of data types,
system, transformation tasks, description tasks, and
                                                                    metadata schema [18], and generated and auxiliary data
investigation tasks. Proofs can follow the classical
                                                                    schemata. The first component is well-investigated in
deductive or inductive setting. Also, abductive,
                                                                    data mining and data management monographs. The
adductive, hypothetical and other reasoning techniques
                                                                    second and third components inherit research results
are applicable. Stereotypes typically use model suites as
                                                                    from database management, from data mart or
a collection of associated models, are already biased by
                                                                    warehouses, and layering of data. An essential element
priming and orientation, follow policies, data mining
                                                                    is the explicit specification of the quality of data. It
design constraints, and framing.
                                                                    allows to derive algorithms for data improvement and to
     Data mining and analysis is rather stereotyped. For
                                                                    derive limitations for applicability of algorithms.
instance, mathematical culture has already developed a
                                                                    Auxiliary data support performance of the algorithms.
good number of stereotypes for problem formulation. It
                                                                         Therefore typical data-oriented questions are: What
is based on a mathematical language for the formulation
                                                                    data do we have available? Is the data relevant to the
of analysis tasks, on selection and instantiation of the
                                                                    problem? Is it valid? Does it reflect our expectations?
best fitting variable space and the space of opportunities
                                                                    Is the data quality, quantity, recency sufficient? Which
provided by mathematics.
                                                                    data we should concentrate on? How is the data
     Data mining uses generic models which are the
                                                                    transformed for modelling? How may we increase the
basis of normal models. Models are based on a
                                                                    quality of data?
separation of concern according the problem setting:
dependence-indicating, dependence-describing, sepa-                 4.4 The Data Mining Process Itself
ration or partition spaces, pattern kinds, reasoning
kinds, etc. This separation of concern governs the                        The data mining process can be understood as a
classical data mining algorithmic classes: association              coherent and stepwise refinement of the given model
analysis, cluster analysis, data grouping with or without           suite. The model refinement may use an explicit
classification, classifiers and rules, dependences among            transformation or an extract-transform-load process
parameters and data subsets, predictor analysis, syner-             among models within the model suite. Evaluation and
getics, blind or informed or heuristic investigation of             termination algorithms are an essential element of any
the search space, and pattern learning.                             data mining algorithm. They can be based on quality
                                                                    criteria for the finalized models in the model suite, e.g.
                                                                    generality, error-proneness, stability, selection-
4.3 Initialization of the Normal Data Models                        proneness, validation, understandability, repeatability,
                                                                    usability, usefulness, and novelty.
Data mining algorithms have their capacity and
                                                                          Typical questions to answer within this process
potential [2]. Potential and capacity can be based on
                                                                    are: How good is the model suite in terms of the task
SWOT (strengths, weaknesses, opportunities, and
                                                                    setting? What have we really learned about the
threats), SCOPE (situation, core competencies,
                                                                    application domain? What is the real adequacy and
obstacles, prospects, expectation), and SMART (how
                                                                    dependability of the models in the model suite? How
simple, meaningful, adequate, realistic, and trackable)
                                                                    these models can be deployed best? How do we know
analysis of methods and algorithms. Each of the
                                                                    that the models in the model suite are still valid? Which
algorithm classes has its strengths and weaknesses, its
                                                                    data are supporting which model in the model suite?
satisfaction of the tasks and the purpose, and its limits
                                                                    Which kind of errors of data is inherited by which part
of applicability. Algorithm selection also includes an
                                                                    of which model?
explicit specification of the order of application of these
                                                                          The final result of the data mining process is then a
algorithms and of mapping parameters that are derived
                                                                    combination of the deep model and the normal model
by means of one algorithm to those that are an input for
                                                                    whereas the first one is a latent or hidden component in
the others, i.e. an explicit association within the model
                                                                    most cases. If we want, however, to reason on the
suite. Additionally, evaluation algorithms for the
                                                                    results then the deep model must be understood as well.
success criteria are selected. Algorithms have their own
                                                                    Otherwise, the results may become surprising and may
obstinacy, their hypotheses and assumptions that must
                                                                    not be convincing.
be taken into consideration. Whether an algorithm can
be considered depends on acceptance criteria derived in             4.5 Controllers and Selectors
the previous two steps.
So, we ask: What kind of model suite architecture suits             Algorithmics [6] treats algorithms as general solution
the problem best? What are applicable development                   pattern that have parameters for their instantiation,
approaches for modelling? What is the best modelling                handling mechanisms for their specialization to a given
technique to get the right model suite? What kind of                environment, and enhancers for context injection. So,
reasoning is supported? What not? What are the                      an algorithm can be derived based on explicit selectors
limitations? Which pitfalls should be avoided?                      and control rules [4] if we neglect context injection. We




                                                              278
can use this approach for data mining design (DMD).                  technology and algorithm driven. The problem selection
For instance, an algorithm pattern such as regression                is made on intuition and experience. So, the matrix and
uses a generic model of parameter dependence, is based               the deep model are latent and hidden. The problem
on blind search, has parameters for similarity and model             specification is not explicit. Therefore, this paper aims
quality, and has selection support for specific treatment            at the entire data mining process and highlights a way to
of the given data set. In this case, the controller is based         leave the ad-hoc, blind and somehow chaotic data
on enablers that specify applicability of the approach,              analysis. The approach we are developing integrates the
on error rules, on data evaluation rules that detect                 theory of models, the theory of problem solving, design
dependencies among control parameters and derive data                science, and knowledge and content management. We
quality measures, and on quality rules for confidence                realized that data mining can be systematized. The
statements.                                                          framework for data mining design exemplarily
                                                                     presented is an example in Figure 4.
4.6 Data Mining and Design Science
Let us finally associate our approach with design                    Acknowledgement. We thank for the support of this
science research [13]. Design science considers                      paper by the CRC 1266. We are very thankful for the
systematic modelling as an embodiment of three closely               fruitful discussions with the members of the CRC.
related cycles of activities. The relevance cycle initiates
design science research with an application context that             References
not only provides the requirements for the research as
                                                                      [1] G. Bell. The mechanism of evolution.
inputs but also defines acceptance criteria for the
                                                                         Chapman and Hall, New York (1997)
ultimate evaluation of the research results. The central
design cycle iterates between the core activities of                  [2] R. Berghammer and B. Thalheim., Metho-
building and evaluating the design artifacts and                         denbasierte mathematische Modellierung mit
processes of the research. The orthogonal rigor cycle                    Relationenalgebren. In: Wissenschaft und
provides past knowledge to the research project to                       Kunst     der    Modellierung:   Modelle,
ensure its innovation. It is contingent on the                           Modellieren, Modellierung, pp. 67–106. De
researchers’ thoroughly research and references the                      Gryuter, Boston ( 2015)
knowledge base in order to guarantee that the designs                 [3] M.R. Berthold, C. Borgelt, F. Höppner, and F.
produced are research contributions and not routine                       Klawonn. Guide to intelligent data analysis.
designs based upon the application of well-known                          Springer, London (2010).
processes.
     The relevance cycle is concerned with the problem                [4] A. Bienemann, K.-D. Schewe, and B.Thalheim.
specification and setting and the matrix and agenda                       Towards a theory of genericity based on
derivation. The design cycle is related to all other                      government and binding. In: Proc. ER’06,
phases of our framework. The rigor cycle is enhanced                      LNCS 4215, pp. 311–324. Springer ( 2006)
by our framework and provides thus a systematic                       [5] L.B. Booker, D.E. Goldberg, and J.H.
modelling approach.                                                      Holland. Classifier systems and genetic
                                                                         algorithms. Artificial Intelligence, 40 (1–3):
5 Conclusion                                                             pp. 235–282 (1989)
The literature on data mining is fairly rich. Mining tools            [6] G. Brassard and P. Bratley. Algorithmics -
have already gained the maturity for supporting any                       Theory and Practice. Prentice Hall, London
kind of data analysis if the data mining problem is well                  ( 1988)
understood, the intentions for models are properly                    [7] A. Coleman. Scientific models as works.
understood, and if the problem is professionally set up.                 Cataloging  & Classification Quarterly,
Data mining aims at development of model suites that                     Special Issue: Works as Entities for
allows to derive and to draw dependable and thus                         Information Retrieval, 33, p p . 3-4 ( 2006)
justifiable conclusions on the given data set. Data
                                                                      [8] A. Dahanayake and B. Thalheim. Co-
mining is a process that can be based on a framework
for systematic modelling that is driven by a deep model                  evolution of (information) system models.
and a matrix. Textbooks on data mining typically                         In: EMMSAD 2010, LNBIB vol. 50, pp.
explore in detail algorithms as blind search. Data                       314–326. Springer ( 2010)
mining is a specific form of modeling. Therefore, we                  [9] D. Embley and B. Thalheim (eds). The
can combine modeling with data mining in a more                          Handbook of Conceptual Modeling: Its Usage
sophisticated form. Models have however an inner                         and Its Challenges. Springer ( 2011)
structure with parts which are given by the application,              [10] N.P. Gillett, F.W. Zwiers, A.J. Weaver,
by the context, by the commonsense and by a                              G.C. Hegerl, M.R. Allen, and P.A. Stott.
community of practice. These fixed parts are then                        Detecting anthropogenic influence with a
enhanced by normal models. A typical normal model is                     multi-model ensemble. Geophys. Res. Lett.,
the result of a data mining process.
                                                                         29:31–34, 2002.
      The current state of the art in data mining is mainly




                                                               279
[11] E. Guerra, J. de Lara, D.S. Kolovos, and                    (in Russian). ZPI at Mech-Mat MGU, Moscow
    R.F. Paige. Inter- modelling: From theory                    (2001)
    to practice. In MoDELS 2010, LNCS 6394,               [24] M. Pottmann, H. Unbehauen, and D.E.
    pp. 376–391, Springer, Berlin (2010)                      Seborg. Application of a general multi-model
[12] H. Haken, A. Wunderlin, and S. Yigitbasi.                approach for identification of highly nonlinear
    An introduction to synergetics. Open                      processes – a case study. Int. Journal of
    Systems and Information Dynamics, 3(1):                   Control, 57(1): pp. 97–120 (1993)
    pp. 1–34 ( 1994)                                      [25]     B. Rumpe. Modellierung                mit   UML.
[13] A. Hevner, S. March, J. Park, and S. Ram.                   Springer, H e i d e l b e r g ( 2012)
    Design science in information systems                 [26] A. Samuel and J. Weir. Introduction to
    research. MIS Quaterly, 28(1): pp. 75–105                 Engineering:   Modelling, Synthesis and
    ( 2004)                                                   Problem    Solving   Strategies. Elsevier,
[14] P.J. Hunter, W. W. Li, A. D. McCulloch, and              Amsterdam ( 2000)
    D. Noble. Multiscale modeling: Physiome               [27] G. Simsion and G.C. Witt. Data modeling
    project standards, tools, and databases. IEEE             essentials. Morgan Kaufmann, San Francisco
    Computer, 39(11), pp. 48–54 (2006)                        ( 2005)
[15] ISO/IEC 25020 (Software and system                   [28] M. Skusa. Semantische Kohärenz in der
    engineering - software product quality                    Softwareentwicklung. PhD thesis, CAU Kiel,
    requirements and evaluation (square)                      (2011)
    - measurement reference model and guide).
                                                          [29]     B. Thalheim. Towards a theory of
    ISO/IEC JTC1/SC7 N3280 (2005)
                                                                 conceptual modelling. Journal of Universal
 [16] H. Jaakkola, B. Thalheim, Y. Kidawara, K.                  Computer Science, 16(20): pp. 3102–3137,
     Zettsu, Y. Chen, and A. Heimbü rger.                       (2010)
     Information modelling and global risk
                                                          [30] B. Thalheim. The conceptual model ≡ an
     management systems. In: Information
                                                              adequate and dependable artifact enhanced
     Modeling and Knowledge Bases XX, pp.
                                                              by concepts. I n : Information Modelling and
     429–446. IOS Press (2009)
                                                              Knowledge Bases XXV, p p . 241–254. IOS
 [17] K. Jannaschk. Infrastruktur für ein Data                Press (2014)
     Mining Design Framework. PhD thesis,
                                                          [31]      B. Thalheim. Conceptual         modeling
     Christian-Albrechts University, Kiel (2017)
                                                                 foundations: The notion of a model in
 [18] F. Kramer and B. Thalheim. A metadata                      conceptual modeling. In: Encyclopedia of
     system for quality management. In:                          Database Systems, Springer ( 2017)
     Information Modelling and Knowledge
                                                          [32] B. Thalheim and M. Tropmann-Frick.
     Bases, pp. 224–242. IOS Press (2014)
                                                              Wherefore models are used and accepted? The
 [19] O. Nakoinz and D. Knitter. Modelling                    model functions as a quality instrument in
     Human Behaviour in Landscapes. Springer                  utilisation scenarios. In: I. Comyn-Wattiau,
     ( 2016)                                                  C. du Mouza, and N. Prat, editors, Ingenierie
 [20] J. Pardillo. A systematic review on the                 Management des Systemes D’Information
     definition of UML profiles. In: MoDELS                   (2016)
     2010, LNCS 6394, pp. 407–422, Springer,              [33] B. Thalheim, M. Tropmann-Frick, and T.
     Berlin (2010)                                            Ziebermayr. Application of generic workflows
 [21] D. Petrelli, S. Levin, M. Beaulieu, and M.              for disaster management. In: Information
     Sanderson. Which user interaction for cross-             Modelling and Knowledge Bases, pp. 64–81.
     language information retrieval? Design issues            IOS Press (2014)
     and reflections. JASIST, 57(5): pp. 709–722          [34] B. Thalheim and Q. Wang. Towards a theory
     ( 2006)                                                   of refinement for data migration. In:
 [22] O.H. Pilkey and L. Pilkey-Jarvis. Useless                ER’2011, LNCS 6998, pp. 318–331. Springer,
     Arithmetic: Why Environmental Scientists                  ( 2011)
     Cant’t Predict the Future.        Columbia           [35] T. Zeugmann. Inductive inference of optimal
     University Press, New York (2006)                         programs: A survey and open problems. In:
 [23] A.S. Podkolsin. Computer-based modelling                 Nonmonotonic and Inductive Logics, pp.
     of solution processes for mathematical tasks              208–222. Springer, Berlin (1991)




                                                    280