<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Pottmann, H. Unbehauen, and D.E.
Seborg. Application of a general multi-model
approach for identification of highly nonlinear
processes - a case study. Int. Journal of
Control</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the XIX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2017)</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Co Yannic Kropp © Bernhard Thalheim Christian Albrechts University Kiel, Department of Computer Science</institution>
          ,
          <addr-line>D-24098 Kiel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>120</volume>
      <issue>1993</issue>
      <fpage>273</fpage>
      <lpage>280</lpage>
      <abstract>
        <p>Data mining is currently a well-established technique and supported by many algorithms. It is dependent on the data on hand, on properties of the algorithms, on the technology developed so far, and on the expectations and limits to be applied. It must be thus matured, predictable, optimisable, evolving, adaptable and well-founded similar to mathematics and SPICE/CMM-based software engineering. Data mining must therefore be systematic if the results have to be fit to its purpose. One basis of this systematic approach is model management and model reasoning. We claim that systematic data mining is nothing else than systematic modelling. The main notion is the notion of the model in a variety of forms, abstraction and associations among models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Data mining and analysis is nowadays
wellunderstood from the algorithms side. There are
thousands of algorithms that have been proposed. The
number of success stories is overwhelming and has
caused the big data hype. At the same time, brute-force
application of algorithms is still the standard. Nowadays
data analysis and data mining algorithms are still taken
for granted. They transform data sets and hypotheses
into conclusions. For instance, cluster algorithms check
on given data sets and for a clustering requirements
portfolio whether this portfolio can be supported and
provide as a set of clusters in the positive case as an
output. The Hopkins index is one of the criteria that
allow to judge whether clusters exist within a data set.
A systematic approach to data mining has already been
proposed in [
        <xref ref-type="bibr" rid="ref17 ref3">3, 17</xref>
        ]. It is based on mathematics and
mathematical statistics and thus able to handle errors,
biases and configuration of data mining as well. Our
experience in large data mining projects in archaeology,
ecology, climate research, medical research etc. has
however shown that ad-hoc and brute-force mining is
still the main approach. The results are taken for
granted and believed despite the modelling,
understanding, flow of work and data handling pitfalls.
So, the results often become dubious.
      </p>
      <p>
        Data are the main source for information in data
mining and analysis. Their quality properties have been
neglected for a long time. At the same time, modern
data management allows to handle these problems. In
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] we compare the critical findings or pitfalls of [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
with resolution techniques that can be applied to
overcome the crucial pitfalls of data mining in
environmental sciences reported there. The algorithms
themselves are another source of pitfalls that are
typically used for the solution of data mining and
analysis tasks. It is neglected that an algorithm also has
an application area, application restrictions, data
requirements, results at certain granularity and
precision. These problems must be systematically
tackled if we want to rely on the results of mining and
analysis. Otherwise analysis may become misleading,
biased, or not possible. Therefore, we explicitly treat
properties of mining and analysis. A similar observation
can be made for data handling.
      </p>
      <p>Data mining is often considered to be a separate
sub-discipline of computer engineering and science.
The statistics basis of data mining is well accepted. We
typically start with a general (or better generic) model
and use for refinement or improvement of the model the
data that are on hand and that seem to be appropriate.
This technique is known in sciences under several
names such as inverse modelling, generic modelling,
pattern-based reasoning, (inductive) learning, universal
application, and systematic modelling.</p>
      <p>
        Data mining is typically not only based on one
model but rather on a model ensemble or model suite
The association among models in a model suite is
explicitly specified. These associations provide an
explicit form via model suites. Reasoning techniques
combine methods from logics (deductive, inductive,
abductive, counter-inductive, etc.), from artificial
intelligence (hypothetic, qualitative, concept-based,
adductive, etc.), computational methods (algorithmics
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], topology, geometry, reduction, etc.), and cognition
(problem representation and solving, causal reasoning,
etc.).
      </p>
      <p>
        These choices and handling approaches need a
systematic underpinning. Techniques from artificial
intelligence, statistics, and engineering are bundled
within the CRISP framework (e.g. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). They can be
enhanced by techniques that have originally been
developed for modelling, for design science, business
informatics, learning theory, action theory etc.
      </p>
      <p>We combine and generalize the CRISP, heuristics,
modelling theory, design science, business informatics,
statistics, and learning approaches in this paper. First,
we introduce our notion of the model. Next we show
how data mining can be designed. We apply this
investigation to systematic modelling and later to
systematic data mining. It is our goal to develop a
holistic and systematic framework for data mining and
analysis. Many issues are left out of the scope of this
paper such as a literature review, a formal introduction
of the approach, and a detailed discussion of data
mining application cases.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Models and Modelling</title>
      <p>Models are principle instruments in mathematics, data
analysis, modern computer engineering (CE), teaching
any kind of computer technology, and also modern
computer science (CS). They are built, applied, revised
and manufactured in many CE&amp;CS sub-disciplines in a
large variety of application cases with different
purposes and context for different communities of
practice. It is now well understood that models are
something different from theories. They are often
intuitive, visualizable, and ideally capture the essence of
an understanding within some community of practice
and some context. At the same time, they are limited in
scope, context and the applicability.</p>
      <sec id="sec-2-1">
        <title>2.1 The Notion of the Model</title>
        <p>
          There is however a general notion of a model and of a
conception of the model:
A model is a well-formed, adequate, and dependable
instrument that represents origins [
          <xref ref-type="bibr" rid="ref9">9, 29, 30</xref>
          ].
        </p>
        <p>Its criteria of well-formedness, adequacy, and
dependability must be commonly accepted by its
community of practice within some context and
correspond to the functions that a model fulfills in
utilization scenarios.</p>
        <p>A well-formed instrument is adequate for a collection
of origins if it is analogous to the origins to be
represented according to some analogy criterion, it is
more focused (e.g. simpler, truncated, more abstract or
reduced) than the origins being modelled, and it
sufficiently satisfies its purpose.</p>
        <p>Well-formedness enables an instrument to be
justified by an empirical corroboration according to its
objectives, by rational coherence and conformity
explicitly stated through conformity formulas or
statements, by falsifiability or validation, and by
stability and plasticity within a collection of origins.</p>
        <p>The instrument is sufficient by its quality
characterization for internal quality, external quality and
quality in use or through quality characteristics [28]
such as correctness, generality, usefulness,
comprehensibility, parsimony, robustness, novelty etc.
Sufficiency is typically combined with some assurance
evaluation (tolerance, modality, confidence, and
restrictions).</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Generic and Specific Models</title>
        <p>The general notion of a model covers all aspects of
adequateness, dependability, well-formedness, scenario,
functions and purposes, backgrounds (grounding and
basis), and outer directives (context and community of
practice). It covers all known so far notions in
agriculture, archaeology, arts, biology, chemistry,
computer science, economics, electro-technics,
environmental sciences, farming, geosciences, historical
sciences, languages, mathematics, medicine, ocean
sciences, pedagogical science, philosophy, physics,
political sciences, sociology, and sports. The models
used in these disciplines are instruments used in certain
scenarios.</p>
        <p>Sciences distinguish between general, particular
and specific things. Particular things are specific for
general things and general for specific things. The same
abstraction may be used for modelling. We may start
with a general model. So far, nobody knows how to
define general models for most utilization scenarios.
Models function as instruments or tools. Typically,
instruments come in a variety of forms and fulfill many
different functions. Instruments are partially
independent or autonomous of the thing they operate
on. Models are however special instruments. They are
used with a specific intention within a utilization
scenario. The quality of a model becomes apparent in
the context of this scenario.</p>
        <p>
          It might thus be better to start with generic models.
A generic model [
          <xref ref-type="bibr" rid="ref23 ref4">4, 26, 31, 32</xref>
          ] is a model which
broadly satisfies the purpose and broadly functions in
the given utilization scenario. It is later tailored to suit
the particular purpose and function. It generally
represents origins of interest, provides means to
establish adequacy and dependability of the model, and
establishes focus and scope of the model. Generic
models should satisfy at least five properties: (i) they
must be accurate; (ii) the quality of generic models
allows that they are used consciously; (iii) they should
be descriptive, not evaluative; (iv) they should be
flexible so that they can be modified from time to time;
(v) they can be used as a first “best guess”.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3 Model Suites</title>
        <p>
          Most disciplines integrate a variety of models or a
society of models, e.g. [
          <xref ref-type="bibr" rid="ref14 ref7">7, 14</xref>
          ] Models used in CE&amp;CS
are mainly at the same level of abstraction. It is already
well-known for threescore years that they form a model
ensemble (e.g. [
          <xref ref-type="bibr" rid="ref10">10, 23</xref>
          ]) or horizontal model suite (e.g.
[
          <xref ref-type="bibr" rid="ref8">8, 27</xref>
          ]). Developed models vary in their scopes,
aspects, and facets they represent and their abstraction.
        </p>
        <p>A model suite consists of a set of models {M1,...,
Mn}, of an association or collaboration schema among
the models, of controllers that maintain consistency or
coherence of the model suite, of application schemata
for explicit maintenance and evolution of the model
suite, and of tracers for the establishment of the
coherence.</p>
        <p>
          Multi-modelling [
          <xref ref-type="bibr" rid="ref11 ref19">11, 19, 24</xref>
          ] became a culture in
CE&amp;CS. Maintenance of coherence, co-evolution, and
consistency among models has become a bottleneck in
development. Moreover, different languages with
different capabilities have become an obstacle similar to
multi-language retrieval [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] and impedance
mismatches. Models are often loosely coupled. Their
dependence and relationship is often not explicitly
expressed. This problem becomes more complex if
models are used for different purposes such as
construction of systems, verification, optimization,
explanation, and documentation.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4 Stepwise Refinement of Models</title>
        <p>Refinement of a model to a particular or special model
provides mechanisms for model transformation along
the adequacy, the justification and the sufficiency of a
model. Refinement is based on specialization for better
suitability of a model, on removal of unessential
elements, on combination of models to provide a more
holistic view, on integration that is based on binding of
model components to other components and on
enhancement that typically improves a model to become
more adequate or dependable.</p>
        <p>
          Control of correctness of refinement [
          <xref ref-type="bibr" rid="ref24">33</xref>
          ] for
information systems takes into account (A) a focus on
the refined structure and refined vocabulary, (B) a focus
to information systems structures of interest, (C)
abstract information systems computation segments,
(D) a description of database segments of interest, and
(E) an equivalence relation among those data of interest.
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5 Deep Models and the Modelling Matrix</title>
        <p>Model development is typically based on an explicit
and rather quick description of the ‘surface’ or normal
model and on the mostly unconditional acceptance of a
deep model. The latter one directs the modelling process
and the surface or normal model. Modelling itself is
often understood as development and design of the
normal model. The deep model is taken for granted and
accepted for a number of normal models.</p>
        <p>The deep model can be understood as the common
basis for a number of models. It consists of the
grounding for modelling (paradigms, postulates,
restrictions, theories, culture, foundations, conventions,
authorities), the outer directives (context and
community of practice), and basis (assumptions, general
concept space, practices, language as carrier, thought
community and thought style, methodology, pattern,
routines, commonsense) of modelling. It uses a
collection of undisputable elements of the background
as grounding and additionally a disputable and
adjustable basis which is commonly accepted in the
given context by the community of practice. Education
on modelling starts, for instance, directly with the deep
model. In this case, the deep model has to be accepted
and is thus hidden and latent.</p>
        <p>A (modelling) matrix is something within or from
which something else originates, develops, or takes
from. The matrix is assumed to be correct for normal
models. It consists of the deep model and the modelling
scenarios. The modelling agenda is derived from the
modelling scenario and the utilization scenarios. The
modelling scenario and the deep model serve as a part
of the definitional frame within a model development
process. They define also the capacity and potential of a
model whenever it is utilized.</p>
        <p>Deep models and the modelling matrix also define
some frame for adequacy and dependability. This frame
is enhanced for specific normal models. It is then used
for a statement in which cases a normal model
represents the origins under consideration.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6 Deep Models and Matrices in Archaeology</title>
        <p>Let us consider an application case. The CRC 12661
“Scales of Transformation – Human
Environmental Interaction in Prehistoric and
Archaic Societies”
investigates processes of transformation from 15,000
BCE to 1 BCE, including crisis and collapse, on
different scales and dimensions, and as involving
different types of groups, societies, and social
formations. It is based on the matrix and a deep model
as sketched in Figure 1. This matrix determines which
normal models can still be considered and which not.
The initial model for any normal model accepts this
matrix.</p>
        <p>Figure 1 Modeling in archaeology with a matrix</p>
        <p>
          We base our consideration on the matrix and the
deep model on [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and the discussions in the CRC.
Whether the deep model or the model matrix is
appropriate has already been discussed. The final
version presented in this paper illustrates our
understanding.
1 https://www.sfb1266.uni-kiel.de/en
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>2.7 Stereotyping of a Data Mining Process</title>
        <p>Typical modeling (and data mining) processes follow
some kind of ritual or typical guideline, i.e. they are
stereotyped. The stereotype of a modelling process is
based on a general modelling situation. Most modelling
methodologies are bound to one stereotype and one
kind of model within one model utilization scenario.
Stereotypes are governing, conditioning, steering and
guiding the model development. They determine the
model kind, the background and way of modelling
activities. They persuade the activities of modelling.
They provide a means for considering the economics of
modelling. Often, stereotypes use a definitional frame
that primes and orients the processes and that considers
the community of practice or actors within the model
development and utilization processes, the deep model
or the matrix with its specific language and model basis,
and the agenda for model development. It might be
enhanced by initial models which are derived from
generic models in accordance to the matrix.</p>
        <p>The model utilization scenario determines the
function that a model might have and therefore also the
goals and purposes of a model.</p>
      </sec>
      <sec id="sec-2-8">
        <title>2.8 The Agenda</title>
        <p>
          The agenda is something like a guideline for modeling
activities and for model associations within a model
suite. It improves the quality of model outcomes by
spending some effort to decide what and how much
reasoning to do as opposed to what activities to do. It
balances resources between the data-level actions and
the reasoning actions. E.g. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] uses an agent approach
with preparation agents, exploration agents, descriptive
agents, and predictive agents. The agenda for a model
suite uses thus decisions points that require agenda
control according to performance and resource
considerations. This understanding supports
introspective monitoring about performance for the data
mining process, coordinated control of the entire mining
process, and coordinated refinement of the models.
Such kind of control is already necessary due to the
problem space, the limitations of resources, and the
amount of uncertainty in knowledge, concepts, data,
and the environment.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Data Mining Design</title>
      <sec id="sec-3-1">
        <title>3.1 Conceptualization of Data Mining and Analysis</title>
        <p>
          The data mining and analysis task must be enhanced by
an explicit treatment of the languages used for concepts
and hypotheses, and by an explicit description of
knowledge that can be used. The algorithmic solution of
the task is based on knowledge on algorithms that are
used and on data that are available and that are required
for the application of the algorithms. Typically, analysis
algorithms are iterative and can run forever. We are
interested only in convergent ones and thus need
termination criteria. Therefore, conceptualization of the
data mining and analysis task consists of a detailed
description of six main parameters (e.g. for inductive
learning [
          <xref ref-type="bibr" rid="ref25">34</xref>
          ]):
(a) The data analysis algorithm: Algorithm
development is the main activity in data mining
research. Each of these algorithms transfers data and
some specific parameters of the algorithm to a result.
(b) The concept space: the concept space defines the
concepts under consideration for analysis based on
certain language and common understanding.
(c) The data space: The data space typically consists of
a multi-layered data set of different granularity. Data
sets may be enhanced by metadata that characterize the
data sets and associate the data sets to other data sets.
(d) The hypotheses space: An algorithm is supposed to
map evidence on the concepts to be supported or
rejected into hypotheses about it.
(e) The prior knowledge space: Specifying the
hypothesis space already provides some prior
knowledge. In particular, the analysis task starts with
the assumption that the target concept is representable
in a certain way.
(f) The acceptability and success criteria: Criteria for
successful analysis allow to derive termination criteria
for the data analysis.
        </p>
        <p>Each instantiation and refinement of the six parameters
leads to specific data mining tasks.</p>
        <p>The result of data mining and data analysis is described
within the knowledge space. The data mining and
analysis task may thus be considered to be a
transformation of data sets, concept sets and hypothesis
sets into chunks of knowledge through the application
of algorithms.</p>
        <p>
          Problem solving and modelling considers,
however, typically six aspects [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]:
(1) Application, problems, and users: The domain
consists of a model of the application, a specification of
problems under consideration, of tasks that are issued,
and of profiles of users.
(2) Context: The context of a problem is anything what
could support the problem solution, e.g. the sciences’
background, theories, knowledge, foundations, and
concepts to be used for problem specification, problem
background, and solutions.
(3) Technology: Technology is the enabler and defines
the methodology. It provides [23] means for the flow of
problem solving steps, the flow of activities, the
distribution, the collaboration, and the exchange.
(4) Techniques and methods: Techniques and methods
can be given as algorithms. Specific algorithms are data
improvers and cleaners, data aggregators, data
integrators, controllers, checkers, acceptance
determiners, and termination algorithms.
(5) Data: Data have their own structuring, their quality
and their life span. They are typically enhanced by
metadata. Data management is a central element of
most problem solving processes.
(6) Solutions: The solutions to problem solving can be
formally given, illustrated by visual means, and
presented by models. Models are typically only normal
models. The deep model and the matrix is already
provided by the context and accepted by the community
of practice in dependence of the needs of this
community for the given application scenario.
Therefore, models may be the final result of a data
mining and analysis process beside other means.
        </p>
        <p>
          Comparing these six spaces with the six
parameters we discover that only four spaces are
considered so far in data mining. We miss the user and
application space as well as the representation space.
Figure 2 shows the difference.
The problem to be tackled must be clearly stated in
dependence on the utilization scenario, the tasks to be
solved, the community of practice involved, and the
given context. The result of this step is the deep model
and its matrix. The first one is based on the background,
the specific context parameter such as infrastructure and
environment, and candidates for deep models.
An abstraction layer approach separates the application
domain, the model domain and the data domain [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
This separation is illustrated in Figure 3.
The data mining design framework uses the inverse
modeling approach. It starts with the consideration of
the application domain and develops models as
mediators between the data and the application domain
worlds. In the sequel we are going to combine the three
approaches of this section. The meta-model corresponds
to other meta-models such as inductive modelling or
hypothetical reasoning (hypotheses development,
experimenting and testing, analysis of results, interim
conclusions, reappraisal against real world).
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Data Mining: A Systematic Model-Based</title>
    </sec>
    <sec id="sec-5">
      <title>Approach</title>
      <p>
        Our approach presented so far allows to revise and to
reformulate the model-oriented data mining process on
the basis of well-defined engineering [
        <xref ref-type="bibr" rid="ref15">15, 25</xref>
        ] or
alternatively on systematic mathematical problem
solving [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Figure 4 displays this revision. We realize
that the first two phases are typically implicitly assumed
and not considered. We concentrate on the non-iterative
form. Iterative processes can be handled in a similar
form.
      </p>
      <p>Figure 4 The Phases in Data Mining Design
(Noniterative form)</p>
      <p>The data mining tasks can be now formulated based
on the matrix and the deep model. We set up the
context, the environment, the general goal of the
problem and also criteria for adequateness and
dependability of the solution, e.g. invariance properties
for problem description and for the task setting and its
mathematical formulation and solution faithfulness
properties for later application of the solution in the
given environment. What is exactly the problem, the
expected benefit? What should a solution look like?
What is known about the application?</p>
      <p>
        Deep models already use a background consisting of
an undisputable grounding and a selectable basis. The
explicit statement of the background provides an
understanding of the postulates, paradigms,
assumptions, conceptions, practices, etc. Without the
background, the results of the analysis cannot be
properly understood. Models have their profile, i.e.
goals, purposes and functions. These must be explicitly
given. The parameters of a generic model can be either
order or slave parameters [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], either primary or
secondary or tertiary (also called genotypes or
phenotypes or observables) [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ], and either ruling (or
order) or driven parameters [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Data mining can be
enhanced by knowledge management techniques.
      </p>
      <p>Additionally, the concept space into which the data
mining task is embedded must be specified. This
concept space is enhanced during data analysis.</p>
      <sec id="sec-5-1">
        <title>4.2 Stereotyping the Process</title>
        <p>The general flow of data mining activities is typically
implicitly assumed on the basis of stereotypes which
form a set of tasks, e.g. tasks of prove in whatever
system, transformation tasks, description tasks, and
investigation tasks. Proofs can follow the classical
deductive or inductive setting. Also, abductive,
adductive, hypothetical and other reasoning techniques
are applicable. Stereotypes typically use model suites as
a collection of associated models, are already biased by
priming and orientation, follow policies, data mining
design constraints, and framing.</p>
        <p>Data mining and analysis is rather stereotyped. For
instance, mathematical culture has already developed a
good number of stereotypes for problem formulation. It
is based on a mathematical language for the formulation
of analysis tasks, on selection and instantiation of the
best fitting variable space and the space of opportunities
provided by mathematics.</p>
        <p>Data mining uses generic models which are the
basis of normal models. Models are based on a
separation of concern according the problem setting:
dependence-indicating, dependence-describing,
separation or partition spaces, pattern kinds, reasoning
kinds, etc. This separation of concern governs the
classical data mining algorithmic classes: association
analysis, cluster analysis, data grouping with or without
classification, classifiers and rules, dependences among
parameters and data subsets, predictor analysis,
synergetics, blind or informed or heuristic investigation of
the search space, and pattern learning.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.3 Initialization of the Normal Data Models</title>
        <p>
          Data mining algorithms have their capacity and
potential [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Potential and capacity can be based on
SWOT (strengths, weaknesses, opportunities, and
threats), SCOPE (situation, core competencies,
obstacles, prospects, expectation), and SMART (how
simple, meaningful, adequate, realistic, and trackable)
analysis of methods and algorithms. Each of the
algorithm classes has its strengths and weaknesses, its
satisfaction of the tasks and the purpose, and its limits
of applicability. Algorithm selection also includes an
explicit specification of the order of application of these
algorithms and of mapping parameters that are derived
by means of one algorithm to those that are an input for
the others, i.e. an explicit association within the model
suite. Additionally, evaluation algorithms for the
success criteria are selected. Algorithms have their own
obstinacy, their hypotheses and assumptions that must
be taken into consideration. Whether an algorithm can
be considered depends on acceptance criteria derived in
the previous two steps.
        </p>
        <p>So, we ask: What kind of model suite architecture suits
the problem best? What are applicable development
approaches for modelling? What is the best modelling
technique to get the right model suite? What kind of
reasoning is supported? What not? What are the
limitations? Which pitfalls should be avoided?</p>
        <p>
          The result of the entire data mining process heavily
depends on the appropriateness of the data sets, their
properties and quality, and more generally the data
schemata with essentially three components: application
data schema with detailed description of data types,
metadata schema [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], and generated and auxiliary data
schemata. The first component is well-investigated in
data mining and data management monographs. The
second and third components inherit research results
from database management, from data mart or
warehouses, and layering of data. An essential element
is the explicit specification of the quality of data. It
allows to derive algorithms for data improvement and to
derive limitations for applicability of algorithms.
Auxiliary data support performance of the algorithms.
        </p>
        <p>Therefore typical data-oriented questions are: What
data do we have available? Is the data relevant to the
problem? Is it valid? Does it reflect our expectations?
Is the data quality, quantity, recency sufficient? Which
data we should concentrate on? How is the data
transformed for modelling? How may we increase the
quality of data?</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.4 The Data Mining Process Itself</title>
        <p>The data mining process can be understood as a
coherent and stepwise refinement of the given model
suite. The model refinement may use an explicit
transformation or an extract-transform-load process
among models within the model suite. Evaluation and
termination algorithms are an essential element of any
data mining algorithm. They can be based on quality
criteria for the finalized models in the model suite, e.g.
generality, error-proneness, stability,
selectionproneness, validation, understandability, repeatability,
usability, usefulness, and novelty.</p>
        <p>Typical questions to answer within this process
are: How good is the model suite in terms of the task
setting? What have we really learned about the
application domain? What is the real adequacy and
dependability of the models in the model suite? How
these models can be deployed best? How do we know
that the models in the model suite are still valid? Which
data are supporting which model in the model suite?
Which kind of errors of data is inherited by which part
of which model?</p>
        <p>The final result of the data mining process is then a
combination of the deep model and the normal model
whereas the first one is a latent or hidden component in
most cases. If we want, however, to reason on the
results then the deep model must be understood as well.
Otherwise, the results may become surprising and may
not be convincing.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.5 Controllers and Selectors</title>
        <p>
          Algorithmics [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] treats algorithms as general solution
pattern that have parameters for their instantiation,
handling mechanisms for their specialization to a given
environment, and enhancers for context injection. So,
an algorithm can be derived based on explicit selectors
and control rules [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] if we neglect context injection. We
can use this approach for data mining design (DMD).
For instance, an algorithm pattern such as regression
uses a generic model of parameter dependence, is based
on blind search, has parameters for similarity and model
quality, and has selection support for specific treatment
of the given data set. In this case, the controller is based
on enablers that specify applicability of the approach,
on error rules, on data evaluation rules that detect
dependencies among control parameters and derive data
quality measures, and on quality rules for confidence
statements.
        </p>
      </sec>
      <sec id="sec-5-5">
        <title>4.6 Data Mining and Design Science</title>
        <p>
          Let us finally associate our approach with design
science research [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Design science considers
systematic modelling as an embodiment of three closely
related cycles of activities. The relevance cycle initiates
design science research with an application context that
not only provides the requirements for the research as
inputs but also defines acceptance criteria for the
ultimate evaluation of the research results. The central
design cycle iterates between the core activities of
building and evaluating the design artifacts and
processes of the research. The orthogonal rigor cycle
provides past knowledge to the research project to
ensure its innovation. It is contingent on the
researchers’ thoroughly research and references the
knowledge base in order to guarantee that the designs
produced are research contributions and not routine
designs based upon the application of well-known
processes.
        </p>
        <p>The relevance cycle is concerned with the problem
specification and setting and the matrix and agenda
derivation. The design cycle is related to all other
phases of our framework. The rigor cycle is enhanced
by our framework and provides thus a systematic
modelling approach.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5 Conclusion</title>
      <p>The literature on data mining is fairly rich. Mining tools
have already gained the maturity for supporting any
kind of data analysis if the data mining problem is well
understood, the intentions for models are properly
understood, and if the problem is professionally set up.
Data mining aims at development of model suites that
allows to derive and to draw dependable and thus
justifiable conclusions on the given data set. Data
mining is a process that can be based on a framework
for systematic modelling that is driven by a deep model
and a matrix. Textbooks on data mining typically
explore in detail algorithms as blind search. Data
mining is a specific form of modeling. Therefore, we
can combine modeling with data mining in a more
sophisticated form. Models have however an inner
structure with parts which are given by the application,
by the context, by the commonsense and by a
community of practice. These fixed parts are then
enhanced by normal models. A typical normal model is
the result of a data mining process.</p>
      <p>The current state of the art in data mining is mainly
technology and algorithm driven. The problem selection
is made on intuition and experience. So, the matrix and
the deep model are latent and hidden. The problem
specification is not explicit. Therefore, this paper aims
at the entire data mining process and highlights a way to
leave the ad-hoc, blind and somehow chaotic data
analysis. The approach we are developing integrates the
theory of models, the theory of problem solving, design
science, and knowledge and content management. We
realized that data mining can be systematized. The
framework for data mining design exemplarily
presented is an example in Figure 4.</p>
      <p>Acknowledgement. We thank for the support of this
paper by the CRC 1266. We are very thankful for the
fruitful discussions with the members of the CRC.
mit</p>
      <p>UML.
[27] G. Simsion and G.C. Witt. Data modeling
essentials. Morgan Kaufmann, San Francisco
( 2005)</p>
      <sec id="sec-6-1">
        <title>B. Thalheim. Towards a theory of</title>
        <p>conceptual modelling. Journal of Universal
Computer Science, 16(20): pp. 3102–3137,
(2010)
[30] B. Thalheim. The conceptual model ≡ an
adequate and dependable artifact enhanced
by concepts. I n : Information Modelling and
Knowledge Bases XXV, p p . 241–254. IOS
Press (2014)</p>
      </sec>
      <sec id="sec-6-2">
        <title>B. Thalheim. Conceptual modeling</title>
        <p>foundations: The notion of a model in
conceptual modeling. In: Encyclopedia of
Database Systems, Springer ( 2017)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>G. Bell.</surname>
          </string-name>
          <article-title>The mechanism of evolution</article-title>
          .
          <source>Chapman</source>
          and Hall, New York (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Berghammer</surname>
          </string-name>
          and
          <string-name>
            <surname>B. Thalheim.</surname>
          </string-name>
          ,
          <article-title>Methodenbasierte mathematische Modellierung mit Relationenalgebren</article-title>
          . In: Wissenschaft und Kunst der Modellierung: Modelle, Modellieren, Modellierung, pp.
          <fpage>67</fpage>
          -
          <lpage>106</lpage>
          . De Gryuter, Boston (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.R.</given-names>
            <surname>Berthold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Borgelt</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>H¨oppner</article-title>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Klawonn</surname>
          </string-name>
          .
          <article-title>Guide to intelligent data analysis</article-title>
          . Springer, London (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bienemann</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-D. Schewe</surname>
            , and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Thalheim</surname>
          </string-name>
          .
          <article-title>Towards a theory of genericity based on government and binding</article-title>
          .
          <source>In: Proc. ER'06, LNCS 4215</source>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>324</lpage>
          . Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.B.</given-names>
            <surname>Booker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.E.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.H.</given-names>
            <surname>Holland</surname>
          </string-name>
          .
          <article-title>Classifier systems and genetic algorithms</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>40</volume>
          (
          <issue>1-3</issue>
          ): pp.
          <fpage>235</fpage>
          -
          <lpage>282</lpage>
          (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brassard</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Bratley</surname>
          </string-name>
          .
          <source>Algorithmics - Theory and Practice</source>
          . Prentice Hall, London (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Coleman</surname>
          </string-name>
          .
          <article-title>Scientific models as works</article-title>
          .
          <source>Cataloging &amp; Classification Quarterly</source>
          , Special Issue:
          <article-title>Works as Entities for Information Retrieval</article-title>
          ,
          <volume>33</volume>
          , p p .
          <fpage>3</fpage>
          -
          <lpage>4</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dahanayake</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thalheim</surname>
          </string-name>
          .
          <article-title>Coevolution of (information) system models</article-title>
          .
          <source>In: EMMSAD</source>
          <year>2010</year>
          ,
          <article-title>LN B I B vol</article-title>
          .
          <volume>50</volume>
          , pp.
          <fpage>314</fpage>
          -
          <lpage>326</lpage>
          . Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Embley</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thalheim</surname>
          </string-name>
          (eds).
          <source>The Handbook of Conceptual Modeling: Its Usage and Its Challenges</source>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.P.</given-names>
            <surname>Gillett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.W.</given-names>
            <surname>Zwiers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.J.</given-names>
            <surname>Weaver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.C.</given-names>
            <surname>Hegerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.R.</given-names>
            <surname>Allen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.A.</given-names>
            <surname>Stott</surname>
          </string-name>
          .
          <article-title>Detecting anthropogenic influence with a multi-model ensemble</article-title>
          .
          <source>Geophys. Res. Lett.</source>
          ,
          <volume>29</volume>
          :
          <fpage>31</fpage>
          -
          <lpage>34</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Guerra</surname>
          </string-name>
          , J. de Lara,
          <string-name>
            <given-names>D.S.</given-names>
            <surname>Kolovos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.F.</given-names>
            <surname>Paige</surname>
          </string-name>
          .
          <article-title>Inter- modelling: From theory to practice</article-title>
          .
          <source>In MoDELS</source>
          <year>2010</year>
          , LNCS 6394, pp.
          <fpage>376</fpage>
          -
          <lpage>391</lpage>
          , Springer, Berlin (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Haken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wunderlin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Yigitbasi</surname>
          </string-name>
          .
          <article-title>An introduction to synergetics</article-title>
          .
          <source>Open Systems and Information Dynamics</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ): pp.
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hevner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>March</surname>
          </string-name>
          , J. Park, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ram</surname>
          </string-name>
          .
          <article-title>Design science in information systems research</article-title>
          .
          <source>MIS Quaterly</source>
          ,
          <volume>28</volume>
          (
          <issue>1</issue>
          ): pp.
          <fpage>75</fpage>
          -
          <lpage>105</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.J.</given-names>
            <surname>Hunter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>McCulloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and D.</given-names>
            <surname>Noble</surname>
          </string-name>
          .
          <article-title>Multiscale modeling: Physiome project standards, tools, and databases</article-title>
          .
          <source>IEEE Computer</source>
          ,
          <volume>39</volume>
          (
          <issue>11</issue>
          ), pp.
          <fpage>48</fpage>
          -
          <lpage>54</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <article-title>ISO/IEC 25020 (Software and system engineering - software product quality requirements and evaluation (square) - measurement reference model and guide)</article-title>
          .
          <source>ISO/IEC JTC1/SC7 N3280</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jaakkola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thalheim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kidawara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zettsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Heimbu</surname>
          </string-name>
          <article-title>¨rger. Information modelling and global risk management systems</article-title>
          .
          <source>In: Information Modeling and Knowledge Bases XX</source>
          , pp.
          <fpage>429</fpage>
          -
          <lpage>446</lpage>
          . IOS Press (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Jannaschk</surname>
          </string-name>
          .
          <article-title>Infrastruktur für ein Data Mining Design Framework</article-title>
          .
          <source>PhD thesis</source>
          , Christian-Albrechts University, Kiel (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Kramer</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thalheim</surname>
          </string-name>
          .
          <article-title>A metadata system for quality management</article-title>
          .
          <source>In: Information Modelling and Knowledge Bases</source>
          , pp.
          <fpage>224</fpage>
          -
          <lpage>242</lpage>
          . IOS Press (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>O.</given-names>
            <surname>Nakoinz</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Knitter</surname>
          </string-name>
          . Modelling Human Behaviour in Landscapes. Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pardillo</surname>
          </string-name>
          .
          <article-title>A systematic review on the definition of UML profiles</article-title>
          .
          <source>In: MoDELS</source>
          <year>2010</year>
          , LNCS 6394, pp.
          <fpage>407</fpage>
          -
          <lpage>422</lpage>
          , Springer, Berlin (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Petrelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beaulieu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          .
          <article-title>Which user interaction for crosslanguage information retrieval? Design issues and reflections</article-title>
          .
          <source>JASIST</source>
          ,
          <volume>57</volume>
          (
          <issue>5</issue>
          ): pp.
          <fpage>709</fpage>
          -
          <lpage>722</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>O.H.</given-names>
            <surname>Pilkey</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Pilkey-Jarvis</surname>
          </string-name>
          .
          <article-title>Useless Arithmetic: Why Environmental Scientists Cant't Predict the Future</article-title>
          . Columbia University Press, New York (
          <year>2006</year>
          ) [31]
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>B.</given-names>
            <surname>Thalheim</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Tropmann-Frick</surname>
          </string-name>
          .
          <article-title>Wherefore models are used and accepted? The model functions as a quality instrument in utilisation scenarios</article-title>
          . In: I.
          <string-name>
            <surname>Comyn-Wattiau</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          du
          <string-name>
            <surname>Mouza</surname>
          </string-name>
          , and N. Prat, editors, Ingenierie Management des
          <string-name>
            <surname>Systemes D'Information</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>B.</given-names>
            <surname>Thalheim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tropmann-Frick</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ziebermayr</surname>
          </string-name>
          .
          <article-title>Application of generic workflows for disaster management</article-title>
          .
          <source>In: Information Modelling and Knowledge Bases</source>
          , pp.
          <fpage>64</fpage>
          -
          <lpage>81</lpage>
          . IOS Press (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>B.</given-names>
            <surname>Thalheim</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Towards a theory of refinement for data migration</article-title>
          .
          <source>In: ER'</source>
          <year>2011</year>
          , LNCS 6998, pp.
          <fpage>318</fpage>
          -
          <lpage>331</lpage>
          . Springer, (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zeugmann</surname>
          </string-name>
          .
          <article-title>Inductive inference of optimal programs: A survey and open problems</article-title>
          .
          <source>In: Nonmonotonic and Inductive Logics</source>
          , pp.
          <fpage>208</fpage>
          -
          <lpage>222</lpage>
          . Springer, Berlin (
          <year>1991</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>