=Paper=
{{Paper
|id=Vol-2022/paper44
|storemode=property
|title=
Data Mining Design and Systematic Modelling
|pdfUrl=https://ceur-ws.org/Vol-2022/paper44.pdf
|volume=Vol-2022
|authors=Yannic Kropp,Bernhard Thalheim
|dblpUrl=https://dblp.org/rec/conf/rcdl/KroppT17a
}}
==
Data Mining Design and Systematic Modelling
==
Data Mining Design and Systematic Modelling
© Co Yannic Kropp © Bernhard Thalheim
Christian Albrechts University Kiel, Department of Computer Science, D-24098 Kiel, Germany
yk@is.informatik.uni-kiel.de thalheim@is.informatik.uni-kiel.de
Abstract. Data mining is currently a well-established technique and supported by many algorithms. It
is dependent on the data on hand, on properties of the algorithms, on the technology developed so far, and
on the expectations and limits to be applied. It must be thus matured, predictable, optimisable, evolving,
adaptable and well-founded similar to mathematics and SPICE/CMM-based software engineering. Data
mining must therefore be systematic if the results have to be fit to its purpose. One basis of this systematic
approach is model management and model reasoning. We claim that systematic data mining is nothing else
than systematic modelling. The main notion is the notion of the model in a variety of forms, abstraction and
associations among models.
Keywords: data mining, modelling, models, framework, deep model, normal model, modelling
matrix
typically used for the solution of data mining and
1 Introduction analysis tasks. It is neglected that an algorithm also has
an application area, application restrictions, data
Data mining and analysis is nowadays well- requirements, results at certain granularity and
understood from the algorithms side. There are precision. These problems must be systematically
thousands of algorithms that have been proposed. The tackled if we want to rely on the results of mining and
number of success stories is overwhelming and has analysis. Otherwise analysis may become misleading,
caused the big data hype. At the same time, brute-force biased, or not possible. Therefore, we explicitly treat
application of algorithms is still the standard. Nowadays properties of mining and analysis. A similar observation
data analysis and data mining algorithms are still taken can be made for data handling.
for granted. They transform data sets and hypotheses Data mining is often considered to be a separate
into conclusions. For instance, cluster algorithms check sub-discipline of computer engineering and science.
on given data sets and for a clustering requirements The statistics basis of data mining is well accepted. We
portfolio whether this portfolio can be supported and typically start with a general (or better generic) model
provide as a set of clusters in the positive case as an and use for refinement or improvement of the model the
output. The Hopkins index is one of the criteria that data that are on hand and that seem to be appropriate.
allow to judge whether clusters exist within a data set. This technique is known in sciences under several
A systematic approach to data mining has already been names such as inverse modelling, generic modelling,
proposed in [3, 17]. It is based on mathematics and pattern-based reasoning, (inductive) learning, universal
mathematical statistics and thus able to handle errors, application, and systematic modelling.
biases and configuration of data mining as well. Our Data mining is typically not only based on one
experience in large data mining projects in archaeology, model but rather on a model ensemble or model suite
ecology, climate research, medical research etc. has The association among models in a model suite is
however shown that ad-hoc and brute-force mining is explicitly specified. These associations provide an
still the main approach. The results are taken for explicit form via model suites. Reasoning techniques
granted and believed despite the modelling, combine methods from logics (deductive, inductive,
understanding, flow of work and data handling pitfalls. abductive, counter-inductive, etc.), from artificial
So, the results often become dubious. intelligence (hypothetic, qualitative, concept-based,
Data are the main source for information in data adductive, etc.), computational methods (algorithmics
mining and analysis. Their quality properties have been [6], topology, geometry, reduction, etc.), and cognition
neglected for a long time. At the same time, modern (problem representation and solving, causal reasoning,
data management allows to handle these problems. In etc.).
[16] we compare the critical findings or pitfalls of [21] These choices and handling approaches need a
with resolution techniques that can be applied to systematic underpinning. Techniques from artificial
overcome the crucial pitfalls of data mining in intelligence, statistics, and engineering are bundled
environmental sciences reported there. The algorithms within the CRISP framework (e.g. [3]). They can be
themselves are another source of pitfalls that are enhanced by techniques that have originally been
developed for modelling, for design science, business
Proceedings of the XIX International Conference informatics, learning theory, action theory etc.
“Data Analytics and Management in Data Intensive We combine and generalize the CRISP, heuristics,
Domains” (DAMDID/RCDL’2017), Moscow, Russia, modelling theory, design science, business informatics,
October 10-13, 2017
273
statistics, and learning approaches in this paper. First, adequateness, dependability, well-formedness, scenario,
we introduce our notion of the model. Next we show functions and purposes, backgrounds (grounding and
how data mining can be designed. We apply this basis), and outer directives (context and community of
investigation to systematic modelling and later to practice). It covers all known so far notions in
systematic data mining. It is our goal to develop a agriculture, archaeology, arts, biology, chemistry,
holistic and systematic framework for data mining and computer science, economics, electro-technics,
analysis. Many issues are left out of the scope of this environmental sciences, farming, geosciences, historical
paper such as a literature review, a formal introduction sciences, languages, mathematics, medicine, ocean
of the approach, and a detailed discussion of data sciences, pedagogical science, philosophy, physics,
mining application cases. political sciences, sociology, and sports. The models
used in these disciplines are instruments used in certain
2 Models and Modelling scenarios.
Sciences distinguish between general, particular
Models are principle instruments in mathematics, data and specific things. Particular things are specific for
analysis, modern computer engineering (CE), teaching general things and general for specific things. The same
any kind of computer technology, and also modern abstraction may be used for modelling. We may start
computer science (CS). They are built, applied, revised with a general model. So far, nobody knows how to
and manufactured in many CE&CS sub-disciplines in a define general models for most utilization scenarios.
large variety of application cases with different Models function as instruments or tools. Typically,
purposes and context for different communities of instruments come in a variety of forms and fulfill many
practice. It is now well understood that models are different functions. Instruments are partially
something different from theories. They are often independent or autonomous of the thing they operate
intuitive, visualizable, and ideally capture the essence of on. Models are however special instruments. They are
an understanding within some community of practice used with a specific intention within a utilization
and some context. At the same time, they are limited in scenario. The quality of a model becomes apparent in
scope, context and the applicability. the context of this scenario.
It might thus be better to start with generic models.
2.1 The Notion of the Model
A generic model [4, 26, 31, 32] is a model which
There is however a general notion of a model and of a broadly satisfies the purpose and broadly functions in
conception of the model: the given utilization scenario. It is later tailored to suit
A model is a well-formed, adequate, and dependable the particular purpose and function. It generally
instrument that represents origins [9, 29, 30]. represents origins of interest, provides means to
Its criteria of well-formedness, adequacy, and establish adequacy and dependability of the model, and
dependability must be commonly accepted by its establishes focus and scope of the model. Generic
community of practice within some context and models should satisfy at least five properties: (i) they
correspond to the functions that a model fulfills in must be accurate; (ii) the quality of generic models
utilization scenarios. allows that they are used consciously; (iii) they should
A well-formed instrument is adequate for a collection be descriptive, not evaluative; (iv) they should be
of origins if it is analogous to the origins to be flexible so that they can be modified from time to time;
represented according to some analogy criterion, it is (v) they can be used as a first “best guess”.
more focused (e.g. simpler, truncated, more abstract or
reduced) than the origins being modelled, and it 2.3 Model Suites
sufficiently satisfies its purpose. Most disciplines integrate a variety of models or a
Well-formedness enables an instrument to be society of models, e.g. [7, 14] Models used in CE&CS
justified by an empirical corroboration according to its are mainly at the same level of abstraction. It is already
objectives, by rational coherence and conformity well-known for threescore years that they form a model
explicitly stated through conformity formulas or ensemble (e.g. [10, 23]) or horizontal model suite (e.g.
statements, by falsifiability or validation, and by [8, 27]). Developed models vary in their scopes,
stability and plasticity within a collection of origins. aspects, and facets they represent and their abstraction.
The instrument is sufficient by its quality A model suite consists of a set of models {M1,...,
characterization for internal quality, external quality and Mn}, of an association or collaboration schema among
quality in use or through quality characteristics [28] the models, of controllers that maintain consistency or
such as correctness, generality, usefulness, coherence of the model suite, of application schemata
comprehensibility, parsimony, robustness, novelty etc. for explicit maintenance and evolution of the model
Sufficiency is typically combined with some assurance suite, and of tracers for the establishment of the
evaluation (tolerance, modality, confidence, and coherence.
restrictions). Multi-modelling [11, 19, 24] became a culture in
CE&CS. Maintenance of coherence, co-evolution, and
2.2 Generic and Specific Models consistency among models has become a bottleneck in
The general notion of a model covers all aspects of development. Moreover, different languages with
274
different capabilities have become an obstacle similar to of the definitional frame within a model development
multi-language retrieval [20] and impedance process. They define also the capacity and potential of a
mismatches. Models are often loosely coupled. Their model whenever it is utilized.
dependence and relationship is often not explicitly Deep models and the modelling matrix also define
expressed. This problem becomes more complex if some frame for adequacy and dependability. This frame
models are used for different purposes such as is enhanced for specific normal models. It is then used
construction of systems, verification, optimization, for a statement in which cases a normal model
explanation, and documentation. represents the origins under consideration.
2.4 Stepwise Refinement of Models 2.6 Deep Models and Matrices in Archaeology
Refinement of a model to a particular or special model Let us consider an application case. The CRC 1266 1
provides mechanisms for model transformation along “Scales of Transformation – Human
the adequacy, the justification and the sufficiency of a Environmental Interaction in Prehistoric and
model. Refinement is based on specialization for better Archaic Societies”
suitability of a model, on removal of unessential investigates processes of transformation from 15,000
elements, on combination of models to provide a more BCE to 1 BCE, including crisis and collapse, on
holistic view, on integration that is based on binding of different scales and dimensions, and as involving
model components to other components and on different types of groups, societies, and social
enhancement that typically improves a model to become formations. It is based on the matrix and a deep model
more adequate or dependable. as sketched in Figure 1. This matrix determines which
Control of correctness of refinement [33] for normal models can still be considered and which not.
information systems takes into account (A) a focus on The initial model for any normal model accepts this
the refined structure and refined vocabulary, (B) a focus matrix.
to information systems structures of interest, (C)
abstract information systems computation segments,
(D) a description of database segments of interest, and
(E) an equivalence relation among those data of interest.
2.5 Deep Models and the Modelling Matrix
Model development is typically based on an explicit
and rather quick description of the ‘surface’ or normal
model and on the mostly unconditional acceptance of a
deep model. The latter one directs the modelling process
and the surface or normal model. Modelling itself is
often understood as development and design of the
normal model. The deep model is taken for granted and
accepted for a number of normal models.
The deep model can be understood as the common
basis for a number of models. It consists of the
grounding for modelling (paradigms, postulates,
restrictions, theories, culture, foundations, conventions,
authorities), the outer directives (context and
community of practice), and basis (assumptions, general
concept space, practices, language as carrier, thought
community and thought style, methodology, pattern,
routines, commonsense) of modelling. It uses a
collection of undisputable elements of the background
as grounding and additionally a disputable and
adjustable basis which is commonly accepted in the
given context by the community of practice. Education Figure 1 Modeling in archaeology with a matrix
on modelling starts, for instance, directly with the deep
model. In this case, the deep model has to be accepted We base our consideration on the matrix and the
and is thus hidden and latent. deep model on [19] and the discussions in the CRC.
A (modelling) matrix is something within or from Whether the deep model or the model matrix is
which something else originates, develops, or takes appropriate has already been discussed. The final
from. The matrix is assumed to be correct for normal version presented in this paper illustrates our
models. It consists of the deep model and the modelling understanding.
scenarios. The modelling agenda is derived from the
modelling scenario and the utilization scenarios. The
modelling scenario and the deep model serve as a part 1
https://www.sfb1266.uni-kiel.de/en
275
2.7 Stereotyping of a Data Mining Process description of six main parameters (e.g. for inductive
learning [34]):
Typical modeling (and data mining) processes follow
(a) The data analysis algorithm: Algorithm
some kind of ritual or typical guideline, i.e. they are
development is the main activity in data mining
stereotyped. The stereotype of a modelling process is
research. Each of these algorithms transfers data and
based on a general modelling situation. Most modelling
some specific parameters of the algorithm to a result.
methodologies are bound to one stereotype and one
(b) The concept space: the concept space defines the
kind of model within one model utilization scenario.
concepts under consideration for analysis based on
Stereotypes are governing, conditioning, steering and
certain language and common understanding.
guiding the model development. They determine the
(c) The data space: The data space typically consists of
model kind, the background and way of modelling
a multi-layered data set of different granularity. Data
activities. They persuade the activities of modelling.
sets may be enhanced by metadata that characterize the
They provide a means for considering the economics of
data sets and associate the data sets to other data sets.
modelling. Often, stereotypes use a definitional frame
(d) The hypotheses space: An algorithm is supposed to
that primes and orients the processes and that considers
map evidence on the concepts to be supported or
the community of practice or actors within the model
rejected into hypotheses about it.
development and utilization processes, the deep model
(e) The prior knowledge space: Specifying the
or the matrix with its specific language and model basis,
hypothesis space already provides some prior
and the agenda for model development. It might be
knowledge. In particular, the analysis task starts with
enhanced by initial models which are derived from
the assumption that the target concept is representable
generic models in accordance to the matrix.
in a certain way.
The model utilization scenario determines the
(f) The acceptability and success criteria: Criteria for
function that a model might have and therefore also the
successful analysis allow to derive termination criteria
goals and purposes of a model.
for the data analysis.
2.8 The Agenda Each instantiation and refinement of the six parameters
leads to specific data mining tasks.
The agenda is something like a guideline for modeling The result of data mining and data analysis is described
activities and for model associations within a model within the knowledge space. The data mining and
suite. It improves the quality of model outcomes by analysis task may thus be considered to be a
spending some effort to decide what and how much transformation of data sets, concept sets and hypothesis
reasoning to do as opposed to what activities to do. It sets into chunks of knowledge through the application
balances resources between the data-level actions and of algorithms.
the reasoning actions. E.g. [17] uses an agent approach Problem solving and modelling considers,
with preparation agents, exploration agents, descriptive however, typically six aspects [16]:
agents, and predictive agents. The agenda for a model (1) Application, problems, and users: The domain
suite uses thus decisions points that require agenda consists of a model of the application, a specification of
control according to performance and resource problems under consideration, of tasks that are issued,
considerations. This understanding supports and of profiles of users.
introspective monitoring about performance for the data (2) Context: The context of a problem is anything what
mining process, coordinated control of the entire mining could support the problem solution, e.g. the sciences’
process, and coordinated refinement of the models. background, theories, knowledge, foundations, and
Such kind of control is already necessary due to the concepts to be used for problem specification, problem
problem space, the limitations of resources, and the background, and solutions.
amount of uncertainty in knowledge, concepts, data, (3) Technology: Technology is the enabler and defines
and the environment. the methodology. It provides [23] means for the flow of
problem solving steps, the flow of activities, the
3 Data Mining Design distribution, the collaboration, and the exchange.
(4) Techniques and methods: Techniques and methods
3.1 Conceptualization of Data Mining and Analysis can be given as algorithms. Specific algorithms are data
The data mining and analysis task must be enhanced by improvers and cleaners, data aggregators, data
an explicit treatment of the languages used for concepts integrators, controllers, checkers, acceptance
and hypotheses, and by an explicit description of determiners, and termination algorithms.
knowledge that can be used. The algorithmic solution of (5) Data: Data have their own structuring, their quality
the task is based on knowledge on algorithms that are and their life span. They are typically enhanced by
used and on data that are available and that are required metadata. Data management is a central element of
for the application of the algorithms. Typically, analysis most problem solving processes.
algorithms are iterative and can run forever. We are (6) Solutions: The solutions to problem solving can be
interested only in convergent ones and thus need formally given, illustrated by visual means, and
termination criteria. Therefore, conceptualization of the presented by models. Models are typically only normal
data mining and analysis task consists of a detailed models. The deep model and the matrix is already
provided by the context and accepted by the community
276
of practice in dependence of the needs of this
community for the given application scenario. 4.1 Setting the Deep Model and the Matrix
Therefore, models may be the final result of a data
mining and analysis process beside other means. The problem to be tackled must be clearly stated in
dependence on the utilization scenario, the tasks to be
Comparing these six spaces with the six solved, the community of practice involved, and the
parameters we discover that only four spaces are given context. The result of this step is the deep model
considered so far in data mining. We miss the user and and its matrix. The first one is based on the background,
application space as well as the representation space. the specific context parameter such as infrastructure and
Figure 2 shows the difference. environment, and candidates for deep models.
Figure 2 Parameters of Data Mining and the Problem
Solving Aspects
3.2 Meta-models of Data Mining
An abstraction layer approach separates the application
domain, the model domain and the data domain [17].
This separation is illustrated in Figure 3. Figure 4 The Phases in Data Mining Design (Non-
iterative form)
The data mining tasks can be now formulated based
on the matrix and the deep model. We set up the
context, the environment, the general goal of the
problem and also criteria for adequateness and
Figure 3 The V meta-model of Data Mining Design dependability of the solution, e.g. invariance properties
for problem description and for the task setting and its
The data mining design framework uses the inverse mathematical formulation and solution faithfulness
modeling approach. It starts with the consideration of properties for later application of the solution in the
the application domain and develops models as given environment. What is exactly the problem, the
mediators between the data and the application domain expected benefit? What should a solution look like?
worlds. In the sequel we are going to combine the three What is known about the application?
approaches of this section. The meta-model corresponds Deep models already use a background consisting of
to other meta-models such as inductive modelling or an undisputable grounding and a selectable basis. The
hypothetical reasoning (hypotheses development, explicit statement of the background provides an
experimenting and testing, analysis of results, interim understanding of the postulates, paradigms,
conclusions, reappraisal against real world). assumptions, conceptions, practices, etc. Without the
background, the results of the analysis cannot be
4 Data Mining: A Systematic Model-Based properly understood. Models have their profile, i.e.
Approach goals, purposes and functions. These must be explicitly
given. The parameters of a generic model can be either
Our approach presented so far allows to revise and to order or slave parameters [12], either primary or
reformulate the model-oriented data mining process on secondary or tertiary (also called genotypes or
the basis of well-defined engineering [15, 25] or phenotypes or observables) [1, 5], and either ruling (or
alternatively on systematic mathematical problem order) or driven parameters [12]. Data mining can be
solving [22]. Figure 4 displays this revision. We realize enhanced by knowledge management techniques.
that the first two phases are typically implicitly assumed Additionally, the concept space into which the data
and not considered. We concentrate on the non-iterative mining task is embedded must be specified. This
form. Iterative processes can be handled in a similar concept space is enhanced during data analysis.
form.
277
4.2 Stereotyping the Process The result of the entire data mining process heavily
depends on the appropriateness of the data sets, their
The general flow of data mining activities is typically
properties and quality, and more generally the data
implicitly assumed on the basis of stereotypes which
schemata with essentially three components: application
form a set of tasks, e.g. tasks of prove in whatever
data schema with detailed description of data types,
system, transformation tasks, description tasks, and
metadata schema [18], and generated and auxiliary data
investigation tasks. Proofs can follow the classical
schemata. The first component is well-investigated in
deductive or inductive setting. Also, abductive,
data mining and data management monographs. The
adductive, hypothetical and other reasoning techniques
second and third components inherit research results
are applicable. Stereotypes typically use model suites as
from database management, from data mart or
a collection of associated models, are already biased by
warehouses, and layering of data. An essential element
priming and orientation, follow policies, data mining
is the explicit specification of the quality of data. It
design constraints, and framing.
allows to derive algorithms for data improvement and to
Data mining and analysis is rather stereotyped. For
derive limitations for applicability of algorithms.
instance, mathematical culture has already developed a
Auxiliary data support performance of the algorithms.
good number of stereotypes for problem formulation. It
Therefore typical data-oriented questions are: What
is based on a mathematical language for the formulation
data do we have available? Is the data relevant to the
of analysis tasks, on selection and instantiation of the
problem? Is it valid? Does it reflect our expectations?
best fitting variable space and the space of opportunities
Is the data quality, quantity, recency sufficient? Which
provided by mathematics.
data we should concentrate on? How is the data
Data mining uses generic models which are the
transformed for modelling? How may we increase the
basis of normal models. Models are based on a
quality of data?
separation of concern according the problem setting:
dependence-indicating, dependence-describing, sepa- 4.4 The Data Mining Process Itself
ration or partition spaces, pattern kinds, reasoning
kinds, etc. This separation of concern governs the The data mining process can be understood as a
classical data mining algorithmic classes: association coherent and stepwise refinement of the given model
analysis, cluster analysis, data grouping with or without suite. The model refinement may use an explicit
classification, classifiers and rules, dependences among transformation or an extract-transform-load process
parameters and data subsets, predictor analysis, syner- among models within the model suite. Evaluation and
getics, blind or informed or heuristic investigation of termination algorithms are an essential element of any
the search space, and pattern learning. data mining algorithm. They can be based on quality
criteria for the finalized models in the model suite, e.g.
generality, error-proneness, stability, selection-
4.3 Initialization of the Normal Data Models proneness, validation, understandability, repeatability,
usability, usefulness, and novelty.
Data mining algorithms have their capacity and
Typical questions to answer within this process
potential [2]. Potential and capacity can be based on
are: How good is the model suite in terms of the task
SWOT (strengths, weaknesses, opportunities, and
setting? What have we really learned about the
threats), SCOPE (situation, core competencies,
application domain? What is the real adequacy and
obstacles, prospects, expectation), and SMART (how
dependability of the models in the model suite? How
simple, meaningful, adequate, realistic, and trackable)
these models can be deployed best? How do we know
analysis of methods and algorithms. Each of the
that the models in the model suite are still valid? Which
algorithm classes has its strengths and weaknesses, its
data are supporting which model in the model suite?
satisfaction of the tasks and the purpose, and its limits
Which kind of errors of data is inherited by which part
of applicability. Algorithm selection also includes an
of which model?
explicit specification of the order of application of these
The final result of the data mining process is then a
algorithms and of mapping parameters that are derived
combination of the deep model and the normal model
by means of one algorithm to those that are an input for
whereas the first one is a latent or hidden component in
the others, i.e. an explicit association within the model
most cases. If we want, however, to reason on the
suite. Additionally, evaluation algorithms for the
results then the deep model must be understood as well.
success criteria are selected. Algorithms have their own
Otherwise, the results may become surprising and may
obstinacy, their hypotheses and assumptions that must
not be convincing.
be taken into consideration. Whether an algorithm can
be considered depends on acceptance criteria derived in 4.5 Controllers and Selectors
the previous two steps.
So, we ask: What kind of model suite architecture suits Algorithmics [6] treats algorithms as general solution
the problem best? What are applicable development pattern that have parameters for their instantiation,
approaches for modelling? What is the best modelling handling mechanisms for their specialization to a given
technique to get the right model suite? What kind of environment, and enhancers for context injection. So,
reasoning is supported? What not? What are the an algorithm can be derived based on explicit selectors
limitations? Which pitfalls should be avoided? and control rules [4] if we neglect context injection. We
278
can use this approach for data mining design (DMD). technology and algorithm driven. The problem selection
For instance, an algorithm pattern such as regression is made on intuition and experience. So, the matrix and
uses a generic model of parameter dependence, is based the deep model are latent and hidden. The problem
on blind search, has parameters for similarity and model specification is not explicit. Therefore, this paper aims
quality, and has selection support for specific treatment at the entire data mining process and highlights a way to
of the given data set. In this case, the controller is based leave the ad-hoc, blind and somehow chaotic data
on enablers that specify applicability of the approach, analysis. The approach we are developing integrates the
on error rules, on data evaluation rules that detect theory of models, the theory of problem solving, design
dependencies among control parameters and derive data science, and knowledge and content management. We
quality measures, and on quality rules for confidence realized that data mining can be systematized. The
statements. framework for data mining design exemplarily
presented is an example in Figure 4.
4.6 Data Mining and Design Science
Let us finally associate our approach with design Acknowledgement. We thank for the support of this
science research [13]. Design science considers paper by the CRC 1266. We are very thankful for the
systematic modelling as an embodiment of three closely fruitful discussions with the members of the CRC.
related cycles of activities. The relevance cycle initiates
design science research with an application context that References
not only provides the requirements for the research as
[1] G. Bell. The mechanism of evolution.
inputs but also defines acceptance criteria for the
Chapman and Hall, New York (1997)
ultimate evaluation of the research results. The central
design cycle iterates between the core activities of [2] R. Berghammer and B. Thalheim., Metho-
building and evaluating the design artifacts and denbasierte mathematische Modellierung mit
processes of the research. The orthogonal rigor cycle Relationenalgebren. In: Wissenschaft und
provides past knowledge to the research project to Kunst der Modellierung: Modelle,
ensure its innovation. It is contingent on the Modellieren, Modellierung, pp. 67–106. De
researchers’ thoroughly research and references the Gryuter, Boston ( 2015)
knowledge base in order to guarantee that the designs [3] M.R. Berthold, C. Borgelt, F. Höppner, and F.
produced are research contributions and not routine Klawonn. Guide to intelligent data analysis.
designs based upon the application of well-known Springer, London (2010).
processes.
The relevance cycle is concerned with the problem [4] A. Bienemann, K.-D. Schewe, and B.Thalheim.
specification and setting and the matrix and agenda Towards a theory of genericity based on
derivation. The design cycle is related to all other government and binding. In: Proc. ER’06,
phases of our framework. The rigor cycle is enhanced LNCS 4215, pp. 311–324. Springer ( 2006)
by our framework and provides thus a systematic [5] L.B. Booker, D.E. Goldberg, and J.H.
modelling approach. Holland. Classifier systems and genetic
algorithms. Artificial Intelligence, 40 (1–3):
5 Conclusion pp. 235–282 (1989)
The literature on data mining is fairly rich. Mining tools [6] G. Brassard and P. Bratley. Algorithmics -
have already gained the maturity for supporting any Theory and Practice. Prentice Hall, London
kind of data analysis if the data mining problem is well ( 1988)
understood, the intentions for models are properly [7] A. Coleman. Scientific models as works.
understood, and if the problem is professionally set up. Cataloging & Classification Quarterly,
Data mining aims at development of model suites that Special Issue: Works as Entities for
allows to derive and to draw dependable and thus Information Retrieval, 33, p p . 3-4 ( 2006)
justifiable conclusions on the given data set. Data
[8] A. Dahanayake and B. Thalheim. Co-
mining is a process that can be based on a framework
for systematic modelling that is driven by a deep model evolution of (information) system models.
and a matrix. Textbooks on data mining typically In: EMMSAD 2010, LNBIB vol. 50, pp.
explore in detail algorithms as blind search. Data 314–326. Springer ( 2010)
mining is a specific form of modeling. Therefore, we [9] D. Embley and B. Thalheim (eds). The
can combine modeling with data mining in a more Handbook of Conceptual Modeling: Its Usage
sophisticated form. Models have however an inner and Its Challenges. Springer ( 2011)
structure with parts which are given by the application, [10] N.P. Gillett, F.W. Zwiers, A.J. Weaver,
by the context, by the commonsense and by a G.C. Hegerl, M.R. Allen, and P.A. Stott.
community of practice. These fixed parts are then Detecting anthropogenic influence with a
enhanced by normal models. A typical normal model is multi-model ensemble. Geophys. Res. Lett.,
the result of a data mining process.
29:31–34, 2002.
The current state of the art in data mining is mainly
279
[11] E. Guerra, J. de Lara, D.S. Kolovos, and (in Russian). ZPI at Mech-Mat MGU, Moscow
R.F. Paige. Inter- modelling: From theory (2001)
to practice. In MoDELS 2010, LNCS 6394, [24] M. Pottmann, H. Unbehauen, and D.E.
pp. 376–391, Springer, Berlin (2010) Seborg. Application of a general multi-model
[12] H. Haken, A. Wunderlin, and S. Yigitbasi. approach for identification of highly nonlinear
An introduction to synergetics. Open processes – a case study. Int. Journal of
Systems and Information Dynamics, 3(1): Control, 57(1): pp. 97–120 (1993)
pp. 1–34 ( 1994) [25] B. Rumpe. Modellierung mit UML.
[13] A. Hevner, S. March, J. Park, and S. Ram. Springer, H e i d e l b e r g ( 2012)
Design science in information systems [26] A. Samuel and J. Weir. Introduction to
research. MIS Quaterly, 28(1): pp. 75–105 Engineering: Modelling, Synthesis and
( 2004) Problem Solving Strategies. Elsevier,
[14] P.J. Hunter, W. W. Li, A. D. McCulloch, and Amsterdam ( 2000)
D. Noble. Multiscale modeling: Physiome [27] G. Simsion and G.C. Witt. Data modeling
project standards, tools, and databases. IEEE essentials. Morgan Kaufmann, San Francisco
Computer, 39(11), pp. 48–54 (2006) ( 2005)
[15] ISO/IEC 25020 (Software and system [28] M. Skusa. Semantische Kohärenz in der
engineering - software product quality Softwareentwicklung. PhD thesis, CAU Kiel,
requirements and evaluation (square) (2011)
- measurement reference model and guide).
[29] B. Thalheim. Towards a theory of
ISO/IEC JTC1/SC7 N3280 (2005)
conceptual modelling. Journal of Universal
[16] H. Jaakkola, B. Thalheim, Y. Kidawara, K. Computer Science, 16(20): pp. 3102–3137,
Zettsu, Y. Chen, and A. Heimbü rger. (2010)
Information modelling and global risk
[30] B. Thalheim. The conceptual model ≡ an
management systems. In: Information
adequate and dependable artifact enhanced
Modeling and Knowledge Bases XX, pp.
by concepts. I n : Information Modelling and
429–446. IOS Press (2009)
Knowledge Bases XXV, p p . 241–254. IOS
[17] K. Jannaschk. Infrastruktur für ein Data Press (2014)
Mining Design Framework. PhD thesis,
[31] B. Thalheim. Conceptual modeling
Christian-Albrechts University, Kiel (2017)
foundations: The notion of a model in
[18] F. Kramer and B. Thalheim. A metadata conceptual modeling. In: Encyclopedia of
system for quality management. In: Database Systems, Springer ( 2017)
Information Modelling and Knowledge
[32] B. Thalheim and M. Tropmann-Frick.
Bases, pp. 224–242. IOS Press (2014)
Wherefore models are used and accepted? The
[19] O. Nakoinz and D. Knitter. Modelling model functions as a quality instrument in
Human Behaviour in Landscapes. Springer utilisation scenarios. In: I. Comyn-Wattiau,
( 2016) C. du Mouza, and N. Prat, editors, Ingenierie
[20] J. Pardillo. A systematic review on the Management des Systemes D’Information
definition of UML profiles. In: MoDELS (2016)
2010, LNCS 6394, pp. 407–422, Springer, [33] B. Thalheim, M. Tropmann-Frick, and T.
Berlin (2010) Ziebermayr. Application of generic workflows
[21] D. Petrelli, S. Levin, M. Beaulieu, and M. for disaster management. In: Information
Sanderson. Which user interaction for cross- Modelling and Knowledge Bases, pp. 64–81.
language information retrieval? Design issues IOS Press (2014)
and reflections. JASIST, 57(5): pp. 709–722 [34] B. Thalheim and Q. Wang. Towards a theory
( 2006) of refinement for data migration. In:
[22] O.H. Pilkey and L. Pilkey-Jarvis. Useless ER’2011, LNCS 6998, pp. 318–331. Springer,
Arithmetic: Why Environmental Scientists ( 2011)
Cant’t Predict the Future. Columbia [35] T. Zeugmann. Inductive inference of optimal
University Press, New York (2006) programs: A survey and open problems. In:
[23] A.S. Podkolsin. Computer-based modelling Nonmonotonic and Inductive Logics, pp.
of solution processes for mathematical tasks 208–222. Springer, Berlin (1991)
280