Data Mining Design and Systematic Modelling © Co Yannic Kropp © Bernhard Thalheim Christian Albrechts University Kiel, Department of Computer Science, D-24098 Kiel, Germany yk@is.informatik.uni-kiel.de thalheim@is.informatik.uni-kiel.de Abstract. Data mining is currently a well-established technique and supported by many algorithms. It is dependent on the data on hand, on properties of the algorithms, on the technology developed so far, and on the expectations and limits to be applied. It must be thus matured, predictable, optimisable, evolving, adaptable and well-founded similar to mathematics and SPICE/CMM-based software engineering. Data mining must therefore be systematic if the results have to be fit to its purpose. One basis of this systematic approach is model management and model reasoning. We claim that systematic data mining is nothing else than systematic modelling. The main notion is the notion of the model in a variety of forms, abstraction and associations among models. Keywords: data mining, modelling, models, framework, deep model, normal model, modelling matrix typically used for the solution of data mining and 1 Introduction analysis tasks. It is neglected that an algorithm also has an application area, application restrictions, data Data mining and analysis is nowadays well- requirements, results at certain granularity and understood from the algorithms side. There are precision. These problems must be systematically thousands of algorithms that have been proposed. The tackled if we want to rely on the results of mining and number of success stories is overwhelming and has analysis. Otherwise analysis may become misleading, caused the big data hype. At the same time, brute-force biased, or not possible. Therefore, we explicitly treat application of algorithms is still the standard. Nowadays properties of mining and analysis. A similar observation data analysis and data mining algorithms are still taken can be made for data handling. for granted. They transform data sets and hypotheses Data mining is often considered to be a separate into conclusions. For instance, cluster algorithms check sub-discipline of computer engineering and science. on given data sets and for a clustering requirements The statistics basis of data mining is well accepted. We portfolio whether this portfolio can be supported and typically start with a general (or better generic) model provide as a set of clusters in the positive case as an and use for refinement or improvement of the model the output. The Hopkins index is one of the criteria that data that are on hand and that seem to be appropriate. allow to judge whether clusters exist within a data set. This technique is known in sciences under several A systematic approach to data mining has already been names such as inverse modelling, generic modelling, proposed in [3, 17]. It is based on mathematics and pattern-based reasoning, (inductive) learning, universal mathematical statistics and thus able to handle errors, application, and systematic modelling. biases and configuration of data mining as well. Our Data mining is typically not only based on one experience in large data mining projects in archaeology, model but rather on a model ensemble or model suite ecology, climate research, medical research etc. has The association among models in a model suite is however shown that ad-hoc and brute-force mining is explicitly specified. These associations provide an still the main approach. The results are taken for explicit form via model suites. Reasoning techniques granted and believed despite the modelling, combine methods from logics (deductive, inductive, understanding, flow of work and data handling pitfalls. abductive, counter-inductive, etc.), from artificial So, the results often become dubious. intelligence (hypothetic, qualitative, concept-based, Data are the main source for information in data adductive, etc.), computational methods (algorithmics mining and analysis. Their quality properties have been [6], topology, geometry, reduction, etc.), and cognition neglected for a long time. At the same time, modern (problem representation and solving, causal reasoning, data management allows to handle these problems. In etc.). [16] we compare the critical findings or pitfalls of [21] These choices and handling approaches need a with resolution techniques that can be applied to systematic underpinning. Techniques from artificial overcome the crucial pitfalls of data mining in intelligence, statistics, and engineering are bundled environmental sciences reported there. The algorithms within the CRISP framework (e.g. [3]). They can be themselves are another source of pitfalls that are enhanced by techniques that have originally been developed for modelling, for design science, business Proceedings of the XIX International Conference informatics, learning theory, action theory etc. “Data Analytics and Management in Data Intensive We combine and generalize the CRISP, heuristics, Domains” (DAMDID/RCDL’2017), Moscow, Russia, modelling theory, design science, business informatics, October 10-13, 2017 273 statistics, and learning approaches in this paper. First, adequateness, dependability, well-formedness, scenario, we introduce our notion of the model. Next we show functions and purposes, backgrounds (grounding and how data mining can be designed. We apply this basis), and outer directives (context and community of investigation to systematic modelling and later to practice). It covers all known so far notions in systematic data mining. It is our goal to develop a agriculture, archaeology, arts, biology, chemistry, holistic and systematic framework for data mining and computer science, economics, electro-technics, analysis. Many issues are left out of the scope of this environmental sciences, farming, geosciences, historical paper such as a literature review, a formal introduction sciences, languages, mathematics, medicine, ocean of the approach, and a detailed discussion of data sciences, pedagogical science, philosophy, physics, mining application cases. political sciences, sociology, and sports. The models used in these disciplines are instruments used in certain 2 Models and Modelling scenarios. Sciences distinguish between general, particular Models are principle instruments in mathematics, data and specific things. Particular things are specific for analysis, modern computer engineering (CE), teaching general things and general for specific things. The same any kind of computer technology, and also modern abstraction may be used for modelling. We may start computer science (CS). They are built, applied, revised with a general model. So far, nobody knows how to and manufactured in many CE&CS sub-disciplines in a define general models for most utilization scenarios. large variety of application cases with different Models function as instruments or tools. Typically, purposes and context for different communities of instruments come in a variety of forms and fulfill many practice. It is now well understood that models are different functions. Instruments are partially something different from theories. They are often independent or autonomous of the thing they operate intuitive, visualizable, and ideally capture the essence of on. Models are however special instruments. They are an understanding within some community of practice used with a specific intention within a utilization and some context. At the same time, they are limited in scenario. The quality of a model becomes apparent in scope, context and the applicability. the context of this scenario. It might thus be better to start with generic models. 2.1 The Notion of the Model A generic model [4, 26, 31, 32] is a model which There is however a general notion of a model and of a broadly satisfies the purpose and broadly functions in conception of the model: the given utilization scenario. It is later tailored to suit A model is a well-formed, adequate, and dependable the particular purpose and function. It generally instrument that represents origins [9, 29, 30]. represents origins of interest, provides means to Its criteria of well-formedness, adequacy, and establish adequacy and dependability of the model, and dependability must be commonly accepted by its establishes focus and scope of the model. Generic community of practice within some context and models should satisfy at least five properties: (i) they correspond to the functions that a model fulfills in must be accurate; (ii) the quality of generic models utilization scenarios. allows that they are used consciously; (iii) they should A well-formed instrument is adequate for a collection be descriptive, not evaluative; (iv) they should be of origins if it is analogous to the origins to be flexible so that they can be modified from time to time; represented according to some analogy criterion, it is (v) they can be used as a first “best guess”. more focused (e.g. simpler, truncated, more abstract or reduced) than the origins being modelled, and it 2.3 Model Suites sufficiently satisfies its purpose. Most disciplines integrate a variety of models or a Well-formedness enables an instrument to be society of models, e.g. [7, 14] Models used in CE&CS justified by an empirical corroboration according to its are mainly at the same level of abstraction. It is already objectives, by rational coherence and conformity well-known for threescore years that they form a model explicitly stated through conformity formulas or ensemble (e.g. [10, 23]) or horizontal model suite (e.g. statements, by falsifiability or validation, and by [8, 27]). Developed models vary in their scopes, stability and plasticity within a collection of origins. aspects, and facets they represent and their abstraction. The instrument is sufficient by its quality A model suite consists of a set of models {M1,..., characterization for internal quality, external quality and Mn}, of an association or collaboration schema among quality in use or through quality characteristics [28] the models, of controllers that maintain consistency or such as correctness, generality, usefulness, coherence of the model suite, of application schemata comprehensibility, parsimony, robustness, novelty etc. for explicit maintenance and evolution of the model Sufficiency is typically combined with some assurance suite, and of tracers for the establishment of the evaluation (tolerance, modality, confidence, and coherence. restrictions). Multi-modelling [11, 19, 24] became a culture in CE&CS. Maintenance of coherence, co-evolution, and 2.2 Generic and Specific Models consistency among models has become a bottleneck in The general notion of a model covers all aspects of development. Moreover, different languages with 274 different capabilities have become an obstacle similar to of the definitional frame within a model development multi-language retrieval [20] and impedance process. They define also the capacity and potential of a mismatches. Models are often loosely coupled. Their model whenever it is utilized. dependence and relationship is often not explicitly Deep models and the modelling matrix also define expressed. This problem becomes more complex if some frame for adequacy and dependability. This frame models are used for different purposes such as is enhanced for specific normal models. It is then used construction of systems, verification, optimization, for a statement in which cases a normal model explanation, and documentation. represents the origins under consideration. 2.4 Stepwise Refinement of Models 2.6 Deep Models and Matrices in Archaeology Refinement of a model to a particular or special model Let us consider an application case. The CRC 1266 1 provides mechanisms for model transformation along “Scales of Transformation – Human the adequacy, the justification and the sufficiency of a Environmental Interaction in Prehistoric and model. Refinement is based on specialization for better Archaic Societies” suitability of a model, on removal of unessential investigates processes of transformation from 15,000 elements, on combination of models to provide a more BCE to 1 BCE, including crisis and collapse, on holistic view, on integration that is based on binding of different scales and dimensions, and as involving model components to other components and on different types of groups, societies, and social enhancement that typically improves a model to become formations. It is based on the matrix and a deep model more adequate or dependable. as sketched in Figure 1. This matrix determines which Control of correctness of refinement [33] for normal models can still be considered and which not. information systems takes into account (A) a focus on The initial model for any normal model accepts this the refined structure and refined vocabulary, (B) a focus matrix. to information systems structures of interest, (C) abstract information systems computation segments, (D) a description of database segments of interest, and (E) an equivalence relation among those data of interest. 2.5 Deep Models and the Modelling Matrix Model development is typically based on an explicit and rather quick description of the ‘surface’ or normal model and on the mostly unconditional acceptance of a deep model. The latter one directs the modelling process and the surface or normal model. Modelling itself is often understood as development and design of the normal model. The deep model is taken for granted and accepted for a number of normal models. The deep model can be understood as the common basis for a number of models. It consists of the grounding for modelling (paradigms, postulates, restrictions, theories, culture, foundations, conventions, authorities), the outer directives (context and community of practice), and basis (assumptions, general concept space, practices, language as carrier, thought community and thought style, methodology, pattern, routines, commonsense) of modelling. It uses a collection of undisputable elements of the background as grounding and additionally a disputable and adjustable basis which is commonly accepted in the given context by the community of practice. Education Figure 1 Modeling in archaeology with a matrix on modelling starts, for instance, directly with the deep model. In this case, the deep model has to be accepted We base our consideration on the matrix and the and is thus hidden and latent. deep model on [19] and the discussions in the CRC. A (modelling) matrix is something within or from Whether the deep model or the model matrix is which something else originates, develops, or takes appropriate has already been discussed. The final from. The matrix is assumed to be correct for normal version presented in this paper illustrates our models. It consists of the deep model and the modelling understanding. scenarios. The modelling agenda is derived from the modelling scenario and the utilization scenarios. The modelling scenario and the deep model serve as a part 1 https://www.sfb1266.uni-kiel.de/en 275 2.7 Stereotyping of a Data Mining Process description of six main parameters (e.g. for inductive learning [34]): Typical modeling (and data mining) processes follow (a) The data analysis algorithm: Algorithm some kind of ritual or typical guideline, i.e. they are development is the main activity in data mining stereotyped. The stereotype of a modelling process is research. Each of these algorithms transfers data and based on a general modelling situation. Most modelling some specific parameters of the algorithm to a result. methodologies are bound to one stereotype and one (b) The concept space: the concept space defines the kind of model within one model utilization scenario. concepts under consideration for analysis based on Stereotypes are governing, conditioning, steering and certain language and common understanding. guiding the model development. They determine the (c) The data space: The data space typically consists of model kind, the background and way of modelling a multi-layered data set of different granularity. Data activities. They persuade the activities of modelling. sets may be enhanced by metadata that characterize the They provide a means for considering the economics of data sets and associate the data sets to other data sets. modelling. Often, stereotypes use a definitional frame (d) The hypotheses space: An algorithm is supposed to that primes and orients the processes and that considers map evidence on the concepts to be supported or the community of practice or actors within the model rejected into hypotheses about it. development and utilization processes, the deep model (e) The prior knowledge space: Specifying the or the matrix with its specific language and model basis, hypothesis space already provides some prior and the agenda for model development. It might be knowledge. In particular, the analysis task starts with enhanced by initial models which are derived from the assumption that the target concept is representable generic models in accordance to the matrix. in a certain way. The model utilization scenario determines the (f) The acceptability and success criteria: Criteria for function that a model might have and therefore also the successful analysis allow to derive termination criteria goals and purposes of a model. for the data analysis. 2.8 The Agenda Each instantiation and refinement of the six parameters leads to specific data mining tasks. The agenda is something like a guideline for modeling The result of data mining and data analysis is described activities and for model associations within a model within the knowledge space. The data mining and suite. It improves the quality of model outcomes by analysis task may thus be considered to be a spending some effort to decide what and how much transformation of data sets, concept sets and hypothesis reasoning to do as opposed to what activities to do. It sets into chunks of knowledge through the application balances resources between the data-level actions and of algorithms. the reasoning actions. E.g. [17] uses an agent approach Problem solving and modelling considers, with preparation agents, exploration agents, descriptive however, typically six aspects [16]: agents, and predictive agents. The agenda for a model (1) Application, problems, and users: The domain suite uses thus decisions points that require agenda consists of a model of the application, a specification of control according to performance and resource problems under consideration, of tasks that are issued, considerations. This understanding supports and of profiles of users. introspective monitoring about performance for the data (2) Context: The context of a problem is anything what mining process, coordinated control of the entire mining could support the problem solution, e.g. the sciences’ process, and coordinated refinement of the models. background, theories, knowledge, foundations, and Such kind of control is already necessary due to the concepts to be used for problem specification, problem problem space, the limitations of resources, and the background, and solutions. amount of uncertainty in knowledge, concepts, data, (3) Technology: Technology is the enabler and defines and the environment. the methodology. It provides [23] means for the flow of problem solving steps, the flow of activities, the 3 Data Mining Design distribution, the collaboration, and the exchange. (4) Techniques and methods: Techniques and methods 3.1 Conceptualization of Data Mining and Analysis can be given as algorithms. Specific algorithms are data The data mining and analysis task must be enhanced by improvers and cleaners, data aggregators, data an explicit treatment of the languages used for concepts integrators, controllers, checkers, acceptance and hypotheses, and by an explicit description of determiners, and termination algorithms. knowledge that can be used. The algorithmic solution of (5) Data: Data have their own structuring, their quality the task is based on knowledge on algorithms that are and their life span. They are typically enhanced by used and on data that are available and that are required metadata. Data management is a central element of for the application of the algorithms. Typically, analysis most problem solving processes. algorithms are iterative and can run forever. We are (6) Solutions: The solutions to problem solving can be interested only in convergent ones and thus need formally given, illustrated by visual means, and termination criteria. Therefore, conceptualization of the presented by models. Models are typically only normal data mining and analysis task consists of a detailed models. The deep model and the matrix is already provided by the context and accepted by the community 276 of practice in dependence of the needs of this community for the given application scenario. 4.1 Setting the Deep Model and the Matrix Therefore, models may be the final result of a data mining and analysis process beside other means. The problem to be tackled must be clearly stated in dependence on the utilization scenario, the tasks to be Comparing these six spaces with the six solved, the community of practice involved, and the parameters we discover that only four spaces are given context. The result of this step is the deep model considered so far in data mining. We miss the user and and its matrix. The first one is based on the background, application space as well as the representation space. the specific context parameter such as infrastructure and Figure 2 shows the difference. environment, and candidates for deep models. Figure 2 Parameters of Data Mining and the Problem Solving Aspects 3.2 Meta-models of Data Mining An abstraction layer approach separates the application domain, the model domain and the data domain [17]. This separation is illustrated in Figure 3. Figure 4 The Phases in Data Mining Design (Non- iterative form) The data mining tasks can be now formulated based on the matrix and the deep model. We set up the context, the environment, the general goal of the problem and also criteria for adequateness and Figure 3 The V meta-model of Data Mining Design dependability of the solution, e.g. invariance properties for problem description and for the task setting and its The data mining design framework uses the inverse mathematical formulation and solution faithfulness modeling approach. It starts with the consideration of properties for later application of the solution in the the application domain and develops models as given environment. What is exactly the problem, the mediators between the data and the application domain expected benefit? What should a solution look like? worlds. In the sequel we are going to combine the three What is known about the application? approaches of this section. The meta-model corresponds Deep models already use a background consisting of to other meta-models such as inductive modelling or an undisputable grounding and a selectable basis. The hypothetical reasoning (hypotheses development, explicit statement of the background provides an experimenting and testing, analysis of results, interim understanding of the postulates, paradigms, conclusions, reappraisal against real world). assumptions, conceptions, practices, etc. Without the background, the results of the analysis cannot be 4 Data Mining: A Systematic Model-Based properly understood. Models have their profile, i.e. Approach goals, purposes and functions. These must be explicitly given. The parameters of a generic model can be either Our approach presented so far allows to revise and to order or slave parameters [12], either primary or reformulate the model-oriented data mining process on secondary or tertiary (also called genotypes or the basis of well-defined engineering [15, 25] or phenotypes or observables) [1, 5], and either ruling (or alternatively on systematic mathematical problem order) or driven parameters [12]. Data mining can be solving [22]. Figure 4 displays this revision. We realize enhanced by knowledge management techniques. that the first two phases are typically implicitly assumed Additionally, the concept space into which the data and not considered. We concentrate on the non-iterative mining task is embedded must be specified. This form. Iterative processes can be handled in a similar concept space is enhanced during data analysis. form. 277 4.2 Stereotyping the Process The result of the entire data mining process heavily depends on the appropriateness of the data sets, their The general flow of data mining activities is typically properties and quality, and more generally the data implicitly assumed on the basis of stereotypes which schemata with essentially three components: application form a set of tasks, e.g. tasks of prove in whatever data schema with detailed description of data types, system, transformation tasks, description tasks, and metadata schema [18], and generated and auxiliary data investigation tasks. Proofs can follow the classical schemata. The first component is well-investigated in deductive or inductive setting. Also, abductive, data mining and data management monographs. The adductive, hypothetical and other reasoning techniques second and third components inherit research results are applicable. Stereotypes typically use model suites as from database management, from data mart or a collection of associated models, are already biased by warehouses, and layering of data. An essential element priming and orientation, follow policies, data mining is the explicit specification of the quality of data. It design constraints, and framing. allows to derive algorithms for data improvement and to Data mining and analysis is rather stereotyped. For derive limitations for applicability of algorithms. instance, mathematical culture has already developed a Auxiliary data support performance of the algorithms. good number of stereotypes for problem formulation. It Therefore typical data-oriented questions are: What is based on a mathematical language for the formulation data do we have available? Is the data relevant to the of analysis tasks, on selection and instantiation of the problem? Is it valid? Does it reflect our expectations? best fitting variable space and the space of opportunities Is the data quality, quantity, recency sufficient? Which provided by mathematics. data we should concentrate on? How is the data Data mining uses generic models which are the transformed for modelling? How may we increase the basis of normal models. Models are based on a quality of data? separation of concern according the problem setting: dependence-indicating, dependence-describing, sepa- 4.4 The Data Mining Process Itself ration or partition spaces, pattern kinds, reasoning kinds, etc. This separation of concern governs the The data mining process can be understood as a classical data mining algorithmic classes: association coherent and stepwise refinement of the given model analysis, cluster analysis, data grouping with or without suite. The model refinement may use an explicit classification, classifiers and rules, dependences among transformation or an extract-transform-load process parameters and data subsets, predictor analysis, syner- among models within the model suite. Evaluation and getics, blind or informed or heuristic investigation of termination algorithms are an essential element of any the search space, and pattern learning. data mining algorithm. They can be based on quality criteria for the finalized models in the model suite, e.g. generality, error-proneness, stability, selection- 4.3 Initialization of the Normal Data Models proneness, validation, understandability, repeatability, usability, usefulness, and novelty. Data mining algorithms have their capacity and Typical questions to answer within this process potential [2]. Potential and capacity can be based on are: How good is the model suite in terms of the task SWOT (strengths, weaknesses, opportunities, and setting? What have we really learned about the threats), SCOPE (situation, core competencies, application domain? What is the real adequacy and obstacles, prospects, expectation), and SMART (how dependability of the models in the model suite? How simple, meaningful, adequate, realistic, and trackable) these models can be deployed best? How do we know analysis of methods and algorithms. Each of the that the models in the model suite are still valid? Which algorithm classes has its strengths and weaknesses, its data are supporting which model in the model suite? satisfaction of the tasks and the purpose, and its limits Which kind of errors of data is inherited by which part of applicability. Algorithm selection also includes an of which model? explicit specification of the order of application of these The final result of the data mining process is then a algorithms and of mapping parameters that are derived combination of the deep model and the normal model by means of one algorithm to those that are an input for whereas the first one is a latent or hidden component in the others, i.e. an explicit association within the model most cases. If we want, however, to reason on the suite. Additionally, evaluation algorithms for the results then the deep model must be understood as well. success criteria are selected. Algorithms have their own Otherwise, the results may become surprising and may obstinacy, their hypotheses and assumptions that must not be convincing. be taken into consideration. Whether an algorithm can be considered depends on acceptance criteria derived in 4.5 Controllers and Selectors the previous two steps. So, we ask: What kind of model suite architecture suits Algorithmics [6] treats algorithms as general solution the problem best? What are applicable development pattern that have parameters for their instantiation, approaches for modelling? What is the best modelling handling mechanisms for their specialization to a given technique to get the right model suite? What kind of environment, and enhancers for context injection. So, reasoning is supported? What not? What are the an algorithm can be derived based on explicit selectors limitations? Which pitfalls should be avoided? and control rules [4] if we neglect context injection. We 278 can use this approach for data mining design (DMD). technology and algorithm driven. The problem selection For instance, an algorithm pattern such as regression is made on intuition and experience. So, the matrix and uses a generic model of parameter dependence, is based the deep model are latent and hidden. The problem on blind search, has parameters for similarity and model specification is not explicit. Therefore, this paper aims quality, and has selection support for specific treatment at the entire data mining process and highlights a way to of the given data set. In this case, the controller is based leave the ad-hoc, blind and somehow chaotic data on enablers that specify applicability of the approach, analysis. The approach we are developing integrates the on error rules, on data evaluation rules that detect theory of models, the theory of problem solving, design dependencies among control parameters and derive data science, and knowledge and content management. We quality measures, and on quality rules for confidence realized that data mining can be systematized. The statements. framework for data mining design exemplarily presented is an example in Figure 4. 4.6 Data Mining and Design Science Let us finally associate our approach with design Acknowledgement. We thank for the support of this science research [13]. Design science considers paper by the CRC 1266. We are very thankful for the systematic modelling as an embodiment of three closely fruitful discussions with the members of the CRC. related cycles of activities. The relevance cycle initiates design science research with an application context that References not only provides the requirements for the research as [1] G. Bell. The mechanism of evolution. inputs but also defines acceptance criteria for the Chapman and Hall, New York (1997) ultimate evaluation of the research results. The central design cycle iterates between the core activities of [2] R. Berghammer and B. Thalheim., Metho- building and evaluating the design artifacts and denbasierte mathematische Modellierung mit processes of the research. The orthogonal rigor cycle Relationenalgebren. In: Wissenschaft und provides past knowledge to the research project to Kunst der Modellierung: Modelle, ensure its innovation. It is contingent on the Modellieren, Modellierung, pp. 67–106. De researchers’ thoroughly research and references the Gryuter, Boston ( 2015) knowledge base in order to guarantee that the designs [3] M.R. Berthold, C. Borgelt, F. Höppner, and F. produced are research contributions and not routine Klawonn. Guide to intelligent data analysis. designs based upon the application of well-known Springer, London (2010). processes. The relevance cycle is concerned with the problem [4] A. Bienemann, K.-D. Schewe, and B.Thalheim. specification and setting and the matrix and agenda Towards a theory of genericity based on derivation. The design cycle is related to all other government and binding. In: Proc. ER’06, phases of our framework. The rigor cycle is enhanced LNCS 4215, pp. 311–324. Springer ( 2006) by our framework and provides thus a systematic [5] L.B. Booker, D.E. Goldberg, and J.H. modelling approach. Holland. Classifier systems and genetic algorithms. Artificial Intelligence, 40 (1–3): 5 Conclusion pp. 235–282 (1989) The literature on data mining is fairly rich. Mining tools [6] G. Brassard and P. Bratley. Algorithmics - have already gained the maturity for supporting any Theory and Practice. Prentice Hall, London kind of data analysis if the data mining problem is well ( 1988) understood, the intentions for models are properly [7] A. Coleman. Scientific models as works. understood, and if the problem is professionally set up. Cataloging & Classification Quarterly, Data mining aims at development of model suites that Special Issue: Works as Entities for allows to derive and to draw dependable and thus Information Retrieval, 33, p p . 3-4 ( 2006) justifiable conclusions on the given data set. Data [8] A. Dahanayake and B. Thalheim. Co- mining is a process that can be based on a framework for systematic modelling that is driven by a deep model evolution of (information) system models. and a matrix. Textbooks on data mining typically In: EMMSAD 2010, LNBIB vol. 50, pp. explore in detail algorithms as blind search. Data 314–326. Springer ( 2010) mining is a specific form of modeling. Therefore, we [9] D. Embley and B. Thalheim (eds). The can combine modeling with data mining in a more Handbook of Conceptual Modeling: Its Usage sophisticated form. Models have however an inner and Its Challenges. Springer ( 2011) structure with parts which are given by the application, [10] N.P. Gillett, F.W. Zwiers, A.J. Weaver, by the context, by the commonsense and by a G.C. Hegerl, M.R. Allen, and P.A. Stott. community of practice. These fixed parts are then Detecting anthropogenic influence with a enhanced by normal models. A typical normal model is multi-model ensemble. Geophys. Res. Lett., the result of a data mining process. 29:31–34, 2002. The current state of the art in data mining is mainly 279 [11] E. Guerra, J. de Lara, D.S. Kolovos, and (in Russian). ZPI at Mech-Mat MGU, Moscow R.F. Paige. Inter- modelling: From theory (2001) to practice. In MoDELS 2010, LNCS 6394, [24] M. Pottmann, H. Unbehauen, and D.E. pp. 376–391, Springer, Berlin (2010) Seborg. Application of a general multi-model [12] H. Haken, A. Wunderlin, and S. Yigitbasi. approach for identification of highly nonlinear An introduction to synergetics. Open processes – a case study. Int. Journal of Systems and Information Dynamics, 3(1): Control, 57(1): pp. 97–120 (1993) pp. 1–34 ( 1994) [25] B. Rumpe. Modellierung mit UML. [13] A. Hevner, S. March, J. Park, and S. Ram. Springer, H e i d e l b e r g ( 2012) Design science in information systems [26] A. Samuel and J. Weir. Introduction to research. MIS Quaterly, 28(1): pp. 75–105 Engineering: Modelling, Synthesis and ( 2004) Problem Solving Strategies. Elsevier, [14] P.J. Hunter, W. W. Li, A. D. McCulloch, and Amsterdam ( 2000) D. Noble. Multiscale modeling: Physiome [27] G. Simsion and G.C. Witt. Data modeling project standards, tools, and databases. IEEE essentials. Morgan Kaufmann, San Francisco Computer, 39(11), pp. 48–54 (2006) ( 2005) [15] ISO/IEC 25020 (Software and system [28] M. Skusa. Semantische Kohärenz in der engineering - software product quality Softwareentwicklung. PhD thesis, CAU Kiel, requirements and evaluation (square) (2011) - measurement reference model and guide). [29] B. Thalheim. Towards a theory of ISO/IEC JTC1/SC7 N3280 (2005) conceptual modelling. Journal of Universal [16] H. Jaakkola, B. Thalheim, Y. Kidawara, K. Computer Science, 16(20): pp. 3102–3137, Zettsu, Y. Chen, and A. Heimbü rger. (2010) Information modelling and global risk [30] B. Thalheim. The conceptual model ≡ an management systems. In: Information adequate and dependable artifact enhanced Modeling and Knowledge Bases XX, pp. by concepts. I n : Information Modelling and 429–446. IOS Press (2009) Knowledge Bases XXV, p p . 241–254. IOS [17] K. Jannaschk. Infrastruktur für ein Data Press (2014) Mining Design Framework. PhD thesis, [31] B. Thalheim. Conceptual modeling Christian-Albrechts University, Kiel (2017) foundations: The notion of a model in [18] F. Kramer and B. Thalheim. A metadata conceptual modeling. In: Encyclopedia of system for quality management. In: Database Systems, Springer ( 2017) Information Modelling and Knowledge [32] B. Thalheim and M. Tropmann-Frick. Bases, pp. 224–242. IOS Press (2014) Wherefore models are used and accepted? The [19] O. Nakoinz and D. Knitter. Modelling model functions as a quality instrument in Human Behaviour in Landscapes. Springer utilisation scenarios. In: I. Comyn-Wattiau, ( 2016) C. du Mouza, and N. Prat, editors, Ingenierie [20] J. Pardillo. A systematic review on the Management des Systemes D’Information definition of UML profiles. In: MoDELS (2016) 2010, LNCS 6394, pp. 407–422, Springer, [33] B. Thalheim, M. Tropmann-Frick, and T. Berlin (2010) Ziebermayr. Application of generic workflows [21] D. Petrelli, S. Levin, M. Beaulieu, and M. for disaster management. In: Information Sanderson. Which user interaction for cross- Modelling and Knowledge Bases, pp. 64–81. language information retrieval? Design issues IOS Press (2014) and reflections. JASIST, 57(5): pp. 709–722 [34] B. Thalheim and Q. Wang. Towards a theory ( 2006) of refinement for data migration. In: [22] O.H. Pilkey and L. Pilkey-Jarvis. Useless ER’2011, LNCS 6998, pp. 318–331. Springer, Arithmetic: Why Environmental Scientists ( 2011) Cant’t Predict the Future. Columbia [35] T. Zeugmann. Inductive inference of optimal University Press, New York (2006) programs: A survey and open problems. In: [23] A.S. Podkolsin. Computer-based modelling Nonmonotonic and Inductive Logics, pp. of solution processes for mathematical tasks 208–222. Springer, Berlin (1991) 280