=Paper=
{{Paper
|id=Vol-2177/paper-05-d004
|storemode=property
|title=
Designing Multidimensional Information Systems Using the Data Vault Methodology
|pdfUrl=https://ceur-ws.org/Vol-2177/paper-05-d004.pdf
|volume=Vol-2177
|authors=Anastasiya V. Demidova,Yevgeny A. Kuznetsov,Maxim B. Fomin
}}
==
Designing Multidimensional Information Systems Using the Data Vault Methodology
==
33 UDC 681.3.016 Designing Multidimensional Information Systems Using the Data Vault Methodology Anastasiya V. Demidova* , Yevgeny A. Kuznetsov† , Maxim B. Fomin* * Department of Information Technology Peoples’ Friendship University of Russia (RUDN University) 6 Miklukho-Maklaya str., Moscow, 117198, Russian Federation † Department of digital solutions Laboratory of New Information Technologies (LANIT) 14 Murmanskiy proezd, Moscow, 129075, Russian Federation Email: demidova_av@rudn.university, kuznetsovea@lanit.ru, fomin_mb@rudn.university The method for designing information systems using the “Data vault” modeling technique, which was formalized by Dan Linstedt, is considered. In case of using “Data vault” the information system is based on the classical formulated by Bill Inmon 3-tier architecture approach to data warehouse design. It includes Operational warehouse of data, Data warehouse, and Data marts. This approach makes it possible to build an information system data warehouse with a metadata repository based on the multidimensional principle. The metadata repository is responsible for collecting data, storing data, and presenting data for analysis. The proposed method of describing metadata provides the ability to specify how to calculate the performance indicators used in the data analysis. The “Data vault” approach allows you to design the data warehouse of an information system using a meta-model that is semantically related to the subject domain of the system and is easily rebuilt in the event of changes in the business model of the subject domain. This approach provides an easy way to generate data marts based on OLAP principles. The key moment in the structure of the information system is the way of transition from the “Data vault” model to the multidimensional model of data representation on the basis of associative rules of the relationship between information objects. Key words and phrases: data warehouse, multidimensional data model, data mart, OLAP, data vault. Copyright © 2018 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. In: K. E. Samouylov, L. A. Sevastianov, D. S. Kulyabov (eds.): Selected Papers of the VIII Conference “Information and Telecommunication Technologies and Mathematical Modeling of High-Tech Systems”, Moscow, Russia, 20-Apr-2018, published at http://ceur-ws.org 34 ITTMM—2018 1. Introduction The appearance of low-cost high-performance computing systems has made them available to medium-sized enterprises, whose operation is associated with the implemen- tation of a large volume of operations of various types. Such enterprises have a need for low-cost and easy-to-operate information systems that provide the implementation of the tasks of analysis of the activities of enterprises. Such information systems should meet the following requirements: – the system must process data arising in the course of the enterprise’s activities; – the system should be able to describe and calculate the key performance indicators that are used in the decision-making process for enterprise management; – the data warehouse metadata structure must correspond to the business processes of the enterprise; – there should be an opportunity of operative changes in the system with changes in the activities of the enterprise or in the case of changes in the methodology of the analysis of activities. 2. Information system architecture During the activity of the enterprise heterogeneous information data sets are generated, which are stored in information subsystems that are external to the analytical information system. The task of the analytical information system is to collect these data from external subsystems and to calculate the key performance indicators that can be used in the process of analyzing the activities of the enterprise and in the process of making decisions on the management of the enterprise. To provide these functions, the information system must contain the following set of subsystems: Data acquisition subsystem, Data storage subsystem, Data representation subsystem, and Subsystem of control [8, 9]. Thus, data storage is separated from the business users and used by them in solving the problems of analyzing data slices. This separation reduces the cost of modification at the business level. At the same time, this approach enables business users to directly manage and modify the virtual layer (self-service BI). The architecture of the information system that automates the information processes in accordance with the data model described in the metadata repository [11–14] is shown in Figure 1. Figure 1. Data warehouse meta-model structure Demidova Anastasiya V., Kuznetsov Yevgeny A., Fomin Maxim B. 35 The analytical information system interacts with external information systems, which are data sources. These are OLTP systems, legacy subsystems, standard format data files, and any other sources of structured data. On the basis of data taken from external sources, the Data acquisition subsystem of the information system forms the correct content of the Operational warehouse of data (OWD). OWD is a storage area in which information exists before it is overloaded in the Data warehouse (DW). DW will combine information related to all aspects of the enterprise. Loading information from OWD to DW is done by normalizing the data according to the rules of the current DW data model. The calculation of performance indicators is based on data taken from the data warehouse. Performance indicators are placed in special data storage structures — thematic data marts in the Data presentation subsystem. Thematic data mart is a narrow slice of information for users working in one specific task. As a rule, the task of a thematic data mart is to represent data access for business applications [4–6]. For business applications, this means decision support systems that use data representation in the form of OLAP, or subsystems that use a different form of data representation that is convenient for generating reports. The central block in the structure of the information system is the metadata repository. It is responsible for managing the data model at the meta–model level and is used to manage the process of data movement in the information system. The main requirement for the meta–model is as follows: metadata should be described in such a way that it is possible to specify on its basis the method of calculating performance indicators used in the process of analyzing the activities of the enterprise and in the process of making decisions on the management of the enterprise [7, 10]. From the point of view of business analysts, the most appropriate approach for describing the metadata repository is the multidimensional principle of data organization (metadata as it is data in the metadata repository). Since a multidimensional data model provides a denormalized way of storing data, a “Data vault” model can provide a convenient way of structuring information for the data warehouse. Using the methodology “Data vault” allows you to describe the semantic links of the data warehouse with a description of the information domain of the information system. This provides an opportunity to rebuild the structure of DW in the event of changes in the business model of the subject domain. 3. Description of the data warehouse model using the “Data vault” methodology One of the ways to build a data warehouse is the data vault methodology. Its use makes it possible to dynamically expand the DW data model without having a complicated task of modifying other subsystems of the information system. The data model must be managed at the meta-model level [15]. The main objects of the meta- model are: a business key (in the terminology “Data vault” — “hub”), a business key transaction (in the terminology “Data vault” — “link”) and business key history (in the terminology of “Data vault” — “sat”). Business key is a property of an object that uniquely identifies it within the subject domain. A business key history is a history of changes to object properties that are functionally dependent on that business key. The relevance of the attributes of the dimension is maintained using business key. Business key transaction is a description of the event that occurred between objects that are identified using these business keys [1–3]. As an example of using “Data vault” you can consider the process of on-line sales. The conceptual model of the process is presented in Figure 2. Figure 3 shows the structure of the DW meta-model as a diagram in the E/R+Merise notation. In order to ensure that the information system modification process does not lose information about the associative links available in the meta–model by link type (aggregation, composition or recursion) and by arity (1:1, 1:N, M:N), this information should be kept by transactions. 36 ITTMM—2018 Figure 2. The sales process conceptual model “Client” act as business keys. “Orders” are transactions between “Clients” and “Products” that implement an association of type “M:N”. “Requisite” form the history of the business keys of the “Clients”. Figure 3. Data warehouse meta-model structure The metadata repository model is based on the multidimensional data model. This approach makes it easier to establish a correspondence between the metadata and business process parameters of an enterprise, and describes how to calculate the performance indicators and data that are used in the process of completing data marts [16,17]. For the implementation of requests for data must be defined rules of connections (associations) Demidova Anastasiya V., Kuznetsov Yevgeny A., Fomin Maxim B. 37 between objects of the multidimensional data model and objects in a “Data vault”. Such rules can be formulated on the basis of the following statements: 1. Within the multidimensional data model in the analytical subsystem, the key can act as a slowly changing dimension; 2. Business key transaction history makes it possible to calculate the values of measures in multidimensional data models used in data marts. These rules use connectivity at the conceptual and logical levels of representation of the metadata repository model. A complete diagram of the metadata repository model is shown in Figure 4. Figure 4. Metadata repository model 38 ITTMM—2018 4. Multidimensional data model The structure of multidimensional data model should reflect the aspects of subject domain which are used in the data analysis process. Each aspect corresponds to one dimension of a multidimensional cube 𝐻. A full set of dimensions forms a set 𝐷(𝐻) = 𝐷1 , 𝐷2 , . . . , 𝐷𝑛 , there 𝐷𝑖 is 𝑖–dimension, and 𝑛 = 𝑑𝑖𝑚(𝐻) — dimensionality {︀ }︀ of multidimensional cube }︀ [18]. Each dimension is characterized by a set of members 𝐷𝑖 = 𝑑𝑖1 , 𝑑𝑖2 , . . . , 𝑑𝑘𝑖 )𝑖 , there 𝑖 is a number of dimension, 𝑘𝑖 — the quantity of members. {︀ Members of 𝐷𝑖 are drawn from a set of positions of the basic classifier which corresponds to an aspect of the observed phenomenon associated with 𝐷𝑖 [19, 20]. The multidimensional data cube is a structured set of cells. Each cell 𝑐 is defined by a combination of members 𝑐 = (𝑑1𝑖1 , 𝑑2𝑖2 , . . . , 𝑑𝑛 𝑖𝑛 ). The combination includes one member for each of the dimensions. If the analysis of the observed phenomenon is performed using a large set of diverse aspects, not all member combinations define the possible cells of multidimensional cube, i.e. the cells corresponding to a certain fact. This effect occurs due to semantic inconsistencies of some members from different dimensions to each other and generates a sparseness in the cube. The complex structure of the compatibility of members may lead to a situation where a certain dimension becomes semantically uncertain if combined with a set of members from other dimensions. In this situation, while describing the possible cell of multidimensional cube the special value “Not in use” can be used to set the member of semantically unspecified dimension. The subject domain is characterized by the measure values defined in possible cells of the multidimensional cube. The full set of measures composes the set 𝑉 (𝐻) = {𝑣1 , 𝑣2 , . . . , 𝑣𝑝 }, where 𝑣𝑗 is 𝑗-measure, 𝑝 — the quantity of measures in the hypercube. Not all the measures from the 𝑉 (𝐻) can be defined in the possible cell. This situation can appear in case of semantic inconsistency between the members defining the cell and some measures. While describing multidimensional data cube structure for every possible sell it is necessary to define its own set 𝑉 (𝑐) = {𝑣1 , 𝑣, . . . , 𝑣𝑝𝑐 }, which consists of certain measures for this cell, 1 6 𝑝𝑐 6 𝑝. We can use the special value “Not in use” for the description of c measures, which are not included in the set 𝑉 (𝑐). 5. Conclusions The paper discussed the method of designing information systems using the method- ology of “Data vault”. This approach allows building a data warehouse system based on meta–model, which is semantically related to the subject domain of the system, easily rebuilt in case of changes in the business model of the subject domain, allows you to form multidimensional data marts and calculate the performance indicators of the enterprise. Acknowledgments The work is partially supported by the Ministry of Education and Science of the Russian Federation (the Agreement number 02.a03.21.0008). References 1. D. Linstedt, M. Olschimke, Building a Scalable Data Warehouse with Data Vault 2.0, Elsevier Inc., 2016. 2. W. Inmon, D. Linstedt, Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault, Elsevier Inc., 2015. 3. H. Hultgren, Modeling the Agile Data Warehouse with Data Vault, Brighton Hamil- ton, 2012. 4. L. Corr, J. Stagnitto, Agile Data Warehouse Design: Collaborative Dimensional Modeling, from Whiteboard to Star Schema, DecisionOne Press, 2011. Demidova Anastasiya V., Kuznetsov Yevgeny A., Fomin Maxim B. 39 5. W. H. Inmon, Building the Data Warehouse, Wiley Publishing, 2005. 6. W. H. Inmon, Building the Operational Data Store, Wiley Publishing, 1999. 7. W. H. Inmon, D. Strauss, G. Neushloss, DW 2.0: Architecture for the next generation of data warehousing, Elsevier Inc., 2010. 8. R. Kimball, J. Caserta, The Data Warehouse ETL toolkit. Willey Publishing, 2004. 9. R. Kimbal, L. Reeves, M. Ross, W. Thornthwaite, The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data Warehouses. Wiley Publishing, 1998. 10. R. Kimbal, L. Reeves, R. Merz, The Data Warehouse Toolkit: The Complete Guide to Dimensional modelling. Wiley Publishing, 2002. 11. C. Batini, S. Ceri, S. B. Navathe, Conceptual Database Design: An Enity-relationship Approach. Benjamin/Cummings, 1992. 12. S. Singh, S. Malhotra, Data Warehouse and its Methods, Journal of Global Research in Computer Science 2 (5) (2011) 113–115. 13. A. Datta, H. Thomas, A conceptual model and Algebra for On-Line Analytical Processing in Decision Support Databases. Information Systems Research 12 (1) (2001) 83–102. doi:10.1287/isre.12.1.83.9715. 14. C. Fahrner, G. Vossen, A survey of database transformations based on the entity- relationship model. Data & Knowledge Engineering 15 (3) (1995) 213–250. doi: 10.1016/0169-023X(95)00006-E. 15. E. Medina, J. Trujillo, Standard for Representing Multidimensional Properties: The Common Warehouse Metamodel (CWM), in: Advances in Databases and Information Systems (ADBIS), Lecture Notes in Computer Science 2435 (2002). 16. V. Jovanovic, D. Jaksic, S. Mrdalj, Data modeling styles in data warehousing, in: Information and Communication Technology, Electronics and Microelectron- ics (MIPRO), 2014, Proceedings 6859796, 1458–1463. doi:10.1109/MIPRO.2014. 6859796. 17. D. Dymek, W Komnata, P. Szwed, Proposal of a new data warehouse architecture reference model, in: Beyond Databases, Architectures and Structures (BDAS), Communications in Computer and Information Science 521 (2015) 222–232. 18. M. B. Fomin, Cluster method of description of information system data model based on multidimensional approach, in: Distributed Computer and Communication Networks (DCCN), Communications in Computer and Information Science 678 (2016) 657–668. 19. E. Thomsen, OLAP Solution: Building Multidimensional Information System, Willey Publishing, 2002. 20. L. Fu, Efficient evaluation of sparse data cubes. in: Advances in Web-Age Information Management (WAIM), Lecture Notes in Computer Science 3129 (2004) 336–345.