Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019) Budva, Becici, Montenegro, September 30 – October 4, 2019 THE VISUALIZATION METHOD PIPELINE FOR THE APPLICATION TO DYNAMIC DATA ANALYSIS T. Galkin1,a, D. Popov2,b, V. Pilyugin1,c, M. Grigorieva3,d 1 National Research Nuclear University MEPhI, Moscow, Russia 2 Skolkovo Institute of Science and Technology, Moscow Russia 3 Lomonosov Moscow State University, Moscow, Russia E-mail: a tpgalkin@mephi.ru, b dmitry.popov@skoltech.ru, c vvpilyugin@mephi.ru, d Maria.Grigorieva@cern.ch The new era of scientific research brings an enormous amount of data for scientists. These complex and multidimensional data structures are used for the verification of scientific hypothesis. Exploring such data by researchers requires the development of new technologies for its efficient processing, investigation and interpretation. Intellectual data analysis and statistical methods are rapidly developing, and this is where visualization methods are getting their place. This work describes mathematical basis of the developed visualization tool for the analysis of multidimensional dynamic data. This tool provides the pipeline of methods, which combined, allow to cope with a set of practical tasks (anomalies detection, cluster, trends and variation analysis) using visualization method. Authors provided mathematical models of geometrical operations under the data domain, algorithms for solving the mentioned classes of tasks and several use-cases with technological and economic data based on visualization method. Keywords: visual analysis, dynamic data, time-variant data, multidimensional data, multivariate data, visualization, data analysis, multidimensional analysis. Timofei Galkin, Dmitry Popov, Victor Pilyugin, Maria Grigorieva Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 295 Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019) Budva, Becici, Montenegro, September 30 – October 4, 2019 1. Introduction Intelligent computer algorithms are state-of-the-art of data analysis today. Artificial intelligence, machine learning and neural networks create the trend of discourse in the data science. However, at the same time some research point out the problem of understanding, interpretation and verification of the research results [1]. Various methods are available for these purposes. One of them is data visualization. Industry and science bring us tasks which include complex multidimensional data analysis, and data visualization can provide deep understanding of data based on its graphical representation. This paper describes the experience of applying the visualization method for multidimensional dynamic data analysis. 2. Background The overview of visualization techniques for time-dependent multidimensional data can be found in [2]. The authors divide these techniques into static and dynamic. Moreover, they considered interactions with the visual representations. The recent overview [3] considers the different visualization techniques and data transformations. Data model for the visualization can be represented as the multidimensional Euclidian space and affine transformations within this space [4]. Various data structures from different research fields are successfully investigated by imaging, which provides analyst with the advanced and interactive means for data exploration. This paper describes the visualization pipeline, developed by the authors, based on 3D scatter plot diagram with colored distances between data objects in multidimensional space. 3. Visualization Method for the Dynamic Data Dynamic multidimensional data is represented as a set of parameters of objects, changing in time. This data is stored as a set of numeric data values given for some periods of time. 3.1 Task formulation Thus, the following task formulation is to solve: Given: Let m objects given, each of them is characterized by n parameters. The data is organized as a set of tables, such as follow: Time = j Parameter 1 Parameter 2 … Parameter n 𝑗 𝑗 𝑗 Object 1 𝑥11 𝑥12 … 𝑥1𝑛 𝑗 𝑗 𝑗 Object 2 𝑥21 𝑥22 … 𝑥2𝑛 … … … … … 𝑗 𝑗 𝑗 Object m 𝑥𝑚1 𝑥𝑚2 … 𝑥𝑚𝑛 This tabular data representation contains object parameter values at a specific point in time. 𝑗 Thus, table j is filled with parameter values for time tj. It is stated that t1 < t2 < … < tk, and 𝑥𝑖𝑙 – value of parameter l of object i in table j at the moment of time tj (l = (1,n), i = (1,m) , j = (1,k)). 𝑗 To make the formulations easier, 𝑥𝑖𝑙 for l = (1,n) is called a n-tuple within the fixed j. Task: Find the subsets of similar objects, explore these subsets at each given point in time and make judgements about their behavior in time. 296 Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019) Budva, Becici, Montenegro, September 30 – October 4, 2019 3.2 Data samples In this research two data samples with dynamic data were used: technological and financial data. Technological data sample were taken from Kaggle's dataset "CareerCon 2019 - Help Navigate Robots"1. It represents sensors data gathered while driving a small mobile robot over different floor surfaces: orientation, velocity, acceleration, etc. These data may help robots to recognize the floor surface. The dataset has ~4K objects and 128 time measurements. The total length of dataset is about 400K records. The visualization pipeline was applied to these data to visualize sensors data in order to explore the surface features. Financial data sample represents data of banking system. This data was obtained from the open sources2 and describes 81 banks for a period of 13 month by a set features like sales profit, deposits, overdue debt and others. The main idea of the exploration of these data is to uncover suspicious, anomalous banks visually, and detect a point in time or a period when the anomalous behavior takes place. 3.3 Task solving method For solving the data analysis task, the scientific visualization method was used [5]. Both data samples were represented as a set of dynamic objects. Each object with all the corresponding features at each point in time is stored in a table. A set of tables for all objects at different points in time form a preprocessed dataset for loading into the visualization application. The visual analysis of data has two stages. The first stage is the visualization itself: data tables are transformed into geometrical objects on screen. This transformation implies four steps: sourcing (obtaining the data from the source), filtering (getting the data ready for the application), mapping (corresponding geometric objects placement on the scene) and rendering (making the resulting picture of the scene). After the visualization is ready, the second stage of the analysis - the interpretation of images is performed by the analyst. Figure 1. The visualization pipeline Parameters of each step of the visualization pipeline may be changed by the analyst interactively in order to generate another visualization sample. This makes the process of data analysis iterative and interactive. 3.3.1 Visualization pipeline Figure 1 shows the transformation between the data table and the visualization. Each line in the table represents the data object. Features of objects are transformed into multidimensional coordinates. The objects are then projected into spheres. The application backend calculates distances between all pairs of objects in multidimensional space, Figure 2. Technological data visualization and display it as segments between corresponding pairs 1 https://www.kaggle.com/c/career-con-2019/data 2 https://www.banki.ru/ 297 Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019) Budva, Becici, Montenegro, September 30 – October 4, 2019 of spheres. Segments are colored from blue to red, depending on how close the objects are in the original multidimensional space. This allows to observe similarity of objects in multidimensional space looking at the 3D visualization. 4. Implementation and Application The algorithms of the visualization method pipeline were implemented in C# programming language, using Unity graphics engine. Figure 3. Cluster evolution in time 4.1 Technological data sample visualization At the figure 2, X Y and Z values are the orientation parameter, in degrees. The spheres in the picture form a circle. That is the key point of visual analysis. Human experts are good at interpretation of graphical images, which is hard to be programmed automatically. Moving the time slider allows to observe changes of parameters values in time and can be useful in the detection of specific points of time when anomalous behavior takes place. One more thing that can be visually discovered is presented in the figure 3. An analysist found a cluster of spheres, and this cluster does not change in time. They are marked with red color at the picture. Further investigation showed that these spheres correspond to the specific type of floor surfaces. Therefore, such visualization is useful for the problem of clusterization. 4.2 Economic data sample visualization The figure 4 shows that the spheres lay on a plane, which noticeably rotates over the time. The analyst may observe visually the direction and the velocity of movement of some financial parameters, Figure 4. Economic data visualization making conclusions about common financial situation for banks. Also, such visualization allows to catch anomalous banks, which features are changing in time along other trajectories. 4.3 Other applications This application was also tested on metadata from ATLAS Grid Information System3 as shown on figure 5. The visualization shows the appearing of computing queues, and the duration of these queues in time, and can be used to observe some specific tendencies, which may be unobvious without graphic representation. Figure 5. ATLAS Grid Information System metadata visualization 3 http://atlas-agis.cern.ch/agis/ 298 Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019) Budva, Becici, Montenegro, September 30 – October 4, 2019 5. Conclusion An interactive and iterative method of data analysis and the application for the dynamic data analysis using the visualization pipeline was developed. The method considers dynamic objects as time-dependent points of the Eucledian space. The visualization system utilizes a 3D scatter plot diagram with colored distances in the multidimensional space. The developed visualization application was tested on different data samples, showing the applicability wide variety of domains. Further research will be focused on an adaptation of the developed software for more complex tasks of cluster analysis and searching for correlations within the data. Acknowledgement This work has been supported by the RSF grant No. 18-71-10003. References [1] J. Thomas, K. Cook, "Illuminating the Path: A Research and Development Agenda for Visual Analytics", IEEE Press. [2] W. Muller and H. Schumann, "Visualization methods for time-dependent data - an overview," Proceedings of the 2003 Winter Simulation Conference, 2003., New Orleans, LA, USA, vol.1., pp. 737-745, 2003. [3] W. Cui, "Visual Analytics: A Comprehensive Overview", IEEE, vol. 7, p. 81555 - 81573, DOI: 10.1109/ACCESS.2019.2923736. [4] I. Milman, A. Pasko, V. Pilyugin, "Survey of approaches to multidimensional data geometrization in the analysis using computer visualization", Scientific Visualization, vol. 7, no. 2, pp. 21-37, 2015. [5] V. Pilyugin, E. Malikova, A. Pasko and V. Adzhiev, "Scientific visualization as method of scientific data analysis," Scientific Visualization, pp. 56-70, 2012. 299